Thanks for that investigation!
The general insight here is that using basic entropy coding is not the
best you can do here because you don't make any use of the fact that
you are only interested in the (ordered) set of fixed-length words and
not in any permutation of a sequence of any length. That's why your
run-length encoding works better.
This is also why general-purpose LZ+entropy coding compression formats
do not work well: as we have seen entropy coding is not good enough
and LZ does not help enough on the expected short common substrings.
Also it doesn't benefit much from starting with a sorted set.
In fact, what you look for, is to find a short representation to
select those 12972 words out of the total set of 5 letter words (26^5
combinations). In that case, the difference between two adjacent words
in the ordered list will be significantly smaller than the
representation of the word itself.
Here's an article about how to do that efficiently (if the symbols
would be drawn randomly):
https://preshing.com/20121105/arithmetic-coding-and-the-1mb-sorting-problem/
Selecting 12972 from 26^5 words means the mean difference between
words (when represented as a base26 number) is 26^5 / 12972 = 915,
needing only around 10 bits per word on average.
In the wordle case, words are not drawn randomly (English words have
common prefixes/suffixes), so the difference between two adjacent
words is often small and less frequently larger. Out of that insight,
the author of
http://alexanderpruss.blogspot.com/2022/02/game-boy-wordle-how-to-compress-12972.html
found a simple compression scheme for the differences between words
similar to the UTF-8 varint representation that compressed the data
further to 17871 bytes with the benefit that the compressed stream is
byte-based and not a bit stream which might be easier to handle on
small CPUs.
One interesting fact of the wordle dataset is that the differences
compress best when you order them by reversing each word first (I also
tried all other permutations before sorting but that seems to be the
best). The reason could be that there are more common word suffixes
than prefixes.
You can find even smaller representations than in the latter blog post
by representing the differences again using entropy coding.
The entropy of the base 26 reverse-word ordered difference
distribution is ~12378 bytes not including the distribution table with
1828 elements. However, representing that table itself will take a lot
of space.
A common way to avoid the huge table is to use a simple logarithmic
scheme of encoding integers (like in zstd or deflate length codes)
where you encode the magnitude of a difference using a variable length
encoding and the noise in the less significant bits using fixed-sized
bits. Using that scheme, my calculations give a lower bound of 13245
bytes for representing just the data with perfect entropy coding
(Huffman is worse but actually not much, you could try tANS or
algorithmic coding for better results). Here is the distribution for
the magnitudes for the reversed words represented in base 26:
```
1 1721 13.27 % 2.91 bits + 1 extra bits Total 3.91 bits For
all: 6735.94 bits
6 1645 12.68 % 2.98 bits + 6 extra bits Total 8.98 bits For
all: 14770.67 bits
2 1635 12.61 % 2.99 bits + 2 extra bits Total 4.99 bits For
all: 8155.26 bits
0 1398 10.78 % 3.21 bits + 0 extra bits Total 3.21 bits For
all: 4492.97 bits
7 1376 10.61 % 3.24 bits + 7 extra bits Total 10.24 bits For
all: 14085.75 bits
3 1186 9.14 % 3.45 bits + 3 extra bits Total 6.45 bits For
all: 7651.02 bits
5 909 7.01 % 3.83 bits + 5 extra bits Total 8.83 bits For
all: 8030.89 bits
8 680 5.24 % 4.25 bits + 8 extra bits Total 12.25 bits For
all: 8332.46 bits
4 625 4.82 % 4.38 bits + 4 extra bits Total 8.38 bits For
all: 5234.56 bits
9 525 4.05 % 4.63 bits + 9 extra bits Total 13.63 bits For
all: 7154.08 bits
10 500 3.85 % 4.70 bits + 10 extra bits Total 14.70 bits For
all: 7348.61 bits
11 421 3.25 % 4.95 bits + 11 extra bits Total 15.95 bits For
all: 6712.98 bits
12 147 1.13 % 6.46 bits + 12 extra bits Total 18.46 bits For
all: 2714.11 bits
14 64 0.49 % 7.66 bits + 14 extra bits Total 21.66 bits For
all: 1386.43 bits
15 57 0.44 % 7.83 bits + 15 extra bits Total 22.83 bits For
all: 1301.32 bits
13 56 0.43 % 7.86 bits + 13 extra bits Total 20.86 bits For
all: 1167.92 bits
16 19 0.15 % 9.42 bits + 16 extra bits Total 25.42 bits For
all: 482.89 bits
17 5 0.04 % 11.34 bits + 17 extra bits Total 28.34 bits For
all: 141.71 bits
18 2 0.02 % 12.66 bits + 18 extra bits Total 30.66 bits For
all: 61.33 bits
Size: 19 Entropy: 3.58 Sum: 46400.9 bits / 5800.1 bytes Total
bits: 105960.88482598629 bits / 13245.110603248286 bytes
```
I.e. the difference between adjacent words is e.g. 1721 times 2 or 3
(magnitude 1), 1645 times between 64 and 127 (magnitude 6), 1398 times
just 1 (magnitude 0), etc. The decimal number of bits is the shannon
entropy for that frequency, extra bits correspond to the magnitude
(there might be better choices for those based on experiments).
The reason that magnitude 6 is frequent is because that's the range
where you find a difference in the two least significant letters (diff
= 26 * dx + dy).
I haven't implemented that scheme yet, so I cannot say how well it
actually works and how big the code would be.
As you also mentioned I would imagine that there's a clever thing you
could do with a custom version of bzip / Burrows–Wheeler to get even
better results but I haven't dug into that yet.
Cheers,
Johannes
> LZ does not help enough on the expected short common substrings
Indeed. It does worse than I had expected when I started digging into the
problem, which is why I thought I could do better.
> the mean difference between words (when represented as a base26 number)
> is 26^5 / 12972 = 915
I looked at my email in the wrong order, because a previous message had me
digging into this same idea, all the way through implementing it. Your
estimate is right on: in the original Wordle list, the mean difference is
913!
https://lists.sr.ht/~skeeto/public-inbox/%3CYiXIayjFm2O0um8e%40triptico.com%3E
> Here's an article about how to do that efficiently
Very interesting, thanks for sharing. Seems I'm a month behind!
> Using that scheme, my calculations give a lower bound of 13245 bytes for
> representing just the data with perfect entropy coding
With my Huffman coding of deltas over the entire list, I got 13,182 bytes
of compressed data. Curious that it's so close, but also that it's just
under your lower bound. As you pointed out, that does not include the
table, which is substantial in this case (1,768 leaves). Representing the
table efficiently is a whole new problem, and the variable-length encoding
seems like a better trade-off.