~skeeto/public-inbox

2 2

Re: Compressing and embedding a Wordle word list

Angel Ortega <angel@triptico.com>
Details
Message ID
<YiXIayjFm2O0um8e@triptico.com>
DKIM signature
pass
Download raw message
I see another approach; given the assumption that the first letter always happen, it's not necessary to store it in the table at all, only to increment by one from the first letter of the previously seen word when the second character of the word has a lower value than the one in the previously seen word. If the word[5] buffer contains the previous word, this is easily achieved.

Given this, we can store each word in 20 bytes, and using a similar way to index the database by bit, we will need a total of 12672 words x 5 letters x 20 bits / 8 bytes = 17741 bytes. I agree that the decoder will be a bit more complicated (but not that much) and not as elegant.

Best regards,
Ángel

Re: Compressing and embedding a Wordle word list

Details
Message ID
<20220308001352.unpovkjxsnmrf7eu@nullprogram.com>
In-Reply-To
<YiXIayjFm2O0um8e@triptico.com> (view parent)
DKIM signature
missing
Download raw message
I don't understand the specifics of your scheme. It sounds like delta 
encoding, but I'm not sure. I also don't follow your math.

I hadn't thought of delta encoding, though, so I just tried it out in a 
straightforward way. With each word as a base-26 number, I encoded the 
differences between adjacent words using Huffman coding, and I get 13,182 
bytes of compressed data. Initially this seems much better, except the 
state table has 3,535 entries (from a Huffman tree of 1,768 leaves), many 
of which are large (up to 18 bits). In other words, it pushed all the 
storage into the state table, which isn't particularly efficient.

The largest delta in the original Wordle list is 164,068, meaning every 
delta fits in 18 bits. Storing 12,972 of these makes for 29,187 bytes, 
which isn't quite as good as the Huffman coded version even counting the 
inefficient state table.

Re: Compressing and embedding a Wordle word list

Ángel Ortega <angel@triptico.com>
Details
Message ID
<d0373ec9-9b11-f649-ebc9-eaee52e6bd09@triptico.com>
In-Reply-To
<20220308001352.unpovkjxsnmrf7eu@nullprogram.com> (view parent)
DKIM signature
pass
Download raw message
8/3/22 1:13, Christopher Wellons:

> I don't understand the specifics of your scheme. It sounds like delta 
> encoding, but I'm not sure. I also don't follow your math.

Let's just forget about encoding each symbol in 5 bits and use full 
bytes by now.

Let the stored database be:

bbey cute gile lbum lloy mple pron rray ttic wful
bide dapt ging lert lone ngel rbor rrow udio abes
bout dded gree lgae long nger reas shes udit acks
bove dmit head lias loud ngle rena side utos acon
buse dobe ided lien lpha ngry rgue sked vail adge
cids dopt ides lign ltar nkle rise spen void adly
corn dult imed like lter nnex rmed sses wait aked
cres fter ired live mber part rmor sset wake aker
cted gain isle lley mend pple roma tlas ward alls
ctor gent larm llow mong pply rose toms ware ands

That is, 4 bytes per word.

The word[5] buffer stores the previously seen word.

The algorithm to take the next word is:

if db[0] < word[1]
   word[0]++
word[1] = db[0]
word[2] = db[1]
word[3] = db[2]
word[4] = db[3]

You start iterating the db with the magic string "`z   " inside word[5].

This way, for the first word in the db, you'll get "abbey", because
the if condition is true and word[0] changes from ` (0x60) to 'a'
(0x61).

After having "awful" stored in word[5] and finding "abes", the if
condition is true again and word[0] changed from 'a' to 'b'.

Just by using this algorithm you just get a 5/4 compression.

Of course, this makes the assumption that there are at least one
word starting with every letter of the English alphabet, which may
or may not be true because I've never played Wordle.

Combine this with the same way of storing each letter in 5 bits, and
you'll get what I meant (I may got my maths bad in my previous email,
I wrote it while on the bus without my glasses on :-)

Anyway, I just want to tell you that I follow your blog with great
interest because the topics you cover are fascinating and you are
very, very didactic. Keep the good work.

Best regards,
Ángel
Reply to thread Export thread (mbox)