Authentication-Results: mail-b.sr.ht; dkim=none Received: from mail.nullprogram.com (mail.nullprogram.com [192.241.191.137]) by mail-b.sr.ht (Postfix) with ESMTPS id 628AA11EF15 for <~skeeto/public-inbox@lists.sr.ht>; Tue, 8 Mar 2022 00:13:58 +0000 (UTC) Received: from nullprogram.com (localhost [127.0.0.1]) by mail.nullprogram.com (Postfix) with ESMTPS id 82AFFC7986; Mon, 7 Mar 2022 19:13:53 -0500 (EST) Date: Mon, 7 Mar 2022 19:13:52 -0500 From: Christopher Wellons To: Angel Ortega Cc: ~skeeto/public-inbox@lists.sr.ht Subject: Re: Compressing and embedding a Wordle word list Message-ID: <20220308001352.unpovkjxsnmrf7eu@nullprogram.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20170113 (1.7.2) I don't understand the specifics of your scheme. It sounds like delta encoding, but I'm not sure. I also don't follow your math. I hadn't thought of delta encoding, though, so I just tried it out in a straightforward way. With each word as a base-26 number, I encoded the differences between adjacent words using Huffman coding, and I get 13,182 bytes of compressed data. Initially this seems much better, except the state table has 3,535 entries (from a Huffman tree of 1,768 leaves), many of which are large (up to 18 bits). In other words, it pushed all the storage into the state table, which isn't particularly efficient. The largest delta in the original Wordle list is 164,068, meaning every delta fits in 18 bits. Storing 12,972 of these makes for 29,187 bytes, which isn't quite as good as the Huffman coded version even counting the inefficient state table.