Authentication-Results: mail-b.sr.ht; dkim=pass header.d=gmail.com header.i=@gmail.com Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) by mail-b.sr.ht (Postfix) with ESMTPS id 9121911F027 for <~skeeto/public-inbox@lists.sr.ht>; Mon, 7 Mar 2022 18:59:18 +0000 (UTC) Received: by mail-pl1-f182.google.com with SMTP id p17so14788457plo.9 for <~skeeto/public-inbox@lists.sr.ht>; Mon, 07 Mar 2022 10:59:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-transfer-encoding; bh=8JaE2YJVMvLKuh306J3bzk7HVt2EpcAnDZtKd79ZxaY=; b=oZAJ2IaAmqkj9sdzlQYgNOdxdWqzk4hvcySt5kUhAep/BZoEia/z05JA3R+mPaJx8p MUmK2T2yEfaHQRvUgKfhG8C6lHrLvX4tt51wxw/znjWBOibway8gO1FwjJ2PEcuCjRAV RwsFJZqYHoirCwnvLKBP0oHS8zWzQRJsmbzGmY0rXgpZ1aLsRNXMEG3Bh0z6H1X3U5kK 2lcrAZpx126UhMLFk40SR52+Dg0u2hdL3rXapGtJE0NXeAjiVX6fPMOUKje3HqaVW11F eV2le4K3eqO/3tcmwZX7E64Y10ahjZiMy9M11WbgqbeLOxYmljMUMuvCsGn8d8SGpdMG hnUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:content-transfer-encoding; bh=8JaE2YJVMvLKuh306J3bzk7HVt2EpcAnDZtKd79ZxaY=; b=G9pYMXrXLWAVtiIC3osKp565qv60HXZ0dOHi5ypNoVBO2cnDkFB0G+z0QRxtPbeliC IppjFd6Mleq5c92JYvBnRXfK2S1NnDr7aP0NlTl+5u4x4eDGGXEcrltMwAOYoaDa7pN4 GJ2lTzPUqsPTsqtYK4bApopZ/r4U80lmCGhXHCnqcRdspoc5iuGM6ZXl6aAy/EBdrtpO w599K7p+Mr5SkUXwqUYqDq5+x03th4qsJsfVfO6OLYz7Bf7Gdw7W+11uKbh3YZXnUO8o Y6h8WeWl6iC4oEmbiyhRLG3Ga1u0e5NRtw0TrdNT6nJ8yCK/8FvinnymeJ+mObSfewmw UwXA== X-Gm-Message-State: AOAM5336LvNt9WA2eLUN/1A49cGdNUNQAoVUv/xYZdrBsXlWMDAUkicm 71EjYhjLfF6r7G/GW9oArxmVYDKfCjg67uxhgF0HvgRS X-Google-Smtp-Source: ABdhPJxG0MXD5Qh+n3zJFvBHEH/GcmmUYrAC5IFHqokoEqy6Gb4hyDXwAfFRYlZW3ssEsC6ZttIDlG14Q25fWwCkyYg= X-Received: by 2002:a17:902:7404:b0:151:c3f9:e43a with SMTP id g4-20020a170902740400b00151c3f9e43amr13665380pll.12.1646679555219; Mon, 07 Mar 2022 10:59:15 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Johannes Rudolph Date: Mon, 7 Mar 2022 19:59:03 +0100 Message-ID: Subject: Fwd: Compressing and embedding a Wordle word list To: ~skeeto/public-inbox@lists.sr.ht Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Thanks for that investigation! The general insight here is that using basic entropy coding is not the best you can do here because you don't make any use of the fact that you are only interested in the (ordered) set of fixed-length words and not in any permutation of a sequence of any length. That's why your run-length encoding works better. This is also why general-purpose LZ+entropy coding compression formats do not work well: as we have seen entropy coding is not good enough and LZ does not help enough on the expected short common substrings. Also it doesn't benefit much from starting with a sorted set. In fact, what you look for, is to find a short representation to select those 12972 words out of the total set of 5 letter words (26^5 combinations). In that case, the difference between two adjacent words in the ordered list will be significantly smaller than the representation of the word itself. Here's an article about how to do that efficiently (if the symbols would be drawn randomly): https://preshing.com/20121105/arithmetic-coding-and-the-1mb-sorting-problem= / Selecting 12972 from 26^5 words means the mean difference between words (when represented as a base26 number) is 26^5 / 12972 =3D 915, needing only around 10 bits per word on average. In the wordle case, words are not drawn randomly (English words have common prefixes/suffixes), so the difference between two adjacent words is often small and less frequently larger. Out of that insight, the author of http://alexanderpruss.blogspot.com/2022/02/game-boy-wordle-how-to-compress-= 12972.html found a simple compression scheme for the differences between words similar to the UTF-8 varint representation that compressed the data further to 17871 bytes with the benefit that the compressed stream is byte-based and not a bit stream which might be easier to handle on small CPUs. One interesting fact of the wordle dataset is that the differences compress best when you order them by reversing each word first (I also tried all other permutations before sorting but that seems to be the best). The reason could be that there are more common word suffixes than prefixes. You can find even smaller representations than in the latter blog post by representing the differences again using entropy coding. The entropy of the base 26 reverse-word ordered difference distribution is ~12378 bytes not including the distribution table with 1828 elements. However, representing that table itself will take a lot of space. A common way to avoid the huge table is to use a simple logarithmic scheme of encoding integers (like in zstd or deflate length codes) where you encode the magnitude of a difference using a variable length encoding and the noise in the less significant bits using fixed-sized bits. Using that scheme, my calculations give a lower bound of 13245 bytes for representing just the data with perfect entropy coding (Huffman is worse but actually not much, you could try tANS or algorithmic coding for better results). Here is the distribution for the magnitudes for the reversed words represented in base 26: ``` 1 1721 13.27 % 2.91 bits + 1 extra bits Total 3.91 bits For all: 6735.94 bits 6 1645 12.68 % 2.98 bits + 6 extra bits Total 8.98 bits For all: 14770.67 bits 2 1635 12.61 % 2.99 bits + 2 extra bits Total 4.99 bits For all: 8155.26 bits 0 1398 10.78 % 3.21 bits + 0 extra bits Total 3.21 bits For all: 4492.97 bits 7 1376 10.61 % 3.24 bits + 7 extra bits Total 10.24 bits For all: 14085.75 bits 3 1186 9.14 % 3.45 bits + 3 extra bits Total 6.45 bits For all: 7651.02 bits 5 909 7.01 % 3.83 bits + 5 extra bits Total 8.83 bits For all: 8030.89 bits 8 680 5.24 % 4.25 bits + 8 extra bits Total 12.25 bits For all: 8332.46 bits 4 625 4.82 % 4.38 bits + 4 extra bits Total 8.38 bits For all: 5234.56 bits 9 525 4.05 % 4.63 bits + 9 extra bits Total 13.63 bits For all: 7154.08 bits 10 500 3.85 % 4.70 bits + 10 extra bits Total 14.70 bits For all: 7348.61 bits 11 421 3.25 % 4.95 bits + 11 extra bits Total 15.95 bits For all: 6712.98 bits 12 147 1.13 % 6.46 bits + 12 extra bits Total 18.46 bits For all: 2714.11 bits 14 64 0.49 % 7.66 bits + 14 extra bits Total 21.66 bits For all: 1386.43 bits 15 57 0.44 % 7.83 bits + 15 extra bits Total 22.83 bits For all: 1301.32 bits 13 56 0.43 % 7.86 bits + 13 extra bits Total 20.86 bits For all: 1167.92 bits 16 19 0.15 % 9.42 bits + 16 extra bits Total 25.42 bits For all: 482.89 bits 17 5 0.04 % 11.34 bits + 17 extra bits Total 28.34 bits For all: 141.71 bits 18 2 0.02 % 12.66 bits + 18 extra bits Total 30.66 bits For all: 61.33 bits Size: 19 Entropy: 3.58 Sum: 46400.9 bits / 5800.1 bytes Total bits: 105960.88482598629 bits / 13245.110603248286 bytes ``` I.e. the difference between adjacent words is e.g. 1721 times 2 or 3 (magnitude 1), 1645 times between 64 and 127 (magnitude 6), 1398 times just 1 (magnitude 0), etc. The decimal number of bits is the shannon entropy for that frequency, extra bits correspond to the magnitude (there might be better choices for those based on experiments). The reason that magnitude 6 is frequent is because that's the range where you find a difference in the two least significant letters (diff =3D 26 * dx + dy). I haven't implemented that scheme yet, so I cannot say how well it actually works and how big the code would be. As you also mentioned I would imagine that there's a clever thing you could do with a custom version of bzip / Burrows=E2=80=93Wheeler to get eve= n better results but I haven't dug into that yet. Cheers, Johannes