~skeeto/public-inbox

1

Branchless UTF-8 encoding

Charles Eckman <cceckman@gmail.com>
Details
Message ID
<CAMqU4MmBNPDKHy_v2N80MjsDJWZ=S-2r-gMa_vCF8pRhF4-ZLA@mail.gmail.com>
Sender timestamp
1736976658
DKIM signature
pass
Download raw message
Hi Chris (and others),

An acquaintance nerd-sniped me yesterday by pointing to your branchless
UTF-8 decoder and asking whether a branchless *en*coder was possible.

I worked out how to do it after a bit of discussion and experimentation.
The writeup and code are at
<https://cceckman.com/writing/branchless-utf8-encoding/>.

I haven't benchmarked it -- my goal was just proof-of-concept,
not performance -- but I thought you might be interested.

Thanks for your writeup & inspiration!

-Charles
Details
Message ID
<20250121051426.m7th6azwxjbosfju@nullprogram.com>
In-Reply-To
<CAMqU4MmBNPDKHy_v2N80MjsDJWZ=S-2r-gMa_vCF8pRhF4-ZLA@mail.gmail.com> (view parent)
Sender timestamp
1737418466
DKIM signature
missing
Download raw message
Thanks for sharing, Charles! It's interesting you included validation in 
the encoder. I wouldn't have (and didn't below) in mine, but it does match 
the spirit of the decoder. Mapping leading zeros onto a length is trickier 
than I anticipated, and your table resolves that nicely.

To resolve your undefined bsr issue, maybe you could OR on a bit you don't 
care about just for bsr. Then it's never zero, and the final length result 
is unchanged. I see that done often with the GCC built-in.

You got some ideas turning in my head, and I came up with this:

https://github.com/skeeto/scratch/blob/master/misc/utf8_branchless.c

Compared to my usual encoder, the results were about what I expected, with 
the branching version faster in the typical ASCII-only case, but the new 
one faster if input occasionally has code points outside the ASCII range
Reply to thread Export thread (mbox)