~skeeto/public-inbox

[PATCH][branchless-utf8] optimization: use unsigned char for lengths table

Details
Message ID
<20220626201828.rkvmvfjzzl2scqdm@gen2.localdomain>
DKIM signature
pass
Download raw message
Patch: +2 -2
while trying to investigate [this issue], i noticed that there was a
huge speed up from turning the lengths table to unsigned on my system
(amd 3700x, GNU/Linux, gcc v11.1.2, clang v13.0.0).

[this issue]: https://github.com/skeeto/branchless-utf8/pull/7#issuecomment-1163999593

	master:       ~650 MB/s
	uchar table:  ~705 MB/s

this made me curious on why there was such a drastic speed difference so
i started digging into the asm generated via:

	$ gcc -std=c99 -Wall -Wextra -O3 -m64 -masm=intel -S utf8.c

NOTE: i've removed the `static` qualifier from the function so that it
doesn't get optimized out and added some `restrict` qualifiers to the
pointers for better aliasing optimization.

turns out the only difference in the generated asm comes down to just
one instruction; it changes `movsx` to `movzx`.

	--- utf8-m64-signed.s	2022-06-24 15:03:07.765169221 +0600
	+++ utf8-m64-unsigned.s	2022-06-24 15:02:43.102171163 +0600
	@@ -19,10 +19,10 @@
	 	movzx	r9d, BYTE PTR 3[r10]
	 	shr	al, 3
	 	and	eax, 31
	-	movsx	rdi, BYTE PTR [rdx+rax]
	+	movzx	edi, BYTE PTR [rdx+rax]
	 	lea	rdx, err_extras.4[rip]
	-	movsx	r8, BYTE PTR [rdx+rax]
	-	movsx	rax, dil
	+	movzx	r8d, BYTE PTR [rdx+rax]
	+	movzx	eax, dil
	 	add	r8, rax
	 	movzx	eax, BYTE PTR 1[r10]
	 	add	r8, r10

it seems that the sign extension turns out to be costly on my system and
neither GCC nor Clang were smart enough to figure out that all the
values inside `lengths` are unsigned. thus, manually make the LUT
`unsigned`.

NOTE: `len` is also changed to unsigned, it doesn't make any difference
performance wise, but i suppose it shouldn't harm either.

P.S: on intel cpu (4790k, gcc v12.0.0), this doesn't seem to have much
effect: `579 MB/s` -> `581 MB/s` (which could've just been noise).
---

NOTE: Was hesitant on sending this since it seemed like a very system
specific optimization. But I also wanted to let you know about my
findings, so sending it regardless.

Weather to apply it or not is completely up to you.

 utf8.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/utf8.h b/utf8.h
index 8c6a7a0..a81f9be 100644
--- a/utf8.h
+++ b/utf8.h
@@ -25,7 +25,7 @@
static void *
utf8_decode(void *buf, uint32_t *c, int *e)
{
    static const char lengths[] = {
    static const unsigned char lengths[] = {
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 3, 3, 4, 0
    };
@@ -35,7 +35,7 @@ utf8_decode(void *buf, uint32_t *c, int *e)
    static const int shifte[] = {0, 6, 4, 2, 0};

    unsigned char *s = buf;
    int len = lengths[s[0] >> 3];
    unsigned int len = lengths[s[0] >> 3];

    /* Compute the pointer to the next character early so that the next
     * iteration can start working on the next character. Neither Clang
-- 
2.35.1
Reply to thread Export thread (mbox)