~sircmpwn/hare-dev

8 3

Performance of bytes::index(), potential issue w/ bytes::index_tw?

Details
Message ID
<32c566cb-b858-4c6e-82fc-44e59ae313d6@bitfehler.net>
DKIM signature
pass
Download raw message
Hello,

sorry, longish email. It is about the performance of Hare-generated 
code, in this case specifically bytes::index(). It would be amazing if 
someone with interest in or experience with profiling/optimizing Hare 
code could chime in. I also CC'ed Bor, who it seems has written the 
two-way search implementation, which also plays a central role in this.

I'll skip all the background and just say: I became interested in the 
performance of bytes::index(), especially after I noticed that it uses 
different functions depending on the length of the needle to search for.

To get some numbers, I set up the following benchmark:

- Haystack: a text file containing "War and Peace" [1]
- Needle: "presses"

Why? The token "presses" shows up first at position 674784, so there is 
a good deal of bytes to be searched. Additionally, there are a 
substantial amount of prefix matches [2]:

- pre: 434
- pres: 272
- press: 208
- presse: 39

Prefix matches can have quite an impact on the runtime of naive matching 
algorithms, so it's great to have a bunch of those. I could have of 
course generated a bunch of gibberish with even more prefix matches, but 
I figured using "War and Peace" gives this a veneer of "real world" 
applicability ;)

I then used a small test program [3] to measure the time it takes to 
find the first occurance of "presses". With the length of the needle 
being > 4, bytes::index will use bytes::index_tw [4] for the task. I 
proceeded to add 3 other implementations [5] for a needle length > 4:

- index_very_naive: the "brute force" method: at each position of the
   input, check if the next len(needle) bytes bytes::equal() the needle
- index_very_naive_inline: same as above, but inlining the for-loop that
   is bytes::equal()
- index_less_naive: re-uses the existing bytes::index_4bytes() [6] to
   find 4-bytes prefix matches, then do bytes::hasprefix(needle) on each
   match

Now for the results. I'll only provide one result for each, but the
measurements have been pretty consistent for me, so they should be
representative:

$ ./test_index_tw
   16157495 µs (674784)
$ ./test_very_naive
   24713303 us (674784)
$ ./test_very_naive_inline
    8399746 us (674784)
$ ./test_less_naive
    5045460 us (674784)

This seems a bit surprising to me. The two-way algorithm is faster than 
the "brute force" method, but it already gets beaten significantly by 
just "inlining" the call to bytes::equal(). I also fail to explain to 
myself how the inlined version can be so much faster. The most likely 
explanation would of course be "function call overhead", but then I 
wonder what would make the two-way algorithm so slow? It only uses 
rather simple operations, no function call in the main for loop, etc.

To make sure my test case wasn't accidentally biased, I picked another 
needle: "the first case it was necessary to renounce". The match is 
almost at the very end (byte 3202033), and the more naive versions have 
to go through 21878 prefix matches for "the " along the way. I picked a 
long needle because that is apparently where two-way search shines. 
However, the results were pretty similar.

At this point, I wondered if there is something wrong with the 
implementation of two-way sort. However, it is pretty hard to 
proof-read. There are many variations, and I don't know what the 
"official" source of the current implementation is. So I did the 
next-best thing I could think of: I took the implementation from musl 
[7] and made a horrible but working Hare port [8]. Lo and behold, that 
implementation indeed outperforms _all others_.

Now, there are some obvious differences. For once, the musl 
implementation uses some sort of shift table (and one other lookup 
table), which take up a bounded, but non-trivial amount of stack memory. 
Due to that, even with this implementation at hand, I cannot really say 
if the current implementation has a bug, or if these lookup tables just 
make it so much faster?

I guess, to make it easier to reply to this, my questions would be:

1. Is there a reference (code or pseudo-code) for the current
    implementation, that it could be checked against to see if it maybe
    has a bug?
2. If it is not a bug, what is the part that makes this implementation
    so slow?
3. Assuming the speed-up in the musl version was due to the lookup
    tables, is that something you would consider for inclusion in the
    stdlib? Or does it eat too much memory? (note that I would of course
    clean up the code!).
4. In case you'd prefer to avoid the lookup tables, and the current two-
    way implementation is indeed not buggy, would it maybe make sense to
    use one of the more naive approaches instead?
5. Two-way aside, can someone explain the difference between the naive
    and the naive_inline versions? Are function calls really that
    expensive? Or am I missing something?

Any pointers and insights would be much appreciated!

Thanks,
Conrad

[1]: 
https://github.com/mmcky/nyu-econ-370/blob/master/notebooks/data/book-war-and-peace.txt
[2]: https://paste.sr.ht/~bitfehler/c971007b64f5089ea1c6ca23d13dfacc19e2f64c
[3]: https://paste.sr.ht/~bitfehler/597c851bc6bcf11c15cf7b9e3f69c981535e9cf1
[4]: https://git.sr.ht/~sircmpwn/hare/tree/master/item/bytes/two_way.ha
[5]: https://paste.sr.ht/~bitfehler/e46006fb346250546bd9f2fd75b4512e7cb26e12
[6]: https://git.sr.ht/~sircmpwn/hare/tree/master/item/bytes/index.ha#L49
[7]: http://git.musl-libc.org/cgit/musl/tree/src/string/memmem.c
[8]: https://paste.sr.ht/~bitfehler/d0a7c2e4e71713c8adac4d8aed5ac50a5fae1676
Details
Message ID
<CV7U645NX3TA.3B490OMZDIE3C@attila>
In-Reply-To
<32c566cb-b858-4c6e-82fc-44e59ae313d6@bitfehler.net> (view parent)
DKIM signature
pass
Download raw message
Hi,

I'm quite busy with exams at the moment, so I won't be able to really take a
detailed look at the code for a couple of weeks at least, but here's what I
gathered from memory and old notes:

There are plenty of string matching algorithms available, which one is the
best depends heavily on the use case, for a stdlib string matching
implementation, the most important requirements are constant space and no
pathological inputs that would make the algorithm really slow. This latter
requirement is more important than optimal performance on "most common" inputs,
such as actual meaningful texts written in a human language.

So being able to beat the standard library implementation is to be expected,
because the key feature of two way is that it's not possible to construct an
input that would make it run 10000x slower than it ran on War and peace, while
it's possible to do so for your functions (and at least theoretically for every
other algorithm that beats two way on common inputs as well).

That said, two way is still supposed to be a bit smarter than the naive
versions so we definitely have some room to improve.

> To get some numbers, I set up the following benchmark:
>
> - Haystack: a text file containing "War and Peace" [1]
> - Needle: "presses"
>
> Why? The token "presses" shows up first at position 674784, so there is 
> a good deal of bytes to be searched. Additionally, there are a 
> substantial amount of prefix matches [2]:
>
> - pre: 434
> - pres: 272
> - press: 208
> - presse: 39
>
> Prefix matches can have quite an impact on the runtime of naive matching 
> algorithms, so it's great to have a bunch of those. I could have of 
> course generated a bunch of gibberish with even more prefix matches, but 
> I figured using "War and Peace" gives this a veneer of "real world" 
> applicability ;)

I think you'd need to pick a haystack and a needle with much more periodic
structure to get any real advantage from two way. In a 3MB text, this is not a
substantial amount of prefix matches. See the explanation below with some
numbers for the "the " case.

> I then used a small test program [3] to measure the time it takes to 
> find the first occurance of "presses". With the length of the needle 
> being > 4, bytes::index will use bytes::index_tw [4] for the task. I 
> proceeded to add 3 other implementations [5] for a needle length > 4:
>
> - index_very_naive: the "brute force" method: at each position of the
>    input, check if the next len(needle) bytes bytes::equal() the needle
> - index_very_naive_inline: same as above, but inlining the for-loop that
>    is bytes::equal()
> - index_less_naive: re-uses the existing bytes::index_4bytes() [6] to
>    find 4-bytes prefix matches, then do bytes::hasprefix(needle) on each
>    match
>
> Now for the results. I'll only provide one result for each, but the
> measurements have been pretty consistent for me, so they should be
> representative:
>
> $ ./test_index_tw
>    16157495 µs (674784)
> $ ./test_very_naive
>    24713303 us (674784)
> $ ./test_very_naive_inline
>     8399746 us (674784)
> $ ./test_less_naive
>     5045460 us (674784)
>
> This seems a bit surprising to me. The two-way algorithm is faster than 
> the "brute force" method, but it already gets beaten significantly by 
> just "inlining" the call to bytes::equal(). I also fail to explain to 
> myself how the inlined version can be so much faster. The most likely 
> explanation would of course be "function call overhead",

Yeah, passing slices around is slow, because it happens on the stack. So is
subslicing, which makes the problem twice as bad here.

Looking at the produced assembly it's clear that there is an enormous amount of
copying overhead happenning in very_naive.

> but then I 
> wonder what would make the two-way algorithm so slow? It only uses 
> rather simple operations, no function call in the main for loop, etc.
>
> To make sure my test case wasn't accidentally biased, I picked another 
> needle: "the first case it was necessary to renounce". The match is 
> almost at the very end (byte 3202033), and the more naive versions have 
> to go through 21878 prefix matches for "the " along the way. I picked a 
> long needle because that is apparently where two-way search shines. 
> However, the results were pretty similar.

Yeah, as I said earlier, the gain from two_way can only happen with lots of
periodicity and partial occurences.

The fancy algorithm is only able to gain advantage when there are a lot of
partial matches, in absence of partial matches it just has to do a comparison
on every byte of haystack at least once, just like the naive version. About 7%
of bytes of the file are 't', and less than 0.7% of 4-byte strings are "the ",
so there's not a lot of space for two_way to get any advantage and with the
lenght of partial matches being so small, the advantage is negligible. In fact,
the main loop of two way is more complicated which does not help performance,
and logic that is run on every occurence of 't' is likely to confuse the branch
predictor more in two_way than in the naive algorithm, further complicating the
comparison.


> At this point, I wondered if there is something wrong with the 
> implementation of two-way sort. However, it is pretty hard to 
> proof-read. There are many variations, and I don't know what the 
> "official" source of the current implementation is. So I did the 
> next-best thing I could think of: I took the implementation from musl 
> [7] and made a horrible but working Hare port [8]. Lo and behold, that 
> implementation indeed outperforms _all others_.

The source for the implementation is [1]. I modified it a bit based on what
Musl does iirc, but not much. There was no attempt to optimize it, because I
just wanted to have something better than basic O(n*m) version as a starting
point, hoping to actually make it fast later. The stdlib was about 4 months
old at the time, and performance wasn't really a concern. I remember comparing
my version to Musl memmem and being somewhat able to tell they do the same
thing, so I'd say it's unlikely to contain actual bugs that make it slower,
though I'm sure there are plenty of microoptimizations to be made.


> Now, there are some obvious differences. For once, the musl 
> implementation uses some sort of shift table (and one other lookup 
> table), which take up a bounded, but non-trivial amount of stack memory. 
> Due to that, even with this implementation at hand, I cannot really say 
> if the current implementation has a bug, or if these lookup tables just 
> make it so much faster?

The tables are an optimization, a clever way to run some kind of modified
Boyer-Moore simultaneously with the main algorithm, which I never understood
well enough to confidently reimplement it in Hare. I'm quite confident this is
what makes the Musl version so much faster, BM is one of the best performing
algorithms on "common" inputs. Iirc glibc does something similar, but it's
really hard to tell what is what in that mess.


> I guess, to make it easier to reply to this, my questions would be:
>
> 1. Is there a reference (code or pseudo-code) for the current
>     implementation, that it could be checked against to see if it maybe
>     has a bug?

It's taken from [1], with some modifications taken from Musl.


> 2. If it is not a bug, what is the part that makes this implementation
>     so slow?

I think it's unlikely but far from impossible to be a bug. I won't be able to
take a look at it myself until october, but my first step would be implementing
the BM optimization in the stdlib version of two_way and then comparing the
performance, seems like the easiest way to check for sure if it's a bug or not.


> 3. Assuming the speed-up in the musl version was due to the lookup
>     tables, is that something you would consider for inclusion in the
>     stdlib? Or does it eat too much memory? (note that I would of course
>     clean up the code!).

Yes, I think that's worth the stack cost. It may even be possible to allocate
it statically. Hare is generally pretty stack-heavy, so 2kB is relatively even
less that it is in C.


> 4. In case you'd prefer to avoid the lookup tables, and the current two-
>     way implementation is indeed not buggy, would it maybe make sense to
>     use one of the more naive approaches instead?

No, even if we don't implement the lookup tables, we should keep using an
algorithm that doesn't have awful worst-case performance. I am not aware of
anything comparable or better than two way in this space.


> 5. Two-way aside, can someone explain the difference between the naive
>     and the naive_inline versions? Are function calls really that
>     expensive? Or am I missing something?

See the explanation above. It's also quite easy to see if you look at the
generated assembly yourself.


> Any pointers and insights would be much appreciated!

I found [2] (asked by Rich Felker) at some point _after_ I wrote this thing,
sadly no answers, but the question itself is quite interesting and relevant.


>
> Thanks,
> Conrad
>
> [1]: 
> https://github.com/mmcky/nyu-econ-370/blob/master/notebooks/data/book-war-and-peace.txt
> [2]: https://paste.sr.ht/~bitfehler/c971007b64f5089ea1c6ca23d13dfacc19e2f64c
> [3]: https://paste.sr.ht/~bitfehler/597c851bc6bcf11c15cf7b9e3f69c981535e9cf1
> [4]: https://git.sr.ht/~sircmpwn/hare/tree/master/item/bytes/two_way.ha
> [5]: https://paste.sr.ht/~bitfehler/e46006fb346250546bd9f2fd75b4512e7cb26e12
> [6]: https://git.sr.ht/~sircmpwn/hare/tree/master/item/bytes/index.ha#L49
> [7]: http://git.musl-libc.org/cgit/musl/tree/src/string/memmem.c
> [8]: https://paste.sr.ht/~bitfehler/d0a7c2e4e71713c8adac4d8aed5ac50a5fae1676

Hope this helps,
Bor

[1]: http://www-igm.univ-mlv.fr/~lecroq/string/node26.html
[2]: https://stackoverflow.com/questions/3183582/what-is-the-fastest-substring-search-algorithm
Details
Message ID
<CV8ZRGFF8PH2.1G8I7LU8KSMD2@eiger>
In-Reply-To
<CV7U645NX3TA.3B490OMZDIE3C@attila> (view parent)
DKIM signature
pass
Download raw message
On Fri Sep 1, 2023 at 7:43 PM UTC, Bor Grošelj Simić wrote:
> > This seems a bit surprising to me. The two-way algorithm is faster than 
> > the "brute force" method, but it already gets beaten significantly by 
> > just "inlining" the call to bytes::equal(). I also fail to explain to 
> > myself how the inlined version can be so much faster. The most likely 
> > explanation would of course be "function call overhead",
>
> Yeah, passing slices around is slow, because it happens on the stack. So is
> subslicing, which makes the problem twice as bad here.
>
> Looking at the produced assembly it's clear that there is an enormous amount of
> copying overhead happenning in very_naive.

fyi i'm pretty sure that qbe generates really suboptimal code for
passing aggregates on the stack, and i suspect that this is the cause of
a lot of the perf issues we currently have in hare: i replaced the str
argument to rt::abort/rt::abort_fixed with *str in a wip patch of mine
and it shrunk hare binaries by roughly 10% (on the same order as
deduplicating paths in abort messages, which i also do in that patch).
that particular patch doesn't cause any meaningful perf improvements,
since aborts aren't actually hot, but if 10% of hare binaries are mov's
for passing a str into a function, i can't imagine this is doing good
things for actually hot codepaths that do similar things

> > 3. Assuming the speed-up in the musl version was due to the lookup
> >     tables, is that something you would consider for inclusion in the
> >     stdlib? Or does it eat too much memory? (note that I would of course
> >     clean up the code!).
>
> Yes, I think that's worth the stack cost. It may even be possible to allocate
> it statically. Hare is generally pretty stack-heavy, so 2kB is relatively even
> less that it is in C.

i'd also like to investigate this further at some point, hare's
excessive stack usage has somewhat concerned me in the past, since i
think it's likely to be indicative of deeper problems somewhere
Details
Message ID
<5594a54d-c4e0-4c39-aa28-d97e068cb9df@bitfehler.net>
In-Reply-To
<CV8ZRGFF8PH2.1G8I7LU8KSMD2@eiger> (view parent)
DKIM signature
pass
Download raw message
Thank you both for these great responses!

I had in the meantime gained at least some of the insight provided by 
removing the hash tables from the musl version and seeing its 
performance drop to the same level as the stdlib one. So I 100% agree 
that a lower-level issue is at play here. More inline.

On 9/3/23 06:18, Ember Sawady wrote:
> On Fri Sep 1, 2023 at 7:43 PM UTC, Bor Grošelj Simić wrote:
>>> This seems a bit surprising to me. The two-way algorithm is faster than
>>> the "brute force" method, but it already gets beaten significantly by
>>> just "inlining" the call to bytes::equal(). I also fail to explain to
>>> myself how the inlined version can be so much faster. The most likely
>>> explanation would of course be "function call overhead",
>>
>> Yeah, passing slices around is slow, because it happens on the stack. So is
>> subslicing, which makes the problem twice as bad here.

That is great to know, because it explains why my implementation of 
Aho-Corasick performs so poorly :)

>> Looking at the produced assembly it's clear that there is an enormous amount of
>> copying overhead happenning in very_naive.

I guess I will need to dust off my assembly skills, so feel free to let 
me figure it out myself, but I wonder: when you say subslicing happens 
on the stack and there is copying, do you mean passing a subslice 
actually creates a copy for the new function's stack? I can modify 
passed in subslices in functions, so is the modified versions then 
copied back over the original? Again, I guess I'll start poking myself, 
but any pointers would help.

> fyi i'm pretty sure that qbe generates really suboptimal code for
> passing aggregates on the stack, and i suspect that this is the cause of
> a lot of the perf issues we currently have in hare: i replaced the str
> argument to rt::abort/rt::abort_fixed with *str in a wip patch of mine
> and it shrunk hare binaries by roughly 10% (on the same order as
> deduplicating paths in abort messages, which i also do in that patch).
> that particular patch doesn't cause any meaningful perf improvements,
> since aborts aren't actually hot, but if 10% of hare binaries are mov's
> for passing a str into a function, i can't imagine this is doing good
> things for actually hot codepaths that do similar things

Interesting. From what I understand (and I have not gone deeper than 
Hare-level yet, no C, qbe intermediate, or assembly yet) str is already 
a pointer to some []u8, but I suppose by aggregate type you mean that it 
is internally represented as probably a struct with say a size and a 
pointer to the data or something? And since the compiler knows the size 
of slices, the same is true for them? And *str is a "real" pointer?

And, combining this with the questions above, this means that even 
creating a string/byte slice on the heap and passing subslices around 
will make the result very stack-heavy?

Again, I'll start answering some of that myself, but any help would be 
appreciated.

>>> 3. Assuming the speed-up in the musl version was due to the lookup
>>>      tables, is that something you would consider for inclusion in the
>>>      stdlib? Or does it eat too much memory? (note that I would of course
>>>      clean up the code!).
>>
>> Yes, I think that's worth the stack cost. It may even be possible to allocate
>> it statically. Hare is generally pretty stack-heavy, so 2kB is relatively even
>> less that it is in C.

Thanks, I guess I'll try to write a patch adding the hash tables to the 
current implementation and then this can be discussed by all interested 
parties.

> i'd also like to investigate this further at some point, hare's
> excessive stack usage has somewhat concerned me in the past, since i
> think it's likely to be indicative of deeper problems somewhere

It would indeed be great to improve this. I remember there was some talk 
of inline functions at some point, what is the state of that?

I'll also try to get a better understanding of qbe and its output, but 
it might take some time :)

Cheers and thanks a lot,
Conrad
Details
Message ID
<CVAURZU87ON0.2BWNSC6B668TM@monch>
In-Reply-To
<5594a54d-c4e0-4c39-aa28-d97e068cb9df@bitfehler.net> (view parent)
DKIM signature
pass
Download raw message
On Tue Sep 5, 2023 at 8:12 AM UTC, Conrad Hoffmann wrote:
> Interesting. From what I understand (and I have not gone deeper than 
> Hare-level yet, no C, qbe intermediate, or assembly yet) str is already 
> a pointer to some []u8, but I suppose by aggregate type you mean that it 
> is internally represented as probably a struct with say a size and a 
> pointer to the data or something? And since the compiler knows the size 
> of slices, the same is true for them? And *str is a "real" pointer?

str is internally represented the same as []u8 (a struct of data, size,
and a currently-unused-but-still-there-for-Reasons capacity)

> And, combining this with the questions above, this means that even 
> creating a string/byte slice on the heap and passing subslices around 
> will make the result very stack-heavy?

yeah, probably, at least as of now, with the caveat that i haven't
tested this as of now. qbe should do a good job of reusing its stack
slots, but you'll still end up with a bunch of unnecessary copies

> > i'd also like to investigate this further at some point, hare's
> > excessive stack usage has somewhat concerned me in the past, since i
> > think it's likely to be indicative of deeper problems somewhere
>
> It would indeed be great to improve this. I remember there was some talk 
> of inline functions at some point, what is the state of that?

existed in a branch on harec at some point in time, wip patch here
https://lists.sr.ht/~sircmpwn/hare-dev/patches/34734. that being said,
i'm not sure inline functions are necessarily the way to improve this -
i'm hoping that a combination of fixing the below qbe issues, getting
rid of string capacity (and, in most cases, slice capacity as well), and
refactoring some apis to take pointers to aggregates rather than
aggregates themselves will improve things enough that we're not forced
to inline lots of stuff

> I'll also try to get a better understanding of qbe and its output, but 
> it might take some time :)

i'm pretty sure the thread to pull on for the qbe side of things is here
https://lists.sr.ht/~mpu/qbe/%3CCTY9XWX8FMWF.1XE7T7DO8CGQV%40taiga%3E
Details
Message ID
<CVC2M1T6Y7S6.1TRWZ0DBPIKDD@attila>
In-Reply-To
<CV8ZRGFF8PH2.1G8I7LU8KSMD2@eiger> (view parent)
DKIM signature
pass
Download raw message
> > > This seems a bit surprising to me. The two-way algorithm is faster than 
> > > the "brute force" method, but it already gets beaten significantly by 
> > > just "inlining" the call to bytes::equal(). I also fail to explain to 
> > > myself how the inlined version can be so much faster. The most likely 
> > > explanation would of course be "function call overhead",
> >
> > Yeah, passing slices around is slow, because it happens on the stack. So is
> > subslicing, which makes the problem twice as bad here.
> >
> > Looking at the produced assembly it's clear that there is an enormous amount of
> > copying overhead happenning in very_naive.
>
> fyi i'm pretty sure that qbe generates really suboptimal code for
> passing aggregates on the stack

It's not fast, but qbe can't get around abi requirements, so at least half of
the copying in the call to bytes::index(something[sub..slice], ...); is
unavoidable. The other part, the one where the subslice struct is created on
the caller's stack first, is also hard to avoid, but at least theoretically
possible.


> i replaced the str
> argument to rt::abort/rt::abort_fixed with *str in a wip patch of mine
> and it shrunk hare binaries by roughly 10% (on the same order as
> deduplicating paths in abort messages, which i also do in that patch).

I did some math and it turns out just passing a pointer in a register instead
of copying the entire struct saves about 30 bytes of instructions per
abort_fixed call. Multiplying by 10000, which is roughly the amount of
abort_fixed calls in the binary produced by make check, gets us to more
than 5% of the test binary size. So your numbers make sense and at least half
of your observed gain is unobtainable through optimizations because of abi
limitations.


> that particular patch doesn't cause any meaningful perf improvements,
> since aborts aren't actually hot, but if 10% of hare binaries are mov's
> for passing a str into a function, i can't imagine this is doing good
> things for actually hot codepaths that do similar things

agreed. On a similar note, I'm also worried about the speedup reported by
Autumn! when they optimized memcpy/move/set.
Details
Message ID
<CVC2KDD8FYMU.241S5N8WBVKGD@monch>
In-Reply-To
<CVC2M1T6Y7S6.1TRWZ0DBPIKDD@attila> (view parent)
DKIM signature
pass
Download raw message
On Wed Sep 6, 2023 at 7:10 PM UTC, Bor Grošelj Simić wrote:
> > fyi i'm pretty sure that qbe generates really suboptimal code for
> > passing aggregates on the stack
>
> It's not fast, but qbe can't get around abi requirements, so at least half of
> the copying in the call to bytes::index(something[sub..slice], ...); is
> unavoidable. The other part, the one where the subslice struct is created on
> the caller's stack first, is also hard to avoid, but at least theoretically
> possible.

right, makes sense. would trimming str down to 16 bytes (*[*]t + size)
help with this? my vague understanding is that that'd get passed in
registers rather than on the stack

> I did some math and it turns out just passing a pointer in a register instead
> of copying the entire struct saves about 30 bytes of instructions per
> abort_fixed call. Multiplying by 10000, which is roughly the amount of
> abort_fixed calls in the binary produced by make check, gets us to more
> than 5% of the test binary size. So your numbers make sense and at least half
> of your observed gain is unobtainable through optimizations because of abi
> limitations.

makes sense, good to know

> agreed. On a similar note, I'm also worried about the speedup reported by
> Autumn! when they optimized memcpy/move/set.

worried it's too high or too low?
Details
Message ID
<CVC2W34K74PJ.3Y1W1MIG6Z2R@attila>
In-Reply-To
<5594a54d-c4e0-4c39-aa28-d97e068cb9df@bitfehler.net> (view parent)
DKIM signature
pass
Download raw message
On Tue Sep 5, 2023 at 10:12 AM CEST, Conrad Hoffmann wrote:
> Thank you both for these great responses!
>
> I had in the meantime gained at least some of the insight provided by 
> removing the hash tables from the musl version and seeing its 
> performance drop to the same level as the stdlib one. So I 100% agree 
> that a lower-level issue is at play here. More inline.
>
> On 9/3/23 06:18, Ember Sawady wrote:
> > On Fri Sep 1, 2023 at 7:43 PM UTC, Bor Grošelj Simić wrote:
> >>> This seems a bit surprising to me. The two-way algorithm is faster than
> >>> the "brute force" method, but it already gets beaten significantly by
> >>> just "inlining" the call to bytes::equal(). I also fail to explain to
> >>> myself how the inlined version can be so much faster. The most likely
> >>> explanation would of course be "function call overhead",
> >>
> >> Yeah, passing slices around is slow, because it happens on the stack. So is
> >> subslicing, which makes the problem twice as bad here.
>
> That is great to know, because it explains why my implementation of 
> Aho-Corasick performs so poorly :)
>
> >> Looking at the produced assembly it's clear that there is an enormous amount of
> >> copying overhead happenning in very_naive.
>
> I guess I will need to dust off my assembly skills, so feel free to let 
> me figure it out myself, but I wonder: when you say subslicing happens 
> on the stack and there is copying, do you mean passing a subslice 
> actually creates a copy for the new function's stack?

A slice is a (*[*]something, size, size) tuple internally. Passing a slice to a
function will create a copy of it on the new function's stack as mandated by
the abi.

Passing a subslice like func(a[sub..slice]); is even worse, because first, the subslice temporary object
is created in current function's stack space and then copied for the called
function's stack. This intermediate stage on caller's stack is avoidable in
theory, because the caller will never get to use the intermediate value, but
qbe would need to be much smarter to detect that.
Details
Message ID
<CVC653E42P6N.18OFOYPFED332@attila>
In-Reply-To
<CVC2KDD8FYMU.241S5N8WBVKGD@monch> (view parent)
DKIM signature
pass
Download raw message
On Wed Sep 6, 2023 at 9:08 PM CEST, Ember Sawady wrote:
> On Wed Sep 6, 2023 at 7:10 PM UTC, Bor Grošelj Simić wrote:
> > > fyi i'm pretty sure that qbe generates really suboptimal code for
> > > passing aggregates on the stack
> >
> > It's not fast, but qbe can't get around abi requirements, so at least half of
> > the copying in the call to bytes::index(something[sub..slice], ...); is
> > unavoidable. The other part, the one where the subslice struct is created on
> > the caller's stack first, is also hard to avoid, but at least theoretically
> > possible.
>
> right, makes sense. would trimming str down to 16 bytes (*[*]t + size)
> help with this? my vague understanding is that that'd get passed in
> registers rather than on the stack

Yep, that should help.

> > I did some math and it turns out just passing a pointer in a register instead
> > of copying the entire struct saves about 30 bytes of instructions per
> > abort_fixed call. Multiplying by 10000, which is roughly the amount of
> > abort_fixed calls in the binary produced by make check, gets us to more
> > than 5% of the test binary size. So your numbers make sense and at least half
> > of your observed gain is unobtainable through optimizations because of abi
> > limitations.
>
> makes sense, good to know
>
> > agreed. On a similar note, I'm also worried about the speedup reported by
> > Autumn! when they optimized memcpy/move/set.
>
> worried it's too high or too low?

It's too high. Reported gain on the test suite is 2.05 -> 1.76 seconds, a 14%
improvement, from just optimizing those functions a bit! An enourmous amount of
time would have to be spent in there for it to make sense. Upon some more
thought, it looks like overwhelming majority of test suite time is spent in
curve25519::x25519basepoint, bcrypt::bcrypt and rsa::pkcs1, which is probably
not representative of real world code, so I'm retracting my concern until
further measurements are performed. Unfortunately I lost the script that did
`perf record` magic for me and converted it into a flamegraph, so I can't
easily do that myself right now.
Reply to thread Export thread (mbox)