~skeeto/public-inbox

4 3

An easy-to-implement, arena-friendly hash map

Steadman Dillroper <dillstead@gmail.com>
Details
Message ID
<CAOJH3v5yZNXm2M_C6c3QhBngAyXtLUpfDdCOxB-WN+wWPvBwBg@mail.gmail.com>
DKIM signature
missing
Download raw message
Chris,

I just finished your article on the hash map and I was curious why you
didn't rotate the hash value? I see if you get to the bottom of the
trie you just start linking nodes off the first child creating a
linked list.  I assume rotation isn't really necessary if you have a
64-bit hash map because of the diminished chance of collision?

Also, you mentioned relative pointers and I was curious how you
implemented them in the context of arena allocation so I looked at
markov.c:

#define p32store(i, p) *(i) = (p) ? (p32)(((uptr)(p) - (uptr)(i))>>2) : 0
#define p32load(i)     (*(i) ? (void *)((uptr)(i) + ((uptr)*(i)<<2)) : 0)

I'm assuming you are shift by two because you'll never have an
alignment less than 4?  I quickly checked the code and that doesn't
appear to be the case.  It doesn't really save any more space, it just
allows you to "reach" farther with the same number of bits, no?

Steadman
Details
Message ID
<20231012013833.y2ywm6vbhe2mbydy@nullprogram.com>
In-Reply-To
<CAOJH3v5yZNXm2M_C6c3QhBngAyXtLUpfDdCOxB-WN+wWPvBwBg@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
Good to hear from you again, Steadman!

> rotate the hash value?

If it's exhausted the hash — 32 levels deep — then it's a full, 64-bit 
collision. One of multiple so far. Rotating will not accomplish anything 
because it's reverting to a linked list regardless. Everything this far is 
rotating/shifting the same hash. Shifting is slightly simpler, so I picked 
that instead. NRK had initially proposed a rotation, and I did that myself 
until I noticed it didn't matter.

> because you'll never have an alignment less than 4?

Yes, relative pointers are 4-byte integers, and therefore 4-byte aligned, 
that point to at least 4-byte aligned objects. I shift out the always-zero 
lowest two bits to extend it to a 34-bit pointer that can reach 8GiB each 
direction. It's "saving space" in the sense that I can fit a 34-bit offset 
in a 32-bit integer.

With some difficulty and inconvenience I could make it relative to the 
containing map node, which is 8-byte aligned, and double the reach again 
to 16GiB each direction. If I make map nodes 16-byte aligned — which is 
probably still reasonable — I could double it yet again to 32GiB each way, 
which could then access all the physical memory of a good workstation. I 
stopped before taking it this far due to the debugging issues mentioned in 
my article.

In the macros notice I'm using uptr, not char*, to compute addresses. It's 
necessary for the shifts, but also because char* must follow pointer rules 
about comparing pointers from different objects. I made this mistake in an 
earlier experiment and quick ran afoul of GCC's optimizer.

> I quickly checked the code and that doesn't appear to be the case.

The arena ensures everything is properly aligned, and p32 pointers only 
point to two kinds of objects: "counts" and "markov" each a kind of hash 
trie, and therefore each pointer-aligned (8-byte aligned). That's greater 
than 4, so my 2-bit shift trick works. Always-zero bits are determined by 
min(src_align, dst_align).

If a relative pointer points at string data, I'd need to ensure the data 
is at least 4-byte aligned. Fortunately allocating such an aligned string 
using an arena is a piece of cake!

If you want to see the consequences of over-aggressively shifting out low 
bits, try changing the macros to 3-bit shifts and observe the results! It 
should quickly crash.

Also, thanks for taking a look at my Markov Chain / relative pointer demo!
Details
Message ID
<nl7tpap7w7gigmpypppimenogab2dfxvdbjzwcuka3fssivwl2@wga3iix2g3qr>
In-Reply-To
<20231012013833.y2ywm6vbhe2mbydy@nullprogram.com> (view parent)
DKIM signature
missing
Download raw message
> Rotating will not accomplish anything because it's reverting to a
> linked list regardless.

Yup, using rotation was a rough first idea. And after thinking about it
more, I also came to realize that it isn't doing anything compared to a
shift. And shifting is both simpler and it communicates the "degrade
into linked list" condition more clearly so that's what I use now too.

And while this might be obvious, it's worth noting that unlike a
(chained) hash-table, a full hash collision doesn't necessarily trigger
a linear linked list search for hash-tries.

In order for it to degrade, it needs to first exhaust all the nodes in
that "path". That's 16 nodes if you're using a 32bit hash (assuming
4-ary trie) and 32 nodes for 64bit hash. And for this reason, if you're
planning to inserts thousands of items, using a 64bit hash becomes more
important for hash-tries.

You could also keep count of how deep you've traversed and re-hash with
a new seed upon detecting full hash collision. But I've noticed that
with 64bit hashes and my synthetic tests, I couldn't trigger the linked
list condition without going OOM first.

But what if you're worried about malicious input? This also might be
obvious, but haven't been explicitly pointed out so far, you can deal
with this in the exact way hash-tables deal with it. By using a
cryptographic hash function (e.g siphash) with a secure seed.

So far, I'm fairly impressed with just how flexible the underlying
design of hash-tries really is. It seems to be pretty easy to adapt it
to many different use-cases.

- NRK
Steadman Dillroper <dillstead@gmail.com>
Details
Message ID
<CAOJH3v6=zr9bYRvyQf+Yrd7ZGu3_pZvihwZZghTDWxd1VSUogQ@mail.gmail.com>
In-Reply-To
<nl7tpap7w7gigmpypppimenogab2dfxvdbjzwcuka3fssivwl2@wga3iix2g3qr> (view parent)
DKIM signature
missing
Download raw message
Yeah, it's funny right after I sent the email to Chris I thought about
it a bit more and came to the same conclusion that both of you did.
Good point about collisions not degrading immediately into a linked
list search, I didn't think of that.

I like this data structure so much I decided to take up Chris' offer
and add linked list that maintains insertion order:
https://gist.github.com/dillstead/141e2e5a6f0d2e5386c3aa7d069cea9c

No matter what, I wanted insertion to be constant after I added the
element to the hash trie.  In order to do this I had to add an
additional pointer (beyond the next pointer added to each strset) and
I toyed with several different ways of doing it, e.g., pass an in/out
parameter to is_member.  Ultimately I decided I wanted to keep the
is_member interface unchanged so I wrapped is_member with code that
creates one additional strset that serves as the root.  The actual
hash trie starts at root's 0th child. In order to eliminate an extra
NULL check on insertion, I unioned the next pointer with a tail
pointer to pointer which I use for constant insertion to the list.  As
far as I can tell, the union is being used correctly and does not
exhibit undefined behavior.  Feel free to point out if I'm wrong on
that.

I also toyed with a few ideas for iteration but ultimately settled
with a set of simple macros that abstract away the additional root:

#define list_begin(x)  (x ? x->child[0] : x)
#define list_next(x)   (x ? x->next : x)
#define list_item(x)   (x ? &x->key : 0)

I smoke tested the code with a simple test, it worked, which is good
enough for me at this point.  I'm definitely going to add this to my
toolbox!

Shifting gears, really interesting point you bring up, Chris,
regarding using uptr not char * in your relative pointer
implementation.  It's obvious to me that you did this for shifts given
that char can be signed but running afoul of the optimizer is much
more subtle.  How did you figure this out?

Steadman


On Thu, Oct 12, 2023 at 11:40 PM NRK <nrk@disroot.org> wrote:
>
> > Rotating will not accomplish anything because it's reverting to a
> > linked list regardless.
>
> Yup, using rotation was a rough first idea. And after thinking about it
> more, I also came to realize that it isn't doing anything compared to a
> shift. And shifting is both simpler and it communicates the "degrade
> into linked list" condition more clearly so that's what I use now too.
>
> And while this might be obvious, it's worth noting that unlike a
> (chained) hash-table, a full hash collision doesn't necessarily trigger
> a linear linked list search for hash-tries.
>
> In order for it to degrade, it needs to first exhaust all the nodes in
> that "path". That's 16 nodes if you're using a 32bit hash (assuming
> 4-ary trie) and 32 nodes for 64bit hash. And for this reason, if you're
> planning to inserts thousands of items, using a 64bit hash becomes more
> important for hash-tries.
>
> You could also keep count of how deep you've traversed and re-hash with
> a new seed upon detecting full hash collision. But I've noticed that
> with 64bit hashes and my synthetic tests, I couldn't trigger the linked
> list condition without going OOM first.
>
> But what if you're worried about malicious input? This also might be
> obvious, but haven't been explicitly pointed out so far, you can deal
> with this in the exact way hash-tables deal with it. By using a
> cryptographic hash function (e.g siphash) with a secure seed.
>
> So far, I'm fairly impressed with just how flexible the underlying
> design of hash-tries really is. It seems to be pretty easy to adapt it
> to many different use-cases.
>
> - NRK
Details
Message ID
<20231031161431.aellqitcbbudbs3v@nullprogram.com>
In-Reply-To
<CAOJH3v6=zr9bYRvyQf+Yrd7ZGu3_pZvihwZZghTDWxd1VSUogQ@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
> How did you figure this out?

(I just remembered I forgot to respond to this.) I wrote my first attempt 
in C++ using operator overloading in order to have nice ergonomics while I 
got the hang of relative pointers, and GCC's optimization punished me for 
misuse of char* pointer arithmetic:

https://lists.sr.ht/~skeeto/public-inbox/%3C20230902232504.hcihm5fdw6k7kejq%40gen2.localdomain%3E#%3C20230911223809.2rvs357olqfzt4d2@nullprogram.com%3E

While that was mostly a result of being cheeky in my test, using local 
variables rather than a proper arena, I expect it's necessary even with 
arenas when also using the "malloc" function attribute. It specifically 
declares that returned pointers are unrelated, and therefore cannot be 
compared using char*. So uintptr_t is really the only safe way to do it.
Reply to thread Export thread (mbox)