~skeeto/public-inbox

1

Re: The quick and practical "MSI" hash table

Details
Message ID
<CAENF2Wo3v9=BnzdjrMTbAVaXXdFmbptsbV_5VeTdA-08W185cg@mail.gmail.com>
DKIM signature
pass
Download raw message
Very interesting approach for hash table design.


> char *intern(struct ht *t, char *key)
> {
>     char *dest = 0;
>     // ...
>         if (!t->ht[i]) {
>             // ...
>             dest = dest ? dest : &t->ht[i];
>             *dest = key;
>             return key;
>         } else if (t->ht[i] == gravestone) {
>             dest = dest ? dest : &t->ht[i];
>         } else if (!strcmp(...)) {
>             // ...
>         }
>     // ...
> }

For deletion, I think the code provided might have some issues. In
particular, the variable "dest" should have type "char**" instead of
"char*" since it's a pointer to a string (char*).

Another issue is that we only update the table entry once we reach an
empty location (in the if-branch "!t->ht[i]"). While the code is
technically correct like this, it will not perform well if there are a
lot of gravestones (since we can only update and exit the loop once we
*eventually* reach an empty location). Since the variable "dest"
stores the earliest gravestone or empty spot we encountered while
probing, why can't we update when we find a gravestone (in the
if-branch "t->ht[i] == gravestone") as well?

In terms of the append-buffer approach to store keys (string), it
doesn't work very well for deletion. Since the buffer is
densely-packed, we cannot remove data from the buffer without
rebuilding it. I guess we can zero out the key we deleted in the
buffer, but that just leads to fragmentation problems over time.

For iteration, we might want to store a pointer to the next key (and
perhaps a back pointer as well) we inserted (to make it a linked
hashmap). This will probably perform better than just iterating
through the array directly when the table is large.

Overall, it's a very enjoyable read. I think this design really shines
for use cases involving only read and insert operations. Another
interesting topic would be how to persist a hashtable.

Re: The quick and practical "MSI" hash table

Details
Message ID
<20221031145720.tcfpygrm26364f3c@nullprogram.com>
In-Reply-To
<CAENF2Wo3v9=BnzdjrMTbAVaXXdFmbptsbV_5VeTdA-08W185cg@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
> the variable "dest" should have type "char**"

Absolutely right! I've pushed a correction. I never compiled this code, so 
I never noticed.

> why can't we update when we find a gravestone (in the if-branch 
> "t->ht[i] == gravestone") as well?

The string may already be in the table, and there might be a gravestone 
ahead of it in the search. It's not until we find an empty spot that we're 
sure it's not already in the table, so we have to keep looking despite 
gravestones.

> but that just leads to fragmentation problems over time.

Yup, this may require a smarter approach. That depends entirely on the 
context which I omitted in my example. Since few applications need to 
delete table entries it's hard to imagine a practical context. That is, 
aside from a programming language implementation — essentially the worst 
case scenario since it really does need a general purpose hash table with 
all the bells and whistles.

In my expected use cases, all strings have the same lifetime anyway (i.e. 
arena allocation), even the "deleted" strings, and so any fragmentation 
from table deletion changes little.

> we might want to store a pointer to the next key (and perhaps a back 
> pointer as well) we inserted (to make it a linked hashmap).

Note that traversal order is dependent on the key. Even when two different 
searches land on the same slot, they will arrive from and will depart to 
different slots due to using different steps (aside from an actual hash 
collision). So such pointers would only be useful for a particular hash 
result.

It's possible to traverse backwards without a back pointer: subtract the 
"step" instead of adding it. The links already exist implicitly without 
needing to store them.

> Overall, it's a very enjoyable read.

Thanks!

> Another interesting topic would be how to persist a hashtable.

Good thinking! I should have written a section on this since it's an area 
where MSI hash tables shine. Being flat, they're trivial to serialize so 
long as you avoid pointers in the table. For example, instead of storing a 
string pointer, I could have stored an offset into the append-buffer. Then 
to persist, memory dump the table and the string buffer, which can later 
be loaded/mapped by a different instance/process. Since it would be 
position independent, multiple processes could even share it via shared 
memory, mapped to different addresses in each process.
Reply to thread Export thread (mbox)