Very interesting approach for hash table design.
> char *intern(struct ht *t, char *key)
> {
> char *dest = 0;
> // ...
> if (!t->ht[i]) {
> // ...
> dest = dest ? dest : &t->ht[i];
> *dest = key;
> return key;
> } else if (t->ht[i] == gravestone) {
> dest = dest ? dest : &t->ht[i];
> } else if (!strcmp(...)) {
> // ...
> }
> // ...
> }
For deletion, I think the code provided might have some issues. In
particular, the variable "dest" should have type "char**" instead of
"char*" since it's a pointer to a string (char*).
Another issue is that we only update the table entry once we reach an
empty location (in the if-branch "!t->ht[i]"). While the code is
technically correct like this, it will not perform well if there are a
lot of gravestones (since we can only update and exit the loop once we
*eventually* reach an empty location). Since the variable "dest"
stores the earliest gravestone or empty spot we encountered while
probing, why can't we update when we find a gravestone (in the
if-branch "t->ht[i] == gravestone") as well?
In terms of the append-buffer approach to store keys (string), it
doesn't work very well for deletion. Since the buffer is
densely-packed, we cannot remove data from the buffer without
rebuilding it. I guess we can zero out the key we deleted in the
buffer, but that just leads to fragmentation problems over time.
For iteration, we might want to store a pointer to the next key (and
perhaps a back pointer as well) we inserted (to make it a linked
hashmap). This will probably perform better than just iterating
through the array directly when the table is large.
Overall, it's a very enjoyable read. I think this design really shines
for use cases involving only read and insert operations. Another
interesting topic would be how to persist a hashtable.
> the variable "dest" should have type "char**"
Absolutely right! I've pushed a correction. I never compiled this code, so
I never noticed.
> why can't we update when we find a gravestone (in the if-branch
> "t->ht[i] == gravestone") as well?
The string may already be in the table, and there might be a gravestone
ahead of it in the search. It's not until we find an empty spot that we're
sure it's not already in the table, so we have to keep looking despite
gravestones.
> but that just leads to fragmentation problems over time.
Yup, this may require a smarter approach. That depends entirely on the
context which I omitted in my example. Since few applications need to
delete table entries it's hard to imagine a practical context. That is,
aside from a programming language implementation — essentially the worst
case scenario since it really does need a general purpose hash table with
all the bells and whistles.
In my expected use cases, all strings have the same lifetime anyway (i.e.
arena allocation), even the "deleted" strings, and so any fragmentation
from table deletion changes little.
> we might want to store a pointer to the next key (and perhaps a back
> pointer as well) we inserted (to make it a linked hashmap).
Note that traversal order is dependent on the key. Even when two different
searches land on the same slot, they will arrive from and will depart to
different slots due to using different steps (aside from an actual hash
collision). So such pointers would only be useful for a particular hash
result.
It's possible to traverse backwards without a back pointer: subtract the
"step" instead of adding it. The links already exist implicitly without
needing to store them.
> Overall, it's a very enjoyable read.
Thanks!
> Another interesting topic would be how to persist a hashtable.
Good thinking! I should have written a section on this since it's an area
where MSI hash tables shine. Being flat, they're trivial to serialize so
long as you avoid pointers in the table. For example, instead of storing a
string pointer, I could have stored an offset into the append-buffer. Then
to persist, memory dump the table and the string buffer, which can later
be loaded/mapped by a different instance/process. Since it would be
position independent, multiple processes could even share it via shared
memory, mapped to different addresses in each process.