I've recently noticed some of the activities on your scratch repo with
arena-based dynamic arrays (https://github.com/skeeto/scratch/blob/master/misc/slice.c)
and I had a couple comments (and questions) regarding that.
TBH I'm not too fond of leaving the old array in place. But reusing that
space would require a bit more sophisticated allocator which will likely
end up not having the usability benefits of a linear allocator.
Instead of dynamic arrays, have you instead looked into "unrolled list"?
I've played around with it a bit recently and some rudimentary
benchmarks shows it's pretty competitive with arrays in terms of fast
iteration (although just today I noticed some oddities when trying to
use a non-power-of-two array size to avoid wasting 4 byte padding):
https://codeberg.org/NRK/slashtmp/src/branch/master/data-structures/u-list.c
Due to not having any relocation requirement, it plays more nicely with
linear allocators and has less memory overhead. Interacting with it is a
bit finicky though due to being a hybrid data-structure. I suspect some
type-safe generic macros can fix that issue but I haven't gone that
route (yet) since I'm still feeling it out.
I've also noticed that you prefer putting the arena metadata into the
arena itself instead of just returning it by value. Is there any
specific reason for it other than style?
And lastly, since we're talking about linear allocators and treaps play
nice with it: have you considered using a hash of the pointer as a
source of randomness?
(This is an idea I recently got from how hashing the for loop index can
serve as a fast and decent source of randomness when using OpenMP
parallel for as opposed to having the threads bottleneck on a single
PRNG state.)
So far, I've been using an explicit priority member but that A) requires
slightly more memory B) requires passing around the PRNG state. I liked
how u-config's treap stores the priority implicitly inside the key
itself but didn't like having to go though O(n) hash everytime for it.
From some preliminary testing, just an xmx permutation seems to work
quite well. You do give up determinism though (maybe that's what you
were going for by hashing the key?).
- NRK
I've got to read your code more often, because I learned some new-to-me,
unrelated techniques like your use of ASan poisoning, and enum for scoped
constants.
I should have elaborated more on my expected use cases for slices in the
header comment. Gradually accumulating long-lived stale arrays would be
suboptimal, almost as bad as a memory leak, and I expect that's why you
don't like the idea. Instead I see slices as being for short-lived array
construction when arrays are essential.
For example, imagine assembling shader data for scene rendering. OpenGL
needs arrays, not linked lists, and I probably need around one array per
shader attribute. I can start with empty vertex attribute slices, push as
many vertices into them as I need, hand the result to OpenGL, then reset
the arena. Or if I'm running multiple shader programs, I can reset the
slice lengths to zero and avoid all the regrowing for the next part of the
scene, then finally reset the arena (implicitly) at the end of the frame.
That probably uses no more memory than an allocator trying to reuse the
old arrays. A bespoke slice implementation could start larger or grow more
aggressively as needed.
(Consider: Old school immediate mode OpenGL essentially did this work in
the background for applications through glBegin/glEnd, though it turned
out to be a performance bottleneck crossing in and out of OpenGL so much.)
Perhaps the old array should be poisoned, which wouldn't be difficult to
arrange. However, my slice experiment is obviously influenced by Go, which
similarly leaves behind the old array on resize. Accidental reuse of the
old array has just never been a problem for me in Go. (This is the same
story for why I've stopped bothering with pointer-to-const in C.)
I'll have to think more on unrolled linked lists. They seem to fall into a
space where they rarely have the right trade-offs. Either a linked list is
sufficient, or it's so insufficient that even an unrolled one won't do.
If you follow the "git blame" you'll see my slices experiment began as
Sean Barrett-style "stretchy buffers". Length/capacity is stored just
before the data, the whole thing is held by the start of the data, and
null pointers are a special case of empty. That wasn't the first time I
attempted to combine this concept with linear allocation, but as before I
wasn't satisfied with the results. Parts felt hacky, more complicated than
I liked (preferably it's simple enough I can quickly work it out again
from scratch, so it's more concept than library), and alignment required
compromises that made the edge cases dominate the implementation. Slices
are way simpler, and the macro is mainly there to smuggle compile-time
type information.
> I've also noticed that you prefer putting the arena metadata into the > arena itself instead of just returning it by value.
I've been working out some basic rules of thumb to guide decisions that
are otherwise mostly arbitrary. In fact, I've been sitting on a new, not
yet announced libc replacement that captures the things I keep rewriting
(arena, buffered output, typedefs, etc.) so that, at the very least, I can
start new, small projects from a good foundation. Working through it has
also helped me refine these rules-of-thumb in order to justify their use.
The rule in this case: Except small "out" parameters (ex. frexp), which
exist because C can only return one value, do not pass pointers to stack
objects. If I pass an arena expecting the callee to allocate the result
out of it (i.e. I can't just pass a copy), it's not an out parameter. If I
repeatedly pass it to many different functions, it's a long-lived object
and so should be on the "heap."
Justification: It's tempting to use the stack beyond local variables, as
source of fast, automatically managed allocation. (I'm guilty, too, which
is how I came up with the rule!) That's why alloca is popular, and it's
the primary use of VLAs. Both are bad. A scratch arena is superior and
should be used the moment this temptation arises. It's amazing how much it
simplifies. Taking the address of a stack variable is sign that you may be
misusing the stack for allocation. So consider: Is this really an out
parameter, or am I abusing the stack? (There's an interesting analogy here
to LLVM SafeStack.)
This is just a rule of thumb, and so of course there are more exceptions
beyond small out parameters.
Another rule: The sizeof operator should not appear outside of allocator
code. (Notable exception: Some APIs like Win32 require its use as a kind
of struct versioning, so it can't be helped.) Justification: Computing
sizes is error-prone and mistakes have an outsized impact. It's a common
source of high profile bugs in C programs. This job should be pushed as
much as possible into code specialized for the task. The C standard
library has utterly failed in this regard. If you're tempted to use
sizeof, you most likely instead need a countof macro. Talking about object
counts (e.g. number of elements in an array) is fine.
A lot of the others should already be familiar: signed counts/subscripts,
avoid null terminated strings, zero-initialization by default. I'm still
working out rules about arena usage, e.g. pass scratch arena by copy, swap
arenas back and forth down the call stack. Also still working through the
idea of separating octets (non-aliasing, uint8_t/u8, UTF-8 strings) from
bytes (aliasing, unsigned char, "raw memory" in allocators), though with
current compilers it makes no practical difference. Finally, also some
rules about Win32 usage that stand out from convention.
> using a hash of the pointer as a source of randomness
That's a good idea! You're right about u-config.
When it comes to linear allocators, trees are a much a better fit for
associative arrays than hash tables, which have all the discussed array
difficulties. It's too bad that implementing them is finicky, even the
relatively simple treap.
> You do give up determinism though (maybe that's what you were going for > by hashing the key?).
Not deliberate, but a consequence of a hash table later turned into a
treap. Determinism is handy for debugging, though, and the benefits of
randomization (breaking pathological inputs) don't apply for u-config.
Thanks for the example patch!
On Sat, Sep 02, 2023 at 11:00:43PM -0400, Christopher Wellons wrote:
> Instead I see slices as being for short-lived array construction when> arrays are essential.
That makes more sense. I was thinking of it being a long-lived object.
> If I repeatedly pass it to many different functions, it's a long-lived> object and so should be on the "heap."
I've tried this out a bit recently and so far it doesn't seem to
appealing to me. So I'll likely drop it. Although I used a slightly
different variant that gets rid of the buf/mem member and just stores a
offset and end pointer: `typedef struct { u8 *off, *end; } Arena;`
> It's too bad that implementing them is finicky, even the relatively> simple treap.
This is something that's been bothering me as well. And I've spent the
last couple days toying around with different ideas and figured out 2
different techniques that I'm quite happy with.
Before that though, here's another cool trick I figured you can do with
node pointer hashes:
static uintptr_t
hashptr(void *p)
{
uintptr_t m = 0x1337ACDCBABE5AAD;
uintptr_t h = (uintptr_t)p - (uintptr_t)(void *)0;
h ^= h >> (sizeof h * 4);
h *= m;
h ^= h >> (sizeof h * 4);
return h - 1;
}
This maps hashptr(NULL) to UINPTR_MAX so you can drop the null check
from the while loop. It doesn't really make any practical difference but
little tricks like these are quite fun :)
With that out of the way, here's my current "default" search data
structure that doesn't require resizes: if the key idea behind treap is
to achieve a balanced tree via well distributed numbers (which can be
achieved either via a PRNG or a hash function) then why not take this
one step further and make a BST order depend on a well distributed
number?
In other words, make the BST ordered by the hash of the key instead of
the string itself. And if we do that, then we can just get rid of the
heap shenanigans entirely since the BST itself now has a high
probability of being fairly balanced. Here's the crux of the code:
u64 keyhash = hash(key);
HTHNode **insert = t;
while (*insert != NULL) {
u64 phash = insert[0]->hash;
switch ((keyhash > phash) - (keyhash < phash)) {
case 0: if (str_eq(key, insert[0]->key)) return insert[0]; // fallthrough
case -1: insert = insert[0]->child + 0; break;
case +1: insert = insert[0]->child + 1; break;
default: ASSERT(!"unreachable"); break;
}
}
// insertion code here
Fairly straight forward and non-finicky code. From some testing and
benchmarking, I found that the depth of the tree doesn't really get
bigger than a treap.
But there's no active effort to re-balance and if that worries you, we
can follow the theme of randomization and have an array of trees which
will get picked by the hash. It takes minimal amount of modification to
support:
+ #define HTH_SLOT 16
- HTHNode **insert = t;
+ HTHNode **insert = t + (keyhash & (HTH_SLOT - 1));
And with that we've come full circle back to a hash-table, although a
different kind: one that doesn't resize and uses a hash-ordered BST to
resolve collusion. In other words, it's a "Hash table of hash trees"
which I'm calling HTHTree in short.
From some benchmarking, it seems to always perform better than a treap
while at the same time being significantly simpler. Speaking of treaps,
my 2nd technique is still a treap, but with much simpler and less
error-prone re-balancing logic:
https://codeberg.org/NRK/slashtmp/commit/09699d8c6952b2e579de529bcfd2d57b854c6d88
Also, while I was looking to see if there's any existing data-structure
like HTHTree I found out that the individual concepts here aren't new,
(tree/trie based on hash, BST for collusion resolution etc) but so far I
haven't quite found anything that combines these concepts into a single
data-structure.
(Closest I found was hash-array-mapped-trie: https://en.wikipedia.org/wiki/Hash_array_mapped_trie
interesting data-structure, but implementing it is way too finicky:
https://codeberg.org/NRK/slashtmp/src/branch/master/data-structures/hamt.c).
Right now I'm in the process of polishing up my (currently private)
benchmarking code and figuring out how to graph the result. But I
figured it'd be nice to get some "external" feedback first in case
there's something obvious (or non-obvious) I'm missing.
- NRK
> we can just get rid of the heap shenanigans entirely
Nice, I like it! It strikes a nice balance between ease of implementation
and performance. Fast enough when it's not on a performance critical path,
but simple enough to whip up from scratch. I played around with it here:
https://github.com/skeeto/scratch/blob/master/misc/treap.c
I wanted to get a feel for tree shapes, especially for highly correlated
keys, so it prints a Graphviz "dot" graph for inspection. Perhaps that
solves your graph question. In the future I can use this to tune a key
hash to do the minimum possible work without throwing the tree out of
whack. I didn't follow through to HTHTree, but it looks like a reasonable
extension of the idea.
Sample tree graph: https://i.imgur.com/Au6hYYZ.png
I also couldn't help but try out some other ideas, including your new
arena layout. Once it's down to just two pointers, it feels more like a
span "value" and less like a stateful "object".
I think I misunderstood your "HTHTree" concept, which on revisit seems you
mean only the top-level is a small hash table. That would reduce the depth
by a fixed amount — a big win for smaller trees. However, I interpreted it
as the "child" field being a small hash table, and choosing a branch based
on the key hash.
As you had written — since it's not what you intended — that wouldn't work
because keys aren't distributed going down the tree. So I added a MCG to
remix the key at each level, and I'm really happy with the results. It
substantially reduces the tree depth at some memory cost. At n=2 it's
really no different than the pre-HTH treap, n=4 is a significant speed
boost, and n=8 is barely faster with a substantial increase in memory use,
so n=4 is (usually?) the only choice that makes sense.
In effect, it's a 4-ary trie where elements are truncated MCG outputs
seeded from the key, and nodes are deposited in the first free spot. The
MCG maintains treap-like randomized balance. My "treap.c" has all these
new changes, including dot rendering of the hash-treap.
With a hash selecting the branch, they're not used in comparisons, so I
don't need to store hashes in the tree. Overall that means it only costs
one more pointer-worth of memory per node. This is one of those moments I
wish C, even if just as a GCC extension, had a concept of "large" and
"small" pointers, like "near" and "far" on 16-bit machines. On a 64-bit
host I could place my arena in the low 4GB of the address space, then tell
the compiler that tree node pointers are "small" 32-bit pointers, saving a
lot of memory. As far as I know, there's only three ways to do this today:
(1) Use 32-bit integers and cast to/from pointers. Noisy, especially the
extra casting necessary to shut up warnings. (A macro or helper function
could hide the noise.) By taking advantage of alignment (e.g. the low 3
bits are always zero because a node contains a 64-bit pointer) "small"
pointers could point anywhere in the first 32GB of memory.
(2) Use 32-bit offsets relative to the arena (base), and combine with the
base to derive a real pointer. Reduces casting noise, only puts a limit on
the arena size, not its location, but requires passing the base pointer
around. With some cleverness in placing the arena (e.g. align the whole
arena to its capacity), the program could always rediscover the base by
masking any arena-allocated pointer. Like (1), exploiting node alignment
extends this to a 32GB arena.
(3) Use 32-bit offsets relative to the current child. The root is still
held by a full pointer. As before, when offsets address larger units than
bytes, it extends to at least 32GB. (Caveat: It's more complex than simply
treating the arena as an array of nodes, as nodes aren't "node-aligned"
relative to each other, just pointer-aligned.) This option sounds the most
appealing to me.
All cases require trade-offs (e.g. all nodes in a tree allocated from the
same arena, specialized arena placement, etc.) that generally aren't worth
it. The Linux x32 ABI was kind of headed in this direction, retrofitting
existing software with these properties, but it was never quite popular
enough to receive decent support.
> seems you mean only the top-level is a small hash table.
Yup, I noticed that yesterday but didn't have the time to reply, but
looks like you've noticed it on your own! For some context: supporting
the hashtable (at the root) was just a matter of a simple one line
change. But it provided large speedups for small N (< 1K), -- the
common case -- making it closer to O(1) of the hash-table.
But for really large N (roughly 100K or above) the "acceleration"
provided by the root level hash-table becomes negligible because the
O(log N) tree dominates.
Your idea of embedding a "hash-table" at the leaf instead of the root
makes it much more closer to a "hash-array-mapped-trie" (HAMT) except that:
* HAMT traverses the trie by "peeling off" the hash bits
* it turns into a linked list if it's traversed the entire trie (i.e a
hash collusion) and didn't reach a conclusion.
Your idea on the other hand gets rid of the special case (2) and
traverses based on a permutation of the hash rather than by peeling it
off. Effectively you've vastly simplified HAMT while still employing
some of it's underlying concepts. I like it! And this is probably going
to be my new default choice.
When I was playing around with HAMT I implemented it by following
wikipedia and did that a bit too literally which ended up with a finicky
and larger than I'd like implementation. The performance was good, but
the memory overhead and large implementation made me not investigate it
further.
From my casual testing:
* this consistently outperforms the HAMT implementation I had by about
~6% while consuming ~2x less memory
* It also outperforms HTHTree by nearly 2x but consumes about 1.2x more
memory.
* It outperforms the simplified "parent-less treap" by 3x by consumes
about 1.6x more memory.
I've published my benchmarking code on a WIP branch in case you want to
take a look at it: https://codeberg.org/NRK/slashtmp/src/branch/search-bench/data-structures/search-tree-benchmark.c> Overall that means it only costs one more pointer-worth of memory per> node. This is one of those moments I wish C, even if just as a GCC> extension, had a concept of "large" and "small" pointers, like "near"> and "far" on 16-bit machines. On a 64-bit host I could place my arena> in the low 4GB of the address space, then tell the compiler that tree> node pointers are "small" 32-bit pointers, saving a lot of memory.
I *did* think about pointer compression. But ultimately didn't use it
for my testing because:
* I was more interested in figuring out underlying concepts and ideas
than a specific implementation.
* I was looking for something simple and straight-forward (read: not
error-prone) that can be rolled from memory easily.
* Back when I *did* play around with self-relative pointers [0] it was way
too finicky to do in C, even with some "helper macros".
* If space efficiency is a concern then there's likely more suitable
data-structure for it that was designed with space efficiency in mind.
The last point is more of a speculation so that might turn out to be
wrong but the other 3 were good enough reasons to not bother with
pointer compression. But I do think it's a trick that's worth keeping in
mind in case you need it. (One other benefit of self-relative pointers
is that you can dump it to disk and load it back without needing to
specify a specific hardcoded address to mmap/VirtualAlloc).
[0]: https://www.gingerbill.org/article/2020/05/17/relative-pointers/#self-relative-pointers
- NRK
> I've published my benchmarking code on a WIP branch
Be mindful about benchmarking hash tables with random keys. If keys are
already uniformly distributed, it's not *really* testing the hash table's
ability to handle a lumpy, more realistic distribution. That's why I like
benchmarking/testing with monotonic keys, which are realistic and a kind
of non-pathological worst case.
> self-relative pointers [0]
That Jon Blow video was great! I hadn't seen that one before. Perhaps
that's for the better because 6 years later Jai is still behind closed
curtains. The example in the article had me itching to try out relative
pointers using C++ operator overloading. However, that implementation was
unsatisfying and incomplete, so I started over and came up with this:
https://github.com/skeeto/scratch/blob/master/misc/ptr.cpp
Getting all those operators right is finicky, but it was enough to let me
try out relative pointers using familiar syntax. They really are tricky,
though perhaps with practice they're as natural as normal pointers. When
copying objects containing relative pointers, I've got to pause to think
if the destination is appropriately nearby. Considering temporaries, I
suspect C++ semantics are not actually strict enough to use these properly
(e.g. in C nor C++ the stack is only an implementation detail.)
Relatedly, my code doesn't work with GCC starting at -O1. It optimizes
away my "nodes[N-1].next = 0" so that the linked list runs off the end,
but I don't see why GCC thinks that's a valid optimization. At -O3 it
optimizes away nearly all relative pointer stores. Neither Clang nor MSVC
do these "optimizations, and it works fine under both. Perhaps a GCC bug?
It probably wreaks havoc with its pointer provenance analysis, making it
think nodes beyond nodes[0] are unused.
Figured out the GCC thing! Just needed to sleep on it Turning it over in
my mind, I realized I took a bad shortcut not actually allocating out of
an arena, and so my pointer arithmetic was illegal. Changing it all to
uintptr_t fixed it. Lesson learned.
Following up on another experiment with compressed pointers:
https://github.com/skeeto/scratch/blob/master/misc/markov.c
Markov chain text generation was a useful test for hash-tries. When I've
built implementations before, dealing with nested maps felt fiddly. They
need to grow with the corpus, and each layer maps over different types. In
Go terms for an N-gram chain: map[[N]string]map[string]int. Hash tries
proved to be so easy that I had no trouble at all.
My initial implementation used plain pointers, and I got it up and running
quickly. (In retrospect I should have checked that in first.) However, I
was disappointed with memory use. I could only load ~512MiB of text before
exhausting physical RAM. Next I tried compressed relative pointers (named
p32) for everything but strings, and I exploited alignment to extend the
range to 8GiB. I can load ~640MiB of the same corpus, but then I hit that
8GiB limit. It's nice that I'm no longer exhausting RAM, but it's little
more capable than before! Oh well.
Once I got the macros worked out, which *was* a bit tricky, I was able to
convert the code over to compressed pointers almost mechanically. Working
out the C++ operators was the practice I needed to use them here, as that
put the necessary concepts in my head, particularly when copying pointers.
Unfortunately, debugging in the presence of compressed pointers is not fun
at all, and that basically killed them for me unless I'm also prepared to
build custom tooling to go with it (e.g. the debugger built into Handmade
Hero). I'd been using GDB to examine the Markov chains by hand, but after
converting over I can no longer do that without great pain.
The next step would be to copy tokens into the arena, with interning, so
that I could use compressed string pointers (and lengths), too. I suspect
the additional overhead would eat into most of the savings.
More hash-trie news: I wanted to try out a generic Go implementation, then
benchmark it against its built-in map type. In my benchmarks it's 1.5x to
2x faster depending on the key type and hash function! Memory footprint is
1.3x to 2x larger (at worst), depending on the number of entries. Only ~30
lines of code, and easy to write. Your rotation idea was important, as an
MCG is far slower in this implementation.
https://github.com/skeeto/scratch/blob/master/misc/hashtrie_test.go
Linear allocation isn't popular in Go — though it's been being seriously
investigated [1] — but this little implementation could easily adapt to
allocate from a given arena, too.
[1]: https://github.com/golang/go/issues/51317
On Sat, Sep 23, 2023 at 04:32:18PM -0400, Christopher Wellons wrote:
> Following up on another experiment with compressed pointers:> https://github.com/skeeto/scratch/blob/master/misc/markov.c> Markov chain text generation was a useful test for hash-tries.
Funny timing, I was thinking of writing a markov chain with hash-trie as
well :-) (prompted by reading 3rd chapter of "Practice of Programming"
which used a chained & fixed-size hash-table that I wasn't impressed
with). Unfortunately though I haven't gotten the time to do it yet.
> It's nice that I'm no longer exhausting RAM
One other compression technique I've thought about was using tagged
pointer for the "child pointers". This is from the observation that
there will be a bunch of "leaf nodes" with no children. So with a tagged
pointer, you can avoid allocating the node by embedding the key into the
parent's pointer.
Unfortunately, this is quite finicky because pointers are 8 bytes
whereas (sized-)string keys are 16 bytes. This probably won't work for
key-value pair either.
And the space saving will be proportional to the number of leaf nodes
whereas with compressed pointers the saving is proportional to the
number of total nodes. And so I didn't investigate this idea any
further. But putting this out here since it might be a worthwhile thing
to do for smaller/integer keys.
> Unfortunately, debugging in the presence of compressed pointers is not fun> at all
I tend to use duel [0] for examining tree/linked-list etc inside of gdb
and compressed pointers end up killing that too. It ends up getting to
the point where hand-writing custom "print/dump" function starts
becoming more efficient and less annoying compared to trying to use gdb.
Speaking of pointer compression, I tried it out on my existing (string
key) benchmark and it unlike your previous int-key benchmark, it didn't
show any slowdown even on small Ns. But similar to the int-key bench, it
became faster at large Ns.
I suspect this is because (a) my benchmark uses same size strings, so
the equality check cannot bail early on length mismatch and (b) about
half the strings are near duplicates - which means there's more bytes
being compared.
And due to the above, the string comparison (probably) ends up hiding
the pointer decoding overhead. But the integer comparison was fast
enough that the pointer decoding became the bottleneck. Storing the hash
to short circuit the string equality check also speeds things up in my
benches probably for the same reasons.
The markov chain might be a more realistic benchmark to see how much
"optimizations" like short circuiting the string check via the hash
actually matters in non-synthetic situations.
> More hash-trie news: I wanted to try out a generic Go implementation [...]
Good to know that the concept can be implemented efficiently and easily
on other langauges as well.
> Linear allocation isn't popular in Go
Interesting to see arenas being investigated in a GC-ed language. Kind
of reminds me of this clip [1] where people end up basically doing
manual memory management in GC-ed langauges anyways for performance
reasons.
[0]: https://mariadb.org/duel-gdb-vs-linked-lists-trees-hash-tables/
[1]: https://youtube.com/watch?v=tK50z_gUpZI&t=400s
- NRK
Interesting, I hadn't heard of duel before! Now I really wish that patch
got merged all those years ago, as I'm unhappy with GDB generally moving
towards implementing newer features in Python. Python is a bad fit as an
extension language.
Thanks for the updates on your pointer decoding experiments. That all
sounds about right.
I had another insight last night, and then had some time today to try it
out: It's nearly trivial to make hash-tries thread-safe. You just need an
atomic acquire for loading child pointers, and a release compare-and-swap
when setting child pointers. Keys don't change after construction, so they
don't need atomic access, and child references are permanent once set, so
the trie doesn't change shape underneath the thread.
https://github.com/skeeto/scratch/blob/master/misc/concurrent-hash-trie.c
In my demo, each thread has an arena, so no locking when allocating new
nodes. It saves an arena snapshot so that it can rollback when CAS fails.
In a real program, upsert() might set the value before insertion (requires
a change in interface), so that initial key+value is inserted atomically.
My demo doesn't read values until after the join, which synchronizes the
value stores. If the value is, say, a shared counter, then the current
interface is good enough because threads would atomically increment them
anyway.
I'm quite happy with how it turned out. A fast, concurrent, unordered map
in ~30 lines of code ain't bad at all!