~skeeto/public-inbox

11 2

Re: [scratch] misc/slice.c

Details
Message ID
<20230902232504.hcihm5fdw6k7kejq@gen2.localdomain>
DKIM signature
missing
Download raw message
I've recently noticed some of the activities on your scratch repo with
arena-based dynamic arrays (https://github.com/skeeto/scratch/blob/master/misc/slice.c)
and I had a couple comments (and questions) regarding that.

TBH I'm not too fond of leaving the old array in place. But reusing that
space would require a bit more sophisticated allocator which will likely
end up not having the usability benefits of a linear allocator.

Instead of dynamic arrays, have you instead looked into "unrolled list"?
I've played around with it a bit recently and some rudimentary
benchmarks shows it's pretty competitive with arrays in terms of fast
iteration (although just today I noticed some oddities when trying to
use a non-power-of-two array size to avoid wasting 4 byte padding):
https://codeberg.org/NRK/slashtmp/src/branch/master/data-structures/u-list.c

Due to not having any relocation requirement, it plays more nicely with
linear allocators and has less memory overhead. Interacting with it is a
bit finicky though due to being a hybrid data-structure. I suspect some
type-safe generic macros can fix that issue but I haven't gone that
route (yet) since I'm still feeling it out.

I've also noticed that you prefer putting the arena metadata into the
arena itself instead of just returning it by value. Is there any
specific reason for it other than style?

And lastly, since we're talking about linear allocators and treaps play
nice with it: have you considered using a hash of the pointer as a
source of randomness?

(This is an idea I recently got from how hashing the for loop index can
serve as a fast and decent source of randomness when using OpenMP
parallel for as opposed to having the threads bottleneck on a single
PRNG state.)

So far, I've been using an explicit priority member but that A) requires
slightly more memory B) requires passing around the PRNG state. I liked
how u-config's treap stores the priority implicitly inside the key
itself but didn't like having to go though O(n) hash everytime for it.

From some preliminary testing, just an xmx permutation seems to work
quite well. You do give up determinism though (maybe that's what you
were going for by hashing the key?).

- NRK

Re: [scratch] misc/slice.c

Details
Message ID
<20230903030043.nliawbqtj7z6rhhi@nullprogram.com>
In-Reply-To
<20230902232504.hcihm5fdw6k7kejq@gen2.localdomain> (view parent)
DKIM signature
missing
Download raw message
I've got to read your code more often, because I learned some new-to-me, 
unrelated techniques like your use of ASan poisoning, and enum for scoped 
constants.

I should have elaborated more on my expected use cases for slices in the 
header comment. Gradually accumulating long-lived stale arrays would be 
suboptimal, almost as bad as a memory leak, and I expect that's why you 
don't like the idea. Instead I see slices as being for short-lived array 
construction when arrays are essential.

For example, imagine assembling shader data for scene rendering. OpenGL 
needs arrays, not linked lists, and I probably need around one array per 
shader attribute. I can start with empty vertex attribute slices, push as 
many vertices into them as I need, hand the result to OpenGL, then reset 
the arena. Or if I'm running multiple shader programs, I can reset the 
slice lengths to zero and avoid all the regrowing for the next part of the 
scene, then finally reset the arena (implicitly) at the end of the frame. 
That probably uses no more memory than an allocator trying to reuse the 
old arrays. A bespoke slice implementation could start larger or grow more 
aggressively as needed.

(Consider: Old school immediate mode OpenGL essentially did this work in 
the background for applications through glBegin/glEnd, though it turned 
out to be a performance bottleneck crossing in and out of OpenGL so much.)

Perhaps the old array should be poisoned, which wouldn't be difficult to 
arrange. However, my slice experiment is obviously influenced by Go, which 
similarly leaves behind the old array on resize. Accidental reuse of the 
old array has just never been a problem for me in Go. (This is the same 
story for why I've stopped bothering with pointer-to-const in C.)

I'll have to think more on unrolled linked lists. They seem to fall into a 
space where they rarely have the right trade-offs. Either a linked list is 
sufficient, or it's so insufficient that even an unrolled one won't do.

If you follow the "git blame" you'll see my slices experiment began as 
Sean Barrett-style "stretchy buffers". Length/capacity is stored just 
before the data, the whole thing is held by the start of the data, and 
null pointers are a special case of empty. That wasn't the first time I 
attempted to combine this concept with linear allocation, but as before I 
wasn't satisfied with the results. Parts felt hacky, more complicated than 
I liked (preferably it's simple enough I can quickly work it out again 
from scratch, so it's more concept than library), and alignment required 
compromises that made the edge cases dominate the implementation. Slices 
are way simpler, and the macro is mainly there to smuggle compile-time 
type information.

> I've also noticed that you prefer putting the arena metadata into the 
> arena itself instead of just returning it by value.

I've been working out some basic rules of thumb to guide decisions that 
are otherwise mostly arbitrary. In fact, I've been sitting on a new, not 
yet announced libc replacement that captures the things I keep rewriting 
(arena, buffered output, typedefs, etc.) so that, at the very least, I can 
start new, small projects from a good foundation. Working through it has 
also helped me refine these rules-of-thumb in order to justify their use.

The rule in this case: Except small "out" parameters (ex. frexp), which 
exist because C can only return one value, do not pass pointers to stack 
objects. If I pass an arena expecting the callee to allocate the result 
out of it (i.e. I can't just pass a copy), it's not an out parameter. If I 
repeatedly pass it to many different functions, it's a long-lived object 
and so should be on the "heap."

Justification: It's tempting to use the stack beyond local variables, as 
source of fast, automatically managed allocation. (I'm guilty, too, which 
is how I came up with the rule!) That's why alloca is popular, and it's 
the primary use of VLAs. Both are bad. A scratch arena is superior and 
should be used the moment this temptation arises. It's amazing how much it 
simplifies. Taking the address of a stack variable is sign that you may be 
misusing the stack for allocation. So consider: Is this really an out 
parameter, or am I abusing the stack? (There's an interesting analogy here 
to LLVM SafeStack.)

This is just a rule of thumb, and so of course there are more exceptions 
beyond small out parameters.

Another rule: The sizeof operator should not appear outside of allocator 
code. (Notable exception: Some APIs like Win32 require its use as a kind 
of struct versioning, so it can't be helped.) Justification: Computing 
sizes is error-prone and mistakes have an outsized impact. It's a common 
source of high profile bugs in C programs. This job should be pushed as 
much as possible into code specialized for the task. The C standard 
library has utterly failed in this regard. If you're tempted to use 
sizeof, you most likely instead need a countof macro. Talking about object 
counts (e.g. number of elements in an array) is fine.

A lot of the others should already be familiar: signed counts/subscripts, 
avoid null terminated strings, zero-initialization by default. I'm still 
working out rules about arena usage, e.g. pass scratch arena by copy, swap 
arenas back and forth down the call stack. Also still working through the 
idea of separating octets (non-aliasing, uint8_t/u8, UTF-8 strings) from 
bytes (aliasing, unsigned char, "raw memory" in allocators), though with 
current compilers it makes no practical difference. Finally, also some 
rules about Win32 usage that stand out from convention.

> using a hash of the pointer as a source of randomness

That's a good idea! You're right about u-config.

When it comes to linear allocators, trees are a much a better fit for 
associative arrays than hash tables, which have all the discussed array 
difficulties. It's too bad that implementing them is finicky, even the 
relatively simple treap.

> You do give up determinism though (maybe that's what you were going for 
> by hashing the key?).

Not deliberate, but a consequence of a hash table later turned into a 
treap. Determinism is handy for debugging, though, and the benefits of 
randomization (breaking pathological inputs) don't apply for u-config. 
Thanks for the example patch!

Re: [scratch] misc/slice.c

Details
Message ID
<20230909145040.pq457m6q2tc6qo7h@gen2.localdomain>
In-Reply-To
<20230903030043.nliawbqtj7z6rhhi@nullprogram.com> (view parent)
DKIM signature
missing
Download raw message
On Sat, Sep 02, 2023 at 11:00:43PM -0400, Christopher Wellons wrote:
> Instead I see slices as being for short-lived array construction when
> arrays are essential.

That makes more sense. I was thinking of it being a long-lived object.

> If I repeatedly pass it to many different functions, it's a long-lived
> object and so should be on the "heap."

I've tried this out a bit recently and so far it doesn't seem to
appealing to me. So I'll likely drop it. Although I used a slightly
different variant that gets rid of the buf/mem member and just stores a
offset and end pointer: `typedef struct { u8 *off, *end; } Arena;`

> It's too bad that implementing them is finicky, even the relatively
> simple treap.

This is something that's been bothering me as well. And I've spent the
last couple days toying around with different ideas and figured out 2
different techniques that I'm quite happy with.

Before that though, here's another cool trick I figured you can do with
node pointer hashes:

	static uintptr_t
	hashptr(void *p)
	{
		uintptr_t m = 0x1337ACDCBABE5AAD;
		uintptr_t h = (uintptr_t)p - (uintptr_t)(void *)0;
		h ^= h >> (sizeof h * 4);
		h *= m;
		h ^= h >> (sizeof h * 4);
		return h - 1;
	}

This maps hashptr(NULL) to UINPTR_MAX so you can drop the null check
from the while loop. It doesn't really make any practical difference but
little tricks like these are quite fun :)

With that out of the way, here's my current "default" search data
structure that doesn't require resizes: if the key idea behind treap is
to achieve a balanced tree via well distributed numbers (which can be
achieved either via a PRNG or a hash function) then why not take this
one step further and make a BST order depend on a well distributed
number?

In other words, make the BST ordered by the hash of the key instead of
the string itself. And if we do that, then we can just get rid of the
heap shenanigans entirely since the BST itself now has a high
probability of being fairly balanced. Here's the crux of the code:

	u64 keyhash = hash(key);
	HTHNode **insert = t;
	while (*insert != NULL) {
		u64 phash = insert[0]->hash;
		switch ((keyhash > phash) - (keyhash < phash)) {
		case  0: if (str_eq(key, insert[0]->key)) return insert[0]; // fallthrough
		case -1: insert = insert[0]->child + 0; break;
		case +1: insert = insert[0]->child + 1; break;
		default: ASSERT(!"unreachable"); break;
		}
	}
	// insertion code here

Fairly straight forward and non-finicky code. From some testing and
benchmarking, I found that the depth of the tree doesn't really get
bigger than a treap.

But there's no active effort to re-balance and if that worries you, we
can follow the theme of randomization and have an array of trees which
will get picked by the hash. It takes minimal amount of modification to
support:

	+ #define HTH_SLOT 16
	
	- HTHNode **insert = t;
	+ HTHNode **insert = t + (keyhash & (HTH_SLOT - 1));

And with that we've come full circle back to a hash-table, although a
different kind: one that doesn't resize and uses a hash-ordered BST to
resolve collusion. In other words, it's a "Hash table of hash trees"
which I'm calling HTHTree in short.

From some benchmarking, it seems to always perform better than a treap
while at the same time being significantly simpler. Speaking of treaps,
my 2nd technique is still a treap, but with much simpler and less
error-prone re-balancing logic:
https://codeberg.org/NRK/slashtmp/commit/09699d8c6952b2e579de529bcfd2d57b854c6d88

Also, while I was looking to see if there's any existing data-structure
like HTHTree I found out that the individual concepts here aren't new,
(tree/trie based on hash, BST for collusion resolution etc) but so far I
haven't quite found anything that combines these concepts into a single
data-structure.

(Closest I found was hash-array-mapped-trie: https://en.wikipedia.org/wiki/Hash_array_mapped_trie
 interesting data-structure, but implementing it is way too finicky:
 https://codeberg.org/NRK/slashtmp/src/branch/master/data-structures/hamt.c).

Right now I'm in the process of polishing up my (currently private)
benchmarking code and figuring out how to graph the result. But I
figured it'd be nice to get some "external" feedback first in case
there's something obvious (or non-obvious) I'm missing.

- NRK

Re: [scratch] misc/slice.c

Details
Message ID
<20230910032326.7egrskaexe226tru@nullprogram.com>
In-Reply-To
<20230909145040.pq457m6q2tc6qo7h@gen2.localdomain> (view parent)
DKIM signature
missing
Download raw message
> we can just get rid of the heap shenanigans entirely 

Nice, I like it! It strikes a nice balance between ease of implementation 
and performance. Fast enough when it's not on a performance critical path, 
but simple enough to whip up from scratch. I played around with it here:

https://github.com/skeeto/scratch/blob/master/misc/treap.c

I wanted to get a feel for tree shapes, especially for highly correlated 
keys, so it prints a Graphviz "dot" graph for inspection. Perhaps that 
solves your graph question. In the future I can use this to tune a key 
hash to do the minimum possible work without throwing the tree out of 
whack. I didn't follow through to HTHTree, but it looks like a reasonable 
extension of the idea.

Sample tree graph: https://i.imgur.com/Au6hYYZ.png

I also couldn't help but try out some other ideas, including your new 
arena layout. Once it's down to just two pointers, it feels more like a 
span "value" and less like a stateful "object".

Re: [scratch] misc/slice.c

Details
Message ID
<20230910155653.7kowsdjajyknqhsm@nullprogram.com>
In-Reply-To
<20230909145040.pq457m6q2tc6qo7h@gen2.localdomain> (view parent)
DKIM signature
missing
Download raw message
I think I misunderstood your "HTHTree" concept, which on revisit seems you 
mean only the top-level is a small hash table. That would reduce the depth 
by a fixed amount — a big win for smaller trees. However, I interpreted it 
as the "child" field being a small hash table, and choosing a branch based 
on the key hash.

As you had written — since it's not what you intended — that wouldn't work 
because keys aren't distributed going down the tree. So I added a MCG to 
remix the key at each level, and I'm really happy with the results. It 
substantially reduces the tree depth at some memory cost. At n=2 it's 
really no different than the pre-HTH treap, n=4 is a significant speed 
boost, and n=8 is barely faster with a substantial increase in memory use, 
so n=4 is (usually?) the only choice that makes sense.

In effect, it's a 4-ary trie where elements are truncated MCG outputs 
seeded from the key, and nodes are deposited in the first free spot. The 
MCG maintains treap-like randomized balance. My "treap.c" has all these 
new changes, including dot rendering of the hash-treap.

With a hash selecting the branch, they're not used in comparisons, so I 
don't need to store hashes in the tree. Overall that means it only costs 
one more pointer-worth of memory per node. This is one of those moments I 
wish C, even if just as a GCC extension, had a concept of "large" and 
"small" pointers, like "near" and "far" on 16-bit machines. On a 64-bit 
host I could place my arena in the low 4GB of the address space, then tell 
the compiler that tree node pointers are "small" 32-bit pointers, saving a 
lot of memory. As far as I know, there's only three ways to do this today:

(1) Use 32-bit integers and cast to/from pointers. Noisy, especially the 
extra casting necessary to shut up warnings. (A macro or helper function 
could hide the noise.) By taking advantage of alignment (e.g. the low 3 
bits are always zero because a node contains a 64-bit pointer) "small" 
pointers could point anywhere in the first 32GB of memory.

(2) Use 32-bit offsets relative to the arena (base), and combine with the 
base to derive a real pointer. Reduces casting noise, only puts a limit on 
the arena size, not its location, but requires passing the base pointer 
around. With some cleverness in placing the arena (e.g. align the whole 
arena to its capacity), the program could always rediscover the base by 
masking any arena-allocated pointer. Like (1), exploiting node alignment 
extends this to a 32GB arena.

(3) Use 32-bit offsets relative to the current child. The root is still 
held by a full pointer. As before, when offsets address larger units than 
bytes, it extends to at least 32GB. (Caveat: It's more complex than simply 
treating the arena as an array of nodes, as nodes aren't "node-aligned" 
relative to each other, just pointer-aligned.) This option sounds the most 
appealing to me.

All cases require trade-offs (e.g. all nodes in a tree allocated from the 
same arena, specialized arena placement, etc.) that generally aren't worth 
it. The Linux x32 ABI was kind of headed in this direction, retrofitting 
existing software with these properties, but it was never quite popular 
enough to receive decent support.

Re: [scratch] misc/slice.c

Details
Message ID
<20230911021112.zqegw3wrlgluyzvt@gen2.localdomain>
In-Reply-To
<20230910155653.7kowsdjajyknqhsm@nullprogram.com> (view parent)
DKIM signature
missing
Download raw message
> seems you mean only the top-level is a small hash table.

Yup, I noticed that yesterday but didn't have the time to reply, but
looks like you've noticed it on your own! For some context: supporting
the hashtable (at the root) was just a matter of a simple one line
change. But it provided large speedups for small N (< 1K), -- the
common case -- making it closer to O(1) of the hash-table.

But for really large N (roughly 100K or above) the "acceleration"
provided by the root level hash-table becomes negligible because the
O(log N) tree dominates.

Your idea of embedding a "hash-table" at the leaf instead of the root
makes it much more closer to a "hash-array-mapped-trie" (HAMT) except that:

* HAMT traverses the trie by "peeling off" the hash bits
* it turns into a linked list if it's traversed the entire trie (i.e a
  hash collusion) and didn't reach a conclusion.

Your idea on the other hand gets rid of the special case (2) and
traverses based on a permutation of the hash rather than by peeling it
off. Effectively you've vastly simplified HAMT while still employing
some of it's underlying concepts. I like it! And this is probably going
to be my new default choice.

When I was playing around with HAMT I implemented it by following
wikipedia and did that a bit too literally which ended up with a finicky
and larger than I'd like implementation. The performance was good, but
the memory overhead and large implementation made me not investigate it
further.

From my casual testing:

* this consistently outperforms the HAMT implementation I had by about
  ~6% while consuming ~2x less memory
* It also outperforms HTHTree by nearly 2x but consumes about 1.2x more
  memory.
* It outperforms the simplified "parent-less treap" by 3x by consumes
  about 1.6x more memory.

I've published my benchmarking code on a WIP branch in case you want to
take a look at it: https://codeberg.org/NRK/slashtmp/src/branch/search-bench/data-structures/search-tree-benchmark.c

> Overall that means it only costs one more pointer-worth of memory per
> node. This is one of those moments I wish C, even if just as a GCC
> extension, had a concept of "large" and "small" pointers, like "near"
> and "far" on 16-bit machines. On a 64-bit host I could place my arena
> in the low 4GB of the address space, then tell the compiler that tree
> node pointers are "small" 32-bit pointers, saving a lot of memory.

I *did* think about pointer compression. But ultimately didn't use it
for my testing because:

* I was more interested in figuring out underlying concepts and ideas
  than a specific implementation.
* I was looking for something simple and straight-forward (read: not
  error-prone) that can be rolled from memory easily.
* Back when I *did* play around with self-relative pointers [0] it was way
  too finicky to do in C, even with some "helper macros".
* If space efficiency is a concern then there's likely more suitable
  data-structure for it that was designed with space efficiency in mind.

The last point is more of a speculation so that might turn out to be
wrong but the other 3 were good enough reasons to not bother with
pointer compression. But I do think it's a trick that's worth keeping in
mind in case you need it. (One other benefit of self-relative pointers
is that you can dump it to disk and load it back without needing to
specify a specific hardcoded address to mmap/VirtualAlloc).

[0]: https://www.gingerbill.org/article/2020/05/17/relative-pointers/#self-relative-pointers

- NRK

Re: [scratch] misc/slice.c

Details
Message ID
<20230911223809.2rvs357olqfzt4d2@nullprogram.com>
In-Reply-To
<20230911021112.zqegw3wrlgluyzvt@gen2.localdomain> (view parent)
DKIM signature
missing
Download raw message
> I've published my benchmarking code on a WIP branch

Be mindful about benchmarking hash tables with random keys. If keys are 
already uniformly distributed, it's not *really* testing the hash table's 
ability to handle a lumpy, more realistic distribution. That's why I like 
benchmarking/testing with monotonic keys, which are realistic and a kind 
of non-pathological worst case.

> self-relative pointers [0]

That Jon Blow video was great! I hadn't seen that one before. Perhaps 
that's for the better because 6 years later Jai is still behind closed 
curtains. The example in the article had me itching to try out relative 
pointers using C++ operator overloading. However, that implementation was 
unsatisfying and incomplete, so I started over and came up with this:

https://github.com/skeeto/scratch/blob/master/misc/ptr.cpp

Getting all those operators right is finicky, but it was enough to let me 
try out relative pointers using familiar syntax. They really are tricky, 
though perhaps with practice they're as natural as normal pointers. When 
copying objects containing relative pointers, I've got to pause to think 
if the destination is appropriately nearby. Considering temporaries, I 
suspect C++ semantics are not actually strict enough to use these properly 
(e.g. in C nor C++ the stack is only an implementation detail.)

Relatedly, my code doesn't work with GCC starting at -O1. It optimizes 
away my "nodes[N-1].next = 0" so that the linked list runs off the end, 
but I don't see why GCC thinks that's a valid optimization. At -O3 it 
optimizes away nearly all relative pointer stores. Neither Clang nor MSVC 
do these "optimizations, and it works fine under both. Perhaps a GCC bug? 
It probably wreaks havoc with its pointer provenance analysis, making it 
think nodes beyond nodes[0] are unused.

Re: [scratch] misc/slice.c

Details
Message ID
<20230912090746.e2pwapegw7ff5hmh@nullprogram.com>
In-Reply-To
<20230911223809.2rvs357olqfzt4d2@nullprogram.com> (view parent)
DKIM signature
missing
Download raw message
Figured out the GCC thing! Just needed to sleep on it Turning it over in 
my mind, I realized I took a bad shortcut not actually allocating out of 
an arena, and so my pointer arithmetic was illegal. Changing it all to 
uintptr_t fixed it. Lesson learned.

Re: [scratch] misc/slice.c

Details
Message ID
<20230923203218.ui2tlrc5znriyr25@nullprogram.com>
In-Reply-To
<20230911021112.zqegw3wrlgluyzvt@gen2.localdomain> (view parent)
DKIM signature
missing
Download raw message
Following up on another experiment with compressed pointers:
https://github.com/skeeto/scratch/blob/master/misc/markov.c

Markov chain text generation was a useful test for hash-tries. When I've 
built implementations before, dealing with nested maps felt fiddly. They 
need to grow with the corpus, and each layer maps over different types. In 
Go terms for an N-gram chain: map[[N]string]map[string]int. Hash tries 
proved to be so easy that I had no trouble at all.

My initial implementation used plain pointers, and I got it up and running 
quickly. (In retrospect I should have checked that in first.) However, I 
was disappointed with memory use. I could only load ~512MiB of text before 
exhausting physical RAM. Next I tried compressed relative pointers (named 
p32) for everything but strings, and I exploited alignment to extend the 
range to 8GiB. I can load ~640MiB of the same corpus, but then I hit that 
8GiB limit. It's nice that I'm no longer exhausting RAM, but it's little 
more capable than before! Oh well.

Once I got the macros worked out, which *was* a bit tricky, I was able to 
convert the code over to compressed pointers almost mechanically. Working 
out the C++ operators was the practice I needed to use them here, as that 
put the necessary concepts in my head, particularly when copying pointers. 
Unfortunately, debugging in the presence of compressed pointers is not fun 
at all, and that basically killed them for me unless I'm also prepared to 
build custom tooling to go with it (e.g. the debugger built into Handmade 
Hero). I'd been using GDB to examine the Markov chains by hand, but after 
converting over I can no longer do that without great pain.

The next step would be to copy tokens into the arena, with interning, so 
that I could use compressed string pointers (and lengths), too. I suspect 
the additional overhead would eat into most of the savings.

Re: [scratch] misc/slice.c

Details
Message ID
<20230925022818.mj35oyitbrrnuyeb@nullprogram.com>
In-Reply-To
<20230911021112.zqegw3wrlgluyzvt@gen2.localdomain> (view parent)
DKIM signature
missing
Download raw message
More hash-trie news: I wanted to try out a generic Go implementation, then 
benchmark it against its built-in map type. In my benchmarks it's 1.5x to 
2x faster depending on the key type and hash function! Memory footprint is 
1.3x to 2x larger (at worst), depending on the number of entries. Only ~30 
lines of code, and easy to write. Your rotation idea was important, as an 
MCG is far slower in this implementation.

https://github.com/skeeto/scratch/blob/master/misc/hashtrie_test.go

Linear allocation isn't popular in Go — though it's been being seriously 
investigated [1] — but this little implementation could easily adapt to 
allocate from a given arena, too.

[1]: https://github.com/golang/go/issues/51317

Re: [scratch] misc/slice.c

Details
Message ID
<20230925142844.tup3jbsna76thsob@gen2.localdomain>
In-Reply-To
<20230923203218.ui2tlrc5znriyr25@nullprogram.com> (view parent)
DKIM signature
missing
Download raw message
On Sat, Sep 23, 2023 at 04:32:18PM -0400, Christopher Wellons wrote:
> Following up on another experiment with compressed pointers:
> https://github.com/skeeto/scratch/blob/master/misc/markov.c
> Markov chain text generation was a useful test for hash-tries.

Funny timing, I was thinking of writing a markov chain with hash-trie as
well :-) (prompted by reading 3rd chapter of "Practice of Programming"
which used a chained & fixed-size hash-table that I wasn't impressed
with). Unfortunately though I haven't gotten the time to do it yet.

> It's nice that I'm no longer exhausting RAM

One other compression technique I've thought about was using tagged
pointer for the "child pointers". This is from the observation that
there will be a bunch of "leaf nodes" with no children. So with a tagged
pointer, you can avoid allocating the node by embedding the key into the
parent's pointer.

Unfortunately, this is quite finicky because pointers are 8 bytes
whereas (sized-)string keys are 16 bytes. This probably won't work for
key-value pair either.

And the space saving will be proportional to the number of leaf nodes
whereas with compressed pointers the saving is proportional to the
number of total nodes. And so I didn't investigate this idea any
further. But putting this out here since it might be a worthwhile thing
to do for smaller/integer keys.

> Unfortunately, debugging in the presence of compressed pointers is not fun
> at all

I tend to use duel [0] for examining tree/linked-list etc inside of gdb
and compressed pointers end up killing that too. It ends up getting to
the point where hand-writing custom "print/dump" function starts
becoming more efficient and less annoying compared to trying to use gdb.

Speaking of pointer compression, I tried it out on my existing (string
key) benchmark and it unlike your previous int-key benchmark, it didn't
show any slowdown even on small Ns. But similar to the int-key bench, it
became faster at large Ns.

I suspect this is because (a) my benchmark uses same size strings, so
the equality check cannot bail early on length mismatch and (b) about
half the strings are near duplicates - which means there's more bytes
being compared.

And due to the above, the string comparison (probably) ends up hiding
the pointer decoding overhead. But the integer comparison was fast
enough that the pointer decoding became the bottleneck. Storing the hash
to short circuit the string equality check also speeds things up in my
benches probably for the same reasons.

The markov chain might be a more realistic benchmark to see how much
"optimizations" like short circuiting the string check via the hash
actually matters in non-synthetic situations.

> More hash-trie news: I wanted to try out a generic Go implementation [...]

Good to know that the concept can be implemented efficiently and easily
on other langauges as well.

> Linear allocation isn't popular in Go

Interesting to see arenas being investigated in a GC-ed language. Kind
of reminds me of this clip [1] where people end up basically doing
manual memory management in GC-ed langauges anyways for performance
reasons.

[0]: https://mariadb.org/duel-gdb-vs-linked-lists-trees-hash-tables/
[1]: https://youtube.com/watch?v=tK50z_gUpZI&t=400s

- NRK

Re: [scratch] misc/slice.c

Details
Message ID
<20230925234831.p4xpxjcwx5spix56@nullprogram.com>
In-Reply-To
<20230925142844.tup3jbsna76thsob@gen2.localdomain> (view parent)
DKIM signature
missing
Download raw message
Interesting, I hadn't heard of duel before! Now I really wish that patch 
got merged all those years ago, as I'm unhappy with GDB generally moving 
towards implementing newer features in Python. Python is a bad fit as an 
extension language.

Thanks for the updates on your pointer decoding experiments. That all 
sounds about right.

I had another insight last night, and then had some time today to try it 
out: It's nearly trivial to make hash-tries thread-safe. You just need an 
atomic acquire for loading child pointers, and a release compare-and-swap 
when setting child pointers. Keys don't change after construction, so they 
don't need atomic access, and child references are permanent once set, so 
the trie doesn't change shape underneath the thread.

https://github.com/skeeto/scratch/blob/master/misc/concurrent-hash-trie.c

In my demo, each thread has an arena, so no locking when allocating new 
nodes. It saves an arena snapshot so that it can rollback when CAS fails. 
In a real program, upsert() might set the value before insertion (requires 
a change in interface), so that initial key+value is inserted atomically. 
My demo doesn't read values until after the join, which synchronizes the 
value stores. If the value is, say, a shared counter, then the current 
interface is good enough because threads would atomically increment them 
anyway.

I'm quite happy with how it turned out. A fast, concurrent, unordered map 
in ~30 lines of code ain't bad at all!
Reply to thread Export thread (mbox)