~sircmpwn/hare-dev

3 2

hare-json (and other) hash maps

Details
Message ID
<D7J33FFKQZPC.3OCFOHT8WYKMB@rosiesworkshop.net>
Sender timestamp
1738595630
DKIM signature
pass
Download raw message
Hi everyone, I have a somewhat open question to ask about hash maps in
the stdlib. In particular I am hoping to get some clarity on what sorts
of ownership semantics are desired for hash-maped objects in the stdlib.

I encountered the following issue early on hacking on hare::types::, but
an example of it which is more likely to cause issues right now is in
hare-json (although it is extlib).

The issue is the reallocation of buckets in json::object invalidating
*value pointers which borrow from the object. The docstr for json::get
does state "The return value is borrowed from the object" but in my
opinion that does not clearly communicate "this pointer may be
invalidated by future mutations with put/set." I think this is
especially problematic given that haredoc will not (by default) show me
the internal structure of json::object, due to current conventions.

Here is a quick demo for code that fails in this way:
https://paste.sr.ht/~roselandgoose/aa15a7897bce80ecff20035577f9c08730330a77

Currently it manifests (on my machine) as a match on an invalid tagged
union, which in turn is caught by the familiar "compiler bug" assertion.
Although this could just as well lead to a segfault in a different
context.

So independent of what change will make the most sense for hare-json, an
extlib package, my question is what of the possible responses would pass
muster for code in the stdlib?

The easiest option is probably just to expand the comment warning about
pointers returned by get(). I'm planning to send a patch for this, btw.
Would that be considered OK for a module in the stdlib too?

Another option I've seen is to make the size of each bucket fixed (once
allocated) and use static slice operations, aborting if it gets full. I
assume that this would not be appropriate for the stdlib unless the case
was very unlikely to arise (unlike here).

The option that I've taken in my hare::unit:: prototype is analogous to
allocating each json::value individually and only storing pointers in
the buckets. Then each bucket can be safely realloced as needed. There
are a number of downsides to this of course, but it seems the simplest
to me.

And of course we could use a more complicated storage system. Unrolled
linked lists for buckets, storing values in a tree instead of a map,
etc. each with their own tradeoffs.

What are your thoughts?

-RKL
Details
Message ID
<D7JY9LNZH9DA.3VA8I36VRLNJL@spxtr.net>
In-Reply-To
<D7J33FFKQZPC.3OCFOHT8WYKMB@rosiesworkshop.net> (view parent)
Sender timestamp
1738701568
DKIM signature
pass
Download raw message
> The option that I've taken in my hare::unit:: prototype is analogous to
> allocating each json::value individually and only storing pointers in
> the buckets. Then each bucket can be safely realloced as needed. There
> are a number of downsides to this of course, but it seems the simplest
> to me.

Personally, I think this is the best way for most cases, and it's what I
use for my own projects. The downsides are relatively minor:

* You have to alloc and free once per item.
* You use an extra pointer's worth of memory per item.
* You have to do an extra pointer deref.
* You lose a bit of cache locality.

(any I missed?)

IMO correctness and simplicity is more important than these, to start
out with.
Details
Message ID
<D7K33UXW9UQX.1F0JPSCNFCZ99@turminal.net>
In-Reply-To
<D7J33FFKQZPC.3OCFOHT8WYKMB@rosiesworkshop.net> (view parent)
Sender timestamp
1738718823
DKIM signature
pass
Download raw message
On Mon Feb 3, 2025 at 9:13 PM CET, Rosie Keith Languet wrote:
> Hi everyone, I have a somewhat open question to ask about hash maps in
> the stdlib. In particular I am hoping to get some clarity on what sorts
> of ownership semantics are desired for hash-maped objects in the stdlib.
>
> I encountered the following issue early on hacking on hare::types::, but
> an example of it which is more likely to cause issues right now is in
> hare-json (although it is extlib).
>
> The issue is the reallocation of buckets in json::object invalidating
> *value pointers which borrow from the object. The docstr for json::get
> does state "The return value is borrowed from the object" but in my
> opinion that does not clearly communicate "this pointer may be
> invalidated by future mutations with put/set." I think this is
> especially problematic given that haredoc will not (by default) show me
> the internal structure of json::object, due to current conventions.
>
> Here is a quick demo for code that fails in this way:
> https://paste.sr.ht/~roselandgoose/aa15a7897bce80ecff20035577f9c08730330a77
>
> Currently it manifests (on my machine) as a match on an invalid tagged
> union, which in turn is caught by the familiar "compiler bug" assertion.
> Although this could just as well lead to a segfault in a different
> context.
>
> So independent of what change will make the most sense for hare-json, an
> extlib package, my question is what of the possible responses would pass
> muster for code in the stdlib?
>
> The easiest option is probably just to expand the comment warning about
> pointers returned by get(). I'm planning to send a patch for this, btw.
> Would that be considered OK for a module in the stdlib too?

I don't think that's a good solution, both generally and in the specific
case of hare::json. It may be a plausible solution in a special purpose
hash table that is almost entirely static and where insertion/deletion
are very rare and special occurences. The general solution should be to
make get() return pointers that are not invalidated on modification, how
to achieve that in practice depends a lot on the specifics of each
situation, but yeah just adding an extra layer of indirection is the
easiest possibility.

>
> Another option I've seen is to make the size of each bucket fixed (once
> allocated) and use static slice operations, aborting if it gets full. I
> assume that this would not be appropriate for the stdlib unless the case
> was very unlikely to arise (unlike here).
>
> The option that I've taken in my hare::unit:: prototype is analogous to
> allocating each json::value individually and only storing pointers in
> the buckets. Then each bucket can be safely realloced as needed. There
> are a number of downsides to this of course, but it seems the simplest
> to me.
>
> And of course we could use a more complicated storage system. Unrolled
> linked lists for buckets, storing values in a tree instead of a map,
> etc. each with their own tradeoffs.

I think your way of fixing hare::types is fine. You could also use a
fixed size open addressing table, as the number of types in the store is
fairly limited - anything more than 15k has a big enough probability of
full type id collision that it isn't worth supporting at the moment
(though this will be addressed at some point).

>
> What are your thoughts?

I have to admit I've recently grown a bit disillusioned with the "roll
your own" approach to data structures that Hare is advocating for. This
makes me question it even more - if very capable people, the ones most
familiar with the language, get hash tables wrong repeatedly, can we
really expect everyone else to do better?
Details
Message ID
<D7K3DP3GYO8L.3SFZ7KYVC8L3G@turminal.net>
In-Reply-To
<D7JY9LNZH9DA.3VA8I36VRLNJL@spxtr.net> (view parent)
Sender timestamp
1738719594
DKIM signature
pass
Download raw message
On Tue Feb 4, 2025 at 9:39 PM CET, Joe Finney wrote:
> > The option that I've taken in my hare::unit:: prototype is analogous to
> > allocating each json::value individually and only storing pointers in
> > the buckets. Then each bucket can be safely realloced as needed. There
> > are a number of downsides to this of course, but it seems the simplest
> > to me.
>
> Personally, I think this is the best way for most cases, and it's what I
> use for my own projects. The downsides are relatively minor:
>
> * You have to alloc and free once per item.
> * You use an extra pointer's worth of memory per item.

There is additional memory cost for metadata for the additional
allocation. But this can be countered by allocating in bulk.


> * You have to do an extra pointer deref.
> * You lose a bit of cache locality.

It's possible to gain that back by doing open addressing.

Storing pointers to objects instead of objects can also be beneficial
because the table data is smaller that way (unless stored objects
are <=8 bytes), which means that resizing the table data is faster.


> IMO correctness and simplicity is more important than these, to start
> out with.

Agreed.
Reply to thread Export thread (mbox)