~skeeto/public-inbox

1

Re: The quick and practical "MSI" hash table

Eric Farmer <erfarmer@gmail.com>
Details
Message ID
<CAAuXPZpxfPKjAAt7-jqyWn4aEhe9uc5pcBMRxrUM1wKgAKAafw@mail.gmail.com>
DKIM signature
pass
Download raw message
Thanks for an interesting read. I have a question about computing the step:

> The low bits of the hash tell it where to start, and the high bits tell it how to step.

Is it necessary to compute the step from the *highest* `exp` bits of
the hash, as opposed to just the "higher" bits of the hash, that is:

    uint32_t step = (hash >> exp) | 1;

The `hash >> (64 - exp)` in the article has the advantage of a "full
size" step even if it's more than half the width of the hash; but the
32-bit types effectively preclude that situation anyway, right?

Re: The quick and practical "MSI" hash table

Details
Message ID
<20220812154833.lgcdeixemqk3udid@nullprogram.com>
In-Reply-To
<CAAuXPZpxfPKjAAt7-jqyWn4aEhe9uc5pcBMRxrUM1wKgAKAafw@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
With a good-quality hash (especially xorshift finalized) it's unnecessary, 
and "higher" vs. "highest" will have no measurable difference. When using 
a multiplicative hash like FNV-1a without a finalizer, you probably want 
to use the highest bits since they're higher quality, and specifically 
using the highest bits may let you avoid paying for a finalizer. On the 
other hand, Java's String.hashCode() has poor-quality high bits (i.e. 
zeros) for short inputs. To a certain degree it's about knowing your 
inputs.

I picked the highest bits for my example since it's simple, obvious, and 
maximizes the effect. Sometimes I've even used overlapping regions of the 
hash which, as you noticed, can't happen in my example. My example would 
reasonably still work well with a 32-bit hash input, overlapping when exp 
is 17 or greater. (The main detriment may be in the hash function itself 
if it's only using a 32-bit state, since hash states are more likely to 
collide mid-hash.) The benefit of double hashing vs. linear probing is 
modest, and it doesn't take much to get that modest benefit.

Related: There's common wisdom about using prime numbers for table sizes. 
It reduces collisions for poor hash functions (i.e. String.hashCode()), 
but it's unnecessary — and not worth the cost — if you're working with 
good hash results. I prefer better hashes over playing tricks with the 
table size, especially since, in the worst case, a permutation can turn a 
bad hash into a good hash.
Reply to thread Export thread (mbox)