~skeeto/public-inbox

1

Re: A flexible, lightweight, spin-lock barrier

Eric Farmer <erfarmer@gmail.com>
Details
Message ID
<CAAuXPZr8Q4opGTkTf0Vn67tde307GWrdmqTKKppTJHbG-Ain8w@mail.gmail.com>
DKIM signature
pass
Download raw message
Thanks for a really interesting post, I was not aware of the
"impossible" memory re-ordering issue.

> The denominator is ideally a compile-time constant in order to avoid paying for division in the spin-lock loop.

I might be missing something simple here-- won't the mod still be
needed inside the loop even in the case where we know the
(non-power-of-two) number of threads at compile time?

Re: A flexible, lightweight, spin-lock barrier

Details
Message ID
<20220315182423.73t34xgjcbml7ttb@nullprogram.com>
In-Reply-To
<CAAuXPZr8Q4opGTkTf0Vn67tde307GWrdmqTKKppTJHbG-Ain8w@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
> won't the mod still be needed inside the loop

Yes, but with a compile-time constant denominator that mod can be reduced 
to multiplication with shift/sub/add, much like division. (On common 
architectures, division and modulo are computed simultaneously from one 
instruction.) It's not as lean as a bitwise mask, but should be better 
than a division.

Though now that I think about it, even in the worst case you could compute 
modulo ahead of the poll loop as a lower bound, then use a greater-than 
comparison in the loop, exiting when the barrier exceeds the pre-computed 
threshold. This avoids expensive (read: latency inducing) operations in 
the poll loop.

int phase = current / n;
int next_phase = phase + 1;
int target = next_phase * n;
while (*barrier < target);
Reply to thread Export thread (mbox)