Re: A flexible, lightweight, spin-lock barrier

Eric Farmer <erfarmer@gmail.com>
Message ID
DKIM signature
Download raw message
Thanks for a really interesting post, I was not aware of the
"impossible" memory re-ordering issue.

> The denominator is ideally a compile-time constant in order to avoid paying for division in the spin-lock loop.

I might be missing something simple here-- won't the mod still be
needed inside the loop even in the case where we know the
(non-power-of-two) number of threads at compile time?

Re: A flexible, lightweight, spin-lock barrier

Message ID
<CAAuXPZr8Q4opGTkTf0Vn67tde307GWrdmqTKKppTJHbG-Ain8w@mail.gmail.com> (view parent)
DKIM signature
Download raw message
> won't the mod still be needed inside the loop

Yes, but with a compile-time constant denominator that mod can be reduced 
to multiplication with shift/sub/add, much like division. (On common 
architectures, division and modulo are computed simultaneously from one 
instruction.) It's not as lean as a bitwise mask, but should be better 
than a division.

Though now that I think about it, even in the worst case you could compute 
modulo ahead of the poll loop as a lower bound, then use a greater-than 
comparison in the loop, exiting when the barrier exceeds the pre-computed 
threshold. This avoids expensive (read: latency inducing) operations in 
the poll loop.

int phase = current / n;
int next_phase = phase + 1;
int target = next_phase * n;
while (*barrier < target);
Reply to thread Export thread (mbox)