Thanks for a really interesting post, I was not aware of the "impossible" memory re-ordering issue. > The denominator is ideally a compile-time constant in order to avoid paying for division in the spin-lock loop. I might be missing something simple here-- won't the mod still be needed inside the loop even in the case where we know the (non-power-of-two) number of threads at compile time?
> won't the mod still be needed inside the loop Yes, but with a compile-time constant denominator that mod can be reduced to multiplication with shift/sub/add, much like division. (On common architectures, division and modulo are computed simultaneously from one instruction.) It's not as lean as a bitwise mask, but should be better than a division. Though now that I think about it, even in the worst case you could compute modulo ahead of the poll loop as a lower bound, then use a greater-than comparison in the loop, exiting when the barrier exceeds the pre-computed threshold. This avoids expensive (read: latency inducing) operations in the poll loop. int phase = current / n; int next_phase = phase + 1; int target = next_phase * n; while (*barrier < target);