Thanks for an interesting read, I had a couple of questions:
1. "The load might be more efficient if this were an explicit
'acquire' operation." I don't quite understand the details of the
various "promises" of sequence points, allowed reorderings, etc., of
__ATOMIC_ACQUIRE and company. What exactly more/different is happening
with the plain unatomic variable assignment in the article that might
be less efficient than in the linked queue.c?
2. The naming tripped me up at the beginning. The "head" and "tail"
semantics seem backward throughout?
1. Great question, since I had to think through this more before I could
put it into words. Acquire/release didn't click with me until I read Russ
Cox's Memory Models series, specifically part 2:
Memory Models
https://research.swtch.com/mm
Acquire/release establishes a happens-before edge, and the terminology
evokes mutexes. Stores before a release are visible to loads after a
matching acquire. The key insight for me: "These probably exist only
because they are free on x86." In other words, these weak orderings
probably exist to expose x86 memory ordering semantics to high level
programs. If your program can safely rely on x86 memory ordering, then you
really just need to ensure the compiler doesn't reorder your loads/stores.
None of this is related to C/C++ sequence points. Aside from volatile,
they're really just about expression evaluation and have nothing to do
with memory models. Depending on who you ask, even volatile might not
count when it comes to memory models. (That's long been controversial.)
I don't remember when it happened, but it was a paradigm shift for me to
think about concurrency in terms of happens-before, synchronization edges,
orderings, and invariants, rather than in terms of locks, exclusive
access, or critical sections. Quite a lot can be accomplished without
locks just by reasoning about the happens-before relationships of existing
synchronization edges established by system calls or by use of concurrent
data structures.
That being said, both GCC and Clang generate identical code for my test
program on x86-64 regardless of acquire/release or sequential consistency,
so it literally doesn't matter there. They generate different code on
ARM64, though I measure no performance difference on my Raspi4. I compiled
a simple toy function in isolation with both GCC and Clang on x86-64, and
using acquire/release results in a plain store (i.e. relying on the usual
x86 memory order semantics) while using sequential consistency generated
an xchg instruction, which is implicitly locked. The former also let the
compilers generate fewer instructions. So I bet in a different situation
it could make a difference.
2. Perhaps I've gotten the convention backwards and I should have just
used "read/write index" to be unambiguous. A quick search for circular
buffers brings up articles using head/tail just as I did in the article.
*shrug* Vanity makes me prefer head/tail: They're the same number of
letters and so line up nicely in the code. :-)