Thanks for an interesting read, I had a couple of questions: 1. "The load might be more efficient if this were an explicit 'acquire' operation." I don't quite understand the details of the various "promises" of sequence points, allowed reorderings, etc., of __ATOMIC_ACQUIRE and company. What exactly more/different is happening with the plain unatomic variable assignment in the article that might be less efficient than in the linked queue.c? 2. The naming tripped me up at the beginning. The "head" and "tail" semantics seem backward throughout?
1. Great question, since I had to think through this more before I could put it into words. Acquire/release didn't click with me until I read Russ Cox's Memory Models series, specifically part 2: Memory Models https://research.swtch.com/mm Acquire/release establishes a happens-before edge, and the terminology evokes mutexes. Stores before a release are visible to loads after a matching acquire. The key insight for me: "These probably exist only because they are free on x86." In other words, these weak orderings probably exist to expose x86 memory ordering semantics to high level programs. If your program can safely rely on x86 memory ordering, then you really just need to ensure the compiler doesn't reorder your loads/stores. None of this is related to C/C++ sequence points. Aside from volatile, they're really just about expression evaluation and have nothing to do with memory models. Depending on who you ask, even volatile might not count when it comes to memory models. (That's long been controversial.) I don't remember when it happened, but it was a paradigm shift for me to think about concurrency in terms of happens-before, synchronization edges, orderings, and invariants, rather than in terms of locks, exclusive access, or critical sections. Quite a lot can be accomplished without locks just by reasoning about the happens-before relationships of existing synchronization edges established by system calls or by use of concurrent data structures. That being said, both GCC and Clang generate identical code for my test program on x86-64 regardless of acquire/release or sequential consistency, so it literally doesn't matter there. They generate different code on ARM64, though I measure no performance difference on my Raspi4. I compiled a simple toy function in isolation with both GCC and Clang on x86-64, and using acquire/release results in a plain store (i.e. relying on the usual x86 memory order semantics) while using sequential consistency generated an xchg instruction, which is implicitly locked. The former also let the compilers generate fewer instructions. So I bet in a different situation it could make a difference. 2. Perhaps I've gotten the convention backwards and I should have just used "read/write index" to be unambiguous. A quick search for circular buffers brings up articles using head/tail just as I did in the article. *shrug* Vanity makes me prefer head/tail: They're the same number of letters and so line up nicely in the code. :-)