Re: A lock-free, concurrent, generic queue in 32 bits

Eric Farmer <erfarmer@gmail.com>
Message ID
DKIM signature
Download raw message
Thanks for an interesting read, I had a couple of questions:

1. "The load might be more efficient if this were an explicit
'acquire' operation." I don't quite understand the details of the
various "promises" of sequence points, allowed reorderings, etc., of
__ATOMIC_ACQUIRE and company. What exactly more/different is happening
with the plain unatomic variable assignment in the article that might
be less efficient than in the linked queue.c?

2. The naming tripped me up at the beginning. The "head" and "tail"
semantics seem backward throughout?

Re: A lock-free, concurrent, generic queue in 32 bits

Message ID
<CAAuXPZoesYGv38FOZbLisrt8yZ8ij4uEoWaXU8HyibdGtF6jMA@mail.gmail.com> (view parent)
DKIM signature
Download raw message
1. Great question, since I had to think through this more before I could 
put it into words. Acquire/release didn't click with me until I read Russ 
Cox's Memory Models series, specifically part 2:

Memory Models

Acquire/release establishes a happens-before edge, and the terminology 
evokes mutexes. Stores before a release are visible to loads after a 
matching acquire. The key insight for me: "These probably exist only 
because they are free on x86." In other words, these weak orderings 
probably exist to expose x86 memory ordering semantics to high level 
programs. If your program can safely rely on x86 memory ordering, then you 
really just need to ensure the compiler doesn't reorder your loads/stores.

None of this is related to C/C++ sequence points. Aside from volatile, 
they're really just about expression evaluation and have nothing to do 
with memory models. Depending on who you ask, even volatile might not 
count when it comes to memory models. (That's long been controversial.)

I don't remember when it happened, but it was a paradigm shift for me to 
think about concurrency in terms of happens-before, synchronization edges, 
orderings, and invariants, rather than in terms of locks, exclusive 
access, or critical sections. Quite a lot can be accomplished without 
locks just by reasoning about the happens-before relationships of existing 
synchronization edges established by system calls or by use of concurrent 
data structures.

That being said, both GCC and Clang generate identical code for my test 
program on x86-64 regardless of acquire/release or sequential consistency, 
so it literally doesn't matter there. They generate different code on 
ARM64, though I measure no performance difference on my Raspi4. I compiled 
a simple toy function in isolation with both GCC and Clang on x86-64, and 
using acquire/release results in a plain store (i.e. relying on the usual 
x86 memory order semantics) while using sequential consistency generated 
an xchg instruction, which is implicitly locked. The former also let the 
compilers generate fewer instructions. So I bet in a different situation 
it could make a difference.

2. Perhaps I've gotten the convention backwards and I should have just 
used "read/write index" to be unambiguous. A quick search for circular 
buffers brings up articles using head/tail just as I did in the article. 
*shrug* Vanity makes me prefer head/tail: They're the same number of 
letters and so line up nicely in the code. :-)
Reply to thread Export thread (mbox)