Impossible memory state with OpenMP

Andreas Korb <korb@in.tum.de>
Message ID
DKIM signature
Download raw message
I also implemented a program demonstrating this a few months ago. I used 
a simpler approach, using OpenMP and using its implicit synchronization 

You can find this at https://github.com/akorb/x86-tso-demonstration

Now I wonder if my demonstration is correct at all? Or does it show 
another race condition that I was not aware of?
Message ID
<1540151c-7727-90a0-e7f7-4e80abc00697@in.tum.de> (view parent)
DKIM signature
Download raw message
Interesting approach. I used assembly in my version specifically to avoid 
the data race I mentioned. As undefined behavior, it's difficult to 
predict what a compiler might do. The memory re-ordering in the experiment 
is one of the many surprising outcomes of a data race, after all. The 
compiler could re-order loads and stores on its own because the statements 
are semantically independent (undermining the experiment). Since the 
variables aren't volatile, there may even be multiple loads and stores per 
assignment in the C source, giving an "impossible" result less/more often 
than it should. It may even elide the elide the "1" stores since those 
stores are never synchronized, and so never legitimately loaded.

I inspected GCC's output on Linux — a little tricky with all the OpenMP 
control flow around it — and even without volatile the output roughly 
matches the desired assembly, with everything in the right order, so the 
experiment should be valid in the case I inspected. So it comes down to 
OpenMP actually parallelizing the two sections across threads and getting 
the timing tight enough. In this case I'm getting about the same results 
as my experiment: ~10% of trials observe a re-ordering. Replacing the data 
race with a bit of assembly from my article gives the same results. GCC's 
OpenMP on Linux has impressively tight timing!

However, under Clang on Linux, it takes on the order of 100,000 trials to 
observe a re-ordering. I haven't dug into the details, but I bet it's the 
situation I was avoiding in the first place: The first thread to arrive 
goes to sleep, and waking it up throws off the timing. It's even worse 
with Mingw-w64 on Windows (w64devkit). It has a similar order of magnitude 
for the trial count, but takes several seconds to run 100k iterations. (Is 
it spinning up two new threads and destroying them per iteration?)

I also tried with MSVC, and it gives results like Clang on Linux: takes 
~100k iterations, but isn't slow about it like Mingw-w64.

This serves as another example of where it's necessary to build something 
yourself versus relying on a generic implementation not tuned to your 
situation. Neither pthreads nor OpenMP specify semantics completely enough 
to necessarily run this particular experiment well. If you require precise 
timing then you need to do it yourself. So far, of the five generic 
synchronization tools I've tried (glibc pthreads barrier, Mingw-w64 
pthreads barrier, GCC OpenMP Linux, Clang OpenMP Linux, MSVC OpenMP 
Windows), only one has had tight enough timing to give accurate results in 
this experiment, and even that may change someday. The barrier I designed 
specifically for it gives accurate results across all implementations, but 
is inappropriate for the generic case.
Reply to thread Export thread (mbox)