I also implemented a program demonstrating this a few months ago. I used a simpler approach, using OpenMP and using its implicit synchronization barriers. You can find this at https://github.com/akorb/x86-tso-demonstration Now I wonder if my demonstration is correct at all? Or does it show another race condition that I was not aware of?
Interesting approach. I used assembly in my version specifically to avoid the data race I mentioned. As undefined behavior, it's difficult to predict what a compiler might do. The memory re-ordering in the experiment is one of the many surprising outcomes of a data race, after all. The compiler could re-order loads and stores on its own because the statements are semantically independent (undermining the experiment). Since the variables aren't volatile, there may even be multiple loads and stores per assignment in the C source, giving an "impossible" result less/more often than it should. It may even elide the elide the "1" stores since those stores are never synchronized, and so never legitimately loaded. I inspected GCC's output on Linux — a little tricky with all the OpenMP control flow around it — and even without volatile the output roughly matches the desired assembly, with everything in the right order, so the experiment should be valid in the case I inspected. So it comes down to OpenMP actually parallelizing the two sections across threads and getting the timing tight enough. In this case I'm getting about the same results as my experiment: ~10% of trials observe a re-ordering. Replacing the data race with a bit of assembly from my article gives the same results. GCC's OpenMP on Linux has impressively tight timing! However, under Clang on Linux, it takes on the order of 100,000 trials to observe a re-ordering. I haven't dug into the details, but I bet it's the situation I was avoiding in the first place: The first thread to arrive goes to sleep, and waking it up throws off the timing. It's even worse with Mingw-w64 on Windows (w64devkit). It has a similar order of magnitude for the trial count, but takes several seconds to run 100k iterations. (Is it spinning up two new threads and destroying them per iteration?) I also tried with MSVC, and it gives results like Clang on Linux: takes ~100k iterations, but isn't slow about it like Mingw-w64. This serves as another example of where it's necessary to build something yourself versus relying on a generic implementation not tuned to your situation. Neither pthreads nor OpenMP specify semantics completely enough to necessarily run this particular experiment well. If you require precise timing then you need to do it yourself. So far, of the five generic synchronization tools I've tried (glibc pthreads barrier, Mingw-w64 pthreads barrier, GCC OpenMP Linux, Clang OpenMP Linux, MSVC OpenMP Windows), only one has had tight enough timing to give accurate results in this experiment, and even that may change someday. The barrier I designed specifically for it gives accurate results across all implementations, but is inappropriate for the generic case.