ISO/IEC JTC1 SC22 WG21 N4375 - 2015-02-06
Paul E. McKenney, paulmck@linux.vnet.ibm.com
Alan Jeffrey, ajeffrey@bell-labs.com
Ali Sezgin, as2418@cam.ac.uk
This paper is an updated version of N4323, revised based on discussions with Michael Wong and Maged Michael. N4323 was in turn an updated version of N4216, revised based on discussions at the 2014 UIUC meeting (with much good feedback especially from Hans Boehm and Victor Luchangco) and on email reflector discussion.
Out-of-thin-air (OOTA) values have proven to be a thorny issue for memory models, including the Java memory model (JMM) and the C11 and C++11 memory models. The current C and C++ draft standards simply advise implementers to avoid OOTA values, without precisely defining what OOTA values might be. A number of publications have looked at this, including that of Vafeiadis et al., Batty et al., Boehm and Demsky, Sevcik, Jeffrey, and the JMM Causality Test Cases. These publications establish that OOTA is harmful, and look at a number of interesting consequences. Unfortunately, these publications focus only on a small (relatively) sensible-seeming subset of possible OOTA scenarios, This paper will explore some of the less sane scenarios, which will have the side-effect of demonstrating that out-of-thin-air execution is, in the C and C++ worlds, vacuous.
To that end, this paper will look at an interesting open problem, which is the fact that harmful OOTA programs can be very closely related to benign operation-reordering programs. This paper will discuss a general method, called perturbation analysis that may be used to distinguish harmful from benign.
NOTE: There are no known production-quality implementations that can induce OOTA. Therefore, implementations need not do anything to avoid OOTA, other than avoid dubious (at best) data-speculation optimizations. The OOTA shortcoming is therefore a shortcoming of theory rather than a change needed in practice.
The canonical harmful-OOTA example is as follow, where x
and y are both initially zero, and where all accesses
to shared variables are memory_order_relaxed (though
loads may instead be memory_order_consume loads):
Thread 1 Thread 2 -------- -------- r1 = x; r2 = y; y = r1; x = r2;
The current C and C++ standards do not rule out the outcome of
r1 and r2 both equalling 42—or any
other value that can be represented by x and y.
This outcome would of course be quite surprising, and would have
a number of
fatal consequences.
In contrast, the following closely related program is an example of benign reordering:
Thread 1 Thread 2 -------- -------- r1 = x; r2 = y; y = 42; x = r2;
Here, the outcome of r1 and r2 is perfectly
legitimate, and in fact occurs on actual implementations.
However, the presence of the constant 42 in and of itself cannot distinguish between the benign and harmful cases, for example, this program is an example of harmful OOTA:
Thread 1               Thread 2
--------               --------
r1 = x;                r2 = y;
if (r1 == 42)          x = r2;
    y = 42;
else
    y = r1;
This example can be extended to produce a variety of OOTA values by expanding Thread 1's “if” statement to provide additional values. A very large number and variety of examples can be generated, a few of which appear in the the JMM Causality Test Cases.
In the canonical harmful OOTA example, the value of 42 comes from nowhere,
and circulates between x and y.
This situation suggests the following perverse modification,
which assumes that x and y are unsigned and thus
not subject to undefined behavior upon overflow:
Thread 1 Thread 2 -------- -------- r1 = x; r2 = y; y = r1; x = r2 + 1;
This cannot result in r1 and r2 both having the
value 42.
To see this, note that the only way that r1 can have a
non-zero value is if it loads the value stored by Thread 2.
Similarly, the only way that r2 can have a non-zero value
is if it loads from Thread 1's store.
So suppose that r2 has the value 42.
This means that Thread 2 stores 43, which means that the value of
r1 will also be 43.
But this means that Thread 1 will store 43 to y,
which means that r2 also cannot be 42, contradicting the
initial assumption.
This example demonstrates that OOTA execution is similar
to the classic spreadsheet “solve” functionality: OOTA
conceptually requires iterating until a fixed point is reached.
This functionality has its place in spreadsheets, but has no place
in the confines of the C and C++ memory models, most especially for
non-converging test cases such as above.
Hence, this example demonstrates that OOTA is not just confusing
and harmful (for example, by inflicting undefined behavior on
unsuspecting developers and code), but is also vacuous in the
context of the C and C++ memory models.
Please note that this problem does not occur if either or both of the loads get the initial value of zero. The three convergent cases are as follows:
| r1Source | r2Source | r1 | r2 | x | y | 
|---|---|---|---|---|---|
| Initial Value | Initial Value | 0 | 0 | 1 | 0 | 
| Initial Value | Thread 1 Store | 0 | 0 | 1 | 0 | 
| Thread 2 Store | Initial Value | 1 | 0 | 1 | 1 | 
In contrast, the benign example does not diverge when modified:
Thread 1 Thread 2 -------- -------- r1 = x; r2 = y; y = 42; x = r2 + 1;
This program produces well-defined results regardless of where the loaded values come from, as shown below:
| r1Source | r2Source | r1 | r2 | x | y | 
|---|---|---|---|---|---|
| Initial Value | Initial Value | 0 | 0 | 1 | 42 | 
| Initial Value | Thread 1 Store | 0 | 42 | 43 | 42 | 
| Thread 2 Store | Initial Value | 1 | 0 | 1 | 42 | 
| Thread 2 Store | Thread 1 Store | 43 | 42 | 1 | 42 | 
The if-statement version of the harmful example also diverges in response to perturbation:
Thread 1               Thread 2
--------               --------
r1 = x;                r2 = y;
if (r1 == 42)          x = r2 + 1;
    y = 42;
else
    y = r1;
If Thread 2 reads from Thread 1's store, it might see the store
of the constant 42.
In that case, it will store 43 to x.
But if Thread 1 also reads Thread 2's store, it will load
the value 43, and thus won't execute the store of 42,
which means that Thread 2's load gives 43, not 42, contradicting the
initial assumption.
A general two-thread/two-variable form of OOTA is as follows:
Thread 1               Thread 2
--------               --------
r1 = x;                r3 = y;
if (f(r1, &r2))        if (g(r3, &r4))
    y = r2;                x = r4;
The previous section applied a perturbation function to Thread 2's store, resulting in the following:
Thread 1               Thread 2
--------               --------
r1 = x;                r3 = y;
if (f(r1, &r2))        if (g(p(r3), &r4))
    y = r2;                x = r4;
In the previous section, we chose p() to the increment
function.
Unfortunately, this choice does not always force an inconsistency,
for example:
int f(unsigned a, unsigned *b)
{
    *b = a & ~0x1;
    return 1;
}
unsigned g(unsigned a, unsigned *b)
{
    *b = a;
    return 1;
}
unsigned p(unsigned a)
{
    return a + 1;
}
Thread 1               Thread 2
--------               --------
r1 = x;                r3 = y;
if (f(r1, &r2))        if (g(p(r3), &r4))
    y = r2;                x = r4;
If Thread 1 stores the value 42 to y, Thread 2
will increment it and thus store 43 to x.
But Thread 1's call to f() will strip off the bottom
bit, restoring both the value 42 and consistent execution.
In this case, a perturbation function p that is an
increment fails to force an inconsistency (although it does succeed
in changing the overal behavior).
The choice of the perturbation function p() depends on
the algorithm, and is in the general case undecidable.
However, all is not lost.
First, it can easily be seen that a function p() that
increments by two rather than one suffices to produce the needed
inconsistency.
This is still a total function and results in the following, where
the functions f(),
g(), and p() have all been inlined for ease
of exposition:
Thread 1 Thread 2 -------- -------- r1 = x; r2 = y; y = r1 & ~0x1; x = r2 + 2;
Here, if Thread 2's load returns 42, it will store the value 44. Thread 1's load will thus return 44, which is unaffected by the bitwise AND, so that Thread 1 stores 44. This contradicts Thread 2's initial load of 42, thus providing the needed inconsistency.
Although the choice of p() is in theory undecidable,
the examples in this paper can be solved for a suitable p()
using (at most) simple algebra.
We further conjecture that a randomly chosen function would have a high
probability of forcing an inconsistency.
In fact, it is possible for the identity function to
result in an inconsistency, for example, in the following case:
Thread 1 Thread 2 -------- -------- r1 = x; r2 = y; y = r1; x = ~r2;
Because the example is itself inconsistent, the choice of the identity
function for p() suffices to preserve this inconsistency.
We further conjecture that the choice of p() is
not only decidable but trivial in the case where all variables
are boolean.
In this case, p() can be simple boolean NOT,
as in the C and C++ prefix ! operator.
In fact, the only reasonable choices for p() are
NOT on the one hand and the identity function on the other.
It might be necessary to apply p() to the Thread 1's
load instead of that of Thread 2.
Of course, just as with integers, it is necessary to check the original
example for inconsistencies before applying a perturbation function.
We will see that the choice of perturbation function p()
is constrained as follows:
p() must be total over the set of possible argument
	values.
p() must not violate constraints deduced from
	global analysis.
This section applies perturbation to each of the JMM causality test cases, comparing the results to the decisions.
Thread 1               Thread 2
--------               --------
r1 = x;                r2 = y;
if (r1 >= 0)           x = r2;
    y = 1;
Behavior in question: r1 == r2 == 1
Decision: Allowed.
The decision is based on the assumption that the compiler determines that
the variables are all non-negative.
We can define p() to be the increment function, and see
that although this choice of perturbation function does change the
behavior, it does not introduce an inconsistency.
On the other hand, if we violate the non-negativity assumption by choosing
p()
to be the function that decrements by two, we have
r2 == 1 && x == -2 && r1 == -2 && y == 0,
which is an inconsistent execution.
This example therefore illustrates another constraint on the perturbation function, namely that it not violate constraints deduced from global analysis.
Thread 1               Thread 2
--------               --------
r1 = x;                r3 = y;
r2 = x;                x = r3;
if (r1 == r2)
    y = 1;
Behavior in question: r1 == r2 == r3 == 1
Decision: Allowed.
Assume an arbitrary perturbation function p(), and that
Thread 1's loads happen after Thread 2's store.
Then r1 will always be equal to r2, so
that Thread 1 will always store to y.
Therefore, we have a consistent execution regardless of the perturbation
function.
Thread 1               Thread 2               Thread 3
--------               --------               --------
r1 = x;                r3 = y;                x = 2
r2 = x;                x = r3;
if (r1 == r2)
    y = 1;
Behavior in question: r1 == r2 == r3 == 1
Decision: Allowed.
Assume an arbitrary perturbation function p(), and that
Thread 1's loads happen after both Thread 2's and
Thread 3's stores.
Then r1 will always be equal to r2, so
that Thread 1 will always store to y.
Therefore, we again have a consistent execution regardless of the perturbation
function.
Thread 1 Thread 2 -------- -------- r1 = x; r2 = y; y = r1; x = r2; Behavior in question: r1 == r2 == 1 Decision: Disallowed.
This test case was analyzed earlier, and that analysis agrees with the
decision of “disallowed.”
Interestingly enough, a compiler examining this test case could deduce
that only the value 0 is assigned to x and y
(at initialization time).
The JMM applied this sort of compiler-based variable-value deduction to
other test cases, so it is curious that they chose not to apply it to
this case.
(Or, alternatively, given that they did not apply it to this case, it is
curious that they felt comfortable applying it to other cases.)
Of course, in general, the range of values of variables is also
undecidable.
Thread 1 Thread 2 Thread 3 Thread 4 -------- -------- -------- -------- r1 = x; r2 = y; z = 1; r3 = z; y = r1; x = r2; x = r3; Behavior in question: r1 == r2 == 1, r3 == 0 Decision: Disallowed.
Because r3 is zero, we know that Thread 4 stored zero
to x.
Therefore, the only way for r1 and r2 to
equal 1 is for an OOTA cycle involving only Threads 1 and 2.
However, this part of the test case is the same as test case 4,
and perturbation analysis gives the same outcome of “disallowed.”
Thread 1               Thread 2
--------               --------
r1 = A;                r2 = B;
if (r1 == 1)           if (r2 == 1)
    B = 1;                 A = 1;
                       if (r2 == 0)
		           A = 1;
Behavior in question: r1 == r2 == 1
Decision: Allowed.
B is always either zero or one, so Thread 2
will load either zero or one into r2.
This means that one or the other of the two if statements
will always be taken, so Thread 2 will always store the value 1 to
A.
This means that a sufficiently aggressive compiler could eliminate
Thread 2's if statements and simply unconditionally
assign to A.
Because all memory references are relaxed, the order of Thread 2's
load and store can be reversed, after which the result is allowed even in an
SC execution.
Perturbation can chance the values of r1 and r2,
but cannot introduce inconsistencies.
If the compiler cannot determine that B is always either zero
or one, perturbation still cannot introduce inconsistencies.
Either way, perturbation analysis agrees with the decision of
“allowed.”
Thread 1 Thread 2 -------- -------- r1 = x; r3 = y; r2 = x; z = r3; y = r2; x = 1; Behavior in question: r1 == r2 == r3 == 1 Decision: Allowed.
Simple reordering can produce the behavior, and adding perturbation can
change the behavior, but cannot result in inconsistencies.
For example, applying an arbitrary perturbation function p()
to the value stored to y results in the following:
Thread 1 Thread 2 -------- -------- r1 = x; r3 = y; r2 = x; z = r3; y = p(r2); x = 1;
In this case, we can see r1 == r2 == 1 && r3 == p(1).
So because perturbation does not introduce inconsistencies (instead merely
changing the behavior), perturbation analysis agrees with the decision
of “allowed.”
Thread 1               Thread 2
--------               --------
r1 = x;                r3 = y;
                       x = r3;
r2 = 1 + r1 * r1 - r1;
y = r2;
Behavior in question: r1 == r2 == 1
Decision: Allowed.
One wonders why r3 = 1 was not included in the behavior.
The analysis in the JMM test cases assumes that inter-thread analysis
determines that x and y is always either zero
or one, so that the compiler converts the code to the following:
Thread 1 Thread 2 -------- -------- r1 = x; r3 = y; r2 = 1; x = r3; y = r2;
In this case, perturbation results in the following:
Thread 1 Thread 2 -------- -------- r1 = x; r3 = y; r2 = 1; x = p(r3); y = r2;
Note that applying the perturbation function to Thread 1 has
no effect: The r1 variable is dead code.
Applying the perturbation to Thread 2 causes the value 2 to be
stored to x, which again has no effect in Thread 1
other than changing the value of r1.
Therefore, perturbations do not result in inconsistency, which agrees
with the decision of “allowed.”
This analysis applies given the range determination for x
and y even without the optimization.
In this case, the only reasonable perturbation function is the
! operator, resulting in the following:
Thread 1               Thread 2
--------               --------
r1 = x;                r3 = y;
                       x = !r3;
r2 = 1 + r1 * r1 - r1;
y = r2;
In this case, Thread 1's load from x returns either
zero or one, but it will always store the value of one to y.
This means that Thread 2's load from y will always
return the value 1, so that there is no inconsistency.
This again means that this test case is an example of benign reordering
rather than harmful OOTA, again agreeing with the decision of
“allowed.”
Thread 1               Thread 2               Thread 3
--------               --------               --------
r1 = x;                r3 = y;                x = 2
                       x = r3;
r2 = 1 + r1 * r1 - r1;
y = r2;
Behavior in question: r1 == r2 == 1
Decision: Allowed.
If the compiler can determine that Thread 3 executes only after
both Threads 1 and 2, then analysis proceeds as with
test case 8 above.
On the other hand, if Thread 3 can execute before Threads 1
and 2, then the compiler cannot limit the values of x
and y to zero and one, and so the perturbation might proceed
as follows:
Thread 1               Thread 2               Thread 3
--------               --------               --------
r1 = x;                r3 = y;                x = 2
                       x = r3 + 1;
r2 = 1 + r1 * r1 - r1;
y = r2;
If each of the Thread 1's and Thread 2's loads returns the
value stored by the other thread, inconsistency results.
For example, if we assume Thread 1 stores the value 1, then Thread 2
will store the value 2.
But that would mean that Thread 1 would calculate and store the
value 4, which is inconsistent with the assumption that Thread 2
loaded the value 1.
Therefore, if the compiler is unable to determine that the values of
x and y are limited to zero and one,
then a load-store cycle is illegal.
This situation might seem a bit disturbing, but it in fact will help lead to key insight, namely that optimizations that replace computations with the equivalent constants are legal and cannot result in OOTA values.
This variation has the initial value of x be two rather
than zero:
Thread 1               Thread 2               Thread 3
--------               --------               --------
r1 = x;                r3 = y;                x = 0
                       x = r3;
r2 = 1 + r1 * r1 - r1;
y = r2;
Behavior in question: r1 == r2 == 1
Decision: Allowed.
This plays out very similarly to test cases 8 and 9: If the
compiler can assign a constant to r2, then Threads 1
and 2 can legitimately load from each other's stores, otherwise
they cannot.
Thread 1              Thread 2              Thread 3              Thread 4
--------              --------              --------              --------
r1 = x;               r2 = y;               z = 1;                r3 = z;
if (r1 == 1)          if (r2 == 1)                                if (r3 == 1)
    y = 1;                x = 1;                                      x = 1;
Behavior in question: r1 == r2 == 1, r3 == 0
Decision: Disallowed.
Given that r3 is equal to zero, we know that Thread 4's
load could not have read from Thread 2's store (possibly due to
Thread 2's store not having executed in the first place).
We also know that Thread 4 did not store to x.
This test case therefore can be analyzed by looking only at Threads 1
and 2.
Perturbation then proceeds as follows:
Thread 1              Thread 2
--------              --------
r1 = x;               r2 = y - 1;
if (r1 == 1)          if (r2 == 1)
    y = 1;                x = 1;
Suppose that Thread 1's load returns the value that Thread 2
stored.
Then Thread 1's if statement will execute the store
in its then clause.
If Thread 2's load in turn returns the value that Thread 1
stored, r2 will be zero, which will mean that Thread 2's
if statement will not execute the store in its
then clause.
But that means that Thread 1's load cannot possibly return the
value that Thread 2 stored because nothing was stored.
This inconsistency means that this test case is an example of harmful
OOTA, which agrees with the JMM decision of “disallowed”.
Thread 1 Thread 2 -------- -------- r1 = z; r4 = w; w = r1; r3 = y; r2 = x; z = r3; y = r2; x = 1; Behavior in question: r1 == r2 == r3 == r4 == 1 Decision: Disallowed.
We again assume that each load returns the value of the corresponding store
from the other thread.
This results in an update order of x, y,
z, w.
Because this is acyclic, perturbation cannot introduce an inconsistency,
so this is an example of simple reordering, and not OOTA at all.
Thus, perturbation analysis agrees with the JMM decision of
“allowed.”
This test case has initial values of zero for x and y,
1 for a[0], and 2 for a[1].
Thread 1 Thread 2 -------- -------- r1 = x; r3 = y; a[r1] = 0; x = r3; r2 = a[0]; y = r2; Behavior in question: r1 == r2 == r3 == 1 Decision: Disallowed.
Suppose as usual that each thread's load returns the value from the corresponding store by the other thread. We can perturb as follows:
Thread 1 Thread 2 -------- -------- r1 = x; r3 = y; a[r1] = 0; x = r3 + 1; r2 = a[0]; y = r2;
Given this perturbation, if Thread 2 loads the value 1 from
y, then it will store the value 2 to x.
Thread 1 will then load 2, and run off the end of array a,
resulting in undefined behavior (or, if the array has three elements,
uninitialized values).
This is clearly inconsistent, so this is an example of harmful OOTA,
which agrees with the JMM decision of “disallowed.”
Thread 1              Thread 2
--------              --------
r1 = x;               r2 = y;
if (r1 == 1)          if (r2 == 1)
    y = 1;                x = 1;
Behavior in question: r1 == r2 == 1
Decision: Disallowed.
The following perturbation works in this case:
Thread 1              Thread 2
--------              --------
r1 = x;               r2 = y;
if (r1 == 1)          if (r2 + 1 == 1)
    y = 1;                x = 1;
As before, suppose that each threads' load returns the value from the
other thread's corresponding store.
Then r2 will be one, so that r2 + 1 will not
be equal to one, in turn meaning that Thread 2's store will not be
executed.
In this case, r1 must be zero, so that Thread 1's
store also is not executed.
This means that r2 cannot possibly have the value one, resulting
in an inconsistency.
This agrees with the JMM decision of “disallowed.”
In this test case, accesses to y use
memory_order_seqcst.
Thread 1              Thread 2
--------              --------
r1 = a;               do {
if (r1 == 0)              r2 = y;
    y = 1;                r3 = b;
else                  } while (r2 + r3 == 0);
    b = 1;            a = 1;
Behavior in question: r1 == r3 == 1 && r2 == 0
Decision: Disallowed.
If Thread 2 leaves its loop due to Thread 1's store to y,
the resulting synchronized-with relationship will force the load from
a to happen before the store, so that r1 == 0.
We therefore consider executions where Thread 2 leaves its loop due
to Thread 1's store to b.
In this case, we can use the following perturbation:
Thread 1              Thread 2
--------              --------
r1 = a;               do {
if (r1 - 1 == 0)          r2 = y;
    y = 1;                r3 = b;
else                  } while (r2 + r3 == 0);
    b = 1;            a = 1;
Suppose that Thread 1's load from a returns the value
stored by Thread 2.
Then Thread 1 will store to y, which, as noted above,
ensures that either Thread 2 never exits its loop or that there is
a synchronized-with relationship between the store to and the load from
y.
Either outcome makes it impossible for Thread 1 to load the value
from Thread 2's store to a, resulting in an inconsistency.
This agrees with the JMM's decision of “disallowed.”
In this test case, accesses to x and y use
memory_order_seqcst.
Thread 1              Thread 2              Thread 2
--------              --------              --------
r0 = x;               do {                  x = 1;
if (r0 == 1)              r2 = y;
    r1 = a;               r3 = b;
else                  } while (r2 + r3 == 0);
    r1 = 0;           a = 1;
if (r1 == 0)
    y = 1;
else
    b = 1;
Behavior in question: r0 == r1 == r3 == 1 && r2 == 0
Decision: Disallowed.
Just as with test case 14, we must consider cases where Thread 2
leaves its loop due to Thread 1's store to b.
And a similar perturbation suffices:
Thread 1              Thread 2              Thread 2
--------              --------              --------
r0 = x;               do {                  x = 1;
if (r0 == 1)              r2 = y;
    r1 = a - 1;           r3 = b;
else                  } while (r2 + r3 == 0);
    r1 = 0;           a = 1;
if (r1 == 0)
    y = 1;
else
    b = 1;
Suppose that Thread 1's load from x returns the value
stored by Thread 3 and that Thread 1's load from a
returns the value stored by Thread 2.
But this means that Thread 1 will store to y,
which forces Thread 2's store to a to happen after
Thread 1's load from a, thus forcing an inconsistency.
Hence perturbation analysis agrees with the JMM's decision of
“disallowed.”
Thread 1 Thread 2 -------- -------- r1 = x; r2 = x; x = 1; x = 2; Behavior in question: r1 == 2 && r2 == 1 Decision: Allowed.
An arbitrary perturbation function applied to either load from
x has no effect on subsequent execution (for
some definition of “subsequent”).
Therefore, perturbation analysis cannot induce an inconsistency,
which agrees with the JMM decision of “allowed.”
Thread 1              Thread 2
--------              --------
r3 = x;               r2 = y;
if (r3 != 42)         x = r2;
    x = 42;
r1 = x;
y = r1;
Behavior in question: r1 == r2 == r3 == 42
Decision: Allowed.
At the point where Thread 1 loads x into r1,
it has either just loaded the value 42 from x or just stored
the value 42 to x.
Therefore, the compiler could simply set r1 to the constant 42.
Once it has done that, because relaxed accesses do not provide any ordering
guarantees, the assignment to r1 (as well as the subsequent
store to y may be reordered.
Note that this transformation might be a bit controversial, because as soon
as the assignment of 42 to r1 is moved to precede the store to
x, the rationale for replacing the load from x
with 42 disappears.
For the purpose of this analysis, we will assume that relaxed loads and
stores permit even this somewhat extreme reordering.
Given that transformation, no perturbation can change the value
that Thread 1 stores to y, which eliminates any
possibility of inconsistency.
Perturbation analysis thus agrees with the JMM decision of
“allowed.”
Thread 1              Thread 2
--------              --------
r3 = x;               r2 = y;
if (r3 == 0)          x = r2;
    x = 42;
r1 = x;
y = r1;
Behavior in question: r1 == r2 == r3 == 42
Decision: Allowed.
Given a compiler that could figure out that the only possible values that
could be loaded from x are 0 and 42, the perturbation analysis
is restricted to perturbing within these two values, which gives the same
result as test case 17.
Thread 1 Thread 2 Thread 3 -------- -------- -------- join thread 3 r2 = y; r3 = x; r1 = x; x = r2; if (r3 != 42) y = r1; x = 42; Behavior in question: r1 == r2 == r3 == 42 Decision: Allowed.
If the compiler is allowed to optimize across the join,
this is the same as test case 17.
Thread 1 Thread 2 Thread 3 -------- -------- -------- join thread 3 r2 = y; r3 = x; r1 = x; x = r2; if (r3 == 0) y = r1; x = 42; Behavior in question: r1 == r2 == r3 == 42 Decision: Allowed.
If the compiler is allowed to optimize across the join,
this is the same as test case 18.
In all cases, perturbation analysis gives the same decision as did the JMM's deliberations. We therefore hypothesize that the analysis distinguishes benign reordering from harmful OOTA.
It is important to note that the perturbation-analysis approach sidesteps the issue of which compiler optimizations may be used in a given situation: Optimizations are applied first, and only then is perturbation analysis undertaken. However, this sidestepping has the benefit that perturbation analysis applies equally well to C, C++, and Java, despite the very different restrictions on optimizations across these three languages.
It would be nice to have a succinct description of the set of test cases
in which perturbation functions introduced inconsistencies.
Ali Sezgin pointed out this set is described by rf ∪ sdep,
where rf is the reads-from relationship and sdep
is “semantic dependence”, roughly defined as those dependency
relationships in which at least some changes in the value at the head of the
dependency relationship propagate through, resulting in a change at the
tail of that relationship.
Prohibiting executions that have cycles in rf ∪ sdep
can therefore be expected to prohibit OOTA behaviors.
One beneficial consequence of this relationship to semantic dependency is
that rf ∪ nsdep cycles are allowed, where
nsdep ∩ sdep is the empty set and where
nsdep ∪ sdep = dep.
This means that the compiler is free to replace expressions that are
known to always result in a single value with the corresponding constant,
without danger of introducing OOTA behavior.
We hypothesize that non-speculative code-reordering optimizations are
similarly unable to introduce OOTA behavior.
Defining “semantic dependency” sufficiently for formal modeling remains an open issue. In the general case, this the question of whether or not a given dependency is a semantic dependency is of course undecidable. However, this question can be decided straightforwardly in many common cases. One approach would be to flag dependencies that the tool was unable to classify. Another approach would be to consider cases that a given compiler might optimize, and to classify other cases as semantic dependencies.
The following sections present asides on undecidabilty, inferred ordering, and code generated by old compilers.
The fact that the choice of the perturbation function is undecidable is no greater obstacle for OOTA than it is for anything else. After all, almost all interesting questions about Turing-complete languages are undecidable. (As Doug Lea pointed out, others are “merely” NP.) The following simple example is a case in point:
Thread 1                      Thread 2
--------                      --------
x = 1;                        r1 = y;
if (undecidable())            r2 = x.load(acquire);
    y.store(1, relaxed)
else
    y.store(1, release)
Is the outcome r1 == 1 && r2 ==0 permitted?
In general, this is undecidable.
So what do we do about this?
The same things that we have always done.
The ppcmem and herd tools permit only a small
finite number of variables, thus avoiding undecidability.
The cbmc model checker limits the number of passes through
each loop, thus considering only finite executions, again avoiding
undecidability.
These two strategies should also work well for perturbation analysis.
And selection of a perturbation function is usually straightforward:
So the undecidability should not normally be a problem in practice.
Suppose that a highly optimizing compiler and a less-aggressive analysis tool are applied to the following litmus test (put forward by Hans Boehm):
Thread 1                      Thread 2
--------                      --------
r1 = x;                       r2 = y;
y = r1;                       r3 = f(y);
                              x = r3;
Suppose the compiler determined that function f() did
not represent a semantic dependency, but that the analysis tool was
unable to make this determination.
Might the analysis tool therefore incorrectly report to the developer
that Thread 2's load from y is ordered before
its store to x?
When considering this question, keep in mind that all accesses are relaxed. This means that there are no ordering properties, unless they are supplied by other non-relaxed accesses. Therefore the answer to the question is that the analysis tool should not report that Thread 2's accesses are ordered in any case. It should instead confine itself to disregarding any candidate executions involving OOTA results.
What can happen is that the compiler might be able to determine
that f(y) always returns 42, thus allowing it to transform
the code as follows:
Thread 1                      Thread 2
--------                      --------
r1 = x;                       r2 = y;
y = r1;                       r3 = 42;
                              x = r3;
The compiler might then generate code that resulted in
x = y = 42.
If the analysis tool had less sophisticated analysis than did the compiler,
the tool might well exclude this result.
But this would constitute a bug in the tool rather than a problem with OOTA:
The compiler correctly determined that the dependency of r3
on r2 was not a semantic dependency, while the tool failed
to make this distinction.
However, a high-quality tool would report that it was unable to prove
whether or not f() represented a semantic dependency.
Suppose that the code for function f() in the prior example
was generated by an old compiler, perhaps even one that is unaware of
C11 and C++11 atomics.
Mightn't such a compiler carry out optimizations that could result in
OOTA executions?
(This possibility was raised in a small-group discussion by Hans Boehm
at the 2014 UIUC meeting.)
The answer is "no" for all known compilers used in production. The only possible exception would be research compilers used to investigate value speculation and similar extreme optimizations. However, these research compilers could potentially generate OOTA executions even in sequential code, so it makes sense to exclude them from consideration.
Furthermore, old binaries often introduce data races due to overwriting adjacent fields, which means that old binaries (for example, from gcc 4.6 and earlier) introduce undefined behavior. This means that use of old binaries in concurrent code is a dubious practice in any case.
Nevertheless, this possibility does not permit data-race-free legacy libraries to inflict OOTA executions on multithreaded programs.
This document has shown that all of the harmful OOTA examples in the the JMM Causality Test Cases are special cases that have a fixed point, and that slight perturbations result in inconsistent results. This supports the hypothesis that any harmful-OOTA test case can be perturbed into an inconsistent state and that benign-reordering test cases cannot be.
This perturbation analysis appears to be equivalent to requiring that
rf ∪ sdep be acyclic.
This is an extremely important result:  It means that any compiler
optimization that substitutes a constant value for a read known to return
that value cannot induce OOTA behavior.
This constraint should also ensure that non-speculative
code-movemenet optimizations should be similarly unable to induce
OOTA behavior.
Effective and efficient modeling of semantic dependencies
(sdep) remains an important open problem.