Doc. No.:	WG21/N2746=J16/08-0256
Almost identical to:	WG14/N1315
Date:	2008-8-22
Reply to:	Hans-J. Boehm
Phone:	+1-650-857-3406
Email:	Hans.Boehm@hp.com

N2746: Rationale for the C++ working paper definition of "memory location".

An identical paper, except for header and introductory information, was previously published as WG14/N1315, largely in response to a C committee liaison report at the Sophia-Antipolis meeting.

This paper outlines the reasoning behind the definition of "memory location" used in the C++ concurrency memory model (see C++ papers N2429 and N2480) in the current C++ working paper.

It is in part a reaction to concern expressed by members of the C committee (including Rich Peterson and Clark Nelson), and in a liaison report, that this definition may be too restrictive on the implementation. This is intended to reflect the arguments for the current C++ solution; it is not intended to represent HP's position on the C memory model.

This follows substantial email discussion with some C committee members on the issue. Although that discussion did no so far result in agreement, it has certainly helped to clarify the issues, and hence contributed to this paper.

Background: The significance of "memory locations"

In the C++ working paper concurrency memory model, a memory location is essentially the smallest unit of memory that we can guarantee to be updated independently of surrounding memory. Objects that share the same memory location may not be independently updatable; updating one may be implemented by reading a larger unit of memory including the other one, replacing a piece of the larger unit, typically in a register, and then storing the larger unit back. This implies that two objects sharing a memory location may not be simultaneously updated by two threads.

For example, consider the declaration

struct {char a; char b; }

If a and b share the same memory location, and they are each updated by loading and then storing the entire struct, then a simultaneous update of a and b might be performed by

Copying the entire struct into a temporary in thread 1.
Updating the a field in the temporary.
Copying the struct into a temporary in thread 2, updating b in the temporary, and writing the temporary back, still in thread 2.
Copying thread 1's temporary back.

This effectively loses thread 2's update of b.

This becomes an issue if, for example, the two fields a and b are protected by different locks, and hence may be accessed concurrently.

With the usual definition of data race, if a and b share a memory location, then we may introduce a race even if, say, b is protected by a lock, and a is never written.

This is also potentially an issue for adjacent array elements, either under similar conditions, or because work on disjoint sections of the array are performed in parallel by different threads.

From a pure programming convenience perspective, it would be optimal if each scalar object and bit-field were its own memory location. Unfortunately, in the case of bit-fields, this is generally considered impractical, since little existing hardware supports independent updates of individual bits without the use of atomic read-modify-write operations. And the latter are generally too expensive to be practical for this application.

As a result, the memory model in the C++ working paper defines each scalar object to be a separate memory location, but considers an entire contiguous sequence of (non-zero length) bit-fields to be a single location, i.e. it allows simultaneous updates of bit-fields to interfere with each other.

We believe that there are only two ways in which major commercial compilers for general purpose processors currently deviate from this:

Updates of bit-fields commonly overwrite adjacent small non-bit-fields. On all modern mainstream machines this is avoidable at a cost of generating several smaller stores instead of a single load and store. This theoretically may incur some slowdown. But experiments by Sandya Mannarswamy at HP on SPECint for PA-RISC and Itanium found differences in run-time produced by these changes in bit-field compilation to be significantly under 1%, with varying signs, and all well within measurement noise. (This is unsurprising, since most programmers generally have a favorable alignment in mind when they declare structs with bit-fields, and the added overhead is minor in any case. We know of no reason to expect different results on other architectures.)
Compilers for the Alpha processor tend to avoid byte stores by default for the sake of compatibility with the earliest generations of Alpha processors, which did not support them. Alpha processors since the 21164A (introduced in late 1995) have supported them, and all Alpha machines have since been discontinued by HP. Thus they do not appear to be of concern for a 2010 (?) standard.

It is conceivable that there exist other multiprocessors for which the C++ restrictions are a serious issue, but we are not aware of them. The C++ memory model is not difficult to implement on X86, ARM, PowerPC, MIPS, IBM Mainframes, SPARC, Motorola 68K, and various others. It imposes strictly weaker hardware requirements than the Java memory model, and is therefore implementable on any hardware that supports Java.

A Note on uniprocessors

The C++ memory model is implementable at no cost in any environment that does not really support multiple concurrent threads. So long as atomic variables, and variables of type sig_atomic_t are sufficiently aligned (and I don't believe this is a new restriction for sig_atomic_t), such environments do not allow the user to observe whether stores of small objects rewrite adjacent bytes. Thus the concurrency memory model imposes no added constraints.

At more cost, this can be extended to uniprocessors running multithreaded code, even in those rare cases in which the processor does not directly support byte stores. This approach uses restartable atomic sequences ( "Fast mutual exclusion for uniprocessors", Bershad, Redell, and Ellis, ASPLOS 92). It requires that byte stores be implemented by the usual load-replace-store sequence, but the compiler must generate enough additional information, and must obey some additional code generation constraints, so that the OS can restart the sequence on a context switch. (The detailed implementation involves a time-space tradeoff. The easiest solution is to compile all byte stores to a function call, so that there is only one such sequence. This clearly involves some run-time cost, but typically far less than generating an atomic read-modify-write sequence. The time cost can be converted to a space cost by duplicating the sequence, and generating a table describing the locations of the sequences.)

In cases in which the thread-switching code is willing to trust client code, simpler solutions are also possible. The client code can just set a flag during the execution of "atomic sequences" like byte stores, which should not be interrupted by a context switch.

The argument for a "coarser" definition of "memory location"

My understanding is that many members of the C committee feel uncomfortable with the above solution, particularly in the context of C, since it may slow down existing code when it is recompiled. This is particularly true when we consider compilers that fully exploit the freedom given by the current Posix specification, which appears to leave the definition of "memory location" completely up to the compiler.

A particularly egregious case (pointed out by David Keaton) may occur on hardware that supports vector instructions, but requires vector stores to unconditionally store every element of the vector. Thus the loop

for (i = 1; i < n; i++) { if (a[i] != 0.0) a[i] = 1.0/a[i]; }

could be vectorized under existing rules, but not easily under the proposed rules. Under existing rules, the above loop could arguably be transformed to

for (i = 1; i < n; i++) { a[i] == (a[i] != 0.0? 1.0/a[i] : a[i]); }

which could be directly vectorized. This transformation is disallowed under C++ working paper rules, since it introduces stores to the elements a[i] that were originally zero. We all believe that the resulting performance difference could be substantial (though experiments might be useful). Thus it is at least conceivable that simply recompiling an existing C program, on a system that previously implemented the above transformation, would result in a substantial slowdown, requiring existing code to be rewritten.

Why I still believe the current C++ definition is correct

It is possible to "coarsen" the C++ definition in relatively minimal ways, notably by allowing bit-field updates to overwrite adjacent fields whose type is smaller than the bit-field type. This would bring the definition into line with existing practice on most platforms. It would deviate from Java and the current C++ working paper definition only for structs involving bit-fields, which do not exist in Java.

But, on balance, this still appears to be an undesirable trade-off:

As we pointed out above, all evidence suggests that the performance benefit is negligible. This would not allow something like the vectorization transformation above. In very rare cases, it would allow fewer stores and bit masking operations to be generated, at the cost of adding a load.
It adds a source of completely unexpected, and nearly untestable, errors. It is relatively easy to explain that individual bit-fields cannot be independently updated, and thus must be separated by e.g. a zero-length bit-field to enable concurrent access. I doubt that many programmers expect this behavior for, say, a short field followed by a bit-field.
It is critically important for the industry that we make it easier to write parallel and specifically multi-threaded programs, so that a larger variety of application see continued performance improvement from multicore processors, where they previously saw performance improvements from added clock speed. I believe it is clearly worth the modest amount of required compiler engineering effort to avoid unnecessary pitfalls like this.

A much more drastic alternative would be to revert to something like the classic Posix rules, which (arguably) allow the above vectorization example. I believe that, in spite of the much more substantial performance advantages, this would be a disastrous mistake, and would make C++ essentially unusable for multithreaded programming:

Although detailed information is hard to come by, as far as I can tell, no existing mainstream compilers implement transformations such as these, except possibly when the user requests unsafe optimizations. I have heard discussion in the past of compiler merging of non-adjacent field updates. As far as I can tell, this is no longer performed by current production compilers. Based on comments from several compiler teams, optimizations along these lines have tended to be removed in response to complaints from programmers writing multithreaded code, often particularly from kernel developers, who have so far probably been the most aggressive authors of such code. I expect that as more application programmers need to aggressively parallelize code to take advantage of multicore processors, they will generate more of these complaints.
Thus the effect of a carefully worded "coarse" definition of memory location in the language standard is likely to be that either it will be ignored, or it will cause mainstream compilers to implement optimizations that have so far largely been considered unsafe. The latter is likely to cause far more serious code breakage than performance regressions from failed vectorization.
Such a "coarse" memory location definition, especially for arrays, breaks even the simplest programs that try to use threading for parallelization. Consider a program that increments each element in an array by forking N threads, each of which increments the elements of a chunk containing roughly one N^th of the array elements. Such a coarse definition would allow data races at the chunk boundaries, and a much more complex algorithm would be needed. Given the inherent difficult of writing such programs, I don't believe we can tolerate such added complexities. I would be unwilling to program in such a language.
It is unclear how much of the pressure to perform optimizations such as the above vectorization example is coming from platforms that would be seriously affected, i.e. that can support threads and cache-coherent shared memory across multiple, probably homogeneous, processors.
More philosophically, the parallel semantics of a piece of program text are fundamentally different from the sequential ones. If we do want to support concurrency, we need to respect the parallel semantics, which unavoidably restricts some compiler transformations. The two versions of the for-loop in the vectorization example have identical sequential semantics, but different parallel semantics. If the programmer wants vectorization, and the second version has acceptable semantics in the presence of threads, (s)he needs to write the second version, which is not seriously more complicated than the first.
These rules would be unexpectedly different from those learned by beginning programmers, who are likely to have been exposed to the Java rules. Problems introduced by failing to note the difference are likely to be very intermittent, and may be missed during testing.
If the presence of data races in parallel programs depends on the sizes of integral types, it becomes much more difficult to write parallel algorithms that are templatized with respect to the type. This would greatly hinder the development of parallel libraries for C++.

What about attributes for shared data?

One "compromise" that came up during earlier discussions is to allow potentially shared data to be annotated with some kind of attribute or type qualifier, so that it can be treated specially, and a less coarse definition of "memory location" can be applied. After some thought, I no longer believe this is at all feasible, except possibly as an attribute that explicitly allows coarser accesses. But even that would have limited effectiveness.

Threads like those defined by pthreads are fundamentally based on the assumption that that there is no difference between variables accessed only by a single thread, and those protected by proper mutual exclusion. This allows abstract data types (e.g. a C++ vector or map) to be implemented in a way that can be used both for thread private and shared, but lock-protected data. A consequence of this is that shared data, e.g. the array in a C++ vector, is often allocated and accessed by code that inherently cannot know whether the data is shared. Hence there is no way to consistently supply correct attributes; the best we might be able to do is to declare certain data to not be shared in those instances, in which we happen to know this at the point of declaration.