Audience: SG6
S. Davis Herring <herring@lanl.gov>
Los Alamos National Laboratory
February 15, 2025
Since r0:
+
explanationFLT_EVAL_METHOD
The standard says very little about the actual results of floating-point evaluations. [basic.fundamental]/12 says the “accuracy of operations” is implementation-defined; [expr.pre]/6 makes the situation even less clear by suggesting that floating-point operands and results are somehow not even values of their types. Indeed, it is of practical value that implementations often interpret expressions involving floating-point types as mathematical expressions in order to improve the performance and accuracy of the computation of its overall result. Common techniques include fusing multiplications and additions, discarding canceling terms, and temporarily using extra precision (e.g., using x87 registers). Strict application of well-defined floating-point operations is of course critical to other numerical algorithms; the footnote in /6 suggests that “The cast and assignment operators” may be used for the purpose.
These ideas are derived from C, which additionally defines the
FLT_EVAL_METHOD
macro to describe the implementation’s
choices about such transformations. Matthias Kretz presented Floating-Point
Excess Precision to SG6 and EWG seeking guidance on how to most
consistently interpret these ideas (in particular about floating-point
literals, the subject of CWG2752) in the
context of C++’s stronger type system, constant evaluation, and the
larger set of contemporary floating-point types. No clear direction has
yet been reached, suggesting that further research may be needed.
This paper proposes a model for the use of extra precision (when
available) for floating-point calculations that is more appropriate for
C++ than FLT_EVAL_METHOD
and resolves CWG2752. It does not,
however, suggest mandating any particular arithmetic specification like
IEC 60559.
The idea that an operator result of a type does not have one of the values of that type is obviously problematic from the perspective of defining semantics for such an operator. Moreover, the idea that assigning a variable forces extended precision to be discarded is problematic in C++ because of the routine use of wrapper class types in mathematical expressions. The creation of every such object involves the initialization of its member variable, which seems to be just as strong as assignment in terms of incompatibility with extralinguistic extended precision.
An alternative approach is to extend the set of values for a
floating-point type beyond those that can even theoretically be stored
in the memory that an object of that type occupies. The result of the
subexpression in a * b + c
(all double
s) might
then have a value outside the set of values that can be stored in a
double
’s space in memory (typically the
binary64
set); the choice of that value conveys the
additional information needed to obtain the correctly rounded result of
the overall expression as of course implemented by an FMA instruction.
Similar careful choices of value from a larger set might capture the
bits (or finiteness) lost in a + b - b
; for x87
implementations, the larger set is simply the values supported by the
register format. FLT_EVAL_METHOD==1
corresponds to using
the normal set of double
s as the extended set for
float
.
The crucial specification technology is the same as used for pointer provenance: the values are put in a many-to-one correspondence with the value representations of the type. (The presence of multiple rounding modes might require the formal duplication of values based on the representable value to which they round, but this matters only if the value representation is examined.) Note that every operation and object still has a single value. Aside from merely being tractable semantics, the stability of values prevents unfortunate practical results like taking both branches in
const double x = /* ... */, y = x + epsilon / 4;
if(x < y) { // X87 comparison
// ...
}
// further operations that cause spilling...
if(x == y) { // binary64 comparison
// ...
}
or failing the assert in
float id(float f) {return f;} // no computations
void call() {
const float x = /* ... */;
assert(3 * x == 3 * id(x));
}
Note the implication that if id
is not inlined
x
must be given a representable value for consistency. When
accuracy is more important than consistency, the unary +
may be used to perform a trivial calculation to be (re)rounded:
void sink(double); // not inlined
double tee(double x, double y, double a) {
const double m = x * y;
(+m);
sinkreturn m + a;
}
Because sink
need not be provided the exact value of
m
, an FMA may still be used for the return value. An
unusually aggressive implementation might exploit the unspecified
accuracy of even the +
operation to perform the same
optimization if the +
appeared instead in the
return
statement, but that would not be consistent with
advertising arithmetic conforming to (say) IEC 60559.
For obvious practical reasons, a value that escapes past an optimization frontier cannot actually store information beyond its bit pattern. The stability requirement implies that any such value must be normalized to its “memory type” upon computation. However, even the member variables of wrapper objects can have an extended value if the optimizer sees their entire lifetime (as is typical for temporaries in the sorts of expressions we want to optimize freely), because they truly are members of the type. Similarly, assignment does not need the normalization effect described in the [expr.pre]/6 footnote; even a value updated in a loop may be extended so long as its intermediate values do not escape. Values passed to and returned from standard-library mathematical functions can also be extended.
Even literals are subject to the same rules: those that appear as operands or that are used only locally can preserve extra digits with any available extra precision, while those that “escape” must be represented in the ordinary fashion for their type for consistency. One means of such escape is a constant variable that might be used in multiple translation units (because it is inline, appears in an importable module unit, or is used in a template argument), unless whole-program optimization can arrange for some particular, more accurate value to be used in all cases.
As there are no opaque functions (merely insufficiently clever
optimizers), it is only prudent to retain an explicit means of requiring
normalization; static_cast
is the obvious candidate (the
other part of the footnote), although std::memcpy
would
also have the effect of selecting the canonical value associated with a
value representation. (For pointers, std::memcpy
needs to
be able to preserve the abstract-machine information to prevent
undefined behavior, but here it would be unnecessary and difficult to
specify since it does affect observable behavior.)
While it would be a conforming implementation of the existing standard and of this proposal to spill all values immediately and to never use fused multiply-adds, generating performant code with this proposal requires the implementation to track the usage of existing values rather than to merely round, and avoid FMA contraction, when encountering casts and assignments. There is an analogy to escape analysis of pointers: when a use of a value requires that it be representable in the in-memory format (perhaps because it crosses beyond the current optimization frontier), the implementation must arrange for that value to be appropriately rounded when it is computed or for it to be excluded from FMA contraction. (In a sufficiently long function, it might be difficult to maintain that information about all values; it is permissible to conservatively apply the treatment for values that escape to any that have not been proven not to escape.)
In the SG6 discussion of this paper in Hagenberg, it was pointed out
that some existing implementations do not maintain in their intermediate
representations adequate information to identify when a value is spilled
from an X87 register (and thus rounded to its memory representation) or
involved in an FMA contraction. However, it should be noted that
existing implementations do not conform to the current wording (which is
both inadequate and questionably normative) by default, even with
options such as -std=c++… -pedantic
; in that regard, this
proposal can be said to improve the alignment with existing
implementations (in the absence of special options like
-ffp-contract=on
(which despite its name actually reduces
the use of FMA)) in that assignment (or initialization) is no longer
considered to necessitate the loss of any excess precision.
Modify the definition of trivially-copyable types to allow floating-point types to have multiple values (one of which is canonical) per value representation. Specify that acquiring a value representation gives a floating-point object the corresponding canonical value.
Replace the “greater precision and range” provision ([expr.pre]/6)
with a note about the contextual dependence of rounding. Specify that
unary +
can, and that static_cast
does, round
floating-point values to their corresponding canonical values.