WG14/N829 J11/98-028 Previous version: WG14/N808. DR 63 (and 56): Floating-Point accuracy Proposed wording for the Response for DR 63 (and 56). The following wording (approved by the editorial review committee) should be incorporated into a future version of the standard in section 5.2.4.2.2 <float.h>, page 25, between paragraphs 3 and 4: The accuracy of the floating-point operations (+, -, *, /) and of the math library (<math.h> and <complex.h>) functions that return a floating-point result is implementation defined. The implementation shall document the accuracy, but is allowed to state that the accuracy is unknown. Rationale (section 5.2.4.2.2 <float.h>) The overflow and/or underflow thresholds may not be the same for all arithmetic operations. For example, there is at least one machine where the overflow threshold for addition is twice as big as for multiplication. Another implementation uses a pair of doubles to represent a long double. In that implementation, the next representable long double value after 1.0L is 1.0L+LDBL_MIN, yet, the difference bewteen those two numbers (LDBL_MIN) is not [note to editor: math type] b ** (1-p), otherwise known as LDBL_EPSILON. Because of anomalies like these, there are few hard requirements on the <float.h> values. But, the values in <float.h> should be in terms of the hardware representation used to store floating-point values in memory (not in terms of the effective accuracy of operations, nor in terms of registers) and should apply to all operations. The representation stored in memory may have padding bits and/or bytes that do not contribute to the value. The padding should not be included in the <float.h> values. Because of the practical difficulty involved in defining a uniform metric that applies to both real and complex types and that all vendors would be willing to follow (just computing the accuracy reliably could be a significant burden that varied depending on the required metric), and because the importance of floating point accuracy differs greatly among users, the standard allows a great deal of latitude in how an implementation documents the accuracy of the real and complex floating-point operations and functions. Here are some ways that an implementation might address the need to define the accuracy ... digits correct digits wrong maximum Units in the Last Place (ULPs) error maximum absolute error maximum relative error For complex values, some methods are ... error in terms of both real and imaginary parts error in terms of Euclidean norm, ||a + b*i|| = sqrt(a*a + b*b). There are two usages of the term ULP. One is in the context of differences between two numbers, e.g., f(x) differs from F(x) by 3 ULPs. The other is the value of the ULP of a number, e.g., an ULP of the value 1.0 is DBL_EPSILON. For this discussion, we are interested in the former; the difference between the computed value and the infinitely precise value. The error between two floating-point numbers in ULPs depends on the radix and the precision used in representing the number, but not the exponent. With a decimal radix and 3 digits of precision, the computed value 0.314e+1 differs from the value 0.31416e+1 by 0.16 ULPs. If both numbers are scaled by the same power of the radix, e.g., 0.314e+49 and 0.31416e+49, they still differ by 0.16 ULPs. When the two numbers being compared span a power of the radix, the two possible ULP error calculations differ by a factor of the radix. For a decimal radix and 3-digits of precision, consider the two values 9.99e2 and 1.01e3. These are the two values adjacent to the value 1.00e3, a power of the radix, in this number system. If 999 is the correct value and 1010 is the computed value, the error is 11 ulps. But, if 1.01e3 is the correct value and 0.999e3 is the computed value, then the error is 1.1 ulps. Some math functions, such as those that do argument reduction modulo an approximation of pi, have good accuracy for small arguments, but poor accuracy for large arguments. It is not unusual for an implementation of the trig functions to have zero bits correct in the computed result for large arguments. For cases like this, an implementation might break the domain of the function into disjoint regions and specify the accuracy in each region. If an implementation documents worst case error, there is no requirement that it be the minimum worst case error. That is, if a vendor believes that the worst case error for a function is around 5 ULPs, they could document it as 7 ULPs to be on the safe side. The committee could not agree on upper limits on accuracy that all conforming imlementations must meet, e.g., addition is no worse than 2 ULPs for all implementations. It is a quality of implementation issue. Implementations that conform to IEC-559 have half ULP accuracy in round to nearest mode, and 1-ULP accuracy in the other three rounding modes, for the basic arithmetic operations and sqrt. For other floating-point arithmetics, it is a rare implementation that has worse than 1-ULP accuracy for the basic arithmetic operations. The accuracy of binary-decimal converions and format conversions are discussed elsewhere in the standard. For the math library functions, currently, fast correctly rounded 0.5 ULP accuracy is still a research problem. Some implementations provide two math libraries: one being faster but less accurate than the other. The C9X committee discussed the idea of allowing the programmer to find out the accuracy of floating-point operations and math functions during compilation (say via macro symbols) or during execution (via a function call), but neither got enough support (even though this is what some users want) to warrant the change to the standard. The use of macro symbols would require over one hundred macro symbols to name every math function (such as ULP_SINF, ULP_SIN, and ULP_SIND just for the real valued sin function). One possible function implementation might be a function that takes the name of the operation or math function as a string, e.g., ulp_err( "sin" ) that would return a double such as 3.5 to indicate the worst case error or the value -1.0 to indicate unknown error. But such a "simple" scheme would likely be of very limited use given that so many functions have accuracies that differ significantly across their domains. Constrained to worst case error across the entire domain, most implementations would wind up reporting either unknown or else a uselessly large value for a very large percentage of functions - useless because most programs that care about accuracy are written in the first place to try to compensate for accuracy problems that typically arise when pushing domain boundaries. And implementing something more useful, like the worst case error for a user-specified partition of the domain, would be a nightmare.