This issue has been automatically converted from the original issue lists and some formatting may not have been preserved.
Authors: Fred Tydeman, WG14
Date: 1992-12-10
Reference document: X3J11/91-005
Submitted against: C90
Status: Closed
Cross-references: 0023
Converted from: dr.htm, dr_025.html
What is meant by “representable floating-point value?” Assume double precision, unless stated otherwise.
First, some definitions based partially upon the floating-point model in subclause 5.2.4.2.2, on pages 14-16 of the C Standard:
DBL_MIN
to DBL_MAX
, inclusive; normalized (first significand digit is non-zero), sign is +1.-DBL_MAX
to -DBL_MIN
, inclusive; normalized.+DBL_DeN
(inclusive) to +DBL_MIN
(exclusive). (Let DBL_DeN
be the symbol for the minimum positive denormal, so we can talk about it by name.)-DBL_MIN
(exclusive) to -DBL_DeN
(inclusive).On the real number line, these symbols order as:
[ 1 )[ 2 ]( 3 ]( 4 )[5]( 6 )[ 7 )[ 8 ]( 9 ]
+--------+-------+--------+------+-+------+--------+-------+--------+
-INF -DBL_MAX -DBL_MIN -DBL_Den -0 +0 +DBL_Den +DBL_MIN +DBL_MAX +INF
Non-real numbers are: SNaN, QNaN, and NaN; call this region 10.
Regions 1 and 9 are overflow, 2 and 8 are normal numbers, 3 and 7 are denormal numbers (pseudo underflow), 4 and 6 are true underflow, and 5 is zero.
So, the question is: What does “representable (double-precision) floating-point value” mean:
a. Regions 2, 5 and 8 (+/- normals and zero)
b. Regions 2, 3, 5, 7, and 8 (+/- normals, denormals, and zero)
c. Regions 2 through 8 [-DBL_MAX
... +DBL_MAX
]
d. Regions 1 through 9 [-INF ... +INF]
e. Regions 1 through 10 (reals and non-reals)
f. What the hardware can represent
g. Something else? What?
Some things to consider in your answer follow. The questions that follow are rhetorical and do not need answers.
Subclause 5.2.4.2.2 Characteristics of floating types <float.h>
, page 14,
lines 32-34:
The characteristics of floating types are defined in terms of a model that describes a representation of floating-point numbers and values that provide information about an implementation's floating-point arithmetic.
Same section, page 15, line 6:
A normalized floating-point number x ... is defined by the following model: ...
That model is just normalized numbers and zero (appears to include signed zeros). It excludes denormal and unnormal numbers, infinities, and NaNs. Are signed zeros required, or just allowed?
Subclause 6.1.3.1 Floating constants, page 26, lines 32-35: “If the scaled value is in the range of representable values (for its type) the result is either the nearest representable value, or the larger or smaller representable value immediately adjacent to the nearest value, chosen in an implementation-defined manner.”
A B y C x D E z F
-DBL_Den 0.0 +DBL_Den +DBL_MIN +DBL_MAX +INF
The representable numbers are A, B, C, D, E, and F. The number x can be
converted to B, C, or D! But what if B is zero, C is DBL_DeN
(denormal), and D
is DBL_MIN
(normalized). Is x representable? It is not in the range DBL_MIN ... DBL_MAX
and its inverse causes overflow; so those say not valid. On the
other hand, it is in the range DBL_DeN ... DBL_MAX
and it does not cause
underflow; so those say it is valid.
What if B is zero, A is -DBL_DeN
(denormal), and C is +DBL_DeN
(denormal);
is y representable? If so, its nearest value is zero, and the immediately
adjacent values include a positive and a negative number. So a user-written
positive number is allowed to end up with a negative value!
What if E is DBL_MAX
and F is infinity (on a machine that uses infinities,
IEEE-754)? Does z have a representation? If z came from 1.0/x, then z caused
overflow which says invalid. But on IEEE-754 machines, it would either be
DBL_MAX
or infinity depending upon the rounding control, so it has a
representation and is valid.
What is “nearest?” In linear or logarithmic sense? If the number is between 0
and DBL_DeN
, e.g.,
10-99999, it is linear-nearest to zero, but log-nearest to DBL_DeN
.
If the number is between DBL_MAX
and INF, e.g., 10+99999, it is
linear- and log-nearest to DBL_MAX
. Or is everything bigger than DBL_MAX
nearest to INF?
Subclause 6.2.1.3 Floating and integral, page 35, Footnote 29: “Thus, the
range of portable floating values is (-1,Utype_MAX
+1).”
Subclause 6.2.1.4 Floating types, page 35, lines 11-15: “When a double
is
demoted to float
or a long double
to double
or float
, if the value being
converted is outside the range of values that can be represented, the behavior
is undefined. If the value being converted is in the range of values that can be
represented but cannot be represented exactly, the result is either the nearest
higher or nearest lower value, chosen in an implementation-defined manner.”
Subclause 6.3 Expressions, page 38, lines 15-17: “If an exception occurs during the evaluation of an expression (that is, if the result is not mathematically defined or not in the range of representable values for its type), the behavior is undefined.”
w = 1.0 / 0.0 ; /* infinity in IEEE-754 */
x = 0.0 / 0.0 ; /* NaN in IEEE-754 */
y = +0.0 ; /* plus zero */
z = - y ; /* minus zero: Must this be -0.0? May it be +0.0? */
Are the above representable?
Subclause 7.5.1 Treatment of error conditions, page 111, lines 11-12: “The behavior of each of these functions is defined for all representable values of its input arguments.”
What about non-numbers? Are they representable? What is sin(
NaN
)
? If you
got a NaN as input, then you can return NaN as output. But, is it a domain
error? Must errno
be set to EDOM
? The NaN already indicates an error, so
setting errno
adds no more information. Assuming NaN is not part of Standard C
“representable,” but the hardware supports it, then using NaNs is an extension
of Standard C and setting errno
need not be required, but is allowed. Correct?
Subclause 7.5.1 Treatment of error conditions, on page 111, lines 20-27
says: “Similarly, a range error occurs if the result of the function cannot be
represented as a double
value. If the result overflows (the magnitude of the
result is so large that it cannot be represented in an object of the specified
type), the function returns the value of the macro HUGE_VAL
, with the same
sign (except for the tan
function) as the correct value of the function; the
value of the macro ERANGE
is stored in errno
. If the result underflows (the
magnitude of the result is so small that it cannot be represented in an object
of the specified type), the function returns zero; whether the integer
expression errno
acquires the value of the macro ERANGE
is
implementation-defined.”
What about denormal numbers? What is sin(DBL_MIN/3.0L)
? Must this be
considered underflow and therefore return zero, and maybe set errno
to
ERANGE
? Or may it return DBL_MIN/3.0
, a denormal number? Assuming denormals
are not part of Standard C “representable,” but the hardware supports it, then
using them is an extension of Standard C and setting errno
need not be
required, but is allowed. Correct?
What about infinity? What is exp(
INF
)
? If you got an INF as input, then
you can return INF as output. But, is it a range error? The output value is
representable, so that says: no error. The output value is bigger than
DBL_MAX
, so that says: an error and set errno
to ERANGE
. Assuming infinity
is not part of Standard C “representable,” but the hardware supports it, then
using INFs is an extension of Standard C and setting errno
need not be
required, but is allowed. Correct?
What about signed zeros? What is sin(-0.0)
? Must this return -0.0? May it
return -0.0? May it return +0.0? Signed zeros appear to be required in the
model in subclause 5.2.4.2.2 on page 15.
What is sqrt(-0.0)
? IEEE-754 and IEEE-854 (floating-point standards) say this
must be -0. Is -0.0 negative? Is this a domain error?
Subclause 7.9.6.1 The fprintf
function on page 132, lines 32-33 says: “(It
will begin with a sign only when a negative value is converted if this flag is
not specified.)”
What is fprintf(stdout, "%+.1f", -0.0);
? Must it be -0.0? May it be +0.0? Is
-0.0 a negative value? The model on page 15 appears to require support for
signed zeros.
What is fprintf(stdout, "%f %f", 1.0/0.0, 0.0/0.0);
? May it be the IEEE-854
strings of inf
or infinity
for the infinity and NaN
for the quiet NaN?
Would NaNQ
also be allowed for a quiet NaN? Would NaNS
be allowed for a
signaling NaN? Must the sign be printed? Signs are optional in IEEE-754 and
IEEE-854. Or, must it be some decimal notation as specified by subclause
7.9.6.1, page 133, line 19? Does the locale matter?
Subclause 7.10.1.4 The strtod
function on page 151, lines 2-3 says: “If
the subject sequence begins with a minus sign, the value resulting from the
conversion is negated.”
What is strtod("-0.0", &ptr)
? Must it be -0.0? May it be +0.0? The model on
page 15 appears to require support for signed zeros. All floating-point hardware
I know about support signed zeros at least at the load, store, and
negate/complement instruction level.
Subclause 7.10.1.4 The strtod
function on page 151, lines 12-15 say: “If
the correct value is outside the range of representable values, plus or minus
HUGE_VAL
is returned (according to the sign of the value), and the value of
the macro ERANGE
is stored in errno
. If the correct value would cause
underflow, zero is returned and the value of the macro ERANGE
is stored in
errno
.”
If HUGE_VAL
is +infinity, then is strtod("1e99999", &ptr)
outside the range
of representable values, and a range error? Or is it the “nearest” of DBL_MAX
and INF?
Comment from WG14 on 1997-09-23:
Principles for C floating-point representation:
(These principles are intended to clarify the use of some terms in the standard; they are not meant to impose additional constraints on conforming implementations.)
(Points 3 and 4 are meant to apply to representations of the same floating type, not meant for comparison between different types.)
This implies that a conforming implementation is allowed to accept a floating-point constant of any arbitrarily large or small value.