N2487: short float

Submitter:Philipp Klaus Krause
Submission Date:2020-02-20

Summary:

short float.

This proposal follows up on N2016 and WG21 P0192R4. It is also inspired by a desire to have somewhat less-inefficient floating-point types on small systems that do not have hardware support for floating-point

Justification:

Floating-point types that do not qualify for float have become important. Examples include the IEEE 16 bit format (1 sign bit, 5 exponent bits, 10 mantissa bits; has been in OpenGL since OpenGL 3.0 and has been supported by virtually all GPUs for years), bfloat16 (1 sign bit, 8 exponent bits, 7 mantissa bits; supported e.g. in some Xeons, ARMv8.6-A, AMD GPUs), and some older hardware using 16-, 20- or 24-bit formats. Both N2016 and WG21 P0192R4 have a more verbose list of hardware support and use cases for floating-point types that do not fulfill requirements for float. GCC supports a 16-bit floating-point type __fp16 on ARM hardware.

The IEEE format, earlier variants of it that do not support denormalized numbers, and the bfloat16 format have found widespread use, so each should be a valid implementation of short float.

Even where there is no hardware support for floating-point calculations, a short float could be implemented as a more efficient alternative to float.

At London in 2016, WG14 saw the N2016 short float favorably, and wanted a more detailed proposal from the authors, which didn't appear so far.

This is a minimal proposal to add short float. In particular, library support is left to a future proposal.

Sign bit?

Some of the use cases in N2016 and WG21 P0192R4 are about float formats that do not have a sign bit (OpenGL GL_R11F_G11F_B10F and GL_RGB9_E5), however, most formats and uses are about formats with a sign bit. Mandating a sign bit seems like a small restriction that would make short float more useful to users.

Do we want short float to have a sign bit?

8-bit implementation?

There are use cases for tiny floats, such as the 1 sign bit, 3 exponent bits, 5 mantissa bits format used in the G.711 telephony standard. On the other hand, allowing such tiny implementations of a short float would make short float less useful to many other users.

Do we want the requirements on short float to be loose enough to allow implementation in 8 bits?

Proposed changes for short float (vs. the standard draft N2310), assuming both questions above are answered "no":

Do we want short float as below?

§5.2.4.2.2p10: Replace "evaluate operations and constants of type float and double to the range and precision of the double type, evaluate long double operations and constants to the range and precision of the long double type;" by "evaluate operations and constants of type short float, float and double to the range and precision of the double type, evaluate long double operations and constants to the range and precision of the long double type;".

§5.2.4.2.2p12: Insert "SFLT_MANT_DIG" in the obvious place. Insert "SFLT_DECIMAL_DIG 3" in the obvious place. Insert "SFLT_DIG 2" in the obvious place. Insert "SFLT_MIN_10_EXP -4" in the obvious place. Insert "SFLT_MAX_10_EXP 4" in the obvious place. Insert "SFLT_MAX 1E+4" in the obvious place. Insert "SFLT_EPSILON 1E-2" in the obvious place. Insert "SFLT_MIN 1E-4" in the obvious place. Insert "SFLT_TRUE_MIN 1E-4" in the obvious place.

§6.2.5p10: Replace "There are three real floating types, designated as float, double, and long double. The set of values of the type float is a subset of the set of values of the type double;" by "There are four real floating types, designated as short float, float, double, and long double. The set of values of the type short float is a subset of the set of values of the type float; The set of values of the type float is a subset of the set of values of the type double;".

§6.3.1.8p1: Insert "Otherwise, if the corresponding real type of either operand is short float, the other operand is converted, without change of type domain, to a type whose corresponding real type is short float." in the obvious place.

§6.7.2p2: Insert "short float" in the obvious place.

A suffix?

We currently do not have a suffix for integer constants of type short.

Should there be a suffix for floating-point constants of type short float? If yes:

§6.4.4.2p1: Replace "floating-suffix: one of f l F L" by "floating-suffix: one of sf f l SF F L".

§6.4.4.2p4: Replace "If suffixed by the letter f or F, it has type float. If suffixed by the letter l or L, it has type long double." by "If suffixed by sf or SF, it has type short float. If suffixed by the letter f or F, it has type float. If suffixed by the letter l or L, it has type long double.".