WG14 Document: N938
BDTI Proposal for FixedPoint Arithmetic in Embedded C
John R. Hauser
Berkeley Design Technology, Inc.
2001 March 16
BDTI has been writing software for and evaluating the performance of DSPs
(digital signal processors) for almost a decade, and we are naturally
interested in WG14's effort to standardize C extensions for fixedpoint
arithmetic as part of its "Embedded C" work item. To promote discussion on
this topic and help foster the best possible standard, we would like to put
forth a proposal for the fixedpoint types and arithmetic that differs in
some respects from the working draft WDTR18037. This document consists of
some preliminary comments followed by an outline of the specification we
would prefer.

Preliminary Commentary
 As was discussed in Toronto, BDTI would like the syntax to be such that
the embedded C features can all be implemented directly in C++ without
extra compiler support. Of course, for actual application development,
a C or C++ compiler would need to treat the new features specially or
they won't be efficient. But making the syntax compatible with existing
C++ has two positive effects: (1) It makes the C++ people happier by
allowing them to be compatible without having to extend C++ and change
all of their compilers. (2) Equally important, it gives everyone with
access to a C++ compiler a convenient prototyping platform, even if their
C compilers don't support the embedded C features.
 We approve of the basic plan to have separate "typeA" and "typeB"
fixedpoint types, with three nominal sizes of each ("short", normal, and
"long"). As defined by Willem Wakker, typeA fixedpoint formats have
no integer bits and support a range of 1 to 1, while typeB formats have
both integer and fractional bits.
 It is BDTI's view that unsigned fixedpoint is not worth pursuing at this
time. While unsigned fixedpoint types can sometimes be useful, they
really only buy one extra bit of precision over the signed types. To
take the typeA case (and echoing Willem), the corresponding unsigned
fixedpoint types would either have the same number of fraction bits and
one integer bit to cover the range 0 to 2, or they would have an extra
fraction bit and cover the range 0 to 1. The first style would be more
consistent with the correspondence between signed and unsigned integers;
however, the range of 0 to 2 has no obvious justification other than to
maintain alignment with the signed fixedpoint format.
Some DSPs don't fully support an unsigned fractional type in hardware.
For those that do, the main benefit is to permit multiword operations to
be built from processor primitives more conveniently. For such purposes,
the location of the binary point in individual operations becomes rather
murky. The C language already has unsigned integers that can be used for
constructing multiword operations about as effectively.
Because one bit isn't much, it's our experience that true unsigned
fixedpoint formats rarely come up. Note that the choice not to require
unsigned fixedpoint types has precedent in the fact that there are no
unsigned _floatingpoint_ types in C, either. We advocate postponing the
standardization of unsigned fixedpoint (and thus coercing all Embedded C
implementations to provide it) until there is a clearer need for it.
 As was briefly touched on at the Toronto meeting, a saturation attribute
is most conveniently attached to fixedpoint types, even though
saturation would ideally be an attribute of operations, not types.
There's no convenient syntax we know of for attaching a "sat" keyword to
operators like "+" and "=".
 As with unsigned fixedpoint, we don't see enough need for saturating
_integer_ arithmetic to justify standardizing it at this time. We were
unable to think of a single instance where support for this feature would
significantly benefit one of our applications.
 Besides saturation, some DSP algorithms require modular wraparound on
overflow (like the wraparound that C requires for unsigned integers).
We propose that there be an attribute for this that can be attached to
types in the same manner as the saturation attribute.
============================================================================
The following outlines our proposal for fixedpoint extensions for C. BDTI
has developed a prototype implementation of this system in C++ and has been
building experience using it to express DSP calculations.
Additional comments in the text below are enclosed in boxes.

FixedPoint Types
Eighteen new fixedpoint type names are defined:
short_fract short_fract_sat short_fract_mod
fract fract_sat fract_mod
long_fract long_fract_sat long_fract_mod
short_accum short_accum_sat short_accum_mod
accum accum_sat accum_mod
long_accum long_accum_sat long_accum_mod
All of these fixedpoint types are signed; no unsigned fixedpoint types are
defined. The "_sat" and "_mod" types are identical to their respective base
types except for the treatment of overflows as explained later.
The "fract" types by definition have no integer bits. The binary point in
the "fract" types is always just to the left of the most significant non
sign bit, so that "fract" types support values in the range of 1 to 1.
(The endpoints 1 and 1 are not necessarily representable exactly by a
"fract" type.) The minimum number of bits for each "fract" type (including
the sign bit) is as follows:
short_fract: 8 bits
fract: 16 bits
long_fract: 32 bits
The "accum" types have integer bits as well as fractional bits. Each
"accum" type (short, normal, and long) must have exactly the same number of
fractional bits as its corresponding "fract" type, plus a minimum of four
integer bits. (It is not required that the three "accum" types all have the
same number of integer bits.)
+
 Willem and others have proposed names such as "short accum", together
 with a separate "sat" keyword. Although this would be more consistent
 with existing C, names requiring more than one token have always been
 problematic because they can't be defined either through "typedef"s
 or macros in C or with classes in C++. Without the ability to declare
 the proper type names, a working implementation can't be created in C++.
 This is important because we expect most C compilers _not_ to support
 the embedded extensions, so a C++ implementation will be the only one
 available to many people.
+

Overflow and Rounding
Conversion of a real arithmetic value to a fixedpoint type may overflow
and/or may require rounding. When the source value does not fit within
the range of the fixedpoint type, the conversion overflows. Two different
behaviors are defined for overflow:
 Saturation: If the fixedpoint type is a "_sat" type, the source value
is replaced by the closest available fixedpoint value (which will be
either the maximal negative or maximal positive value of the fixedpoint
type).
 Modular wraparound: If the fixedpoint type is a "_mod" type, the
source value is replaced by a value in the range of (2^I) to 2^I that is
congruent (in the mathematical sense) to the source value modulo 2^(I+1),
where I is the number of integer bits in the fixedpoint format. For
example, for "fract" types, the source value is replaced by a value in
the range 1 to 1 that is congruent to the source value modulo 2. (This
has the effect of discarding all the bits above the most significant bit
of the fixedpoint format. This is exactly the same sort of overflow
behavior required of unsigned integers.)
Otherwise, if the fixedpoint type is not a "_sat" or "_mod" type, overflow
is undefined (and may legitimately terminate the program, for example).
If, after overflow handling, the source value cannot be represented exactly
by the fixedpoint type, the source value is rounded to either the closest
fixedpoint value greater than the source value (rounded up) or to the
closest fixedpoint value less than the source value (rounded down).
Whether rounding is up or down is implementationdefined and may differ for
different values and different situations.

Type Conversions
All conversions between a fixedpoint type and another arithmetic type
(which can be another fixedpoint type) are defined. Overflow and
rounding are handled according to the usual rules for the destination type.
Conversions from a fixedpoint to an integer type round toward zero, just as
conversions from floatingpoint to integer do. The rounding of conversions
from fixedpoint to floatingpoint is unspecified.

Operations
The usual operators for basic arithmetic and for comparisons (add, subtract,
multiply, divide, equal, less than, etc.) are defined to work with fixed
point values. In addition, fixedpoint shifts left and right are defined
to be equivalent to multiplications and divisions by a power of two. The
"roundfx" operator is provided for explicitly rounding a fixedpoint value
to a number of fractional bits. Finally, a "bits" operator allows the bits
of a fixedpoint value to be accessed and manipulated as an integer, for
situations not covered by the other operations. All of these operations are
detailed in the subsections below.
                                     
Basic arithmetic
The standard binary arithmetic operators +, , *, and / accept fixedpoint
operands as follows:
 If both operands are fixedpoint, the two operand types must be
"comparable" according to the following partial order (or lattice):
short_fract <= fract <= long_fract
short_accum <= accum <= long_accum
short_fract <= short_accum
fract <= accum
long_fract <= long_accum
(This means that one of the types must be guaranteed to have at least
as many integer bits and as many fractional bits as the other type, and
thus able to hold all values of the other type.) The "_sat" or "_mod"
attributes do not affect whether two fixedpoint types are comparable.
It is a constraint error for the two fixedpoint types not to be
comparable according to this partial order. The type of the result is
the larger type, with the adoption of any "_sat" or "_mod" attribute
from the other operand. For example, if the operands of an addition have
types "long_accum" and "fract_sat", the result type of the operation is
"long_accum_sat". It is a constraint error for one fixedpoint operand
to be a "_sat" type and the other a "_mod" type.
+
 If one of the fixedpoint types isn't strictly "bigger" than the other,
 there's no certain way to choose the result type. Rather than invent
 an arbitrary rule, we've simply prohibited the questionable cases.
 Programmers can achieve what they want unambiguously through the use of
 type casts.
+
 If one operand is fixedpoint and the other an integer, the result type
is that of the fixedpoint operand.
 It is a constraint error for one operand to be fixedpoint and the other
floatingpoint.
+
 The prohibition against mixing fixedpoint and floatingpoint may come as
 a surprise, but it serves an important purpose. Imagine a program written
 for an embedded processor with no floatingpoint hardware. Fixedpoint
 types are used in the program for certain noninteger calculations.
 Assume that some numeric constants appear in the code, e.g.,
 "a * 0.333333", where the variable "a" has fixedpoint type. Now, if
 fixedpoint and floatingpoint operands were allowed to mix, this would
 (presumably) count as a floatingpoint multiplication, which might take
 hundreds of clock cycles to emulate instead of the two or threecycle
 fixedpoint multiplication the programmer intended. The problem is that
 obvious constants like 0.333333 have floatingpoint type in C, so any
 inadvertent use of them as operands will drag in floatingpoint emulation
 software and suck up clock cycles. From the perspective of the embedded
 software developer, these situations are bugs that have to be found and
 eliminated.

 It's a safe bet that embedded software that uses fixedpoint arithmetic
 is going to be run on machines without efficient floatingpoint. Having
 fixedpoint values automatically promote to floatingpoint is not a
 convenience; it's an annoyance working against the programmer. Rather
 than think of fixedpoint as being between integers and floatingpoint,
 it's more helpful to think of it as an _alternative_ to floatingpoint.
 With that view, an operation with both fixedpoint and floatingpoint
 operands has no preferred result type. Of course, programmers can easily
 use explicit type casts and assignments to convert between the two forms
 when needed.
+
In all cases above, a fixedpoint arithmetic operation is done exactly
(according to its mathematical definition), and then overflow handling
and rounding is performed for the result type as explained in the earlier
section _Overflow_and_Rounding_. An integer operand is _not_ first
converted to fixedpoint before the operation is performed.
+
 The proposal not to promote integers to fixedpoint to balance the
 operands is clearly a departure from the way C is normally defined
 and, in particular, the way the same operations work when integer and
 floatingpoint operands are mixed. At BDTI, we've struggled with this
 inconsistency, but we keep running into a fundamental problem: integer
 values often can't be promoted honestly to fixedpoint types. None of
 the "fract" types have _any_ integer bits, and some DSPs will have as
 few as four integer bits in their "accum" types. On such a machine, it's
 impossible to promote an integer bigger than 8 to any fixedpoint type,
 which leaves only a limited range of integers to work with. Consider, for
 example, how you divide a fixedpoint value by an integer variable whose
 value could be as large as, say, 15.

 The floatingpoint types have the advantage that (on all machines I know)
 the _range_ of all the integers fits within even the smallest floating
 point type, so promoting an integer to floatingpoint at worst suffers
 a rounding error (and often not even that). This is absolutely not the
 case for the fixedpoint types. On the other hand, unlike floatingpoint,
 from a hardware perspective, fixedpoint and integer operations are nearly
 the same thing. Thus, it's no trouble for the hardware to mix integer
 and fixedpoint operands and perform the calculation directly, as we've
 proposed. Since it seems the only argument against this approach is the
 wrinkle it incurs on the C language, we think the balance of the argument
 favors defining mixed fixedpoint and integer operations as working
 directly without the integer argument being promoted to fixedpoint first.
+
Division by zero is undefined.
The standard assignment operators +=, =, *=, and /= are defined in the
usual way when either operand is fixedpoint. Note, in particular, that,
given the declarations
fract_sat a;
fract_mod b;
the expression "a += b" is a constraint violation for the same reason that
"a + b" is.
The pre and postfix ++ and  operators have their usual meaning of adding
or subtracting the integer 1 to/from the operand and returning the value
before or after the addition as the result.
The unary plus (+) and negation () operations are defined for fixedpoint
as well. The result type is the same as that of the operand. The negation
operation is equivalent to subtracting the operand from the integer zero.
                                     
Comparisons
The standard comparison operators (<, <=, ==, >=, >, and !=) accept fixed
point operands. For consistency, the rules for mixing operands of different
types are the same as for the basic arithmetic operations, except that it
is permitted to compare a fixedpoint value of "_sat" type with a fixed
point value of "_mod" type. (Note this means a fixedpoint value cannot be
compared to a floatingpoint value without a type cast.) Fixedpoint and
integer values are compared directly; the integer operand is not first
converted to fixedpoint before the comparison is made.
                                     
Shifts left and right
Shifts of fixedpoint values using the standard << and >> operators are
defined to be equivalent to multiplication or division by a power of two.
The right operand is converted to type "int" and must be nonnegative and
less than the total number of bits of the fixedpoint operand (the left
operand). The result type is the same as that of the fixedpoint operand.
An exact result is calculated and then converted to the result type in the
same way as the other fixedpoint arithmetic operations.
The standard assignment operators <<= and >>= are defined in the usual way
when the left operand is fixedpoint.
+
 Shifts of fixedpoint values are common, particularly for block floating
 point, and need to be supported.
+
                                     
roundfx
The "roundfx" operator rounds a fixedpoint operand to a specified number of
fractional bits. The operator has a functionlike syntax:
roundfx ( , )
The result type is the type of the first operand, which must be fixedpoint.
The second operand is converted to type "int" and must be nonnegative and
less than the number of fractional bits in the fixedpoint type. The fixed
point value is rounded to the specified number of fractional bits in an
implementationdefined way, and this rounded value is returned as the
result. Fractional bits beyond the rounding point are set to zero in the
result.
                                     
Accessing the bits of the fixedpoint type
The "bits" operator provides access to the bits of a fixedpoint value as an
integer. The operator has a functionlike syntax:
bits ( )
The single operand must be a fixedpoint type. When used as an expression,
the "bits" operator returns an integer equal to the fixedpoint value
multiplied by 2^F, where F is the number of fractional bits in the fixed
point type. The result type is a signed integer type big enough to hold all
valid result values for the given operand fixedpoint type. For example, if
"fract" is 16 bits, then after the declaration
fract a = 0.5;
the value of "bits(a)" is 0.5 * 2^16 = 0x4000.
If the operand to "bits" is an lvalue, the same construct is also an lvalue
that can be used as the destination of an assignment. Continuing the
example above, the assignment
bits(a) = 0x2000;
sets "a" to the fixedpoint value 0.25. The assignment causes the value on
the right hand side to be converted to an integer. If this integer value is
too large for the fixedpoint type of the "bits" operand, overflow handling
occurs in the usual manner according to the fixedpoint type, as defined
earlier in the section _Overflow_and_Rounding_.
The "bits" construct cannot have its address taken (by the "&" operator).
+
 It is often necessary to access a fixedpoint value as an integer in this
 way, and the other arithmetic operations don't provide as convenient a
 means to do this.
+

FixedPoint Constant Expressions
Fixedpoint constant expressions whose type is neither a "_sat" nor a "_mod"
type are evaulated by the compiler with saturation on overflow.
+
 Fixedpoint constant expressions can be constructed using type casts, as
 in "(fract) 1.0".
+
