ISO/ IEC JTC1/SC22/WG14 N938

WG14 Document: N938

BDTI Proposal for Fixed-Point Arithmetic in Embedded C

John R. Hauser
Berkeley Design Technology, Inc.
2001 March 16

BDTI has been writing software for and evaluating the performance of DSPs
(digital signal processors) for almost a decade, and we are naturally
interested in WG14's effort to standardize C extensions for fixed-point
arithmetic as part of its "Embedded C" work item.  To promote discussion on
this topic and help foster the best possible standard, we would like to put
forth a proposal for the fixed-point types and arithmetic that differs in
some respects from the working draft WDTR18037.  This document consists of
some preliminary comments followed by an outline of the specification we
would prefer.

Preliminary Commentary

 - As was discussed in Toronto, BDTI would like the syntax to be such that
   the embedded C features can all be implemented directly in C++ without
   extra compiler support.  Of course, for actual application development,
   a C or C++ compiler would need to treat the new features specially or
   they won't be efficient.  But making the syntax compatible with existing
   C++ has two positive effects:  (1) It makes the C++ people happier by
   allowing them to be compatible without having to extend C++ and change
   all of their compilers.  (2) Equally important, it gives everyone with
   access to a C++ compiler a convenient prototyping platform, even if their
   C compilers don't support the embedded C features.

 - We approve of the basic plan to have separate "type-A" and "type-B"
   fixed-point types, with three nominal sizes of each ("short", normal, and
   "long").  As defined by Willem Wakker, type-A fixed-point formats have
   no integer bits and support a range of -1 to 1, while type-B formats have
   both integer and fractional bits.

 - It is BDTI's view that unsigned fixed-point is not worth pursuing at this
   time.  While unsigned fixed-point types can sometimes be useful, they
   really only buy one extra bit of precision over the signed types.  To
   take the type-A case (and echoing Willem), the corresponding unsigned
   fixed-point types would either have the same number of fraction bits and
   one integer bit to cover the range 0 to 2, or they would have an extra
   fraction bit and cover the range 0 to 1.  The first style would be more
   consistent with the correspondence between signed and unsigned integers;
   however, the range of 0 to 2 has no obvious justification other than to
   maintain alignment with the signed fixed-point format.

   Some DSPs don't fully support an unsigned fractional type in hardware.
   For those that do, the main benefit is to permit multi-word operations to
   be built from processor primitives more conveniently.  For such purposes,
   the location of the binary point in individual operations becomes rather
   murky.  The C language already has unsigned integers that can be used for
   constructing multi-word operations about as effectively.

   Because one bit isn't much, it's our experience that true unsigned
   fixed-point formats rarely come up.  Note that the choice not to require
   unsigned fixed-point types has precedent in the fact that there are no
   unsigned _floating-point_ types in C, either.  We advocate postponing the
   standardization of unsigned fixed-point (and thus coercing all Embedded C
   implementations to provide it) until there is a clearer need for it.

 - As was briefly touched on at the Toronto meeting, a saturation attribute
   is most conveniently attached to fixed-point types, even though
   saturation would ideally be an attribute of operations, not types.
   There's no convenient syntax we know of for attaching a "sat" keyword to
   operators like "+" and "=".

 - As with unsigned fixed-point, we don't see enough need for saturating
   _integer_ arithmetic to justify standardizing it at this time.  We were
   unable to think of a single instance where support for this feature would
   significantly benefit one of our applications.

 - Besides saturation, some DSP algorithms require modular wrap-around on
   overflow (like the wrap-around that C requires for unsigned integers).
   We propose that there be an attribute for this that can be attached to
   types in the same manner as the saturation attribute.


The following outlines our proposal for fixed-point extensions for C.  BDTI
has developed a prototype implementation of this system in C++ and has been
building experience using it to express DSP calculations.

Additional comments in the text below are enclosed in boxes.

Fixed-Point Types

Eighteen new fixed-point type names are defined:

    short_fract   short_fract_sat   short_fract_mod
          fract         fract_sat         fract_mod
     long_fract    long_fract_sat    long_fract_mod

    short_accum   short_accum_sat   short_accum_mod
          accum         accum_sat         accum_mod
     long_accum    long_accum_sat    long_accum_mod

All of these fixed-point types are signed; no unsigned fixed-point types are
defined.  The "_sat" and "_mod" types are identical to their respective base
types except for the treatment of overflows as explained later.

The "fract" types by definition have no integer bits.  The binary point in
the "fract" types is always just to the left of the most significant non-
sign bit, so that "fract" types support values in the range of -1 to 1.
(The endpoints -1 and 1 are not necessarily representable exactly by a
"fract" type.)  The minimum number of bits for each "fract" type (including
the sign bit) is as follows:

    short_fract:   8 bits
          fract:  16 bits
     long_fract:  32 bits

The "accum" types have integer bits as well as fractional bits.  Each
"accum" type (short, normal, and long) must have exactly the same number of
fractional bits as its corresponding "fract" type, plus a minimum of four
integer bits.  (It is not required that the three "accum" types all have the
same number of integer bits.)

| Willem and others have proposed names such as "short accum", together
| with a separate "sat" keyword.  Although this would be more consistent
| with existing C, names requiring more than one token have always been
| problematic because they can't be defined either through "typedef"s
| or macros in C or with classes in C++.  Without the ability to declare
| the proper type names, a working implementation can't be created in C++.
| This is important because we expect most C compilers _not_ to support
| the embedded extensions, so a C++ implementation will be the only one
| available to many people.

Overflow and Rounding

Conversion of a real arithmetic value to a fixed-point type may overflow
and/or may require rounding.  When the source value does not fit within
the range of the fixed-point type, the conversion overflows.  Two different
behaviors are defined for overflow:

 - Saturation:  If the fixed-point type is a "_sat" type, the source value
   is replaced by the closest available fixed-point value (which will be
   either the maximal negative or maximal positive value of the fixed-point

 - Modular wrap-around:  If the fixed-point type is a "_mod" type, the
   source value is replaced by a value in the range of -(2^I) to 2^I that is
   congruent (in the mathematical sense) to the source value modulo 2^(I+1),
   where I is the number of integer bits in the fixed-point format.  For
   example, for "fract" types, the source value is replaced by a value in
   the range -1 to 1 that is congruent to the source value modulo 2.  (This
   has the effect of discarding all the bits above the most significant bit
   of the fixed-point format.  This is exactly the same sort of overflow
   behavior required of unsigned integers.)

Otherwise, if the fixed-point type is not a "_sat" or "_mod" type, overflow
is undefined (and may legitimately terminate the program, for example).

If, after overflow handling, the source value cannot be represented exactly
by the fixed-point type, the source value is rounded to either the closest
fixed-point value greater than the source value (rounded up) or to the
closest fixed-point value less than the source value (rounded down).
Whether rounding is up or down is implementation-defined and may differ for
different values and different situations.

Type Conversions

All conversions between a fixed-point type and another arithmetic type
(which can be another fixed-point type) are defined.  Overflow and
rounding are handled according to the usual rules for the destination type.
Conversions from a fixed-point to an integer type round toward zero, just as
conversions from floating-point to integer do.  The rounding of conversions
from fixed-point to floating-point is unspecified.


The usual operators for basic arithmetic and for comparisons (add, subtract,
multiply, divide, equal, less than, etc.) are defined to work with fixed-
point values.  In addition, fixed-point shifts left and right are defined
to be equivalent to multiplications and divisions by a power of two.  The
"roundfx" operator is provided for explicitly rounding a fixed-point value
to a number of fractional bits.  Finally, a "bits" operator allows the bits
of a fixed-point value to be accessed and manipulated as an integer, for
situations not covered by the other operations.  All of these operations are
detailed in the subsections below.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Basic arithmetic

The standard binary arithmetic operators +, -, *, and / accept fixed-point
operands as follows:

 - If both operands are fixed-point, the two operand types must be
   "comparable" according to the following partial order (or lattice):

       short_fract <= fract <= long_fract
       short_accum <= accum <= long_accum
          short_fract <= short_accum
                fract <= accum
           long_fract <= long_accum

   (This means that one of the types must be guaranteed to have at least
   as many integer bits and as many fractional bits as the other type, and
   thus able to hold all values of the other type.)  The "_sat" or "_mod"
   attributes do not affect whether two fixed-point types are comparable.
   It is a constraint error for the two fixed-point types not to be
   comparable according to this partial order.  The type of the result is
   the larger type, with the adoption of any "_sat" or "_mod" attribute
   from the other operand.  For example, if the operands of an addition have
   types "long_accum" and "fract_sat", the result type of the operation is
   "long_accum_sat".  It is a constraint error for one fixed-point operand
   to be a "_sat" type and the other a "_mod" type.

| If one of the fixed-point types isn't strictly "bigger" than the other,
| there's no certain way to choose the result type.  Rather than invent
| an arbitrary rule, we've simply prohibited the questionable cases.
| Programmers can achieve what they want unambiguously through the use of
| type casts.

 - If one operand is fixed-point and the other an integer, the result type
   is that of the fixed-point operand.

 - It is a constraint error for one operand to be fixed-point and the other

| The prohibition against mixing fixed-point and floating-point may come as
| a surprise, but it serves an important purpose.  Imagine a program written
| for an embedded processor with no floating-point hardware.  Fixed-point
| types are used in the program for certain non-integer calculations.
| Assume that some numeric constants appear in the code, e.g.,
| "a * 0.333333", where the variable "a" has fixed-point type.  Now, if
| fixed-point and floating-point operands were allowed to mix, this would
| (presumably) count as a floating-point multiplication, which might take
| hundreds of clock cycles to emulate instead of the two- or three-cycle
| fixed-point multiplication the programmer intended.  The problem is that
| obvious constants like 0.333333 have floating-point type in C, so any
| inadvertent use of them as operands will drag in floating-point emulation
| software and suck up clock cycles.  From the perspective of the embedded
| software developer, these situations are bugs that have to be found and
| eliminated.
| It's a safe bet that embedded software that uses fixed-point arithmetic
| is going to be run on machines without efficient floating-point.  Having
| fixed-point values automatically promote to floating-point is not a
| convenience; it's an annoyance working against the programmer.  Rather
| than think of fixed-point as being between integers and floating-point,
| it's more helpful to think of it as an _alternative_ to floating-point.
| With that view, an operation with both fixed-point and floating-point
| operands has no preferred result type.  Of course, programmers can easily
| use explicit type casts and assignments to convert between the two forms
| when needed.

In all cases above, a fixed-point arithmetic operation is done exactly
(according to its mathematical definition), and then overflow handling
and rounding is performed for the result type as explained in the earlier
section _Overflow_and_Rounding_.  An integer operand is _not_ first
converted to fixed-point before the operation is performed.

| The proposal not to promote integers to fixed-point to balance the
| operands is clearly a departure from the way C is normally defined
| and, in particular, the way the same operations work when integer and
| floating-point operands are mixed.  At BDTI, we've struggled with this
| inconsistency, but we keep running into a fundamental problem:  integer
| values often can't be promoted honestly to fixed-point types.  None of
| the "fract" types have _any_ integer bits, and some DSPs will have as
| few as four integer bits in their "accum" types.  On such a machine, it's
| impossible to promote an integer bigger than 8 to any fixed-point type,
| which leaves only a limited range of integers to work with.  Consider, for
| example, how you divide a fixed-point value by an integer variable whose
| value could be as large as, say, 15.
| The floating-point types have the advantage that (on all machines I know)
| the _range_ of all the integers fits within even the smallest floating-
| point type, so promoting an integer to floating-point at worst suffers
| a rounding error (and often not even that).  This is absolutely not the
| case for the fixed-point types.  On the other hand, unlike floating-point,
| from a hardware perspective, fixed-point and integer operations are nearly
| the same thing.  Thus, it's no trouble for the hardware to mix integer
| and fixed-point operands and perform the calculation directly, as we've
| proposed.  Since it seems the only argument against this approach is the
| wrinkle it incurs on the C language, we think the balance of the argument
| favors defining mixed fixed-point and integer operations as working
| directly without the integer argument being promoted to fixed-point first.

Division by zero is undefined.

The standard assignment operators +=, -=, *=, and /= are defined in the
usual way when either operand is fixed-point.  Note, in particular, that,
given the declarations

    fract_sat a;
    fract_mod b;

the expression "a += b" is a constraint violation for the same reason that
"a + b" is.

The pre- and postfix ++ and -- operators have their usual meaning of adding
or subtracting the integer 1 to/from the operand and returning the value
before or after the addition as the result.

The unary plus (+) and negation (-) operations are defined for fixed-point
as well.  The result type is the same as that of the operand.  The negation
operation is equivalent to subtracting the operand from the integer zero.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

The standard comparison operators (<, <=, ==, >=, >, and !=) accept fixed-
point operands.  For consistency, the rules for mixing operands of different
types are the same as for the basic arithmetic operations, except that it
is permitted to compare a fixed-point value of "_sat" type with a fixed-
point value of "_mod" type.  (Note this means a fixed-point value cannot be
compared to a floating-point value without a type cast.)  Fixed-point and
integer values are compared directly; the integer operand is not first
converted to fixed-point before the comparison is made.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Shifts left and right

Shifts of fixed-point values using the standard << and >> operators are
defined to be equivalent to multiplication or division by a power of two.
The right operand is converted to type "int" and must be nonnegative and
less than the total number of bits of the fixed-point operand (the left
operand).  The result type is the same as that of the fixed-point operand.
An exact result is calculated and then converted to the result type in the
same way as the other fixed-point arithmetic operations.

The standard assignment operators <<= and >>= are defined in the usual way
when the left operand is fixed-point.

| Shifts of fixed-point values are common, particularly for block floating-
| point, and need to be supported.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

The "roundfx" operator rounds a fixed-point operand to a specified number of
fractional bits.  The operator has a function-like syntax:

    roundfx ( <fixed-value> , <num-fract-bits> )

The result type is the type of the first operand, which must be fixed-point.
The second operand is converted to type "int" and must be nonnegative and
less than the number of fractional bits in the fixed-point type.  The fixed-
point value is rounded to the specified number of fractional bits in an
implementation-defined way, and this rounded value is returned as the
result.  Fractional bits beyond the rounding point are set to zero in the

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Accessing the bits of the fixed-point type

The "bits" operator provides access to the bits of a fixed-point value as an
integer.  The operator has a function-like syntax:

    bits ( <fixed-value> )

The single operand must be a fixed-point type.  When used as an expression,
the "bits" operator returns an integer equal to the fixed-point value
multiplied by 2^F, where F is the number of fractional bits in the fixed-
point type.  The result type is a signed integer type big enough to hold all
valid result values for the given operand fixed-point type.  For example, if
"fract" is 16 bits, then after the declaration

    fract a = 0.5;

the value of "bits(a)" is 0.5 * 2^16 = 0x4000.

If the operand to "bits" is an lvalue, the same construct is also an lvalue
that can be used as the destination of an assignment.  Continuing the
example above, the assignment

    bits(a) = 0x2000;

sets "a" to the fixed-point value 0.25.  The assignment causes the value on
the right hand side to be converted to an integer.  If this integer value is
too large for the fixed-point type of the "bits" operand, overflow handling
occurs in the usual manner according to the fixed-point type, as defined
earlier in the section _Overflow_and_Rounding_.

The "bits" construct cannot have its address taken (by the "&" operator).

| It is often necessary to access a fixed-point value as an integer in this
| way, and the other arithmetic operations don't provide as convenient a
| means to do this.

Fixed-Point Constant Expressions

Fixed-point constant expressions whose type is neither a "_sat" nor a "_mod"
type are evaulated by the compiler with saturation on overflow.

| Fixed-point constant expressions can be constructed using type casts, as
| in "(fract) 1.0".