WG14 Document: N938 BDTI Proposal for Fixed-Point Arithmetic in Embedded C John R. Hauser Berkeley Design Technology, Inc. 2001 March 16 BDTI has been writing software for and evaluating the performance of DSPs (digital signal processors) for almost a decade, and we are naturally interested in WG14's effort to standardize C extensions for fixed-point arithmetic as part of its "Embedded C" work item. To promote discussion on this topic and help foster the best possible standard, we would like to put forth a proposal for the fixed-point types and arithmetic that differs in some respects from the working draft WDTR18037. This document consists of some preliminary comments followed by an outline of the specification we would prefer. ---------------------------------------------------------------------------- Preliminary Commentary - As was discussed in Toronto, BDTI would like the syntax to be such that the embedded C features can all be implemented directly in C++ without extra compiler support. Of course, for actual application development, a C or C++ compiler would need to treat the new features specially or they won't be efficient. But making the syntax compatible with existing C++ has two positive effects: (1) It makes the C++ people happier by allowing them to be compatible without having to extend C++ and change all of their compilers. (2) Equally important, it gives everyone with access to a C++ compiler a convenient prototyping platform, even if their C compilers don't support the embedded C features. - We approve of the basic plan to have separate "type-A" and "type-B" fixed-point types, with three nominal sizes of each ("short", normal, and "long"). As defined by Willem Wakker, type-A fixed-point formats have no integer bits and support a range of -1 to 1, while type-B formats have both integer and fractional bits. - It is BDTI's view that unsigned fixed-point is not worth pursuing at this time. While unsigned fixed-point types can sometimes be useful, they really only buy one extra bit of precision over the signed types. To take the type-A case (and echoing Willem), the corresponding unsigned fixed-point types would either have the same number of fraction bits and one integer bit to cover the range 0 to 2, or they would have an extra fraction bit and cover the range 0 to 1. The first style would be more consistent with the correspondence between signed and unsigned integers; however, the range of 0 to 2 has no obvious justification other than to maintain alignment with the signed fixed-point format. Some DSPs don't fully support an unsigned fractional type in hardware. For those that do, the main benefit is to permit multi-word operations to be built from processor primitives more conveniently. For such purposes, the location of the binary point in individual operations becomes rather murky. The C language already has unsigned integers that can be used for constructing multi-word operations about as effectively. Because one bit isn't much, it's our experience that true unsigned fixed-point formats rarely come up. Note that the choice not to require unsigned fixed-point types has precedent in the fact that there are no unsigned _floating-point_ types in C, either. We advocate postponing the standardization of unsigned fixed-point (and thus coercing all Embedded C implementations to provide it) until there is a clearer need for it. - As was briefly touched on at the Toronto meeting, a saturation attribute is most conveniently attached to fixed-point types, even though saturation would ideally be an attribute of operations, not types. There's no convenient syntax we know of for attaching a "sat" keyword to operators like "+" and "=". - As with unsigned fixed-point, we don't see enough need for saturating _integer_ arithmetic to justify standardizing it at this time. We were unable to think of a single instance where support for this feature would significantly benefit one of our applications. - Besides saturation, some DSP algorithms require modular wrap-around on overflow (like the wrap-around that C requires for unsigned integers). We propose that there be an attribute for this that can be attached to types in the same manner as the saturation attribute. ============================================================================ The following outlines our proposal for fixed-point extensions for C. BDTI has developed a prototype implementation of this system in C++ and has been building experience using it to express DSP calculations. Additional comments in the text below are enclosed in boxes. ---------------------------------------------------------------------------- Fixed-Point Types Eighteen new fixed-point type names are defined: short_fract short_fract_sat short_fract_mod fract fract_sat fract_mod long_fract long_fract_sat long_fract_mod short_accum short_accum_sat short_accum_mod accum accum_sat accum_mod long_accum long_accum_sat long_accum_mod All of these fixed-point types are signed; no unsigned fixed-point types are defined. The "_sat" and "_mod" types are identical to their respective base types except for the treatment of overflows as explained later. The "fract" types by definition have no integer bits. The binary point in the "fract" types is always just to the left of the most significant non- sign bit, so that "fract" types support values in the range of -1 to 1. (The endpoints -1 and 1 are not necessarily representable exactly by a "fract" type.) The minimum number of bits for each "fract" type (including the sign bit) is as follows: short_fract: 8 bits fract: 16 bits long_fract: 32 bits The "accum" types have integer bits as well as fractional bits. Each "accum" type (short, normal, and long) must have exactly the same number of fractional bits as its corresponding "fract" type, plus a minimum of four integer bits. (It is not required that the three "accum" types all have the same number of integer bits.) +--------------------------------------------------------------------------- | Willem and others have proposed names such as "short accum", together | with a separate "sat" keyword. Although this would be more consistent | with existing C, names requiring more than one token have always been | problematic because they can't be defined either through "typedef"s | or macros in C or with classes in C++. Without the ability to declare | the proper type names, a working implementation can't be created in C++. | This is important because we expect most C compilers _not_ to support | the embedded extensions, so a C++ implementation will be the only one | available to many people. +--------------------------------------------------------------------------- ---------------------------------------------------------------------------- Overflow and Rounding Conversion of a real arithmetic value to a fixed-point type may overflow and/or may require rounding. When the source value does not fit within the range of the fixed-point type, the conversion overflows. Two different behaviors are defined for overflow: - Saturation: If the fixed-point type is a "_sat" type, the source value is replaced by the closest available fixed-point value (which will be either the maximal negative or maximal positive value of the fixed-point type). - Modular wrap-around: If the fixed-point type is a "_mod" type, the source value is replaced by a value in the range of -(2^I) to 2^I that is congruent (in the mathematical sense) to the source value modulo 2^(I+1), where I is the number of integer bits in the fixed-point format. For example, for "fract" types, the source value is replaced by a value in the range -1 to 1 that is congruent to the source value modulo 2. (This has the effect of discarding all the bits above the most significant bit of the fixed-point format. This is exactly the same sort of overflow behavior required of unsigned integers.) Otherwise, if the fixed-point type is not a "_sat" or "_mod" type, overflow is undefined (and may legitimately terminate the program, for example). If, after overflow handling, the source value cannot be represented exactly by the fixed-point type, the source value is rounded to either the closest fixed-point value greater than the source value (rounded up) or to the closest fixed-point value less than the source value (rounded down). Whether rounding is up or down is implementation-defined and may differ for different values and different situations. ---------------------------------------------------------------------------- Type Conversions All conversions between a fixed-point type and another arithmetic type (which can be another fixed-point type) are defined. Overflow and rounding are handled according to the usual rules for the destination type. Conversions from a fixed-point to an integer type round toward zero, just as conversions from floating-point to integer do. The rounding of conversions from fixed-point to floating-point is unspecified. ---------------------------------------------------------------------------- Operations The usual operators for basic arithmetic and for comparisons (add, subtract, multiply, divide, equal, less than, etc.) are defined to work with fixed- point values. In addition, fixed-point shifts left and right are defined to be equivalent to multiplications and divisions by a power of two. The "roundfx" operator is provided for explicitly rounding a fixed-point value to a number of fractional bits. Finally, a "bits" operator allows the bits of a fixed-point value to be accessed and manipulated as an integer, for situations not covered by the other operations. All of these operations are detailed in the subsections below. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Basic arithmetic The standard binary arithmetic operators +, -, *, and / accept fixed-point operands as follows: - If both operands are fixed-point, the two operand types must be "comparable" according to the following partial order (or lattice): short_fract <= fract <= long_fract short_accum <= accum <= long_accum short_fract <= short_accum fract <= accum long_fract <= long_accum (This means that one of the types must be guaranteed to have at least as many integer bits and as many fractional bits as the other type, and thus able to hold all values of the other type.) The "_sat" or "_mod" attributes do not affect whether two fixed-point types are comparable. It is a constraint error for the two fixed-point types not to be comparable according to this partial order. The type of the result is the larger type, with the adoption of any "_sat" or "_mod" attribute from the other operand. For example, if the operands of an addition have types "long_accum" and "fract_sat", the result type of the operation is "long_accum_sat". It is a constraint error for one fixed-point operand to be a "_sat" type and the other a "_mod" type. +--------------------------------------------------------------------------- | If one of the fixed-point types isn't strictly "bigger" than the other, | there's no certain way to choose the result type. Rather than invent | an arbitrary rule, we've simply prohibited the questionable cases. | Programmers can achieve what they want unambiguously through the use of | type casts. +--------------------------------------------------------------------------- - If one operand is fixed-point and the other an integer, the result type is that of the fixed-point operand. - It is a constraint error for one operand to be fixed-point and the other floating-point. +--------------------------------------------------------------------------- | The prohibition against mixing fixed-point and floating-point may come as | a surprise, but it serves an important purpose. Imagine a program written | for an embedded processor with no floating-point hardware. Fixed-point | types are used in the program for certain non-integer calculations. | Assume that some numeric constants appear in the code, e.g., | "a * 0.333333", where the variable "a" has fixed-point type. Now, if | fixed-point and floating-point operands were allowed to mix, this would | (presumably) count as a floating-point multiplication, which might take | hundreds of clock cycles to emulate instead of the two- or three-cycle | fixed-point multiplication the programmer intended. The problem is that | obvious constants like 0.333333 have floating-point type in C, so any | inadvertent use of them as operands will drag in floating-point emulation | software and suck up clock cycles. From the perspective of the embedded | software developer, these situations are bugs that have to be found and | eliminated. | | It's a safe bet that embedded software that uses fixed-point arithmetic | is going to be run on machines without efficient floating-point. Having | fixed-point values automatically promote to floating-point is not a | convenience; it's an annoyance working against the programmer. Rather | than think of fixed-point as being between integers and floating-point, | it's more helpful to think of it as an _alternative_ to floating-point. | With that view, an operation with both fixed-point and floating-point | operands has no preferred result type. Of course, programmers can easily | use explicit type casts and assignments to convert between the two forms | when needed. +--------------------------------------------------------------------------- In all cases above, a fixed-point arithmetic operation is done exactly (according to its mathematical definition), and then overflow handling and rounding is performed for the result type as explained in the earlier section _Overflow_and_Rounding_. An integer operand is _not_ first converted to fixed-point before the operation is performed. +--------------------------------------------------------------------------- | The proposal not to promote integers to fixed-point to balance the | operands is clearly a departure from the way C is normally defined | and, in particular, the way the same operations work when integer and | floating-point operands are mixed. At BDTI, we've struggled with this | inconsistency, but we keep running into a fundamental problem: integer | values often can't be promoted honestly to fixed-point types. None of | the "fract" types have _any_ integer bits, and some DSPs will have as | few as four integer bits in their "accum" types. On such a machine, it's | impossible to promote an integer bigger than 8 to any fixed-point type, | which leaves only a limited range of integers to work with. Consider, for | example, how you divide a fixed-point value by an integer variable whose | value could be as large as, say, 15. | | The floating-point types have the advantage that (on all machines I know) | the _range_ of all the integers fits within even the smallest floating- | point type, so promoting an integer to floating-point at worst suffers | a rounding error (and often not even that). This is absolutely not the | case for the fixed-point types. On the other hand, unlike floating-point, | from a hardware perspective, fixed-point and integer operations are nearly | the same thing. Thus, it's no trouble for the hardware to mix integer | and fixed-point operands and perform the calculation directly, as we've | proposed. Since it seems the only argument against this approach is the | wrinkle it incurs on the C language, we think the balance of the argument | favors defining mixed fixed-point and integer operations as working | directly without the integer argument being promoted to fixed-point first. +--------------------------------------------------------------------------- Division by zero is undefined. The standard assignment operators +=, -=, *=, and /= are defined in the usual way when either operand is fixed-point. Note, in particular, that, given the declarations fract_sat a; fract_mod b; the expression "a += b" is a constraint violation for the same reason that "a + b" is. The pre- and postfix ++ and -- operators have their usual meaning of adding or subtracting the integer 1 to/from the operand and returning the value before or after the addition as the result. The unary plus (+) and negation (-) operations are defined for fixed-point as well. The result type is the same as that of the operand. The negation operation is equivalent to subtracting the operand from the integer zero. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Comparisons The standard comparison operators (<, <=, ==, >=, >, and !=) accept fixed- point operands. For consistency, the rules for mixing operands of different types are the same as for the basic arithmetic operations, except that it is permitted to compare a fixed-point value of "_sat" type with a fixed- point value of "_mod" type. (Note this means a fixed-point value cannot be compared to a floating-point value without a type cast.) Fixed-point and integer values are compared directly; the integer operand is not first converted to fixed-point before the comparison is made. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Shifts left and right Shifts of fixed-point values using the standard << and >> operators are defined to be equivalent to multiplication or division by a power of two. The right operand is converted to type "int" and must be nonnegative and less than the total number of bits of the fixed-point operand (the left operand). The result type is the same as that of the fixed-point operand. An exact result is calculated and then converted to the result type in the same way as the other fixed-point arithmetic operations. The standard assignment operators <<= and >>= are defined in the usual way when the left operand is fixed-point. +--------------------------------------------------------------------------- | Shifts of fixed-point values are common, particularly for block floating- | point, and need to be supported. +--------------------------------------------------------------------------- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - roundfx The "roundfx" operator rounds a fixed-point operand to a specified number of fractional bits. The operator has a function-like syntax: roundfx ( <fixed-value> , <num-fract-bits> ) The result type is the type of the first operand, which must be fixed-point. The second operand is converted to type "int" and must be nonnegative and less than the number of fractional bits in the fixed-point type. The fixed- point value is rounded to the specified number of fractional bits in an implementation-defined way, and this rounded value is returned as the result. Fractional bits beyond the rounding point are set to zero in the result. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Accessing the bits of the fixed-point type The "bits" operator provides access to the bits of a fixed-point value as an integer. The operator has a function-like syntax: bits ( <fixed-value> ) The single operand must be a fixed-point type. When used as an expression, the "bits" operator returns an integer equal to the fixed-point value multiplied by 2^F, where F is the number of fractional bits in the fixed- point type. The result type is a signed integer type big enough to hold all valid result values for the given operand fixed-point type. For example, if "fract" is 16 bits, then after the declaration fract a = 0.5; the value of "bits(a)" is 0.5 * 2^16 = 0x4000. If the operand to "bits" is an lvalue, the same construct is also an lvalue that can be used as the destination of an assignment. Continuing the example above, the assignment bits(a) = 0x2000; sets "a" to the fixed-point value 0.25. The assignment causes the value on the right hand side to be converted to an integer. If this integer value is too large for the fixed-point type of the "bits" operand, overflow handling occurs in the usual manner according to the fixed-point type, as defined earlier in the section _Overflow_and_Rounding_. The "bits" construct cannot have its address taken (by the "&" operator). +--------------------------------------------------------------------------- | It is often necessary to access a fixed-point value as an integer in this | way, and the other arithmetic operations don't provide as convenient a | means to do this. +--------------------------------------------------------------------------- ---------------------------------------------------------------------------- Fixed-Point Constant Expressions Fixed-point constant expressions whose type is neither a "_sat" nor a "_mod" type are evaulated by the compiler with saturation on overflow. +--------------------------------------------------------------------------- | Fixed-point constant expressions can be constructed using type casts, as | in "(fract) 1.0". +--------------------------------------------------------------------------- ----------------------------------------------------------------------------