Identifying array length state

Jens Gustedt, INRIA and ICube, France
Martin Uecker, Graz University of Technology, Austria

2023-12-13

document history

document number date comment
n3188 202312 this paper, original proposal

license

CC BY, see https://creativecommons.org/licenses/by/4.0

1 Overview

This paper builds on n3187 where we clarify the evaluation model of VM Types. In essence we reduce the evaluation that is necessary to obtain array length or size information to a simple lvalue conversion of a hidden state variable that holds the array length. Because that state variable can be assumed to be immutable (it doesn’t change during the lifetime of any instance of an object of VM type), such an evaluation is completely harmless and can be effected by the compiler whenever the corresponding object or type definition has been met.

The goal of this paper here is to allow user code to name that otherwise hidden state variable and use it directly to declare VM Types. By that we introduce a simple mechanism to track array length

and all of that without changing any function call ABI. The only new feature that has to be taught compilers is how to compute size information of arrays when the array length is not in a hidden state but at a specific location, that is either a specific variable (with automatic or static duration and with or without linkage) or a structure member at a specific offset from the array.

Note though, that even the existing model already has some subtlety that comes from the fact that for now these state variables are always assumed to be automatic. Although VM type declarators are limited to block scope, they have not necessarily automatic storage duration, pointers to VLA may be declared with static. This opens the possibility to declare static objects that have a different type each time the definition is met. Even worse, if such an object is found in a recursive function, the same object can simultaneously have several types. Valid pointer values that point to the static VM but use these different types may be propagated along recursive calls and so the type-based aliasing rule might simply break down. So some care is in order to not make this situation even worse.

In this paper we so propose to add an annotation to bind an array length (constexpr or const) to a const-qualified integer variable that holds the length. By allowing the identifier of such a variable to act as a forward declaration, this generalizes the current concept of incomplete arrays to VLA and allows to improve cross TU declarations of array objects.

To illustrate the simplicity of the proposed concept suppose that for C23 we have for example

extern size_t const arrayLength;
extern double array[];

in a header. Users of this header need implicit knowledge to compute the size of array (which has an incomplete type) as sizeof(double[arrayLength]), but nothing in the code indicates this relationship and no array length checking compiler can track this size and diagnose out-of-bounds accesses.

The defining translation unit in C23 could then look like this

#ifdef TOTO
# define MYDAILYDOSE 20
#else
# define MYDAILYDOSE 60
#endif

size_t const arrayLength = MYDAILYDOSE;
double array[MYDAILYDOSE] = {};

Which, again, has to ensure the implicit relationship between the value of arrayLength and the effective length with which array is defined. So this is a fragile relation that can only be ensured by documentation, and not by language constructs.

With our proposal this code becomes much simpler. In the header

extern double array[.arrayLength];

Here, if arrayLength has no other declaration it is implicitly declared to have type size_t const and external linkage. Any user of this header can (in block scope) simply query for the size information by using sizeof array much as for any other VLA. Note that thereby this sizeof expression is not an integer constant expression, and indeed it can’t be, because the necessary information (the value of arrayLength) is not available at translation time.

The defining TU now becomes

#ifdef TOTO
size_t const arrayLength = 20;
#else
size_t const arrayLength = 60;
#endif

double array[.arrayLength] = {};

To ensure that the array can always be constructed during translation (and not only during linking) we impose that arrayLength must be located in the same TU as array and that thus at the end of the translation of the TU, all information for the code generation of the array is present. In particular, the array is considered a VLA, and so only default initialization as above is allowed, we don’t have to deal with specific initialization at this point.

The proposal extends this simple idea to other usages of arrays, namely to

This new feature is intended to provide an easy to use tool that helps our users to annotate array declarations with array length information, that put simple sizeof expressions at their disposal to access this information, and that also can be used by analyzing compilers to track out-of-bounds accesses to a much larger extend than is currently possible.

With this approach, it is also possible to improve a bit on the side of those C library interfaces that traditionally have integer parameters that encode array lengths but that come after these in the parameter list.

The syntax that we propose for an array length specifier is to have

    . identifier

in places where otherwise could be an expression in an array length declarator. This is much similar to member designators in initializers and should be easy to grasp for C programmers, even if they are not yet familiar with that notation. This notation has the advantage to disambiguate clearly from expression that would be evaluated. But any other notation that ensures this distinction would be fine.

All of this is meant that it does not introduce ABI changes. Code that doesn’t use these features can continue to interact with code that uses them. Even the new feature for static arrays that have lengths that are not integer constant expressions do not need much more overhead than current static arrays: the only novelty here is that they may take the length parameter from a static const variable definition. (This might be novel to C, but is commonly used in C++).

Note also, that the imposed constraints for flexible array members are relatively rigid. In particular, to be used in the declaration of the flexible array member an integer member must be const-qualified. Otherwise a reliable evaluation with sizeof (and reliable static analysis) would not be possible and the undefined behavior would open a new attack surface. With the usual tools of the language it would even be difficult to create (or maintain) such structure objects: the const-qualified integer member can never be written to. But this aspect of creating and reallocating structure objects with flexible array members and const-qualified members is more general than that and independent of the proposal at hand; a solution that always views reallocated structures as new objects (similar as done by realloc) is proposed in n3186.

2 Proposed text

2.1 Model array lengths by state variables

The syntax should be added to array-declarator in 6.7.6.

direct-declarator [ type-qualifier-listopt . identifier ]

and to array-abstract-declarator in 6.7.7

direct-abstract-declaratoropt [ type-qualifier-listopt . identifier ]

Note that the syntax with a dot ensures that this construct cannot be confounded with expression syntax, and that this thereby does not constitute an evaluation of the given identifier. Other syntax might be possible for this feature, the important thing is that a distinction between the use of the identifier as a reference to the name of an object and to the use as an lvalue expression should always be easily detectable.

2.2 Changes to 6.7.6.2, array declarators

Modify 6.7.6.2 p1 in the “Constraints”, reorganization of the text into new paragraphs is left to the editors.

1 In addition to optional type qualifiers and the keyword static, the [ and ] may delimit an expression, a . followed by an identifier (called a length designator), or *. If there is neither an assignment expression nor a length designator, the array length specifier shall be the outermost length specifier in the declaration; such an array type without expression or length designator is incomplete and shall not be used as the element type of another array type. If they delimit an expression (which specifies the size of an array), the expression shall have an integer type. If the expression is a constant expression, it shall have a value greater than zero. If it is an expression and that expression is not an integer constant expression, the declaration shall not appear in file scope and not be a member declaration; if the declaration declares a variable length array it shall not have static or thread storage duration. The element type shall not be an incomplete type or function type. The optional type qualifiers and the keyword static shall appear only in a declaration of a function parameter with an array type, and then only in the outermost array type derivation.

2 If an identifier is declared as having a variably modified type, it shall be an ordinary identifier (as defined in 6.2.3), have no linkage, and have either block scope or function prototype scope. If an identifier is declared to be an object with static or thread storage duration, it shall not have a variable length array type.

Let ID be an identifier that is used as a length designator in the declarator or abstract declarator of a type XT (including implicitly in the declaration of an object or the definition of type). If ID has been declared prior to XT, it shall have integer type. If ID is not declared when it is used as a length designator but then later declared in the same scope or structure, the type shall be size_t. If there is a visible declaration of ID, it shall have a const-qualified type unless the declaration appears in function prototype scope. In any redeclaration of XT, the corresponding declarator or abstract declarator shall use a * (if admited) or the same identifier ID; ID shall refer to the same parameter in the parameter list, to the same structure member or to the same instance of the same object if it is another kind of integer object, respectively.
The following constraints apply to the lexical positions of XT and a declaration of ID.

If a declaration declares an identifier A and uses a length designator ID: if A has linkage and there is any declaration of ID it shall also have linkage; additionally, if the linkage of A is external the linkage of ID shall also be external. If the declaration is the definition of an array with static storage duration, ID shall also have static storage duration and be defined in the same translation unit as A.

For the “Semantics” section

4 If the sizearray length is not present, the array type is an incomplete type. Otherwise, if the sizelength is * instead of being an expression, the array type is a variable length array type of unspecified sizelength, which can only be used in declarations or type names with function prototype scope172) and in the declarator of a function return type; such arrays are nonetheless complete types. If the sizelength is an integer constant expression or if the length designator is constexpr and if the element type has a known constant size, the array type is a constant length array type; otherwise, the array type is a variable length array type. (Variable length arrays with automatic storage duration are a conditional feature that implementations need not support; see 6.10.9.3.)

5 If the sizelength is an expression that is not an integer constant expression or it is a length designator that is not previously declared as constexpr: if it appears in a declaration at function prototype scope or in the return type of a function declaration that is not a definition, it is treated as if it were replaced by *. Otherwise, each time itan array length expression is evaluated it shall have a value greater than zero and if it is a length designator .ID, ID shall have value greater than or equal to zero. If an object or type with a length specifier with value zero is accessed the behavior is undefined. If a length designator ID is not declared in the current scope it behaves as if a declaration within the same scope as A as followsFNT1) is visible and the declaration of A is nonetheless complete.FNT2)

extern size_t const ID;

The sizelength of each instance of a variable length array type does not change during its lifetime. Where a size expression is part of the operand of a typeof or sizeof operator and changing the value of the size expression would not affect the result of the operator, it is unspecified whether or not the size expression is evaluated. Where a sizelength expression is part of the operand of an alignof operator, that expression is not evaluated; if it is a length designator, that designator is ignored.

FNT1) All constraints for a subsequent declaration or definition of such an object ID apply.

FNT2) That is, in that case .ID acts as the forward declaration of an object of type size_t const. To form a valid program ID has an explicit definition with an non-zero initializer in this or another translation unit. Although the size of the array type may not be known at translation time, the array type is still complete and sizeof(A) is valid (in block scope) and possibly evaluates ID and is thus not an integer constant expression. If the definition of ID is located in a different translation unit and that translation unit has no compatible definition of A with length designator ID, the behavior is undefined.

Amend EXAMPLE 2 by

extern int *x;
extern int y[];
extern int z[.n];

The first declares x to be a pointer to int; the second declares y to be an array of int of unknown size (an incomplete type), the storage for which is defined elsewhere. In contrast to that, z has a complete variable length array type with a length designator as if declared as

extern size_t const n;
The translation unit that defines z then also contains a file-scope definition of n that initializes n to a positive value.
size_t const n = 50;
...
int z[.n] = {};
Here, the order of the two definitions may also be inverted. Even though n is constant (and it is implementation-defined if the expression n is an integer constant expression) z is a variable length array. An explicit element-wise initialization is a constraint violation. In block scope, the array z has a less-constrained use than y.
sizeof(y); // constraint violation, incomplete type
sizeof(z); // variable sizeof expression, accesses n

Replace Example 4

EXAMPLE 4 Objects declared with linkage that have a VM type declare the length designator explicitly and that length designator has static storage duration and also has linkage; if the linkage is external the linkage of the length designator is also external unless the length designator is also defined by the definition in which case it is fixed and has internal linkage. Variable length arrays (VLA) are more restricted because the definition needs to compute the total size when the lifetime of the instance of the VLA begins and eventually during the lifetime when requested with a sizeof operator. Therefore the lifetime of any length designator (explicit or implicit) does not have a lifetime that is shorter than the lifetime of the VLA. Finally, all identifiers declared with a VM type that are structure members, have a length designator that is another member of the same structure; if additionally that member with length designator is an array, it is a flexible array member that possibly extends beyond the size of the structure. In the following, “invalid” indicates a constraint violation.
int B[100];            // valid, file scope but not VLA
extern int const n;
static int const j = 8;
static int H[.j];      // valid, internal linkage

int A1[n];             // invalid, not an integer constant expression
extern int (*p2)[n];   // invalid, not an integer constant expression

void fvla(int m, int C[m][m]);     // valid, VLA with prototype scope
void fvlb(int[m][m], size_t m);    // valid, VLA with prototype scope, misleading
void fvla(int m, int C[m][m])      // valid, C adjusted to auto pointer to VLA
{
    if (m <= 0) unreachable(); // undefined behavior if m is not positive
    typedef int T[m][m];       // valid, block scope typedef of VLA
    struct tag {
        int (*y)[n];           // invalid, not an integer constant expression
        int z[n];              // invalid, not an integer constant expression
    };
    int D[m] = { };            // valid, auto VLA, initialized per default
    static int E[m];           // invalid, not an integer constant expression
    extern int F[m];           // invalid, not an integer constant expression
    int (*s)[m];               // valid, auto pointer to VLA
    extern int (*r)[m];        // invalid, not an integer constant expression
    if (m > 100) unreachable();// undefined behavior, if m is too big
    static int (*q)[m] = &B;   // valid if m≤100, q has no linkage
}

int A2[.n];            // invalid, no definition of n in the TU
extern int A3[.n];     // valid, declaration of an array with external linkage
extern int (*p3)[.n];  // valid, pointer to array

void fvla(int m, int[.m][.m]);             // valid, no need for const
void fvlb(int[.m][.m], size_t m);          // valid, no need for const
void fvlb(int C[.m][.m], size_t const m)   // valid, C adjusted to auto pointer to VLA
{
    if (m == 0) unreachable(); // undefined behavior if m is not positive
    typedef int T[.m][.m];     // valid, m has no linkage and automatic duration
    struct tag {
        int (*y)[.k];          // valid, y not an array
        size_t const k;
        int z[.n];             // invalid, n not a member
    };
    int A4[.n];                // valid, auto VLA object but static array length
    int D[.m] = { };           // valid, auto VLA with auto array length, initialized per default
    static int E[.m];          // invalid, static VLA definition with auto length
    extern int F[.m];          // invalid, static VLA declaration, auto VLA type
    extern int G[.j];          // invalid, G external and j internal linkage
    extern int H[.j];          // valid, H and j have internal linkage
    static size_t const i = 7;
    extern int K[.i];          // invalid, i has no linkage
    int (*s)[.m];              // valid, auto pointer to VLA
    extern int (*r)[.m];       // invalid, static object, auto array length
    extern int (*t)[.n];       // valid, local scope, static array length with external linkage
    if (m > 100) unreachable();// undefined behavior, if m is too big
    static int (*q)[.m] = &B;  // valid if m≤100, q and m have no linkage
    if (n > 100) unreachable();// undefined behavior, if n is too big
    static int (*p)[.n] = &B;  // valid if n≤100, q has no linkage, n extern
}

2.3 Allow VM types in structures, 6.7.2.1

Since we also want to be able to treat flexible array members of structures as VLA, we have to enable these as members for them. Change 6.7.2.1:

11 A member of a structure or union may have any complete object type other than a variably modified type.145) In addition, a member may be declared to consist of a specified number of bits (including a sign bit, if any). Such a member is called a bit-field;146) its width is preceded by a colon.

Amend p20, flexible array members:

20 As a special case, the last member of a structure with more than one named member may have an incomplete array type or a variable length array type with a length specifier; this is called a flexible array member. In most situations, the flexible array member is ignored. In particular, the size of the structure is as if the flexible array member were omitted except that it may have more trailing padding than the omission would imply. However, When a . (or ->) operator has a left operand that is (a pointer to) a structure with a flexible array member and the right operand names that member:

The offset of the array shall remain that of the flexible array member, even if this would differ from that of the replacement array. If this array would have no elements (because no element fits into the allocated space for the instance or because ID happens to be zero), it behaves as if it had one element but the behavior is undefined if any attempt is made to access that element or to generate a pointer one past it.

2.4 Changes to 6.7.6.3, function declarators

For VLA parameters that are adjusted to pointers, a length specifier plays a similar role to a length expression that has a static specification. Namely it establishes a contract between the caller of the function and the function implementation about the minimal length of the array that is represented by the adjusted pointer argument.

6 A declaration of a parameter as “array of type” shall be adjusted to “qualified pointer to type”, where the type qualifiers (if any) are those specified within the [ and ] of the array type derivation. If the length is given by a specifier ID or the keyword static also appears within the [ and ] of the array type derivation, then for each call to the function, the value of the corresponding actual argument shall provide access to the first element of an array with at least as many elements as specified by the value of ID for the call or the value of the size expression.

2.5 Annotation of C library interfaces

A lot of C library interfaces receive length information through parameters that could profit from the new length specifier. It should be discussed on a per interface base which of these actually already have formulated restrictions and for which of these that would be a semantic change.

We propose to make these annotations always via the length specifier. Doing it for the few cases where it would be possible to use a [static] annotation would have issues for backwards compatibility and for compatibility with C++.

The length parameters here don’t need to be const-qualified, because these are not the definitions but only declarations. Implementations would have to change their function definitions with const to be able to track the length internally for the purpose of the function in question itself.

Note also:

int snprintf(char s[restrict .n], size_t n, const char * restrict format, ...);
int vsnprintf(char s[restrict .n], size_t n, const char * restrict format, va_list arg);
int snprintf_s(char s[restrict .n], rsize_t n, const char * restrict format, ...);
int sprintf_s(char s[restrict .n], rsize_t n, const char * restrict format, ...);
int vsnprintf_s(char s[restrict .n], rsize_t n, const char *restrict format, va_list arg);
int vsprintf_s(char s[restrict .n], rsize_t n, const char * restrict format, va_list arg);
char *gets_s(char s[.n], rsize_t n);
int strfromd(char s[restrict .n], size_t n, const char *restrict format, double fp);
int strfromf(char s[restrict .n], size_t n, const char *restrict format, float fp);
int strfroml(char s[restrict .n], size_t n, const char *restrict format, long double fp);
int mblen(const char s[.n], size_t n);
int mbtowc(wchar_t * restrict pwc, const char s[restrict .n], size_t n);
size_t mbstowcs(wchar_t * restrict pwcs, const char s[restrict .n], size_t n);
size_t wcstombs(char * restrict s, const wchar_t pwcs[restrict .n], size_t n);
int strfromd32(char s[restrict .n], size_t n, const char*restrict format, _Decimal32 fp);
int strfromd64(char s[restrict .n], size_t n, const char*restrict format, _Decimal64 fp);
int strfromd128(char s[restrict .n], size_t n, const char*restrict format, _Decimal128 fp);
char *strncpy(char s1[restrict .n], const char s2[restrict .n], size_t n);
char *strncat(char s1[restrict .n], const char s2[restrict .n], size_t n);
int strncmp(const char s1[.n], const char s2[.n], size_t n);
size_t strxfrm(char s1[restrict .n], const char s2[restrict .n], size_t n);
errno_t strncpy_s(char pwcs[restrict .s1max], rsize_t s1max, const char s2[restrict .n], rsize_t n);
errno_t strncat_s(char pwcs[restrict .s1max], rsize_t s1max, const char s2[restrict .n], rsize_t n);
size_t mbrtoc8(char8_t * restrict pc8, const char s[restrict .n], size_t n, mbstate_t * restrict ps);
size_t mbrtoc16(char16_t * restrict pc16, const char s[restrict .n], size_t n, mbstate_t * restrict ps);
size_t mbrtoc32(char32_t * restrict pc32, const char s[restrict .n], size_t n, mbstate_t * restrict ps);
int swprintf(wchar_t s[restrict .n], size_t n, const wchar_t * restrict format, ...);
int vswprintf(wchar_t s[restrict .n], size_t n, const wchar_t * restrict format, va_list arg);
wchar_t *wcsncpy(wchar_t* restrict s1, const wchar_t s2[restrict .n], size_t n);
wchar_t *wmemcpy(wchar_t* restrict s1, const wchar_t s2[restrict .n], size_t n);
wchar_t *wmemmove(wchar_t *s1, const wchar_t s2[.n], size_t n);
wchar_t *wcsncat(wchar_t* restrict s1, const wchar_t s2[restrict .n], size_t n);
int wcsncmp(const wchar_t *s1, const wchar_t s2[.n], size_t n);
size_t wcsxfrm(wchar_t * restrict s1, const wchar_t s2[restrict .n], size_t n);
int wmemcmp(const wchar_t *s1, const wchar_t s2[.n], size_t n);
QWchar_t *wmemchr(QWchar_t s[.n], wchar_t c, size_t n);
wchar_t *wmemset(wchar_t s[.n], wchar_t c, size_t n);
size_t mbrlen(const char s[restrict .n], size_t n, mbstate_t * restrict ps);
size_t mbrtowc(wchar_t * restrict pwc, const char s[restrict .n], size_t n, mbstate_t * restrict ps);
int snwprintf_s(wchar_t s[restrict .n], rsize_t n, const wchar_t * restrict format, ...);
int swprintf_s(wchar_t s[restrict .n], rsize_t n, const wchar_t * restrict format, ...);
int vsnwprintf_s(wchar_t s[restrict .n], rsize_t n, const wchar_t *restrict format, va_list arg);
int vswprintf_s(wchar_t s[restrict .n], rsize_t n, const wchar_t *restrict format, va_list arg);
errno_t wcsncpy_s(wchar_t pwcs[restrict .s1max], rsize_t s1max, const wchar_t s2[restrict .n], rsize_t n);
errno_t wmemcpy_s(wchar_t pwcs[restrict .s1max], rsize_t s1max, const wchar_t s2[restrict .n], rsize_t n);
errno_t wcsncat_s(wchar_t pwcs[restrict .s1max], rsize_t s1max, const wchar_t s2[restrict .n], rsize_t n);