Document Number: N2300
Submitter: Martin Sebor
Submission Date: September 29, 2018
Subject: Aliasing By String Functions

Summary

Among the expressions listed in §6.5, paragraph 7, as allowed to access the value of an object is an lvalue of a character type, so that code like the following is well-defined:

	extern int n;
	unsigned char *p = (unsigned char*)&n;
	int i = *p;

In §7.24.1 String function conventions the standard further states that:

–1–The header <string.h> declares one type and several functions, and defines one macro useful for manipulating arrays of character type and other objects treated as arrays of character type. …

In particular, the text doesn't mention any constraints on the other objects treated as arrays of character type.

Furthermore, string functions specified in §7.24.1 such as strlen and strcpy are described to operate on strings. For instance, strlen is described in §7.24.6.3 as follows:

Synopsis
    #include <string.h>
    size_t strlen(const char *s);
Description
The strlen function computes the length of the string pointed to by s.

And the strcpy function is described like so:

Synopsis
    #include <string.h>
    char *strcpy(char * restrict s1,
                 const char * restrict s2);
Description
The strcpy function copies the string pointed to by s2 (including the terminating null character) into the array pointed to by s1. …

With the exception of the memcpy, memmove, memcmp, memchr, and memset functions that take void* arguments, all other string functions that take char* arguments are also described as operating on strings.

Finally, the term string is defined in §7.1.1 Definitions of terms as follows:

A string is a contiguous sequence of characters terminated by and including the first null character. The term multibyte string is sometimes used instead to emphasize special processing given to multibyte characters contained in the strin g or to avoid confusion with a wide string. A pointer to a string is a pointer to its initial (lowest addressed) character. The length of a string is the number of bytes preceding the null character and the value of a string is the sequence of the values of the contained characters, in order.

The specification and definitions above give rise to a number of questions illustrated by the following examples.

Is the following variation on the example above also intended to be well-defined?

	int n = 1;
	unsigned char *p = (unsigned char*)&n;
	int i = strlen (p);
In particular, since it contains at least one null character (byte), is the array of characters (bytes) that constitute the representation of the value of n in the example a string?

Note that if the answer is yes then in the following function the result of the first call to strlen cannot be substituted for the second call because the pointer s could store the address of n.

	int n = 1;
	size_t f (const char *s)
	{
  	  n = strlen (s);
	  //  code that doesn't modify *s
	  return strlen (s);
	}

Furthermore, if the strlen examples above are well-defined, is the following call to strlen function also intended to be?

	struct Data {
	  int i;
	  void (pf)(void);
	};

	void f (struct Data *p)
	{
	  strcpy (p, "some text");
	}
Note that the strcpy call partially overwrites the function pointer member of struct Data. Such instances of data corruption have been linked to security exploits.

In contrast to the functions defined in the <string.h> header, the %s directive to the snprintf function specified in §7.21 Input/output <stdio.h> takes an argument constrained as follows.

If no l length modifier is present, the argument shall be a pointer to the initial element of an array of character type. …

This is a much stronger requirement than on the string functions. It makes it clear that a valid %s argument cannot point to an object of some other (non-character) type. As a result, unlike in the analogous strlen example above, in the code below the second snprintf call can be replaced by the result of the first.

	extern int n;
        int g (const char *s)
	{
	  n = snprintf (0, 0, "%s", s);
	  // ...
	  return snprintf (0, 0, "%s", s);
	}

The implication of the above is that against intent and intuition, calling snprintf can (at least in some cases) be a faster way to compute the length of string than calling strlen.

Furthermore, while conforming compilers can detect, diagnose, and even prevent some past-the-end accesses to subobjects by the printf faimily of functions caused by arguments to the %s directive that aren't properly nul-terminated strings, the same strategy could not be employed by the <string.h> functions if such past-the-end sobobject accesses were meant to be valid.

Proposed Resolution

It is important for string manipulation to be efficient. Allowing string functions like strlen or strcpy to operate on the representation of objects of any type makes them less than optimal in the common case (when their arguments are arrays of type char). It is also a highly unlikely use case to call a string function like strlen to operate on an array of type other than character.

By the same token, string functions such as strcpy have been linked to successfully exploited vulnerabilities due to their susceptibility to buffer overflow. It is, therefore, also important to judiciously constrain their accesses to reduce such incidents. To make that possible, we propose to explicitly require arguments to string handling functions to be arrays of char analogously to %s arguments, and not objects of other types.

To that effect, we propose to make the following changes. In §7.1.1 Definitions of terms make changes as indicated below:

–1–A string is a contiguous sequence of characters stored in an object or array of character type and terminated by and including the first null character. The term multibyte string is sometimes used instead to emphasize special processing given to multibyte characters contained in the string or to avoid confusion with a wide string. A pointer to a string is a pointer to its initial (lowest addressed) character. The length of a string is the number of charactersbytes preceding the null character and the value of a string is the sequence of the values of the contained characters, in order.

In §7.24.1 String function conventions make changes as indicated below:

–1–The header <string.h> declares one type and several functions, and defines one macro useful for manipulating strings and arrays of character type and other objects treated as arrays of type char. 307) An argument to a qualified or unqualified char* parameter of a function declared in the header shall point to an object of character type. An argument to a qualified or unqualified void* parameter may point to an object of any type.

Also Consider

In addition, as an independent improvement to make the distinction crisp between strings and arrays of character type manipulated by string functions like strcpy and objects of any type manipulated by the "raw memory" functions like memcpy we suggest to revisit DR 446 and consider making the changes proposed there. The changes are duplicated below for reference (the msemset changes are missing from DR 446).

Change §7.24.2.1, paragraph 2 as indicated below:

–2–The memcpy function copies n bytes characters from the object pointed to by s2 into the object pointed to by s1.

Change §7.24.2.2, paragraph 2 as indicated below:

–2–The memmove function copies n bytes characters from the object pointed to by s2 into the object pointed to by s1. Copying takes place as if the n bytes characters from the object pointed to by s2 are first copied into a temporary array of n bytescharacters that does not overlap the objects pointed to by s1 and s2, and then the n characters from the temporary array are copied into the object pointed to by s1.

Change §7.24.4.1, paragraphs 2 and 3 as indicated below:

–2–The memcmp function compares the first n bytescharacters of the object pointed to by s1 to the first n bytes characters of the object pointed to by s2. Description

–2–The memchr function locates the first occurrence of c (converted to an unsigned char) in the initial n bytescharacters (each interpreted as unsigned char) of the object pointed to by s. The implementation shall behave as if it reads the bytescharacters sequentially and stops as soon as a matching bytecharacter is found.

Returns

–3–The memchr function returns a pointer to the located bytecharacter, or a null pointer if the bytecharacter does not occur in the first n bytes of the object.

And finally, change §7.24.6.1, paragraph 2 as indicated below:

–2–The memset function copies the value of c (converted to an unsigned char) into each of the first n bytescharacters of the object pointed to by s.