Doc Number: X3J16/93-0182 WG21/N0389 Date: 24 November 1993 Project: Programming Language C++ Reply to: Sean A. Corfield Programming Research Sean.Corfield@prl0.co.uk Making String Literals 'const' Sean A. Corfield 1. INTRODUCTION Currently, attempting to modify a string literal results in undefined behaviour (2.9.4, para 1). Many compilers can arrange for string literals to be placed in 'read-only' memory which means that they are effectively 'const' - an attempt to modify a string literal will cause a run-time fault (probably). Since attempting to modify a string literal is undefined, it would be beneficial to programmers to trap such attempts as early as possible. If string literals were 'const' this would allow the compiler to flag code that could attempt such modification, e.g., void doSomethingTo(char*); doSomethingTo("hello"); // useful if compiler could flag this However, this on its own would break a lot of innocent, if somewhat inadvisable code, e.g., char* msg = "hello"; cout << msg << '\n'; This paper proposes that string literals be made 'const' and that certain implicit conversions are introduced to allow 'const'-unaware code to continue to compile - the conversions are immediately deprecated. 2. DISCUSSION String literals are not modifiable, but only because the WP says, like ISO C, that attempting to modify them is undefined. This is a special rule that exists outside the type system. Because a character string literal has type 'char[]', we have an inconsistency in the following: char* msg1 = "hello"; char msg2[] = "hello"; char* msg3 = msg2; msg1[0] = 'H'; // undefined msg2[0] = 'H'; // OK msg3[0] = 'H'; // OK The C library strxxx routines have been "fixed" to be 'const'-correct which would seem to pave the way for "fixing" string literals so that their type is what everyone already has to treat it as, e.g., char const* msg1a = "hello"; // 'correct' initialisation // msg2 is OK, it really is non-'const' and has been initialised // with a sequence of characters // msg3 is OK, it is non-'const' and points at non-'const' data msg1a[0] = 'H'; // ill-formed: attempt to modify 'const' Here, the compiler is able to detect the problem. Accessing a string literal through a non-'const' pointer is (and would remain) undefined, e.g., char* msg1b = const_cast(msg1a); msg1b[0] = 'H'; // undefined Simply changing the type of string literals to 'const char[]' (and 'const wchar_t[]') would break a lot of code, e.g., char* msg = "hello"; // ill-formed: char* = char const* That means that this C++ standard cannot make the change directly. However, there seem to be only two situations where code will be broken: * assigning a string literal to a non-'const' pointer * initialising a non-'const' pointer with a string literal Introducing implicit conversions for these cases would allow any existing code to continue to compile, with the same caveat that attempting to modify the (now 'const') string literal through the non-'const' pointer would be undefined, but now by the general rule about modifying 'const' objects through non-'const' access. Under this scheme, we would have the following situation: void f(char*); char * msg = "Hello"; // implicit, deprecated, conversion to 'char*' msg = "hello"; // implicit, deprecated, conversion to 'char*' f("hello"); // implicit, deprecated, conversion to 'char*' if (strcmp(msg, "hello") == 0) // msg is implicitly converted to 'char const *' // "hello" now has exact match, would previously // have been implicitly converted The first three cases shown above are the only cases that would need the implicit conversion. This does introduce a problem with overloaded functions: char index(char const* s, int i) { return (s[i]); } char& index(char* s, int i) { return (s[i]); } Currently: char& rc = index("abc", 1); invokes the second form (with the potential undefined behaviour if you assign to the result). If string literals were 'const', then this would (silently) change to invoke the first form, but at least you would not be able to assign to the result. I believe this is still a gain in terms of type safety - it removes the possibility that undefined behaviour could occur at run-time by making the code 'const'-correct. 3. SUMMARY Changing the type of string literals to 'const char[]' ('const wchar_t[]') and introducing deprecated implicit conversions for assignment and initialisation will provide a path to safer code. Programmers will be encouraged to treat string literals in a 'const'-correct manner. Calls involving overloaded functions and string literals will become 'const'- correct. The next C++ committee would be able to complete the transition if it is appropriate. Of those people who participated in the e-mail discussions, everyone (who expressed a preference) felt that string literals should be 'const', but no-one felt this standard should make the change. If we do not make the change now, the next committee will be in the same position and no progress will be possible. 4. PROPOSAL * String literals have type 'const char[]', wide string literals have type 'const wchar_t[]'. * String literals are implicitly converted to 'char*' in the following situations: * initialisation of 'char*' object (including passing arguments in function calls) * assignment to 'char*' variable (similarly for implicit conversion of wide string literals to 'wchar_t*') * Such implicit conversions are deprecated 5. WORKING PAPER CHANGES 2.9.4 String Literals, para 1, line 2: Change 'A string has type "array of char"' to 'A string has type "array of const char"' ...para 4, line 2: Change 'A wide-character string is of type "array of char_t"' to 'A wide-character string is of type "array of const wchar_t"' 5.17 Assignment Operators Add after para 3 'A character string literal can also be assigned to a pointer of type char*. An implicit conversion, from const char* to char*, takes place. A wide-character string literal can also be assigned to a pointer of type wchar_t*. An implicit conversion, from const wchar_t* to wchar_t*, takes place. Both of these implicit conversions are deprecated.' 8.4 Initializers, para 3: After 'the reverse initialization is not allowed.' add 'Additionally, a pointer of type char* may be initialized with a character string literal, and a pointer of type wchar_t* may be initialized with a wide-character string literal. In both such initializations, an implicit conversion takes place. This implicit conversion is deprecated.' 6. CREDITS This issue was discussed on the core and extensions reflectors and in private e-mail during June 1993 (c++std-core-2288..2320, c++std-ext- 1325..1344). Richard Minner raised the issue in -core, John Skaller suggested the deprecated implicit conversions. Various people contributed to both the public and private e-mail, including Chuck Allison, Mike Ball, Bill Gibbons, James Kanze, Andrew Koenig, Doug Landauer, Jerry Schwarz, Patrick Smith and Bjarne Stroustrup.