New string literal lexem

Version 1
WG14/N1152

Ivan A. Kosarev
Unicals Group, Russia

15 October, 2005

Abstract

Here an additional string literal lexem is proposed that solves numerous issues with a use of string literals in the preprocessor.

Conformance

No valid C99 program will be affected by this proposal.

The Problem

It looks so the only thing string literals were initially designed for is to be a kind of primary expressions. Thus, there are congruous requirements in the Standard for the lexems:

6.4.5 #3 ("String literals"):

[#3] The same considerations apply to each element of the sequence in a character string literal or a wide string literal as if it were in an integer character constant or a wide character constant, except that the single-quote ' is representable either by itself or by the escape sequence \', but the double-quote " shall be represented by the escape sequence \".

6.4.4.4 ("Character constants"):

[#3] The single-quote ', the double-quote ", the question-mark ?, the backslash \, and arbitrary integer values are representable according to the following table of escape sequences...

[#5] The octal digits that follow the backslash in an octal escape sequence are taken to be part of the construction of a single character for an integer character constant or of a single wide character for a wide character constant...

[#6] The hexadecimal digits that follow the backslash and the letter x in a hexadecimal escape sequence are taken to be part of the construction of a single character for an integer character constant or of a single wide character for a wide character constant...

[#8] In addition, characters not in the basic character set are representable by universal character names and certain nongraphic characters are representable by escape sequences consisting of the backslash \ followed by a lowercase letter: \a, \b, \f, \n, \r, \t, and \v.64)

64) The semantics of these characters were discussed in 5.2.2. If any other character follows a backslash, the result is not a token and a diagnostic is required. See ''future language directions'' (6.11.4).

That is, once a string literal that is a part of a program violates the requirements, a diagnostic message shall be produced and the whole program is treated to be illegal.

At the same time, the Standard allows to use string literals in some completely different contexts, and while it does so, the exactly same requirements apply to the string literals, which cause factitious problems. There are a few examples below that demostrate various aspects of this point.

Example #1

6.10.9 #2 ("Pragma operator"):

[#2] EXAMPLE A directive of the form:

    #pragma listing on "..\listing.dir"

can also be expressed as:

    _Pragma ( "listing on \"..\\listing.dir\"" )

The problem is that both the lines violate previously cited verses of the Standard, so they shall not appear in a conforming program (see N1123, DR 324), which seems to be too restrictive.

Example #2

#include "\unicals.h"             // (1)
#define HEADER_NAME "\unicals.h"  // (2)
#include HEADER_NAME              // (3)

The problem is that while the directive (1) may be accepted by a conforming implementation, the directive (2) shall not, so current wordings of the Standard make directives (1) and (3) to be unequal.

Example #3

#include "\oops"      // (1)
#define f()
#include "\oops" f()  // (2)

Similarly, while the directive (1) may be accepted by a conforming implementation, the directive (2) shall not. In contrast to the previous example, the problem token in the directive (2) (which is "\oops", an instance of the string-literal grammar entity, not a quoted q-char-sequence) is neither a part of a macro replacement list nor a misplaced token; no any further processing actually needed, intented or even possible for the character sequence that is forming the token, but it still forms an illegal piece of the code due to the suffixed macro invocation.

Example #4

#include "\x.h"  // (1) may be accepted
#line 1 "\x.h"   // (2) error: violates a syntax clause

Since the #line directive is defined is terms of s-characters, and not q-characters, the directive (2) is not allowed in a conforming implementation. Note is that the directive (1) still may be accepted.

Example #5

#if 0
Whether these lines will actually be skipped or not,
both implementers and users should always care about "\escape \sequences".
#endif

Similarly to the Example #1, it seems to be too restrictive to forbid such use of preprocessing facilities.

The Solution

The soultion is to introduce preprocessing string literal lexem (pp-string-literal); just like with the preprocessing number (pp-number), for the same purpose and with a similar semantics.

Changes to the Standard

In clause 5.1.1.2 #1 modify the sentence of phase 5 to read:

Each source character set member and escape sequence in character constants and preprocessing string literals is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character.

In clause 5.1.1.2 #1 modify the sentence of phase 6 to read:

Adjacent preprocessing string literal tokens are concatenated.

In clause 5.2.1 #3 modify the last sentence to read:

If any other characters are encountered in a source file (except in an identifier, a character constant, a string literal, a preprocessing string literal, a header name, a comment, or a preprocessing token that is never converted to a token), the behavior is undefined.

Modify clause 5.2.1.2 #2 to read:

In clause 5.2.4.1 #1 modify the 16th bullet to read:

4095 characters in a character string literal, preprocessing character string litreal, or wide string literal, or preprocessing wide string literal (after concatenation)

In clause 6.4 #1, modify the preprocessing-token syntax to read:

preprocessing-token:
        header-name
        identifier
        pp-number
        character-constant
        string-literal
        pp-string-literal
        punctuator
        each non-white-space character that cannot be one of the above

In clause 6.4 #3, modify the 2nd sentence to read:

The categories of preprocessing tokens are: header names, identifiers, preprocessing numbers, character constants, string literals, preprocessing string literals, punctuators, and single non-white-space characters that do not lexically match the other preprocessing token categories.

In clause 6.4 #3, modify the last sentence to read:

White space may appear within a preprocessing token only as part of a header name or between the quotation characters in a character constant or preprocessing string literal.

In clause 6.4 #4, modify the last sentence to read:

There is one exception to this rule: a header name preprocessing token is only recognized within a #include preprocessing directive, and within such a directive, a sequence of characters that could be either a header name or a preprocessing string literal is recognized as the former.

In clause 6.4.9 #1, modify the first sentence to read:

Except within a character constant, a preprocessing string literal, or a comment, the characters /* introduce a comment.

In clause 6.4.9 #2, modify the first sentence to read:

Except within a character constant, a preprocessing string literal, or a comment, the characters // introduce a comment that includes all multibyte characters up to, but not including, the next new-line character.

Add a new clause 6.4.10, entitled "Preprocessing string literals":

Syntax

[#1] pp-string-literal:
        " pp-s-char-sequenceopt "
        L" pp-s-char-sequenceopt "

pp-s-char-sequence:
        pp-s-char
        pp-s-char-sequence pp-s-char

pp-s-char:
        any member of the source character set except the double-quote ", new-line character or backslash \ followed by a new-line character

Description

[#2] A preprocessing character string literal is a sequence of zero or more multibyte characters enclosed in double-quotes. A preprocessing wide string literal is the same, except prefixed by the letter L.

[#3] Preprocessing string literal tokens lexically include all string literal tokens.

Semantics

[#4] A preprocessing string literal does not have type or a value; it acquires both after a successful conversion (as part of translation phase 7) to a string literal token.

Modify clause 6.10.3.2 #2 to read:

If, in the replacement list, a parameter is immediately preceded by a # preprocessing token, both are replaced by a single preprocessing character string literal preprocessing string literal token that contains the spelling of the preprocessing token sequence for the corresponding argument. Each occurrence of white space between the argument's preprocessing tokens becomes a single space character in the preprocessing character string literal. White space before the first preprocessing token and after the last preprocessing token composing the argument is deleted. Otherwise, the original spelling of each preprocessing token in the argument is retained in the preprocessing character string literal, except for special handling for producing the spelling of preprocessing string literals and character constants: a \ character is inserted before each " and \ character of a character constant or preprocessing string literal (including the delimiting " characters), except that it is implementation-defined whether a \ character is inserted before the \ character beginning a universal character name. If the replacement that results is not a valid preprocessing character string literal, the behavior is undefined. The preprocessing character string literal corresponding to an empty argument is "". The order of evaluation of # and ## operators is unspecified.

Modify clause 6.10.4 #1 to read:

The preprocessing string literal of a #line directive, if present, shall be a preprocessing character string literal.

Modify clause 6.10.4 #4 to read:

A preprocessing directive of the form

# line digit-sequence "pp-s-char-sequenceopt" new-line

sets the presumed line number similarly and changes the presumed name of the source file to be the contents of the preprocessing character string literal.

In clause 6.10.8 #1, modify the sentence of macro __DATE__ to read:

The date of translation of the preprocessing translation unit: a preprocessing character string literal of the form "Mmm dd yyyy", where the names of the months are the same as those generated by the asctime function, and the first character of dd is a space character if the value is less than 10.

In clause 6.10.8 #1, modify the sentence of macro __FILE__ to read:

The presumed name of the current source file (a preprocessing character string literal).

In clause 6.10.8 #1, modify the sentence of macro __TIME__ to read:

The time of translation of the preprocessing translation unit: a preprocessing character string literal of the form "hh:mm:ss" as in the time generated by the asctime function. If the time of translation is not available, an implementation-defined valid time shall be supplied.

Modify clause 6.10.8 #3 to read:

The values replacement lists of the predefined macros (except for __FILE__ and __LINE__) remain constant throughout the translation unit.

Modify clause 6.10.9 #1 to read:

A unary operator expression of the form:

_Pragma ( pp-string-literal )

is processed as follows: The preprocessing string literal is destringized by deleting the L prefix, if present, deleting the leading and trailing double-quotes, replacing each escape sequence \" by a double-quote, and replacing each escape sequence \\ by a single backslash. The resulting sequence of characters is processed through translation phase 3 to produce preprocessing tokens that are executed as if they were the pp-tokens in a pragma directive. The original four preprocessing tokens in the unary operator expression are removed.

Conclusion

While the proposal assumes a lot of changes in the formal text of the Standard, the essential set of changes is just a few lines. Benefits are that programs that deal with the preprocessor facilities can be simpler and more portable, and the Standard can be more clear and neutral to a host platform.