JTC1/SC22/WG14 N770


                                N770 J11/97-134
                    Trigraphs and Universal Character Names
                                  Randy Meyers
                                  23 Sept 1997


     Both C and C++ have adopted the same proposal for handling the full
     range of natural language characters in source programs.  Basically,
     during phase 1 of translation any character not in the basic source
     character set is mapped into its Universal Character Name (UCN).
     After phase 1, C/C++ programs are represented only using basic source
     characters and UCNs.

     Phase 1 of translation also handles another mapping:  it recognizes
     trigraphs and translates them into their single character
     representation.  This raises an ordering problem:  if the initial
     source character set is multibyte, do you recognize trigraphs before
     or after recognizing multibyte characters?

     Consider phase 1 input that looks like this:

             $??)

     where in the multibyte encoding, a byte containing the code for "$" is
     the first byte of a single multibyte character made from the byte
     containing the "$" and the byte that follows it.  Bytes containing the
     codes for "?" and ")" are treated as single byte characters unless
     immediately preceded by a special flag byte like "$".

     If you process trigraphs before decoding multibyte characters, you
     would recognize the trigraph for "]", and map the input into "$]",
     which would then be translated into a surprising multibyte character.
     The translator would interpret the source completely differently than
     any display hardware or text processing program.

     The alternative, of course, is to perform multibyte processing before
     trigraph recognition.  In that case, the source would be interpreted
     the same way that the programmer's editor probably displayed it:  a
     multibyte character followed by the characters "?" and ")".  This is
     clearly the most reasonable interpretation, and it also is the most
     defensible interpretation since phase 1 in the Working Paper talks
     about recognizing and mapping characters, and trigraph sequences are
     defined to be sequences of characters.  A byte stream before multibyte
     processing is not a sequence of characters, and you can not find
     trigraph sequences in it until you turn it into characters by
     multibyte processing.

     Unfortunately, the wording for Phase 1 (Subclause 2.1) in the
     Post-London Preview Edition of the C++ Working Paper is very easy to
     misread as requiring trigraph processing before multibyte processing:

      1.  Physical source file characters are mapped, in an
          implementation-defined manner, to the basic source character
          set (including new-line characters for end-of-line
     N770 J11/97-134                                                 Page 2
     Trigraphs and Universal Character Names


          indicators) if necessary.  Trigraph sequences (2.3) are
          replaced by corresponding single-character internal
          representations.  Any source file character not in the basic
          source character set (2.2) is replaced by the
          universal-character-name that designates that character.
          (An implementation may use any internal encoding, so long as
          an actual extended character in the source file, and the
          same extended character expressed in the source file as a
          universal-character-name (i.e., using the notation), are
          handled equivalently.)

     (The wording in the C Working Paper is not yet available, but is
     expected to be the similar.)

     Since the above wording discussing trigraph processing before UCN
     processing, it appears that trigraph processing happens first.

     A simple reordering of the paragraph seems sufficient to clear this
     problem up:

      1.  Physical source file characters are mapped, in an
          implementation-defined manner, to the basic source character
          set (including new-line characters for end-of-line
          indicators) if necessary.  Any source file character not in
          the basic source character set (2.2) is replaced by the
          universal-character-name that designates that character.
          (An implementation may use any internal encoding, so long as
          an actual extended character in the source file, and the
          same extended character expressed in the source file as a
          universal-character-name (i.e., using the notation), are
          handled equivalently.) Trigraph sequences (2.3) are replaced
          by corresponding single-character internal representations.

     If the committee wishes, the word "Then" could be inserted at the
     start of the last sentence to add more emphasis:

          Then, trigraph sequences (2.3) are replaced by corresponding
          single-character internal representations.

     Both the C and C++ Working Papers should reorder the paragraph for
     clarity and optionally add the word "Then".