SG16: Unicode meeting summaries 2019-10-09 through 2019-12-11
Summaries of SG16 meetings are maintained at
https://github.com/sg16-unicode/sg16-meetings.  This paper contains a
snapshot of select meeting summaries from that repository.
October 9th, 2019
Draft agenda:
  - P1880R0 - u8string, u16string, and u32string Don't Guarantee UTF Endcoding
    
  
- P1879R0 - The u8 string literal prefix does not do what you think it does
    
  
- P1844R0: Enhancement of regex
    
  
Attendees:
  - Corentin Jabot
- David Wendt
- Henri Sivonen
- JeanHeyd Meneide
- Peter Bindels
- Peter Brett
- Tom Honermann
- Zach Laine
Meeting summary:
  - P1880R0 - u8string, u16string, and u32string Don't Guarantee UTF Endcoding
    
      - 
          https://github.com/tzlaine/small_wg1_papers/blob/master/P1880_uNstring_shall_be_utf_n_encoded.md
- Zach introduced:
        
          - The idea is that interfaces taking these string types expect that
              contents of these strings are well-formed UTF-8, UTF-16, UTF-32
              respectively; this requirement needs to be reflected in the
              standard.
- We should state a blanket requirement for these expectations.
- The paper proposes a 4th bullet to
              [res.on.arguments].
 
- PeterBr asked if the requirement should be for well-formed data.
- Zach replied that it should be.  LWG should confirm that.
- Henri asked what happens if an ill-formed code unit sequence is
          passed.  Is it undefined behavior or as-if the Unicode replacement
          character was present?
- Zach replied that the current wording makes it undefined
          behavior.
- PeterBr provided an example of why the behavior is undefined.
          Consider a string that ends with an incomplete code unit sequence; the
          implementation could run off the end of the buffer.
- Zach responded that, for std::basic_string types, the buffer
          overrun can be avoided, but in that case, the interface specification
          should state that behavior.  The proposed blanket wording is for the
          weakest interface requirements and can be strengthened by individual
          interfaces.
- Henri asked if that is useful as it seems like undefined behavior is
          a huge foot cannon; replacement character semantics would provide a
          safer interface.
- Zach responded that, if this is a foot gun, then so is
          std::vector operator[].  You must meet preconditions.
          Implementations can always constrain their handling if they want.
          The intent here is to enable the fast path.
- PeterBr added that it would add complexity to implement replacement
          character behavior; interfaces would not be able to use SIMD
          instructions if ill-formed strings must be handled.
- Zach repeated that the proposal just specifies the default behavior
          unless otherwise specified for an interface.
- Corentin opined that this seems almost editorial.
- Henri stated that, for char8_t, there are values that are
          never valid in well-formed UTF-8 text and asked what an individual
          char8_t means; it must be restricted to ASCII.
- Tom noted that matches UTF-8 character literals; they can only
          specify ASCII values.
- Zach read the existing content in
          [res.on.arguments]
          in order to demonstrate similarity in existing requirements.
- Henri asked if this represents a requirement that is more difficult
          to satisfy than the existing requirements.  For example, in UTF-16,
          almost all code bases will allow unpaired surrogates.  Does this
          requirement make the standard library useless for their code
          bases?
- Zach stated that interfaces can specify their handling of unpaired
          surrogates.
- Henri asked again if this is a practical requirement.
- Tom responded that this is needed for our mantra of not leaving
          performance on the floor; we can't both check for ill-formed text and
          maximize performance.
- Zach added that ICU already does this for performance.  Within
          Boost.Text, Zach added interfaces for both unchecked and checked
          text.
- PeterBr opined that this paper is great and sorely needed.
- Tom and Corentin agreed.
- Henri asked Zach to expound on his statement that ICU already
          exhibits undefined behavior.
- Zach responded that, in ICU normalization code, assumptions are made
          when decoding UTF-8.  For example, unsafe unpacking of UTF-8 is
          performed.
- Henri asked if ICU does likewise for UTF-16 for unpaired
          surrogates.
- Zach responded that he thought so, but is not completely sure.
- Corentin expressed support for an NB comment to include this in
          C++20.
- Tom opined that it doesn't much matter if this makes C++20 as
          implementors will already do the right thing.
- Henri asked if this might introduce a backward compatibility issue in
          C++23 if added after C++20.
- Tom responded that the undefined behavior is effectively already
          there; this is fixing an underspecification.
- Henri stated it would be a huge task to scrub existing code bases to
          avoid this undefined behavior.
- Zach predicted that we'll end up with separate interfaces for
          assuming an encoding vs checking the encoding.  This isn't hurting
          anybody, it is just enabling fast path implementations.
- Henri expressed concern about digging deeper into making default
          interfaces unsafe; like std::optional::operator* is.  He
          would prefer unsafe interfaces be clearly marked as unsafe.  This
          undefined behavior has the potential to introduce security
          issues.
- Zach responded that most standard interfaces are unsafe in some way,
          for example every function that accepts arguments of pointer
          type.
- Henri countered that the undefined behavior can be avoided in this
          case; just like we could for std::optional::operator*.
- Zach suggested that C++ is often used for its performance advantages;
          we want the default to be fast.  But this proposal isn't really about
          that; it is about documenting our default behavior.
- PeterBr stated that std::u8string is
          std::basic_string with char8_t.
          std::basic_string provides many interfaces that allow
          mutating the string in a way that would break otherwise well-formed
          UTF-8.  Rust doesn't do that.  We could specify a UTF-8 string type
          that maintains invariants, but it wouldn't be a
          std::basic_string any more.  Thus, it is up to the
          programmer to not violate UTF-8 requirements.
- Corentin agreed that we don't want to change std::u8string;
          it is just a container of code units.  String mutation should be
          managed via some overlying type like std::text.  This paper
          just reflects existing behavior.
- Henri asked if we really want to enable so much performance that we
          risk our users.  In Firefox, lots of string checking is done to avoid
          security issues even though ill-formed UTF-8 is very rare.  The
          performance isn't bad.
- PeterBr responded that an implementation can choose to define its
          behavior.
- Henri countered that, if it isn't required everywhere, then it can't
          be relied on.
- Corentin suggested that, if you want safety, then
          std::basic_string is not the type you're looking for.
          We're going to need other types on top and, eventually, we'll have
          more trusted types.
- Zach added that no interfaces are being specified in this paper, so
          there are no ergonomic concerns.  Again, this is just proposing
          blanket wording that can be strengthened in individal interfaces.
 
- Tom initiated a discussion about polling during telecons.
    
      - Tom introduced:
        
          - He prefers to avoid polling during telecons in favor of polling
              during face to face meetings.  This is due to 1) larger numbers
              of attendees at face to face meetings, 2) more opportunity for
              input from those that do not regularly attend telecons, and 3)
              more opportunity for background thinking after a discussion
              before having to respond to a poll.
- He also sees the telecons as useful for priming discussion and
              identifying non-obvious concerns.
 
- Tom asked if anyone wanted to argue for a change in practice.
- The group expressed general agreement to continue doing what we've
          been doing.
 
- P1879R0 - The u8 string literal prefix does not do what you think it does
    
      - 
          https://github.com/tzlaine/small_wg1_papers/blob/master/P1879_please_dont_rewrite_my_string_literals.md
- Zach introduced:
        
          - This started from an experience from a while back that we have
              previously discussed.
- Tests involving UTF-8 formatted source files failed when compiled
              with the Microsoft compiler, but not with other compilers.
- The source files did not have a UTF-8 BOM and Microsoft's
              /source-charset:utf-8 option wasn't being used, so the
              source files were decoded as Windows-1252.
- String literals therefore did not contain what was expected
              because code units were not interpreted as expected.
- The paper proposes prohibiting use of u8, u,
              and U literals unless the source file encoding is a
              Unicode encoding.
 
- Corentin suggested relaxing the prohibition to allow use of these
          literals so long as the source contents of the literal only use
          characters from the basic source character set.
          [ Editor's note: presumably this would still allow characters
          outside the basic source character set if specified with
          universal-character-name escape sequences. ]
- Corentin also stated that the current behavior makes sense according
          to the standard, but most programmers aren't aware of source file
          encoding vs execution encoding concerns.
- Henri stated that the behavior makes sense if you think of C++ source
          code as text rather than bytes and agreed that this isn't what
          programmers expect.
- PeterBr expressed support for the paper because it ensures you get
          the same abstract characters written in the source file and added
          that it would be nice if this paper used the same terminology as
          propsoed in Steve's recent terminology paper
          (P1859R0).
          [ Editor's note: this paper will be in the Belfast pre-meeting
          mailing. ]
- Zach agreed regarding use of terminology.
- Tom expressed concerns regarding breaking backward compatibility,
          particularly for z/OS where source files are EBCDIC and u8
          literals are used to produce ASCII strings.
- Zach asked if it would help to only allow characters from ASCII.
- PeterBr stated that, if the compiler is not explicitly told what the
          source encoding is, you are in trouble since the compiler can't
          always detect an encoding expectation mismatch.
- Henri noted that the translation model matches what is done on the
          web where HTML source is transcoded to some internal (Unicode)
          encoding.  A compiler could preserve meta data about the encoding a
          literal came from and, if the transcoded code point is above 0x80,
          issue a diagnostic.
- Zach asked for more information regarding concerns for z/OS and
          EBCDIC.
- Tom explained the source translation model according to
          translation phase 1.
          Source files are first transcoded from an implementation defined
          encoding to an implementation defined internal encoding.  The internal
          encoding has to be effectively Unicode (or isomorphic to it) due to
          possible use of universal-character-name sequences in the
          source code.  The internal encoding is then transcoded to the various
          execution encodings where needed.
- Tom went on to explain that there are multiple EBCDIC code pages and
          that many of the characters available in them are not defined in
          ASCII.  Restricting UTF literals to just ASCII would prevent use of
          those characters.
- Tom restated PeterBr's point from earlier.  This problem is always due
          to mojibake; the source file being encoded in something other than
          what the compiler expects.
- PeterBr agreed that the root cause is the encoding mismatch and opined
          that this is a problem worth solving.  The question is how best to
          solve it.  The first place to look is at the translation from source
          encoding to internal encoding.
- Henri expressed belief that it makes sense to address the problem
          where Zach suggests.
- Zach stated that the right place to detect this is during parsing;
          when parsing a UTF literal, it is critical to know what the source
          encoding is.
- Tom countered that it is necessary to know the encoding as soon as you
          hit a code unit that doesn't represent a member of the basic source character set.
- Henri stated that diagnosing any such code unit is a harder sell than
          just diagnosing one in a UTF literal.
- Tom agreed.
- PeterBr noted that it is implementation defined how (or if) characters
          outside the basic source character set are represented.  The goal of
          the paper is effectively to tighten that up.  That means
          implementations can have extensions to relax diagnostics.
- Henri responded that such arguments apply to any change to the
          standard.
- Zach agreed, but noted this is restricted to source files that have
          UTF literals with transcoded code points outside of ASCII.
- Henri stated that there is more potential for failures for some
          character sets than others.  For example, some character sets don't
          roundtrip through Unicode.  This failure mode already exists, but
          there is little value in trying to diagnose this outside of UTF
          literals.
- PeterBr stated that a source file with code units representing
          characters outside of the basic source character set is ill-formed
          subject to implementation defined behavior.  When a programmer writes
          a UTF literal, that is a request for a specific encoding, but it is
          perfectly valid for the source file to be written in Shift-JIS.
- Henri acknowledged that perspective as logically valid, but doesn't
          address the problems caused by the Microsoft compiler's default
          behavior not matching user expectations.  Programmers are using UTF-8
          editors these days.
- PeterBr asserted that is a quality of implementation concern and not
          an issue with the standard.
- Tom agreed.
- Zach stated that the proposed restrictions can be worked around by
          using universal-character-name escapes and stated a
          preference for implementing a solution that results in a diagnosis
          for the problem he encountered, but that this isn't a critical
          issue.
- Corentin brought up static reflection and that, at some point,
          reflection will require defining or reflecting the source file
          encoding.
- Tom stated that dovetails nicely with Steve's P1859R0 draft that
          provides a callable for conversion of string literal encoding.
- Corentin noted that Vcpkg compiles all of its packages with the
          Microsoft compiler's /utf-8 option and that Microsoft may
          be open to defaulting source encoding to UTF-8 when compiling as
          C++20.
- Zach added that the Visual Studio editor, by default, adds a UTF-8
          BOM to new source files it creates, though it doesn't implicitly add
          a UTF-8 BOM when existing files are added to a project.
- Corentin observed that, because source encoding is not portable,
          most programmers just don't use characters outside of ASCII except
          in comments; which is why such characters are ignored.
- PeterBr suggested that an evening session in Belfast to discuss this
          or other ideas might be an option and that it would be good to talk
          directly with implementors.
 
- Tom confirmed that the next meeting will be on October 23rd and will be
      the last meeting before Belfast.
October 23rd, 2019
Draft agenda:
  - P1844R0: Enhancement of regex
    
  
- P1892R0 - Extended locale-specific presentation specifiers for std::format
    
  
- P1859R0 - Standard terminology for execution character set encodings
    
  
Attendees:
  - David Wendt
- Mark Zeren
- Peter Brett
- Steve Downey
- Tom Honermann
- Yehezkel Bernat
- Zach Laine
Meeting summary:
  - Tom initiated a round of introductions for new attendees.
- P1844R0: Enhancement of regex
    
      - https://wg21.link/P1844R0
- Tom introduced the paper on behalf of the author:
        
          - The proposal is an expansion of std::basic_regex
              specializations.
- We've discussed issues with std::basic_regex before.
              The author has put significant effort into this proposal.  It
              includes wording.  We owe it to the author to set aside any biases
              and consider the benefits of this paper.
- An implementation is available though it only implements the
              proposed char8_t, char16_t, and
              char32_t specializations, not the existing char
              or wchar_t specializations.
- The paper does not propose an alternative to
              std::basic_regex, but rather attempts to address
              shortcomings of it for UTF encodings via specializations.
              [ Editor's note: this implies that the proposal doesn't
              address issues with support of UTF encodings with the
              char and wchar_t specializations. ]
- The paper proposes a new regex syntax option,
              ECMAScript2019, to be used to select a regular expression
              engine that implements the ECMAScript 2019 specification.  This
              option would be available for use with all
              std::basic_regex specializations.
- The paper proposes a new dotall syntax option that allows
              the . character to match any Unicode code point,
              including new line characters, when using the
              ECMAScript2019 option.
- The new ECMAScript2019 syntax option would be the only
              syntax option supported for the char8_t, char16_t, and char32_t specializations.
- The ECMAScript2019 regular expression engine would
              NOT exactly match the ECMAScript 2019 specification:
            
              - The \xHH expression is redefined to match code points
                  rather than code units.  However,
- The author would be fine with removing support for the
                  \xHH expression since support for code points is
                  provided by the \uHHHH and \u{H...}
                  expressions.
 
- The proposal removes locale dependency for the char8_t,
              char16_t, and char32_t specializations and
              therefore does not propose any new specializations of
              std::regex_traits.
- The paper proposes new overloads of std::regex_match and
              std::regex_search to allow specifying look behind limits
              on ranges.
- The proposed changes to std::regex_iterator are ABI
              breaking.
 
- PeterBr observed that the proposal doesn't deal with language specific
          aspects like case folding.
- PeterBr stated he liked the motivation for this paper and the notion
          that std::regex can be made to work.
- Zach asked about support for collation and whether anyone was familiar
          with the existing collate syntax option.
- PeterBr responded that the paper states that the collate
          option is ignored for these specializations.
- Zach stated that the default collation is not useful and that
          tailoring is required.
- Tom summarized, so the paper needs to address collation.
- Zach refuted that need since it could profoundly impact
          performance.
- PeterBr suggested that, perhaps, regex for Unicode should operate on
          std::text.
- Tom expanded that suggestion to any sequence of code points and
          observed that the proposal kind of does that already via the changes
          to regex_iterator.
- Zach agreed it would be useful to use as an adapter for code
          points.
- Tom asked if a new regex feature for non-compile-time regex support
          would be preferred over specializing std::basic_regex as
          proposed.
- Zach responded that he doesn't think std::regex is DOA, but
          if we're going to support Unicode regex with dynamic patterns, then,
          we should pursue some of the design of CTRE.
- Zach added that solving the problem is important and that he wants to
          see Unicode regex support but would prefer to take a wait-and-see
          approach on this paper while watching how CTRE and
          std::format evolve.
- PeterBr acknowledged the benefits of CTRE, but stated that we do need
          a solution for dynamic regex.
- Zach reported that be believes that Hana is planning to make CTRE
          capable of supporting dynamic pattern strings and If that were to
          happen, then we wouldn't need std::regex any longer.
- Mark lamented the lack of a proposal like this one when C++11 was
          being designed since the approach looks good relative to other papers
          from the past.
- Mark added that it is an embarassment that we don't have a solution
          for this today, but that he feels kind of neutral on it as well due
          to concerns about allocating time for this relative to other things
          we could do.
- Mark asked what implementors would think and if they get requests for
          Unicode std::regex support.
- Mark asserted that the implicit use of the ECMAScript2019
          engine when a different syntax option is specified has to be
          changed.
- Zach reiterated that this proposal is definitely an ABI break, that
          an ABI break is a serious problem, and that the need for such a break
          suggests we need a different family of types.
- Mark added that the paper should make it clear that it does break ABI,
          not that it might.
- Tom asked if this proposal solves the std::basic_regex
          issues with support for variable length encodings.
- Zach responded that std::regex doesn't handle incomplete or
          ill-formed code unit sequences and suggested that perhaps those should
          match against \uFFFD.
- Zach reported that std::regex can also match code unit ranges
          that stride code unit sequences since std::regex effectively
          matches bytes.
- Tom asked what guidance we should offer to LEWG.
- Zach suggested:
        
          - We should solve this problem.
- This approach is premature given other things in flight now, but
              if this had been proposed three years ago he might have felt
              differently about it.
 
- PeterBr suggested it should be prioritized behind CTRE.
- Tom asked whether support for tailoring is important.
- Zach suggested placing tailoring at the lowest priority and mentioned
          that he doesn't think ICU supports it as people don't often want to
          do collation aware searching.
- Tom reiterated that we should offer guidance that it be ill-formed to
          specify a syntax option other than ECMAScript2019 for the
          proposed specializations.
 
- P1892R0 - Extended locale-specific presentation specifiers for std::format
    
      - PeterBr introduced the paper:
        
          - Looking through the std::format specification he found
              that there are useful floating point formats that can not be
              produced in locale specific formats.
- Locale specific formats are important in scientific fields.
- The 'n' specifier has a different meaning for integers
              than it does for floating point.
- An NB comment was filed to make the 'n' specifier
              indicate a locale specific format rather than a type
              modifier.
- The proposed change should not affect existing well-formed
              std::format calls except for bool which would
              now be formatted as locale variants of "true" or "false" instead
              of 1 or 0.
- This would make std::format unambiguously the best choice
              for localized formatting since locales can be easily specified and
              std::format already solves short falls of iostreams and
              printf such as ordering.
- Without this change, there is still a need to use printf
              for locale sensitive formatting.
 
- Mark noted that this change will break existing users of
          {fmt}.
- PeterBr responded that it will for existing uses of bool but
          that he isn't concerned about existing users of
          {fmt}.
- Tom observed that use of 'l' as the specifier as suggested in
          the paper avoids the break and aligns with Victor's
          P1868R0 paper to enable locale
          specific handling of character encodings.
- Mark stated that the core issue is that there remains some uses of
          printf that can't be directly replicated with
          std::format and asked how a programmer would print, for
          example, the locale specific decimal character but without the locale
          specific thousands separator.
- PeterBr responded that the programmer can create a custom locale.
- Zach stated that we can't defer this until C++23 because changing the
          meaning of 'n' would break compatibility and asked why we
          can't just introduce an 'l' specifier in C++23?
- PeterBr responded that doing so makes things more complicated and
          asked whether we would deprecate 'n' if 'l' were to
          be adopted.  We can postpone addressing this, but we get a cleaner
          solution in the long term by addressing it now.
- Zach agreed with the motivation being to avoid a wart that we'll need
          to teach but that some opposition will be raised due to perceived risk
          at this late stage.
- Zach stated that he likes the change, but that it needs good
          motivation.
- PeterBr suggested that 'n' could be removed now and then
          restored with desired changes in C++23.
- Zach suggested that if Victor supports the paper, it will probably
          pass, but if he disagrees with it, then it is probably DOA.
- Mark stated that the choices need to be clearly presented for
          LEWG.
- Zach observed that there are a few options and suggested presenting a
          cost/benefit of each so that LEWG is given clear choices.
- Mark suggested socializing the issue on the LEWG mailing list now to
          flush out any objections.
- PeterBr stated that any help improving the paper would be
          appreciated.
- Mark suggested presenting either slides or a different paper that
          presents the options and analysis.
- PeterBr stated he would create a doc that could be collaboratively
          edited.
 
- P1859R0 - Standard terminology for execution character set encodings
    
      - Steve introduced the paper:
        
          - The goal is to not affect implementations, but rather to fix
              wording so that we can use modern terminology and understand
              each other better.
- We often use terms like "execution encoding" that are not defined
              in the standard and are opportunities for confusion.
- We need to admit that wchar_t is not, in practice, able
              to hold all code points of the wide execution character set.
 
- Zach asked what "literal encoding" is for.
- Steve responded that it reflects the encoding for non-UTF
          literals.
- Zach asked what difference is intended by "character set" and
          "character repertoire".
- Steve responded that the goal is to tighten up the meanings of
          existing terminology so as to avoid massive changes to the
          standard.
- Mark observed that there seems to be a missing word in the
          definition of "Basic execution character set"; that there seems to
          be a missing "that".
- PeterBr stated that this should be high priority in C++23 so we can
          get everyone on board with terminology.
- Steve agreed and asserted we'll need to socialize these new
          terms.
- Tom asked if there are any terms being dropped; it looks like the
          paper adds "literal encoding" and "dynamic encoding".
- Steve responded that none are dropped and stated there will be an
          additional associated encoding added for character types as well.
- Mark noticed that the paper discusses literal_encoding and
          wide_literal_encoding but doesn't define a term for "Wide
          literal encoding".
- Tom asked if "source encoding" should be added.
- Tom asked if we should add a statement that the dynamic encoding must
          be able to represent all of the characters of the execution character
          set.
- Steve responded that we could add that.
- PeterBr observed a potential problem with doing so on Windows where
          the dynamic encoding might be UCS-2, but the execution character set
          is UTF-16.
- Tom suggested refining the requirement such that characters used in
          literals must have a representation in the dynamic encoding.
- Mark suggested it would be helpful to have a cheat sheet with
          mathematical notation of which terms denote a subset of other
          terms.
- Steve agreed.
- Tom suggested that we also need "wide dynamic encoding".
- Zach asked about the difference between the "encoding" and "character
          set" terms.
- Steve responded that the former states how characters are represented
          while the latter states what characters must be representeable.
- Zach stated it would be useful to have text explaining the
          difference.
- Tom asked how ODR violations would be avoided for
          literal_encoding since literal encoding can vary by TU.
- Steve responded that the same technique used for
          std::source_location can be used; a value is provided.
 
- Tom confirmed that the next meeting will be November 20th.
November 20th, 2019
Draft agenda:
  - Belfast follow up and review.
- Volunteers to draft a library design guidelines paper.
Attendees:
  - JeanHeyd Meneide
- Mark Zeren
- Steve Downey
- Tom Honermann
- Yehezkel Bernat
- Zach Laine
Meeting summary:
  - P1868 - 🦄 width: clarifying units of width and precision in std::format:
    
      - Tom introduced the topic:
        
          - Concerns were raised in Belfast with regard to the stability of
              the proposed code point ranges to be used for display width
              estimation.  The currently proposed ranges map all extended
              grapheme clusters (EGCs) to a display width of one or two despite
              there being a number of known cases of EGCs that consume no
              display width (e.g., U+200B {ZERO WIDTH SPACE}) or more
              than two display width units (e.g., U+FDFD {ARABIC LIGATURE
              BISMALLAH AR-RAHMAN AR-RAHEEM}).
- Additionally, the EGC breaking algorithm is dependent on Unicode
              version and the proposed wording does not specify which version
              of Unicode to implement.  Concerns were raised regarding having a
              floating reference to the Unicode standard and the potential for
              differences in behavior across implementations if the Unicode
              version is implementation defined and subject to change across
              compiler versions.
- How should we address these concerns?
 
- Zach commented that the wording review went through LWG ok and that
          he had posted a message to the LWG mailting list responding to one
          concern that was raised.
- Zach reported that Jonathan Wakely stated that floating references
          to other standards are not permitted but that implementors can, as
          QoI, offer support for other versions.
- Tom expressed surprise regarding that restriction given that we have
          a floating reference to ISO 10646 in the working paper today.
- Zach responded that LWG stated a requirement for a normative reference
          and is therefore planning to add a normative reference to Unicode 12
          with the intent that we update the normative reference with each
          standard release.
- Tom asked that, if we reference a particular version, can
          implementations use a later version and remain conforming.
- Zach responded that doing so seems to be acceptable to
          implementors.
- Steve remarked that CWG expressed a preference for a floating
          reference.
- JeanHeyd confirmed and added that is how the working paper ended up
          with the floating reference to ISO 10646.
- Zach said he will follow up about this discrepancy.
- Mark asked if we have a preference for floating vs fixed.
- Zach responded that implementations will do what they need to do for
          their users.
- Tom turned the discussion back to concerns raised by Billy regarding
          changes to the width estimate algorithm being a breaking change; e.g.,
          changing the width estimate for a given EGC.  This is a related but
          distinct concern from the EGC algorithm changing due to a change in
          Unicode version.
- Zach stated that U+FDFD is an example of something we need
          to fix that can also be a breaking change.
- Steve repeated that the concern is basically any change in behavior
          potentially resulting in a surprising or undesirable change.
- Mark asserted that we're going to continue having difficulties with
          dependencies on Unicode data and that the situation is analagous with
          respect to the timezone database.  Implementors can enable stable
          behavior by allowing choice of Unicode version.
- Steve noted that the rate of change of the Unicode standard has skewed
          towards stability.
- Mark opined that we should not solve this problem in the
          standard.
- Tom agreed and added that we can specify a minimum version, but leave
          the atual version implementation defined.
- Mark asked which version of the Unicode standard the proposed code
          point ranges were pulled from.
- Tom responded that the Unicode standard doesn't contain character
          display width data and that these were extracted from an
          implementation of wcswidth().
- Steve stated that he maintained a list of double wide characters for
          years and that it was not a significant burden.
- Tom stated that his desire for a floating reference to the Unicode
          standard with an implementation defined choice of version is intended
          to allow implementors to keep up with new Unicode versions.  Unicode
          releases happen every year while C++ standards are only released
          every three years.  Implementors probably can't lag Unicode by three
          years.
- Zach acknowledged the goal and stated that will result in some
          implementation divergence as some implementors will keep up and some
          won't, but that the differences are likely to be minor.
- Tom asked if ISO 10646 annex U constitutes a reference to
          UAX#31.
- Steve suggested this is probably a beuracratic issue and added that
          having a normative reference is helpful.
- Zach responded that it could be harmful if we get cconflicting
          floating and non-floating references for ISO 10646 vs Unicode, but
          this should fall to LWG and CWG to decide.
- Tom asked how we should go about fixing the currently proposed width
          estimates since the proposed ranges are clearly missing support for
          cases of zero width or width greater than two.
- Zach opined that he wasn't sure there is a problem to be fixed since
          what is specified matches existing practice.
- Tom asked if we know where this implementation of wcswidth()
          came from and how widely deployed it is.
- Zach suggested asking Victor.
- [ Editor's note: According to
          P1868R0, the implementation
          of wcswidth() is the one at
          https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c.
          ]
- Tom asked for opinions regarding writing a short paper that explains
          the Unicode stability guarantees and argues for floating references
          and implementations.
- Zach suggested waiting for a more motivating reason to do so.
 
- P1949 - C++ Identifier Syntax using Unicode Standard Annex 31:
    
      - Tom introduced the topic:
        
          - EWG rejected the SG16 guidance offered in response to NB comment
              NL029
              to deprecate identifiers that do not conform to
              UAX#31 with
              noted exceptions for the _ character.
- A suggestion was made that a CWG issue be filed to consider the
              lack of updates to the allowed identifiers since C++11 as a
              defect.
- Tom agreed to file a core issue and started to do some
              research.
- According to N3146, the
              original identifier allowances appear to have been aggregated
              from various sources including
              UAX#31 and
              XML 2008,
              and following guidance in annex A of a draft of
              ISO/IEC TR 10176:2003.
- Thank you to Corentin for quickly providing a way to query the
              code point ranges that have the XID_Start or
              XID_Continue property set.
              https://godbolt.org/z/h7ThEh.
              These ranges differ substantially from what is in the current
              standard.
- What should the proposed resolution for the core issue be?
 
- Steve stated that
          UAX#31 permits
          extensions, and what was adopted for C++11 effectively whitelisted
          a large set of code points.
- Zach asked what EWG's concern was.
- Steve replied that they were nervous about such a late change and
          want more time to think it through.
- Zach opined that this seems like something better addressed in
          C++23.
- Steve noted that what is done can be back ported to prior standards
          though, that Clang and gcc support Unicode encoded source code
          [ Editor's note: so does MSVC ], and that the longer we wait
          to address this, the more code we potentially break.
- Tom stated that, from the DR perspective, we could either figure out
          what we want for C++23 and recommend that as the proposed resolution,
          or we can do a more targetted fix for C++20 for specific problematic
          cases knowing that we'll likely do differently for C++23.
- Steve stated that the only difference C++ needs from
          UAX#31 is support
          for _, and such an extension is conforming.  It would also
          be ok to restrict identifiers to a common script to avoid homoglyph
          attacks.
- Steve added that there is also the issue of normalization forms and
          that gcc will currently warn if identifiers are not in NFC form.
- Mark asked if we should make it ill-formed for identifiers to not be
          in NFC form.
- Steve responded that doing so could break existing code.
- Tom suggested normalizing when comparing identifiers is another
          approach.
- Steve noted that doing so requires the Unicode normalization
          algorithms.
- JeanHeyd mentioned that we'll also have the problem of reflecting
          identifiers in the future and that normalization will be relevant
          there.  Corentin brought this up in SG7.  Requiring NFC would be
          helpful there.
- Mark expressed support for the idea of requiring NFC.
- Steve suggested that there is always the
          universal-character-name escape hatch.
- Mark opined that EWG probably won't like requiring conversion to NFC
          in name lookup.
- Tom responded that gcc is at least detecting non-normalized
          identifiers today, that doing so must require some level of Unicode
          database support, and that performance costs are presumably
          reasonable.
- Steve stated that gcc looks for some range of combining code points
          and may not be 100% accurate.
- Mark asked if non-NFC normalization can be detected without having
          to fully normalize?
- Zach responded that he didn't think so.
- Mark asked if normalization was brought up in EWG.
- Steve responded that it wasn't, that we didn't get that far in the
          discussion.
- Tom suggested that we have a good amount to think about here and that
          he is looking forward to the next revision of Steve's paper.
- Steve took the bait and agreed that the paper will have to provide
          good arguments for why this is important.
- Zach suggested that this should be easy for implementors if they
          don't have to deal with normalization and that we should just
          require NFC for performance reasons.
- Mark asked if we could make use of non-NFC ill-formed NDR so that
          implementations are not required to diagnose violations.
 
- P1097 - Named character escapes:
    
      - Tom introduced the topic:
        
          - EWG narrowly rejected the paper, but expressed good support for
              the direction.
- Most concerns had to do with implementation impact and, in
              particular, the potential increase in compiler binaries.  Some
              distributed build systems distribute compilers as part of the
              build process and the additional latency imposed by incresing
              the size of compiler binaries adds cost.  Numbers haven't been
              obtained, but guesses were around 2MB, but could probably be
              reduced to under 600K.
- One prominent EWG member was strongly opposed to the design
              because he would prefer a solution that avoids baking Unicode
              into the core language.  Something like a string interpolation
              solution that could call out to constexpr library
              functions to do character name lookup.
- Martinho was working on an implementation in Clang at Kona, but
              Tom doesn't know the state of it or where to find it.  Tom
              reached out to Martinho via email, but didn't hear back.
- Anyone have time and interest to experiment and produce some
              estimates to address the implementation impact concerns?
 
- Steve stated that he could probably do some work on it and that the
          name DB should compress really well with use of a trie.
- JeanHeyd suggested that the
          UAX44-LM2
          compression scheme could help to reduce size.
- Tom expressed uncertaintly that it would help much over a trie, but
          we could experiment and put the results in a paper.
- Zach suggested splitting names that contain "with" in them since the
          suffixes that tend to follow "with" are highly repeated.
- Tom noted that the algorithmically generated names could be specially
          handled as well.
- Steve added that a tokenization approach could help too.
- Tom asked if anyone might know of a link to Martinho's
          implementation.
- Zach replied that a link was provided at some point, possibly in
          Slack.
- [ Editor's note: Tom searched Slack, but failed to find a
          reference. ]
 
- P1880 - uNstring Arguments Shall Be UTF-N Encoded:
    
      - Tom introduced the topic:
        
          - LEWG rejected the SG16 guidance offered in response to NB comment
              FR164
              to adopt P1880 for C++20.
- What should we do next?
 
- Zach expressed frustration that he was available when the NB comment
          and paper were discussed in LEWG, but that no one notified him that
          the discussion was happening.
- Zach stated that, after the SG16 meeting, he went through all
          references to std::basic_string and added missing references
          to PMR strings and std::basic_string_view.  This research
          also identified a number of references that are deserving of more
          scrutiny.
- Zach opined that this isn't very important for C++20 and that he will
          work on a revision for C++23, though not for the Prague meeting.
- Zach stated he was surprised at how many references to these types he
          found in function templates.
 
- Tom asked for volunteers to draft a library design guidelines paper.
    
      - Tom introduced the topic:
        
          - During the
              SG16 meeting on July 31st,
              we discussed guidelines for when to add function overloads for
              each of char, wchar_t, char8_t,
              char16_t, and char32_t and he would like to have
              a library guideline paper that records our guidance.
- Would anyone be interested and willing to work on this?
 
- Zach expressed interest in doing so.
 
- Mark brought up a wording update email Zach sent to LWG with regard to
      P1868:
    
      - Mark noted that the wording introduces a new term of art: "estimated
          display width units".
- Zach responded that the new term was intentional; we're leaving the
          width estimation effectively unspecified for non-Unicode encodings.
          Implementors expressed a preference for not having to document their
          choices and we didn't want to force embedded compilers to have to be
          Unicode aware.  So, we needed a non-Unicode term.
- Tom noted that the wording appears to require embedded compilers to
          use the proposed Unicode algorithm if their execution character set
          is Unicode.
- Zach acknowledged that would be the case.
- Mark siggested that is probably what we want if they are actually
          doing Unicode.
- Tom agreed and suggested such implementors could otherwise state that
          their execution character set is ASCII.
 
- Tom communicated that the next meeting will be on December 11th.
December 11th, 2019
Draft agenda:
  - Vocabulary type(s) for extended grapheme clusters?
    
      - Per Michael McLaughlin's questions posted to the (old) mailing list
          on 11/01.
 
- P1097: Named character escapes
    
      - Review research on minimizing the name lookup DB and code size.
 
Attendees:
  - Corentin Jabot
- David Wendt
- Peter Bindels
- Peter Brett
- Steve Downey
- Tom Honermann
Meeting summary:
  - P1097: Named character escapes:
    
      - Tom introduced the topic:
        
          - Since our last meeting, Corentin did some outstanding
              investigative and evaluation work and blogged about his results:
            
          
- Corentin's implementation of his size reduction techniques is
              available at:
            
          
- The goal for today is to review his results and determine next
              steps.
 
- Corentin opined that the data is still kind of large at approximately
          260K.
- Zach noted that Corentin did a good job of estimating a theoretical
          lower bound for reducing the data at around 180K, so achieving a
          result of 260K is great.
- Steve commented that the code shows the challenges C++ has with
          variable length data.  The natural representation would use variants,
          but that can't be represented as well.
- Corentin agreed noting that good performance demands working at the
          byte level.
- Zach expressed a similar experience working on
          Boost.text; flat arrays
          of bytes had to be used to achieve scaling goals.
- Tom stated that we need to draft a revision of this paper and that he
          is happy to do so, but would welcome any other volunteers.
- Corentin asked if we know how to get in touch with Martinho.
- Tom responded that he tried, but did not get a response.
- Tom noted that, if we can't get in touch with Martinho, then we'll
          need to submit a new paper rather than a new revision.
- Corentin asked if a new paper was really necessary.
- Steve responded that, as a matter of procedure, we need a new paper to
          get it on the schedule.
- PeterBi added that we need a place to record the new information.
- Tom stated he would attempt to contact Martinho again.
- [ Editor's note: Tom did reach out again via email, but again did
          not get a response. ]
- Tom asked Corentin if he wanted to take this and run with it given the
          considerable investment he has already made.
- Corentin responded that he is unfortunately time constrained.
- Corentin mentioned that the new paper should state the need for
          matching name aliases and case insensitivity.
- Tom agreed and noted that we have polls on those topics from
          presentation to EWGI in San Diego that record a trail of intent for
          those cases.
- Zach asked Corentin if dashes are handled properly in his
          experiment.
- Corentin replied affirmatively that spaces, dashes, and underscores
          can be omitted or swapped as recommended by Unicode in
          UAX44.
- Corentin added that the current 260K size includes support for name
          aliases.
- Steve observed motivation for allowing spaces, dashes, and underscores
          to be swappable; that behavior falls out of a good implementation.
- Corentin stated that, should a desire arise to be able to map code
          points to names, then a different implementation would provide a more
          optimized data set that handles mapping both directions.
- Tom asked Corentin for an estimated size for a perfect hash
          approach.
- Corentin responded with 300K to 400K.
- Corentin pointed out a potential challenge; that it may be desirable
          to support code point to name mapping in the standard library, but
          probably not in the compiler.  This implies a potential need for the
          Unicode character name data to be available to both.
- Steve stated that it seems unfortunate to not expose the compiler data
          to the library.
- Corentin suggested the data would probably need to be present in both
          the compiler and the library.
- Tom provided a possible way to avoid that; by making it available in
          the library, but accessible from the core language.  At least one EWG
          member strongly advocated for such an approach; a string interpolation
          like facility.
 
- Vocabulary types for extended grapheme clusters:
    
      - Tom introduced the topic:
        
          - Michael McLaughlin had posted some questions to the (old) mailing
              list on 2019-11-01:
            
          
- These questions are related to representation of extended grapheme
              clusters (EGCs), specifically, how a collection or sequence of
              them might be stored.
- Should the standard library provide vocabulary types for EGCs?
 
- Zach explained the choices he made for
          Boost.text.  There are
          two vocabulary types;
          grapheme
          provides value semantics and stores a small vector optimized sequence
          of code units with a maximum size limited according to the
          Unicode stream-safe text format described in UAX #15,
          and grapheme_ref
          provides read-only reference/view semantics over a code point range
          denoted by an iterator pair.
- Zach added that he is unsure if anyone is using the value type.
- Corentin acknowledged the uncertainty regarding use cases for a value
          type.
- Corentin asked why the reference/view version is not an alias of a
          span.
- Zach responded that he wanted to support subranges and non-contiguous
          storage.  The implementation uses the view_interface CRTP
          base from C++20 ranges.
- Steve asked who the anticipated consumers are for use of EGCs.
- PeterBr expressed similar curiosity and provided some background
          experience; he previously worked on a product that was text based and
          everything was done on graphemes.  Support was available for
          individual grapheme replacement, but a value type was never needed
          because reference/view semantics were always desired.  All text
          processing was performed in terms of ranges of graphemes.
- Zach offered a couple of examples.  Text rendering depends on
          knowledge of EGC boundaries.  Additionally, an EGC reference is the
          value type of an (EGC-based) iterator on a text range.
- Zach observed that breaking algorithms don't always break on EGC
          boundaries, though split EGCs still remain EGCs on either side of the
          boundary.
- Steve stated that having a named type is very useful.  An EGC view is
          essentially a subrange, but naming it is useful.
- PeterBr clarified that an EGC is effectively a range of code
          points.
- Tom asked if there is a good distinction between an EGC type that
          represents a range of code units or code points that constitute
          exactly one grapheme vs a type that represents a range of EGCs in
          terms of a range of code units or code points.
- Zach replied yes,
          Boost.text has a type
          that represents the latter case as well;
          grapheme_view
          is a view that provides an EGC iterator.  So, yes, there are three
          potentially useful types: an owning EGC, a reference EGC, and an EGC
          view.
- Steve asked how breaking algorithms that split EGCs interact with
          these types.
- Zach replied that all Unicode algorithms are specified in terms of
          code points, not EGCs.  So, a split EGC just becomes two EGCs.  The
          sentence breaking algorithm may cause this to happen.
- Tom recalled prior conversations where we discovered that the EGC sum
          of the parts of a text may be greater than the EGC sum of the whole
          text.
- Steve asked for confirmation that you can still view the split code
          point ranges as EGCs.
- Zach confirmed, yes.
- Corentin asked if all of these types aren't effectively
          subranges.
- Steve replied yes, but different types is useful to avoid subranges
          of subranges.
- Corentin countered that, if you have a text_view and you
          split it, you get a text_view.
- Zach stated that the idea that the Unicode algorithms produce
          sequences of code points but programmers want EGCs is a key idea.
- PeterBr observed that rendering text requires more than just
          EGCs.
- Steve returned converation to the motivation for EGC types and
          mentioned the DB field example; there is a known limit of how many
          bytes can be stored, EGCs indicate where text should be truncated
          to.
- Tom asked if there is a need to distinguish between an EGC view and a
          subrange of EGC view other than an EGC reference; as Corentin
          mentioned, a subrange of a text_view is a text_view,
          so is a subrange of an EGC view an EGC view?
- Zach stated he didn't see a need for such a distinction.  Most
          interfaces should operate on EGC views, but for Unicode algorithms,
          it is necessary to drop down a level to a code point view.
- Steve summarized; an EGC reference is a view over code points with a
          contract that its range represents exactly one EGC.
- PeterBr imagined a scenario in which a range of code points is sliced
          to produce multiple EGCs, but when recombined with additional text,
          might yield different EGCs.
- [ Editor's note: Some discussion was missed here. ]
- Tom stated a need for consistent terminology.  Tom originally proposed
          text_view as a sequence of code points, but we now think it
          should be EGC based.
- PeterBr expressed concern; most people think they want code points.
          LEWG might object to an EGC based design.
- Zach stated that a concern we have is that we're the Unicode experts
          and everyone with strong opinions is pretty much on this call; we
          need to be aware of echo chamber issues.
- Tom added that echo chamber issues are the thing that keeps me up at
          night; how do we ensure we deliver what is truly useful?
- Steve added that he frequently is asked why some simple thing isn't
          implemented.  The answer is, because it isn't actually simple.
- Corentin stated that he gets quite concerned whenever we discuss going
          in a direction that doesn't align with Unicode recomendations; the UTC
          (Unicode Technical Committeee) doesn't get things wrong very
          often.
- Steve noted that, fortunately, we're kind of late to the game, we can
          learn from the experience of other languages, and we don't have to
          discover all the problems ourselves.
- Tom returned discussion to the subrange of subrange concern; there may
          be a need to put subranges back together.
- Corentin replied that there is an ongoing effort to support that, but
          it is complicated.  JeanHeyd is working on
          P1664 and it should be discussed
          more in Prague.
- Steve described one of the challenges; for efficiency, when we have an
          EGC view and want to get down to the code unit range for efficient IO,
          reassembly can get difficult.
- Zach replied that, if you have an EGC view over a code point view over
          a sequence of code units, that is easy.
- Tom countered that doing so requires that you know that the underlying
          storage is contiguous if you want to operate on it at the code unit
          level.
- Steve added that there can't be a missing range in the middle.
- Corentin expressed a belief that this will be solved; maybe not for
          C++20, but for C++23.
 
- Tom stated that our normal meeting cadence would have us meeting again on
      December 25th 🎅, but expected meeting that day would be unpopular,
      so we'll plan to meet next on January 8th.