Document Number P1844R1
Date 2019-11-22
Audience LEWGI
Intended Ship Vehicle C++23
Reply-To Nozomu Katō
Revises P1844R0

Enhancement of regex

Table of Contents

  1. Changes
  2. Introduction and Motivation
  3. Why new syntax option? Why not simply enhance the existing ECMAScript mode?
  4. Why RegExp of ECMAScript?
  5. Scope and Impact on the Standard
  6. Technical Specifications
  7. Relevant Matter
  8. Review
  9. References
  10. Appendix

Abstract

I. Changes

R1

II. Introduction and Motivation

The default engine of C++'s regex is based on RegExp, the regular expression object of the ECMAScript specification third edition. Although its original RegExp had not been being modified for many years, in these years it has been enhanced as follows:

Unfortunately, while C++'s regex supports six regular expression grammars, all are inferior to other languages' regular expression features in richness of available expressions nowadays. In addition, the problem of how to support Unicode is still unsettled. To resolve these problems, or at least to improve the current situation, this paper proposes adding the new syntax option, ECMAScript2019 to regex.

III. Why new syntax option? Why not simply enhance the existing ECMAScript engine?

Short answer: The ECMAScript based engine of C++ has been modified to depend on the locale deeply. The author of this proposal would like to release regex from the locale and to revert it to being locale independent as its original RegExp, then to add new features on it. But doing so involves deprecating some features of the default engine and seems to be difficult.

Long answer: RegExp of ECMAScript is locale independent and treats an input sequence as UTF-16. For example, /[a-z]/ is always interpreted "any character in the range from U+0061 to U+007A inclusive". This has the benefit of allowing that the set of characters that some character class expression matches can be settled even at compile time. This is true even if the icase flag is set, by preparing a reversed case-folding table (while case-folding means converting each of "S", "s", and U"ſ" to "s", reversed case-folding here means returning the set of characters that are converted into the same character when case-folding is done, such as converting each of "S", "s", and U"ſ" to U"Ssſ".)

However, the ECMAScript engine of regex can be modified as being locale sensitive by setting the collate flag. If this flag is used with the icase flag, the pattern compiler is required to call tolower and toupper per character in one character class [] for gathering all characters that the character class can match. This clumsiness was filed as Defect Report 523 in 2005 and is still Status Open.

Furthermore, the ECMAScript engine of C++ was modified to support the POSIX character class and to change the expressions \d, \D, \s, \S, \w, and \W to be equivalent to some of the POSIX character class. This also made the ECMAScript engine depend on the locale.

Although this paper is not intended to propose making regex constexpr, if there is any ECMAScript based engine that inherits the nature of being locale independent from its original RegExp, it might make constexpr support easier than now.

The modifications defined in [re.grammar] also caused DR2986, DR2987 and they are still unresolved. To fix these issues originating in the modifications, I considered to propose deprecating all the modifications and reverting the ECMAScript engine to its original specification of RegExp, before adding something new on it. But the ECMAScript engine is the default one of regex. Trying to deprecate some features of it seems to increase difficulty in getting the proposal accepted.

Moreover, while ECMAScript supports UTF-16 or now deprecated UCS-2 only, C++'s regex needs to support various character sets. It is a difficult question how Unicode property escapes should be processed in legacy (non-Unicode based) character sets.

Thus, this paper does not touch any part of the current ECMAScript engine and leaves it as is for legacy character sets that use char and wchar_t; and instead propose introducing a new syntax option being compliant to a recent version of ECMAScript keeping the nature of being locale independent, for use with char8_t, char16_t, or char32_t through the template specialization.

IV. Why RegExp of ECMAScript?

Because the ECMAScript specification explains the behavior of RegExp through defining closures in detail, there seems to be less room for any undefined or undocumented behaviour to appear. The author thinks that it is an important factor to the standard.

V. Scope and Impact on the Standard

The new syntax is implemented through the template specialization for char8_t, char16_t, and char32_t. As of C++20, <regex> supports only char and wchar_t and compiling basic_regex with the other types leads to compile time error by the reasons explained in P0169. So, the existing implementations would not be affected by this proposal.

basic_regex

When the first template parameter charT of the basic_regex class is char8_t, char16_t, or char32_t, its constructors and assign functions are required to interpret an input sequence as UTF-8, UTF-16, or UTF-32 respectively. In addition, syntax_option_types other than icase, nosubs, optimize, and multiline are disabled. An input sequence is always interpreted assuming that ECMAScript2019 is set.

These three specializations are not required necessarily to be implemented separately. It is expected that typical implementations use an internal iterator class template that translates a sequence of UTF-8, UTF-16, or UTF-32 to a sequence of Unicode code points and construct a finite state machine by parsing that translated sequence in their common base class.

One obstacle to implementing in such a way is the expression \xHH where H is a hexadecimal digit. In the spec of ECMAScript this expression is defined to specify a character by its code unit value. However, appearance of an isolated code unit in a UTF-8 sequence requires special treatment, because unlike in UTF-16 and UTF-32, an isolated code unit in UTF-8 cannot be converted to any code point when its value is in 0x80-0xFF inclusive. (This does not matter in ECMAScript, since it does not support UTF-8.)

Considering that ECMAScript assumes that a string is a sequence of UTF-16 or now deprecated UCS-2, in the proposed ECMAScript2019 syntax, the expression \xHH is treated to represent a character whose code unit value in UTF-16 is 0xHH, even it appears in a sequence of UTF-8. Since code unit values 0x00-0xFF in UTF-16 represent U+0000-U+00FF respectively, in this case, \xHH represents a code point U+00HH virtually.

(If the committee does not like the inconsistency with the meaning of C++'s own \xHH, the previous paragraph will be changed to "... use of the expression \xHH is disabled". ECMAScript's RegExp has \uHHHH and \u{H...} for specifying a character by its code point. Removing support for \xHH is not likely to cause inconvenience.)

regex_traits

basic_regex<char8_t>, basic_regex<char16_t>, and basic_regex<char32_t> must be locale independent. It means that these specializations have to construct an internal finite state machine only based on the Unicode code point values that are translated from the input sequence and may not use regex_traits.

Thus, regex_traits does not need to be specialized for char8_t, char16_t, and char32_t.

Particularly, when basic_regex is used with char8_t, char16_t, or char32_t, case-folding may not be performed using regex_traits<charT>::translate_nocase(), but performed as defined in the ECMAScript specification.

Algorithm functions (regex_match, regex_search, and regex_replace)

As these take an instance of the basic_regex class as a parameter, they get charT as a template parameter. So, they also can be implemented in a way similar to basic_regex using the template specialization.

regex_search

The proposed ECMAScrip2019 syntax supports lookbehind assertions, which none of the existing six engines of C++'s regex has support for. When performing a lookbehind assertion, the algorithm function reads the input sequence backwards. This raises a new issue.

When a user wants to find all the portions that some regular expression matches against some input sequence, the user will call regex_search against the subrange [the end of the previous matched portion of the entire sequence, the end of the entire sequence) while the previous call succeeds. In this case, the subrange [the beginning of the entire sequence, the end of the previous matched portion of the entire sequence) is also a valid range and it is safe for an algorithm function to read a character in that subrange.

However, there is no way at this time to inform an algorithm function about the limit until where it can read backwards for lookbehind.

For example, when a user calls regex_search with the regular expression /(?<!\d{2,})ABC123/ ("ABC123" not preceded by two or more digit characters) against "ABC123ABC1234ABC12345", only the first six characters should be matched, because the second and later "ABC123"s are all preceded by two or more digits. But if there is no way to tell regex_search about until where it can lookbehind, the second and later call to regex_search against [end of previous matched subsequence, end of entire sequence) also return "abc123".

The match_prev_avail flag is not suitable for this purpose. It only indicates that the preceding one point is a valid iterator position.

As of C++20, regex_search takes as an input sequence 1) two bidirectional iterators that specify [begin, end), 2) const reference to an instance of std::basic_string, or 3) a pointer to a null-terminated string. To fix the problem mentioned above, this paper proposes adding new overload functions that take three bidirectional iterators, which specify [begin, end) and the limit of lookbehind, to regex_search.

This addition is useful only when an algorithm function is used with an instance of basic_regex constructed with char8_t, char16_t, or char32_t. If any variant of algorithm functions that takes three bidirectional iterators is called when its charT is char or wchar_t, the third iterator for specifying the limit of lookbehind is simply ignored.

No specialization is proposed for regex_replace. It does not do matching by itself but uses regex_iterator that calls regex_search internally.

match_results

It is preferable that 1) the new function group() and its overload functions be added to match_results for access to captured sequences by group name, and 2) the member function format() be modified to support the replacement text symbol $<GROUPNAME> that was introduced in ES2018.

However, as match_results does not take charT as a template parameter, it is not easy to implement something specific to the proposed ECMAScript2019 syntax option through the template specialization.

Thus, in this proposal, the new member function gname_to_gnumber() and its overload functions that convert a group name to the group number assigned with it, are added instead into basic_regex specialized for char8_t, char16_t, and char32_t.

regex_iterator

regex_iterator is changed to use one variant of regex_search that takes three bidirectional iterators mentioned above, instead of the current one that takes two bidirectional iterators. In this proposal this is the only proposed change that requires modifications to the existing implementations.

It is not expected that the change above causes ABI breaking, because it does not modify the members of regex_iterator or the parameters of the member functions at all.

The link to a sample implementation based on the method above is shown in the Appendix section of this document.

VI. Technical Specifications

1. <regex>

The following changes are proposed:

Others

The undated version of the ECMAScript Specification is added to references.

VII. Relevant Matter

VIII. Review

P1844R0 was reviewed and discussed by SG16 in a telecon and at the C++ committee meeting in Belfast. The author of this document received the following feedback:

In general, it was well received, but there is considerable reluctance to investing in std::regex.  Here is some of the specific feedback from the discussions:

Since the feedback recommends abandoning <regex> and rebuilding the proposal virtually, it is difficult to revise P1844R0 that proposed enhancement of regex, based on the feedback.

Regarding the second last opinion, it was intentional. I think that touching regex for char and wchar_t can lead to ABI breaking.

Regarding the last one, there does not seem to be a technical problem to change to throwing an error. If required, I change it.

IX. References

X. Appendix

This example can be used with char8_t, char16_t, char32_t only. char and wchar_t versions are not implemented. If basic_regex or an algorithm function is used with a type other than char8_t, char16_t, and char32_t, assert(0) is called. It demonstrates that adding a new syntax option for char8_t, char16_t, and char32_t through the template specialization is a real option.

All the classes and algorithms are declared in namespace regex_proposal, instead of std.