Errata to the Regular Expression Proposal

Doc.no.:J16/03-0090=WG21/N1507=
Date: 16 Sept 2003
Project: Programming Language C++
Subgroup: Library
Wiki: http://www.research.att.com/~ark/cgi-bin/lwg-wiki/wiki?RegularExpressions
Reply to: John Maddock <john@johnmaddock.co.uk>

This document contains a list of corrections to N1429: “A Proposal to add Regular Expressions to the Standard Library”.

The document is divided up into two sections: the first deals with errors in the original proposal (mostly minor issues), the second with more substantive issues which may require further debate.

Section 1: Errata to the original proposal

Bad rationale for regex_ prefixes

Pete Becker writes: The proposal has the following statement:

Many people may see the latter as an evil, but experience suggests that statements such as:

using namespace whatever;

are remarkably common in practice. For this reason the "regex_" prefixed names that Boost regex uses are retained in this proposal.

I'm not strongly for or against the regex_ prefixes. They may well be helpful in understanding code. But I'm strongly against the notion that the standard library should use prefixes because users abuse using declarations. One of the key reasons for the existence of namespaces is to provide a language-level way of distinguishing otherwise identical names in disparate libraries. If the standard library has to use prefixes for this purpose then namespaces have failed. We should either encourage users who have written bad code to clean it up, or we should remove namespaces from the language.

John Maddock notes: there was strong agreement on the library reflector that the rationale here is just plain wrong.  There was some feeling that the regex_ prefixes did help improve code readability though.

Proposed changes:

Replace all of:

The names of the algorithms in this proposal are based upon the names Boost uses and all have a "regex_" prefix, which may cause some controversy (as may the naming used in general in this proposal). This proposal does not use overly generic names like match/compare/search/find here, as the possibility for ambiguous overloads occurring should be obvious. An alternative is to use a generic name, but placed in a nested namespace (let's say std::regex). This may be an appropriate option, but ambiguities can still occur either as a result of argument dependent lookup, or a using directive. Many people may see the latter as an evil, but experience suggests that statements such as:

using namespace whatever;

are remarkably common in practice. For this reason the "regex_" prefixed names that Boost regex uses are retained in this proposal. Here are the twelve forms for regex_match and regex_search:

With:

The names of the algorithms in this proposal are based upon the names Boost uses and all have a "regex_" prefix, which may cause some controversy (as may the naming used in general in this proposal). These names are retained largely on the grounds of code legibility. Here are the twelve forms for regex_match and regex_search:

Unintended occurrence of reg_expression:

There is a systematic error in the "proposed text" section: the various algorithms have been defined to accept a type "reg_expression" which does not in fact exist in the proposal, and which should of course be called "basic_regex".  This is an editing error that crept in when the name of that class was changed from reg_expression to basic_regex.

The fix is to just replace all occurrences of "reg_expression" with "basic_regex" throughout that section.

Iterators have incorrect definitions of the types "reference" and "pointer".

In regex_iterator and regex_token_iterator the definitions given for the types "iterator" and "reference" are wrong: as given these types refer/point to the value_type of the underlying iterator type, but should of course refer/point to the actual value_type being enumerated (the two are not the same type).

The fix is to change:

typedef typename iterator_traits<BidirectionalIterator>::pointer pointer;
typedef typename iterator_traits<BidirectionalIterator>::reference reference; 

To:

typedef const value_type* pointer;
typedef const value_type& reference; 

In both the regex_iterator and regex_token_iterator definitions.

regex_iterator does not handle zero-length matches correctly.

There is a subtle bug in regex_iterator::operator++; when the previous match found matched a zero-length string, then the iterator needs to take special action to avoid going into an infinite loop, the current wording does this but gets it wrong because it does not allow two consecutive zero length matches, for example iterating occurrences of “^” in the text “\n\n” yields just one match rather than three as it should.  The actual behavior should be as follows:

When the previous match was of zero length, then check to see if there is a non-zero-length match starting at the same position, otherwise move one position to the right of the last match (if such a position exists), and continue searching as normal for a (possibly zero length) match.

The reason for checking for a non-zero length match at the same position as the last match is shown by the following example:  enumerating occurrences of “^|abc|$” in the text “abc” should find three matches:

A zero length string at the start of the text.

The string “abc”.

A zero length string at the end of the text.

This behavior is then consistent with Perl, however ECMAScript appears to fall into the infinite loop trap in its description of RegExp.prototype.exec(string) (ECMAScript 15.10.6.2), while String.prototype.match (regexp) (ECMAScript 15.5.4.10) does get this right, albeit in a different manner: enumerating “^|abc|$” in the text “abc” would find only two matches (the zero length strings at the start and end of the text). 

This issue also effects regex_token_iterator which uses the similar standardese as regex_iterator, it might be better rewritten in terms of regex_iterator to avoid the duplication.

Proposed changes: are incorporated into the following issue.

Regex_iterator does not set match_results::postion correctly.

As currently specified, given:

regex_iterator<something> i;

then i->position() == i->prefix().length() for all matches found.

This is correct for the first match found, but makes little sense for subsequent matches where the result of i->position() is only useful if it returns the distance from the start of the string being searched to the start of the match found.

[ footnote: recall that i->prefix() contains everything from the end of the last match found, to the start of the current match, this allows search and replace operations to be constructed by copying i->prefix() unchanged to output, and then outputting a modified version of whatever matched. ]

For example this problem showed up when converting a boost.regex example program from the regex_grep algorithm (not part of the proposal) to use regex_iterator: the example takes the contents of a C++ source file as a string, and creates an index that maps C++ class names to file positions in the form of a std::map<std::string, int>.  In order for the program to take a regex_iterator and from that add an item to the index, it needs to know how far it is from the start of the text being searched to the start of the current match: that was what regex_match::position() was intended for, but as the proposal stands it instead returns the distance from the end of the last match to the start of the current match. 

The proposed text for regex_iterator::operator++ needs to be modified to handle this, likewise the text the for match_results::position needs updating accordingly (see also the issue “What does match_results::position return when passed an out of range index?”).

Proposed changes:

Replace:

difference_type position(unsigned int sub = 0)const;

Effects: Returns std::distance(prefix().first, (*this)[sub].first).

With:

difference_type position(unsigned int sub = 0)const;

Effects: If !(*this)[sub].matched then returns -1.  Otherwise returns distance(base, (*this)[sub].first), where base is the start iterator of the sequence that was searched.  [Note – unless this is part of a repeated search with a regex_iterator then base is the same as prefix().first – end note]

Replace:

regex_iterator& operator++();

Effects: if what.prefix().first != what[0].second and if the element match_prev_avail is not set in flags then sets it. Then calls regex_search(what[0].second, end, what, *pre, ((what[0].first == what[0].second) ? flags | match_not_null : flags)), and if this returns false then sets *this equal to the end of sequence iterator.

With:

regex_iterator& operator++();

Effects: if what.prefix().first != what[0].second and if the element match_prev_avail is not set in flags then sets it. Then behaves as if by calling regex_search(what[0].second, end, what, *pre, flags), with the following variation: in the event that the previous match found was of zero length (what[0].length() == 0) then attempts to find a non-zero length match starting at what[0].second, only if that fails and provided what[0].second != suffix().second does it look for a (possibly zero length) match starting from what[0].second + 1.  If no further match is found then sets *this equal to the end of sequence iterator.

Postconditions: provided *this is not set to the end-of-sequence iterator then the effects on this are given in table RE20:

Table RE20-- regex_iterator& operator++() effects

Element

Value

(*this)->size()

pre->mark_count()

(*this)->empty()

false

(*this)->prefix().first

An iterator denoting the end point of the previous match found

(*this)->prefix().last

(**this)[0].first

(*this)->prefix().matched

(*this)->prefix().first != (*this)->prefix().second

(*this)->suffix().first

(**this)[0].second

(*this)->suffix().last

end

(*this)->suffix().matched

(*this)->suffix().first != (*this)->suffix().second

(**this)[0].first

The starting iterator for this match.

(**this)[0].second

The ending iterator for this match.

(**this)[0].matched

true if a full match was found, and false if it was a partial match (found as a result of the match_partial flag being set).

(**this)[n].first

For all integers n < (*this)->size(), the start of the sequence that matched sub-expression n. Alternatively, if sub-expression n did not participate in the match, then end.

(**this)[n].second

For all integers n < (*this)->size(), the end of the sequence that matched sub-expression n. Alternatively, if sub-expression n did not participate in the match, then end.

(**this)[n].matched

For all integers n < (*this)->size(), true if sub-expression n participated in the match, false otherwise.

(*this)->position()

The distance from the start of the original sequence being iterated, to the start of this match.

Similar rewording is needed for regex_token_iterator, replace:

regex_token_iterator& operator++();

Effects: if N == -1 then sets *this equal to the end of sequence iterator. Otherwise if N+1 < subs.size(), then increments N and sets result equal to ((subs[N] == -1) ? value_type(what.prefix().str()) : value_type(what[subs[N]].str())).

Otherwise if what.prefix().first != what[0].second and if the element match_prev_avail is not set in flags then sets it. Then if regex_search(what[0].second, end, what, *pre, ((what[0].first == what[0].second) ? flags | match_not_null : flags)) == true sets N equal to zero, and sets result equal to ((subs[N] == -1) ? value_type(what.prefix().str()) : value_type(what[subs[N]].str())).

Otherwise if the call to regex_search returns false, then let last_end be the value of what[0].second prior to the call to regex_search. Then if last_end != end and subs[0] == -1 sets N equal to -1 and sets result equal to value_type(last_end, end). Otherwise sets *this equal to the end of sequence iterator.

With:

regex_token_iterator& operator++();

Effects: if N == -1 then sets *this equal to the end of sequence iterator. Otherwise if N+1 < subs.size(), then increments N and sets result equal to ((subs[N] == -1) ? value_type(what.prefix().str()) : value_type(what[subs[N]].str())).

Otherwise if what.prefix().first != what[0].second and if the element match_prev_avail is not set in flags then sets it.  Then locates the next match as if by calling regex_search(what[0].second, end, what, *pre, flags), with the following variation: in the event that the previous match found was of zero length (what[0].length() == 0) then attempts to find a non-zero length match starting at what[0].second, only if that fails and provided what[0].second != suffix().second does it look for a (possibly zero length) match starting from what[0].second + 1.  If such a match is found then sets N equal to zero, and sets result equal to ((subs[N] == -1) ? value_type(what.prefix().str()) : value_type(what[subs[N]].str())).

Otherwise if no further matches were found, then let last_end be the endpoint of the last match that was found. Then if last_end != end and subs[0] == -1 sets N equal to -1 and sets result equal to value_type(last_end, end). Otherwise sets *this equal to the end of sequence iterator.

Naming of basic_regex::getflags

Pete Becker writes: basic_regex has member functions named getflags and get_allocator. The latter is consistent with the use of the same name in STL containers. In general, it seems to me, the library tries to use an underscore to separate a verb from its object for names of this nature. That convention would mean that we should call the other one get_flags. On the other hand, we do have getline, but that's arguably different because it's not a state query. Do we have a general policy here? If so, what is it, and what should the name of getflags be?

Nathan Mayers: what does "get_flags()" convey that "flags()" does not?  Generally, names should be tested in context to see if they make the code more readable. Thus, a verb-noun pair name may make code like "if (thing.is_open())" read more like English.  The verb in get_flags() doesn't serve, there: "if (re.get_flags() & re.greedy)" is less clear than "if (re.flags() & re.greedy)" (and even less clear than "if (re.is_greedy())").  I would prefer to reserve imperative verbs to names of functions that change the state of the object.

Suggested change: Replace all occurrences of “getflags” in the document with “flags” (this is consistent with ios_base as well).

Missing namespace prefix in regex_iterator description

Pete Becker writes: The definition of regex_iterator in RE.8.1 mentions

regex_iterator(BidirectionalIterator a, BidirectionalIterator b, const regex_type& re, match_flag_type m = match_default);

And

match_flag_type flags; // for exposition only

match_flag_type and match_default are defined in the nested namespace regex_constants, so these two names need to be qualified with regex_constants::Same thing in the first RE.8.1.1.

John Maddock adds: These are used almost without exception without the needed namespace qualifier throughout the text, likewise the other symbols listed below.

Suggested changes: Go through the text and replace all occurrences of:

match_flag_type with regex_constants::match_flag_type,

match_default with regex_constants::match_default,

match_partial with regex_constants::match_partial,

match_prev_avail with regex_constants::match_prev_avail,

match_not_null with regex_constants::match_not_null,

format_default with regex_constants::format_default,

format_no_copy with regex_constants::format_no_copy,

format_first_only with regex_constants::format_first_only,

 except in the section which defines these (RE.3.1).

Unnecessary sub-section headers in regex_iterator

Pete Becker writes: The first clause labeled RE.8.1.1 has the title "regex_iterator constructors". It contains descriptions of the constructors, plus several operators. The second clause labeled RE.8.1.1 has the title "regex_iterator dereference". It contains operator*, operator->, and the two versions of operator++. Seems like both of these labels should be removed.

John Maddock replies: Agreed, they don’t seem to add anything, it’s easier to just remove those sub-section headers (they are misnamed and misnumbered in places as well):

Rename the section “RE.8.1.1 regex_iterator constructors” as “regex_iterator members”.

Remove the section “RE.8.1.1 regex_iterator dereference”.

Rename the section “RE.8.2.1 regex_iterator constructors” as “regex_token_iterator members”.

Remove the section: “RE.8.2.1 regex_token_iterator dereference”.

Names of symbolic constants

Pete Becker writes: ECMAScript has five control escapes: t, n, v, f, r. The regex proposal has named constants for four of them: escape_type_control_f, _n, _r, and _t. escape_type_control_v seems to be missing. (Okay, that's not about names, but the next two are).

This is minor, but in C and C++ those five things are escape sequences, and using names that include 'control' is a bit confusing. Granted, it fits with the terminology in ECMAScript, but I'd lean toward more C-like names, on the line of escape_type_f.

And finally, there's escape_type_ascii_control. (For those not familiar with the details of the proposal, this refers to things that we might write in ordinary text as <ctrl>-X, for example.) We've pretty much avoided the term "ascii" in the standard (it's only used twice, in footnotes, apologetically), and I'm a bit uncomfortable with its use here. I'd prefer escape_type_control_letter, which picks up the name of the production in the ECMAScript grammar for the letter that follows the escape. I think it's pretty clear what it means, and it avoids "ascii".

Proposed changes: replace all occurrences of:

escape_type_control_f with escape_type_f

escape_type_control_n with escape_type_n

escape_type_control_r with escape_type_r

escape_type_control_t with escape_type_t

escape_type_ascii_control with escape_type_control

Then immediately after the line:

static const escape_syntax_type escape_type_t;

add the line:

static const escape_syntax_type escape_type_v;

Then immediately after the table entry:

escape_type_t

t

Add the new table entry:

escape_type_v

v

Traits class versioning incompletely edited in.

Pete Becker writes: The paper talks about versioning of regex_traits classes, and RE.1.1 (in table RE2) says that a traits class shall have a member X::version_tag whose type is regex_traits_version_1_tag or a class that publicly inherits from that. Neither the <regex> synopsis (RE.2) nor the description of regex_traits (RE.3.3) mentions either of these types. I can't tell whether this was partially edited in or partially edited out. <g> So, is regex_traits versioning part of the proposal?

John Maddock replies: It’s an editing mix-up on my part, the versioning mechanism is part of the rationale, but is only incompletely edited into the main text.  Footnote: for those unfamiliar with the proposal, it is almost impossible to provide compatible extensions to the regex library proposal without extending or altering the traits class design.  Specific optimizations may also require modification of the design (one of the problems here is that this is a competitive field with new algorithms or heuristics for regular expression searches being published at frequent intervals, it is clearly not possible to build in support for as yet un-thought-of advances in the field).  By providing a mechanism that supports a hierarchy of traits class versions the basic_regex class template can adapt to the supplied traits class, taking advantage of whatever features it offers, and ignoring those it does not.

Pete Becker replies: I don't have strong views one way or the other on versioning the regex_traits class per se. It's an example, though, of a library-wide issue that we need to give some thought to (the earlier effort to "solve the versioning problem" having foundered for lack of a well defined goal). Although there are currently no concrete proposals, there have also been suggestions of changes to allocators, which would have similar problems. I think those two between them can give us a tight enough focus to understand what's at stake and come up with a workable mechanism to support backward compatibility and extensions mandated by the standard as well as vendor- or user-supplied extensions. (I don't mean to suggest that I see problems with regex_traits_version_1_tag, just that I don't want to prejudge what the solution should look like).

Proposed resolution: one of the problems with regex_traits is that one can predict right from the outset is that vendors will want to add their own refinements to the class to support new advances in the field.  The versioning mechanism in regex_traits is therefore needed but is not intended as a general library-wide solution (note that the mechanism versions a specific interface within the library, and not the whole standard library in general).  However the general technique used may be applicable to Allocators which suffer from a similar problem.  This warrants further discussion (and perhaps a proposal in its own right), but to fix the regular expression proposal as it stands:

In the “Header <regex> synopsis” section, right after:

class bad_expression;

Add:

struct regex_traits_version_1_tag;

Immediately before the section: “RE.3.3 Template class regex_traits” add:

RE.3.3 struct regex_traits_version_1_tag

struct regex_traits_version_1_tag{};

The struct regex_traits_version_1_tag defines a type that is used to identify the interface to which a regular expression traits class conforms.  Traits classes that offer an implementation defined superset of the functionality of that offered by regex_traits should define a class that is derived from regex_traits_version_1_tag and use that type as their version_tag member.

The number of the remaining sections will then need adjusting.

Finally add the following typedef member to class template regex_traits:

typedef regex_traits_version_1_tag version_tag;

Specification of sub_match::length incorrect

The specification for sub_match::length has acquired a couple of typos (a misplaced static, and the logic in the effects clause is back-to-front), it should read:

difference_type length();

Effects: returns (matched ? distance(first, second) : 0).

Traits class sentry language

Pete Becker writes: the proposal says:

“An object of type regex_traits<charT>::sentry shall be constructed from a regex_traits object, and tested to be not equal to null, before any of the member functions of that object other than length, getloc, and imbue shall be called. Type sentry performs implementation defined initialization of the traits class object, and represents an opportunity for the traits class to cache data obtained from the locale object.”

The first sentence is in passive voice, and begs the question of who shall do it: the user of the regex instance that holds the regex_traits object, or the regex instance itself. Unless the user is hacking around with a standalone instance of regex_traits, it probably ought to be the regex object that "shall" do this.

Second, sentry "performs implementation defined initialization." I think this ought to be implementation specific, not implementation defined. I don't want to have to document the details of the initialization that sentry performs.

Proposed changes: replace with the wording:

“An object of type regex_traits<charT>::sentry shall be constructed from the regex_traits object held by objects of type basic_regex, and tested to be not equal to null, before any of the member functions of that object other than length, getloc, and imbue shall be called. Type sentry performs implementation specific initialization of the traits class object, and represents an opportunity for the traits class to cache data obtained from the locale object.”

Imprecise specification of regex_traits::char_class_type

Pete Becker writes: Roughly speaking, there are three categories of character class: the ones that are supported by C and C++ locales (alnum, etc.), the additional ones for the regex proposal (d s w) and user-supplied character classes (through extensions to regex_traits). 

Is the intent of the proposal to require that for the first category, the value returned by, for example, lookup_classname("alnum") be the value alnum as defined by ctype_base::mask? (I don't care one way or the other, but we have to be clear about what's required).

John Maddock replies: I don’t think that the proposal should specify that there is any specific relationship between ctype_base::mask (and its associated values) and regex_traits::char_class_type (and its values). 

Proposed changes:

Replace:

“The type char_class_type is used to represent a character classification and is capable of holding an implementation defined superset of the values held by ctype_base::mask (22.2.1).”

with:

“The type char_class_type is used to represent a character classification and is capable of holding the implementation specific set of values returned by lookup_classname.”

Can anything other than basic_regex throw bad_expression objects?

Pete Becker writes: The text describing the class bad_expresions says it is the type of the object thrown to report errors "during the conversion from a string ... to a finite state machine." This suggests that it is not thrown by the functions that try to match a string to and a basic_regex object, and this is borne out by the throws clauses for the constructors and assignment operators for basic_regex, which say that they throw bad_expression if the string isn't a valid regular expression, and by the lack of throws clauses for regex_match, etc. 

On the other hand, error_type has two values, error_complexity and error_stack, that only occur during matching. There's no other mention of these values, so the only thing that can be done with them is for the implementation to pass them to regex_traits::error_string, and the only way the user can see the resulting string is by catching an exception. This suggests that bad_expression can also be thrown by the match functions. And the text says, in the last paragraph of RE.4, that "the functions described in this clause can report errors by throwing exceptions of type bad_expression."

So: can the various match functions throw bad_expression, and, if so, is bad_expression the appropriate name?

John Maddock replies: It is correct that those values (error_complexity and error_stack) are for the use of the regex-matching algorithms and not for the basic_regex construct and assign members.  It is also true that it should be documented that these throw, and that bad_expression is an inappropriate name for this, but how about “throws exceptions derived from runtime_error” (which is the base class for bad_expression).

Proposed changes: Add the following throw clause:

Throws: exceptions derived from std::runtime_error if either the runtime complexity or memory space required to obtain a match exceeds an implementation defined threshold.

To the following functions:

template <class BidirectionalIterator, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_match(BidirectionalIterator first, BidirectionalIterator last,
                 match_results<BidirectionalIterator, Allocator>& m,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);

and:

template <class BidirectionalIterator, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
                  match_results<BidirectionalIterator, Allocator>& m,
                  const reg_expression<charT, traits, Allocator2>& e,
                  match_flag_type flags = match_default);

and:

template <class OutputIterator, class BidirectionalIterator, class traits,
          class Allocator, class charT>
OutputIterator regex_replace(OutputIterator out,
                           BidirectionalIterator first,
                           BidirectionalIterator last,
                           const reg_expression<charT, traits, Allocator>& e,
                           const basic_string<charT>& fmt,
                           match_flag_type flags = match_default);

Note that the other algorithm overloads are described in terms of these three main forms, so their throw clauses are assumed to be implicit here.

Unneeded basic_regex members

The following basic_regex members are redundant and should be removed:

basic_regex(const charT* p1, const charT* p2, flag_type f = regex_constants::normal,

               const Allocator& a = Allocator());

basic_regex& assign(const charT* first, const charT* last,

                       flag_type f = regex_constants::normal);

Historical note: these were present in the original Boost implementation as a workaround for non-conforming compilers and didn’t get edited out of the proposal as they should have been.

Missing basic_regex members

Pete Becker writes: The proposal has member functions named 'assign' that take argument lists that correspond to the argument lists for constructors, with two exceptions: there's basic_regex(const charT *, size_type len, flag_type), but no assign(const charT *, size_type, flag_type); and there's basic_regex(), but no assign(). Are these omissions intentional?

John Maddock replies: I don’t believe that there should be an assign() member (any more than std::string has one), I don’t see any great need for a clear() member either.  The assign(const chart*, size_type, flag_type) should be present however.

Proposed changes: add the following member to the basic_regex class synopsis:

basic_regex& assign(const charT* ptr, size_type len, flag_type f = regex_constants::normal);

The add the following description in the RE4.5 section:

basic_regex& assign(const charT* ptr, size_type len, flag_type f = regex_constants::normal);

Effects: Returns assign(string_type(ptr, len), f).

Types of match_results typedefs members

Pete Becker writes: The proposal says that match_results has a nested typedef

typedef const value_type& const_reference

Since match_results has an allocator, this should be

typedef typename allocator::const_reference const_reference

What does match_results::size() return?

Pete Becker writes: The member funtion size() returns "the number of sub_match elements stored in *this". Aside from the suggested implementation above, there are the prefix() and suffix() sub_match elements. Is the intention that size() should return the number of capture groups in the original expression, and not include those two extra sub_matches? (I think the answer is probably yes).

John Maddock adds: yes this is correct, it’s the number of capture groups plus one (because index 0 holds the whole match).

Proposed changes:

Replace:

size_type size()const;

Effects: Returns the number of sub_match elements stored in *this.

With:

size_type size()const;

Effects: Returns one plus the number marked sub-expressions in the regular expression that was matched.

What does match_results::position return when passed an out of range index?

Pete Becker writes: match_results::position() doesn't say what happens when someone asks for the position of a non-matched group. The specification says that it's distance(first1, first2), where first1 is the beginning of the target text and first2 is the beginning of the nth match. The specification for sub_match says that for a failed match the iterators have unspecified contents. Do we want this to be unspecified or undefined, or is there some meaningful value we can return?

Having looked ahead <g>, the match and search algorithms specify that non-matched groups hold iterators that point to the end of the target text.  This conflicts with the specification for sub_match, which says they're undefined. Is that text in sub_match incorrect?

John Maddock replies: I think it is reasonable to require specified values in non-matched sub-expressions, so yes the text for sub_match is wrong.  The return value from position() should be defined in all cases as well, since the returned type is signed and negative values can never be produced by matched sub-expressions, how about -1?

Proposed changes:

Changes to:

difference_type position(unsigned int sub = 0)const;

Effects: Returns std::distance(prefix().first, (*this)[sub].first).

Are covered in “Regex_iterator does not set match_results::postion correctly”.

Delete the following paragraphs from the sub_match specification:

When the marked sub-expression denoted by an object of type sub_match<> participated in a regular expression match then member matched evaluates to true, and members first and second denote the range of characters [first,second) which formed that match. Otherwise matched is false, and members first and second contained undefined values.

If an object of type sub_match<> represents sub-expression 0 - that is to say the whole match - then member matched is always true, unless a partial match was obtained as a result of the flag match_partial being passed to a regular expression algorithm, in which case member matched is false, and members first and second represent the character range that formed the partial match.

The add the following to the match_results specification, immediately after the sentence ending “except that only operations defined for const-qualified Sequences are supported.:

The sub_match<> object stored at index zero represents sub-expression 0; that is to say the whole match.  In this case the sub_match<> member matched is always true, unless a partial match was obtained as a result of the flag regex_constants::match_partial being passed to a regular expression algorithm, in which case member matched is false, and members first and second represent the character range that formed the partial match.

The sub_match<> object stored at index n denotes what matched the marked sub-expression n within the matched expression.  If the sub-expression n participated in a regular expression match then the sub_match<> member matched evaluates to true, and members first and second denote the range of characters [first,second) which formed that match. Otherwise matched is false, and members first and second point to the end of sequence that was searched.

What happens if match_results::operator[] is out of range?

Pete Becker writes: with respect to match_results::operator[]: We need to say what happens for an index out of range. Seems to me there are two reasonable possibilities: undefined behavior, or returns a no-match object.

While I strongly favor undefined behavior over artificially well-defined results, I also favor well-defined behavior when it is not too artificial. Thus, the behavior of sqrt(-2.0) is undefined; free(0) does nothing. While undefined behavior provides a convenient hook for debugging implementations, that's not its purpose, and if we can specify reasonable (which includes inexpensive) behavior we ought to do it, rather than provide another place where users can go astray.

In this case, I think I prefer to view operator[] as indexing into an unbounded array of sub_match objects. The objects at match_results.size() and above would look like failed sub-matches: their boolean flag would be false, and both their iterators would point to the end of the target string. Since we've agreed that sub_match objects for failed sub-matches need not have distinct addresses, this can be implemented by simply adding one sub_match element beyond those needed for the actual results, and returning it for an index that's otherwise out of bounds.

Proposed changes: replace:

const_reference operator[](int n) const;

Effects: Returns a reference to the sub_match object representing the character sequence that matched marked sub-expression n. If n == 0 then returns a reference to a sub_match object representing the character sequence that matched the whole regular expression.

With:

const_reference operator[](int n) const;

Effects: Returns a reference to the sub_match object representing the character sequence that matched marked sub-expression n. If n == 0 then returns a reference to a sub_match object representing the character sequence that matched the whole regular expression.  If n >= size() then returns a sub_match object representing an unmatched sub-expression.

Incorrect case insensitive match specification

The following wording:

"During matching of a regular expression finite state machine against a sequence of characters, comparison of a collating element range c1-c2 against a character c is conducted as follows: if getflags() & regex_constants::collate is true, then the character c is matched if traits_inst.transform(string_type(1,c1)) <= traits_inst.transform(string_type(1,c)) && traits_inst.transform(string_type(1,c)) <= traits_inst.transform(string_type(1,c2)), otherwise c is matched if c1 <= c && c <= c2.  During matching of a regular expression finite state machine against a sequence of characters, testing whether a collating element is a member of a primary equivalence class is conducted by first converting the collating element and the equivalence class to a sort keys using traits::transform_primary, and then comparing the sort keys for equality."

Is defective in that it does not take account of case-insensitive matches, all input characters, and all collating elements in the finite state machine should be passed through traits.inst.translate before being converted into a sort key.  The changes are trivial to make, but rather verbose...

Proposed changes: are incorporated into the following issue.

Character class extensions to ECMAScript grammar need a formal grammar

Pete Becker writes: The regex proposal adds to ECMAScript the ability to use named character classes through "expressions of the form":

[[:class-name:]]
[[.collating-name.]]
[[=collating-name=]]

This isn't sufficient. In ECMAScript the expression [[] is valid, and names a character set containing the character '['. Similarly, [[:] is also valid, and names a character set containing the characters '[' and ':'. We need to say whether these two expressions (and their analogs for collating names) are still valid. I suspect the answer is that they're not -- a '[' as the first character in a character class is a special character, which must be follwed by one of ':', '.', or '=', then a name that does not contain any of ']', ':', ".', or '=' (technically we could allow ']', but that seems unnecessarily baroque), then the appropriate close marker.

John Maddock replies: After some discussion on the c++-std-lib list it was agreed that the character class extensions were worth having (in fact they were requested after a previous committee meeting), but that they required a change to the ECMAScript grammar.

Proposed changes: In section RE.4, remove everything starting from:

“The regular expression grammar recognized by type specialization of basic_regex is described in”

and ending with:

“During matching of a regular expression finite state machine against a sequence of characters, a character c is a member of character class some_name, if traits_inst.is_class(c, traits_inst.lookup_classname("some_name")).”

and replace with the paragraph:

“The regular expression grammar recognized by class template basic_regex is that specified by Appendix XXXX”

Then add the following as an appendix to the proposal:

Appendix: Modified ECMAScript regular expression grammar

The regular expression grammar recognized by class template basic_regex is that specified by ECMA-262, ECMAScript Language Specification, Chapter 15 part 10, RegExp (Regular Expression) Objects (FWD.1) except as specified below.

Objects of type specialization of basic_regex store within themselves a default-constructed instance of their traits template parameter, henceforth referred to as traits_inst. This traits_inst object is used to support localization of the regular expression; no basic_regex object shall call any locale dependent C or C++ API, including the formatted string input functions, instead it shall call the appropriate traits member function to achieve the required effect.

The transformation from a sequence of characters to a finite state machine is accomplished by first by transforming the sequence of characters to the sequence of tokens obtained by calling traits_inst.syntax_type(c) for each input character c. The regular expression grammar is then applied to the sequence of tokens in order to construct the finite state machine (this is to allow the regular expression syntax to be localized to a specific character set; for example to use Far Eastern ideographs, rather than Latin characters). Where traits_inst.syntax_type(c) returns syntax_escape, then the implementation shall call traits_inst.escape_syntax_type(c) for the character following the escape, in order to determine the grammatical meaning of the escape sequence. When traits_inst.escape_syntax_type(c)returns escape_type_class or escape_type_not_class, then the single character following the escape is converted to a string n, and passed to traits_inst.lookup_classname(n), to determine the character classification to be matched against. If traits_inst.lookup_classname(n) returns null, then the escape is treated as an identity escape.  The mapping of values returned by syntax_type and escape_syntax_type to characters within the ECMAScript grammar is specified in tables RE5 and RE6.

The following productions within the ECMAScript grammar are modified as follows:

CharacterClass ::

[ [lookahead . [{^}] ClassRanges ]

[ ^ ClassRanges ]


ClassAtom
::

-
ClassAtomNoDash
ClassAtomExClass
ClassAtomCollatingElement
ClassAtomEquivalence

The following new productions are then added:

ClassAtomExClass::
  [: ClassName :]

ClassAtomCollatingElement::
  [. ClassName .]

ClassAtomEquivalence::
  [= ClassName =]

ClassName::
  ClassNameCharacter
  ClassNameCharacter ClassName

ClassNameCharacter::
  SourceCharacter but not one of "." "=" ":"

The productions ClassAtomExClass, ClassAtomCollatingElement and ClassAtomEquivalence provide the equivalent functionality to the same features in IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Base Definitions and Headers, Section 9, Regular Expressions.

The regular expression grammar may be modified by any regex_constants::syntax_option_type flags specified when constructing an object of type specialization of basic_regex according to the rules in table RE.3.1.1.

A ClassName production when used in ClassAtomExClass is not valid if, when the value returned by traits_inst.lookup_classname for that name is zero.  If this is not the case then a bad_expression exception shall be thrown.  The range of values recognized as a valid ClassName is determined by the type of the traits class, but at least the following names shall be recognized: alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, xdigit, d, s, w.  In addition the following expressions shall be equivalent:

\d and [[:digit:]]

\D and [^[:digit:]]

\s and [[:space:]]

\S and [^[:space:]]

\w and [_[:alnum:]]

\W and [^_[:alnum:]]

The results from multiple calls to traits_inst.lookup_classname can be bitwise OR'ed together and subsequently passed to traist_inst.is_class.

A ClassName production when used in a ClassAtomCollatingElement production is not valid if the value returned by traits_inst.lookup_collatename for that name is an empty string.

A ClassName production when used in a ClassAtomEquivalence production is not valid if, when the value returned by traits_inst.lookup_collatename for that name is an empty string or if the value returned by traits_inst.transform_primary for the result of the call to traits_inst.lookup_collatename is an empty string.

When the sequence of characters being transformed to a finite state machine contains an invalid class name the translator shall throw an exception object of type bad_expression.

Where the regular expression grammar requires the conversion of a sequence of characters to an integral value, this is accomplished by calling traits_inst.toi.

The behavior of the internal finite state machine representation, when used to match a sequence of characters is as described in ECMAScript Language Specification, Chapter 15 part 10, RegExp (Regular Expression) Objects (FWD.1). The behavior is modified according to any match_flag_type flags specified (RE.3.1.2) when using the regular expression object in one of the regular expression algorithms (RE.7). The behavior is also localized by interaction with the traits class template parameter as follows:

During matching of a regular expression finite state machine against a sequence of characters, two characters c and d are compared using traits_inst.translate(c, getflags() & regex_constants::icase) == traits_inst.translate(d, getflags() & regex_constants::icase).

During matching of a regular expression finite state machine against a sequence of characters, comparison of a collating element range c1-c2 against a character c is conducted as follows: if getflags() & regex_constants::collate is true, then the character c is matched if traits_inst.transform(string_type(1,traits_inst.translate(c1, flags() & icase)) <= traits_inst.transform(string_type(1,traits_inst.translate(c, flags() & icase))) && traits_inst.transform(string_type(1,traits_inst.translate(c, flags() & icase))) <= traits_inst.transform(string_type(1,traits_inst.translate(c2, flags() & icase))), otherwise c is matched if c1 <= c && c <= c2.

During matching of a regular expression finite state machine against a sequence of characters, testing whether a collating element is a member of a primary equivalence class is conducted by first converting the collating element and the equivalence class to a sort keys using traits::transform_primary, and then comparing the sort keys for equality.

During matching of a regular expression finite state machine against a sequence of characters, a character c is a member of character class some_name, if traits_inst.is_class(c, traits_inst.lookup_classname("some_name")).

Imprecise Specification of regex_replace

Pete Becker writes:

RE.7.3 (regex_replace) says:

Finds all the non-overlapping matches 'm' of type match_results<BidirectionalIterator> that occur in the sequence [first, last).

Having found them or not, it then writes stuff depending on its arguments.  It's not clear, though, what "non-overlapping matches" are. It took me about five minutes to convince myself that these are matches of the complete expression, and not matches of internal capture groups (which would always overlap the full match). I think a footnote is sufficient for this. More important, though, is what happens when matches overlap. Suppose we're searching for "aba" in the text "ababa". There are two matches: the first three characters match, and the last three match. These two matches overlap. Do we discard them both? Keep the first? Keep the second? My guess is that the intention is to keep the first one, but we need to say so.

John Maddock replies: thinking about this, I think the text that specifies how multiple matches are enumerated should be in one place only (and obviously be completely precise), currently regex_iterator has the text for this, so I think it makes sense to specify the behavior of regex_replace in terms of that type, even though regex_iterator hasn’t actually been described yet when regex_replace is encountered.

Proposed changes:

Replace the following clause:

Effects: Finds all the non-overlapping matches m of type match_results<BidirectionalIterator> that occur within the sequence [first, last). If no such matches are found and !(flags & format_no_copy) then calls std::copy(first, last, out). Otherwise, for each match found, if !(flags & format_no_copy) calls std::copy(m.prefix().first, m.prefix().last, out), and then calls m.format(out, fmt, flags). Finally if !(flags & format_no_copy) calls std::copy(last_m.suffix().first, last_m,suffix().last, out) where last_m is a copy of the last match found. If flags & format_first_only is non-zero then only the first match found is replaced.

With:

Effects: Constructs an regex_iterator object: regex_iterator<BidirectionalIterator, charT, traits, Allocator> i(first, last, e, flags), and uses i to enumerate through all of the matches m of type match_results<BidirectionalIterator> that occur within the sequence [first, last). If no such matches are found and !(flags & format_no_copy) then calls std::copy(first, last, out). Otherwise, for each match found, if !(flags & format_no_copy) calls std::copy(m.prefix().first, m.prefix().last, out), and then calls m.format(out, fmt, flags). Finally if !(flags & format_no_copy) calls std::copy(last_m.suffix().first, last_m,suffix().last, out) where last_m is a copy of the last match found. If flags & format_first_only is non-zero then only the first match found is replaced.

Section 2: Issue list

What is an invalid/empty regular expression?

Pete Becker writes: The default constructor for basic_regex constructs an object for which empty() returns true, i.e. the object does not contain a valid regular expression. The consequences of this aren't clear. For example, regex_match (RE.7.1) doesn't say what happens if the basic_regex object passed to it does not contain a valid regular expression. 

Would it make sense for the default constructor to construct an object as if the user had used basic_regex("")? That way the object holds a valid (although trivial) expression, and the need to check for invalid regular expressions goes away, because the rest of the ctors throw exceptions for invalid regular expressions.  I generally prefer objects that do reasonable things, even if those things are useless. It's not hard for an implementation to handle "" efficiently, and that eliminates an oddball case that otherwise requires explicit wording in several places -- if the default constructor creates an object with an invalid expression then copy constructor, assign, and operator= all need to say what happens when they're passed one of these. Currently they say that the postcondition is empty() == false, and they'd have to add a qualification for default-constructed objects. Same for anything that uses a basic_regex object. It's simpler to not have to say anything about invalid regular expressions.

John Maddock replies: I accept it's simpler, but I'm still not sure that it's a good idea, consider: 

The user default constructs a std::regex object, and the code then goes off on one of several possible execution paths, later the user tries to use the expression, and gets very strange behavior (a match is found at every single position in the string), how would that user track down the bug in their program that leaves the expression uninitialized?  It would be much easier if they could just assert(!my_expression.empty()), or better still if the implementation of regex_match/regex_search did just that. 

Note also that invalid regular expression objects can be produced by basic_regex<>::assign (if the assign encounters an invalid expression it throws, but what kind of object is left behind is unspecified - unless we insist on the strong exception guarantee it will be empty(). 

Finally consider the following sequence of actions: 

std::regex e; //1
e.imbue(my_locale); //2
e.assign(my_expression);

If statement 1 creates a valid regex, then its traits class will presumably be called upon to initialize itself (which may involve caching lots of locale specific data), statement 2 then wipes out all that work done. 

I do agree though that as the text stands, it needs an audit to fix unspecified behavior when the expression is invalid.

Proposed changes: At the very least we must indicate what happens to the existing object in the event that basic_regex::assign throws.

To the effects clause for:

template <class string_traits, class A>
basic_regex& assign(const basic_string<charT, string_traits, A>& s, 

                    flag_type f = regex_constants::normal);

Add the sentence:

“In the event that an exception is thrown, then the basic_regex object shall remain unchanged.”

Then we must either indicate that the basic_regex default constructor is equivalent to the regular expression “”, or we must add the following precondition:

Precondition: !e.empty()

to the following algorithms (other overloads inherit this precondition implicitly since they are described in terms of the ones below):

template <class BidirectionalIterator, class Allocator, class charT,

          class traits, class Allocator2>

bool regex_match(BidirectionalIterator first, BidirectionalIterator last,

                 match_results<BidirectionalIterator, Allocator>& m,

                 const reg_expression<charT, traits, Allocator2>& e,

                 match_flag_type flags = match_default);

 

template <class BidirectionalIterator, class Allocator, class charT,

          class traits, class Allocator2>

bool regex_search(BidirectionalIterator first, BidirectionalIterator last,

                  match_results<BidirectionalIterator, Allocator>& m,

                  const reg_expression<charT, traits, Allocator2>& e,

                  match_flag_type flags = match_default);

 

template <class OutputIterator, class BidirectionalIterator, class traits,

          class Allocator, class charT>

OutputIterator regex_replace(OutputIterator out,

                           BidirectionalIterator first,

                           BidirectionalIterator last,

                           const reg_expression<charT, traits, Allocator>& e,

                           const basic_string<charT>& fmt,

                           match_flag_type flags = match_default);

Regular expression constructor language

Pete Becker writes: Probably editorial. For the basic_regex ctor that takes a const charT *p, the proposal says: 

Effects: Constructs an object of class basic_regex; the object's internal finite state machine is constructed from the regular expression contained in the null-terminated string p...

p is not a null-terminated string. It is a pointer. The analogous phrasing for basic_string is:

Effects: Constructs an object of class basic_string and determines its initial string value from the array of charT of length traits::length(s) whose first element is designated by s ...

We need to maintain a similar level of formalism.

Incorrect usage of “undefined”

In several places in the document the term “undefined” should be replaced by “unspecified”:

“Otherwise matched is false, and members first and second contained undefined values.”

“If the function returns false, then the effect on parameter m is undefined, otherwise the effects on parameter m are given in table RE18”

“If the function returns false, then the effect on parameter m is undefined, otherwise the effects on parameter m are given in table RE19”

Incorrect usage of “implementation defined”

In several places in the document the term “implementation defined” should be replaced by either “implementation specific” or “unspecified”:

“Type sentry performs implementation defined initialization of the traits class object, and represents an opportunity for the traits class to cache data obtained from the locale object.”

char_class_type lookup_classname(const string_type& name) const;

Effects: returns an implementation defined value that represents the character classification name”

Returns: converts f into a value m of type ctype_base::mask in an implementation defined manner”

Effects: constructs an object result of type int. If first == last or if is_class(*first, lookup_classname("d")) == false then sets result equal to -1. Otherwise constructs a basic_istream<charT> object which uses an implementation defined stream buffer type which represents the character sequence [first,last), and sets the format flags on that object as appropriate for argument radix.”

Are sub_match objects all unique?

Pete Becker writes: Are sub_match objects for non-matched capture groups required to be distinct? I can picture a match_type implementation that holds sub_match objects only for the capture groups that matched, and returns a generic no-match object for others. Is this intended to be legal? (My inclination is that it ought to be allowed, because I don't see any good reason not to allow it).

John Maddock replies: there is nothing to explicitly disallow this in the proposal, so yes it is allowed, and no I don’t see any reason to disallow it either.

Proposed changes: Either none, or add a non-normative note.

How are Unicode escape sequences handled?

Pete Becker writes: ECMA-Script supports character escapes of the form "\uxxxx", where each 'x' is a hex digit. Each such escape sequence represents the character whose code point is the value of 'xxxx' translated to a number in the usual way.  What do such character escapes mean when the character type for basic_regex is too small to hold that value? Do we intend to require multi-byte support here (I hope not)? Or is such a value invalid when the target character type is too small?

John Maddock adds: this sparked off a furious discussion about Unicode support in C++ generally, it was clear that:

  1. Unicode support in C++ is highly desirable.
  2. basic_regex probably isn’t the place to fix this – we need to review std::string, std::locale etc first.

For now, we should assume that if the underlying character type isn’t wide enough to hold a specified Unicode code point, then it should be treated as a syntax error.

Meaning of the match_partial flag

From Pete Becker:

RE.3.1.2 says that the match_partial flag

Specifies that if no match can be found, then it is acceptable to return a match [from, last) where from!=last, if there exists some sequence of  characters [from,to) of which [from,last) is a prefix, and which would result in a full match.

Taking this literally, if I have the expression "a(?=b)(?!b)" and try to match it against "a", the partial match must fail, because the two assertions are contradictory. Is the matcher really required to do this sort of analysis of the expression, and determine that there is no possible continuation that could succeed?

From the name, I would think that partial_match would mean, roughly, that if you reach the end of the search text but are only partway through the regular expression, that's okay.  So in the example above, the partial match would succeed. Is that what's intended here?

John Maddock replies: Yes that is the intent; the text may be a little too strict.

Proposed change: Add the following text to the end of the above clause:

[Note – implementations are not required to go to heroic efforts to determine whether a partial match is truly possible, there may be combinations of text and regular expression where ruling out an impossible partial match would require excessive effort – end note]

Name of regex_traits::is_class

Pete Becker writes: That name is confusing. I'd prefer inclass, or some variant. The function takes two arguments: a character and a character class, and tells you whether the character belongs to the class. is_class sounds too much like querying whether some object represents a character class.

John Maddock replies: ctype uses “is” for the equivalent function but that is too short, maybe “is_class_member” or something, but finding a good name is hard (as ever).

No proposed change at this time; more suggestions or comments are needed.

Can traits::error_string be simplified?

Pete Becker writes: In the proposal, the template regex_traits has a member function error_string that takes an error code that indicates what error occurred and returns a string corresponding to that error, which is then used as the argument to the constructor for an exception object. Seems to me it would be simpler to have regex_traits simply provide a function that throws the exception, called with the error code. Is this string needed for anything else?

John Maddock replies: The only use I can think of is if you implement C API's on top of the C++ library (as boost.regex does with the POSIX regex API's).  These have a separate API that returns the error string from a numeric ID.  However from the point of view of the proposal I think you are correct, it could be changed to just throw.

Proposed changes: none are presented here, since the proposal is not actually defective, it would be easy enough to change if there is any strong feelings one way or another.

Can traits::translate be improved?

Pete Becker writes: The regex_traits member function 'translate' is used when comparing a character in the pattern string with a character in the target string. It takes two arguments: the character to translate, and a boolean flag that indicates whether the translation should be case sensitive. So two characters are equal if

translate(pch, icase) == translate(tch, icase)

So with pattern text of "abcde", checking for a match would look something like this:

for (int i = 0; i < 5; ++i)

   if (translate(pch[i], icase) == translate(tch[i], icase))

      return false;

return true;

The implementation of regex_traits::translate in the library-supplied traits class is:

return (icase ? use_facet<ctype<charT> >(getloc()).tolower(ch) : ch);

There's potential for a significant speedup, though, if case sensitive and case insensitive comparisons go through two different functions. The obvious transformation of the preceding loop would be:

if (icase)

   for (int i = 0; i < 5; ++i)

      if (translate_ic(pch[i]) == translate_ic(tch[i]))

         return false;

else

   for (int i = 0; i < 5; ++i)

      if (translate(pch[i]) == translate(tch[i]))

         return false;

return true;

For the default regex_traits class, the calls to translate in the second branch of the if statement would be inline calls to a translate function that simply returns its argument, so the loop turns into a sequence of direct comparisons, with no distractions from the possibility of case insensitivity. Further, since case sensitivity is determined by a flag that's set at the time the regular expression is compiled, one of the two branches of the outer if statement will always be unnecessary.

I made up the names 'translate_ic' and 'translate' for this e-mail. I'm not suggesting that we use them.

John Maddock replies: I’ve no object to this, but the speedup may not be as big as you think, an earlier Boost.regex implementation (actually before it was part of Boost), used something similar, and the difference between the two is pretty minimal compared to other costs involved – such as the number of states matched.  The change also involves quite a bit of redrafting.

Improving on traits::toi

Pete Becker writes: It says, in part:

If first == last or if is_class(*first, lookup_classname("d")) == false then sets result equal to -1.

And "d" by default is the digits 0-9. Since the radix for the conversion can be 8, 10, or 16, the condition involving "d" isn't right. For a hex value it precludes the value 'a0'. For an octal value it allows '90', but the ensuing conversion will fail. We need to find a different way to express this. The idea is to return -1 on a failed conversion, and the appropriate unsigned value on success.

And further: I'm starting to think that toi is too high level an interface. Regular expression grammars go character by charcter. For example, the value of a HexEscapeSequence (\xhh) is "(16 times the MV of the first hex digit) plus the MV of the second HexDigit". toi (hypertechnically) doesn't require that. In order to implement the specification literally, the regex parser needs to translate individual characters, not groups of characters, into values, and accumulate those values as appropriate. Thus, regex_traits ought to provide int value(charT ch),which returns -1 if isxdigit(ch) is false, otherwise the numeric value represented by the character.

And: I've just implemented it. Here are the changes I made:

1.      I removed escape_type_backref and escape_type_decimal

2.      I added escape_type_numeric (0-9)

3.      I added int regex_traits::value(charT ch, int base)

The first two aren't technically necessary for this change, but escape_type_backref is a bit misleading. ECMAScript doesn't restrict the number of capture groups, so \10 can be a valid back reference. This means that escape_type_backref alone isn't sufficient. So I figured it's enough to know that you're starting a numeric constant (i.e. escape_type_numeric), and then you can use value() == -1 to determine when you've reached the end of a constant.

The second argument to value is needed in order to decide whether the character is a valid digit for the base. value returns -1 for an invalid digit, and the (unsigned) numeric value for a valid digit.

John Maddock adds: I’ve no objections to the change, but it involves a fair bit of redrafting, it should be dealt with in an overall re-drafting of the traits class section when we have a complete set of changes.

Improving on traits::lookup_classname

Pete Becker writes: I think this needs a change in specification. It returns a value that identifies the named character class identified by its string argument. The cases I'm concerned about are the ones with names like [:alnum:]. When the code encounters the opening [: it has to scan ahead for the matching :, pick up the characters in between, stuff them into a string, and call lookup_classname. This is a lot of wheel spinning. In particular, creating the string is expensive. If lookup_classname took two iterators instead of a string it could simply look at the characters without the intervening string object.

John Maddock replies: Funnily enough Boost.regex currently does it that way, I changed it to simplify the interface.

I'm not against the change - but if we change that one then lookup_collatename should also change - and there are numerous places in the text that use: lookup_classname("string literal"), which would actually be quite tricky to reword correctly.

This should be dealt with in an overall re-drafting of the traits class section when we have a complete set of changes.