A Proposal to add Regular Expressions to the Standard Library

Doc. no.: J16/02-0044 = WG21/N1386
Date: 06 September 2002
Project: Programming Language C++
Subgroup: Library
Wiki: http://www.research.att.com/~ark/cgi-bin/lwg-wiki/wiki?RegularExpressions
Reply to: John Maddock <john_maddock@compuserve.com>

Forward

This proposal is based heavily on the Boost regular expression library, although GRETA is also discussed. The proposal represents widespread existing practice; the Boost regex library is in use on over twenty compiler/platform combinations, and is one of the most visited libraries on the Boost web site. The proposal is necessarily long and quite detailed and covers many difficult and contentious issues. The intent is to provide a concrete "straw man" proposal for discussion along with a clear discussion of the design issues involved, and not necessarily the final form that a proposal will take. In short, regard this as target practice!

Motivation
   Target Audience
Impact on the standard
Design
   Summary of key design decisions
   Summary of key limitations imposed by the design
   Regular expression syntax (and complexity)
      Regular expression algorithms
   Regular expression representation
   Regular expression traits classes
   Representing regular expression matches
   Algorithms
   Repeated regular expression searches
   Iterator requirements
Proposed text
   Regular Expression Traits
   Header <regex> synopsis
   Supporting Types
   Template class basic_regex
   Template class sub_match
   Template class match_results
   Regular expression algorithms
   Regular expression Iterators

I. Motivation

Regular expressions should be familiar to many programmers; a regular expression is a string used as a pattern to select specific strings from a set of character strings. Regular expressions are used by many familiar unix utilities, including egrep, grep, lex, and sed[1], and are supported directly in many programming languages including awk[1], perl[2], C#[3], and ECMAScript[4] (also known as JavaScript), and indeed are often used ubiquitously wherever text processing occurs.

Until recently most regular expression libraries were implemented in C, no doubt helped by the fact that the POSIX standard provides a standardised regular expression API[1]. Well known C implementations include the GNU regex library[5], the RX library[6], Henry Spencer's regex library[7] (which forms part of the BSD distribution), and PCRE[8] (for perl like regular expressions). Most early C++ implementations were simple wrappers around the C API's[9][10], although these add a measure of safety (by using the Resource Acquisition is Initialisation principle), as well an object oriented interface. However wrapper classes and C API's alike suffer from a key defect when interfacing with the C++ standard library: they are limited to searching narrow character strings held in a const char*.

Native C++ implementations can overcome the limitations suffered by C libraries, and there are two main candidates for standardisation: the Boost regex library[11] (formerly regex++), and GRETA[12]. Both these libraries support narrow and wide character regular expressions, and iterator-based searches. GRETA implements the regular expression algorithms as member functions of the regular expression class, while the Boost library make these separate non-member functions. GRETA is based on perl regular expressions, where as the Boost library uses POSIX regular expression matching rules, albeit with some perl compatible extensions.

Target audience

The are essentially three possible target audiences for a regular expression library:

End Users: regular expression may be used by the end user of a program - typically this would be in a text editor for search (and replace) operations. These users would require that the expression is localised to use their native character set, and never exhibit the kind of exponential search times that some regular expression engines can produce under worst case conditions. Typically this means that the regular expression recognised by the engine has to be restricted to some kind of safe subset of the syntax recognised by programming tools such as perl.

Programmers: this section includes all users of scripting languages like perl[2], awk[1], and ECMAScript[4]. In these cases the regular expressions are authored by the programmer, and these are converted to a state machine at run time. Typically these users require an expressive regular expression syntax, at the expense of worst case performance. Localisation of the syntax is not normally required, although the state machine itself should usually honour the run time locale (for example when matching an alphanumeric character).

Programmers (code generators): this section of users include those using tools like lex[1], which convert one or more regular expressions into a C or C++ source file. Here the expressions are written by the programmer, and then converted to a state machine at compile time. There is no localisation, either of the syntax or of the way in which the state machine behaves at runtime. The generated state machines are usually highly optimised; as a result the regular expressions may take some time to be analysed by the generator program, and are typically quite restricted in the syntax they accept.

This proposal concentrates on the second group; that is to say a rich regular expression syntax, generally used by programmers, and interpreted at run time. However, thought has been given to the other possible user types, and the design by no means rule these out completely.

II. Impact on the Standard

This is a pure library proposal, it does not add any new language features, or alter any existing standard library headers.

III. Design

Overview

This proposal consists of a pair of classes, and a set of algorithms that operate on instances of those classes. Regular expressions are represented by instances of the basic_regex class template, matches against a regular expression are represented by instances of the match_results class template. The algorithms regex_match and regex_search take as input a regular expression and a character string, and fill in a match_results structure if a match is found. For search and replace operations the algorithm regex_replace performs formatted search and replace on a character string, producing a new string as output. Finally, for repeated find operations, there are two iterator types; one enumerates a sequence of full regular expression matches, the other a sequence of strings.

Summary of key design decisions

Regular expression syntax: ECMAScript (discussion).

Complexity: None specified (discussion).

Regular Expression Object: A template class modelled after basic_string; the regular expression object does not have access to the iterator type that will be searched (discussion).

Traits class: Regular expression objects may be customised via a traits class template parameter (TODO).

Character types: both narrow and wide character strings are supported.

Representation of regular expression matches: a container of sub-expression matches, what matched each marked sub-expression in the regular expression is reported back to the user (discussion).

Regular expression algorithms: iterator based non-member functions for searching for a match, and for finding an exact match (discussion).

Search and replace: ECMAScript-like search and replace via a non-member function to transform one string to another (discussion).

Repeated searches: an iterator type to enumerate all of the regular expression matches within another iterator sequence (discussion).

Splitting a string: an iterator type that enumerates one or more strings for each regular expression match within another iterator sequence (discussion).

Summary of key limitations imposed by the design

Regular expression syntax: whatever syntax is chosen will severely limit the choice of state machine used along with any complexity guarantees given. The choice of ECMAScript regular expressions implies that some regular expressions can only be matched by a backtracking Non-deterministic Finite Automata (NFA), other algorithms may be used for some subsets of ECMAScript regular expressions (discussion).

Regular expression object type: the fact that the regular expression object is templated on the character type, and not on an iterator type, may make some implementation choices more difficult. In practice this choice seems to have no impact on performance (discussion).

Regular expression traits class: the traits class design will be a major constraining factor on the kinds of internal representations that a regular expression state machine may use. No design is presented here, and both GRETA and Boost designs are considered defective (they are sufficient for their own needs but are not necessarily flexible enough), more work need to be done here.

Support for wide characters: much regular expression theoretical work is based on small finite character sets, and can therefore not be applied to wide character strings, however wide character string support is considered essential. This constraint may be mitigated to some degree by providing a specialisation for basic_regex<char> and then overloading its associated non-member algorithms (discussion).

Sub-expression matching: User feedback suggests that support for matching marked sub-expressions is essential. However, this can have a negative impact on the kinds of algorithms that may be employed. This is mitigated to some degree in this proposal by providing overloaded forms of the key algorithms that do not report what matched, allowing alternative algorithms to be used in these cases for some expressions (discussion).

Iterator type: support for forward iterators would allow for multibyte character support via iterator adapters, however this would also preclude some perl-like features, and may have negative performance implications. This proposal requires support for random access iterators, however this choice needs more investigation and may need to be changed at a later date (discussion).

Regular expression syntax (and complexity)

The complexity of converting a character string to a state machine, and the complexity of searching for a match to that machine, are deeply dependent upon the syntax used to represent the regular expression. There are three ways in which the regular expression syntax can be specified:

Leave the manner in which the string representation is interpreted as implementation defined. This makes it impossible for programmers to include portable regular expressions in their code.
Define completely within the standard how the regular expression finite state machine is interpreted. This option incurs a lot of work, and possibly duplicates information that has already been standardised elsewhere.
Make a normative reference to another standard. There are only really two pertinent standards available: IEEE Std 1003.1-2001: POSIX, Chapter 9 Regular Expressions[1], and ECMA-262: ECMAScript Language Specification[4].

The following brief summary indicates the main differences between the regular expression syntaxes that have been standardised. It is assumed that the reader is already familiar with one or more of these.

POSIX-basic

The basic POSIX regular expression syntax is most often encountered in Unix utilities such as grep or sed.

Only the characters .[\*^$ are special. Other features are accessed as escape sequences, for example grouping is accomplished with $ and $, quantified repeats using \{ and \} and alternation with \|.

Bracket expressions (contained in []), have access to the full range of POSIX features including named collating elements and equivalence classes, however escape sequences within bracket expressions are not valid (unlike in perl).

There are no + or ? quantifiers.

Back-references to previously marked sub-expressions (with $ and $) are allowed. This last option means that pattern recognition is almost certainly an NP-Complete problem[13].

Expressions are matched using the "leftmost longest" rule.

POSIX-extended

The extended POSIX regular expression syntax is used by the Unix utilities egrep and awk[1].

The extended POSIX regular expression syntax is generally more natural in use than the basic syntax, the characters .[]{}+*?|\^$ are all special, escape sequences are used to make special characters ordinary (in other words there are no special escape sequences defined by the POSIX standard, however such sequences are allowed as compatible extensions).

Expressions are matched using the "leftmost longest" rule[1].

Back-references are not allowed in POSIX extended regular expressions, as a result a POSIX extended regular expression can be converted to a Deterministic Finite Automata (DFA) if required.

The following features are supported by both POSIX-extended and POSIX-basic regular expressions but not by ECMAScript:

Named collating elements, for example "[[.left-square-bracket.][.right-square-bracket.]]" would match either '[' or ']'.
Primary equivalence classes, for example "[[=a=]]" would match any character whose primary collation key is the same as the primary collation key for 'a', i.e. 'à', 'á', 'â', 'ã', 'ä', 'å', 'A', 'À', 'Á', 'Â', 'Ã', 'Ä', or 'Å'.

ECMAScript

ECMAScript[4] is based on the perl regular expression syntax[2], which in turn is loosely based on the POSIX extended syntax[1]. ECMAScript is also known as JavaScript or JScript.

Unlike the POSIX regular expression syntaxes, ECMAScript does not use the "leftmost longest" rule to determine the best possible match, instead the regular expression is treated as a directed graph, and a depth first search is performed, with the first match found being returned. The following examples illustrate the main differences between the two:

Expression	Text	POSIX leftmost longest match	ECMAScript depth first search match
`a\|ab`	`xaby`	`"ab"`	`"a"`
`.([[:alnum:]]+).`	`" abc def xyz "`	$0 = " abc def xyz " $1 = "abc"	$0 = " abc def xyz " $1 = "z"
`.*(a\|xayy)`	`zzxayyzz`	`"zzxayy"`	`"zzxa"`

These differences between ECMAScript matching rules, and POSIX matching rules, mean that these two regular expression syntaxes differ not only in the features offered, but also in the form that the state machine takes and/or the algorithms used to traverse the state machine.

The following features are supported by ECMAScript, but not by the POSIX-basic or POSIX-extended syntaxes:

Non-greedy repeats: for example "a+?" will match the shortest possible sequence of repeating a's.
Non-marking parenthesis: the form "(?:sub)" can be use to logically group the expression "sub" without marking it for future reference.
Forward lookahead asserts: the form "(?=sub)" matches zero characters that are followed by "sub", and the form "(?!sub)" matches zero characters that are not followed by "sub".
The word boundary assertions \b and \B match zero characters at a word boundary, or not at a word boundary respectively.

The main advantage offered by ECMAScript is that the richer syntax makes it very much easier to author patterns that will match sections of mark-up languages. For example to match a "…" section using the POSIX syntax is much more complicated than it first seems, the obvious expression:

"<font[^>]*>.*</font>"

would match everything from the first occurrence of "<font" to the last occurrence of "", including any intervening or tags. It is possible to do the right thing using something much more complex like:

"<font[^>]*>([^<]|<[^/]|</[^f]|</f[^o]|</fo[^n]|</fon[^t]|</font[^>])*</font>"

which is not especially memorable (nor even a general solution). Where as the ECMAScript equivalent:

"<font[^>]*>.*?</font>"

is very much easier. I've seen suggestions that a POSIX-like syntax, but using a "leftmost shortest" matching rule would be a better a way of handling mark-up languages[14], but while this is true, it still doesn't approach the flexibility of ECMAScript in this important area.

Perl

Perl 5 regular expressions offer the same facilities as ECMAScript, as well as several additional features such as independent sub-expressions, zero width look-behind assertions, and interpreted sub-expressions. The latter feature allows recursive regular expressions; something that can only be fully implemented when regular expressions are an intrinsic part of the language, and have access to the program's variables. Given that perl isn't standardised, the perl syntax probably isn't a good candidate for the C++ standard. It is also worth noting that the perl regular expression syntax will undergo a complete rewrite in perl 6[15].

Recommendation

This proposal recommends that support for ECMAScript syntax is required, while support for the POSIX syntax is optional. This choice is different to the one made by Boost, but is based on a very clear bias in user expectations (and feedback), in particular for features like "non-greedy repeats" which simply do not fit well into the POSIX "leftmost longest" model. This choice is also the same as that made by almost all of the newer programming languages: not only perl and JavaScript, but also Python, PHP, and Microsoft's .NET framework.

Regular expression algorithms

The main state machine representations and algorithms used for implementing regular expressions are shown below, along with their complexities [16][17]:

Finite state machine / algorithm	Size of machine for M character expression	Pattern recognition complexity, for an N character text, and S states.	Comments
DFA	O(2^M)	O(N)	May have a high cost for the construction of the finite state machine, typically only used by code generators such as lex.
Bit-parallel non-backtracking NFA	O(M) for most expressions, but O(2^M) for nested bounded-repeats.	O(1+(S/B))N	Only applicable to narrow character regular expressions and does not report what matched. If B (the number of bits in a machine word) is greater than S (a reasonably common experience), then performance is O(N).
Non backtracking NFA	O(M) for most expressions, but O(2^M) for nested bounded-repeats.	O(SN)	Typically used for POSIX-extended regular expression engines, especially tools like egrep where the expression is to be authored by an "end user" rather than a programmer.
Backtracking NFA	O(M)	O(2^N)	Typically used for ECMAScript regular expressions, and for POSIX-basic expressions that contain back-references.

The main lesson from this table is that unless the standard defines the regular expression grammar used, the complexity of the regular expression algorithms can not be specified. One option that could be explored is to allow for the runtime monitoring of algorithm complexity. That is, the stated algorithm complexity is enforced not by using an algorithm that has the stated complexity, but by monitoring the number of states in the machine that are visited, and throwing an exception if the number visited exceeds some threshold. This option allows for the use of richer regular expression grammars, while at the same time preventing the state machine from executing eternally (this is currently being tested with the Boost.regex library).

Please also note that for carefully limited regular expression grammars, and for narrow character strings, there are many clever algorithms and/or heuristics available with much better performance (and in some cases worst case complexity as well) than the ones given here[18][19].

This proposal does not make any performance guarantees, given that this is essentially impossible for ECMAScript regular expressions.

Regular expression representation

In most programming languages, regular expressions are represented as a character strings. Ultimately though, a regular expression is a finite state machine, the character string is just is a user-friendly representation. This leads to the first design decision: regular expressions will be represented by a type that is constructible from a character string. This allows the data structure representing the finite state machine (DFA or NFA etc) to be completely opaque to the end user. This form also ensures that character strings are converted to a state machine only once (whereas if regular expression were represented by basic_string then the transformation to a state machine would have to occur every time the expression was used). On the other hand, since character strings and regular expressions are so closely related, the regular expression type should "look like" a basic_string, in particular the member functions should make the underlying character string representation easily accessible. There are some limits to this; other than basic_regex<>::assign, member functions will be limited to those that do not modify the string. This limitation arises from the fact that one can not arbitrarily modify a regular expression and expect the result to still be a valid expression, nor can one make arbitrary changes to the underlying state machine without repeating any static analysis that may have been performed on the machine. One reviewer has asked for arbitrary modification to the machine to be allowed, in order to support emacs-style incremental searches. However, it is not clear what kind of impact this would have on the state machine representation, or even whether this is possible for regular expressions. As a result this is not part of this proposal, and experience with Boost so far is that not one user has ever asked for this feature.

The GRETA and Boost regular expression libraries use slightly different forms for the regular expression object. Boost uses:

template <class charT, 
          class traits = regex_traits<charT>, 
          class Allocator = allocator<charT> >
class reg_expression;

While GRETA uses:

template< typename ConstIterator, 
          typename traits = perl_syntax<iterator_traits<ConstIterator>::value_type> >
class basic_rpattern;

There are two main differences here: Boost has an Allocator template parameter, where as GRETA does not; and GRETA is templated on the iterator type that will be searched rather than on the character type. In addition, GRETA uses member functions for regular expression search and match operations, where as Boost uses non-member functions. This proposal follows the Boost interface for the following reasons:

The class will need to allocate memory: it is in effect both a container of characters, and a container of states, as such an allocator parameter is appropriate.
The Boost interface closely models that of basic_string. This is a deliberate design decision; the basic_string interface is well known and understood, following the same interface design decisions (in particular using the same template arguments) makes the library both easier to learn, and easier to teach. I believe that it is hard to overstate the importance of this.
With the Boost interface, the same regular expression object can be used with multiple iterator types, where as GRETA requires a different regular expression type for each iterator type that is to be searched. Although most programs do not in general use regular expressions with a large number of different iterator types, it is none the less annoying to have to typedef a different regular expression type for each particular iterator type to be operated upon. In contrast, most Boost users will only ever use the typedefs boost::regex or boost::wregex (see discussion below), again in a manner completely analogous to the standard string classes.
There are some implemenation techniques that offer significant performance enhancements but which can only be readily applied to narrow character strings, there is therefore a need to be able to specialize basic_regex<char> if these techniques are to be used. This is harder (though not necessarily impossible) if the interface is iterator based.
If the regular expression type is templated only on the character type, then it is possible to create a code generator that takes a regular expression string as input, and outputs C++ code containing a pre-computed state machine for that expression. In comparison if the GRETA approach were taken then the code generator would have to be recompiled for each iterator type supported. This option is not yet supported by the Boost code, but it is something that users request from time to time, often because they want the their regular expressions to be obfuscated, and not embedded in the executable as character strings.
One can imagine compilers that would (perhaps as a compatible extension), replace regular expression objects that are initialised with a string literal, with a pre-computed state machine. I believe that a compiler can only do this if the range of state machine types that it can generate is bounded (for example to narrow and wide character regular expressions), and not dependent upon a (possibly user defined) iterator type.

As with all design decisions there are some compromises involved. In particular, the GRETA implementation by default uses a recursive depth first search; each state in the machine is represented by a polymorphic object, so the iterator type needs to be known by the machine in order for iterators to be passed through virtual function calls. However there is a one to one transformation from a virtual function call, to a call through a function pointer looked up in a table indexed by the object's type. A refactored version of Boost regex uses this technique to implement the same algorithm that GRETA uses, the result has been found to be faster than either Boost or GRETA so far, and this implementation will probably be used in a future Boost version[21]. In other words, although there are some implementation choices that do not fit easily into the Boost interface, there are no algorithms that can not fit into this interface, either in theory or practice.

There is one other restriction worth pointing out; the Boost interface does not easily allow an implementation to generate self-modifying machine code for the state machine. That would require the state machine to be fixed to a finite number of iterator types, and in practice there appear to be no regex implementations that make this design choice in any case. If this option were required, then full construction of the state machine would have to be delayed until it was required (and the iterator type was known to be one that is supported), the result could then be cached in the regular expression object (and indexed by a const typeinfo*).

This proposal has made some small modifications the Boost interface where experience has indicated that the existing version is imperfect, in particular the name of the regular expression object has been changed from reg_expression to basic_regex, to make the basic_string analogy clearer. The actual class used to represent the regular expression is:

template <class charT, class traits = regex_traits<charT>, class Allocator = allocator<charT> >
class basic_regex;

In many cases there are multiple ways in which a character string can be converted to a state machine, the most important of these is to tell the state machine whether it should regard the text as case sensitive or not, but other options are possible: for example hints on optimisations and/or the expression syntax to use (for example POSIX basic or extended, perl or JavaScript etc). So in addition to a string, the form of the state machine is determined by a series of flags (represented by a bitmask type), these flags should take on well defined default values which results in any given string producing the same behaviour on all implementations.

In the Boost reference implementation these flags are defined as members of a non-template base class to basic_expression, called regbase. Originally this base class had several members in addition to these constants (in a design rather similar to std::ios_base), but as time has gone on these members have either been removed or moved to other locations (either the traits class or the basic_regex class), so that leaves a base class with no purpose other than to hold some symbolic constants. At the same time there are other symbolic constants used elsewhere by the regular expression library, which in the Boost implementation are defined at namespace scope. To tidy all this up, this proposal adds these to the nested namespace std::re, whose only purpose is to hold the symbolic constants used by the regular expression library:

namespace std{ namespace re{
typedef bitmask_type syntax_type;
// default syntax, the same on all implementations:
static const syntax_type normal = 0;
// case insensitivity:
static const syntax_type icase = 1;
// possibly other flags follow.

// Other flag types follow...
} }

One of the more controversial aspects of this proposal will be which flags to include here, as this depends heavily upon the regular expression syntax we standardise upon. As a preliminary choice, this proposal suggests that ECMAScript regular expression must be supported, and that POSIX style regular expression are an optional extra.

The proposal does not document any member functions for accessing the internal state machine representation. Obviously such member functions will exist, but their form will depend upon the implementation. In most cases these member functions will be declared private, and one or more non-member functions will be declared as friends.

When constructing a state machine, one must recognise that not all strings are valid regular expressions, in this proposal the regular expression object throws an exception derived from runtime_error when the string is not a valid expression.

Finally as with basic_string, most users will only be using one of two typedefs of basic_regex:

typedef basic_regex<char> regex;
typedef basic_regex<wchar_t> wregex;

Regular expression traits classes

TODO: the traits class design will be a major constraining factor on the kinds of internal representations that a regular expression state machine may use. No design is presented here, and both GRETA and Boost designs are considered defective (they are sufficient for their own needs but are not necessarily flexible enough). More work needs to be done here in order to ensure that the traits class offers the kinds of localisation and customisation facilities that users expect.

Representing regular expression matches

Many of the regular expression algorithms report what matched the expression. Unlike matches against a normal sequence of characters, regular expression matches can be of variable length, and can even match the null string. Furthermore, the end of sequence iterator can not be used to indicate that no match was found, as a null string at the end of a sequence can be a valid regular expression match.

Regular expressions also contain marked sub-expressions, each of which has its own "sub-match" in addition to what matched the whole expression. Further more, each of these sub-expression matches can be either matched or non-matched depending upon the context. For example, consider the expression:

(^[[:space:]]*)|([[:space:]]*$)

Here the expression contains two marked sub-expressions (which we'll call $1 and $2 using perl terminology), both the expression as a whole and the two marked sub-expressions can match variable quantities of text, and either $1 or $2 can be matched, but never both at once. The expression can also match the null string at either the end or the start of the text being searched.

In order to represent this range of possibilities, each match found must be represented as a collection of sub-expression matches, with each sub-expression match represented by a pair of iterators. A Boolean flag is needed to indicate whether a sub-expression participated in the match or not; ideally one would use singular iterators for this, but not all iterators are required to have singular values (nor are is it guaranteed by the standard that one can compare two iterators if one of them is singular), so this option is prohibited.

This proposal uses:

template <class Iterator>
struct sub_match : public std::pair<Iterator, Iterator>
{
   bool matched;

   typename iterator_traits<Iterator>::distance_type length()const;
   operator basic_string<typename iterator_traits<Iterator>::value_type>()const;
};

to represent matched sub-expressions. This is the simplest solution, however experience with Boost has shown that some users new to C++ get confused by the "pair of iterators" concept, and just assume that the first member contains a null terminated string representing the match. An alternative and more object oriented approach would be to develop this into a fully fledged substring class, perhaps as part of a wider strings library (the author has some code for this, but it is not yet ready for primetime). However there are also dangers to this approach; a substring class would still be a pair of iterators internally, there is therefore a potential for the string to which the substring refers to be destroyed before the substring object, leaving invalidated iterators behind. Of course this is still a possible problem with the "pair of iterators" approach, however at least it is obvious that they are iterators and not something that looks like it may be a "real" string.

There are two options for representing a collection of sub_matches: either to use an existing container class, or provide a new container. This proposal takes the latter option - providing a new container called match_results - for the following reasons:

The match_results type adds a number of member functions that are useful for this particular concept, these include prefix() and suffix() (for returning what perl calls $` and $'), position(n) and length(n) provide the position and length of sub-expression n, and str(n) returns a copy of sub-expression n as a string.
The match_results type does not contain any member functions for modifying its contents - that is the sole responsibility of the regular expression algorithms - and not the end user. In fact the match_results type models a "const sequence".
The regular expression algorithms may need to copy and/or cache instances of match_results internally; for some algorithms significant performance benefit may be obtained by carefully managing the memory used by the class.
A regular expression match can be reformatted to create a new string based upon a format string (using either the perl or the sed syntax); this is achieved using the format member function. This is actually a separate algorithm in the Boost regex library (called regex_format), however there appears to be no benefit in making this a non-member function.

Finally there are some typedefs of the match_results type to simplify usage:

typedef match_results<const char*> cmatch;
typedef match_results<const wchar_t*> wcmatch;
typedef match_results<string::const_iterator> smatch;
typedef match_results<wstring::const_iterator> wsmatch;

Algorithms

In this proposal regular expression algorithms are present as iterator-based non-member functions that operate upon instances of basic_regex and match_results. As previously discussed this maintains the independence of the regular expression type from the iterator type. Each algorithm is overloaded for three common cases: where the source text is a pair of iterators, or a null terminated string, or an instance of basic_string. In addition, the algorithms are split into two types: those that report what matched using an instance of the match_results class, and those that simply indicate whether there was a match or not. Although the latter option may not seem that useful for those used to tools such as perl, in practice it does have important uses, and can be implemented using algorithms that are significantly more efficient than those that report what matched. Indeed for long searches and a limited subset of regular expressions there are algorithms that offer better than O(N) average case performance[22]. Typical uses here include search engines (grep etc), or data input validation (where you just want to compare the string against a regular expression based filter). Note however, that the worst case performance of the "non-reporting" algorithms is still likely to be the same as the "reporting" ones, it's the real world performance that can differ. For this reason the "non-reporting" algorithms represent an opportunity for the vendor to optimise the implementation, there is no enforceable requirement that they do so (indeed Boost does not currently optimise these at present). The regular expression algorithms all also take a series of flags - expressed as a combination of bitmasks - to indicate things like whether the string being searched represents the start/end of a line etc. These flags are all defined in the nested namespace std::re, and are of type std::re::match_flag_type.

The first two algorithms are regex_search and regex_match. The former will find an occurrence of a regular expression within a string, while the latter will determine whether a regular expression matches the string exactly and is mainly used for data input validation. Note that the effect of regex_match can be obtained by using regex_search with an expression that is suitably anchored at either end. However, from the programmers point of view, the advantage of using regex_match is that it is obvious from reading the code that it requires an exact match against the expression, and not just an occurrence somewhere within the string.

The names of the algorithms in this proposal are based upon the names Boost uses and all have a "regex_" prefix, which may cause some controversy (as may the naming used in general in this proposal). This proposal does not use overly generic names like match/compare/search/find here, as the possibility for ambiguous overloads occurring should be obvious. An alternative is to use a generic name, but placed in a nested namespace (let's say std::re). This may be an appropriate option, but ambiguities can still occur either as a result of Koenig lookup, or a using directive. Many people may see the latter as an evil, but experience suggests that statements such as:

using namespace whatever;

are remarkably common in practice. For this reason the "regex_" prefixed names that Boost regex uses are retained in this proposal. Here are the twelve forms for regex_match and regex_search:

template <class RandomAccessIterator, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_match(RandomAccessIterator first, RandomAccessIterator last,
                 match_results<RandomAccessIterator, Allocator>& m,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);
template <class RandomAccessIterator, class charT, class traits, class Allocator2>
bool regex_match(RandomAccessIterator first, RandomAccessIterator last,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);
template <class charT, class Allocator, class traits, class Allocator2>
bool regex_match(const charT* str, match_results<const charT*, Allocator>& m,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);
template <class ST, class SA, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_match(const basic_string<charT, ST, SA>& s,
                 match_results<typename basic_string<charT, ST, SA>::const_iterator, Allocator>& m, 
                 const reg_expression<charT, traits, Allocator2>& e, 
                 match_flag_type flags = match_default);
template <class charT, class traits, class Allocator2>
bool regex_match(const charT* str,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);
template <class ST, class SA, class charT, class traits, class Allocator2>
bool regex_match(const basic_string<charT, ST, SA>& s,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);

template <class RandomAccessIterator, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_search(RandomAccessIterator first, RandomAccessIterator last,
                  match_results<RandomAccessIterator, Allocator>& m,
                  const reg_expression<charT, traits, Allocator2>& e,
                  match_flag_type flags = match_default);
template <class charT, class Allocator, class traits, class Allocator2>
bool regex_search(const charT* str, match_results<const charT*, Allocator>& m,
                  const reg_expression<charT, traits, Allocator2>& e,
                  match_flag_type flags = match_default);
template <class ST, class SA, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_search(const basic_string<charT, ST, SA>& s,
                  match_results<typename basic_string<charT, ST, SA>::const_iterator, Allocator>& m,
                  const reg_expression<charT, traits, Allocator2>& e,
                  match_flag_type flags = match_default);
template <class RandomAccessIterator, class Allocator, class charT,
          class traits>
bool regex_search(RandomAccessIterator first, RandomAccessIterator last,
                  const reg_expression<charT, traits, Allocator>& e,
                  match_flag_type flags = match_default);
template <class charT, class Allocator, class traits>
bool regex_search(const charT* str
                  const reg_expression<charT, traits, Allocator>& e,
                  match_flag_type flags = match_default);
template <class ST, class SA, class Allocator, class charT,
          class traits>
bool regex_search(const basic_string<charT, ST, SA>& s,
                  const reg_expression<charT, traits, Allocator>& e,
                  match_flag_type flags = match_default);

The algorithms regex_match and regex_search both support a feature not commonly seen in regular expression libraries: a partial match. When the flag std::re::match_partial is set in the flags passed to the algorithm, then a result of true may be returned if one or more characters were matched, and the state machine then ran out of text to match against while there were still states to be matched. Partial matches answer the question: "could a match have been found if more input was provided?". They have proven to be extremely useful in two situations: for data input validation, where validation occurs a character at a time as the input is entered, and for long searches through texts that are too long to be loaded into memory and therefore have to be buffered (in this case the partial match allows the code to determine whether a match could occur at an overlap between two buffers)[23].

Many tools such as perl or sed offer algorithms for performing search and replace operations on a string. These take an input text, find all occurrences of a regular expression within it, and replace whatever matched with the result of merging a format string with what matched. The equivalent in the Boost library is the algorithm regex_merge; there are two main forms, one sends the new text string to an output iterator, while the other returns a basic_string<> object. In this proposal this algorithm has been renamed regex_replace in order to make it a little more memorable:

template <class OutputIterator, class RandomAccessIterator, class traits,
          class Allocator, class charT>
OutputIterator regex_replace(OutputIterator out,
                           RandomAccessIterator first,
                           RandomAccessIterator last,
                           const reg_expression<charT, traits, Allocator>& e,
                           const basic_string<charT>& fmt,
                           unsigned int flags = match_default);
template <class traits, class Allocator, class charT>
basic_string<charT> regex_replace(const basic_string<charT>& s,
                            const reg_expression<charT, traits, Allocator>& e,
                            const basic_string<charT>& fmt,
                            unsigned int flags = match_default);

Repeated regular expression searches

It is common practice to want to find all of the (non-overlapping) occurrences of a regular expression within a text. Boost uses the algorithm regex_grep for this; this algorithm searches through the text, and for each match found calls a unary predicate with the results of the match. However there is a sense in which this is the wrong interface; basically it provides a "push" interface, where as the standard library normally uses "pull" interfaces in the form of iterators. Of course, one could just call regex_search repeatedly, however these algorithms often have to do some housekeeping tasks (memory allocation is a typical example) each time they are called. By providing a separate interface for finding all the occurrences of an expression, the implementation can ensure that these housekeeping tasks occur only once, rather than once for each occurrence of the expression. This proposal does not provide the regex_grep interface that Boost does, instead it provides a regex_iterator, whose value_type is an instance of the match_results class template.

template <class RandomAccessIterator, 
          class charT = iterator_traits<RandomAccessIterator>::value_type,
          class traits = regex_traits<charT>,
          class Allocator = allocator<charT> >
class regex_iterator;

Another common task is a perl-like split operation; to find all the non-overlapping occurrences of an expression within a text, and for each occurrence spit out one or more new strings (usually one new string for each marked sub-expression). In Boost regex this is accomplished using the regex_split algorithm, however like regex_grep this algorithm has a "push" rather than a "pull" interface. In addition the semantics of the perl split function (on which the Boost algorithm is modelled) are rather strange; giving different behaviour depending upon whether there are any marked sub-expressions or not. For these reasons, in this proposal an iterator-based interface (regex_token_iterator) is provided that actually has a much more flexible interface than the Boost regex_split function. The iterator can be constructed with an integer argument indicating the index of the sub-expression to enumerate for each match, in which case the iterator will enumerate one string for each match found. Alternatively the iterator can be constructed from an array literal, containing a list of the sub-expressions to iterate, in which case there will be several strings enumerated for each match found.

template <class RandomAccessIterator, 
          class charT = iterator_traits<RandomAccessIterator>::value_type,
          class traits = regex_traits<charT>,
          class Allocator = allocator<charT> >
class regex_token_iterator;

Typical usage would be for example:

extern std::string   text;
extern std::regex    re;
//
// This iterator will enumerate one string for each occurrence of re in text,
// each string will be equivalent to what perl calls $` (i.e. field splitting):
std::regex_token_iterator<std::string::const_iterator> it1(text.begin(), text.end(), re, -1);
//
// This iterator will enumerate one string for each occurrence of re in text,
// each string will be equivalent to what perl calls $& (the whole match):
std::regex_token_iterator<std::string::const_iterator> it1(text.begin(), text.end(), re, 0);
//
// This iterator will enumerate three string's for each occurrence of re in text,
// these will be what perl calls $1, $2 and $5 :
std::regex_token_iterator<std::string::const_iterator> it1(text.begin(), text.end(), re, {1,2,5});

Iterator requirements

While all the algorithms discussed so far are iterator-based, no mention has been made so far of iterator requirements. The following options are available along with their pro's and con's:

InputIterator: this is discounted as an option, input iterators can not indicate what matched (as many of the algorithms must do), nor can they be used to match back-references.

ForwardIterator: this is a real option, forward iterators can be used to indicate what matched. In addition there is in practice a real world use case: multibyte character sequences can be converted to a single wide character "on the fly" using a forward iterator adapter. Note that multi-byte sequences can not usually be represented by a bi-directional or random access iterator. There are downsides as well: requiring forward iterator support places a burden on the implementation, in particular matching boundary conditions (such as ^ or $) require that the implementation remembers the last character visited as well as the current position. In addition perl-style backtracking algorithms become much less efficient for single character repeats (a very common construct), since every backtrack-position has to be saved, rather than computed by decrementing an iterator. There are also some perl features (zero width look-behind assertions) that can not be matched using forward iterators, however these features are not required by the ECMAScript standard. It is unclear just how much of a burden forward iterator support would be in practice; some experimental evidence would be of help here.

BidirectionalIterator: This requirement is much easier to implement, but it offers no real benefits.

RandomAccessIterator: As with bidirectional iterators, this requirement is easy to implement, it also offers the potential for significant performance gains: for POSIX style matches, computing "leftmost longest" matches is much more efficient, while for perl style matches, skipping through greedy ".*" repeats are much more efficient, and for some expressions Boyer-Moore like search heuristics are possible.

Of the two template libraries discussed here, Boost requires bidirectional iterators, while GRETA doesn't document it's requirements (but appears to also require bidirectional iterators). In practice there are only two requirements worth considering - random access iterators (on grounds of efficiency), or forward iterators (on grounds of user friendliness). Experience with Boost suggests that the bidirectional iterator requirement option offers only the disadvantages of random access iterators, but without their potential performance gains. In addition, practice suggests that there are no containers that are likely to be used with character strings that have a bidirectional interface, rather than a random access one. This proposal currently documents that all iterators must be random access, this can be tightened to a requirement to support forward iterators if practice shows that this is indeed possible (in the meantime implementations should be encouraged to support the widest range of iterator types possible).

Proposed Text

Forward

FW.1 Normative References

TODO

FW.2 Definitions

TODO

Regular Expressions

This clause describes components used by C++ programs to describe regular expressions, and perform search and replace operations on character strings using those expressions.

Table RE1 - Regular Expression Summary

Clause	Header
Regular Expression Traits	<regex>
Header <regex> synopsis	<regex>
Supporting Types	<regex>
Template class basic_regex	<regex>
Template class sub_match	<regex>
Template class match_results	<regex>
Regular expression algorithms	<regex>
Regular expression Iterators	<regex>

RE.1 Regular Expression Traits

RE.1.1 Regular Expression Traits Requirements

TODO

RE.1.2 Regular expression Traits Classes

TODO

RE.2 Regular Expressions

The header <regex> defines a basic regular expression class template and its traits that can handle all char-like (lib.strings) template arguments.

The header <regex> defines a container-like class template that holds the result of a regular expression match.

The header <regex> defines a series of algorithms that allow an iterator sequence to be operated upon by a regular expression.

The header <regex> defines two specific template classes, regex and wregex and their special traits.

The header <regex> also defines two iterator types for enumerating regular expression matches.

Header <regex> synopsis

namespace std{

namespace re{

typedef bitmask_type syntax_option_type;
typedef bitmask_type match_flag_type;
typedef implementation defined syntax_type;

} // namespace re

class bad_expression;

template <class charT>
struct regex_traits;

template <class charT,
          class traits = regex_traits<charT>,
          class Allocator = allocator<charT> >
class basic_regex;

template <class charT, class traits, class Allocator>
bool operator == (const basic_regex<charT, traits, Allocator>& lhs,
                  const basic_regex<charT, traits, Allocator>& rhs);
template <class charT, class traits, class Allocator>
bool operator != (const basic_regex<charT, traits, Allocator>& lhs,
                  const basic_regex<charT, traits, Allocator>& rhs);
template <class charT, class traits, class Allocator>
bool operator < (const basic_regex<charT, traits, Allocator>& lhs,
                 const basic_regex<charT, traits, Allocator>& rhs);
template <class charT, class traits, class Allocator>
bool operator <= (const basic_regex<charT, traits, Allocator>& lhs,
                  const basic_regex<charT, traits, Allocator>& rhs);
template <class charT, class traits, class Allocator>
bool operator >= (const basic_regex<charT, traits, Allocator>& lhs,
                  const basic_regex<charT, traits, Allocator>& rhs);
template <class charT, class traits, class Allocator>
bool operator > (const basic_regex<charT, traits, Allocator>& lhs,
                 const basic_regex<charT, traits, Allocator>& rhs);

template <class charT, class io_traits, class re_traits, class Allocator>
basic_ostream<charT, io_traits>&
   operator << (basic_ostream<charT, io_traits>& os,
                const basic_regex<charT, re_traits, Allocator>& e);

template <class charT, class traits, class Allocator>
void swap(basic_regex<charT, traits, Allocator>& e1,
          basic_regex<charT, traits, Allocator>& e2);

typedef basic_regex<char> regex;
typedef basic_regex<wchar_t> wregex;

template <class RandomAccessIterator>
class sub_match;

template <class RandomAccessIterator>
bool operator == (const sub_match<RandomAccessIterator>& lhs,
                  const sub_match<RandomAccessIterator>& rhs);
template <class RandomAccessIterator>
bool operator != (const sub_match<RandomAccessIterator>& lhs,
                  const sub_match<RandomAccessIterator>& rhs);
template <class RandomAccessIterator>
bool operator < (const sub_match<RandomAccessIterator>& lhs,
                 const sub_match<RandomAccessIterator>& rhs);
template <class RandomAccessIterator>
bool operator <= (const sub_match<RandomAccessIterator>& lhs,
                  const sub_match<RandomAccessIterator>& rhs);
template <class RandomAccessIterator>
bool operator >= (const sub_match<RandomAccessIterator>& lhs,
                  const sub_match<RandomAccessIterator>& rhs);
template <class RandomAccessIterator>
bool operator > (const sub_match<RandomAccessIterator>& lhs,
                 const sub_match<RandomAccessIterator>& rhs);

template <class charT, class traits, class RandomAccessIterator>
basic_ostream<charT, traits>&
   operator << (basic_ostream<charT, traits>& os,
                const sub_match<RandomAccessIterator>& m);

template <class RandomAccessIterator,
          class Allocator = allocator<
               typename iterator_traits<RandomAccessIterator>::value_type > >
class match_results;

template <class RandomAccessIterator, class Allocator>
bool operator == (const match_results<RandomAccessIterator, Allocator>& m1,
                  const match_results<RandomAccessIterator, Allocator>& m2);
template <class RandomAccessIterator, class Allocator>
bool operator != (const match_results<RandomAccessIterator, Allocator>& m1,
                  const match_results<RandomAccessIterator, Allocator>& m2);

template <class charT, class traits, class RandomAccessIterator, class Allocator>
basic_ostream<charT, traits>&
   operator << (basic_ostream<charT, traits>& os,
                const match_results<RandomAccessIterator, Allocator>& m);

template <class RandomAccessIterator, class Allocator>
void swap(match_results<RandomAccessIterator, Allocator>& m1,
          match_results<RandomAccessIterator, Allocator>& m2);

typedef match_results<const char*> cmatch;
typedef match_results<const wchar_t*> wcmatch;
typedef match_results<string::const_iterator> smatch;
typedef match_results<wstring::const_iterator> wsmatch;

template <class RandomAccessIterator, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_match(RandomAccessIterator first, RandomAccessIterator last,
                 match_results<RandomAccessIterator, Allocator>& m,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);
template <class RandomAccessIterator, class charT, class traits, class Allocator2>
bool regex_match(RandomAccessIterator first, RandomAccessIterator last,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);
template <class charT, class Allocator, class traits, class Allocator2>
bool regex_match(const charT* str, match_results<const charT*, Allocator>& m,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);
template <class ST, class SA, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_match(const basic_string<charT, ST, SA>& s,
                 match_results<typename basic_string<charT, ST, SA>::const_iterator, Allocator>& m, 
                 const reg_expression<charT, traits, Allocator2>& e, 
                 match_flag_type flags = match_default);
template <class charT, class traits, class Allocator2>
bool regex_match(const charT* str,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);
template <class ST, class SA, class charT, class traits, class Allocator2>
bool regex_match(const basic_string<charT, ST, SA>& s,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);

template <class RandomAccessIterator, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_search(RandomAccessIterator first, RandomAccessIterator last,
                  match_results<RandomAccessIterator, Allocator>& m,
                  const reg_expression<charT, traits, Allocator2>& e,
                  match_flag_type flags = match_default);
template <class charT, class Allocator, class traits, class Allocator2>
bool regex_search(const charT* str, match_results<const charT*, Allocator>& m,
                  const reg_expression<charT, traits, Allocator2>& e,
                  match_flag_type flags = match_default);
template <class RandomAccessIterator, class Allocator, class charT,
          class traits>
bool regex_search(RandomAccessIterator first, RandomAccessIterator last,
                  const reg_expression<charT, traits, Allocator>& e,
                  match_flag_type flags = match_default);
template <class charT, class Allocator, class traits>
bool regex_search(const charT* str
                  const reg_expression<charT, traits, Allocator>& e,
                  match_flag_type flags = match_default);
template <class ST, class SA, class Allocator, class charT,
          class traits>
bool regex_search(const basic_string<charT, ST, SA>& s,
                  const reg_expression<charT, traits, Allocator>& e,
                  match_flag_type flags = match_default);
template <class ST, class SA, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_search(const basic_string<charT, ST, SA>& s,
                  match_results<typename basic_string<charT, ST, SA>::const_iterator, Allocator>& m,
                  const reg_expression<charT, traits, Allocator2>& e,
                  match_flag_type flags = match_default);

template <class OutputIterator, class RandomAccessIterator, class traits,
          class Allocator, class charT>
OutputIterator regex_replace(OutputIterator out,
                           RandomAccessIterator first,
                           RandomAccessIterator last,
                           const reg_expression<charT, traits, Allocator>& e,
                           const basic_string<charT>& fmt,
                           match_flag_type flags = match_default);
template <class traits, class Allocator, class charT>
basic_string<charT> regex_replace(const basic_string<charT>& s,
                            const reg_expression<charT, traits, Allocator>& e,
                            const basic_string<charT>& fmt,
                            match_flag_type flags = match_default);

// regular expression iterators:
template <class RandomAccessIterator, 
          class charT = iterator_traits<RandomAccessIterator>::value_type,
          class traits = regex_traits<charT>,
          class Allocator = allocator<charT> >
class regex_iterator;
template <class RandomAccessIterator, 
          class charT = iterator_traits<RandomAccessIterator>::value_type,
          class traits = regex_traits<charT>,
          class Allocator = allocator<charT> >
class regex_token_iterator;

} // namespace std

RE.3 Supporting Types

RE.3.1 Namespace std::re

The namespace std::re acts as a repository for the symbolic constants used by the regular expression library.

The namespace std::re defines two types: syntax_option_type and match_flag_type, along with a series of constants of these types.

RE.3.1.1 Bitmask Type syntax_option_type

namespace std{ namespace re{

typedef bitmask_type syntax_option_type;
// these flags are required:
static const syntax_option_type normal;
static const syntax_option_type icase;
static const syntax_option_type nosubs;
static const syntax_option_type optimize;
static const syntax_option_type ECMAScript = normal;
static const syntax_option_type JavaScript = normal;
static const syntax_option_type JScript = normal;
// these flags are optional, if the functionality is supported
// then the flags shall take these names.
static const syntax_option_type basic;
static const syntax_option_type extended;
static const syntax_option_type emacs;
static const syntax_option_type awk;
static const syntax_option_type grep;
static const syntax_option_type egrep;
static const syntax_option_type sed;
static const syntax_option_type perl;

} // namespace re
} // namespace std

The type syntax_option_type is an implementation defined bitmask type (17.3.2.1.2).

TODO: Document these values.

RE.3.1.2 Bitmask Type match_flag_type

namespace std{ namespace re{

typedef bitmask_type match_flag_type;

static const match_flag_type match_default;
static const match_flag_type match_not_bol;
static const match_flag_type match_not_eol;
static const match_flag_type match_not_bow;
static const match_flag_type match_not_eow;
static const match_flag_type match_any;
static const match_flag_type match_not_null;
static const match_flag_type match_continuous;
static const match_flag_type match_partial;
static const match_flag_type format_default;
static const match_flag_type format_sed;
static const match_flag_type format_perl;
static const match_flag_type format_no_copy;
static const match_flag_type format_first_only;

} // namespace re
} // namespace std

The type match_flag_type is an implementation defined bitmask type (17.3.2.1.2).

TODO: Document these values.

RE.3.2 Class bad_expression

class bad_expression : public runtime_error
{
public:
   explicit bad_expression(const string& what_arg);
};

The class bad_expression defines the type of objects thrown as exceptions to report errors during the conversion from a string representing a regular expression to a finite state machine.

bad_expression(const string& what_arg);

Effects: Constructs an object of class bad_expression.

Postcondition: strcmp(what(), what_arg.c_str()) == 0.

RE.3.3 Template class regex_traits

TODO

RE.4 Template class basic_regex

For a char-like type charT, the template class basic_regex describes objects that represent a finite state machine constructed from a sequence of charT's which are interpreted as a regular expression. In the rest of this clause, charT denotes such a given char-like type. Storage for the regular expression is allocated and freed as necessary by the member functions of class basic_regex.

The template class basic_regex conforms to the requirements of a Sequence, as specified in (lib.sequence.reqmts), except that only operations defined for const-qualified Sequences are supported.

Objects of type specialization of basic_regex are responsible for converting the sequence of charT objects to an internal finite state machine representation. It is not specified what form this finite state machine takes, nor how it is accessed by algorithms that operate on regular expressions (implementations will typically declare some template functions as friends of basic_regex to achieve this).

[- Author's note, what follows is preliminary and based on the Boost regex_traits design, this will almost certainly change -

Objects of type specialization of basic_regex store within themselves a default-constructed instance of their traits template parameter, henceforth referred to as traits_inst. This traits_inst object is used to support localisation of the regular expression, no basic_regex object shall call any locale dependent C or C++ API, including the formatted string input functions, instead it shall call the appropriate traits member function to achieve the required effect.

Before an object of type specialization of basic_regex calls any traits member function, it shall first construct an object of type traits::sentry from it; values returned from traits member functions may be cached between calls provided traits::imbue is not called.

The transformation from a sequence of characters to a finite state machine is accomplished by first by transforming the sequence of characters to the sequence of tokens obtained by calling traits_inst.syntax_type(c) for each input character c. The regular expression grammar is then applied to the sequence of tokens in order to construct the finite state machine (this is to allow the regular expression syntax to be localised to a specific character set; for example to use Far Eastern ideographs, rather than Latin characters).

Where the regular expression grammar requires the conversion of a sequence of characters to an integral value, this is accomplished by calling traits::toi.

Where the regular expression grammar requires the conversion of a sequence of characters representing a named collating element to the character(s) it represents, this is accomplished by calling traits::lookup_collatename.

Where the regular expression grammar requires the conversion of a sequence of characters representing a named character class to an internal representation, this is accomplished by calling traits::lookup_classname. The results from subsequent calls to this function can be bitwise OR'ed together and subsequently passed to traist::is_class.

During matching of a regular expression finite state machine against a sequence of characters, two characters c and d are compared using traits_inst.translate(c, getflags() & re::icase) == traits_inst.translate(d, getflags() & re::icase).

During matching of a regular expression finite state machine against a sequence of characters, comparison of collating elements is conducted by first converting each collating element to a sort key using traits::transform, and then comparing the sort keys.

During matching of a regular expression finite state machine against a sequence of characters, testing whether a collating element is a member of a primary equivalence class is conducted by first converting the collating element and the equivalence class to a sort keys using traits::transform_primary, and then comparing the sort keys for equality.

- end author's note ]

The functions described in this clause can report errors by throwing exceptions of type bad_expression.

namespace std{

template <class charT,
          class traits = regex_traits<charT>,
          class Allocator = allocator<charT> >
class basic_regex
{
public:
   // types:
   typedef          charT                           value_type;
   typedef          implementation defined          const_iterator;
   typedef          const_iterator                  iterator;
   typedef typename Allocator::reference            reference;
   typedef typename Allocator::const_reference      const_reference;
   typedef typename Allocator::difference_type      difference_type;
   typedef typename Allocator::size_type            size_type;
   typedef          Allocator                       allocator_type;
   typedef          re::syntax_option_type          flag_type;
   typedef typename traits::locale_type             locale_type;

   // construct/copy/destroy:
   explicit basic_regex(const Allocator& a = Allocator());
   explicit basic_regex(const charT* p, flag_type f = re::normal,
                        const Allocator& a = Allocator());
   basic_regex(const charT* p1, const charT* p2, flag_type f = re::normal,
               const Allocator& a = Allocator());
   basic_regex(const charT* p, size_type len, flag_type f,
               const Allocator& a = Allocator());
   basic_regex(const basic_regex&);
   template <class ST, class SA>
   explicit basic_regex(const basic_string<charT, ST, SA>& p,
                        flag_type f = re::normal,
                        const Allocator& a = Allocator());
   template <class InputIterator>
   basic_regex(InputIterator first, inputIterator last,
               flag_type f = re::normal,
               const Allocator& a = Allocator());

   ~basic_regex();
   basic_regex& operator=(const basic_regex&);
   basic_regex& operator=(const charT* ptr);
   template <class ST, class SA>
   basic_regex& operator=(const basic_string<charT, ST, SA>& p);

   // iterators:
   const_iterator begin() const;
   const_iterator end() const;
   // capacity:
   size_type size() const;
   size_type max_size() const;
   bool empty() const;
   unsigned mark_count() const;

   //
   // modifiers:
   basic_regex& assign(const basic_regex& that);
   basic_regex& assign(const charT* ptr, flag_type f = re::normal);
   basic_regex& assign(const charT* first, const charT* last,
                       flag_type f = re::normal);
   template <class string_traits, class A>
   basic_regex& assign(const basic_string<charT, string_traits, A>& s,
                       flag_type f = re::normal);
   template <class InputIterator>
   basic_regex& assign(InputIterator first, InputIterator last,
                       flag_type f = re::normal);

   // const operations:
   Allocator get_allocator() const;
   flag_type getflags() const;
   basic_string<charT> str() const;
   int compare(basic_regex&) const;
   // locale:
   locale_type imbue(locale_type loc);
   locale_type getloc() const;
   // swap
   void swap(basic_regex&) throw();
};

} // namespace std

RE.4.1 basic_regex constructors

In all basic_regex constructors, a copy of the Allocator argument is used for any memory allocation performed by the constructor or member functions during the lifetime of the object.

basic_regex(const Allocator& a = Allocator());

Effects: Constructs an object of class basic_regex. The postconditions of this function are indicated in Table RE3:

Table RE3--basic_regex(const Allocator&) effects

Element	Value
empty()	true
size()	0
str()	basic_string<charT>()

basic_regex(const charT* p, flag_type f = re::normal, const Allocator& a = Allocator());

Requires: p shall not be a null pointer.

Throws: bad_expression if p is not a valid regular expression.

Effects: Constructs an object of class basic_regex; the object's internal finite state machine is constructed from the regular expression contained in the null-terminated string p, and interpreted according to the flags specified in f. The postconditions of this function are indicated in Table RE4:

Table RE4--basic_regex(const charT* p, flag_type f, const Allocator&) effects

Element	Value
empty()	false
size()	char_traits<charT>::length(p)
str()	basic_string<charT>(p)
getflags()	f
mark_count()	The number of marked sub-expressions within the expression.

basic_regex(const charT* p1, const charT* p2, flag_type f = re::normal, const Allocator& a = Allocator());

Requires: p1 and p2 are not null pointers, p1 < p2.

Throws: bad_expression if [p1,p2) is not a valid regular expression.

Effects: Constructs an object of class basic_regex; the object's internal finite state machine is constructed from the regular expression contained in the sequence of characters [p1,p2), and interpreted according the flags specified in f. The postconditions of this function are indicated in Table RE5:

Table RE5--basic_regex(const charT* p1, const charT* p2, flag_type f, const Allocator&) effects

Element	Value
empty()	false
size()	std::distance(p1,p2)
str()	basic_string<charT>(p1,p2)
getflags()	f
mark_count()	The number of marked sub-expressions within the expression.

basic_regex(const charT* p, size_type len, flag_type f, const Allocator& a = Allocator());

Requires: p shall not be a null pointer, len < max_size().

Throws: bad_expression if p is not a valid regular expression.

Effects: Constructs an object of class basic_regex; the object's internal finite state machine is constructed from the regular expression contained in the sequence of characters [p, p+len), and interpreted according the flags specified in f. The postconditions of this function are indicated in Table RE6:

Table RE6--basic_regex(const charT* p1, size_type len, flag_type f, const Allocator&) effects

Element	Value
empty()	false
size()	len
str()	basic_string<charT>(p, len)
getflags()	f
mark_count()	The number of marked sub-expressions within the expression.

basic_regex(const basic_regex& e);

Effects: Constructs an object of class basic_regex as a copy of the object e. The postconditions of this function are indicated in Table RE7:

Table RE7--basic_regex(const basic_regex& e) effects

Element	Value
empty()	e.empty()
size()	e.size()
str()	e.str()
getflags()	e.getflags()
mark_count()	e.mark_count()

template <class ST, class SA>
basic_regex(const basic_string<charT, ST, SA>& s,
            flag_type f = re::normal, const Allocator& a = Allocator());

Throws: bad_expression if s is not a valid regular expression.

Effects: Constructs an object of class basic_regex; the object's internal finite state machine is constructed from the regular expression contained in the string s, and interpreted according to the flags specified in f. The postconditions of this function are indicated in Table RE8:

Table RE8--basic_regex(const basic_regex& e) effects

Element	Value
empty()	false
size()	s.size()
str()	s
getflags()	f
mark_count()	The number of marked sub-expressions within the expression.

template <class ForwardIterator>
basic_regex(ForwardIterator first, ForwardIterator last,
            flag_type f = re::normal, const Allocator& a = Allocator());

Throws: bad_expression if the sequence [first, last) is not a valid regular expression.

Effects: Constructs an object of class basic_regex; the object's internal finite state machine is constructed from the regular expression contained in the sequence of characters [first, last), and interpreted according to the flags specified in f. The postconditions of this function are indicated in Table RE9:

Table RE9--basic_regex(ForwardIterator first, ForwardIterator last, flag_type f, const Allocator&) effects

Element	Value
empty()	false
size()	distance(first,last)
str()	basic_string<charT>(first,last)
getflags()	f
mark_count()	The number of marked sub-expressions within the expression.

basic_regex& operator=(const basic_regex& e);

Effects: Returns the result of assign(e.str(), e.getflags()).

basic_regex& operator=(const charT* ptr);

Requires: p shall not be a null pointer.

Effects: Returns the result of assign(ptr).

template <class ST, class SA>
basic_regex& operator=(const basic_string<charT, ST, SA>& p);

Effects: Returns the result of assign(p).

RE.4.2 basic_regex iterators

const_iterator begin() const;

Effects: Returns a starting iterator to a sequence of characters representing the regular expression.

const_iterator end() const;

Effects: Returns termination iterator to a sequence of characters representing the regular expression.

RE.4.3 basic_regex capacity

size_type size() const;

Effects: Returns the length of the sequence of characters representing the regular expression.

size_type max_size() const;

Effects: Returns the maximum length of the sequence of characters representing the regular expression.

bool empty() const;

Effects: Returns true if the object does not contain a valid regular expression, otherwise false.

unsigned mark_count() const;

Effects: Returns the number of marked sub-expressions within the regular expresion.

RE.4.4 basic_regex assign

basic_regex& assign(const basic_regex& that);

Effects: Returns assign(that.str(), that.getflags()).

basic_regex& assign(const charT* ptr, flag_type f = re::normal);

Effects: Returns assign(string_type(ptr), f).

basic_regex& assign(const charT* first, const charT* last,
                    flag_type f = re::normal);

Effects: Returns assign(string_type(first, last), f).

template <class string_traits, class A>
basic_regex& assign(const basic_string<charT, string_traits, A>& s,
                    flag_type f = re::normal);

Throws: bad_expression if s is not a valid regular expression.

Returns: *this.

Effects: Assigns the regular expression contained in the string s, interpreted according the flags specified in f. The postconditions of this function are indicated in Table RE10:

Table RE10-- basic_regex& assign(const basic_string<charT, string_traits, A>& s, flag_type f) effects

Element	Value
empty()	false
size()	s.size()
str()	s
getflags()	f
mark_count()	The number of marked sub-expressions within the expression.

template <class InputIterator>
basic_regex& assign(InputIterator first, InputIterator last,
                    flag_type f = re::normal);

Requires: The type InputIterator corresponds to the Input Iterator requirements (24.1.1).

Effects: Returns assign(string_type(first, last), f).

RE.4.5 basic_regex constant operations

Allocator get_allocator() const;

Effects: Returns a copy of the Allocator that was passed to the object's constructor.

flag_type getflags() const;

Effects: Returns a copy of the regular expression syntax flags that were passed to the object's constructor, or the last call to assign.

basic_string<charT> str() const;

Effects: Returns a copy of the character sequence passed to the object's constructor, or the last call to assign.

int compare(basic_regex& e)const;

Effects: If getflags() == e.getflags() then returns str().compare(e.str()), otherwise returns getflags() - e.getflags().

RE.4.6 basic_regex locale

locale_type imbue(locale_type l);

Effects: Returns the result of traits_inst.imbue(l) where traits_inst is a (default initialised) instance of the template parameter traits stored within the object.

locale_type getloc() const;

Effects: Returns the result of traits_inst.getloc() where traits_inst is a (default initialised) instance of the template parameter traits stored within the object.

RE.4.7 basic_regex swap

void swap(basic_regex& e) throw();

Effects: Swaps the contents of the two regular expressions.

Postcondition: *this contains the characters that were in e, e contains the regular expression that was in *this.

Complexity: constant time.

RE.4.8 basic_regex non-member functions

RE.4.8.1 basic_regex non-member comparison operators

template <class charT, class traits, class Allocator>
bool operator == (const basic_regex<charT, traits, Allocator>& lhs,
                  const basic_regex<charT, traits, Allocator>& rhs);

Effects: Returns lhs.compare(rhs) == 0.

template <class charT, class traits, class Allocator>
bool operator != (const basic_regex<charT, traits, Allocator>& lhs,
                  const basic_regex<charT, traits, Allocator>& rhs);

Effects: Returns lhs.compare(rhs) != 0.

template <class charT, class traits, class Allocator>
bool operator < (const basic_regex<charT, traits, Allocator>& lhs,
                 const basic_regex<charT, traits, Allocator>& rhs);

Effects: Returns lhs.compare(rhs) < 0.

template <class charT, class traits, class Allocator>
bool operator <= (const basic_regex<charT, traits, Allocator>& lhs,
                  const basic_regex<charT, traits, Allocator>& rhs);

Effects: Returns lhs.compare(rhs) <= 0.

template <class charT, class traits, class Allocator>
bool operator >= (const basic_regex<charT, traits, Allocator>& lhs,
                  const basic_regex<charT, traits, Allocator>& rhs);

Effects: Returns lhs.compare(rhs) >= 0.

template <class charT, class traits, class Allocator>
bool operator > (const basic_regex<charT, traits, Allocator>& lhs,
                 const basic_regex<charT, traits, Allocator>& rhs);

Effects: Returns lhs.compare(rhs) > 0.

RE.4.8.2 basic_regex inserter.

template <class charT, class io_traits, class re_traits, class Allocator>
basic_ostream<charT, io_traits>&
   operator << (basic_ostream<charT, io_traits>& os
                const basic_regex<charT, re_traits, Allocator>& e);

Effects: Returns (os << e.str()).

RE.4.8.3 basic_regex non-member swap

template <class charT, class traits, class Allocator>
void swap(basic_regex<charT, traits, Allocator>& lhs,
          basic_regex<charT, traits, Allocator>& rhs);

Effects: calls lhs.swap(rhs).

RE.5 Template class sub_match

Template class sub_match denotes the sequence of characters matched by a particular marked sub-expression.

template <class RandomAccessIterator>
class sub_match : public std::pair<RandomAccessIterator, RandomAccessIterator>
{
public:
   typedef typename iterator_traits<RandomAccessIterator>::value_type       value_type;
   typedef typename iterator_traits<RandomAccessIterator>::difference_type  difference_type;
   typedef          RandomAccessIterator                                    iterator;

   bool matched;

   difference_type length()const;
   operator basic_string<value_type>()const;
};

RE.5.1 sub_match members

TODO: add documentation here

RE.5.2 sub_match non-member operators

template <class RandomAccessIterator>
bool operator == (const sub_match<RandomAccessIterator>&,
                  const sub_match<RandomAccessIterator>&);
template <class RandomAccessIterator>
bool operator != (const sub_match<RandomAccessIterator>&,
                  const sub_match<RandomAccessIterator>&);
template <class RandomAccessIterator>
bool operator < (const sub_match<RandomAccessIterator>&,
                 const sub_match<RandomAccessIterator>&);
template <class RandomAccessIterator>
bool operator <= (const sub_match<RandomAccessIterator>&,
                  const sub_match<RandomAccessIterator>&);
template <class RandomAccessIterator>
bool operator >= (const sub_match<RandomAccessIterator>&,
                  const sub_match<RandomAccessIterator>&);
template <class RandomAccessIterator>
bool operator > (const sub_match<RandomAccessIterator>&,
                 const sub_match<RandomAccessIterator>&);
template <class charT, class traits, class RandomAccessIterator>
basic_ostream<charT, traits>&
   operator << (basic_ostream<charT, traits>&
                const sub_match<RandomAccessIterator>&);

RE.6 Template class match_results

Template class match_results denotes a collection of character sequences representing the result of a regular expression match. Storage for the collection is allocated and freed as necessary by the member functions of class match_results.

The template class match_results conforms to the requirements of a Sequence, as specified in (lib.sequence.reqmts), except that only operations defined for const-qualified Sequences are supported.

template <class RandomAccessIterator,
          class Allocator = allocator<sub_match<RandomAccessIterator> >
class match_results
{ 
public: 
   typedef          sub_match<RandomAccessIterator>                         value_type;
   typedef          const value_type&                                       const_reference;
   typedef          const_reference                                         reference;
   typedef          implementation defined                                  const_iterator;
   typedef          const_iterator                                          iterator;
   typedef typename iterator_traits<RandomAccessIterator>::difference_type  difference_type;
   typedef typename Allocator::size_type                                    size_type;
   typedef          Allocator                                               allocator_type;
   typedef typename iterator_traits<RandomAccessIterator>::value_type       char_type;
   typedef          basic_string<char_type>                                 string_type;

   // construct/copy/destroy:
   explicit match_results(const Allocator& a = Allocator());
   match_results(const match_results& m);
   match_results& operator=(const match_results& m); 
   ~match_results();

   // size:
   size_type size() const;
   size_type max_size() const;
   bool empty() const;
   // element access:
   difference_type length(int sub = 0) const;
   difference_type position(unsigned int sub = 0) const;
   string_type str(int sub = 0) const;
   const_reference operator[](int n) const;

   const_reference prefix() const;

   const_reference suffix() const;
   const_iterator begin() const;
   const_iterator end() const;
   // format:
   template <class OutputIterator>
   OutputIterator format(OutputIterator out,
                         const string_type& fmt,
                         match_flag_type flags = format_default) const;
   string_type format(const string_type& fmt,
                      match_flag_type flags = format_default) const;

   allocator_type get_allocator() const;
   void swap(match_results& that);
};

RE.6.1 match_results constructors

In all match_results constructors, a copy of the Allocator argument is used for any memory allocation performed by the constructor or member functions during the lifetime of the object.

match_results(const Allocator& a = Allocator());

Effects: Constructs an object of class match_results. The postconditions of this function are indicated in Table RE12:

Table RE12--match_results(const Allocator&) effects

Element	Value
empty()	true
size()	0
str()	basic_string<charT>()

match_results(const match_results& m);

Effects: Constructs an object of class match_results, as a copy of m.

match_results& operator=(const match_results& m);

Effects: Assigns m to *this. The postconditions of this function are indicated in Table RE13:

Table RE13--match_results(const Allocator&) effects

Element	Value
empty()	m.empty().
size()	m.size().
str(n)	m.str(n) for all integers n < m.size().
prefix()	m.prefix().
suffix()	m.suffix().
(*this)[n]	m[n] for all integers n < m.size().
length(n)	m.length(n) for all integers n < m.size().
position(n)	m.position(n) for all integers n < m.size().

RE.6.2 match_results size

size_type size()const;

Effects: Returns the number of sub_match elements stored in *this.

size_type max_size()const;

Effects: Returns the maximum number of sub_match elements that can be stored in *this.

bool empty()const;

Effects: Returns size() == 0.

RE.6.3 match_results element access

difference_type length(int sub = 0)const;

Effects: Returns (*this)[sub].length().

difference_type position(unsigned int sub = 0)const;

Effects: Returns std::distance(prefix().first, (*this)[sub].first).

string_type str(int sub = 0)const;

Effects: Returns string_type((*this)[sub]).

const_reference operator[](int n) const;

Effects: Returns a reference to the sub_match object representing the character sequence that matched marked sub-expression n. If n == 0 then returns a reference to a sub_match object representing the character sequence that matched the whole regular expression.

const_reference prefix()const;

Effects: Returns a reference to the sub_match object representing the character sequence from the start of the string being matched/searched, to the start of the match found.

const_reference suffix()const;

Effects: Returns a reference to the sub_match object representing the character sequence from the end of the match found to the end of the string being matched/searched.

const_iterator begin()const;

Effects: Returns a starting iterator that enumerates over all the marked sub-expression matches stored in *this.

const_iterator end()const;

Effects: Returns a terminating iterator that enumerates over all the marked sub-expression matches stored in *this.

RE.6.4 match_results reformatting

template <class OutputIterator>
OutputIterator format(OutputIterator out,
                      const string_type& fmt,
                      match_flag_type flags = format_default);

Requires: The type OutputIterator conforms to the Output Iterator requirements (24.1.2).

Effects: Copies the character sequence [fmt.begin(), fmt.end()) to OutputIterator out. For each format specifier or escape sequence in fmt, replace that sequence with either the character(s) it represents, or the sequence of characters within *this to which it refers. The bitmasks specified in flags determines what format specifiers or escape sequences are recognised.

Returns: out.

string_type format(const string_type& fmt,
                   match_flag_type flags = format_default);

Effects: Returns a copy of the string fmt. For each format specifier or escape sequence in fmt, replace that sequence with either the character(s) it represents, or the sequence of characters within *this to which it refers. The bitmasks specified in flags determines what format specifiers or escape sequences are recognised.

allocator_type get_allocator()const;

Effects: Returns a copy of the Allocator that was passed to the object's constructor.

void swap(match_results& that);

Effects: Swaps the contents of the two sequences.

Postcondition: *this contains the sequence of matched sub-expressions that were in that, that contains the sequence of matched sub-expressions that were in *this.

Complexity: constant time.

RE.7 Regular expression algorithms

RE.7.1 regex_match

template <class RandomAccessIterator, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_match(RandomAccessIterator first, RandomAccessIterator last,
                 match_results<RandomAccessIterator, Allocator>& m,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);

Requires: Type RandomAccessIterator meets the requirements of a Random Access Iterator (24.1.5).

Effects: Determines whether there is an exact match between the regular expression e, and all of the character sequence [first, last), parameter flags is used to control how the expression is matched against the character sequence. Returns true if such a match exists, false otherwise.

Postconditions: If the function returns false, then the effect on parameter m is undefined, otherwise the effects on parameter m are given in table RE14:

Table RE14--regex_match(RandomAccessIterator first, RandomAccessIterator last, match_results<RandomAccessIterator, Allocator>& m, const reg_expression<charT, traits, Allocator2>& e, match_flag_type flags) effects

Element	Value
m.size()	e.mark_count()
m.empty()	false
m.prefix().first	first
m.prefix().last	first
m.prefix().matched	false
m.suffix().first	last
m.suffix().last	last
m.suffix().matched	false
m[0].first	first
m[0].second	last
m[0].matched	`true` if a full match was found, and `false` if it was a partial match (found as a result of the `match_partial` flag being set).
m[n].first	For all integers n < m.size(), the start of the sequence that matched sub-expression n. Alternatively, if sub-expression n did not participate in the match, then last.
m[n].second	For all integers n < m.size(), the end of the sequence that matched sub-expression n. Alternatively, if sub-expression n did not participate in the match, then last.
m[n].matched	For all integers n < m.size(), true if sub-expression n participated in the match, false otherwise.

Complexity: depends upon the syntax chosen (TODO).

template <class RandomAccessIterator, class charT, class traits, class Allocator2>
bool regex_match(RandomAccessIterator first, RandomAccessIterator last,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);

Effects: Behaves "as if" by constructing an instance of match_results<RandomAccessIterator> what, and then returning the result of regex_match(first, last, what, e, flags).

template <class charT, class Allocator, class traits, class Allocator2>
bool regex_match(const charT* str, match_results<const charT*, Allocator>& m,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);

Effects: Returns the result of regex_match(str, str + char_traits<charT>::length(str), m, e, flags).

template <class ST, class SA, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_match(const basic_string<charT, ST, SA>& s,
                 match_results<typename basic_string<charT, ST, SA>::const_iterator, Allocator>& m, 
                 const reg_expression<charT, traits, Allocator2>& e, 
                 match_flag_type flags = match_default);

Effects: Returns the result of regex_match(s.begin(), s.end(), m, e, flags).

template <class charT, class traits, class Allocator2>
bool regex_match(const charT* str,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);

Effects: Returns the result of regex_match(str, str + char_traits<charT>::length(str), e, flags).

template <class ST, class SA, class charT, class traits, class Allocator2>
bool regex_match(const basic_string<charT, ST, SA>& s,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);

Effects: Returns the result of regex_match(s.begin(), s.end(), e, flags).

RE.7.2 regex_search

template <class RandomAccessIterator, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_search(RandomAccessIterator first, RandomAccessIterator last,
                  match_results<RandomAccessIterator, Allocator>& m,
                  const reg_expression<charT, traits, Allocator2>& e,
                  match_flag_type flags = match_default);

Requires: Type RandomAccessIterator meets the requirements of a Random Access Iterator (24.1.5).

Effects: Determines whether there is some sub-sequence within [first,last) that matches the regular expression e, parameter flags is used to control how the expression is matched against the character sequence. Returns true if such a sequence exists, false otherwise.

Postconditions: If the function returns false, then the effect on parameter m is undefined, otherwise the effects on parameter m are given in table RE15:

Table RE15--regex_search(RandomAccessIterator first, RandomAccessIterator last, match_results<RandomAccessIterator, Allocator>& m, const reg_expression<charT, traits, Allocator2>& e, match_flag_type flags) effects

Element	Value
m.size()	e.mark_count()
m.empty()	false
m.prefix().first	first
m.prefix().last	m[0].first
m.prefix().matched	m.prefix().first != m.prefix().second
m.suffix().first	m[0].second
m.suffix().last	last
m.suffix().matched	m.suffix().first != m.suffix().second
m[0].first	The start of the sequence of characters that matched the regular expression
m[0].second	The end of the sequence of characters that matched the regular expression
m[0].matched	`true` if a full match was found, and `false` if it was a partial match (found as a result of the `match_partial` flag being set).
m[n].first	For all integers n < m.size(), the start of the sequence that matched sub-expression n. Alternatively, if sub-expression n did not participate in the match, then last.
m[n].second	For all integers n < m.size(), the end of the sequence that matched sub-expression n. Alternatively, if sub-expression n did not participate in the match, then last.
m[n].matched	For all integers n < m.size(), true if sub-expression n participated in the match, false otherwise.

Complexity: depends upon the syntax chosen (TODO).

template <class charT, class Allocator, class traits, class Allocator2>
bool regex_search(const charT* str, match_results<const charT*, Allocator>& m,
                  const reg_expression<charT, traits, Allocator2>& e,
                  match_flag_type flags = match_default);

Effects: Returns the result of regex_search(str, str + char_traits<charT>::length(str), m, e, flags).

template <class ST, class SA, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_search(const basic_string<charT, ST, SA>& s,
                  match_results<typename basic_string<charT, ST, SA>::const_iterator, Allocator>& m,
                  const reg_expression<charT, traits, Allocator2>& e,
                  match_flag_type flags = match_default);

Effects: Returns the result of regex_search(s.begin(), s.end(), m, e, flags).

template <class iterator, class Allocator, class charT,
          class traits>
bool regex_search(iterator first, iterator last,
                  const reg_expression<charT, traits, Allocator>& e,
                  match_flag_type flags = match_default);

Effects: Behaves "as if" by constructing an instance of match_results<RandomAccessIterator> what, and then returning the result of regex_search(first, last, what, e, flags).

template <class charT, class Allocator, class traits>
bool regex_search(const charT* str
                  const reg_expression<charT, traits, Allocator>& e,
                  match_flag_type flags = match_default);

Effects: Returns the result of regex_search(str, str + char_traits<charT>::length(str), e, flags).

template <class ST, class SA, class Allocator, class charT,
          class traits>
bool regex_search(const basic_string<charT, ST, SA>& s,
                  const reg_expression<charT, traits, Allocator>& e,
                  match_flag_type flags = match_default);

Effects: Returns the result of regex_search(s.begin(), s.end(), e, flags).

RE.7.3 regex_replace

template <class OutputIterator, class RandomAccessIterator, class traits,
          class Allocator, class charT>
OutputIterator regex_replace(OutputIterator out,
                           RandomAccessIterator first,
                           RandomAccessIterator last,
                           const reg_expression<charT, traits, Allocator>& e,
                           const basic_string<charT>& fmt,
                           match_flag_type flags = match_default);

Effects: Finds all the non-overlapping matches m of type match_results<RandomAccessIterator> that occur within the sequence [first, last). If no such matches are found then calls std::copy(first, last, out). Otherwise, for each match found, if !(flags & format_no_copy) calls std::copy(m.prefix().first, m.prefix().last, out), and then calls m.format(out, fmt, flags). Finally if !(flags & format_no_copy) calls std::copy(m.suffix().first, m,suffix().last, out). If flags & format_first_only is non-zero then only the first match found is replaced.

Returns: out.

Complexity: TODO, hard to figure out accurately.

template <class traits, class Allocator, class charT>
basic_string<charT> regex_replace(const basic_string<charT>& s,
                            const reg_expression<charT, traits, Allocator>& e,
                            const basic_string<charT>& fmt,
                            match_flag_type flags = match_default);

Effects: Constructs an object basic_string<charT> result, calls regex_replace(back_inserter(result), s.begin(), s.end(), e, fmt, flags), and then returns result.

RE.8 Regular expression Iterators

RE.8.1 Template class regex_iterator

The class template regex_iterator is an iterator adapter; that is to say it represents a new view of an existing iterator sequence, by enumerating all the occurrences of a regular expression within that sequence. regex_iterator finds (using regex_search) successive regular expression matches within the sequence from which it was constructed. After it is constructed, and every time ++ is used, the iterator finds and stores a value of match_results<RandomAccessIterator>. If the end of sequence is reached (regex_search returns false), the iterator becomes equal to the end-of-sequence iterator value. The constructor with no arguments regex_iterator() always constructs an end of sequence iterator object, which is the only legitimate iterator to be used for the end condition. The result of operator* on an end of sequence is not defined. For any other iterator value a const match_results<RandomAccessIterator>& is returned. The result of operator-> on an end of sequence is not defined. For any other iterator value a const match_results<RandomAccessIterator>* is returned. It is impossible to store things into regex_iterator's. Two end-of-sequence iterators are always equal. An end-of-sequence iterator is not equal to a non-end-of-sequence iterator. Two non-end-of-sequence iterators are equal when they are constructed from the same arguments.

namespace std{

template <class RandomAccessIterator, 
          class charT = iterator_traits<RandomAccessIterator>::value_type,
          class traits = regex_traits<charT>,
          class Allocator = allocator<charT> >
class regex_iterator 
{
public:
   typedef          basic_regex<charT, traits, Allocator>                  regex_type;
   typedef          match_results<RandomAccessIterator>                    value_type;
   typedef typename iterator_traits<RandomAccessIterator>::difference_type difference_type;
   typedef typename iterator_traits<RandomAccessIterator>::pointer         pointer;
   typedef typename iterator_traits<RandomAccessIterator>::reference       reference;
   typedef          std::input_iterator_tag                                iterator_category;
   
   regex_iterator();
   regex_iterator(RandomAccessIterator a, RandomAccessIterator b, 
                  const regex_type& re, 
                  match_flag_type m = match_default);
   regex_iterator(const regex_iterator&);
   regex_iterator& operator=(const regex_iterator&);
   bool operator==(const regex_iterator&);
   bool operator!=(const regex_iterator&);
   const value_type& operator*();
   const value_type* operator->();
   regex_iterator& operator++();
   regex_iterator operator++(int);
private:
   match_results<RandomAccessIterator> what;  // for exposition only
   RandomAccessIterator*               end;   // for exposition only
   const regex_type*                   pre;   // for exposition only
   match_flag_type                     flags; // for exposition only
};

} // namespace std

RE.8.1.1 regex_iterator constructors

regex_iterator();

TODO

regex_iterator(RandomAccessIterator a, RandomAccessIterator b, 
               const regex_type& re, 
               match_flag_type m = match_default);

TODO

regex_iterator(const regex_iterator&);

TODO

regex_iterator& operator=(const regex_iterator&);

TODO

RE.8.1.1 regex_iterator dereference

const value_type& operator*();
const value_type* operator->();
regex_iterator& operator++();
regex_iterator& operator++(int);

TODO

RE.8.1 Template class regex_token_iterator

The template class regex_token_iterator is an iterator adapter; that is to say it represents a new view of an existing iterator sequence, by enumerating all the occurrences of a regular expression within that sequence, and presenting one or more new strings for each match found. After it is constructed, the iterator finds and stores a value of match_results<RandomAccessIterator> what (by calling regex_search) and sets the internal count N to 1. Every time operator++ is used the count N is incremented, if N exceeds or equals this->subs.size(), then the iterator finds and stores the next value of match_results<RandomAccessIterator> and sets count N to zero. If the end of sequence is reached (regex_search returns false), the iterator becomes equal to the end-of-sequence iterator value. The constructor with no arguments regex_iterator() always constructs an end of sequence iterator object, which is the only legitimate iterator to be used for the end condition. The result of operator* on an end of sequence is not defined. For any other iterator value a const basic_string<charT>& is returned (obtained by calling this->what[subs[N]]). The result of operator-> on an end of sequence is not defined. For any other iterator value a const basic_string<charT>* is returned. It is impossible to store things into regex_iterator's. Two end-of-sequence iterators are always equal. An end-of-sequence iterator is not equal to a non-end-of-sequence iterator. Two non-end-of-sequence iterators are equal when they are constructed from the same arguments.

namespace std{

template <class RandomAccessIterator, 
          class charT = iterator_traits<RandomAccessIterator>::value_type,
          class traits = regex_traits<charT>,
          class Allocator = allocator<charT> >
class regex_token_iterator 
{
public:
   typedef          basic_regex<charT, traits, Allocator>                  regex_type;
   typedef          basic_string<charT>                                    value_type;
   typedef typename iterator_traits<RandomAccessIterator>::difference_type difference_type;
   typedef typename iterator_traits<RandomAccessIterator>::pointer         pointer;
   typedef typename iterator_traits<RandomAccessIterator>::reference       reference;
   typedef          std::input_iterator_tag                                iterator_category;
   
   regex_token_iterator();
   regex_token_iterator(RandomAccessIterator a, RandomAccessIterator b, const regex_type& re, 
                        int submatch = 0, match_flag_type m = match_default);
   regex_token_iterator(RandomAccessIterator a, RandomAccessIterator b, const regex_type& re, 
                        const std::vector<int>& submatches, match_flag_type m = match_default);
   template <std::size_t N>
   regex_token_iterator(RandomAccessIterator a, RandomAccessIterator b, const regex_type& re, 
                        const int (&submatches)[N], match_flag_type m = match_default);
   regex_token_iterator(const regex_token_iterator&);
   regex_token_iterator& operator=(const regex_token_iterator&);
   bool operator==(const regex_token_iterator&);
   bool operator!=(const regex_token_iterator&);
   const value_type& operator*();
   const value_type* operator->();
   regex_token_iterator& operator++();
   regex_token_iterator operator++(int);
private:
   match_results<RandomAccessIterator> what;   // for exposition only
   RandomAccessIterator*               end;    // for exposition only
   const regex_type*                   pre;    // for exposition only
   match_flag_type                     flags;  // for exposition only
   basic_string<charT>            result; // for exposition only
   std::size_t                         N;      // for exposition only
   std::vector<int>                    subs;   // for exposition only
};
} // namespace std

RE.8.1.1 regex_iterator constructors

regex_token_iterator();

TODO

regex_token_iterator(RandomAccessIterator a, RandomAccessIterator b, const regex_type& re, 
                     int submatch = 0, match_flag_type m = match_default);

TODO

regex_token_iterator(RandomAccessIterator a, RandomAccessIterator b, const regex_type& re, 
                     const std::vector<int>& submatches, match_flag_type m = match_default);

TODO

template <std::size_t N>
regex_token_iterator(RandomAccessIterator a, RandomAccessIterator b, const regex_type& re, 
                     const int (&submatches)[N], match_flag_type m = match_default);

TODO

regex_token_iterator(const regex_token_iterator&);

TODO

regex_token_iterator& operator=(const regex_token_iterator&);

TODO

RE.8.1.1 regex_token_iterator dereference

const value_type& operator*();
const value_type* operator->();
regex_token_iterator& operator++();
regex_token_iterator& operator++(int);

References

IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX). See also, The Single UNIX Specification, Version 3, http://www.unix-systems.org/version3/.
The Perl Programming Language, http://www.perl.org/.
Microsoft Corporation, Visual C#, http://msdn.microsoft.com/vcsharp/.
Standard ECMA-262 3^rd Edition - December 1999, ECMAScript, http://www.ecma.ch/.
GNU, The GNU C Library, http://www.gnu.org/software/libc/.
Tom Lord, The Hackerlab C Library, http://regexps.com/.
Henry Spencer, BSD regular expression library, ftp://ftp.zoo.toronto.edu/pub/ and http://arglist.com/regex/.
Philip Hazel, Perl Compatible Regular Expressions, http://www.pcre.org/.
Hugo Etchegoyen, A C++ wrapper around pcre-3.2, ftp://ftp.cus.cam.ac.uk/pub/software/programs/pcre/.
Michael Tesch, A C++ Wrapper for the "Perl Compatible Regular Expressions" library, ftp://ftp.cus.cam.ac.uk/pub/software/programs/pcre/.
John Maddock, Boost regular expression library, http://www.boost.org/.
Eric Niebler, The GRETA Regular Expression Template Archive, http://research.microsoft.com/downloads/.
Mark Jason Dominus, Perl Regular Expression Matching is NP-Complete, http://perl.plover.com/NPC/.
Charles L. A. Clarke, Gordon V. Cormack, On the use of Regular Expressions for Searching Text, ACM Trans. Programming Languages and Systems 19, (May 1997), University of Waterloo, Waterloo, Canada, Technical Report CS9507, http://plg.uwaterloo.ca/~gvcormac/PAPERS.html.
Larry Wall, Apocalypse 5, http://www.perl.com/pub/a/2002/06/04/apo5.html.
Robert Sedgewick, Algorithms in C++, Addison Wesley.
Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest, Introduction to Algorithms, MIT Press.
Gonzalo Navarro and Mathieu Raffinot, Fast Regular Expression Search, In Proceedings of WAE'99, LNCS 1668. Pages 198-212, 1999, http://www.dcc.uchile.cl/~gnavarro/publ.html.
Sun Wu and Udi Manber, Agrep - A Fast Approximate Pattern Matching Tool, Usenix Winter 1992 Technical Conference, San Francisco (January 1992), pp. 153-162, http://www.tgries.de/agrep/doc/agrep2ps.zip.
Ricardo A. Baeza-Yates and Gaston H. Gonnet, A New Approach to Text Searching, Communications of the ACM, 35:74--82, Oct 1992, http://www.dcc.uchile.cl/~rbaeza/cv/journals.html.
An experimental Boost regex version uses the same stack recursive algorithm that GRETA does by default, and has proven to be just as fast in practice: http://ourworld.compuserve.com/homepages/proposals/exregex.htm.
Gonzalo Navarro. "NR-grep: a Fast and Flexible Pattern Matching Tool". Software Practice and Experience (SPE) 31:1265-1312, 2001. http://www.dcc.uchile.cl/~gnavarro/publ.html.
Partial matches are described in more detail in the Boost documentation at: http://www.boost.org/libs/regex/template_class_ref.htm#partial_matches.

Acknowledgements

I am particularly grateful to GRETA's author Eric Niebler for many long and detailed discussions, to Edward Diener for much helpful feedback over many years, to Bjorn Karlsson, Jeremy Siek, Richard Peters, and Beman Dawes for many helpful comments on a draft of this proposal, and of course to all the Boost-list members and regex users for their continued help with this project.

A Proposal to add Regular Expressions to the Standard Library

Forward

Contents

I. Motivation

Target audience

II. Impact on the Standard

III. Design

Overview

Summary of key design decisions

Summary of key limitations imposed by the design

Regular expression syntax (and complexity)

POSIX-basic

POSIX-extended

ECMAScript

Perl

Recommendation

Regular expression algorithms

Regular expression representation

Regular expression traits classes

Representing regular expression matches

Algorithms

Repeated regular expression searches

Iterator requirements

Proposed Text

Forward

FW.1 Normative References

FW.2 Definitions

Regular Expressions

Table RE1 - Regular Expression Summary

RE.1 Regular Expression Traits

RE.1.1 Regular Expression Traits Requirements

RE.1.2 Regular expression Traits Classes

RE.2 Regular Expressions

Header <regex> synopsis

RE.3 Supporting Types

RE.3.1 Namespace std::re

RE.3.1.1 Bitmask Type syntax_option_type

RE.3.1.2 Bitmask Type match_flag_type

RE.3.2 Class bad_expression

RE.3.3 Template class regex_traits

RE.4 Template class basic_regex

RE.4.1 basic_regex constructors

Table RE3--basic_regex(const Allocator&) effects

Table RE4--basic_regex(const charT* p, flag_type f, const Allocator&) effects

Table RE5--basic_regex(const charT* p1, const charT* p2, flag_type f, const Allocator&) effects

Table RE6--basic_regex(const charT* p1, size_type len, flag_type f, const Allocator&) effects

Table RE7--basic_regex(const basic_regex& e) effects

Table RE8--basic_regex(const basic_regex& e) effects

Table RE9--basic_regex(ForwardIterator first, ForwardIterator last, flag_type f, const Allocator&) effects

RE.4.2 basic_regex iterators

RE.4.3 basic_regex capacity

RE.4.4 basic_regex assign

Table RE10-- basic_regex& assign(const basic_string<charT, string_traits, A>& s, flag_type f) effects

RE.4.5 basic_regex constant operations

RE.4.6 basic_regex locale

RE.4.7 basic_regex swap

RE.4.8 basic_regex non-member functions

RE.4.8.1 basic_regex non-member comparison operators

RE.4.8.2 basic_regex inserter.

RE.4.8.3 basic_regex non-member swap

RE.5 Template class sub_match

RE.5.1 sub_match members

RE.5.2 sub_match non-member operators

RE.6 Template class match_results

RE.6.1 match_results constructors

Table RE12--match_results(const Allocator&) effects

Table RE13--match_results(const Allocator&) effects

RE.6.2 match_results size

RE.6.3 match_results element access

RE.6.4 match_results reformatting

RE.7 Regular expression algorithms

RE.7.1 regex_match

Table RE14--regex_match(RandomAccessIterator first, RandomAccessIterator last, match_results<RandomAccessIterator, Allocator>& m, const reg_expression<charT, traits, Allocator2>& e, match_flag_type flags) effects

RE.7.2 regex_search

Table RE15--regex_search(RandomAccessIterator first, RandomAccessIterator last, match_results<RandomAccessIterator, Allocator>& m, const reg_expression<charT, traits, Allocator2>& e, match_flag_type flags) effects

RE.7.3 regex_replace

RE.8 Regular expression Iterators

RE.8.1 Template class regex_iterator

RE.8.1.1 regex_iterator constructors

RE.8.1.1 regex_iterator dereference