A Proposal to add Regular Expressions to the Standard Library

Doc. no.: J16/03-0011= WG21/N1429.
Previous Doc. no.: J16/02-0044 = WG21/N1386
Date: 03 March 2003
Project: Programming Language C++
Subgroup: Library
Wiki: http://www.research.att.com/~ark/cgi-bin/lwg-wiki/wiki?RegularExpressions
Reply to: John Maddock <john_maddock@compuserve.com>

Forward

This proposal is based heavily on the Boost regular expression library, although GRETA is also discussed. The proposal represents widespread existing practice; the Boost regex library is in use on over twenty compiler/platform combinations, and is one of the most visited libraries on the Boost web site. The proposal is necessarily long and quite detailed and covers many difficult and contentious issues. The intent is to provide a concrete "straw man" proposal for discussion along with a clear discussion of the design issues involved.

Contents

Motivation
   Target Audience
Impact on the standard
Design
   Summary of key design decisions
   Summary of key limitations imposed by the design
   Regular expression syntax (and complexity)
      Regular expression algorithms
   Regular expression representation
   Regular expression traits classes
   Representing regular expression matches
   Algorithms
   Repeated regular expression searches
   Iterator requirements
Proposed text
   Regular Expression Traits
   Header <regex> synopsis
   Supporting Types
   Template class basic_regex
   Template class sub_match
   Template class match_results
   Regular expression algorithms
   Regular expression Iterators

I. Motivation

Regular expressions should be familiar to many programmers; a regular expression is a string used as a pattern to select specific strings from a set of character strings. Regular expressions are used by many familiar Unix utilities, including egrep, grep, lex, and sed[1], and are supported directly in many programming languages including awk[1], perl[2], C#[3], and ECMAScript[4] (also known as JavaScript), and indeed are often used ubiquitously wherever text processing occurs.

Until recently most regular expression libraries were implemented in C, no doubt helped by the fact that the POSIX standard provides a standardized regular expression API[1]. Well known C implementations include the GNU regex library[5], the RX library[6], Henry Spencer's regex library[7] (which forms part of the BSD distribution), and PCRE[8] (for perl like regular expressions). Most early C++ implementations were simple wrappers around the C API's[9][10], although these add a measure of safety (by using the Resource Acquisition is Initialization principle), as well an object oriented interface. However wrapper classes and C API's alike suffer from a key defect when interfacing with the C++ standard library: they are limited to searching narrow character strings held in a const char*.

Native C++ implementations can overcome the limitations suffered by C libraries, and there are two main candidates for standardization: the Boost regex library[11] (formerly regex++), and GRETA[12]. Both these libraries support narrow and wide character regular expressions, and iterator-based searches. GRETA implements the regular expression algorithms as member functions of the regular expression class, while the Boost library make these separate non-member functions. GRETA is based on perl regular expressions, where as the Boost library uses POSIX regular expression matching rules, albeit with some perl compatible extensions.

Target audience

The are essentially three possible target audiences for a regular expression library:

End Users: regular expression may be used by the end user of a program - typically this would be in a text editor for search (and replace) operations. These users would require that the expression is localized to use their native character set, and never exhibit the kind of exponential search times that some regular expression engines can produce under worst case conditions. Typically this means that the regular expression recognized by the engine has to be restricted to some kind of safe subset of the syntax recognized by programming tools such as perl.

Programmers: this section includes all users of scripting languages like perl[2], awk[1], and ECMAScript[4]. In these cases the regular expressions are authored by the programmer, and these are converted to a state machine at run time. Typically these users require an expressive regular expression syntax, at the expense of worst case performance. Localization of the syntax is not normally required, although the state machine itself should usually honor the run time locale (for example when matching an alphanumeric character).

Programmers (code generators): this section of users include those using tools like lex[1], which convert one or more regular expressions into a C or C++ source file. Here the expressions are written by the programmer, and then converted to a state machine at compile time. There is no localization, either of the syntax or of the way in which the state machine behaves at runtime. The generated state machines are usually highly optimized; as a result the regular expressions may take some time to be analyzed by the generator program, and are typically quite restricted in the syntax they accept.

This proposal concentrates on the second group; that is to say a rich regular expression syntax, generally used by programmers, and interpreted at run time. However, thought has been given to the other possible user types, and the design by no means rule these out completely.

II. Impact on the Standard

This is a pure library proposal, it does not add any new language features, or alter any existing standard library headers.

III. Design

Overview

This proposal consists of a pair of classes, and a set of algorithms that operate on instances of those classes. Regular expressions are represented by instances of the basic_regex class template, matches against a regular expression are represented by instances of the match_results class template. The algorithms regex_match and regex_search take as input a regular expression and a character string, and fill in a match_results structure if a match is found. For search and replace operations the algorithm regex_replace performs formatted search and replace on a character string, producing a new string as output. Finally, for repeated find operations, there are two iterator types; one enumerates a sequence of full regular expression matches, the other a sequence of strings.

Summary of key design decisions

Regular expression syntax: ECMAScript (discussion).

Complexity: None specified (discussion).

Regular Expression Object: A template class modeled after basic_string; the regular expression object does not have access to the iterator type that will be searched (discussion).

Traits class: Regular expression objects may be customized via a traits class template parameter; all localization of the library is handled via the traits class (discussion).

Character types: both narrow and wide character strings are supported.

Representation of regular expression matches: a container of sub-expression matches, what matched each marked sub-expression in the regular expression is reported back to the user (discussion).

Regular expression algorithms: iterator based non-member functions for searching for a match, and for finding an exact match (discussion).

Search and replace: ECMAScript-like search and replace via a non-member function to transform one string to another (discussion).

Repeated searches: an iterator type to enumerate all of the regular expression matches within another iterator sequence (discussion).

Splitting a string: an iterator type that enumerates one or more strings for each regular expression match within another iterator sequence (discussion).

Summary of key limitations imposed by the design

Regular expression syntax: whatever syntax is chosen will severely limit the choice of state machine used along with any complexity guarantees given. The choice of ECMAScript regular expressions implies that some regular expressions can only be matched by a backtracking Non-deterministic Finite Automata (NFA), other algorithms may be used for some subsets of ECMAScript regular expressions (discussion).

Regular expression object type: the fact that the regular expression object is templated on the character type, and not on an iterator type, may make some implementation choices more difficult. In practice this choice seems to have no impact on performance (discussion).

Regular expression traits class: the traits class design will be a major constraining factor on the kinds of internal representations that a regular expression state machine may use. The design presented here is the simplest possible; implementations wishing to extend the regular expression library may need to extend the traits class design (discussion).

Support for wide characters: much regular expression theoretical work is based on small finite character sets, and can therefore not be applied to wide character strings, however wide character string support is considered essential. This constraint may be mitigated to some degree by providing a specialization for basic_regex<char> and then overloading its associated non-member algorithms (discussion).

Sub-expression matching: User feedback suggests that support for matching marked sub-expressions is essential. However, this can have a negative impact on the kinds of algorithms that may be employed. This is mitigated to some degree in this proposal by providing overloaded forms of the key algorithms that do not report what matched, allowing alternative algorithms to be used in these cases for some expressions (discussion).

Iterator type: support for forward iterators would allow for multibyte character support via iterator adapters, however this would also preclude some perl-like features, and may have negative performance implications. This proposal requires support for bidirectional iterators, however this choice needs more investigation and may need to be changed at a later date (discussion).

Regular expression syntax (and complexity)

The complexity of converting a character string to a state machine, and the complexity of searching for a match to that machine, are deeply dependent upon the syntax used to represent the regular expression. There are three ways in which the regular expression syntax can be specified:

  1. Leave the manner in which the string representation is interpreted as implementation defined. This makes it impossible for programmers to include portable regular expressions in their code.
  2. Define completely within the standard how the regular expression finite state machine is interpreted. This option incurs a lot of work, and possibly duplicates information that has already been standardized elsewhere.
  3. Make a normative reference to another standard. There are only really two pertinent standards available: IEEE Std 1003.1-2001: POSIX, Chapter 9 Regular Expressions[1], and ECMA-262: ECMAScript Language Specification[4].

The following brief summary indicates the main differences between the regular expression syntaxes that have been standardized. It is assumed that the reader is already familiar with one or more of these.

POSIX-basic

The basic POSIX regular expression syntax is most often encountered in Unix utilities such as grep or sed.

Only the characters .[\*^$ are special. Other features are accessed as escape sequences, for example grouping is accomplished with \( and \), quantified repeats using \{ and \} and alternation with \|.

Bracket expressions (contained in []), have access to the full range of POSIX features including named collating elements and equivalence classes, however escape sequences within bracket expressions are not valid (unlike in perl).

There are no + or ? quantifiers.

Back-references to previously marked sub-expressions (with \( and \)) are allowed. This last option means that pattern recognition is almost certainly an NP-Complete problem[13].

Expressions are matched using the "leftmost longest" rule.

POSIX-extended

The extended POSIX regular expression syntax is used by the Unix utilities egrep and awk[1].

The extended POSIX regular expression syntax is generally more natural in use than the basic syntax, the characters .[]{}+*?|\^$ are all special, escape sequences are used to make special characters ordinary (in other words there are no special escape sequences defined by the POSIX standard, however such sequences are allowed as compatible extensions).

Expressions are matched using the "leftmost longest" rule[1].

Back-references are not allowed in POSIX extended regular expressions, as a result a POSIX extended regular expression can be converted to a Deterministic Finite Automata (DFA) if required.

The following features are supported by both POSIX-extended and POSIX-basic regular expressions but not by ECMAScript:

ECMAScript

ECMAScript[4] is based on the perl regular expression syntax[2], which in turn is loosely based on the POSIX extended syntax[1]. ECMAScript is also known as JavaScript or JScript.

Unlike the POSIX regular expression syntaxes, ECMAScript does not use the "leftmost longest" rule to determine the best possible match, instead the regular expression is treated as a directed graph, and a depth first search is performed, with the first match found being returned. The following examples illustrate the main differences between the two:

Expression

Text

POSIX leftmost longest match

ECMAScript depth first search match

a|ab

xaby

"ab"

"a"

.*([[:alnum:]]+).*

" abc def xyz "

$0 = " abc def xyz "
$1 = "abc"

$0 = " abc def xyz "
$1 = "z"

.*(a|xayy)

zzxayyzz

"zzxayy"

"zzxa"

These differences between ECMAScript matching rules, and POSIX matching rules, mean that these two regular expression syntaxes differ not only in the features offered, but also in the form that the state machine takes and/or the algorithms used to traverse the state machine.

The following features are supported by ECMAScript, but not by the POSIX-basic or POSIX-extended syntaxes:

The main advantage offered by ECMAScript is that the richer syntax makes it very much easier to author patterns that will match sections of mark-up languages. For example to match a "<font>…</font>" section using the POSIX syntax is much more complicated than it first seems, the obvious expression:

"<font[^>]*>.*</font>"

would match everything from the first occurrence of "<font" to the last occurrence of "</font>", including any intervening <font> or </font> tags. It is possible to do the right thing using something much more complex like:

"<font[^>]*>([^<]|<[^/]|</[^f]|</f[^o]|</fo[^n]|</fon[^t]|</font[^>])*</font>"

which is not especially memorable (nor even a general solution). Where as the ECMAScript equivalent:

"<font[^>]*>.*?</font>"

is very much easier. I've seen suggestions that a POSIX-like syntax, but using a "leftmost shortest" matching rule would be a better a way of handling mark-up languages[14], but while this is true, it still doesn't approach the flexibility of ECMAScript in this important area.

Perl

Perl 5 regular expressions offer the same facilities as ECMAScript, as well as several additional features such as independent sub-expressions, zero width look-behind assertions, and interpreted sub-expressions. The latter feature allows recursive regular expressions; something that can only be fully implemented when regular expressions are an intrinsic part of the language, and have access to the program's variables. Given that perl isn't standardized, the perl syntax probably isn't a good candidate for the C++ standard. It is also worth noting that the perl regular expression syntax will undergo a complete rewrite in perl 6[15].

Recommendation

This proposal recommends that support for ECMAScript syntax is required, while support for the POSIX syntax is optional. This choice is different to the one made by Boost, but is based on a very clear bias in user expectations (and feedback), in particular for features like "non-greedy repeats" which simply do not fit well into the POSIX "leftmost longest" model. This choice is also the same as that made by almost all of the newer programming languages: not only perl and JavaScript, but also Python, PHP, and Microsoft's .NET framework. The next major revision of the boost regex library will offer a choice of either POSIX or ECMAScript semantics

However it should be noted that ECMA script has no support for localization in regular expressions, the proposal therefore requires that the following features (taken from POSIX-extended regular expressions) be supported, and that these features are localizable via the supplied traits class:

Regular expression algorithms

The main state machine representations and algorithms used for implementing regular expressions are shown below, along with their complexities [16][17]:

Finite state machine / algorithm

Size of machine for M character expression

Pattern recognition complexity, for an N character sequence, and S states.

Comments

DFA

O(2M)

O(N)

May have a high cost for the construction of the finite state machine, typically only used by code generators such as lex.

Bit-parallel non-backtracking NFA

O(M) for most expressions, but O(2M) for nested bounded-repeats.

O(1+(S/B))N

Only applicable to narrow character regular expressions and does not report what matched. If B (the number of bits in a machine word) is greater than S (a reasonably common experience), then performance is O(N).

Non backtracking NFA

O(M) for most expressions, but O(2M) for nested bounded-repeats.

O(SN)

Typically used for POSIX-extended regular expression engines, especially tools like egrep where the expression is to be authored by an "end user" rather than a programmer.

Backtracking NFA

O(M)

O(2N)

Typically used for ECMAScript regular expressions, and for POSIX-basic expressions that contain back-references.

The main lesson from this table is that unless the standard defines the regular expression grammar used, the complexity of the regular expression algorithms can not be specified. One option that could be explored is to allow for the runtime monitoring of algorithm complexity. That is, the stated algorithm complexity is enforced not by using an algorithm that has the stated complexity, but by monitoring the number of states in the machine that are visited, and throwing an exception if the number visited exceeds some threshold. This option allows for the use of richer regular expression grammars, while at the same time preventing the state machine from executing eternally (this is currently in use with the Boost.regex library and appears to work well in practice). 

Please also note that for carefully limited regular expression grammars, and for narrow character strings, there are many clever algorithms and/or heuristics available with much better performance (and in some cases worst case complexity as well) than the ones given here[18][19].

This proposal does not make any performance guarantees, given that this is essentially impossible for ECMAScript regular expressions.

Regular expression representation

In most programming languages, regular expressions are represented as a character strings. Ultimately though, a regular expression is a finite state machine, the character string is just is a user-friendly representation. This leads to the first design decision: regular expressions will be represented by a type that is constructible from a character string. This allows the data structure representing the finite state machine (DFA or NFA etc) to be completely opaque to the end user. This form also ensures that character strings are converted to a state machine only once (whereas if regular expression were represented by basic_string then the transformation to a state machine would have to occur every time the expression was used). On the other hand, since character strings and regular expressions are so closely related, the regular expression type should "look like" a basic_string, in particular the member functions should make the underlying character string representation easily accessible. There are some limits to this; other than basic_regex<>::assign, member functions will be limited to those that do not modify the string. This limitation arises from the fact that one can not arbitrarily modify a regular expression and expect the result to still be a valid expression, nor can one make arbitrary changes to the underlying state machine without repeating any static analysis that may have been performed on the machine. One reviewer has asked for arbitrary modification to the machine to be allowed, in order to support emacs-style incremental searches. However, it is not clear what kind of impact this would have on the state machine representation, or even whether this is possible for regular expressions. As a result this is not part of this proposal, and experience with Boost so far is that not one user has ever asked for this feature.

The GRETA and Boost regular expression libraries use slightly different forms for the regular expression object. Boost uses:

template <class charT, 
          class traits = regex_traits<charT>, 
          class Allocator = allocator<charT> >
class reg_expression;

While GRETA uses:

template< typename ConstIterator, 
          typename traits = perl_syntax<iterator_traits<ConstIterator>::value_type> >
class basic_rpattern;

There are two main differences here: Boost has an Allocator template parameter, whereas GRETA does not; and GRETA is templated on the iterator type that will be searched rather than on the character type. In addition, GRETA uses member functions for regular expression search and match operations, where as Boost uses non-member functions. This proposal follows the Boost interface for the following reasons:

As with all design decisions there are some compromises involved. In particular, the GRETA implementation by default uses a recursive depth first search; each state in the machine is represented by a polymorphic object, so the iterator type needs to be known by the machine in order for iterators to be passed through virtual function calls. However there is a one to one transformation from a virtual function call, to a call through a function pointer looked up in a table indexed by the object's type. A refactored version of Boost regex uses this technique to implement the same algorithm that GRETA uses, the result has been found to be faster than either Boost or GRETA so far, and this implementation will be used in the next major revision of the boost.regex library [21]. In other words, although there are some implementation choices that do not fit easily into the Boost interface, there are no algorithms that can not fit into this interface, either in theory or practice.

There is one other restriction worth pointing out; the Boost interface does not easily allow an implementation to generate self-modifying machine code for the state machine. That would require the state machine to be fixed to a finite number of iterator types, and in practice there appear to be no regex implementations that make this design choice in any case. If this option were required, then full construction of the state machine would have to be delayed until it was required (and the iterator type was known to be one that is supported), the result could then be cached in the regular expression object (and indexed by a const typeinfo*).

This proposal has made some small modifications to the Boost interface where experience has indicated that the existing version is imperfect, in particular the name of the regular expression object has been changed from reg_expression to basic_regex, to make the basic_string analogy clearer. The actual class used to represent the regular expression is:

template <class charT, class traits = regex_traits<charT>, class Allocator = allocator<charT> >
class basic_regex;

In many cases there are multiple ways in which a character string can be converted to a state machine, the most important of these is to tell the state machine whether it should regard a character sequence as case sensitive or not, but other options are possible: for example hints on optimizations and/or the expression syntax to use (for example POSIX basic or extended, perl or JavaScript etc). So in addition to a string, the form of the state machine is determined by a series of flags (represented by a bitmask type), these flags should take on well defined default values which results in any given string producing the same behavior on all implementations.

In the Boost reference implementation these flags are defined as members of a non-template base class to basic_expression, called regbase. Originally this base class had several members in addition to these constants (in a design rather similar to std::ios_base), but as time has gone on these members have either been removed or moved to other locations (either the traits class or the basic_regex class), so that leaves a base class with no purpose other than to hold some symbolic constants. At the same time there are other symbolic constants used elsewhere by the regular expression library, which in the Boost implementation are defined at namespace scope. To tidy all this up, this proposal adds these to the nested namespace std::regex_constants, whose only purpose is to hold the symbolic constants used by the regular expression library:

namespace std{ namespace regex_constants{
typedef bitmask_type syntax_type;
// default syntax, the same on all implementations:
static const syntax_type normal = 0;
// case insensitivity:
static const syntax_type icase = 1;
// possibly other flags follow.

// Other flag types follow...
} }

One of the more controversial aspects of this proposal will be which flags to include here, as this depends heavily upon the regular expression syntax we standardize upon. As a preliminary choice, this proposal suggests that ECMAScript regular expression must be supported, and that POSIX style regular expression are an optional extra.

The proposal does not document any member functions for accessing the internal state machine representation. Obviously such member functions will exist, but their form will depend upon the implementation. In most cases these member functions will be declared private, and one or more non-member functions will be declared as friends.

When constructing a state machine, one must recognize that not all strings are valid regular expressions, in this proposal the regular expression object throws an exception derived from runtime_error when the string is not a valid expression.

Finally as with basic_string, most users will only be using one of two typedefs of basic_regex:

typedef basic_regex<char> regex;
typedef basic_regex<wchar_t> wregex;

Regular expression traits classes

The traits class used by class template basic_regex provides two related facilities: localization and customization. Unlike most traits classes used in the standard library, the one used by the class template basic_regex is instance based, allowing three main use cases:

The default traits class (regex_traits) is stateful, and uses std::locale to provide localization facilities. Note however that stateful traits classes are not tied to std::locale; user defined traits classes could use some other type for localization: for example on the Microsoft Windows platform an LCID might provide a better locale type.

The default traits class is defined as follows:

template <class charT>
struct regex_traits
{
public:
   typedef charT                        char_type;
   typedef std::size_t                  size_type;
   typedef std::basic_string<char_type> string_type;
   typedef std::locale                  locale_type;
   typedef bitmask_type                 char_class_type;
   typedef regex_traits_version_1_tag   version_tag;

   struct sentry
   {
      sentry(regex_traits<charT>&);
      operator void*();
   };

   static size_type length(const char_type* p);
   regex_constants::syntax_type syntax_type(charT c) const;
   regex_constants::escape_syntax_type escape_syntax_type(charT c) const;
   charT translate(charT c, bool icase) const;
   string_type transform(const string_type& in) const;
   string_type transform_primary(const string_type& in) const;
   char_class_type lookup_classname(const string_type& name) const;
   string_type lookup_collatename(const string_type& name) const;
   bool is_class(charT c, char_class_type f) const;
   template <class InputIterator>
   int toi(InputIterator& first, InputIterator last, int radix) const;
   locale_type imbue(locale_type l);
   locale_type getloc()const;
   std::string error_string(regex_constants::error_type) const;
};

Traits class versioning

Designing the correct traits class for basic_regex is difficult, for example the current boost version has been through several design revisions already, and is still imperfect. The design presented here uses combinations of features from the boost regex design and from GRETA, it has also been simplified compared to either of these as far as is possible. However, it is a fact that vendors wishing to provide new and innovative features in their regular expression code will almost certainly need to extend or modify the traits class design presented here. In order to allow for this, the traits class contains a version tag: a type that signifies the interface version that the traits class conforms to. Any traits class that provides a superset of the interface specified in this document should specify a version tag that derives from std::regex_traits_version_1_tag. Likewise, implementations of basic_regex should accept any traits class whose version tag derives from regex_traits_version_1_tag, and may of course accept traits classes with some other version tag as a compatible extension. A class type is used for this purpose to allow for a hierarchy of traits class versions:

// required by the standard:
struct regex_traits_version_1_tag {}; 
// vendor specific traits version, compatible with regex_traits_version_1_tag:
struct regex_traits_vendor1_tag : regex_traits_version_1_tag {}; 
// some future standard version:
struct regex_traits_version_2_tag : regex_traits_version_1_tag {};

traits class initialization

Traits classes used by basic_regex, must be default constructible, they are also capable of being imbued with a locale. However the implementation of many traits class designs requires the initialization and or caching of a fair amount of locale-specific data - it makes sense to delay this initialization until the traits class is actually about to be used (and after any call to imbue) - to ensure that the user only pays for what they actually use:

std::regex e;

int main(int argc, const char**argv)
{
   if(argc > 2)
   {
      e.imbue(std::locale("en_GB");
      e.assign(argv[2]);
      // do some work
   }
   else
   {
      // do something else, no regex traits class initialization performed
   }
}

The traits class uses a sentry design, similar to that used by the iostreams library for this purpose:

template <class charT, class traits, class Allocator>
class basic_regex
{
private:
  // for exposition only:
  traits traits_inst;
  Allocator alloc_inst;
public:
  basic_regex(const Allocator& a = Allocator())
    : traits_inst(), alloc_inst(a) {}
  basic_regex& assign(const charT* ptr, flag_type f = regex_constants::normal)
  {
    traits::sentry s(traits_inst);
    if(s){
      // perform state machine construction here
    }
  }
};

Customizing the expression syntax

One of the key features of the traits class design is that it allows the expression syntax to be customized. There are limits to what can be accomplished with this: the traits class is not a parser and has no knowledge of the internal state machine structure (and nor should it). What it can do is replace those characters that have special meaning in the regex grammar with some other (possibly localized) equivalent, by use of the following member functions:

regex_constants::syntax_type syntax_type(charT c) const;
regex_constants::escape_syntax_type escape_syntax_type(charT c) const;
char_class_type lookup_classname(const string_type& name) const;
string_type lookup_collatename(const string_type& name) const;
template<class InputIterator>
int toi(InputIterator& first, InputIterator last, int radix) const;

Member function syntax_type, returns an enumerated value which represents the meaning of the next input character: the regular expression grammar is actually applied to the sequence of values obtained by calling traits.syntax_type, and not the characters in the expression itself. Implementations may define values for regex_constants::syntax_type that are not specified in this standard in order to support compatible extensions.

Likewise the member function escape_syntax_type returns an enumerated value indicating the meaning of an escaped character, implementations may support escape sequences not required by the ECMA standard by defining additional values for regex_constants::escape_syntax_type.

Together the functions syntax_type and escape_syntax_type allow the primary regular expression grammar to be customized by a custom traits class: for example a traits class could use Han-ideographs rather that Latin punctuation symbols in place of the usual ? * + regular expression operators.

The member function lookup_classname converts a string into a bitmap type, which can later be used to determine whether a character is a member of a particular character class. For example when the parser encounters a \d escape sequence, syntax_type will be passed '\' and return syntax_escape. escape_syntax_type will then be passed 'd' and return escape_type_class. Finally the parser will pass "d" to lookup_classname to obtain a bitmask type that can later be passed to member function is_class to determine whether a character is a digit or not. This process allows the traits class to support custom character class names, in addition to the standard forms:

\d \s \w [[:alnum:]] [[:alpha:]] [[:blank:]] [[:cntrl:]] [[:digit:]] [[:graph:]] [[:lower:]] [[:print:]] [[:punct:]] [[:space:]] [[:upper:]] [[:xdigit:]]

It is interesting to note that while the C localization API has the function wctype to support locale specific character class names, there is no equivalent in std::locale.

Names looked up by lookup_classname are case insensitive, so for example passing the values "UPPER" "upper" and "Upper" would all have the same result. In the case of escape sequences, the value returned by escape_syntax_type determines whether the character class is to be negated (for example in "\D") or not (as in "\d").

The member function lookup_collatename converts a string into a single collating element, it is called when the parser encounters a [[.collating-element-name.]] sequence. The default traits class supports all the names specified in the POSIX standard, for example if the parser encountered "[[.left-square-bracket.]]" then " left-square-bracket " would be passed to lookup_classname, and "[" would be returned. Even though support for this functionality is required in POSIX regular expressions, note that neither the C, POSIX, nor C++ standards have any portable mechanism for converting locale-specific collating element names to collating elements. Consequently it is implementation defined whether the default traits class supports collating names other than those in the default "POSIX" locale. Finally, the collating element is returned as a string not as a character to allow for multi-character collating elements (for example ae in English, or ll in traditional Spanish).

Finally the toi member function is used to convert numeric sequences to integers according to the traits classes locale: the default traits class uses std::num_get for this purpose, user defined traits classes that don't use std::locale need to make their own locale specific arrangements.

Customizing how expressions are matched

The following traits class member functions are used to customize how an expression is matched against an input sequence:

charT translate(charT c, bool icase) const;
string_type transform(const string_type& in) const;
string_type transform_primary(const string_type& in) const;
bool is_class(charT c, char_class_type f) const;

Matching of individual characters can generally be accomplished in one of two ways: either the character to be matched is converted to the set of all characters that are equivalent to it, and the input character is then matched against that set, or the both the input character and the character in the expression are converted to a consistent form (for example lower case) and then directly compared. The regex traits class design takes the latter approach, on the grounds that it can be applied to large character sets, where as it is not in general possible to enumerate all the characters equivalent to some character c, when the character set is large. For example the character set may contain more than two cases (upper, lower and title case variants in Unicode for example), as well as other forms of canonical or compatibility equivalence. Thus two characters c1 and c2 are compared with regard to case using:

translate(c1, false) == translate(c2, false)

and caseless comparisons using:

translate(c1, true) == translate(c2, true)

Character ranges can be matched either in a locale sensitive manner or not: POSIX leaves it up to the implementation whether or not character ranges such as [a-b] are sensitive to the locale, while ECMAScript requires that such ranges are matched only according to character code points. This proposal makes locale sensitivity an option: when the syntax_option_type flag collate is set when constructing a regular expression, then character ranges are sensitive to the locale, and when it’s not set then they are matched only by comparing code points.

Matching locale sensitive character ranges is accomplished by converting all the input sequences to collating keys with the transform member function: in the default traits class this just calls std::collate<charT>::transform. For example when the regex-option flag regex_constants::collate is set then the sequence [a-b] would match some character c1 if:

traits.transform("a") <= traits.transform(c1) <= traits.transform("b");

Matching equivalence classes is accomplished by comparing the sort keys returned from the transform_primary member function, for example the expression [[=a=]] would match some character c1 if:

traits.transform_primary(c1) == traits.transform_primary("a")

Note that the transform and transform_primary member function accept a string rather than a character as input in order to cope with multi-character collating elements used in expressions such as [[.ae.]-d] or [[=ll=]]. Note also that there is no portable way to implement transform_primary in terms of std::locale, since even if the sort key format returned by std::collate_byname<>::transform is known and can be converted into a primary sort key, the user can still install their own custom std::collate implementation into the locale object used, and that can use any sort key format they see fit. The transform_primary member function is therefore more of use to custom traits classes, and should throw an exception if it cannot be implemented for a particular locale. Unfortunately this significantly reduces the usefulness of POSIX style equivalence classes within regular expressions, but that cannot be fixed without modifying the std::collate facet. Note that primary sort keys can not be obtained by converting to all lower case and then obtaining a regular sort key: primary keys take into account only the primary character shape, case, accentation and locale specific tailoring are not taken into account, so for example the characters "AÀÁÂÃÄÅaàáâãäå" should all produce the same primary sort key.

Finally the is_class member function is used to determine whether some input character is a member of a particular character class, for example the expression [[:upper:]] would match some character c1 if:

traits.is_class(c1, traits.lookup_classname("upper")) == true

The default traits class is implemented in terms of a call to std::ctype<charT>::is.

Error messages

Whenever the regular expression parser encounters a syntax error, a bad_expression exception needs to be thrown. The traits class provides the member function error_string to convert a numeric error condition into a suitably localized string for presentation to the end user, the default traits class returns only English language error strings, but user defined traits classes can easily customize this behavior.

Traits class locale

The member functions:

locale_type imbue(locale_type l);
locale_type getloc()const;

allow the traits class to have a per-instance locale when required, and to be imbued with that locale. If the traits class is stateless then these functions have no effect, and the member-type locale_type is a dummy/placeholder type.

Representing regular expression matches

Many of the regular expression algorithms report what matched the expression. Unlike matches against a normal sequence of characters, regular expression matches can be of variable length, and can even match the null string. Furthermore, the end of sequence iterator can not be used to indicate that no match was found, as a null string at the end of a sequence can be a valid regular expression match.

Regular expressions also contain marked sub-expressions, each of which has its own "sub-match" in addition to what matched the whole expression. Further more, each of these sub-expression matches can be either matched or non-matched depending upon the context. For example, consider the expression:

(^[[:space:]]*)|([[:space:]]*$)

Here the expression contains two marked sub-expressions (which we'll call $1 and $2 using perl terminology), both the expression as a whole and the two marked sub-expressions can match variable length sequences, and either $1 or $2 can be matched, but never both at once. The expression can also match the null string at either the end or the start of the sequence being searched.

In order to represent this range of possibilities, each match found must be represented as a collection of sub-expression matches, with each sub-expression match represented by a pair of iterators. A Boolean flag is needed to indicate whether a sub-expression participated in the match or not; ideally one would use singular iterators for this, but not all iterators are required to have singular values (nor are is it guaranteed by the standard that one can compare two iterators if one of them is singular), so this option is prohibited.

This proposal uses:

template <class Iterator>
struct sub_match : public std::pair<Iterator, Iterator>
{
   bool matched;

   typename iterator_traits<Iterator>::distance_type length()const;
   operator basic_string<typename iterator_traits<Iterator>::value_type>()const;
};

to represent matched sub-expressions. This is the simplest solution, however experience with Boost has shown that some users new to C++ get confused by the "pair of iterators" concept, and just assume that the first member contains a null terminated string representing the match. An alternative and more object oriented approach would be to develop this into a fully fledged substring class, perhaps as part of a wider strings library (the author has some code for this, but it is not yet ready for primetime). However there are also dangers to this approach; a substring class would still be a pair of iterators internally, there is therefore a potential for the string to which the substring refers to be destroyed before the substring object, leaving invalidated iterators behind. Of course this is still a possible problem with the "pair of iterators" approach, however at least it is obvious that they are iterators and not something that looks like it may be a "real" string.

There are two options for representing a collection of sub_matches: either to use an existing container class, or provide a new container. This proposal takes the latter option - providing a new container called match_results - for the following reasons:

Finally there are some typedefs of the match_results type to simplify usage:

typedef match_results<const char*> cmatch;
typedef match_results<const wchar_t*> wcmatch;
typedef match_results<string::const_iterator> smatch;
typedef match_results<wstring::const_iterator> wsmatch;

Algorithms

In this proposal regular expression algorithms are present as iterator-based non-member functions that operate upon instances of basic_regex and match_results. As previously discussed this maintains the independence of the regular expression type from the iterator type. Each algorithm is overloaded for three common cases: where the source character-container sequence is a pair of iterators, or a null terminated string, or an instance of basic_string. In addition, the algorithms are split into two types: those that report what matched using an instance of the match_results class, and those that simply indicate whether there was a match or not. Although the latter option may not seem that useful for those used to tools such as perl, in practice it does have important uses, and can be implemented using algorithms that are significantly more efficient than those that report what matched. Indeed for long searches and a limited subset of regular expressions there are algorithms that offer better than O(N) average case performance[22]. Typical uses here include search engines (grep etc), or data input validation (where you just want to compare the string against a regular expression based filter). Note however, that the worst case performance of the "non-reporting" algorithms is still likely to be the same as the "reporting" ones, it's the real world performance that can differ. For this reason the "non-reporting" algorithms represent an opportunity for the vendor to optimize the implementation, there is no enforceable requirement that they do so (indeed Boost does not currently optimize these at present). The regular expression algorithms all also take a series of flags - expressed as a combination of bitmasks - to indicate things like whether the string being searched represents the start/end of a line etc. These flags are all defined in the nested namespace std::regex_constants, and are of type std::regex_constants::match_flag_type.

The first two algorithms are regex_search and regex_match. The former will find an occurrence of a regular expression within a string, while the latter will determine whether a regular expression matches the string exactly and is mainly used for data input validation. Note that the effect of regex_match can be obtained by using regex_search with an expression that is suitably anchored at either end. However, from the programmers point of view, the advantage of using regex_match is that it is obvious from reading the code that it requires an exact match against the expression, and not just an occurrence somewhere within the string.

The names of the algorithms in this proposal are based upon the names Boost uses and all have a "regex_" prefix, which may cause some controversy (as may the naming used in general in this proposal). This proposal does not use overly generic names like match/compare/search/find here, as the possibility for ambiguous overloads occurring should be obvious. An alternative is to use a generic name, but placed in a nested namespace (let's say std::regex). This may be an appropriate option, but ambiguities can still occur either as a result of argument dependent lookup, or a using directive. Many people may see the latter as an evil, but experience suggests that statements such as:

using namespace whatever;

are remarkably common in practice. For this reason the "regex_" prefixed names that Boost regex uses are retained in this proposal. Here are the twelve forms for regex_match and regex_search:

template <class BidirectionalIterator, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_match(BidirectionalIterator first, BidirectionalIterator last,
                 match_results<BidirectionalIterator, Allocator>& m,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);

template <class BidirectionalIterator, class charT, class traits, class Allocator2>
bool regex_match(BidirectionalIterator first, BidirectionalIterator last,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);

template <class charT, class Allocator, class traits, class Allocator2>
bool regex_match(const charT* str, match_results<const charT*, Allocator>& m,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);

template <class ST, class SA, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_match(const basic_string<charT, ST, SA>& s,
                 match_results<typename basic_string<charT, ST, SA>::const_iterator, Allocator>& m, 
                 const reg_expression<charT, traits, Allocator2>& e, 
                 match_flag_type flags = match_default);

template <class charT, class traits, class Allocator2>
bool regex_match(const charT* str,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);

template <class ST, class SA, class charT, class traits, class Allocator2>
bool regex_match(const basic_string<charT, ST, SA>& s,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);

template <class BidirectionalIterator, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
                  match_results<BidirectionalIterator, Allocator>& m,
                  const reg_expression<charT, traits, Allocator2>& e,
                  match_flag_type flags = match_default);

template <class charT, class Allocator, class traits, class Allocator2>
bool regex_search(const charT* str, match_results<const charT*, Allocator>& m,
                  const reg_expression<charT, traits, Allocator2>& e,
                  match_flag_type flags = match_default);

template <class ST, class SA, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_search(const basic_string<charT, ST, SA>& s,
                  match_results<typename basic_string<charT, ST, SA>::const_iterator, Allocator>& m,
                  const reg_expression<charT, traits, Allocator2>& e,
                  match_flag_type flags = match_default);

template <class BidirectionalIterator, class Allocator, class charT,
          class traits>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
                  const reg_expression<charT, traits, Allocator>& e,
                  match_flag_type flags = match_default);

template <class charT, class Allocator, class traits>
bool regex_search(const charT* str
                  const reg_expression<charT, traits, Allocator>& e,
                  match_flag_type flags = match_default);

template <class ST, class SA, class Allocator, class charT,
          class traits>
bool regex_search(const basic_string<charT, ST, SA>& s,
                  const reg_expression<charT, traits, Allocator>& e,
                  match_flag_type flags = match_default);

The algorithms regex_match and regex_search both support a feature not commonly seen in regular expression libraries: a partial match. When the flag std::regex_constants::match_partial is set in the flags passed to the algorithm, then a result of true may be returned if one or more characters were matched, and the state machine then reached the end of the character sequence while there were still states to be matched. Partial matches answer the question: "could a match have been found if more input was provided?". They have proven to be extremely useful in two situations: for data input validation, where validation occurs a character at a time as the input is entered, and for long searches through character sequences that are too long to be loaded into memory and therefore have to be buffered (in this case the partial match allows the code to determine whether a match could occur at an overlap between two buffers)[23].

Many tools such as perl or sed offer algorithms for performing search and replace operations on a string. These take an input sequence, find all occurrences of a regular expression within it, and replace whatever matched with the result of merging a format string with what matched. The equivalent in the Boost library is the algorithm regex_merge; there are two main forms, one sends the new text string to an output iterator, while the other returns a basic_string<> object. In this proposal this algorithm has been renamed regex_replace in order to make it a little more memorable:

template <class OutputIterator, class BidirectionalIterator, class traits,
          class Allocator, class charT>
OutputIterator regex_replace(OutputIterator out,
                           BidirectionalIterator first,
                           BidirectionalIterator last,
                           const reg_expression<charT, traits, Allocator>& e,
                           const basic_string<charT>& fmt,
                           unsigned int flags = match_default);

template <class traits, class Allocator, class charT>
basic_string<charT> regex_replace(const basic_string<charT>& s,
                            const reg_expression<charT, traits, Allocator>& e,
                            const basic_string<charT>& fmt,
                            unsigned int flags = match_default);

Repeated regular expression searches

It is common practice to want to find all of the (non-overlapping) occurrences of a regular expression within a character sequence. Boost uses the algorithm regex_grep for this; this algorithm searches through the sequence, and for each match found calls a unary predicate with the results of the match. However there is a sense in which this is the wrong interface; basically it provides a "push" interface, where as the standard library normally uses "pull" interfaces in the form of iterators. Of course, one could just call regex_search repeatedly, however these algorithms often have to do some housekeeping tasks (memory allocation is a typical example) each time they are called. By providing a separate interface for finding all the occurrences of an expression, the implementation can ensure that these housekeeping tasks occur only once, rather than once for each occurrence of the expression. This proposal does not provide the regex_grep interface that Boost does, instead it provides a regex_iterator, whose value_type is an instance of the match_results class template.

template <class BidirectionalIterator, 
          class charT = iterator_traits<BidirectionalIterator>::value_type,
          class traits = regex_traits<charT>,
          class Allocator = allocator<charT> >
class regex_iterator;

Another common task is a perl-like split operation; to find all the non-overlapping occurrences of an expression within a character sequence, and for each occurrence spit out one or more new strings (usually one new string for each marked sub-expression). In Boost regex this is accomplished using the regex_split algorithm, however like regex_grep this algorithm has a "push" rather than a "pull" interface. In addition the semantics of the perl split function (on which the Boost algorithm is modeled) are rather strange; giving different behavior depending upon whether there are any marked sub-expressions or not. For these reasons, in this proposal an iterator-based interface (regex_token_iterator) is provided that actually has a much more flexible interface than the Boost regex_split function. The iterator can be constructed with an integer argument indicating the index of the sub-expression to enumerate for each match, in which case the iterator will enumerate one string for each match found. Alternatively the iterator can be constructed from an array literal, containing a list of the sub-expressions to iterate, in which case there will be several strings enumerated for each match found.

template <class BidirectionalIterator, 
          class charT = iterator_traits<BidirectionalIterator>::value_type,
          class traits = regex_traits<charT>,
          class Allocator = allocator<charT> >
class regex_token_iterator;

Typical usage would be for example:

extern std::string   text;
extern std::regex    re;
//
// This iterator will enumerate one string for each occurrence of re in text,
// each string will be equivalent to what perl calls $` (i.e. field splitting):
std::regex_token_iterator<std::string::const_iterator> it1(text.begin(), text.end(), re, -1);
//
// This iterator will enumerate one string for each occurrence of re in text,
// each string will be equivalent to what perl calls $& (the whole match):
std::regex_token_iterator<std::string::const_iterator> it1(text.begin(), text.end(), re, 0);
//
// This iterator will enumerate three string's for each occurrence of re in text,
// these will be what perl calls $1, $2 and $5 :
std::regex_token_iterator<std::string::const_iterator> it1(text.begin(), text.end(), re, {1,2,5});

Iterator requirements

While all the algorithms discussed so far are iterator-based, no mention has been made so far of iterator requirements. The following options are available along with their pro's and con's:

InputIterator: this is discounted as an option, input iterators can not indicate what matched (as many of the algorithms must do), nor can they be used to match back-references.

ForwardIterator: this is a real option, forward iterators can be used to indicate what matched. In addition there is in practice a real world use case: multibyte character sequences can be converted to a single wide character "on the fly" using a forward iterator adapter. Note that some multi-byte sequences can not be represented by a bi-directional or random access iterator. There are downsides as well: requiring forward iterator support places a burden on the implementation, in particular matching boundary conditions (such as ^ or $) require that the implementation remembers the last character visited as well as the current position. In addition perl-style backtracking algorithms become much less efficient for single character repeats (a very common construct), since every backtrack-position has to be saved, rather than computed by decrementing an iterator. There are also some perl features (zero width look-behind assertions) that can not be matched using forward iterators, however these features are not required by the ECMAScript standard. It is unclear just how much of a burden forward iterator support would be in practice; some experimental evidence would be of help here.

BidirectionalIterator: This requirement is much easier to implement, and can additionally be used with some kinds of multibyte sequences (Unicode UTF8 encoding for example).

RandomAccessIterator: As with bidirectional iterators, this requirement is easy to implement, it also offers the potential for significant performance gains: for POSIX style matches, computing "leftmost longest" matches is much more efficient, while for perl style matches, skipping through greedy ".*" repeats are much more efficient, and for some expressions Boyer-Moore like search heuristics are possible.

Of the two template libraries discussed here, Boost requires bidirectional iterators, while GRETA doesn't document it's requirements (but appears to also require bidirectional iterators). In practice insisting on ForwardIterator support is probably excessive, while BidirectionIterators are relatively easy to support, in addition the kinds of performance gains that can be obtained from RandomAccessIterators can still be added using compile-time dispatch to different scanners depending upon the iterator type. This proposal currently documents that all iterators must be bidirectional, this can be tightened to a requirement to support forward iterators if practice shows that this is indeed possible (in the meantime implementations should be encouraged to support the widest range of iterator types possible).

 


Proposed Text


Forward

FW.1 Normative References

ECMA-262, ECMAScript Language Specification, Chapter 15 part 10, RegExp (Regular Expression) Objects.

ECMA-262, ECMAScript Language Specification, Chapter 15 part 5.4.11 String.prototype.replace.

IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Base Definitions and Headers, Section 6.1, Portable Character Set.

IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Base Definitions and Headers, Section 9, Regular Expressions.

IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Shells and Utilities, Section 4, Utilities, awk.

IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Shells and Utilities, Section 4, Utilities, grep.

FW.2 Definitions

Collating element: A sequence of one or more characters within the current locale that collate as if they were a single character.

Finite state machine: An unspecified data structure that is used to represent a regular expression, and which permits efficient matches against the regular expression to be obtained.

Format specifier: A sequence of one or more characters that is to be replaced with some part of a regular expression match.

Matched: A sequence of zero or more characters shall be said to be matched by a regular expression when the characters in the sequence correspond to a sequence of characters defined by the pattern.

Partial match: A match that is obtained by matching one or more characters at the end of a character-container sequence, but only a prefix of the regular expression.

Primary equivalence class: A set of one or more characters which share the same primary sort key: that is the sort key weighting that depends only upon character shape, and not accentation, case, or locale specific tailorings.

Regular expression: A pattern that selects specific strings from a set of character strings.

Sub-expression: A subset of a regular expression that has been marked by parenthesis.

Regular Expressions

This clause describes components used by C++ programs to describe regular expressions, and perform search and replace operations on character strings using those expressions.

Table RE1 - Regular Expression Summary

Clause

Header

Regular Expression Traits

<regex>

Header <regex> synopsis

<regex>

Supporting Types

<regex>

Template class basic_regex

<regex>

Template class sub_match

<regex>

Template class match_results

<regex>

Regular expression algorithms

<regex>

Regular expression Iterators

<regex>

RE.1 Regular Expression Traits

This subclause defines requirements on classes representing regular expression traits, and defines a class template regex_traits<charT> that satisfies those requirements.

The class template basic_regex needs a set of related types and functions to complete the definition of it's semantics. These types and functions are provided as a set of member typedefs and functions in the template parameter `traits' used by the basic_regex template. This subclause defines the semantics guaranteed by these members.

To specialize the basic_regex template to generate a regular expression class to handle a particular character container type CharT, that and its related regular expression traits class Traits is passed as a pair of parameters to the basic_regex template as formal parameters charT and Traits.

RE.1.1 Regular Expression Traits Requirements

In Table RE2 X denotes a traits class defining types and functions for the character container type charT; u is an object of type X; v is an object of type const X; p is a value of type const charT*; I1 and I2 are Input Iterators; c is a value of type const charT; s is an object of type X::string_type; cs is an object of type const X::string_type; b is a value of type bool; I is a value of type int; and loc is an object of type X::locale_type.

Table RE2: regular expression traits class requirements

Expression

Return type

Assertion / Note
Pre / Post condition

X::char_type

charT

The character container type used in the implementation of class template basic_regex.

X::size_type

 

An unsigned integer type, capable of holding the length of a null-terminated string of charT's.

X::string_type

std::basic_string<charT>

 

X::locale_type

Implementation defined

A copy constructible type that represents the locale used by the traits class.

X::char_class_type

Implementation defined

A bitmask type representing a particular character classification. Multiple values of this type can be bitwise-or'ed together to obtain a new valid value.

X::version_tag

Either regex_traits_version_1_tag or a class that publicly inherits from regex_traits_version_1_tag.

A type that signifies the interface to which this traits class conforms.

X::sentry

Implementation defined

A type used to force initialization of an instance of this traits class. An instance of this type must be copy constructed from a traits class instance, and verified that it does not compare equal to a null pointer constant, before any member function of that traits class instance may be called other than length, imbue and getloc.

X::sentry s(u);

void

An object of type X::sentry shall be copy constructible from an instance of X.

X::sentry(u)

Convertible to bool.

Conversion of the sentry class to bool, is used to verify that the traits class has been correctly initialized.

X::length(p)

X::size_type

Yields the smallest i such that p[i] == 0. Complexity is linear in i.

v.syntax_type(c)

regex_constants::syntax_type

Returns a symbolic value of type regex_constants::syntax_type that signifies the meaning of character c within the regular expression grammar.

v.escape_syntax_type(c)

regex_constants::escape_syntax_type

Returns a symbolic value of type regex_constants::escape_syntax_type, that signifies the meaning of character c within the regular expression grammar, when c has been preceded by an escape character. Precondition: if b is the character preceding c in the expression being parsed then: v.syntax_type(b) == syntax_escape

v.translate(c, b)

X::char_type

Returns a character d such that: for any character d that is to be considered equivalent to c then v.translate(c,false)==v.translate(d,false). Likewise for all characters C that are to be considered equivalent to c when comparisons are to be performed without regard to case, then v.translate(c,true)==v.translate(C,true).

v.transform(cs)

X::string_type

Returns a sort key for string cs such that: if string cs sorts before string ct then: v.transform(cs) < v.transform(ct)

v.transform_primary(cs)

X::string_type

Returns a primary sort key for string cs, such that if cs sorts before string ct, when character case is not considered, then v.transform(cs) < v.transform(ct). Returns an empty string on error.

v.lookup_classname(cs)

X::char_class_type

Converts the string cs into a bitmask type that can subsequently be passed to is_class. Values returned from lookup_classname can be safely bitwise or'ed together. Returns an empty string if cs is not the name of a character class recognized by X. At least the names "d", "w", "s", "alnum", "alpha", "blank", "cntrl", "digit", "graph", "lower", "print", "punct", "space", "upper" and "xdigit", shall be recognized. The value returned shall be independent of the case of the characters in cs.

v.lookup_collatename(cs)

X::string_type

Returns a string containing the collating element named by cs, or an empty string if cs is not a recognized name. 

v.is_class(c, v.lookup_classname (cs))

bool

Returns true if character c is a member of the character class cs, false otherwise. 

v.toi(I1, I2, i)

int

Behaves as follows: if p==q or v.is_class(*p, v.lookup_classname("d")) == false then returns -1. Otherwise performs formatted numeric input on the sequence [p,q) and returns the result as an int. Postcondition: either p == q or v.is_class(*p, v.lookup_classname("d")) == false

u.imbue(loc)

X::locale_type

Imbues u with the locale loc, returns the previous locale used by u if any. 

v.getloc()

X::locale_type

Returns the current locale used by v if any. 

v.error_string(i)

std::string

Returns a human readable error string for the error condition i, where i is one of the values enumerated by type regex_constants::error_type.  If the value i is not recognized then returns the string "Unknown error" or a localized equivalent.

 

RE.1.2 Regular expression Traits Classes

The header <regex> defines the class template regex_traits which shall be capable of being specialized for character container types char and wchar_t, and which satisfies the requirements for a regular expression traits class. Class template regex_traits is described in section RE.3.3.

RE.2 Regular Expressions

The header <regex> defines a basic regular expression class template and its traits that can handle all char-like (lib.strings) template arguments.

The header <regex> defines a container-like class template that holds the result of a regular expression match.

The header <regex> defines a series of algorithms that allow an iterator sequence to be operated upon by a regular expression.

The header <regex> defines two specific template classes, regex and wregex and their special traits.

The header <regex> also defines two iterator types for enumerating regular expression matches.

Header <regex> synopsis

namespace std{

namespace regex_constants{

typedef bitmask_type syntax_option_type;
typedef bitmask_type match_flag_type;
typedef implementation defined syntax_type;
typedef implementation defined escape_syntax_type;
typedef implementation defined error_type;

} // namespace regex_constants

class bad_expression;

template <class charT>
struct regex_traits;

template <class charT,
          class traits = regex_traits<charT>,
          class Allocator = allocator<charT> >
class basic_regex;

template <class charT, class traits, class Allocator>
bool operator == (const basic_regex<charT, traits, Allocator>& lhs,
                  const basic_regex<charT, traits, Allocator>& rhs);
template <class charT, class traits, class Allocator>
bool operator != (const basic_regex<charT, traits, Allocator>& lhs,
                  const basic_regex<charT, traits, Allocator>& rhs);
template <class charT, class traits, class Allocator>
bool operator < (const basic_regex<charT, traits, Allocator>& lhs,
                 const basic_regex<charT, traits, Allocator>& rhs);
template <class charT, class traits, class Allocator>
bool operator <= (const basic_regex<charT, traits, Allocator>& lhs,
                  const basic_regex<charT, traits, Allocator>& rhs);
template <class charT, class traits, class Allocator>
bool operator >= (const basic_regex<charT, traits, Allocator>& lhs,
                  const basic_regex<charT, traits, Allocator>& rhs);
template <class charT, class traits, class Allocator>
bool operator > (const basic_regex<charT, traits, Allocator>& lhs,
                 const basic_regex<charT, traits, Allocator>& rhs);

template <class charT, class io_traits, class re_traits, class Allocator>
basic_ostream<charT, io_traits>&
   operator << (basic_ostream<charT, io_traits>& os,
                const basic_regex<charT, re_traits, Allocator>& e);

template <class charT, class traits, class Allocator>
void swap(basic_regex<charT, traits, Allocator>& e1,
          basic_regex<charT, traits, Allocator>& e2);

typedef basic_regex<char> regex;
typedef basic_regex<wchar_t> wregex;

template <class BidirectionalIterator>
class sub_match;

template <class BidirectionalIterator>
bool operator == (const sub_match<BidirectionalIterator>& lhs,
                  const sub_match<BidirectionalIterator>& rhs);
template <class BidirectionalIterator>
bool operator != (const sub_match<BidirectionalIterator>& lhs,
                  const sub_match<BidirectionalIterator>& rhs);
template <class BidirectionalIterator>
bool operator < (const sub_match<BidirectionalIterator>& lhs,
                 const sub_match<BidirectionalIterator>& rhs);
template <class BidirectionalIterator>
bool operator <= (const sub_match<BidirectionalIterator>& lhs,
                  const sub_match<BidirectionalIterator>& rhs);
template <class BidirectionalIterator>
bool operator >= (const sub_match<BidirectionalIterator>& lhs,
                  const sub_match<BidirectionalIterator>& rhs);
template <class BidirectionalIterator>
bool operator > (const sub_match<BidirectionalIterator>& lhs,
                 const sub_match<BidirectionalIterator>& rhs);


template <class BidirectionalIterator, class traits, class Allocator> 
bool operator == (const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& lhs,
                  const sub_match<BidirectionalIterator>& rhs);
template <class BidirectionalIterator, class traits, class Allocator> 
bool operator != (const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& lhs,
                  const sub_match<BidirectionalIterator>& rhs);
template <class BidirectionalIterator, class traits, class Allocator> 
bool operator < (const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& lhs,
                 const sub_match<BidirectionalIterator>& rhs);
template <class BidirectionalIterator, class traits, class Allocator> 
bool operator > (const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& lhs,
                 const sub_match<BidirectionalIterator>& rhs);
template <class BidirectionalIterator, class traits, class Allocator> 
bool operator >= (const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& lhs,
                  const sub_match<BidirectionalIterator>& rhs);
template <class BidirectionalIterator, class traits, class Allocator> 
bool operator <= (const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& lhs,
                  const sub_match<BidirectionalIterator>& rhs);

template <class BidirectionalIterator, class traits, class Allocator> 
bool operator == (const sub_match<BidirectionalIterator>& lhs,
                  const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& rhs);
template <class BidirectionalIterator, class traits, class Allocator> 
bool operator != (const sub_match<BidirectionalIterator>& lhs,
                  const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& rhs);
template <class BidirectionalIterator, class traits, class Allocator> 
bool operator < (const sub_match<BidirectionalIterator>& lhs,
                 const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& rhs);
template <class BidirectionalIterator, class traits, class Allocator> 
bool operator > (const sub_match<BidirectionalIterator>& lhs,
                 const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& rhs);
template <class BidirectionalIterator, class traits, class Allocator> 
bool operator >= (const sub_match<BidirectionalIterator>& lhs,
                  const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& rhs);
template <class BidirectionalIterator, class traits, class Allocator> 
bool operator <= (const sub_match<BidirectionalIterator>& lhs,
                  const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& rhs);

template <class BidirectionalIterator> 
bool operator == (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
                  const sub_match<BidirectionalIterator>& rhs); 
template <class BidirectionalIterator> 
bool operator != (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
                  const sub_match<BidirectionalIterator>& rhs); 
template <class BidirectionalIterator> 
bool operator < (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
                 const sub_match<BidirectionalIterator>& rhs); 
template <class BidirectionalIterator> 
bool operator > (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
                 const sub_match<BidirectionalIterator>& rhs); 
template <class BidirectionalIterator> 
bool operator >= (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
                  const sub_match<BidirectionalIterator>& rhs); 
template <class BidirectionalIterator> 
bool operator <= (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
                  const sub_match<BidirectionalIterator>& rhs); 

template <class BidirectionalIterator> 
bool operator == (const sub_match<BidirectionalIterator>& lhs, 
                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
template <class BidirectionalIterator> 
bool operator != (const sub_match<BidirectionalIterator>& lhs, 
                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
template <class BidirectionalIterator> 
bool operator < (const sub_match<BidirectionalIterator>& lhs, 
                 typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
template <class BidirectionalIterator> 
bool operator > (const sub_match<BidirectionalIterator>& lhs, 
                 typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
template <class BidirectionalIterator> 
bool operator >= (const sub_match<BidirectionalIterator>& lhs, 
                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
template <class BidirectionalIterator> 
bool operator <= (const sub_match<BidirectionalIterator>& lhs, 
                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 

template <class BidirectionalIterator> 
bool operator == (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
                  const sub_match<BidirectionalIterator>& rhs); 
template <class BidirectionalIterator> 
bool operator != (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
                  const sub_match<BidirectionalIterator>& rhs); 
template <class BidirectionalIterator> 
bool operator < (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
                 const sub_match<BidirectionalIterator>& rhs); 
template <class BidirectionalIterator> 
bool operator > (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
                 const sub_match<BidirectionalIterator>& rhs); 
template <class BidirectionalIterator> 
bool operator >= (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
                  const sub_match<BidirectionalIterator>& rhs); 
template <class BidirectionalIterator> 
bool operator <= (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
                  const sub_match<BidirectionalIterator>& rhs); 

template <class BidirectionalIterator> 
bool operator == (const sub_match<BidirectionalIterator>& lhs, 
                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
template <class BidirectionalIterator> 
bool operator != (const sub_match<BidirectionalIterator>& lhs, 
                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
template <class BidirectionalIterator> 
bool operator < (const sub_match<BidirectionalIterator>& lhs, 
                 typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
template <class BidirectionalIterator> 
bool operator > (const sub_match<BidirectionalIterator>& lhs, 
                 typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
template <class BidirectionalIterator> 
bool operator >= (const sub_match<BidirectionalIterator>& lhs, 
                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
template <class BidirectionalIterator> 
bool operator <= (const sub_match<BidirectionalIterator>& lhs, 
                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 

template <class charT, class traits, class BidirectionalIterator>
basic_ostream<charT, traits>&
   operator << (basic_ostream<charT, traits>& os,
                const sub_match<BidirectionalIterator>& m);

template <class BidirectionalIterator,
          class Allocator = allocator<
               typename iterator_traits<BidirectionalIterator>::value_type > >
class match_results;

template <class BidirectionalIterator, class Allocator>
bool operator == (const match_results<BidirectionalIterator, Allocator>& m1,
                  const match_results<BidirectionalIterator, Allocator>& m2);
template <class BidirectionalIterator, class Allocator>
bool operator != (const match_results<BidirectionalIterator, Allocator>& m1,
                  const match_results<BidirectionalIterator, Allocator>& m2);

template <class charT, class traits, class BidirectionalIterator, class Allocator>
basic_ostream<charT, traits>&
   operator << (basic_ostream<charT, traits>& os,
                const match_results<BidirectionalIterator, Allocator>& m);

template <class BidirectionalIterator, class Allocator>
void swap(match_results<BidirectionalIterator, Allocator>& m1,
          match_results<BidirectionalIterator, Allocator>& m2);

typedef match_results<const char*> cmatch;
typedef match_results<const wchar_t*> wcmatch;
typedef match_results<string::const_iterator> smatch;
typedef match_results<wstring::const_iterator> wsmatch;

template <class BidirectionalIterator, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_match(BidirectionalIterator first, BidirectionalIterator last,
                 match_results<BidirectionalIterator, Allocator>& m,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);
template <class BidirectionalIterator, class charT, class traits, class Allocator2>
bool regex_match(BidirectionalIterator first, BidirectionalIterator last,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);
template <class charT, class Allocator, class traits, class Allocator2>
bool regex_match(const charT* str, match_results<const charT*, Allocator>& m,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);
template <class ST, class SA, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_match(const basic_string<charT, ST, SA>& s,
                 match_results<typename basic_string<charT, ST, SA>::const_iterator, Allocator>& m, 
                 const reg_expression<charT, traits, Allocator2>& e, 
                 match_flag_type flags = match_default);
template <class charT, class traits, class Allocator2>
bool regex_match(const charT* str,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);
template <class ST, class SA, class charT, class traits, class Allocator2>
bool regex_match(const basic_string<charT, ST, SA>& s,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);

template <class BidirectionalIterator, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
                  match_results<BidirectionalIterator, Allocator>& m,
                  const reg_expression<charT, traits, Allocator2>& e,
                  match_flag_type flags = match_default);
template <class charT, class Allocator, class traits, class Allocator2>
bool regex_search(const charT* str, match_results<const charT*, Allocator>& m,
                  const reg_expression<charT, traits, Allocator2>& e,
                  match_flag_type flags = match_default);
template <class BidirectionalIterator, class Allocator, class charT,
          class traits>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
                  const reg_expression<charT, traits, Allocator>& e,
                  match_flag_type flags = match_default);
template <class charT, class Allocator, class traits>
bool regex_search(const charT* str
                  const reg_expression<charT, traits, Allocator>& e,
                  match_flag_type flags = match_default);
template <class ST, class SA, class Allocator, class charT,
          class traits>
bool regex_search(const basic_string<charT, ST, SA>& s,
                  const reg_expression<charT, traits, Allocator>& e,
                  match_flag_type flags = match_default);
template <class ST, class SA, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_search(const basic_string<charT, ST, SA>& s,
                  match_results<typename basic_string<charT, ST, SA>::const_iterator, Allocator>& m,
                  const reg_expression<charT, traits, Allocator2>& e,
                  match_flag_type flags = match_default);

template <class OutputIterator, class BidirectionalIterator, class traits,
          class Allocator, class charT>
OutputIterator regex_replace(OutputIterator out,
                           BidirectionalIterator first,
                           BidirectionalIterator last,
                           const reg_expression<charT, traits, Allocator>& e,
                           const basic_string<charT>& fmt,
                           match_flag_type flags = match_default);
template <class traits, class Allocator, class charT>
basic_string<charT> regex_replace(const basic_string<charT>& s,
                            const reg_expression<charT, traits, Allocator>& e,
                            const basic_string<charT>& fmt,
                            match_flag_type flags = match_default);

// regular expression iterators:
template <class BidirectionalIterator, 
          class charT = iterator_traits<BidirectionalIterator>::value_type,
          class traits = regex_traits<charT>,
          class Allocator = allocator<charT> >
class regex_iterator;
template <class BidirectionalIterator, 
          class charT = iterator_traits<BidirectionalIterator>::value_type,
          class traits = regex_traits<charT>,
          class Allocator = allocator<charT> >
class regex_token_iterator;

} // namespace std

 

RE.3 Supporting Types

RE.3.1 Namespace std::regex_constants

The namespace std::regex_constants acts as a repository for the symbolic constants used by the regular expression library.

The namespace std::regex_constants defines five types: syntax_option_type, match_flag_type, syntax_type, escape_syntax_type and error_type, along with a series of constants of these types.

RE.3.1.1 Bitmask Type syntax_option_type
namespace std{ namespace regex_constants{

typedef bitmask_type syntax_option_type;
// these flags are required:
static const syntax_option_type normal;
static const syntax_option_type icase;
static const syntax_option_type nosubs;
static const syntax_option_type optimize;
static const syntax_option_type collate;
static const syntax_option_type ECMAScript = normal;
static const syntax_option_type JavaScript = normal;
static const syntax_option_type JScript = normal;
// these flags are optional, if the functionality is supported
// then the flags shall take these names.
static const syntax_option_type basic;
static const syntax_option_type extended;
static const syntax_option_type awk;
static const syntax_option_type grep;
static const syntax_option_type egrep;
static const syntax_option_type sed = basic;
static const syntax_option_type perl;

} // namespace regex_constants
} // namespace std

The type syntax_option_type is an implementation defined bitmask type (17.3.2.1.2). Setting its elements has the effects listed in table RE3, a valid value of type syntax_option_type will always have exactly one of the elements normal, basic, extended, awk, grep, egrep, sed or perl set:

Table RE3: syntax_option_type effects.

Element

Effect if set

normal

Specifies that the grammar recognized by the regular expression engine uses its normal semantics: that is the same as that given in the ECMA-262, ECMAScript Language Specification, Chapter 15 part 10, RegExp (Regular Expression) Objects (FWD.1).

icase

Specifies that matching of regular expressions against a character container sequence shall be performed without regard to case.

nosubs

Specifies that when a regular expression is matched against a character container sequence, then no sub-expression matches are to be stored in the supplied match_results structure.

optimize

Specifies that the regular expression engine should pay more attention to the speed with which regular expressions are matched, and less to the speed with which regular expression objects are constructed. Otherwise it has no detectable effect on the program output.

collate

Specifies that character ranges of the form "[a-b]" should be locale sensitive.

ECMAScript

The same as normal.

JavaScript

The same as normal.

JScript

The same as normal.

basic

Specifies that the grammar recognized by the regular expression engine is the same as that used by POSIX basic regular expressions in IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Base Definitions and Headers, Section 9, Regular Expressions (FWD.1).

extended

Specifies that the grammar recognized by the regular expression engine is the same as that used by POSIX extended regular expressions in IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Base Definitions and Headers, Section 9, Regular Expressions (FWD.1).

awk

Specifies that the grammar recognized by the regular expression engine is the same as that used by POSIX utility awk in IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Shells and Utilities, Section 4, awk (FWD.1).

grep

Specifies that the grammar recognized by the regular expression engine is the same as that used by POSIX utility grep in IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Shells and Utilities, Section 4, Utilities, grep (FWD.1).

egrep

Specifies that the grammar recognized by the regular expression engine is the same as that used by POSIX utility grep when given the -E option in IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Shells and Utilities, Section 4, Utilities, grep (FWD.1).

sed

The same as basic.

perl

Specifies that the grammar recognized by the regular expression is an implementation defined extension of the normal syntax.

RE.3.1.2 Bitmask Type match_flag_type
namespace std{ namespace regex_constants{

typedef bitmask_type match_flag_type;

static const match_flag_type match_default = 0;
static const match_flag_type match_not_bol;
static const match_flag_type match_not_eol;
static const match_flag_type match_not_bow;
static const match_flag_type match_not_eow;
static const match_flag_type match_any;
static const match_flag_type match_not_null;
static const match_flag_type match_continuous;
static const match_flag_type match_partial;
static const match_flag_type match_prev_avail;
static const match_flag_type format_default = 0;
static const match_flag_type format_sed;
static const match_flag_type format_perl;
static const match_flag_type format_no_copy;
static const match_flag_type format_first_only;

} // namespace regex_constants
} // namespace std

The type match_flag_type is an implementation defined bitmask type (17.3.2.1.2). When matching a regular expression against a sequence of characters [first, last) then setting its elements has the effects listed in table RE4:

Table RE4: match_flag_type effects when obtaining a match against a character container sequence [first,last).

Element

Effect if set

match_default

Specifies that matching of regular expressions proceeds without any modification of the normal rules used in ECMA-262, ECMAScript Language Specification, Chapter 15 part 10, RegExp (Regular Expression) Objects (FWD.1)

match_not_bol

Specifies that the expression "^" should not be matched against the sub-sequence [first,first).

match_not_eol

Specifies that the expression "$" should not be matched against the sub-sequence [last,last).

match_not_bow

Specifies that the expression "\b" should not be matched against the sub-sequence [first,first).

match_not_eow

Specifies that the expression "\b" should not be matched against the sub-sequence [last,last).

match_any

Specifies that if more than one match is possible then any match is an acceptable result.

match_not_null

Specifies that the expression can not be matched against an empty sequence.

match_continuous

Specifies that the expression must match a sub-sequence that begins at first.

match_partial

Specifies that if no match can be found, then it is acceptable to return a match [from, last) where from!=last, if there exists some sequence of characters [from,to) of which [from,last) is a prefix, and which would result in a full match.

match_prev_avail

Specifies that --first is a valid iterator position, when this flag is set then the flags match_not_bol and match_not_bow are ignored by the regular expression algorithms (RE.7) and iterators (RE.8).

format_default

Specifies that when a regular expression match is to be replaced by a new string, that the new string is constructed using the rules used by the ECMAScript replace function in ECMA-262, ECMAScript Language Specification, Chapter 15 part 5.4.11 String.prototype.replace. (FWD.1). In addition during search and replace operations then all non-overlapping occurrences of the regular expression are located and replaced, and sections of the input that did not match the expression, are copied unchanged to the output string.

format_sed

Specifies that when a regular expression match is to be replaced by a new string, that the new string is constructed using the rules used by the Unix sed utility in IEEE Std 1003.1-2001, Portable Operating SystemInterface (POSIX ), Shells and Utilities..

format_perl

Specifies that when a regular expression match is to be replaced by a new string, that the new string is constructed using an implementation defined superset of the rules used by the ECMAScript replace function in ECMA-262, ECMAScript Language Specification, Chapter 15 part 5.4.11 String.prototype.replace (FWD.1).

format_no_copy

When specified during a search and replace operation, then sections of the character container sequence being searched that do match the regular expression, are not copied to the output string.

format_first_only

When specified during a search and replace operation, then only the first occurrence of the regular expression is replaced.

 

RE.3.1.3 Implementation defined syntax_type
namespace std{ namespace regex_constants{

typedef implementation defined syntax_type;

static const syntax_type syntax_char;
static const syntax_type syntax_open_mark;
static const syntax_type syntax_close_mark;
static const syntax_type syntax_dollar;
static const syntax_type syntax_caret;
static const syntax_type syntax_dot;
static const syntax_type syntax_star;
static const syntax_type syntax_plus;
static const syntax_type syntax_question;
static const syntax_type syntax_open_set;
static const syntax_type syntax_close_set;
static const syntax_type syntax_or;
static const syntax_type syntax_escape;
static const syntax_type syntax_dash;
static const syntax_type syntax_open_brace;
static const syntax_type syntax_close_brace;
static const syntax_type syntax_digit;
static const syntax_type syntax_comma;
static const syntax_type syntax_equal;
static const syntax_type syntax_colon;
static const syntax_type syntax_not;

} // namespace regex_constants
} // namespace std

The type syntax_type is an implementation defined enumeration type (17.3.2.1.2). Values of type syntax_type represent how individual characters should be interpreted within a localized regular expression grammar, table RE5 shows which special characters defined in ECMA-262, ECMAScript Language Specification, Chapter 15 part 10, RegExp (Regular Expression) Objects, are equivalent to which syntax_type values in the C locale:

Table RE5: syntax_type values in the C locale.

Value

Equivalent character(s) in the syntax specified by ECMA-262, ECMAScript Language Specification, Chapter 15 part 10, RegExp (Regular Expression) Objects.

syntax_char

Any character not listed below.

syntax_open_mark

(

syntax_close_mark

)

syntax_dollar

$

syntax_caret

^

syntax_dot

.

syntax_star

*

syntax_plus

+

syntax_question

?

syntax_open_set

[

syntax_close_set

]

syntax_or

|

syntax_escape

\

syntax_dash

-

syntax_open_brace

{

syntax_close_brace

}

syntax_digit

0123456789

syntax_comma

,

syntax_equal

=

syntax_colon

:

syntax_not

!

 
RE.3.1.4 Implementation defined escape_syntax_type
namespace std{ namespace regex_constants{

typedef implementation defined escape_syntax_type;

static const escape_syntax_type escape_type_word_assert;
static const escape_syntax_type escape_type_not_word_assert;
static const escape_syntax_type escape_type_control_f;
static const escape_syntax_type escape_type_control_n;
static const escape_syntax_type escape_type_control_r;
static const escape_syntax_type escape_type_control_t;
static const escape_syntax_type escape_type_ascii_control;
static const escape_syntax_type escape_type_hex;
static const escape_syntax_type escape_type_unicode;
static const escape_syntax_type escape_type_identity;
static const escape_syntax_type escape_type_backref;
static const escape_syntax_type escape_type_decimal;
static const escape_syntax_type escape_type_class;
static const escape_syntax_type escape_type_not_class;

} // namespace regex_constants
} // namespace std

The type escape_syntax_type is an implementation defined enumeration type (17.3.2.1.2). Values of type escape_syntax_type represent how individual escaped characters should be interpreted within a localized regular expression grammar, table RE6 shows which special characters defined in ECMA-262, ECMAScript Language Specification, Chapter 15 part 10, RegExp (Regular Expression) Objects, are equivalent to which escape_syntax_type values in the C locale:

Table RE6: escape_syntax_type values in the C locale.

Value

Equivalent character(s) in syntax specified in ECMA-262, ECMAScript Language Specification, Chapter 15 part 10, RegExp (Regular Expression) Objects

escape_type_word_assert

b

escape_type_not_word_assert

B

escape_type_control_f

f

escape_type_control_n

n

escape_type_control_r

r

escape_type_control_t

t

escape_type_ascii_control

c

escape_type_hex

x

escape_type_unicode

u

escape_type_identity

\ |^$*+?{}.()[]

escape_type_backref

123456789

escape_type_decimal

0

escape_type_not_class

Any upper case character not listed above.

escape_type_class

Any non-upper case character not listed above.

 
RE.3.1.5 Implementation defined error_type
namespace std{ namespace regex_constants{

typedef implementation defined error_type;

static const error_type error_collate;
static const error_type error_ctype;
static const error_type error_escape;
static const error_type error_subreg;
static const error_type error_brack;
static const error_type error_paren;
static const error_type error_brace;
static const error_type error_badbrace;
static const error_type error_range;
static const error_type error_space;
static const error_type error_badrepeat;
static const error_type error_complexity;
static const error_type error_stack;

} // namespace regex_constants
} // namespace std

The type error_type is an implementation defined enumeration type (17.3.2.1.2). Values of type error_type represent the error conditions as described in table RE7:

Table RE7: error_type values.

Value

Error condition

error_collate

The expression contained an invalid collating element name.

error_ctype

The expression contained an invalid character class name.

error_escape

The expression contained an invalid escaped character, or a trailing escape.

error_subreg

The expression contained an invalid backreference.

error_brack

The expression contained mismatched [ and ].

error_paren

The expression contained mismatched ( and ).

error_brace

The expression contained mismatched { and }

error_badbrace

The expression contained an invalid range in a {} expression.

error_range

The expression contained an invalid character range, for example [b-a].

error_space

There was insufficient memory to convert the expression into a finite state machine.

error_badrepeat

One of *?+{ was not preceded by a valid regular expression.

error_complexity

The complexity of an attempted match against a regular expression exceeded a pre-set level.

error_stack

There was insufficient memory to determine whether the regular expression could match the specified character sequence.

 

RE.3.2 Class bad_expression

class bad_expression : public runtime_error
{
public:
   explicit bad_expression(const string& what_arg);
};

The class bad_expression defines the type of objects thrown as exceptions to report errors during the conversion from a string representing a regular expression to a finite state machine.

bad_expression(const string& what_arg); 

Effects: Constructs an object of class bad_expression.

Postcondition: strcmp(what(), what_arg.c_str()) == 0.

RE.3.3 Template class regex_traits

namespace std{

template <class charT>
struct regex_traits
{
public:
   typedef charT                        char_type;
   typedef std::size_t                  size_type;
   typedef std::basic_string<char_type> string_type;
   typedef std::locale                  locale_type;
   typedef bitmask_type                 char_class_type;

   struct sentry
   {
      sentry(regex_traits<charT>&);
      operator void*();
   };

   regex_traits();
   static size_type length(const char_type* p);
   regex_constants::syntax_type syntax_type(charT c)const;
   regex_constants::escape_syntax_type escape_syntax_type(charT c) const;
   charT translate(charT c, bool icase) const;
   string_type transform(const string_type& in) const;
   string_type transform_primary(const string_type& in) const;
   char_class_type lookup_classname(const string_type& name) const;
   string_type lookup_collatename(const string_type& name) const;
   bool is_class(charT c, char_class_type f) const;
   template<class InputIterator>
   int toi(InputIterator& first, InputIterator last, int radix) const;
   locale_type imbue(locale_type l);
   locale_type getloc()const;
   std::string error_string(regex_constants::error_type) const;
};

} // namespace std

The class template regex_traits is capable of being specialized for the types char and wchar_t and satisfies the requirements for a regular expression traits class (RE.1.1).

typedef bitmask_type                 char_class_type;

The type char_class_type is used to represent a character classification and is capable of holding an implementation defined superset of the values held by ctype_base::mask (22.2.1).

struct sentry
{
   sentry(regex_traits<charT>&);
   operator void*();
};

An object of type regex_traits<charT>::sentry shall be constructed from a regex_traits object, and tested to be not equal to null, before any of the member functions of that object other than length, getloc, and imbue shall be called. Type sentry performs implementation defined initialization of the traits class object, and represents an opportunity for the traits class to cache data obtained from the locale object.

static size_type length(const char_type* p);

Effects: returns char_traits<charT>::length(p);

regex_constants::syntax_type syntax_type(charT c) const;

Effects: for a character c in the right hand column of table RE5, returns the corresponding regex_constants::syntax_type value.

regex_constants::escape_syntax_type escape_syntax_type(charT c) const;

Effects: for a character c in the right hand column of table RE6, returns the corresponding regex_constants::escape_syntax_type value.

charT translate(charT c, bool icase) const;

Effects: returns (icase ? use_facet<ctype<charT> >(getloc()).tolower(c) : c)

string_type transform(const string_type& in) const;

Effects: returns use_facet<collate<charT> >(getloc()).transform(in.begin(), in.end()).

string_type transform_primary(const string_type& in) const;

Effects: if typeid(use_facet<collate<charT> >) == typeid(collate_byname<charT>) and the form of the sort key returned by collate_byname<charT>::transform is known and can be converted into a primary sort key, then returns that key, otherwise returns a empty string.

char_class_type lookup_classname(const string_type& name) const;

Effects: returns an implementation defined value that represents the character classification name. If name is not recognized then returns a value that compares equal to 0. At least the names "d", "w", "s", "alnum", "alpha", "blank", "cntrl", "digit", "graph", "lower", "print", "punct", "space", "upper" and "xdigit", shall be recognized. The value returned shall be independent of the case of the characters in name.

string_type lookup_collatename(const string_type& name) const;

Effects: returns the sequence of one or more characters that represents the collating element name. Returns an empty string if name is not recognized. At least the names specified in IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Base Definitions and Headers, Section 6.1, Portable Character Set shall be recognized.

bool is_class(charT c, char_class_type f) const;

Effects: determines if the character c is a member of the character classification represented by f.

Returns: converts f into a value m of type ctype_base::mask in an implementation defined manner, and returns true if use_facet<ctype<charT> >(getloc()).is(c, m) is true. Otherwise returns true if f & lookup_classname("w") == lookup_classname("w") and c == '_', otherwise returns false.

template <class InputIterator>
int toi(InputIterator& first, InputIterator last, int radix) const;

Precondition: argument radix shall take one of the values 8, 10 or 16.

Effects: constructs an object result of type int. If first == last or if is_class(*first, lookup_classname("d")) == false then sets result equal to -1. Otherwise constructs a basic_istream<charT> object which uses an implementation defined stream buffer type which represents the character sequence [first,last), and sets the format flags on that object as appropriate for argument radix. Performs unsigned integer formatted input on the basic_istream<charT> object setting the value result to the value obtained, and updates first to point to the first non-digit character in the sequence [first,last).

Returns: the value result.

Postcondition: is_class(*first, lookup_classname("d")) == false.

locale_type imbue(locale_type loc);

Imbues this with a copy of the locale loc. The effect of calling imbue is to invalidate all cached data held by this, no member functions other than length, getloc, and imbue may be called until an object of type sentry has been copy-constructed from *this.

Returns: if no locale has been previously imbued then a copy of the global locale in effect at the time of construction of this, otherwise a copy of the last argument passed to imbue.

Postcondition: getloc() == loc.

locale_type getloc()const;

Effects: if no locale has been imbued then a copy of the global locale in effect at the time of construction of this, otherwise a copy of the last argument passed to imbue.

std::string error_string(regex_constants::error_type e) const;

Effects: returns a human readable error string for error condition e.

RE.4 Template class basic_regex

For a char-like type charT, the template class basic_regex describes objects that represent a regular expression constructed from a sequence of charT's. In the rest of this clause, charT denotes such a given char-like type. Storage for the regular expression is allocated and freed as necessary by the member functions of class basic_regex.

The template class basic_regex conforms to the requirements of a Sequence, as specified in (lib.sequence.reqmts), except that only operations defined for const-qualified Sequences are supported.

Objects of type specialization of basic_regex are responsible for converting the sequence of charT objects to an internal representation. It is not specified what form this representation takes, nor how it is accessed by algorithms that operate on regular expressions (implementations will typically declare some template functions as friends of basic_regex to achieve this).

The regular expression grammar recognized by type specialization of basic_regex is described in ECMA-262, ECMAScript Language Specification, Chapter 15 part 10, RegExp (Regular Expression) Objects (FWD.1), and is modified according to any syntax_option_type flags specified when constructing the object or assigning a regular expression to it (RE.3.1.1). In addition to the features specified in ECMA-262, the following features shall also be recognized:

The grammar used is localized by interaction with the traits class template parameter as follows:

Objects of type specialization of basic_regex store within themselves a default-constructed instance of their traits template parameter, henceforth referred to as traits_inst. This traits_inst object is used to support localization of the regular expression; no basic_regex object shall call any locale dependent C or C++ API, including the formatted string input functions, instead it shall call the appropriate traits member function to achieve the required effect.

The transformation from a sequence of characters to a finite state machine is accomplished by first by transforming the sequence of characters to the sequence of tokens obtained by calling traits_inst.syntax_type(c) for each input character c. The regular expression grammar is then applied to the sequence of tokens in order to construct the finite state machine (this is to allow the regular expression syntax to be localized to a specific character set; for example to use Far Eastern ideographs, rather than Latin characters). Where traits_inst.syntax_type(c) returns syntax_escape, then the implementation shall call traits_inst.escape_syntax_type(c) for the character following the escape, in order to determine the grammatical meaning of the escape sequence. When traits_inst.escape_syntax_type(c)returns escape_type_class or escape_type_not_class, then the single character following the escape is converted to a string n, and passed to traits_inst.lookup_classname(n), to determine the character classification to be matched against. If traits_inst.lookup_classname(n) returns null, then the escape is treated as an identity escape.

Where the regular expression grammar requires the conversion of a sequence of characters to an integral value, this is accomplished by calling traits::toi.

Where the regular expression grammar requires the conversion of a sequence of characters representing a named collating element to the character(s) it represents, this is accomplished by calling traits::lookup_collatename.

Where the regular expression grammar requires the conversion of a sequence of characters representing a named character class to an internal representation, this is accomplished by calling traits::lookup_classname. The results from subsequent calls to this function can be bitwise OR'ed together and subsequently passed to traist::is_class.

The behavior of the internal finite state machine representation, when used to match a sequence of characters is as described in ECMAScript Language Specification, Chapter 15 part 10, RegExp (Regular Expression) Objects (FWD.1). The behavior is modified according to any match_flag_type flags specified (RE.3.1.2) when using the regular expression object in one of the regular expression algorithms (RE.7). The behavior is also localized by interaction with the traits class template parameter as follows:

During matching of a regular expression finite state machine against a sequence of characters, two characters c and d are compared using traits_inst.translate(c, getflags() & regex_constants::icase) == traits_inst.translate(d, getflags() & regex_constants::icase).

During matching of a regular expression finite state machine against a sequence of characters, comparison of a collating element range c1-c2 against a character c is conducted as follows: if getflags() & regex_constants::collate is true, then the character c is matched if traits_inst.transform(string_type(1,c1)) <= traits_inst.transform(string_type(1,c)) && traits_inst.transform(string_type(1,c)) <= traits_inst.transform(string_type(1,c2)), otherwise c is matched if c1 <= c && c <= c2.

During matching of a regular expression finite state machine against a sequence of characters, testing whether a collating element is a member of a primary equivalence class is conducted by first converting the collating element and the equivalence class to a sort keys using traits::transform_primary, and then comparing the sort keys for equality.

During matching of a regular expression finite state machine against a sequence of characters, a character c is a member of character class some_name, if traits_inst.is_class(c, traits_inst.lookup_classname("some_name")).

The functions described in this clause can report errors by throwing exceptions of type bad_expression.

namespace std{

template <class charT,
          class traits = regex_traits<charT>,
          class Allocator = allocator<charT> >
class basic_regex
{
public:
   // types:
   typedef          charT                                value_type;
   typedef          implementation defined               const_iterator;
   typedef          const_iterator                       iterator;
   typedef typename Allocator::reference                 reference;
   typedef typename Allocator::const_reference           const_reference;
   typedef typename Allocator::difference_type           difference_type;
   typedef typename Allocator::size_type                 size_type;
   typedef          Allocator                            allocator_type;
   typedef          regex_constants::syntax_option_type  flag_type;
   typedef typename traits::locale_type                  locale_type;

   // constants:
   static const regex_constants::syntax_option_type normal = regex_constants::normal;
   static const regex_constants::syntax_option_type icase = regex_constants::icase;
   static const regex_constants::syntax_option_type nosubs = regex_constants::nosubs;
   static const regex_constants::syntax_option_type optimize = regex_constants::optimize;
   static const regex_constants::syntax_option_type collate = regex_constants::collate;
   static const regex_constants::syntax_option_type ECMAScript = normal;
   static const regex_constants::syntax_option_type JavaScript = normal;
   static const regex_constants::syntax_option_type JScript = normal;
   // these flags are optional, if the functionality is supported
   // then the flags shall take these names.
   static const regex_constants::syntax_option_type basic = regex_constants::basic;
   static const regex_constants::syntax_option_type extended = regex_constants::extended;
   static const regex_constants::syntax_option_type awk = regex_constants::awk;
   static const regex_constants::syntax_option_type grep = regex_constants::grep;
   static const regex_constants::syntax_option_type egrep = regex_constants::egrep;
   static const regex_constants::syntax_option_type sed = basic = regex_constants::sed;
   static const regex_constants::syntax_option_type perl = regex_constants::perl;

   // construct/copy/destroy:
   explicit basic_regex(const Allocator& a = Allocator());
   explicit basic_regex(const charT* p, flag_type f = regex_constants::normal,
                        const Allocator& a = Allocator());
   basic_regex(const charT* p1, const charT* p2, flag_type f = regex_constants::normal,
               const Allocator& a = Allocator());
   basic_regex(const charT* p, size_type len, flag_type f,
               const Allocator& a = Allocator());
   basic_regex(const basic_regex&);
   template <class ST, class SA>
   explicit basic_regex(const basic_string<charT, ST, SA>& p,
                        flag_type f = regex_constants::normal,
                        const Allocator& a = Allocator());
   template <class InputIterator>
   basic_regex(InputIterator first, inputIterator last,
               flag_type f = regex_constants::normal,
               const Allocator& a = Allocator());

   ~basic_regex();
   basic_regex& operator=(const basic_regex&);
   basic_regex& operator=(const charT* ptr);
   template <class ST, class SA>
   basic_regex& operator=(const basic_string<charT, ST, SA>& p);

   // iterators:
   const_iterator begin() const;
   const_iterator end() const;
   // capacity:
   size_type size() const;
   size_type max_size() const;
   bool empty() const;
   unsigned mark_count() const;

   //
   // modifiers:
   basic_regex& assign(const basic_regex& that);
   basic_regex& assign(const charT* ptr, flag_type f = regex_constants::normal);
   basic_regex& assign(const charT* first, const charT* last,
                       flag_type f = regex_constants::normal);
   template <class string_traits, class A>
   basic_regex& assign(const basic_string<charT, string_traits, A>& s,
                       flag_type f = regex_constants::normal);
   template <class InputIterator>
   basic_regex& assign(InputIterator first, InputIterator last,
                       flag_type f = regex_constants::normal);

   // const operations:
   Allocator get_allocator() const;
   flag_type getflags() const;
   basic_string<charT> str() const;
   int compare(basic_regex&) const;
   // locale:
   locale_type imbue(locale_type loc);
   locale_type getloc() const;
   // swap
   void swap(basic_regex&) throw();
};

} // namespace std

RE.4.1 basic_regex constants

static const regex_constants::syntax_option_type normal = regex_constants::normal;
static const regex_constants::syntax_option_type icase = regex_constants::icase;
static const regex_constants::syntax_option_type nosubs = regex_constants::nosubs;
static const regex_constants::syntax_option_type optimize = regex_constants::optimize;
static const regex_constants::syntax_option_type collate = regex_constants::collate;
static const regex_constants::syntax_option_type ECMAScript = normal;
static const regex_constants::syntax_option_type JavaScript = normal;
static const regex_constants::syntax_option_type JScript = normal;
// these flags are optional, if the functionality is supported
// then the flags shall take these names.
static const regex_constants::syntax_option_type basic = regex_constants::basic;
static const regex_constants::syntax_option_type extended = regex_constants::extended;
static const regex_constants::syntax_option_type awk = regex_constants::awk;
static const regex_constants::syntax_option_type grep = regex_constants::grep;
static const regex_constants::syntax_option_type egrep = regex_constants::egrep;
static const regex_constants::syntax_option_type sed = basic = regex_constants::sed;
static const regex_constants::syntax_option_type perl = regex_constants::perl;

The static constant members are provided as synonyms for the constants declared in namespace std::regex_constants; for each constant of type syntax_option_type declared in namespace std::regex_constants then a constant with the same name, type and value shall be declared within the scope of basic_regex.

RE.4.2 basic_regex constructors

In all basic_regex constructors, a copy of the Allocator argument is used for any memory allocation performed by the constructor or member functions during the lifetime of the object.

basic_regex(const Allocator& a = Allocator());

Effects: Constructs an object of class basic_regex. The postconditions of this function are indicated in Table RE8:

Table RE8--basic_regex(const Allocator&) effects

Element

Value

empty()

true

size()

0

str()

basic_string<charT>()

basic_regex(const charT* p, flag_type f = regex_constants::normal, const Allocator& a = Allocator());

Requires: p shall not be a null pointer.

Throws: bad_expression if p is not a valid regular expression.

Effects: Constructs an object of class basic_regex; the object's internal finite state machine is constructed from the regular expression contained in the null-terminated string p, and interpreted according to the flags specified in f. The postconditions of this function are indicated in Table RE9:

Table RE9--basic_regex(const charT* p, flag_type f, const Allocator&) effects

Element

Value

empty()

false

size()

char_traits<charT>::length(p)

str()

basic_string<charT>(p)

getflags()

f

mark_count()

The number of marked sub-expressions within the expression.

basic_regex(const charT* p1, const charT* p2, flag_type f = regex_constants::normal, const Allocator& a = Allocator());

Requires: p1 and p2 are not null pointers, p1 < p2.

Throws: bad_expression if [p1,p2) is not a valid regular expression.

Effects: Constructs an object of class basic_regex; the object's internal finite state machine is constructed from the regular expression contained in the sequence of characters [p1,p2), and interpreted according the flags specified in f. The postconditions of this function are indicated in Table RE10:

Table RE10--basic_regex(const charT* p1, const charT* p2, flag_type f, const Allocator&) effects

Element

Value

empty()

false

size()

std::distance(p1,p2)

str()

basic_string<charT>(p1,p2)

getflags()

f

mark_count()

The number of marked sub-expressions within the expression.

basic_regex(const charT* p, size_type len, flag_type f, const Allocator& a = Allocator());

Requires: p shall not be a null pointer, len < max_size().

Throws: bad_expression if p is not a valid regular expression.

Effects: Constructs an object of class basic_regex; the object's internal finite state machine is constructed from the regular expression contained in the sequence of characters [p, p+len), and interpreted according the flags specified in f. The postconditions of this function are indicated in Table RE11:

Table RE11--basic_regex(const charT* p1, size_type len, flag_type f, const Allocator&) effects

Element

Value

empty()

false

size()

len

str()

basic_string<charT>(p, len)

getflags()

f

mark_count()

The number of marked sub-expressions within the expression.

basic_regex(const basic_regex& e);

Effects: Constructs an object of class basic_regex as a copy of the object e. The postconditions of this function are indicated in Table RE12:

Table RE12--basic_regex(const basic_regex& e) effects

Element

Value

empty()

e.empty()

size()

e.size()

str()

e.str()

getflags()

e.getflags()

mark_count()

e.mark_count()

template <class ST, class SA>
basic_regex(const basic_string<charT, ST, SA>& s,
            flag_type f = regex_constants::normal, const Allocator& a = Allocator());

Throws: bad_expression if s is not a valid regular expression.

Effects: Constructs an object of class basic_regex; the object's internal finite state machine is constructed from the regular expression contained in the string s, and interpreted according to the flags specified in f. The postconditions of this function are indicated in Table RE13:

Table RE13--basic_regex(const basic_regex& e) effects

Element

Value

empty()

false

size()

s.size()

str()

s

getflags()

f

mark_count()

The number of marked sub-expressions within the expression.

template <class ForwardIterator>
basic_regex(ForwardIterator first, ForwardIterator last,
            flag_type f = regex_constants::normal, const Allocator& a = Allocator());

Throws: bad_expression if the sequence [first, last) is not a valid regular expression.

Effects: Constructs an object of class basic_regex; the object's internal finite state machine is constructed from the regular expression contained in the sequence of characters [first, last), and interpreted according to the flags specified in f. The postconditions of this function are indicated in Table RE14:

Table RE14--basic_regex(ForwardIterator first, ForwardIterator last, flag_type f, const Allocator&) effects

Element

Value

empty()

false

size()

distance(first,last)

str()

basic_string<charT>(first,last)

getflags()

f

mark_count()

The number of marked sub-expressions within the expression.

basic_regex& operator=(const basic_regex& e);

Effects: Returns the result of assign(e.str(), e.getflags()).

basic_regex& operator=(const charT* ptr);

Requires: p shall not be a null pointer.

Effects: Returns the result of assign(ptr).

template <class ST, class SA>
basic_regex& operator=(const basic_string<charT, ST, SA>& p);

Effects: Returns the result of assign(p).

RE.4.3 basic_regex iterators

const_iterator begin() const;

Effects: Returns a starting iterator to a sequence of characters representing the regular expression.

const_iterator end() const;

Effects: Returns termination iterator to a sequence of characters representing the regular expression.

RE.4.4 basic_regex capacity

size_type size() const;

Effects: Returns the length of the sequence of characters representing the regular expression.

size_type max_size() const;

Effects: Returns the maximum length of the sequence of characters representing the regular expression.

bool empty() const;

Effects: Returns true if the object does not contain a valid regular expression, otherwise false.

unsigned mark_count() const;

Effects: Returns the number of marked sub-expressions within the regular expresion.

RE.4.5 basic_regex assign

basic_regex& assign(const basic_regex& that);

Effects: Returns assign(that.str(), that.getflags()).

basic_regex& assign(const charT* ptr, flag_type f = regex_constants::normal);

Effects: Returns assign(string_type(ptr), f).

basic_regex& assign(const charT* first, const charT* last,
                    flag_type f = regex_constants::normal);

Effects: Returns assign(string_type(first, last), f).

template <class string_traits, class A>
basic_regex& assign(const basic_string<charT, string_traits, A>& s,
                    flag_type f = regex_constants::normal);

Throws: bad_expression if s is not a valid regular expression.

Returns: *this.

Effects: Assigns the regular expression contained in the string s, interpreted according the flags specified in f. The postconditions of this function are indicated in Table RE15:

Table RE15-- basic_regex& assign(const basic_string<charT, string_traits, A>& s, flag_type f) effects

Element

Value

empty()

false

size()

s.size()

str()

s

getflags()

f

mark_count()

The number of marked sub-expressions within the expression.

template <class InputIterator>
basic_regex& assign(InputIterator first, InputIterator last,
                    flag_type f = regex_constants::normal);

Requires: The type InputIterator corresponds to the Input Iterator requirements (24.1.1).

Effects: Returns assign(string_type(first, last), f).

RE.4.6 basic_regex constant operations

Allocator get_allocator() const;

Effects: Returns a copy of the Allocator that was passed to the object's constructor.

flag_type getflags() const;

Effects: Returns a copy of the regular expression syntax flags that were passed to the object's constructor, or the last call to assign.

basic_string<charT> str() const;

Effects: Returns a copy of the character sequence passed to the object's constructor, or the last call to assign.

int compare(basic_regex& e)const;

Effects: If getflags() == e.getflags() then returns str().compare(e.str()), otherwise returns getflags() - e.getflags().

RE.4.7 basic_regex locale

locale_type imbue(locale_type l);

Effects: Returns the result of traits_inst.imbue(l) where traits_inst is a (default initialized) instance of the template parameter traits stored within the object. Calls to imbue invalidate any currently contained regular expression.

Postcondition: empty() == true.

locale_type getloc() const;

Effects: Returns the result of traits_inst.getloc() where traits_inst is a (default initialized) instance of the template parameter traits stored within the object.

RE.4.8 basic_regex swap

void swap(basic_regex& e) throw();

Effects: Swaps the contents of the two regular expressions.

Postcondition: *this contains the characters that were in e, e contains the regular expression that was in *this.

Complexity: constant time.

RE.4.9 basic_regex non-member functions

RE.4.9.1 basic_regex non-member comparison operators 
template <class charT, class traits, class Allocator>
bool operator == (const basic_regex<charT, traits, Allocator>& lhs,
                  const basic_regex<charT, traits, Allocator>& rhs);

Effects: Returns lhs.compare(rhs) == 0.

template <class charT, class traits, class Allocator>
bool operator != (const basic_regex<charT, traits, Allocator>& lhs,
                  const basic_regex<charT, traits, Allocator>& rhs);

Effects: Returns lhs.compare(rhs) != 0.

template <class charT, class traits, class Allocator>
bool operator < (const basic_regex<charT, traits, Allocator>& lhs,
                 const basic_regex<charT, traits, Allocator>& rhs);

Effects: Returns lhs.compare(rhs) < 0.

template <class charT, class traits, class Allocator>
bool operator <= (const basic_regex<charT, traits, Allocator>& lhs,
                  const basic_regex<charT, traits, Allocator>& rhs);

Effects: Returns lhs.compare(rhs) <= 0.

template <class charT, class traits, class Allocator>
bool operator >= (const basic_regex<charT, traits, Allocator>& lhs,
                  const basic_regex<charT, traits, Allocator>& rhs);

Effects: Returns lhs.compare(rhs) >= 0.

template <class charT, class traits, class Allocator>
bool operator > (const basic_regex<charT, traits, Allocator>& lhs,
                 const basic_regex<charT, traits, Allocator>& rhs);

Effects: Returns lhs.compare(rhs) > 0.

RE.4.9.2 basic_regex inserter.
template <class charT, class io_traits, class re_traits, class Allocator>
basic_ostream<charT, io_traits>&
   operator << (basic_ostream<charT, io_traits>& os
                const basic_regex<charT, re_traits, Allocator>& e);

Effects: Returns (os << e.str()).

RE.4.9.3 basic_regex non-member swap
template <class charT, class traits, class Allocator>
void swap(basic_regex<charT, traits, Allocator>& lhs,
          basic_regex<charT, traits, Allocator>& rhs);

Effects: calls lhs.swap(rhs).

RE.5 Template class sub_match

Template class sub_match denotes the sequence of characters matched by a particular marked sub-expression.

When the marked sub-expression denoted by an object of type sub_match<> participated in a regular expression match then member matched evaluates to true, and members first and second denote the range of characters [first,second) which formed that match. Otherwise matched is false, and members first and second contained undefined values.

If an object of type sub_match<> represents sub-expression 0 - that is to say the whole match - then member matched is always true, unless a partial match was obtained as a result of the flag match_partial being passed to a regular expression algorithm, in which case member matched is false, and members first and second represent the character range that formed the partial match.

template <class BidirectionalIterator>
class sub_match : public std::pair<BidirectionalIterator, BidirectionalIterator>
{
public:
   typedef typename iterator_traits<BidirectionalIterator>::value_type       value_type;
   typedef typename iterator_traits<BidirectionalIterator>::difference_type  difference_type;
   typedef          BidirectionalIterator                                    iterator;

   bool matched;

   difference_type length()const;
   operator basic_string<value_type>()const;
   basic_string<value_type> str()const;

   int compare(const sub_match& s)const;
   int compare(const basic_string<value_type>& s)const;
   int compare(const value_type* s)const;
};

RE.5.1 sub_match members

Static difference_type length();

Effects: returns (matched ? 0 : distance(first, second)).

operator basic_string<value_type>()const;

Effects: returns (matched ? basic_string<value_type>(first, second) : basic_string<value_type>()).

basic_string<value_type> str()const;

Effects: returns (matched ? basic_string<value_type>(first, second) : basic_string<value_type>()).

int compare(const sub_match& s)const;

Effects: returns str().compare(s.str()).

int compare(const basic_string<value_type>& s)const;

Effects: returns str().compare(s).

int compare(const value_type* s)const;

Effects: returns str().compare(s).

RE.5.2 sub_match non-member operators

template <class BidirectionalIterator>
bool operator == (const sub_match<BidirectionalIterator>& lhs,
                  const sub_match<BidirectionalIterator>& rhs);

Effects: returns lhs.compare(rhs) == 0.

template <class BidirectionalIterator>
bool operator != (const sub_match<BidirectionalIterator>& lhs,
                  const sub_match<BidirectionalIterator>& rhs);

Effects: returns lhs.compare(rhs) != 0.

template <class BidirectionalIterator>
bool operator < (const sub_match<BidirectionalIterator>& lhs,
                 const sub_match<BidirectionalIterator>& rhs);

Effects: returns lhs.compare(rhs) < 0.

template <class BidirectionalIterator>
bool operator <= (const sub_match<BidirectionalIterator>& lhs,
                  const sub_match<BidirectionalIterator>& rhs);

Effects: returns lhs.compare(rhs) <= 0.

template <class BidirectionalIterator>
bool operator >= (const sub_match<BidirectionalIterator>& lhs,
                  const sub_match<BidirectionalIterator>& rhs);

Effects: returns lhs.compare(rhs) >= 0.

template <class BidirectionalIterator>
bool operator > (const sub_match<BidirectionalIterator>& lhs,
                 const sub_match<BidirectionalIterator>& rhs);

Effects: returns lhs.compare(rhs) > 0.

template <class BidirectionalIterator> 
bool operator == (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
                  const sub_match<BidirectionalIterator>& rhs); 

Effects: returns lhs == rhs.str().

template <class BidirectionalIterator> 
bool operator != (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
                  const sub_match<BidirectionalIterator>& rhs); 

Effects: returns lhs != rhs.str().

template <class BidirectionalIterator> 
bool operator < (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
                 const sub_match<BidirectionalIterator>& rhs); 

Effects: returns lhs < rhs.str().

template <class BidirectionalIterator> 
bool operator > (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
                 const sub_match<BidirectionalIterator>& rhs); 

Effects: returns lhs > rhs.str().

template <class BidirectionalIterator> 
bool operator >= (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
                  const sub_match<BidirectionalIterator>& rhs); 

Effects: returns lhs >= rhs.str().

template <class BidirectionalIterator> 
bool operator <= (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
                  const sub_match<BidirectionalIterator>& rhs); 

Effects: returns lhs <= rhs.str().

template <class BidirectionalIterator> 
bool operator == (const sub_match<BidirectionalIterator>& lhs, 
                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 

Effects: returns lhs.str() == rhs.

template <class BidirectionalIterator> 
bool operator != (const sub_match<BidirectionalIterator>& lhs, 
                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 

Effects: returns lhs.str() != rhs.

template <class BidirectionalIterator> 
bool operator < (const sub_match<BidirectionalIterator>& lhs, 
                 typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 

Effects: returns lhs.str() < rhs.

template <class BidirectionalIterator> 
bool operator > (const sub_match<BidirectionalIterator>& lhs, 
                 typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 

Effects: returns lhs.str() > rhs.

template <class BidirectionalIterator> 
bool operator >= (const sub_match<BidirectionalIterator>& lhs, 
                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 

Effects: returns lhs.str() >= rhs.

template <class BidirectionalIterator> 
bool operator <= (const sub_match<BidirectionalIterator>& lhs, 
                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 

Effects: returns lhs.str() <= rhs.

template <class BidirectionalIterator> 
bool operator == (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
                  const sub_match<BidirectionalIterator>& rhs); 

Effects: returns lhs == rhs.str().

template <class BidirectionalIterator> 
bool operator != (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
                  const sub_match<BidirectionalIterator>& rhs); 

Effects: returns lhs != rhs.str().

template <class BidirectionalIterator> 
bool operator < (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
                 const sub_match<BidirectionalIterator>& rhs); 

Effects: returns lhs < rhs.str().

template <class BidirectionalIterator> 
bool operator > (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
                 const sub_match<BidirectionalIterator>& rhs); 

Effects: returns lhs > rhs.str().

template <class BidirectionalIterator> 
bool operator >= (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
                  const sub_match<BidirectionalIterator>& rhs); 

Effects: returns lhs >= rhs.str().

template <class BidirectionalIterator> 
bool operator <= (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
                  const sub_match<BidirectionalIterator>& rhs); 

Effects: returns lhs <= rhs.str().

template <class BidirectionalIterator> 
bool operator == (const sub_match<BidirectionalIterator>& lhs, 
                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 

Effects: returns lhs.str() == rhs.

template <class BidirectionalIterator> 
bool operator != (const sub_match<BidirectionalIterator>& lhs, 
                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 

Effects: returns lhs.str() != rhs.

template <class BidirectionalIterator> 
bool operator < (const sub_match<BidirectionalIterator>& lhs, 
                 typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 

Effects: returns lhs.str() < rhs.

template <class BidirectionalIterator> 
bool operator > (const sub_match<BidirectionalIterator>& lhs, 
                 typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 

Effects: returns lhs.str() > rhs.

template <class BidirectionalIterator> 
bool operator >= (const sub_match<BidirectionalIterator>& lhs, 
                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 

Effects: returns lhs.str() >= rhs.

template <class BidirectionalIterator> 
bool operator <= (const sub_match<BidirectionalIterator>& lhs, 
                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 

Effects: returns lhs.str() <= rhs.

template <class charT, class traits, class BidirectionalIterator>
basic_ostream<charT, traits>&
   operator << (basic_ostream<charT, traits>& os
                const sub_match<BidirectionalIterator>& m);

Effects: returns (os << m.str()).

RE.6 Template class match_results

Template class match_results denotes a collection of character sequences representing the result of a regular expression match. Storage for the collection is allocated and freed as necessary by the member functions of class match_results.

The template class match_results conforms to the requirements of a Sequence, as specified in (lib.sequence.reqmts), except that only operations defined for const-qualified Sequences are supported.

template <class BidirectionalIterator,
          class Allocator = allocator<sub_match<BidirectionalIterator> >
class match_results
{ 
public: 
   typedef          sub_match<BidirectionalIterator>                        value_type;
   typedef          const value_type&                                       const_reference;
   typedef          const_reference                                         reference;
   typedef          implementation defined                                  const_iterator;
   typedef          const_iterator                                          iterator;
   typedef typename iterator_traits<BidirectionalIterator>::difference_type difference_type;
   typedef typename Allocator::size_type                                    size_type;
   typedef          Allocator                                               allocator_type;
   typedef typename iterator_traits<BidirectionalIterator>::value_type      char_type;
   typedef          basic_string<char_type>                                 string_type;

   // construct/copy/destroy:
   explicit match_results(const Allocator& a = Allocator());
   match_results(const match_results& m);
   match_results& operator=(const match_results& m); 
   ~match_results();

   // size:
   size_type size() const;
   size_type max_size() const;
   bool empty() const;
   // element access:
   difference_type length(int sub = 0) const;
   difference_type position(unsigned int sub = 0) const;
   string_type str(int sub = 0) const;
   const_reference operator[](int n) const;

   const_reference prefix() const;

   const_reference suffix() const;
   const_iterator begin() const;
   const_iterator end() const;
   // format:
   template <class OutputIterator>
   OutputIterator format(OutputIterator out,
                         const string_type& fmt,
                         match_flag_type flags = format_default) const;
   string_type format(const string_type& fmt,
                      match_flag_type flags = format_default) const;

   allocator_type get_allocator() const;
   void swap(match_results& that);
};

RE.6.1 match_results constructors

In all match_results constructors, a copy of the Allocator argument is used for any memory allocation performed by the constructor or member functions during the lifetime of the object.

match_results(const Allocator& a = Allocator());

Effects: Constructs an object of class match_results. The postconditions of this function are indicated in Table RE16:

Table RE16--match_results(const Allocator&) effects

Element

Value

empty()

true

size()

0

str()

basic_string<charT>()

match_results(const match_results& m);

Effects: Constructs an object of class match_results, as a copy of m.

match_results& operator=(const match_results& m);

Effects: Assigns m to *this. The postconditions of this function are indicated in Table RE17:

Table RE17--match_results(const Allocator&) effects

Element

Value

empty()

m.empty().

size()

m.size().

str(n)

m.str(n) for all integers n < m.size().

prefix()

m.prefix().

suffix()

m.suffix().

(*this)[n]

m[n] for all integers n < m.size().

length(n)

m.length(n) for all integers n < m.size().

position(n)

m.position(n) for all integers n < m.size().

RE.6.2 match_results size

size_type size()const;

Effects: Returns the number of sub_match elements stored in *this.

size_type max_size()const;

Effects: Returns the maximum number of sub_match elements that can be stored in *this.

bool empty()const;

Effects: Returns size() == 0.

RE.6.3 match_results element access

difference_type length(int sub = 0)const;

Effects: Returns (*this)[sub].length().

difference_type position(unsigned int sub = 0)const;

Effects: Returns std::distance(prefix().first, (*this)[sub].first).

string_type str(int sub = 0)const;

Effects: Returns string_type((*this)[sub]).

const_reference operator[](int n) const;

Effects: Returns a reference to the sub_match object representing the character sequence that matched marked sub-expression n. If n == 0 then returns a reference to a sub_match object representing the character sequence that matched the whole regular expression.

const_reference prefix()const;

Effects: Returns a reference to the sub_match object representing the character sequence from the start of the string being matched/searched, to the start of the match found.

const_reference suffix()const;

Effects: Returns a reference to the sub_match object representing the character sequence from the end of the match found to the end of the string being matched/searched.

const_iterator begin()const;

Effects: Returns a starting iterator that enumerates over all the marked sub-expression matches stored in *this.

const_iterator end()const;

Effects: Returns a terminating iterator that enumerates over all the marked sub-expression matches stored in *this.

RE.6.4 match_results reformatting

template <class OutputIterator>
OutputIterator format(OutputIterator out,
                      const string_type& fmt,
                      match_flag_type flags = format_default);

Requires: The type OutputIterator conforms to the Output Iterator requirements (24.1.2).

Effects: Copies the character sequence [fmt.begin(), fmt.end()) to OutputIterator out. For each format specifier or escape sequence in fmt, replace that sequence with either the character(s) it represents, or the sequence of characters within *this to which it refers. The bitmasks specified in flags determines what format specifiers or escape sequences are recognized, by default this is the format used by ECMA-262, ECMAScript Language Specification, Chapter 15 part 5.4.11 String.prototype.replace.

Returns: out.

string_type format(const string_type& fmt,
                   match_flag_type flags = format_default);

Effects: Returns a copy of the string fmt. For each format specifier or escape sequence in fmt, replace that sequence with either the character(s) it represents, or the sequence of characters within *this to which it refers. The bitmasks specified in flags determines what format specifiers or escape sequences are recognized, by default this is the format used by ECMA-262, ECMAScript Language Specification, Chapter 15 part 5.4.11 String.prototype.replace.

allocator_type get_allocator()const;

Effects: Returns a copy of the Allocator that was passed to the object's constructor.

void swap(match_results& that);

Effects: Swaps the contents of the two sequences.

Postcondition: *this contains the sequence of matched sub-expressions that were in that, that contains the sequence of matched sub-expressions that were in *this.

Complexity: constant time.

RE.7 Regular expression algorithms

RE.7.1 regex_match

template <class BidirectionalIterator, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_match(BidirectionalIterator first, BidirectionalIterator last,
                 match_results<BidirectionalIterator, Allocator>& m,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);

Requires: Type BidirectionalIterator meets the requirements of a Bidirectional Iterator (24.1.4).

Effects: Determines whether there is an exact match between the regular expression e, and all of the character sequence [first, last), parameter flags is used to control how the expression is matched against the character sequence. Returns true if such a match exists, false otherwise.

Postconditions: If the function returns false, then the effect on parameter m is undefined, otherwise the effects on parameter m are given in table RE18:

Table RE18--regex_match(BidirectionalIterator first, BidirectionalIterator last, match_results<BidirectionalIterator, Allocator>& m, const reg_expression<charT, traits, Allocator2>& e, match_flag_type flags) effects

Element

Value

m.size()

e.mark_count()

m.empty()

false

m.prefix().first

first

m.prefix().last

first

m.prefix().matched

false

m.suffix().first

last

m.suffix().last

last

m.suffix().matched

false

m[0].first

first

m[0].second

last

m[0].matched

true if a full match was found, and false if it was a partial match (found as a result of the match_partial flag being set).

m[n].first

For all integers n < m.size(), the start of the sequence that matched sub-expression n. Alternatively, if sub-expression n did not participate in the match, then last.

m[n].second

For all integers n < m.size(), the end of the sequence that matched sub-expression n. Alternatively, if sub-expression n did not participate in the match, then last.

m[n].matched

For all integers n < m.size(), true if sub-expression n participated in the match, false otherwise.

template <class BidirectionalIterator, class charT, class traits, class Allocator2>
bool regex_match(BidirectionalIterator first, BidirectionalIterator last,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);

Effects: Behaves "as if" by constructing an instance of match_results<BidirectionalIterator> what, and then returning the result of regex_match(first, last, what, e, flags).

template <class charT, class Allocator, class traits, class Allocator2>
bool regex_match(const charT* str, match_results<const charT*, Allocator>& m,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);

Effects: Returns the result of regex_match(str, str + char_traits<charT>::length(str), m, e, flags).

template <class ST, class SA, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_match(const basic_string<charT, ST, SA>& s,
                 match_results<typename basic_string<charT, ST, SA>::const_iterator, Allocator>& m, 
                 const reg_expression<charT, traits, Allocator2>& e, 
                 match_flag_type flags = match_default);

Effects: Returns the result of regex_match(s.begin(), s.end(), m, e, flags).

template <class charT, class traits, class Allocator2>
bool regex_match(const charT* str,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);

Effects: Returns the result of regex_match(str, str + char_traits<charT>::length(str), e, flags).

template <class ST, class SA, class charT, class traits, class Allocator2>
bool regex_match(const basic_string<charT, ST, SA>& s,
                 const reg_expression<charT, traits, Allocator2>& e,
                 match_flag_type flags = match_default);

Effects: Returns the result of regex_match(s.begin(), s.end(), e, flags).

RE.7.2 regex_search

template <class BidirectionalIterator, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
                  match_results<BidirectionalIterator, Allocator>& m,
                  const reg_expression<charT, traits, Allocator2>& e,
                  match_flag_type flags = match_default);

Requires: Type BidirectionalIterator meets the requirements of a Bidirectional Iterator (24.1.4).

Effects: Determines whether there is some sub-sequence within [first,last) that matches the regular expression e, parameter flags is used to control how the expression is matched against the character sequence. Returns true if such a sequence exists, false otherwise.

Postconditions: If the function returns false, then the effect on parameter m is undefined, otherwise the effects on parameter m are given in table RE19:

Table RE19--regex_search(BidirectionalIterator first, BidirectionalIterator last, match_results<BidirectionalIterator, Allocator>& m, const reg_expression<charT, traits, Allocator2>& e, match_flag_type flags) effects

Element

Value

m.size()

e.mark_count()

m.empty()

false

m.prefix().first

first

m.prefix().last

m[0].first

m.prefix().matched

m.prefix().first != m.prefix().second

m.suffix().first

m[0].second

m.suffix().last

last

m.suffix().matched

m.suffix().first != m.suffix().second

m[0].first

The start of the sequence of characters that matched the regular expression

m[0].second

The end of the sequence of characters that matched the regular expression

m[0].matched

true if a full match was found, and false if it was a partial match (found as a result of the match_partial flag being set).

m[n].first

For all integers n < m.size(), the start of the sequence that matched sub-expression n. Alternatively, if sub-expression n did not participate in the match, then last.

m[n].second

For all integers n < m.size(), the end of the sequence that matched sub-expression n. Alternatively, if sub-expression n did not participate in the match, then last.

m[n].matched

For all integers n < m.size(), true if sub-expression n participated in the match, false otherwise.

template <class charT, class Allocator, class traits, class Allocator2>
bool regex_search(const charT* str, match_results<const charT*, Allocator>& m,
                  const reg_expression<charT, traits, Allocator2>& e,
                  match_flag_type flags = match_default);

Effects: Returns the result of regex_search(str, str + char_traits<charT>::length(str), m, e, flags).

template <class ST, class SA, class Allocator, class charT,
          class traits, class Allocator2>
bool regex_search(const basic_string<charT, ST, SA>& s,
                  match_results<typename basic_string<charT, ST, SA>::const_iterator, Allocator>& m,
                  const reg_expression<charT, traits, Allocator2>& e,
                  match_flag_type flags = match_default);

Effects: Returns the result of regex_search(s.begin(), s.end(), m, e, flags).

template <class iterator, class Allocator, class charT,
          class traits>
bool regex_search(iterator first, iterator last,
                  const reg_expression<charT, traits, Allocator>& e,
                  match_flag_type flags = match_default);

Effects: Behaves "as if" by constructing an instance of match_results<BidirectionalIterator> what, and then returning the result of regex_search(first, last, what, e, flags).

template <class charT, class Allocator, class traits>
bool regex_search(const charT* str
                  const reg_expression<charT, traits, Allocator>& e,
                  match_flag_type flags = match_default);

Effects: Returns the result of regex_search(str, str + char_traits<charT>::length(str), e, flags).

template <class ST, class SA, class Allocator, class charT,
          class traits>
bool regex_search(const basic_string<charT, ST, SA>& s,
                  const reg_expression<charT, traits, Allocator>& e,
                  match_flag_type flags = match_default);

Effects: Returns the result of regex_search(s.begin(), s.end(), e, flags).

RE.7.3 regex_replace

template <class OutputIterator, class BidirectionalIterator, class traits,
          class Allocator, class charT>
OutputIterator regex_replace(OutputIterator out,
                           BidirectionalIterator first,
                           BidirectionalIterator last,
                           const reg_expression<charT, traits, Allocator>& e,
                           const basic_string<charT>& fmt,
                           match_flag_type flags = match_default);

Effects: Finds all the non-overlapping matches m of type match_results<BidirectionalIterator> that occur within the sequence [first, last). If no such matches are found and !(flags & format_no_copy) then calls std::copy(first, last, out). Otherwise, for each match found, if !(flags & format_no_copy) calls std::copy(m.prefix().first, m.prefix().last, out), and then calls m.format(out, fmt, flags). Finally if !(flags & format_no_copy) calls std::copy(last_m.suffix().first, last_m,suffix().last, out) where last_m is a copy of the last match found. If flags & format_first_only is non-zero then only the first match found is replaced.

Returns: out.

template <class traits, class Allocator, class charT>
basic_string<charT> regex_replace(const basic_string<charT>& s,
                            const reg_expression<charT, traits, Allocator>& e,
                            const basic_string<charT>& fmt,
                            match_flag_type flags = match_default);

Effects: Constructs an object basic_string<charT> result, calls regex_replace(back_inserter(result), s.begin(), s.end(), e, fmt, flags), and then returns result.

RE.8 Regular expression Iterators

RE.8.1 Template class regex_iterator

The class template regex_iterator is an iterator adapter; that is to say it represents a new view of an existing iterator sequence, by enumerating all the occurrences of a regular expression within that sequence. regex_iterator finds (using regex_search) successive regular expression matches within the sequence from which it was constructed. After it is constructed, and every time operator++ is used, the iterator finds and stores a value of match_results<BidirectionalIterator>. If the end of the sequence is reached (regex_search returns false), the iterator becomes equal to the end-of-sequence iterator value. The constructor with no arguments regex_iterator() always constructs an end of sequence iterator object, which is the only legitimate iterator to be used for the end condition. The result of operator* on an end of sequence is not defined. For any other iterator value a const match_results<BidirectionalIterator>& is returned. The result of operator-> on an end of sequence is not defined. For any other iterator value a const match_results<BidirectionalIterator>* is returned. It is impossible to store things into regex_iterator's. Two end-of-sequence iterators are always equal. An end-of-sequence iterator is not equal to a non-end-of-sequence iterator. Two non-end-of-sequence iterators are equal when they are constructed from the same arguments.

namespace std{

template <class BidirectionalIterator, 
          class charT = iterator_traits<BidirectionalIterator>::value_type,
          class traits = regex_traits<charT>,
          class Allocator = allocator<charT> >
class regex_iterator 
{
public:
   typedef          basic_regex<charT, traits, Allocator>                   regex_type;
   typedef          match_results<BidirectionalIterator>                    value_type;
   typedef typename iterator_traits<BidirectionalIterator>::difference_type difference_type;
   typedef typename iterator_traits<BidirectionalIterator>::pointer         pointer;
   typedef typename iterator_traits<BidirectionalIterator>::reference       reference;
   typedef          std::forward_iterator_tag                               iterator_category;
   
   regex_iterator();
   regex_iterator(BidirectionalIterator a, BidirectionalIterator b, 
                  const regex_type& re, 
                  match_flag_type m = match_default);
   regex_iterator(const regex_iterator&);
   regex_iterator& operator=(const regex_iterator&);
   bool operator==(const regex_iterator&);
   bool operator!=(const regex_iterator&);
   const value_type& operator*();
   const value_type* operator->();
   regex_iterator& operator++();
   regex_iterator operator++(int);
private:
   match_results<BidirectionalIterator> what;  // for exposition only
   BidirectionalIterator                end;   // for exposition only
   const regex_type*                    pre;   // for exposition only
   match_flag_type                      flags; // for exposition only
};

} // namespace std
RE.8.1.1 regex_iterator constructors
regex_iterator();

Effects: default constructs the members what, end and flags, and sets the member pre equal to the null-pointer constant.

regex_iterator(BidirectionalIterator a, BidirectionalIterator b, 
               const regex_type& re, 
               match_flag_type m = match_default);

Effects: default constructs all data members, and then calls regex_search(a, b, what, re, m), if this returns false then sets *this equal to the end of sequence iterator, otherwise sets end equal to b, pre equal to &re, and flags equal to m.

regex_iterator(const regex_iterator& that);

Effects: constructs all data members as a copy of that.

Postconditions: *this == that.

regex_iterator& operator=(const regex_iterator&);

Effects: sets all data members equal to those in that.

Postconditions: *this == that.

bool operator==(const regex_iterator& that);

Effects: if (pre == 0) && (that.pre == 0) then returns true, otherwise returns the result of (pre == that.pre) && (end == that.end) && (flags == that.flags) && (what[0].first == that.what[0].first) && (what[0].second == that.what[0].second).

bool operator!=(const regex_iterator&);

Effects: returns !(*this == that).

RE.8.1.1 regex_iterator dereference
const value_type& operator*();

Effects: returns what.

const value_type* operator->();

Effects: returns &what.

regex_iterator& operator++();

Effects: if what.prefix().first != what[0].second and if the element match_prev_avail is not set in flags then sets it. Then calls regex_search(what[0].second, end, what, *pre, ((what[0].first == what[0].second) ? flags | match_not_null : flags)), and if this returns false then sets *this equal to the end of sequence iterator.

Returns: *this.

regex_iterator operator++(int);

Effects: constructs a copy result of *this, then calls ++(*this).

Returns: result.

RE.8.2 Template class regex_token_iterator

The template class regex_token_iterator is an iterator adapter; that is to say it represents a new view of an existing iterator sequence, by enumerating all the occurrences of a regular expression within that sequence, and presenting one or more new strings for each match found. Each position enumerated by the iterator is a string that represents what matched a particular sub-expression within the regular expression. When class regex_token_iterator is used to enumerate a single sub-expression with index -1, then the iterator performs field splitting: that is to say it enumerates one string for each section of the character container sequence that does not match the regular expression specified.

After it is constructed, the iterator finds and stores a value of match_results<BidirectionalIterator> what (by calling regex_search) and sets the internal count N to zero. Every time operator++ is used the count N is incremented, if N exceeds or equals this->subs.size(), then the iterator finds and stores the next value of match_results<BidirectionalIterator> and sets count N to zero.

If the end of sequence is reached (regex_search returns false), the iterator becomes equal to the end-of-sequence iterator value, unless the sub-expression being enumerated has index -1: In which case the iterator enumerates one last string that contains all the characters from the end of the last regular expression match to the end of the input sequence being enumerated, provided that this would not be an empty string.

The constructor with no arguments regex_iterator() always constructs an end of sequence iterator object, which is the only legitimate iterator to be used for the end condition. The result of operator* on an end of sequence is not defined. For any other iterator value a const basic_string<charT>& is returned (obtained by calling this->what[subs[N]]). The result of operator-> on an end of sequence is not defined. For any other iterator value a const basic_string<charT>* is returned. It is impossible to store things into regex_iterator's. Two end-of-sequence iterators are always equal. An end-of-sequence iterator is not equal to a non-end-of-sequence iterator. Two non-end-of-sequence iterators are equal when they are constructed from the same arguments.

namespace std{

template <class BidirectionalIterator, 
          class charT = iterator_traits<BidirectionalIterator>::value_type,
          class traits = regex_traits<charT>,
          class Allocator = allocator<charT> >
class regex_token_iterator 
{
public:
   typedef          basic_regex<charT, traits, Allocator>                   regex_type;
   typedef          basic_string<charT>                                     value_type;
   typedef typename iterator_traits<BidirectionalIterator>::difference_type difference_type;
   typedef typename iterator_traits<BidirectionalIterator>::pointer         pointer;
   typedef typename iterator_traits<BidirectionalIterator>::reference       reference;
   typedef          std::forward_iterator_tag                               iterator_category;
   
   regex_token_iterator();
   regex_token_iterator(BidirectionalIterator a, BidirectionalIterator b, const regex_type& re, 
                        int submatch = 0, match_flag_type m = match_default);
   regex_token_iterator(BidirectionalIterator a, BidirectionalIterator b, const regex_type& re, 
                        const std::vector<int>& submatches, match_flag_type m = match_default);
   template <std::size_t N>
   regex_token_iterator(BidirectionalIterator a, BidirectionalIterator b, const regex_type& re, 
                        const int (&submatches)[N], match_flag_type m = match_default);
   regex_token_iterator(const regex_token_iterator&);
   regex_token_iterator& operator=(const regex_token_iterator&);
   bool operator==(const regex_token_iterator&);
   bool operator!=(const regex_token_iterator&);
   const value_type& operator*();
   const value_type* operator->();
   regex_token_iterator& operator++();
   regex_token_iterator operator++(int);
private:
   match_results<BidirectionalIterator> what;   // for exposition only
   BidirectionalIterator                end;    // for exposition only
   const regex_type*                    pre;    // for exposition only
   match_flag_type                      flags;  // for exposition only
   basic_string<charT>                  result; // for exposition only
   std::size_t                          N;      // for exposition only
   std::vector<int>                     subs;   // for exposition only
};
} // namespace std
RE.8.2.1 regex_iterator constructors
regex_token_iterator();

Effects: constructs an end of sequence iterator, by default constructing all data members.

Postconditions: pre == 0.

regex_token_iterator(BidirectionalIterator a, BidirectionalIterator b, const regex_type& re, 
                     int submatch = 0, match_flag_type m = match_default);

Preconditions: !re.empty().

Effects: default constructs all data members and pushes the value submatch onto subs. Then if regex_search(a, b, what, re, m) == true sets N equal to zero, flags equal to m, pre equal to &re, end equal to b, and sets result equal to ((submatch == -1) ? value_type(what.prefix().str()) : value_type(what[submatch].str())). Otherwise if the call to regex_search returns false, then if (submatch == -1) && (a != b) sets result equal to value_type(a, b) and N equal to -1. Otherwise sets *this equal to the end of sequence iterator.

regex_token_iterator(BidirectionalIterator a, BidirectionalIterator b, const regex_type& re, 
                     const std::vector<int>& submatches, match_flag_type m = match_default);

Preconditions: submatches.size() && !re.empty().

Effects: copy constructs member subs from submatches, and default constructs the other data members. Then if regex_search(a, b, what, re, m) == true sets N equal to zero, flags equal to m, pre equal to &re, end equal to b, and sets result equal to ((submatch[N] == -1) ? value_type(what.prefix().str()) : value_type(what[submatch[N]].str())). Otherwise if the call to regex_search returns false, then if (submatch[0] == -1) && (a != b) sets result equal to value_type(a, b) and N equal to -1. Otherwise sets *this equal to the end of sequence iterator.

template <std::size_t N>
regex_token_iterator(BidirectionalIterator a, BidirectionalIterator b, const regex_type& re, 
                     const int (&submatches)[R], match_flag_type m = match_default);

Preconditions: !re.empty().

Effects: copy constructs member subs from the iterator sequence [submatches, submatches+R), and default constructs the other data members. Then if regex_search(a, b, what, re, m) == true sets N equal to zero, flags equal to m, pre equal to &re, end equal to b, and sets result equal to ((subs[N] == -1) ? value_type(what.prefix().str()) : value_type(what[subs[N]].str())). Otherwise if the call to regex_search returns false, then if (subs[0] == -1) && (a != b) sets result equal to value_type(a, b) and N equal to -1. Otherwise sets *this equal to the end of sequence iterator.

regex_token_iterator(const regex_token_iterator& that);

Effects: constructs all data members as a copy of that.

Postconditions: *this == that.

regex_token_iterator& operator=(const regex_token_iterator& that);

Effects: sets all data members equal to those in that.

Postconditions: *this == that.

bool operator==(const regex_token_iterator&);

Effects: if (pre == 0) && (that.pre == 0) then returns true, otherwise returns the result of (pre == that.pre) && (end == that.end) && (flags == that.flags) && (N == that.N) && (what[0].first == that.what[0].first) && (what[0].second == that.what[0].second).

bool operator!=(const regex_token_iterator&);

Effects: returns !(*this == that).

RE.8.2.1 regex_token_iterator dereference
const value_type& operator*();

Effects: returns result.

const value_type* operator->();

Effects: returns &result.

regex_token_iterator& operator++();

Effects: if N == -1 then sets *this equal to the end of sequence iterator. Otherwise if N+1 < subs.size(), then increments N and sets result equal to ((subs[N] == -1) ? value_type(what.prefix().str()) : value_type(what[subs[N]].str())).

Otherwise if what.prefix().first != what[0].second and if the element match_prev_avail is not set in flags then sets it. Then if regex_search(what[0].second, end, what, *pre, ((what[0].first == what[0].second) ? flags | match_not_null : flags)) == true sets N equal to zero, and sets result equal to ((subs[N] == -1) ? value_type(what.prefix().str()) : value_type(what[subs[N]].str())).

Otherwise if the call to regex_search returns false, then let last_end be the value of what[0].second prior to the call to regex_search. Then if last_end != end and subs[0] == -1 sets N equal to -1 and sets result equal to value_type(last_end, end). Otherwise sets *this equal to the end of sequence iterator.

Returns: *this.

regex_token_iterator& operator++(int);

Effects: constructs a copy result of *this, then calls ++(*this).

Returns: result.


References

  1. IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX). See also, The Single UNIX Specification, Version 3, http://www.unix-systems.org/version3/.
  2. The Perl Programming Language, http://www.perl.org/.
  3. Microsoft Corporation, Visual C#, http://msdn.microsoft.com/vcsharp/.
  4. Standard ECMA-262 3rd Edition - December 1999, ECMAScript, http://www.ecma.ch/.
  5. GNU, The GNU C Library, http://www.gnu.org/software/libc/.
  6. Tom Lord, The Hackerlab C Library, http://regexps.com/.
  7. Henry Spencer, BSD regular expression library, ftp://ftp.zoo.toronto.edu/pub/ and http://arglist.com/regex/.
  8. Philip Hazel, Perl Compatible Regular Expressions, http://www.pcre.org/.
  9. Hugo Etchegoyen, A C++ wrapper around pcre-3.2, ftp://ftp.cus.cam.ac.uk/pub/software/programs/pcre/.
  10. Michael Tesch, A C++ Wrapper for the "Perl Compatible Regular Expressions" library, ftp://ftp.cus.cam.ac.uk/pub/software/programs/pcre/.
  11. John Maddock, Boost regular expression library, http://www.boost.org/.
  12. Eric Niebler, The GRETA Regular Expression Template Archive, http://research.microsoft.com/downloads/.
  13. Mark Jason Dominus, Perl Regular Expression Matching is NP-Complete, http://perl.plover.com/NPC/.
  14. Charles L. A. Clarke, Gordon V. Cormack, On the use of Regular Expressions for Searching Text, ACM Trans. Programming Languages and Systems 19, (May 1997), University of Waterloo, Waterloo, Canada, Technical Report CS&shy;95&shy;07, http://plg.uwaterloo.ca/~gvcormac/PAPERS.html.
  15. Larry Wall, Apocalypse 5, http://www.perl.com/pub/a/2002/06/04/apo5.html.
  16. Robert Sedgewick, Algorithms in C++, Addison Wesley.
  17. Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest, Introduction to Algorithms, MIT Press.
  18. Gonzalo Navarro and Mathieu Raffinot, Fast Regular Expression Search, In Proceedings of WAE'99, LNCS 1668. Pages 198-212, 1999, http://www.dcc.uchile.cl/~gnavarro/publ.html.
  19. Sun Wu and Udi Manber, Agrep - A Fast Approximate Pattern Matching Tool, Usenix Winter 1992 Technical Conference, San Francisco (January 1992), pp. 153-162, http://www.tgries.de/agrep/doc/agrep2ps.zip.
  20. Ricardo A. Baeza-Yates and Gaston H. Gonnet, A New Approach to Text Searching, Communications of the ACM, 35:74--82, Oct 1992, http://www.dcc.uchile.cl/~rbaeza/cv/journals.html.
  21. An experimental Boost regex version uses the same stack recursive algorithm that GRETA does by default, and has proven to be just as fast in practice: http://ourworld.compuserve.com/homepages/john_maddock/proposals/exregex.htm.
  22. Gonzalo Navarro. "NR-grep: a Fast and Flexible Pattern Matching Tool". Software Practice and Experience (SPE) 31:1265-1312, 2001. http://www.dcc.uchile.cl/~gnavarro/publ.html.
  23. Partial matches are described in more detail in the Boost documentation at: http://www.boost.org/libs/regex/template_class_ref.htm#partial_matches.

Acknowledgements

I am particularly grateful to GRETA's author Eric Niebler for many long and detailed discussions, to Edward Diener for much helpful feedback over many years, to Bjorn Karlsson, Jeremy Siek, Richard Peters, and Beman Dawes for many helpful comments on a draft of this proposal, and of course to all the Boost-list members and regex users for their continued help with this project.