Regular Expressions: Internationalization and Customization

Document number:   N1500 = 03-0083
Date:   September 22, 2003
Project:   Programming Language C++
Reference:   ISO/IEC IS 14882:1998(E)
Reply to:   Pete Becker
  Dinkumware, Ltd.
  petebecker@acm.org


Give me but one firm spot on which to stand, and I will move the earth.
-- Archimedes on the lever, quoted in The Oxford Dictionary of Quotations, Second Edition, Oxford University Press, London, 1953, p. 14

Regular expressions are a powerful tool that can be used to search text for occurrences of patterns of characters. If the formal language used to define the search pattern is too hard to understand or too hard to apply, however, the user has no firm spot on which to stand, and it will be the fulcrum, not the earth, that moves. Designing this formal language requires recognizing that flexibility and understandability can easily become antagonists.

The regular expression proposal (N1429) is based on the regular expression support in ECMAScript and in POSIX. It offers richer internationalization and customization capabilities than either of these standards. This paper reviews the internationalization support in the two underlying standards as well as the support in the regular expression proposal, and recommends that we be cautious in enhancing the capabilities of regular expressions beyond those provided by these two standards.

Basics

When using regular expressions, character sequences occur in two distinct contexts: the pattern text which defines the regular expression itself and the target text which is the text to be searched for occurrences of the pattern. A regular expression implementation typically compiles the pattern text into an intermediate representation which is used by the search algorithms to find matches in the target text.

A regular expression pattern uses a set of special characters (e.g. the familiar ".*") to invoke special matching rules. All characters that are not special characters are ordinary characters. Ordinary characters in the pattern text match characters in the target text only if the two characters have exactly the same bit pattern. Other than in a POSIX bracket expression there are no special considerations for variable width characters in the pattern text. Consequently, a multi-character sequence in the pattern text matches the target text only if the same sequence of characters occurs in the target text.

Internationalization in POSIX

POSIX regular expressions support more flexible matching through the use of bracket expressions that refer to named classes. A named class can be one of three types: a collating symbol, an equivalence class, or a character class.

A collating symbol, of the form [.ce.], matches a sequence of characters in the target text only if that sequence of characters is the same collating element. Thus, in a locale which defines the sequence "ch" as a collating element, the bracket expression [[.ch.]] matches the entire text sequence "ch", while the ordinary bracket expression [ch] matches the first character of that text sequence.

An equivalence class, of the form [=cl=], matches a sequence of characters in the target text only if that sequence of characters is a collating element in the same equivalence class as cl according to the current locale. Thus, the equivalence class [=a=] will typically match various accented forms of the letter 'a' in a locale that recognizes accented letters.

A character class, of the form [:name:], matches a character in the target text only if that character is in the set of characters named by name. Character class names include, but are not limited to, d, w, s, alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, and xdigit. The contents of each of these named classes depends on the current locale.

Sets of Characters in ECMAScript

ECMAScript does not have explicit internationalization support. For the most part it's not needed, because ECMAScript traffics only in Unicode characters. It also does not have POSIX-style bracket expressions. It does, however, have escape sequences that define several character classes.

The digit class, of the form "\d", matches any character in the set of ASCII characters [0-9].

The word class, of the form "\w", matches any character in the set of ASCII characters [a-zA-Z0-9_].

The space class, of the form "\s", matches any of the characters <TAB>, <VT>, <FF>, <SP>, <NBSP>, <USP>, <LF>, <CR>, <LS>, <PS>.

A word boundary assert, of the form "\b", tests whether the current position in the text is at a word boundary. A word boundary is a transition from a character that is not in the word class to a character that is in the word class or vice versa.

Each of these four escape sequences has a corresponding negation, represented by a backslash followed by the uppercase letter corresponding to the desired escape. Thus not digit is "\D", not word is "\W", not space is "\S", and not word boundary is "\B".

Internationalization in the Regex Proposal

When the user compiles a regular expression, one of the arguments specifies whether to treat the regular expression as POSIX or ECMAScript.

When the user specifies ECMAScript the regular expression can use both POSIX named classes and ECMAScript escape sequences. The three classes named [:d:], [:w:], and [:s:] are used to define the contents of the character sets named by the ECMAScript escape sequences:

ECMAScriptPOSIX
escapenamed class
\d[[:d:]]
\w[[:w:]]
\s[[:s:]]
\D[^[:d:]]
\W[^[:w:]]
\S[^[:s:]]

The ECMAScript escape sequences \b and \B are defined, as in ECMAScript, in terms of transitions between characters in the set defined by \w and by \W.

This support for internationalization is provided through member functions of a regex_traits object.

Collating symbols are supported by the member function lookup_collatename, which determines whether a character sequence is a valid collation element and, if so, converts that sequence of characters into a canonical representation. During a text search, the canonical representation for the collation element in the pattern will be compared to the canonical representation for the current text element to determine whether the match succeeds.

Equivalence classes are supported by the member function transform_primary, which, like strxfrm, converts a sequence of characters into a canonical representation. During a text search, the canonical representation for the collation element in the pattern will be compared to the canonical representation for the current text element to determine whether the match succeeds.

Character classes are supported by the member function lookup_classname, which converts the name of a class to a numeric value that can be passed, along with the current character, to is_class, which returns true if the character is in the class.

Customization in the Regex Proposal

The regular expressions proposal also provides for customizing the rule for matching an ordinary character in the pattern text and a character in the target text, as well as for remapping the special characters used in writing regular expressions. Just as for internationalization, these customizations are provided through member functions of a regex_traits object.

Character matching is handled by the member function translate. It takes two arguments, a character and a boolean flag, and returns a character. The boolean flag indicates whether the translation should be case sensitive. Two characters match if the characters returned by calling translate with each of the two characters are equal.

Remapping special characters is handled by the member functions syntax_type and escape_syntax_type. They are called only when the pattern text is being compiled, and they return enumerated types that tell the compiler what a character means. So, for example, to recognize that any character can match the pattern at the current position, the compiler code is if (traits_inst.syntax_type(ch) == syntax_dot).

Discussion of Enhancements

[:d:], [:w:], [:s:] added to POSIX

This enhancement is a conforming extension to the POSIX regular expression specification.

ECMAScript escape sequences are locale sensitive

ECMAScript does not support locales. That's a reasonable (although a bit limiting) choice for a regular expression specification that uses only Unicode. Since the C++ regular expression proposal supports byte-sized character types, locales are necessary.

Named classes added to ECMAScript

This is technically a non-conforming change to ECMAScript. For example, it quietly changes the meaning of the regular expression "[[:d:]]. Under ECMAScript, that expression matches the full text "[]" -- the bracket expression (which ends at the first ']') matches any of the characters '[', ':', 'd', and the final ']' in the pattern matches the final ']' in the text. With this enhancement, the match fails. Of course, that regular expression is a rather peculiar way to try to match those three characters The usual way to write that match would be "[[:d]", which becomes invalid with the enhancement. In practice this shouldn't be a significant problem.

However, arbitrary named character classes for large character sets can be expensive to use. For example, testing whether a Unicode character is a letter requires large tables and several lookups. This is probably why ECMAScript's escape sequences are defined only for specific sets of ASCII code points. This keeps the lookup tables small. Java's regex package also has the usual POSIX named character classes, but they, too, only hold ASCII code points, not all of the possible Unicode characters. The Java package also provides named classes for the various Unicode character classifications, which do need the large tables and multiple lookups.

Since the regular expression proposal deals with arbitrary character types, it isn't feasible to restrict named character classes to ASCII values (which might not be the same characters under some locales). So the provision for unrestricted named classes in this proposal is reasonable.

Matching of ordinary characters is customizable

Case insensitive matches need to use a locale, so calling translate is a reasonable requirement for such matches. The broader requirement that every character match be done with translate, however, isn't so clearly useful, and can impose a significant performance cost. For case sensitive comparisons, both ECMAScript and POSIX require that the character values be the same. This, in turn, means that comparing long sequences that contain only text can be done with memcmp:

    if (memcmp(cur, pat, nchrs))
        // match failed

Using translate in the comparison requires an explicit loop:

    for (int i = 0; i < nchrs; ++i)
        if (traits_inst.translate(cur[i], false) != traits_inst.translate(pat[i], false))
            // match failed

I've suggested elsewhere that we change the interface to provide two versions of translate, one for case sensitive matches and one for case insensitive matches. With that change, the default implementation for case sensitive matches becomes simple return ch;, which the compiler can easily inline. In effect, the previous loop becomes:

    for (int i = 0; i < nchrs; ++i)
        if (cur[i] != pat[i])
            // match failed

While that's a significant improvement over the previous version, it's not as good as calling memcmp. A compiler might recognize this pattern and generate fast code, but that's not something users can rely on. And, of course, a non-trivial version of translate would be far slower.

The proposal says that translate is needed to support various locale-specific forms of canonical or compatibility equivalence. That's what equivalence classes do. Equivalence classes are a clear signal in the pattern text that a potentially expensive comparison will be made. With a user-defined translate there is no such signal.

Special characters can be remapped

The proposal says that remapping special characters allows users to "use Han-ideographs rather that Latin punctuation symbols in place of the usual ? * + regular expression operators." That's certainly true, but it's not an obvious benefit. The problem is that users can't look at a regular expression and know what it means. As an extreme example, what is the result of matching the pattern "[abcd](efg)" against the target text "abcdefg"? With the usual regular expression operators, the match is "defg", with a subexpression matching "efg". But with a simple change to the function syntax_type, the match would be "abcde", with a subexpression matching "abcd".

A regular expression is a statement in a programming language. Just as we do not support remapping C++ operators, we should not support remapping regular expression operators.

Recommendations

Under the principle that you don't pay for it if you don't use it, character equality should be based on code points, without an intervening function call, unless the user explicitly asks for slower but more flexible comparisons at the time the regular expression is compiled.

Under the princpile that code should say what it means, special characters should not be remapped.