Simplifying Interfaces in basic_regex

Document number:   N1499 = 03-0082
Date:   September 22, 2003
Project:   Programming Language C++
Reference:   ISO/IEC IS 14882:1998(E)
Reply to:   Pete Becker
  Dinkumware, Ltd.
  petebecker@acm.org


basic_regex Should Not Keep a Copy of its Initializer

The basic_regex template has a member function str which returns a string object that holds the text used to initialize the basic_regex object. It also provides a container-like interface to this text through the member functions begin and end, which return const_iterator objects that allow inspection of the initializer text. While it might occasionally be useful to look at the initializer string, we ought to apply the rule that you don't pay for it if you don't use it. Just as fstream objects don't carry around the file name that they were opened with, basic_regex objects should not carry around their initializer text. If someone needs to keep track of that text they can write a class that holds the text and the basic_regex object.

Recommended changes: remove the member functions str, begin, and end.

basic_regex Should Not Have an Allocator

The basic_regex template takes an argument that defines a type for an allocator object. The template also has several member typedefs and one member function to provide information about the allocator type and the allocator object. This is because a basic_regex object "is in effect both a container of characters, and a container of states, as such an allocator parameter is appropriate." Calling it a container doesn't make it one. The allocator in basic_regex is not very useful, and it unduly complicates the implementation.

The cost of using an allocator is high. Every type that the basic_regex object uses internally must have its own allocator type and its own allocator object. A node based implementation might have a dozen or more node types, requiring a dozen or more allocator objects. Allocator objects can be created as local objects when needed, which effectively precludes allocators with internal state; they can be ordinary members of the basic_regex object, inflating its size; or they can be implemented as a chain of base classes (to take advantage of the zero-size base optimization), with a high cost in readability and maintainability. None of these options is attractive.

Further, it's not at all clear how a user can determine that a substitute allocator is appopriate or what characteristics such an allocator should have. The STL containers have clearly spelled out requirements for their memory usage; basic_regex objects have no such requirements (nor should they). The implementor of the basic_regex template knows best what its memory requirements are.

Recommended changes: remove the Allocator argument from basic_regex and remove the members reference, const_reference, difference_type, size_type, allocator_type, get_allocator, and max_size.

The Interface to regex_traits Should Use Iterators, Not Strings

The member functions of the regex_trait template support customization and internationalization for regular expressions. Of these, the member functions transform, transform_primary, lookup_collatename, and lookup_classname take string as input.

This interface is inherently inefficient -- it requires creating a string object from a sequence in order to pass that string to the function. Further, in the case of transform, the function typically extracts iterators from the string object. Passing the text as a pair of iterators avoids introducing unnecessary string objects.

Recommended changes:

  1. Change the signature of regex_traits::transform to
        template <class InIt, class OutIt>
        string_type transform(InIt first, InIt last) const;

    and change the Effects clause to:

    Effects: returns use_facet<collate<charT> >(getloc()).transform(first, last)).

  2. Change the signature of regex_traits::transform_primary to
        template <class InIt, class OutIt>
        string_type transform_primary(InIt first, InIt last) const;

    and change the Effects clause to:

    Effects: if typeid(use_facet<collate<charT> >) == typeid(collate_byname<charT>) and the form of the sort key returned by collate_byname<charT>::transform(first, last) is known and can be converted into a primary sort key, then returns that key, otherwise returns an empty string.

  3. Change the signature of regex_traits::lookup_collatename to
        template <class InIt, class OutIt>
        char_class_type lookup_collatename(InIt first, InIt last) const;

    and change the Effects clause to:

    Effects: returns the sequence characters that represents the collation element named by the characters in the half-open range [first, last) if that sequence names a valid collation element under the imbuded locale, otherwise returns an empty string.

    Note that in addition to the iterator language, this change to the effects clause removes the requirement that lookup_collatename recognize the names of characters in the POSIX Portable Character Set. This requirement seems to be the result of a misunderstanding of what constitutes a collation element.

  4. Change the signature of regex_traits::lookup_classname to
        template <class InIt, class OutIt>
        char_class_type lookup_classname(InIt first, InIt last) const;

    and change the Effects clause to:

    Effects: returns an implementation-specific value that represents a character classification named without regard to case by the characters in the half-open range [first, last) if such a character classification exists, otherwise returns 0. The implementation shall provide character classes with the following names: "d", "w", "s", "alnum", "alpha", "blank", "cntrl", "digit", "graph", "lower", "print", "punct", "space", "upper", and "xdigit".