Proposal for new string algorithms in TR2

Author: Pavol Droba <droba@topmail.sk>
Number:N2059=06-0129
Date: 2006-09-08

Table of Contents

1   Abstract

This proposal introduces several string manipulating algorithms for addition into the C++ standard library.

2   Motivation

String processing is a part of almost every non-trivial program. There is always a need to process some input, extract some information from a text or format a textual output. Although the range of operations that need to be performed with strings is almost unlimited, some of them are very common and they tend to be reinvented continuously by C++ programmers.

The C++ standard library supports string processing mainly by the basic_string class. In addition, the <algorithm> header contains several algorithms that might be applied in string processing tasks, i.e. search, replace, etc. Compared to other programming languages like perl, these options are very limited.

This paper tries to address this deficiency by introducing some of the often used algorithms into the standard library.

2.1   Existing practice

This proposal is based on the Boost String Algorithms Library. The library is in Boost since the version 1.32 and it is well accepted (See references).

The algorithms mentioned here are also present in some form in many general-purpose C++ libraries. Among others CString from MFC or QString from Qt. They are also standard operations in scripting languages like Perl, Python or Ruby.

3   Impact on the standard

The proposal is directly dependent on other two proposed extensions: Range Library (N2068) and rvalue reference proposal (N1770). No other changes in the core language or in existing parts of the standard library are required.

Both requirements can be refined/relaxed to be compatible with C++2003 version of the standard. Range library can be, with some restrictions, redefined to use the current standard. The Boost Range Library implementation is a direct proof the of this claim. Algorithms that use rvalue reference to perform move semantics can also be altered to do simple in-place mutation. The main functionality of algorithms will be still maintained [1].

[1]This is actually how the Boost String Algorithm Library is implemented. Instead of version that takes a && parameter, there is a version that takes a & parameter and mutates the argument directly.

4   Design issues

4.1   Algorithm selection

String processing is a wide application area. Therefore there is also a large quantity of various related algorithms. Obviously, not all of them are suitable for a general purpose library like the C++ Standard Library. Our goal is to provide the essential algorithms that are not available in the standard by other or require nontrivial effort to implemenent.

In the search for the 'essential ones' we have considered some general purpose C++ libraries and standard libraries for mainstream languages comparable to the C++ in the area of application. Each of these representatives have one or more string classes, each with a set of operations bound to them.

Here is the list of representatives:

Scripting languages have naturally been strong in the string processing. We include two representatives in our comparison. We have chosen these two because they are object oriented and widely used.

Algorithms presented in these libraries can be divided into the following categories:

  • String manipulation and construction
  • Substring extraction
  • Search and replace
  • Transformation
  • Parsing and formatting
  • Trimming and padding
  • Predicates
  • Others

4.1.1   String manipulation and construction

This category consists of algorithms that directly modify a string. Operations include construction, concatenation, insertion and erasure.

Representatives:

  • basic_string
    • assign
    • append
    • operator+=
    • insert
    • erase
    • ...
  • CString
    • Append
    • Delete
    • Insert
    • operator+=
  • Ruby.String
    • concat
    • delete
    • insert

All of the mentioned libraries including STL basic_string implement this to a reasonable extend.

4.1.2   Substring extraction

Representatives:

  • basic_string::substr
  • CString
    • Left
    • Mid
    • Right
  • Java.String.substring
  • Ruby.String[]

Strongly-typed languages offer methods that extract a substring between a specified indexes. Some libraries have also methods for simple extraction from the start or the end of a string. Scripting languages like ruby combine all this functionality into the operator[].

basic_string is limited to one extraction method. It can perform all of the required operations at the cost of slightly more verbose code.

4.1.3   Search and replace

Searching for a substring in a string and replacing it with something else is one of the fundamental string operations.

Representatives:

  • basic_string::find
  • QString::indexOf
  • CString::Find
  • Java.String.indexOf

Libraries offer various levels of flexibility for these operations. The basic level allows to search for the exact string. An improvement is the ability to ignore the case. Most powerful is searching using an expression that describes the substring. Regular expressions are the de facto standard in the last category.

Current C++ standard is limited to the basic level of functionality. Fortunately, this state was radically changed by accepting the Regex Library into TR1.

Note: split and its variant can be also considered to be a part of this category.

4.1.4   Transformation

Probably the most frequently used operation from this category is the case conversion.

Representatives:

  • CString
    • MakeUpper
    • MakeLower
  • QString
    • toLower
    • toUpper
  • Java
    • toUpperCase
    • toLowerCase

It is present in all the mentioned libraries, except for STL. Being one of the most basic operation, its absence in the Standard Library is problematic at least. Therefore it is part of this proposal.

Other transformations found in the libraries are usually domain-specific and not universally accepted.

4.1.5   Parsing and formatting

We will omit these two categories from the scope of this proposal. Both are rather complex and require more elaboration to give some reasonable results. Separate proposals should be considered for this domain.

4.1.6   Trimming and padding

Strings are often printed with spaces on either side to improve the readability on the screen. These spaces have usually no semantic value and must be removed prior to processing. This removal is called trimming. The converse operation, when spaces are added to the front or the end of string, is called padding.

Both operations can be found in most of the mentioned libraries.

Representatives:

  • QString
    • trimmed
    • leftJustified
    • rightJustified
  • CString
    • Trim
    • TrimLeft
    • TrimRight
  • C# System.String
    • TrimStart
    • TrimEnd
    • PadLeft
    • PadRight
  • Ruby.String
    • strip
    • center
    • rjust
    • ljust

This functionality is completely missing in the Standard Library, therefore it is part of this proposal.

4.1.7   String predicates

Even if this category is quite broad there is a relatively small number of algorithms that are used frequently. We have focused on predicates that examine relationship between the two strings. Relation operators (<,=,>) that fall in this category are present in all libraries. Additionally, we have containment checks and advanced forms of comparison.

Representatives:

  • STL
    • lexicographical_compare
  • QString
    • compare
    • startsWith
    • endsWith
  • Java.String
    • compareTo
    • startsWith
    • endsWith
  • CString
    • Compare
    • CompareNoCase
    • Collate
    • CollateNoCase

This area is not sufficiently covered in the Standard Library. The predicates presented here are therefore a useful addition.

4.1.8   Others

Algorithms that do not fall to any of the above categories are usually domain-specific or too narrow-defined. Therefore they are not suitable for the context of this proposal.

4.2   String representation

While a 'string' is often represented by basic_string class, this is not the only representation. There are several alternatives that are widely used. Examples are FlexString from the Loki library or rope from the STLPort implementation of the C++ standard library. Algorithms presented here are therefore designed to work with generic structures that satisfy algorithm requirements, rather than with a concrete string class.

To accommodate this design, the Range Library (N2068) is used as the interface for accessing the input strings. The Range Library provides uniform access to a container structures. This allows virtually any container to be used with the algorithms presented here. Yet, the algorithms are designed to work efficiently with string-like structures. An implication is that the elements of a string (i.e. characters) are values that can be copied and moved around without additional overhead.

4.3   Differences from the Boost String Algorithms Library

As mentioned earlier, this proposal is based on Boost String Algorithms Library. However, it includes only a subset of the algorithms contained in the Boost library. The part of the library that deals with find and replace functionality is intentionally omitted due to several reasons:

  • This functionality is quite complex and requires several facilities that should be considered separately.
  • There are several volatile topics that have not been satisfactory solved, therefore they are not ready for standardization.
  • There is a possible overlap with Regex proposal that is a part of TR1.

Therefore we have decided to exclude these algorithms from this proposal. Yet, it is possible that they might be a part of another proposal in the future.

There are a few design differences between the proposal and the Boost library as well;

  • Mutable variants of algorithms were removed in favor of move semantics (using constructs form the proposal N1770). Now the algorithms usually come in two flavors, a copy (using const &) and a move (using &&) variant. We consider this to be another evolutionary step towards functional-oriented programming in C++. During the review of Boost String Algorithms Library this topic was intensely debated. A strong opinion was expressed in favor of functional approach, however without the move semantics support, it is not possible to implement certain algorithms in an efficient manner. Therefore mutable version were retained as a primary interface, accompanied by copy variants.
  • The definitions are tied to the Range Library proposal (N2068) rather than to the Boost Range Library.

5   Character literals handling

The Range library proposal (N2068) does not contain direct support for handling of the character literals. This support was present in Boost Range Library, but it was deprecated due to the inconsistencies in usage of the built-in arrays. Character literals are typed as C-arrays of char (s) or wchar_t (s). The problematic isssue is the zero terminator at the end. For string processing, this terminator should not be a part of the range, on the other hand, generic array usage requires that the entire array is considered to be a part of range.

A hybrid solution that uses literal helpers is therefore suggested. The default behaviour of range is defined in the range proposal. Algorithms, specifically in the domain of string processing, can override this default by applying as_literal to the input arguments.

Please note, that there is no support planned for null terminated C-strings. Unlike literals, these have type char* or wchar_t*. Usage of plain C-strings is discouraged due to their inherently unsafe nature. A user can still convert C-string to a range prior to usage.

5.1   Literal helpers

Both helpers convert a character literal to a range. The difference between the two utilities is in the way they handle Char[] literals. as_literal treats it as an ordinary null-terminated string, while as_array uses the array dimensions to determine the length.

In addition, they behave neutrally to the other ranges. That means they return the same range as the input (modulo range wrapping).

In the following example:

char str[]="Hello";
range<char*> lit=as_literal(str);
range<char*> arr=as_array(str);

variable lit will delimit elements 'H','e','l','l','o'. arr, on the other end will point to elements 'H','e','l','l','o',0 including the terminating zero.

5.1.1   as_literal

as_literal constructs a range that delimits the range specified as the parameter. Char[] (s) are considered to be null-terminated strings. strlen is used to determine the length of the string. Other range type are passed unmodified.

Synopsis:

template<typename Range>
range<typename range_iterator<Range>::type> as_literal(const Range& r);

template<typename Char, size_t sz>
range<Char*> as_literal(Char (&arr)[sz])
Requires:
  • Range must satisfy the range concept requirements (N2068).
  • char_traits<Char> is defined.

Effect: Construct a range that delimits the input string.

5.1.2   as_array

as_array constructs a range that delimits the range specified as parameter. Char[] literals are considered to be C-arrays. The array dimension is used to determine the length of the string. Other range types are passed unmodified.

Note, this is actually the default behaviour for the range library.

Synopsis:

template<typename Range>
range<typename range_iterator<Range>::type> as_array(const Range& r);

template<typename Char, size_t sz>
range<Char*> as_array(Char (&arr)[sz])

Requires: Range must satisfy the range concept requirements (Range library proposal)

Effect: Constructs a range that delimits the input string.

Note: This is actually the default behaviour for the range library. Therefore an implementation
can simply forward call to make_range.

5.2   Implementation

An algorithm that wants to allow a user to pass a character literal in its parameter needs to use as_literal to convert the input parameter.

Example:

template<typename Range>
bool AlgoImpl(const Range& r);

template<typename Range>
bool Algo(const Range& r)
{
  return AlgoImpl(as_literal(r));
}

Alternatively, if a user requires to pass an argument as an array rather then literal, she can override the as_literal conversion in the algorithm by using as_array.

Example:

char arr[6]={ 'h', 'e', 'l', 'l', 'o' };
Algo(as_array(arr));

6   Algorithms

Unless stated otherwise, types of arguments passed to the template functions must satisfy the requirements defined by the concept of the same name as the corresponding template argument. The used concepts are either defined in the Standard Library Reference or they are part of the Range Library Proposal (N2068).

In addition, all parameters of Range type shall be converted using as_literal helper as described above.

6.1   Case Conversion

These algorithms perform case-conversion on all characters in the string.

6.1.1   to_lower

Synopsis:

template<typename CopyableRange>
CopyableRange to_lower(const CopyableRange& s, const locale& loc=locale());

Returns: A modification of s such that each element x == tolower(y, loc) where y is a corresponding element in the original s. [2]

template<typename CopyableRange>
CopyableRange&& to_lower(CopyableRange&& s, const locale& loc=locale());

Returns: to_lower( CopyableRange(s), loc)

6.1.2   to_upper

Synopsis:

template<typename CopyableRange>
CopyableRange to_upper(const CopyableRange& s, const locale& loc=locale());

Returns: A modification of s such that each element x == toupper(y, loc) where y is a corresponding element in the original s. [2]

template<typename Sequence>
CopyableRange&& to_upper(CopyableRange&& s, const locale& loc=locale());

Returns: to_upper( CopyableRange(s), loc)

[2](1, 2) An implicit requirement is implied from using toupper and tolower. ctype must be specialised for range_value<CopyableRange>::type and it must be imbued in the specified locales.

6.2   Predicates

These algorithms perform containment checks of various types and the string comparison.

6.2.1   starts_with

Synopsis:

template<typename Range1, typename Range2>
bool starts_with(const Range1& r1, const Range2& r2);

template<typename Range1, typename Range2, typename Pred>
bool starts_with(const Range1& r1, const Range2& r2, Pred pred);

template<typename Range1, typename Range2>
bool istarts_with(const Range1& r1, const Range2& r2, const locale& loc=locale());
Returns:

The predicate checks if r2 is a prefix of r1. In other words, it holds when size(r1)>=size(r2) and each element from r2 is equal to corresponding element in r1, beginning from the start of r1.

The first form uses operator== for comparison, second one uses the given predicate and the last one is comparing using the expression tolower(x, loc)==tolower(y, loc)

Example: Function IsURL() checks if the path denotes an url.

bool IsURL(const string& strPath)
{
      return
              istarts_with(strPath, "http://") ||
              istarts_with(strPath, "https://") ||
              istarts_with(strPath, "ftp://");
}

6.2.2   ends_with

Synopsis:

template<typename Range1, typename Range2>
bool ends_with(const Range1& r1, const Range2& r2);

template<typename Range1, typename Range2, typename Pred>
bool ends_with(const Range1& r1, const Range2& r2, Pred pred);

template<typename Range1, typename Range2>
bool iends_with(const Range1& r1, const Range2& r2, const locale& loc=locale());
Returns:
The predicate checks if r2 is a suffix of r1. Equals to istarts_with(reverse(r1), reverse(r2)).

Example: Function IsExecutable() checks if the extension of the file name denotes an executable.

bool IsExecutable(const string& strPath)
{
  return
    iends_with(strPath, ".com") ||
    iends_with(strPath, ".exe") ||
    iends_with(strPath, ".bat");
}

6.2.3   contains

Synopsis:

template<typename Range1, typename Range2>
bool contains(const Range1& r1, const Range2& r2);

template<typename Range1, typename Range2, typename Pred>
bool contains(const Range1& r1, const Range2& r2, Pred pred);

template<typename Range1, typename Range2>
bool icontains(const Range1& r1, const Range2& r2, const locale& loc=locale());
Returns:

The predicate checks if r2 is contained within r1. It returns true if size(r1)>=size(r2) and there exists an index n such that r2[i]==r1[i+n] for all indexes i in r1,

The first form uses operator== for comparison, second one uses the given predicate and the last one is comparing using expression tolower(x, loc)==tolower(y, loc)

6.2.4   equals

The equals algorithm is actually a two-way version of equal available in the <algorithm> header. equal does not check if both operands have the same length. Moreover, it requires that the second argument is at least as long as the first one. Algorithm presented here fills this gap. In addition, case-insensitive comparison is provided.

Synopsis:

template<typename Range1, typename Range2>
bool equals(const Range1& r1, const Range2& r2);

template<typename Range1, typename Range2, typename Pred>
bool equals(const Range1& r1, const Range2& r2, Pred pred);

template<typename Range1, typename Range2>
bool iequals(const Range1& r1, const Range2& r2, const locale& loc=locale);
Returns:

The predicate checks if r1 and r2 are equal element wise. It returns true size(r1)==size(r2) and each element r1[n] from r1 is equal to corresponding elements r2[n] from r2.

First form uses operator== for comparison, second one uses the given predicate and the last one is comparing using expression tolower(x, loc)==tolower(y, loc)

6.2.5   lexicographical_compare

This algorithm has functionality equivalent to lexicographical_compare algorithm already present in the standard library. It provides a simpler, range-oriented interface and a case-insensitive version.

Synopsis:

template<typename Range1, typename Range2>
bool lexicographical_compare(const Range1& r1, const Range2& r2);

template<typename Range1, typename Range2, typename Pred>
bool lexicographical_compare(const Range1& r1, const Range2& r2, Pred pred);

template<typename Range1, typename Range2>
bool ilexicographical_compare(const Range1& r1, const Range2& r2, const locale& loc=locale());

Returns: true if r1 is lexicographically less than r2, false otherwise.

Notes:
If the two sequences have the same number of elements and their corresponding elements are equivalent, then neither sequence is lexicographically less than the other. If one sequence is a prefix of the other, then the shorter sequence is lexicographically less than the longer sequence. Otherwise, the lexicographical comparison of the sequences yields the same result as the comparison of the first corresponding pair of elements that are not equivalent.

6.2.6   all [3]

Synopsis:

template<typename Range, typename Pred>
bool all(const Range& r1, Pred pred);

Returns: Returns true if for each element x, pred(x)==true, false otherwise.

6.2.7   none [3]

Synopsis:

template<typename Range, typename Pred>
bool none(const Range& r1, Pred pred);

Returns: Returns true if for each element x, pred(x)==false, false otherwise.

[3](1, 2) The names all and none might be considered too general. Yet they describe the algorithms very well and we haven't found a better variant so far. If this is considered a problem, alternative names might be used.

6.3   Trimming

Trimming algorithms remove unnecessary spaces from the beginning and the end of a string. [5]

6.3.1   trim_left [6]

Synopsis:

template<typename CopyableRange>
CopyableRange trim_left(const CopyableRange& s, const locale& loc=locale());

template<typename CopyableRange, typename Pred>
CopyableRange trim_left_if(const Sequence& s, Pred pred);

Return: A copy of the input range where all leading spaces (characters satisfying pred) have been removed.

template<typename Sequence>
Sequence&& trim_left(Sequence&& s, const locale& loc=locale());

template<typename Sequence>
Sequence&& trim_left_if(Sequence&& s, Pred pred);

Return: A modified version of s where all leading spaces (characters satisfying pred) have been removed.

6.3.2   trim_right [6]

Synopsis:

template<typename CopyableRange>
CopyableRange trim_right(const CopyableRange& s, const locale& loc=locale);

template<typename CopyableRange, typename Pred>
CopyableRange trim_right_if(const Sequence& s, Pred pred);

Returns: A copy of the input range where all trailing spaces (characters satisfying pred) have been removed.

template<typename Sequence>
Sequence&& trim_right(Sequence&& s, const locale& loc=locale);

template<typename Sequence>
Sequence&& trim_right_if(Sequence&& s, Pred pred);

Returns: A modified version of s where all trailing spaces (characters satisfying pred) have been removed.

6.3.3   trim

Synopsis:

template<typename CopyableRange>
CopyableRange trim(const CopyableRange& s, const locale& loc=locale());

template<typename CopyableRange, typename Pred>
CopyableRange trim_if(const Sequence& s, Pred pred);
Returns:
A copy of the input range where all leading and trailing trailing spaces (characters satisfying pred) have been removed. Function have the same effect as trim_left(trim_right(s))
template<typename Sequence>
Sequence&& trim(Sequence&& s, const locale& loc=locale());

template<typename Sequence>
Sequence&& trim_if(Sequence&& s, Pred pred);
Returns:
A modified version of s where all leading and trailing trailing spaces (characters satisfying pred) have been removed. Function have same the effect as trim_left(trim_right(s))
[5]The _if suffix has to be added to the variants that have the predicate argument, to avoid name collisions. Since the predicate is a generic template argument, it cannot be distinguished from loc argument during the template instantiation stage.
[6](1, 2) Alternative names that might be considered are trim_start and trim_end. This would be more consistent with starts_with and ends_with, however 'TrimLeft/Right' is more often used term in currently existing libraries.

6.4   Padding

As the opposite to trimming, padding algorithms add a certain number of spaces to the one or both ends of the string, so that the string has the specified length.

All algorithms presented here can fill the gap with spaces or using the specified pattern. The pattern is used as follows: To fill the gap of n elements, pattern is repeated so that a sequence with at least n elements is created. After that, beginning of this sequence is used to fill the gap.

6.4.1   pad_left

template<typename Sequence>
Sequence pad_left(const Sequence& s, unsigned int len);

template<typename Sequence, typename Range>
Sequence pad_left(const Sequence& s, unsigned int len, const Range& pattern);
Returns:
If size(s)>=len an identical copy of s is returned. Otherwise the sequence is prepended with spaces or the repeated pattern to have the size of len characters.
template<typename Sequence>
Sequence&& pad_left(Sequence&& s, unsigned int len);

template<typename Sequence>
Sequence&& pad_left(Sequence&& s, unsigned int len, const Range& pattern);
Returns:
If size(s)>=len, the original sequence is returned unaltered. Otherwise the sequence is prepended with spaces or the repeated pattern to have the size of len characters.

6.4.2   pad_right

template<typename Sequence>
Sequence pad_right(const Sequence& s, unsigned int len);

template<typename Sequence, typename Range>
Sequence pad_right(const Sequence& s, unsigned int len, const Range& pattern);
Returns:
If size(s)>=len, an identical copy of s is returned. Otherwise the sequence is extended with spaces or the repeated pattern to have the size of len characters.
template<typename Sequence>
Sequence&& pad_right(Sequence&& s, unsigned int len);

template<typename Sequence>
Sequence&& pad_right(Sequence&& s, unsigned int len, const Range& pattern);
Returns:
If size(s)>=len, the original sequence is returned unaltered. Otherwise the sequence is extended with spaces or the repeated pattern to have the size of len characters.

6.4.3   pad

template<typename Sequence>
Sequence pad(const Sequence& s, unsigned int len);

template<typename Sequence, typename Range>
Sequence pad(const Sequence& s, unsigned int len, const Range& pattern);
Returns:
If size(s)>=len, an identical copy of s is returned. Otherwise, the sequence is extended on both sides with spaces or repeated pattern to have the size of len characters [7].
template<typename Sequence>
Sequence&& pad(Sequence&& s, unsigned int len);

template<typename Sequence>
Sequence&& pad(Sequence&& s, unsigned int len, const Range& pattern);
Returns:
If size(s)>=len, the original sequence is returned unaltered. Otherwise, the sequence is extended on both sides with spaces or repeated pattern to have the size of len characters [7].
[7](1, 2) The filling characters are distributed evenly to the front and the back. If the difference is of odd length, the front gap is larger by one character. Both gaps are filled independently as if pad_left and pad_right were applied in sequence.

7   Placement

The algorithms should be placed into the <stringalgorithm> header. Although <algorithm> header might be also considered as an option, we have chosen to use a separate header due to dependency on the locales.

Since the scope of literal helpers is larger than this proposal, they shall be placed together with the Range Library.

8   Discussion and open issues

8.1   C-string support

Even if this issue is abandoned in the proposal, it could be considered. To support char* and wchar_t* null-terminated strings, the following extensions must be incorporated:

  • In the Range Library proposal, range decomposition metafunctions, namely range_iterator, range_reverse_iterator, range_value and range_difference, must be specialized for Char* types.
  • as_literal helper must be extended for Char*. The specialization will work exactly as for Char[].

9   References

  1. Boost String Algorithms library, http://www.boost.org/doc/html/string_algo.html
  2. Regular Expressions proposal, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2003/n1429.htm
  3. RValue reference proposal (N1770) and related documents, http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2005/n1770.html
  4. Range Library proposal, http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2006/n2068.html
  5. Boost Range library, http://www.boost.org/libs/range/index.html