Proposing std::split()

ISO/IEC JTC1 SC22 WG21 N3430 = 12-0120 - 2012-09-19

Greg Miller, jgm@google.com

Introduction

Splitting strings into substrings is a common task in most general-purpose programming languages, and C++ is no exception. When the need arises, programmers need to search for an existing solution or write one of their own. A typical solution might look like the following:

    std::vector<std::string> my_split(const std::string& text, const std::string& delimiter);
    

A straightforward implementation of the above function would likely use std::string::find or std::string::find_first_of to identify substrings and move from one to the next, building the vector to return. This is a fine solution for simple needs, but it is deficient in the following ways:

These are real deficiencies that resulted in Google code accumulating more than 50 separate "Split" functions for various needs. For example, the following is a family of related split functions that are used in real Google code:

    SplitStringUsing  // Splits to std::vector<string>
    SplitStringToHashsetUsing
    SplitStringToSetUsing
    SplitStringToHashmapUsing
    SplitStringAllowEmpty
    SplitStringToHashsetAllowEmpty
    SplitStringToSetAllowEmpty
    SplitStringToHashmapAllowEmpty
    

Each of the above functions splits an input string using any of the single-byte delimiters given as the delimiter string. They differ only in the collection type they return and whether or not empty substrings are included in the output. The moment someone needs to split a string into a std::unordered_set, two new split functions will need to be written: one that skips empty substrings and one that allows them.

To address the above deficiencies, Google has implemented and is internally using a new API for splitting strings. The new API has been very well received by internal engineers writing real code, and it is rapidly replacing the existing assortment of split functions. The following examples demonstrate Google's new string splitting API in a few common usage scenarios.

    using std::set;
    using std::string;
    using std::string_ref;
    using std::vector;

    vector<string> v1 = strings::Split("a<br>b<br>c", "<br>");
    // v1 is {"a", "b", "c"}

    set<string> s1 = strings::Split("a,b,c,a,b,c", ",");
    // s1 is {"a", "b", "c"}

    vector<string> v2 = strings::Split("a,b;c-d", strings::AnyOf(",;-"));
    // v2 is {"a", "b", "c", "d"}

    vector<string> v3 = strings::Split("a,,c", ",", strings::SkipEmpty());
    // v3 is {"a", "c"}

    vector<string_ref> v4 = strings::Split("a,b,c", ",");
    // v4 is {"a", "b", "c"} -- string_refs refer to the data passed in the first arg, avoiding data copies
    

The rest of this paper describes Google's new string splitting API as it might appear in C++ in the std:: namespace.

New API

At a basic level, a string splitting API breaks text into substrings using a separator or delimiter. This simple description combined with real-world programmer needs drawn from the existence and usage of existing split functions has led to the following goals for a new string splitting API:

The above goals are realized in the following API:

    namespace std {

      template <typename Delimiter>
      splitter<Delimiter> split(std::string_ref text, Delimiter d);

      template <typename Delimiter, typename Predicate>
      splitter<Delimiter> split(std::string_ref text, Delimiter d, Predicate p);

    }
    

[Footnote: This API uses the std::string_ref API [string_ref] to minimize string copies. This API could be written in terms of std::string instead if std:string_ref is not available. —end footnote]

The Delimiter template parameter represents various ways to delimit strings, such as substrings, single characters, or even regular expressions. The Predicate, given in the second form, represents various ways to filter the results, such as skipping empty strings. The splitter<T> that is returned from std::split() has a templated conversion operator (operator T()) that allows it to be implicitly converted to the type specified by the caller.

The text to be split is given as a std::string_ref object, which cannot modify the underlying data to which it refers. Thus, the input text to be split is effectively immutable. The split results may also be returned in a collection of std::string_ref objects. In this case, the resultant std::string_ref objects will refer to the text data that was given as input, eliminating all string data copies. Data are only copied if the caller requests to store results in a container of objects that copy the data, such as a container of std::string objects.

Delimiters

The general notion of a delimiter is not new. A delimiter (little d) marks the boundary between two substrings in a larger string. With this split API comes the formal concept of a Delimiter (big D). A Delimiter is an object with a find() member function that knows how to find the first occurrence of itself in a given std::string_ref. Objects that conform to the Delimiter concept represent specific kinds of delimiters, such as single characters, substrings, and regular expressions.

The following example shows a simple object that models to the Delimiter concept. It has a find() member function that is responsible for finding the next occurrence of a char in the given text. The std::string_ref returned from the find() member function must refer to a substring of find()'s argument text, or else it must be an empty std::string_ref.

    struct char_delimiter {
      char c_;
      explicit char_delimiter(char c) : c_(c) {}
      std::string_ref find(std::string_ref text) {
        int pos = text.find(c_);
        if (pos == std::string_ref::npos)
          return std::string_ref();            // Not found, returns empty std::string_ref.
        return std::string_ref(text, pos, 1);  // Returns a string_ref referring to the c_ that was found in the input string.
      }
    };
    

The following shows how the above delimiter could be used to split a string:

    std::vector<std::string> v = std::split("a,b,c", char_delimiter(','));
    // v is {"a", "b", "c"}
    

The following are standard delimiter implementations that will be part of the splitting API.

std::literal
A string delimiter. The default delimiter used if a string is given as the delimiter argument to std::split(). (Alternative name, std::literal_delimiter.)
std::any_of
Each character in the given string is a delimiter. This is different from the std::any_of algorithm [alg.any_of], but overload resolution should disambiguate them. (Alternative name, std::any_of_delimiter.)

Predicates

The predicates used in the splitting API are unary function objects that return true or false. These are normal STL predicates [Footnote: C++11[algorithms.general]p8 —end footnote]. They are used to filter the results of a split operation by determining whether or not a resultant element should be included or filtered out. The following example shows a predicate that will omit empty strings from the results of a split.

    struct skip_empty {
      bool operator()(std::string_ref sref) const {
        return !sref.empty();
      }
    };
    

The above predicate could be used when splitting as follows:

    std::vector<std::string> v = std::split("a,,c", ",", skip_empty());
    // v is {"a", "c"}
    

The splitter<T>

The std::split() function returns an object of type splitter<T>. This object is responsible for returning the results in the caller-specified container, which can be done using a templated conversion operator. The splitter<T> will also have begin() and end() member functions so it can be used in range-based for loops. The splitter<T> is used to implement the behavior of the std::split() function—it is not part of the public split API. The following example shows what a splitter<T> might look like.

    template <typename Delimiter>
    class splitter {
     public:
      …
      const iterator& begin() const;
      const iterator& end() const;

      template <typename Container>
      operator Container() {
        return Container(begin(), end());
      }
    };
    

The example code above shows a possible splitter interface with support for range-based for loops and implicit conversion to caller-specified containers. The templated conversion operator in the example above shall not participate in overload resolution unless the Container has a constructor taking a begin and end iterator.

API Synopsis

std::split()

The function called to split an input string into a collection of substrings.

    namespace std {

      template <typename Delimiter>
      splitter<Delimiter> split(std::string_ref text, Delimiter d);

      template <typename Delimiter, typename Predicate>
      splitter<Delimiter> split(std::string_ref text, Delimiter d, Predicate p);

    }
    

std::literal (delimiter)

A string delimiter. This is the default delimiter used if a string is given as the delimiter argument to std::split(). Alternatively, this delimiter could be named differently, such as std::literal_delimiter.

    namespace std {

      class literal {
       public:
        explicit literal(string_ref sref);
        string_ref find(string_ref text) const;

       private:
        const string delimiter_;
      };

    }
    

std::any_of (delimiter)

Each character in the given string is a delimiter. This is different from the std::any_of algorithm [alg.any_of], but overload resolution should disambiguate this delimiter. Alternatively, this delimiter could be named differently, such as std::any_of_delimiter.

    namespace std {

      class any_of {
       public:
        explicit any_of(string_ref sref);
        string_ref find(string_ref text) const;

       private:
        const string delimiters_;
      };

    }
    

std::skip_empty (predicate)

Skips empty substrings in the std::split() output collection.

    namespace std {

      struct skip_empty {
        bool operator()(string_ref sref) const {
          return !sref.empty();
        }
      };

    }
    

API Usage

The following using declarations are assumed for brevity:

    using std::deque;
    using std::list;
    using std::set;
    using std::string;
    using std::string_ref;
    using std::vector;
    
  1. The default delimiter when not explicitly specified is std::literal. The following two calls to std::split() are equivalent. The first form is provided for convenience.
        vector<string> v1 = std::split("a,b,c", ",");
        vector<string> v2 = std::split("a,b,c", std::literal(","));
        
  2. Empty substrings are included in the returned collection unless explicitly filtered out using a predicate.
        vector<string> v1 = std::split("a,,c", ",");
        assert(v1.size() == 3);  // "a", "", "c"
    
        vector<string> v2 = std::split("a,b,c", ",", std::skip_empty());
        assert(v2.size() == 2);  // "a", "c"
        
  3. Results can be returned in various STL containers as specified by the caller.
        vector<string> v = std::split("a,b,c", ",");
        deque<string> v = std::split("a,b,c", ",");
        set<string> s = std::split("a,b,c", ",");
        list<string> l = std::split("a,b,c", ",");
        
  4. A delimiter of the empty string results in each character in the input string becoming one element in the output collection.
        vector<string> v = std::split("abc", "");
        assert(v.size() == 3);  // "a", "b", "c"
        
  5. Results can also be returned in a container of std::string_ref objects rather than std::strings. The returned std::string_refs will refer to the data that was given as input to the std::split() function. This eliminates all copies of string data while splitting.
        vector<string_ref> v = std::split("a,b,c", ",");  // No data copied.
        assert(v.size() == 3);  // "a", "b", "c"
        
        
  6. Iterating the results of a split in a range-based for loop.
        for (string_ref sref : std::split("a,b,c", ",")) {
          // use sref
        }
        

References

[string_ref]
N3442 (previously, N3334)