TextView

Synopsis

#include "swoc/TextView.h"

class TextView

Reference documentation.

This class acts as a view of memory allocated / owned elsewhere and treated as a sequence of 8 bit characters. It is in effect a pointer and should be treated as such (e.g. care must be taken to avoid dangling references by knowing where the memory really is). The purpose is to provide string manipulation that is safer than raw pointers and much faster than duplicating strings.

Usage

TextView is a subclass of std::string_view and inherits all of its methods. The additional functionality of TextView is for easy string manipulation, with an emphasis on fast parsing of string data. As noted, an instance of TextView is a pointer and needs to be handled as such. It does not own the memory and therefore, like a pointer, care must be taken that the memory is not deallocated while the TextView still references it. The advantage of this is creating new views and modifying existing ones is very cheap.

Any place that passes a char * and a size is an excellent candidate for using a TextView. Code that uses functions such as strtok or tracks pointers and offsets internally is an excellent candidate for using TextView instead.

Because TextView is a subclass of std::string_view it can be unclear which is a better choice. In many cases it doesn’t matter, since because of this relationship converting between the types is at most as expensive as a copy of the same type, and in cases of constant reference, can be free. In general if the string is treated as a block of data, std::string_view is a better choice. If the contents of the string are to be examined / parsed then TextView is better. For example, if the string is used simply as a key or a hash source, use std::string_view. Contrariwise if the string may contain substrings of interest such as key / value pairs, then use a TextView. Although I do sometimes use TextView because of the lack of support for instance reuse in std::string_view. - e.g. no assign or clear methods.

When passing TextView as an argument, it is very debatable whether passing by value or passing by reference is more efficient. The appropriate conclusion is it’s not likely to matter in production code. My personal heuristic is whether the function will modify the value. If so, passing by value saves a copy to a local variable therefore it should be passed by value. If the function simply passes the TextView on to other functions, then pass by constant reference. This distinction is irrelevant to the caller, the same code at the call site will work in either case.

As noted, TextView is designed as a pointer style class. Therefore it has an increment operator which is equivalent to std::string_view::remove_prefix. TextView also has a dereference operator, which acts the same way as on a pointer. The difference is the view knows where the end of the view is. This provides a comfortably familiar way of iterating through a view, the main difference being checking the view itself rather than a dereference of it (like a C-style string) or a range limit. E.g. the code to write a simple hash function [1] could be

void hasher(TextView v) {
   size_t hash = 0;
   while (v) {
      hash = hash * 13 + * v ++;
   }
   return hash;
}

Although alternatively, this can be done in a non-modifying way.

void hasher(TextView v) {
   size_t hash = 0;
   for ( auto c : v) {
      hash = hash * 13 + c;
   }
   return hash;
}

Because TextView inherits from std::string_view it can also be used as a container for range for loops.

void hasher(TextView const& v) {
   size_t hash = 0;
   for (char c : v) hash = hash * 13 + c;
   return hash;
}

The first approach enables dropping out of the loop on some condition with the view updated to no longer contain processed characters, making restart or other processing simple.

The standard functions strcmp, memcmp, code:memcpy, and strcasecmp are overloaded for TextView so that a TextView can be used as if it were a C-style string. The size is is taken from the TextView and doesn’t need to be passed in explicitly.

class CharSet

Reference documentation.

This is a simple class that contains a set of characters. This is intended primarily to make parsing faster and simpler. Rather than checking a list of delimiters the character can be checked with a single std::bitset lookup.

Basic Operations

TextView is essentially a collection of operations which have been found to be common and useful in manipulating contiguous blocks of text.

Construction

Constructing a view means creating a view from another object which owns the memory (for creating views from other views see Extraction). This can be a char const* pointer and size, two pointers, a literal string, a std::string or a std::string_view although in the last case there is presumably yet another object that actually owns the memory. All of these constructors require only the equivalent of two assignment statements. The one thing to be careful of is if a literal string or C-string is used, the resulting TextView will drop the terminating nul character from the view. This is almost always the correct behavior, but if it isn’t an explicit size can be used.

A TextView can be constructed from a null char const* pointer or a straight nullptr. This will construct an empty TextView identical to one default constructed.

TextView supports a generic constructor that will accept any class that provides the data and size methods that return values convertible to char const * and size_t. This enables greater interoperability with other libraries, as any well written C++ library with its own string class will have these methods implemented sensibly.

Searching

Because TextView is a subclass of std::string_view all of its search method work on a TextView. The only search methods provided beyond those in std::string are TextView::find_if() and TextView::rfind_if() which search the view by a predicate. The predicate takes a single char argument and returns a bool. The search terminates on the first character for which the predicate returns true.

Extraction

Extraction is creating a new view from an existing view. Because views cannot in general be expanded new views will be sub-sequences of existing views. This is the primary utility of a TextView. As noted in the general description TextView supports copying or removing prefixes and suffixes of the view. All of this is possible using the underlying std::string_view_substr but this is frequently much clumsier. The development of TextView was driven to a large extent by the desire to make such code much more compact and expressive, while being at least as safe. In particular extraction methods on TextView do useful and well defined things when given out of bounds arguments. This is quite handy when extracting tokens based on separator characters.

The primary distinction is how a character in the view is selected.

  • By index, an offset in to the view. These have plain names, such as TextView::prefix().

  • By character comparison, either a single character or set of characters which is matched against a single character in the view. These are suffixed with “at” such as TextView::prefix_at().

  • By predicate, a function that takes a single character argument and returns a bool to indicate a match. These are suffixed with “if”, such as TextView::prefix_if().

A secondary distinction is what is done to the view by the methods.

  • The base methods make a new view without modifying the existing view.

  • The “split…” methods remove the corresponding part of the view and return it. The selected character is discarded and not left in either the returned view nor the source view. If the selected character is not in the view, an empty view is returned and the source view is not modified.

  • The “take…” methods remove the corresponding part of the view and return it. The selected character is discarded and not left in either the returned view nor the source view. If the selected character is not in the view, the entire view is returned and the source view is cleared.

  • The “clip…” methods remove the corresponding part of the view and return it. Only those characters are removed - in contrast to “split…” and “take…” which drop a (presumed) separator. If the first character doesn’t match, the view is not modified and an empty view is returned. These are very similar to the “trim…” methods described below, the difference what part of the original view is returned.

This is a table of the affix oriented methods, grouped by the properties of the methods. “Bounded” indicates whether the operation requires the target character, however specified, to be within the bounds of the view. A bounded method does nothing if the target character is not in the view. On this note, the remove_prefix and remove_suffix are implemented differently in TextView compared to std::string_view. Rather than being undefined, the methods will clear the view if the size specified is larger than the contents of the view.

Operation

Affix

Bounded

Method

Copy

Prefix

No

TextView::prefix()

Yes

TextView::prefix_at()

TextView::prefix_if()

Suffix

No

TextView::suffix()

Yes

TextView::suffix_at()

TextView::suffix_if()

Modify

Prefix

No

std::string_view::remove_prefix

Yes

TextView::remove_prefix_at()

TextView::remove_prefix_if()

Suffix

No

std::string_view::remove_suffix

Yes

TextView::remove_suffix_at()

TextView::remove_suffix_if()

Modify and Copy

Prefix

Yes

TextView::split_prefix()

TextView::split_prefix_at()

TextView::split_prefix_if()

TextView::clip_prefix_of()

No

TextView::take_prefix()

TextView::take_prefix_at()

TextView::take_prefix_if()

Suffix

Yes

TextView::split_suffix()

TextView::split_suffix_at()

TextView::split_suffix_if()

TextView::clip_suffix_of()

No

TextView::take_suffix()

TextView::take_suffix_at()

TextView::take_suffix_if()

Other

The comparison operators for TextView are inherited from std::string_view and therefore use the content of the view to determine the relationship.

TextView provides a collection of “trim” methods which remove leading or trailing characters. These have similar suffixes with the same meaning as the affix methods. This can be done for a single character, one of a set of characters, or a predicate. The most common use is with the predicate isspace which removes leading and/or trailing whitespace as needed.

While the plethora of view methods can seem a bit much, all of these are useful in different situations and exist because of such use cases.

Numeric conversions are provided, in signed (svtoi()), unsigned (svtou()), and floating point (svtod()) flavors. The integer functions are designed to be “complete” in the sense that any other string to integer conversion can be mapped to one of these functions. The floating point conversion is sufficiently accurate - it will return a floating point value that is within one epsilon of the exact value, but not always the closest. This is fine for general use such as in configurations, but possibly not quite enough for high precision work.

The standard functions strcmp, strcasecmp, and memcmp are overloaded when at least of the parameters is a TextView. The length is taken from the view, rather than being an explicit parameter as with strncasecmp.

When no other useful result can be returned, TextView methods return a reference to the instance. This makes chaining methods easy. If a list consisted of colon separated elements, each of which was of the form “A.B.old” and just the “A.B” part was needed, sans leading white space:

TEST_CASE("TextView misc", "[libswoc][example][textview][misc]") {
  TextView src = "  alpha.bravo.old:charlie.delta.old  :  echo.foxtrot.old  ";
  REQUIRE("alpha.bravo" == src.take_prefix_at(':').remove_suffix_at('.').ltrim_if(&isspace));
  REQUIRE("charlie.delta" == src.take_prefix_at(':').remove_suffix_at('.').ltrim_if(&isspace));

Parsing with TextView

Time for some examples demonstrating string parsing using TextView. There are two major reasons for developing TextView parsing.

The first was to minimize the need to allocate memory to hold intermediate results. For this reason, the normal style of use is a streaming / incremental one, where tokens are extracted from a source one by one and placed in TextView instances, with the original source TextView being reduced by each extraction until it is empty.

The second was to minimize cut and paste coding. Typical C or C++ parsing logic consists mostly of very generic code to handle pointer and size updates. The point of TextView is to automate all of that yielding code focused entirely on the parsing logic, not boiler plate string or view manipulation. It is a common occurrence to not get such code exactly correct leading to hard to track bugs. Use of TextView eliminates those problems.

The minimization of exceptions on sizes beyond the view boundaries was done primarily to help parsing. It noticeably simplifies the logic if excessive removal or advancement yields an empty view rather than an exception.

CSV Example

For example, assume value contains a null terminated string which is expected to be tokens separated by commas. To handle this generically a function could be written which takes a token handler and calls it for each token.

void
parse_csv(TextView src, std::function<void(TextView)> const &f) {
  while (src.ltrim_if(&isspace)) {
    TextView token{src.take_prefix_at(',').rtrim_if(&isspace)};
    if (token) { // skip empty tokens (double separators)
      f(token);
    }
  }
}

If value was "bob  ,dave, sam" then token would be successively bob, dave, sam. Each loop iteration is guaranteed to remove text from src making the loop eventually terminate when all text has been removed, because an empty TextView is false. This is a recommended style because TextView instances are very cheap to copy. This is essentially the same as having a current pointer and and end pointer and checking for current >= end except TextView does all the work, leading to simpler and less buggy code.

White space is dropped because of the calls to ltrim_if and rtrim_if. By calling in the loop condition, the loop exits if the remaining text is only whitespace and no token is processed. Alternatively trim_if could be used after extraction. The performance will be slightly better because although trim_if calls ltrim_if and rtrim_if, a final token extraction on trailing whitespace will be avoided. In practice it won’t make a difference, do what’s convenient.

It could be tempting to squeeze the code a bit to be

void
parse_csv_non_empty(TextView src, std::function<void(TextView)> const &f) {
  TextView token;
  while ((token = src.take_prefix_at(',').trim_if(&isspace))) {
    f(token);
  }
}

However this causes a significant behavior difference - the loop terminates on an empty token because that token will be false. That is, this will work if there is a guarantee of no empty tokens (e.g. adjacent separators).

Key / Value Example

A similar case is parsing a list of key / value pairs in a comma separated list. Each pair is “key=value” where white space is ignored. In this case it is also permitted to have just a keyword for values that are boolean.

void
parse_kw(TextView src, std::function<void(TextView, TextView)> const &f) {
  while (src) {
    TextView value{src.take_prefix_at(',').trim_if(&isspace)};
    if (value) {
      TextView key{value.take_prefix_at('=')};
      // Trim any space that might have been around the '='.
      f(key.rtrim_if(&isspace), value.ltrim_if(&isspace));
    }
  }
}

The basic list processing is the same as the previous example, extracting each comma separated element. The resulting element is treated as a “list” with = as the separator. Note if there is no = character then all of the list element is moved to key leaving value empty, which is the desired result. A bit of extra white space trimming it done in case there was space next to the =.

Line Processing

TextView works well when parsing lines from a file. For this example, load() will be used. This method, given a path, loads the entire content of the file into a std::string. This will serve as the owner of the string memory. If it is kept around with the configuration, all of the parsed strings can be instances of TextView that reference memory in that std::string. If the density of useful text is sufficiently high, this is a convenient way to handle parsing with minimal memory allocations.

This example counts the number of code lines in the documenations conf.py file.

  swoc::file::path path{"doc/conf.py"};
  std::error_code ec;

  auto content   = swoc::file::load(path, ec);
  size_t n_lines = 0;

  TextView src{content};
  while (!src.empty()) {
    auto line = src.take_prefix_at('\n').trim_if(&isspace);
    if (line.empty() || '#' == *line)
      continue;
    ++n_lines;
  }
  // To verify this
  // cat doc/conf.py | grep -v '^ *#' | grep -v '^$' | wc

The TextView src is constructed from the std::string content which contains the file contents. While that view is not empty, a line is taken each look and leading and trailing whitespace is trimmed. If this results in an empty view or one where the first character is the Python comment character # it is not counted. The newlines are discard by the prefix extraction. The use of TextView::take_prefix_at() forces the extraction of text even if there is no final newline. If this were a file of key value pairs, then line would be subjected to one of the other examples to extract the values. For all of this, there is only one memory allocation, that needed for content to load the file contents.

Entity Tag Lists Example

An example from actual production code is this example that parses a quoted, comma separated list of values (“CSV”). This is used for parsing entity tags as used for HTTP fields such as “If-Match” (14.24). This will be a CSV each where each value is quoted. To make it interesting these quoted strings may contain commas, which do not count as separators. Therefore the simple approach in previous examples will not work in all cases. This example also does not use the callback style of the previous examples - instead the tokens are pulled off in a streaming style with the source TextView being passed by reference in order to be updated by the tokenizer. Further, some callers want the quotes, and some do not, so a flag to strip quotes from the resulting elements is needed. The final result looks like

    TextView::size_type idx = 0;
    // Characters of interest in a null terminated string.
    char sep_list[3] = {'"', sep, 0};
    bool in_quote_p  = false;
    while (idx < src.size()) {
      // Next character of interest.
      idx = src.find_first_of(sep_list, idx);
      if (TextView::npos == idx) {
        // no more, consume all of @a src.
        break;
      } else if ('"' == src[idx]) {
        // quote, skip it and flip the quote state.
        in_quote_p = !in_quote_p;
        ++idx;
      } else if (sep == src[idx]) { // separator.
        if (in_quote_p) {
          // quoted separator, skip and continue.
          ++idx;
        } else {
          // found token, finish up.
          break;
        }
      }
    }

This takes a TextView& which is the source view which will be updated as tokens are removed (therefore the caller must do the empty view check). The other arguments are the separator character and the “strip quotes” flag. The algorithm is to find the next “interesting” character, which is either a separator or a quote. Quotes flip the “in quote” flag back and forth, and separators terminate the loop if the “in quote” flag is not set. This skips quoted separators. If neither is found then all of the view is returned as the result. Whitespace is always trimmed and then quotes are trimmed if requested, before the view is returned. In this case keeping an offset of the amount of the source view processed is the most convenient mechanism for tracking progress. The result is a fairly compact piece of code that does non-trivial parsing and conversion on a source string, without a lot of complex parsing state, and no memory allocation.

History

The first attempt at this functionality was in the TSConfig library in the ts::Buffer and ts::ConstBuffer classes. Originally intended just as raw memory views, ts::ConstBuffer in particular was repeatedly enhanced to provide better support for strings. The header was eventually moved from lib/tsconfig to lib/ts and was used in in various part of the Traffic Server core.

There was then a proposal to make these classes available to plugin writers as they proved handy in the core. A suggested alternative was Boost.StringRef which provides a similar functionality using std::string as the base of the pre-allocated memory. A version of the header was ported to Traffic Server (by stripping all the Boost support and cross includes) but in use proved to provide little of the functionality available in ts::ConstBuffer. If extensive reworking was required in any case, it seemed better to start from scratch and build just what was useful in the Traffic Server context.

The next step was the TextView class which turned out reasonably well. About this time std::string_view was officially adopted for C++17, which was a bit of a problem because TextView was extremely similar in functionality but quite different in interface. Further, it had a number of quite useful methods that were not in std::string_view. To simplify the use of TextView (which was actually called “StringView” then) it was made a subclass of std::string_view with user defined conversions so that two classes could be used almost interchangeable in an efficient way. Passing a TextView to a std::string_view const& is zero marginal cost because of inheritance and passing by value is also no more expensive than just std::string_view.

Footnotes