Ambiguity and Insecurity with User-Defined Literals

ISO/IEC JTC1 SC22 WG21 N2747 = 08-0257 - 2008-08-24

Lawrence Crowl, Lawrence@Crowl.org, crowl@google.com

Introduction

The proposal for user-defined literals, N2378 User-defined Literals (aka. Extensible Literals (revision 3)), has significant ambiguity and insecurity. These problems should be solved before the feature is added to the language to prevent significant harm.

The use cases for user-defined literals are:

The current proposal for user-defined literals may inhibit the use of user-defined literals in exactly those cases.

Lookup Ambiguity

User-defined literals can be defined via literal operators. These literal operators may be defined in namespace scope (N2378 section 5 "Proposed Wording", modifying clause 13.5.8 over.literal, paragraph 1).

The definition of these operators in a nested namespace with a using directive in the global namespace (N2378 section 3.5 "An idiom") effectively provides for conflict-free definition of literal operators.

Unfortunately, unambiguous definition of a literal operator does not imply unambiguous invocation.

This ambiguity will be compounded because suffixes will tend to be very short, and collisions are likely. For example, competition for SI suffixes will likely be fierce and immediate, particularly as the C++ committee is not yet ready to standardize SI types or their literals.

Consider two libraries that export literal operators in the manner suggested by the idiom.


// library ping.h
namespace ping {
     struct X {};
     namespace literals {
         X operator "foo"( unsigned long long );
         X operator "bar"( unsigned long long );
     }
}
using namespace ping::literals;

// library pong.h
namespace pong {
     struct Y {};
     namespace literals {
         Y operator "foo"( unsigned long long );
         Y operator "bar"( unsigned long long );
     }
}
using namespace pong::literals;

// application.cc

auto u = 1foo;
auto v = 2bar;

namespace applic {
    auto u = 1foo;
    auto v = 2bar;
}

int main() {
    auto a = 1foo;
    auto b = 2bar;
}

In this example, any simple use of either suffix will result in an ambiguity.

Existing Solution: Using Declarations

Daveed Vandevoorde notes that some of this ambiguity can be resolved with a using declaration. For example,


int main() {
    using ping::literals::operator "foo";
    using pong::literals::operator "bar";
    auto a = 1foo;
    auto b = 2bar;
}

Unfortunately, this solution does not work at the global namespace because the using declaration is redundant with existing using directive. The solution also does not work when using the same suffix from two different libraries.

Proposed Solution: Modified Idiom

Much of the ambiguity in the existing idiom arises from headers providing the using directive for the namespace of literals. Modifying the idiom so that clients, not libraries, provide the using directive would substantially reduce the number of likely collisions.

In the following example, all literals will be selected from the library ping.


// library ping.h
namespace ping {
     struct X {};
     namespace literals {
         X operator "foo"( unsigned long long );
         X operator "bar"( unsigned long long );
     }
}

// library pong.h
namespace pong {
     struct Y {};
     namespace literals {
         Y operator "foo"( unsigned long long );
         Y operator "bar"( unsigned long long );
     }
}

// application.cc

using namespace ping::literals;
auto u = 1foo;
auto v = 2bar;

namespace applic {
    auto u = 1foo;
    auto v = 2bar;
}

int main() {
    auto a = 1foo;
    auto b = 2bar;
}

Existing Solution: Call Syntax

Again, Daveed Vandevoorde notes that one can avoid these ambiguities by using function call syntax, either the operator form or perhaps some constructor. For example,


int main() {
    auto a = ping::literals::operator"foo"(1);
    auto b = pong::Y(2);
}

The first alternative suffers from verbosity. The second alternative suffers from two problems. First, the intent of a literal is lost. Second, there is potential for undesirable function overloading.

Proposed Solution: Qualified Literals

The general solution is to provide qualified literals; which enables fine-grained selection of the appropriate literal operator. The evolution subcommittee discussed this option, but ultimately did not choose it. Perhaps that choice should be revisited.

Parse Ambiguity

Any user-defined suffix starting with [A-Fa-f] has a potential conflict with hexidecimal notation. While the proposal defines away ambiguity (N2378 section 5 "Proposed Wording", modifying clause 13.5.8 over.literal, paragraph 0), that definition effectively means that suffixes intended for use with arbitrary integers must avoid more than 22% of the available suffix namespace. Some of these effectively prohibited suffixes would otherwise be the natural choice. The same applies to floating point values, e.g. "e" as electron charges, though to a lesser degree.

As Daveed Vandevoorde points out, there is also a more subtle ambiguity. Some letters are visually similar to digits, which could lead to misinterpretation by readers on casual reading. Daveed's example is:

I might introduce "units" for memory sizes: B for bytes, KB for kilobytes, MB for > megabytes etc. Unfortunately,

size_t memsize = 11B;

and

size_t memsize = 118;

look a lots like each other, ... (It gets worse even if you have suffixes that start with l (letter ell) or O (letter oh).)

Existing Solution: Leading Underscore

The intended solution to these parse ambiguities is to define a suffix with leading underscore, which separates the meaningful part of the suffix from the remainder of the literal. Continuing Daveed's example,

so I prefer to separate the suffix with an underscore:

size_t memsize = 11_B;

Unfortunately, the ambiguity reappears with digit separators. The intent of the current proposal is to be compatible with any future adoption of N2281 Digit Separators (N2378 section 4 "Use cases", paragraph 1). However, it fails to achieve that goal because the syntax of separated digits matches the syntax of user-defined literals starting with an underscore. For example, 0xAB_B is ambiguously 0xABB or operator"_B"(0xAB). (Likewise, 11_B is visually similar to the number 11_8, though that particular construct is less likely.)

Possible Solution: Prefer Digit Separator Interpretation

As Daveed Vandevoorde points out, one possible resolution to this problem is to simply require the compiler to disambiguate digit separators and literal suffixes.

Unfortunately, this approach will cause arbitrary interpretation of inherently ambiguous tokens.

The problem will exacerbated because in the absence of digit separators, programmers will be well-motiviated to add user-defined literals for the sole purpose of achieving digit separation. For example, programmers are likely to define

int operator"_000"( unsigned long long n ) { return 1000*n; }

to clarify the magnitude of literals, as in 123_000.

(At the same time, because suffixes must be enumerated, literal operators are an insufficient mechanism for digit separation; and thus can be used with only a sparse set, such as round thousands and millions.)

Any change to the standard to recognize digit separators will invalidate code.

Finally, users that do define literal operators for the purposes of digit separation effectively exclude the use of all other literal operators because the proposal permits only invocation of only one literal operator per literal.

Proposed Solution: Preemptive Adoption of Digit Separators

We can prevent misuse of user-defined literals for digit separation by preemptively adopting digit separators.

Proposed Solution: Double Leading Underscore

Both with and without digit separators, some ambiguity remains. Rather than provide conventions for disambiguation, it is preferable to prevent ambiguity in the first place. We can achieve this by using a double underscore to separate the value from the suffix.

For example, 0xAB__B would be unambigously a suffix because N2281 admits only a single underscore between digits.

The standard can enforce this double underscore separation either by introducing it as syntax, or by simply requiring the operator identifier to have two underscores rather than one.

Retained Solution: No Leading Underscore

For compatiblity with C, literals must retain the capability for suffixes with no leading underscores. There is no proposal to remove that capability.

Proposed Solution: Qualified Literals

With a double underscore separating value from suffix, we are very close to qualified names. One could instead separate value from suffix with the scope operator. For example,

size_t memsize = 11::B;

Admittedly, this suffix form of qualification would be somewhat unnatural.

Daveed Vandevoorde points out that there is a potential problem with this approach in the code

extern "C++"::X f();

However, in this one special case in the syntax, user-defined literals are inappropriate as well. Normal literals cannot have an following scope operator.

Evolution Insecurity

Because of the high degree of ambiguity, use of user-defined literals will expose programs to long-term instability. Because users of literals must be using the definition, the must extend the set of using declarations and defintions, and are vulnerable to any change in the using environment.

A concrete example of the insecurity is the subsequent addition to the standard of a new suffix. Once the C++0x is published, and users start to define literal operators, any addition to the standard of a new suffix will potentially invalidate those literal operators. This problem is particularly vexing because the C language has no such problem and is thus free to add suffixes at will. Thus, C++ will have the unfortunate choice of either being incompatible with C at the lexical level, or breaking user code. Since suffixes will tend to be short, that breakage is likely.

Further, the C++ standard is likely to define new literal operators over time. Again, these are likely to be highly ambiguous with user-written literal operators. Again, C++ will have the unfortunate choice of either not introducing new literal operators or breaking user code. Since suffixes will tend to be short, that breakage is likely.

Proposed Solution: Reserve "Jammed" Literals

One solution to this problem is to simply reserve suffixes with no leading underscores (or with no scope operator) to the standard. This approach as several advantages.