#embed - a scannable, tooling-friendly binary resource inclusion mechanism

Published Proposal,

Paper Source:
GitHub ThePhD/future_cxx
GitHub ThePhD/embed
ISO/IEC JTC1/SC22/WG21 14882: Programming Language — C++


Pulling binary data into a program often involves external tools and build system coordination. Many programs need binary data such as images, encoded text, icons and other data in a specific format. Current state of the art for working with such static data in C includes creating files which contain solely string literals, directly invoking the linker to create data blobs to access through carefully named extern variables, or generating large brace-delimited lists of integers to place into arrays. As binary data has grown larger, these approaches have begun to have drawbacks and issues scaling. From parsing 5 megabytes worth of integer literal expressions into AST nodes to arbitrary string literal length limits in compilers, portably putting binary data in a C program has become an arduous task that taxes build infrastructure and compilation memory and time. This proposal provides a flexible preprocessor directive for making this data available to the user in a straightforward manner.

1. Changelog

1.1. Revision 10 - January 15th, 2023

1.2. Revision 9 - October 15th, 2022

1.3. Revision 8 - July 15th, 2022

1.4. Revision 7 - June 23rd, 2022

1.5. Revision 6 - May 12th, 2022

1.6. Revision 5 - April 12th, 2022

1.7. Revision 4 - June 15th, 2021

1.8. Revision 3 - April 15th, 2021

1.9. Revision 2 - October 25th, 2020

1.10. Revision 1 - April 10th, 2020

1.11. Revision 0 - January 5th, 2020

2. Polls & Votes

The votes for the C++ Committee are as follows:

2.1. November 2022 Kona-Hybrid C++ meeting

Forward P1967R9, with both "optional" sections included to CWG for inclusion in C++26. This is as WG14 accepted.

7 5 2 2 2

Forward P1967R9, including section §7.3.6 (__has_embed with return value 2), but not §7.3.7 (prefix/suffix/if_empty) to CWG for inclusion in C++26. This diverges from what was accepted by WG14.

6 6 4 3 2

The first poll had stronger consensus, so it was taken as the option to CWG.

2.2. June 2022 Virtual C++ meeting

"EWG encourages P1967 to define the form of vendor extensions as parameters to #embed?"

4 4 3 1 0

This was the result of consensus. The extensive discussion also made it clear that we must make sure that unrecognized embed parameters, due to them changing how an initializer may be formed, must be considered ill-formed. Users may get around this by using __has_embed. To dispel the notion that they may be optional, frontmatter wording was added to § 7.3.3 Add to the control-line production in §15.1 Preamble [cpp.pre] a new grammar production, as well as a supporting embed-parameter-seq production to make it clear the expectations.

Part of the discussion during this meeting was also whether or not the case for emptiness was useful. We moved the empty-based parameters to OPTIONAL pieces of wording, and expect to forward each of these on independent votes asides from the base proposal. This captures the sentiment of folks who may not have spoken up a lot during the meeting but nevertheless felt uneasy: we can simply go with whatever the poll says next meeting.

We took the feedback to rename is_empty to if_empty, since it is a better name for a "do-something-if-predicate-is-true" style attribute.

2.3. July 2021 Virtual C++ meeting

No votes were taken at this meeting, since it was mostly directional and about the changing of the syntax to better fit tools and scanners. In particular, it was more or less unanimously encouraged to:

All of these recommendations were incorporated below.

2.4. September 2020 Virtual C++ EWG Meeting

"We want #embed [optional limit] header-name (no type name, no other specification) as a feature."

2 16 3 0 1

This vote gained the most consensus in the Committee. While there were some individuals who wanted to be able to specify a type, there was stronger interest in not specifying a type at all and always producing a list of integer literals suitable to be used anywhere an comma-separated list was valid.

"We want to explore allowing an optional sequence of tokens to specify a type to #embed."

1 9 4 4 3

Further need was also expressed for constexpr of different types of variables, so we would rather focus that ability into a sister feature, std::embed. There was also an expression to augment std::bitcast<...>(...) to handle arrays of data, which would be a follow-on proposal. There was a great amount of interest in the std::bitcast direction, which means a paper should be written to follow up on it.

2.5. April 2020 Virtual C Meeting

"We want to have a proper preprocessor #embed ... over a #pragma _STDC embed ...-based directive."

This had UNANIMOUS CONSENT to pursue a proper preprocessor directive and NOT use the #pragma syntax. It is noted that the author deems this to be the best decision!

The following poll was later superseded in the C and C++ Committees.

"We want to specify embed as using #embed [bits-per-element] header-name rather than #embed [pp-tokens-for-type] header-name." (2-way poll.)

10 2 3

This poll will be a bit harder to accommodate properly. Using a constant-expression that produces a numeric constant means that the max-length specifier is now ambiguous. The syntax of the directive may need to change to accommodate further exploration.

3. Introduction

For well over 40 years, people have been trying to plant data into executables for varying reasons. Whether it is to provide a base image with which to flash hardware in a hard reset, icons that get packaged with an application, or scripts that are intrinsically tied to the program at compilation time, there has always been a strong need to couple and ship binary data with an application.

Neither C nor C++ makes this easy for users to do, resulting in many individuals reaching for utilities such as xxd, writing python scripts, or engaging in highly platform-specific linker calls to set up extern variables pointing at their data. Each of these approaches come with benefits and drawbacks. For example, while working with the linker directly allows injection of very large amounts of data (5 MB and upwards), it does not allow accessing that data at any other point except runtime. Conversely, doing all of these things portably across systems and additionally maintaining the dependencies of all these resources and files in build systems both like and unlike make is a tedious task.

Thusly, we propose a new preprocessor directive whose sole purpose is to be #include, but for binary data: #embed.

3.1. Motivation

The reason this needs a new language feature is simple: current source-level encodings of "producing binary" to the compiler are incredibly inefficient both ergonomically and mechanically. Creating a brace-delimited list of numbers in C comes with baggage in the form of how numbers and lists are formatted. C’s preprocessor and the forcing of tokenization also forces an unavoidable cost to lexer and parser handling of values.

Therefore, using arrays with specific initialized values of any significant size becomes borderline impossible. One would think this old problem would be work-around-able in a succinct manner. Given how old this desire is (that comp.std.c thread is not even the oldest recorded feature request), proper solutions would have arisen. Unfortunately, that could not be farther from the truth. Even the compilers themselves suffer build time and memory usage degradation, as contributors to the LLVM compiler ran the gamut of the biggest problems that motivate this proposal in a matter of a week or two earlier this very year. Luke is not alone in his frustrations: developers all over suffer from the inability to include binary in their program quickly and perform exceptional gymnastics to get around the compiler’s inability to handle these cases.

C developer progress is impeded regarding the inability to handle this use case, and it leaves both old and new programmers wanting.

Finally, Microsoft has an ABI problem with its maximum string literal size that cannot be solved using string literals or anything treated like string literals, as the LLVM thread and the thread from Claire Xen make clear. It has also frustrated both C an C++ programmers alike, despite their best efforts. It was so frustrating that even extended-C-and-C++-compilers, like Circle, solve this problem with custom directives.

3.2. But How Expensive Is This?

Many different options as opposed to this proposal were seriously evaluated. Implementations were attempted in at least 2 production-use compilers, and more in private. To give an idea of usage and size, here are results for various compilers on a machine with the following specification:

While time and Measure-Command work well for getting accurate timing information and can be run several times in a loop to produce a good average value, tracking memory consumption without intrusive efforts was much harder and thusly relied on OS reporting with fixed-interval probes. Memory usage is therefore approximate and may not represent the actual maximum of consumed memory. All of these are using the latest compiler built from source if available, or the latest technology preview if available. Optimizations at -O2 (GCC & Clang style)//O2 /Ob2 (MSVC style) or equivalent were employed to generate the final executable.

3.2.1. Speed

Strategy 40 kilobytes 400 kilobytes 4 megabytes 40 megabytes
#embed GCC 0.236 s 0.231 s 0.300 s 1.069 s
xxd-generated GCC 0.406 s 2.135 s 23.567 s 225.290 s
xxd-generated Clang 0.366 s 1.063 s 8.309 s 83.250 s
xxd-generated MSVC 0.552 s 3.806 s 52.397 s Out of Memory

3.2.2. Memory Size

Strategy 40 kilobytes 400 kilobytes 4 megabytes 40 megabytes
#embed GCC 17.26 MB 17.96 MB 53.42 MB 341.72 MB
xxd-generated GCC 24.85 MB 134.34 MB 1,347.00 MB 12,622.00 MB
xxd-generated Clang 41.83 MB 103.76 MB 718.00 MB 7,116.00 MB
xxd-generated MSVC ~48.60 MB ~477.30 MB ~5,280.00 MB Out of Memory

3.2.3. Analysis

The numbers here are not reassuring that compiler developers can reduce the memory and compilation time burdens with regard to large initializer lists. Furthermore, privately owned compilers and other static analysis tools perform almost exponentially worse here, taking vastly more memory and thrashing CPUs to 100% for several minutes (to sometimes several hours if e.g. the Swap is engaged due to lack of main memory). Every compiler must always consume a certain amount of memory in a relationship directly linear to the number of tokens produced. After that, it is largely implementation-dependent what happens to the data.

The GNU Compiler Collection (GCC) uses a tree representation and has many places where it spawns extra "garbage", as its called in the various bug reports and work items from implementers. There has been a 16+ year effort on the part of GCC to reduce its memory usage and speed up initializers (C Bug Report and C++ Bug Report). Significant improvements have been made and there is plenty of room for GCC to improve here with respect to compiler and memory size. Somewhat unfortunately, one of the current changes in flight for GCC is the removal of all location information beyond the 256th initializer of large arrays in order to save on space. This technique is not viable for static analysis compilers that promise to recreate source code exactly as was written, and therefore discarding location or token information for large initializers is not a viable cross-implementation strategy.

LLVM’s Clang, on the other hand, is much more optimized. They maintain a much better scaling and ratio but still suffer the pain of their token overhead and Abstract Syntax Tree representation, though to a much lesser degree than GCC. A bug report was filed but talk from two prominent LLVM/Clang developers made it clear that optimizing things any further would require an extremely large refactor of parser internals with a lot of added functionality, with potentially dubious gains. As part of this proposal, the implementation provided does attempt to do some of these optimizations, and follows some of the work done in this post to try and prove memory and file size savings. (The savings in trying to optimize parsing large array literals were "around 10%", compared to the order-of-magnitude gains from #embed and similar techniques).

Microsoft Visual C (MSVC) scales the worst of all the compilers, even when given the benefit of being on its native operating system. Both Clang and GCC outperform MSVC on Windows 10 or WINE as of the time of writing.

Linker tricks on all platforms perform better with time (though slower than #embed implementation), but force the data to be optimizer-opaque (even on the most aggressive "Link Time Optimization" or "Whole Program Optimization" modes compilers had). Linker tricks are also exceptionally non-portable: whether it is the incbin assembly command supported by certain compilers, specific invocations of rc.exe/objcopy or others, non-portability plagues their usefulness in writing Cross-Platform C (see Appendix for listing of techniques). This makes C decidedly unlike the "portable assembler" advertised by its proponents (and my Professors and co-workers).

3.3. Support

To say that #embed enjoys broad C Community support is an understatement. In all the years we have written proposals for C and C++, this is the only one where someone physically mailed us a letter - from a different country - directly to the Standards Body to try and make a case for the feature directly, rather than what was already in the paper: