1. Revision history
1.1. Changes since R1
The paper has been seen in Tokyo by SG18 with positive reception,
except for the 
1.2. Changes since R0
- 
     Expand § 4.3 Hardware support, taking more instructions into account, including AVX-512. 
- 
     Minor editorial fixes. 
2. Introduction
The C++ bit manipulation library in 
However, there are still a few operations which are non-trivial to implement in software and have widely available hardware support. Therefore, I propose to expand the bit manipulation library with multiple bit permutation functions described below. Even without hardware support, these functions provide great utility and/or help the developer write more expressive code.
3. Proposed features
I propose to add the following functions to the bit manipulation library (
Note: All constraints are exposition-only. The function bodies contain a naive implementation that merely illustrates the behavior.
Note: See § 4.3 Hardware support for the corresponding hardware instructions.
3.1. bit_reverse 
template < unsigned_integral T > constexpr T bit_reverse ( T x ) noexcept { T result = 0 ; for ( int i = 0 ; i < numeric_limits < T >:: digits ; ++ i ) { result <<= 1 ; result |= x & 1 ; x >>= 1 ; } return result ; } 
Reverses the bits of 
3.2. bit_repeat 
template < unsigned_integral T > constexpr T bit_repeat ( T x , int length ) noexcept ( false) { T result = 0 ; for ( int i = 0 ; i < numeric_limits < T >:: digits ; ++ i ) { result |= (( x >> ( i % length )) & 1 ) << i ; } return result ; } 
Repeats a pattern stored in the least significant 
3.3. bit_compress 
template < unsigned_integral T > constexpr T bit_compressr ( T x , T m ) noexcept { T result = 0 ; for ( int i = 0 , j = 0 ; i < numeric_limits < T >:: digits ; ++ i ) { bool mask_bit = ( m >> i ) & 1 ; result |= ( mask_bit & ( x >> i )) << j ; j += mask_bit ; } return result ; } 
For each one-bit in 
template < unsigned_integral T > constexpr T bit_compressl ( T x , T m ) noexcept { return bit_reverse ( bit_compressr ( bit_reverse ( x )), bit_reverse ( m ))); } 
For each one-bit in 
x: +---+---+---+---+ | a | b | c | d | +---+---+---+---+ m: | | bit_compressl(x, m): bit_compressr(x, m): +---+---+---+---+ +---+---+---+---+ +---+---+---+---+ | 0 | 1 | 0 | 1 | | b | d | 0 | 0 | | 0 | 0 | b | d | +---+---+---+---+ +---+---+---+---+ +---+---+---+---+
Note: Intuitively, 
3.4. bit_expand 
template < unsigned_integral T > constexpr T bit_expandr ( T x , T m ) noexcept { T result = 0 ; for ( int i = 0 , j = 0 ; i < numeric_limits < T >:: digits ; ++ i ) { bool mask_bit = ( m >> i ) & 1 ; result |= ( mask_bit & ( x >> j )) << i ; j += mask_bit ; } return result ; } 
For each one-bit in 
template < unsigned_integral T > constexpr T bit_expandl ( T x , T m ) noexcept { return bit_reverse ( bit_expandr ( bit_reverse ( x )), bit_reverse ( m ))); } 
For each one-bit in 
x: +---+---+---+---+ | a | b | c | d | +---+---+---+---+ m: bit_expandl(x, m): bit_expandr(x, m): +---+---+---+---+ +---+---+---+---+ +---+---+---+---+ | 0 | 1 | 0 | 1 | | 0 | a | 0 | b | | 0 | c | 0 | d | +---+---+---+---+ +---+---+---+---+ +---+---+---+---+
Note: Intuitively, 
4. Motivation and scope
Bit-reversal, repetition, compression, and expansion are fundamental operations that meet multiple criteria which make them suitable for standardization:
- 
     They are common and useful operations. 
- 
     They can be used to implement numerous other operations. 
- 
     At least on some architectures, they have direct hardware support. 
- 
     They are non-trivial to implement efficiently in software. 
- 
     For known masks, numerous optimization opportunities are available. 
4.1. Applications
4.1.1. Applications of bit_reverse 
   Bit-reversal is a common operation with uses in:
- 
     Cryptography: scrambling bits 
- 
     Networking: as part of cyclic redundancy check computation 
- 
     Graphics: mirroring of images with one bit per pixel 
- 
     Random number generation: reversal can counteract low entropy of low-order bits such as in the case of linear congruential generators 
- 
     Digital signal processing: for radix-2 Cooley-Tukey FFT algorithms 
- 
     Code obfuscation: security by obscurity 
4.1.2. Applications of bit_repeat 
   The generation of recurring bit patterns is such a fundamental operation that it’s hard to tie to any specific domain, but here are some use cases:
- 
     Debugging: using obvious recurring bit pattenrs like 0xcccc ... 
- 
     Bit manipulation: generating alternating sequences of 1 0 
- 
     Eliminating integer divison: for x >> ( i % N ) N bit_repeat ( x , N ) >> i 
- 
     Testing: recurring bit patterns can make for good test cases when implementing numerics 
- 
     Error detection: known bit patterns can be introduced to spot failed transmission 
4.1.3. Applications of bit_compress bit_expand 
   Compression and expansion are also common, with uses in:
- 
     Space-filling curves: Morton/Z-Order and Hilbert curves 
- 
     Input/output: especially for variable-length encodings, such as UTF-8 (§ 4.2.3 UTF-8 decoding with bit_compressr) 
- 
     Chess engines: for bitboards; see [ChessProgramming1] 
- 
     Genomics: according to [ARM1] 
A GitHub code search for 
4.2. Motivating examples
4.2.1. Implementing countr_zero bit_repeat 
   [Anderson1] contains a vast amount of algorithms,
many of which involve masks of alternating 
When written as "magic numbers" in code,
these masks can make it quite hard to understand the overall pattern
and to generalize these algorithms. 
unsigned int v ; // 32-bit word input to count zero bits on right unsigned int c = 32 ; // c will be the number of zero bits on the right v &= - v ; if ( v ) c -- ; if ( v & 0x0000FFFF bit_repeat (( 1 << 16 ) - 1 , 32 ) ) c -= 16 ; if ( v & 0x00FF00FF bit_repeat ( ( 1 << 8 ) - 1 , 16 ) ) c -= 8 ; if ( v & 0x0F0F0F0F bit_repeat ( ( 1 << 4 ) - 1 , 8 ) ) c -= 4 ; if ( v & 0x33333333 bit_repeat ( ( 1 << 2 ) - 1 , 4 ) ) c -= 2 ; if ( v & 0x55555555 bit_repeat ( ( 1 << 1 ) - 1 , 2 ) ) c -= 1 ; 
It is now obvious how this can be expressed in a loop:
// ... for ( int i = 16 ; i != 0 ; i /= 2 ) { unsigned mask = bit_repeat (( 1u << i ) - 1 , i * 2 ); if ( v & mask ) c -= i ; } 
4.2.2. Interleaving bits with bit_expand 
   A common use case for expansion is interleaving bits. This translates Cartesian coordinates to the index on a Z-order curve. Space filling curves are a popular technique in compression.
unsigned x = 3 ; // 0b011 unsigned y = 5 ; // 0b101 const auto i = bit_expandr ( x , bit_repeat ( 0b10u , 2 )) // i = 0b01'10'11 | bit_expandr ( y , bit_repeat ( 0b01u , 2 )); 
4.2.3. UTF-8 decoding with bit_compressr 
   
uint_least32_t x = load32_little_endian ( utf8_data_pointer ); switch ( countl_one ( uint8_t ( x ))) { case 0 : return bit_compressr ( x , 0b01111111 ); case 1 : /* error */ ; case 2 : return bit_compressr ( x , 0b00111111'00011111 ); case 3 : return bit_compressr ( x , 0b00111111'00111111'00001111 ); case 4 : return bit_compressr ( x , 0b00111111'00111111'00111111'00000111 ); } 
4.2.4. Other operations based on bit_compress bit_expand 
   Many operations can be built on top of 
// x & 0xf bit_expandr ( x , 0xf ) bit_compressr ( x , 0xf ) // (x & 0xf) << 4 bit_expandr ( x , 0xf0 ) // (x >> 4) & 0xf bit_compressr ( x , 0xf0 ) // Clear the least significant one-bit of x. x ^= bit_expandr ( 1 , x ) // Clear the nth least significant one-bit of x. x ^= bit_expandr ( 1 << n , x ) // Clear the n least significant one-bits of x. x ^= bit_expandr (( 1 << n ) - 1 , x ) // (x >> n) & 1 bit_compressr ( x , 1 << n ) // Get the least significant bit of x. bit_compressr ( x , x ) & 1 // Get the nth least significant bit of x. ( bit_compressr ( x , x ) >> n ) & 1 // popcount(x) countr_one ( bit_compressr ( -1u , x )) countr_one ( bit_compressr ( x , x )) 
4.3. Hardware support
| Operation | x86_64 | ARM | RISC-V | 
|---|---|---|---|
|  | AVX512_BITALG, () | SVE2 | Zbb+Zbkb,(Zbb) | 
|  | AVX512_BITALG | ||
|  | BMI2 | SVE2 | ( V) | 
|  | BMI2 | SVE2 | ( +V) | 
|  | ( BMI2+ABM) | SVE2, (SVE2+SVE) | ( V) | 
|  | ( BMI2+ABM) | ( SVE2+SVE) | ( +V) | 
(Parenthesized) entries signal that the instruction does not directly implement the function, but greatly assists in its implementation.
Note: The AVX-512 
Note: The RISC-V 
4.3.1. Support for bit_reverse 
   This operation is directly implemented in ARM through 
Any architecture with support for 
// assuming a byte is an octet of bits, and assuming the width of x is a power of two x = byteswap ( x ); x = ( x & 0x0F0F0F0F ) << 4 | ( x & 0xF0F0F0F0 ) >> 4 ; // ... quartets of bits x = ( x & 0x33333333 ) << 2 | ( x & 0xCCCCCCCC ) >> 2 ; // ... pairs of bits x = ( x & 0x55555555 ) << 1 | ( x & 0xAAAAAAAA ) >> 1 ; // ... individual bits 
It is worth noting that clang provides a cross-platform family of intrinsics. 
Such an intrinsic has been requested from GCC users a number of times in [GNU1].
4.3.2. Support for bit_repeat 
   Firstly, note that for the pattern length, there are only up to 
While the AVX-512 instruction 
Specific cases like 
However, note that the primary use of 
4.3.3. Support for bit_compress bit_expand 
   Starting with Haswell (2013), Intel CPUs directly implement compression and expansion with
with 
ARM also supports these operations directly with 
Overall, only recent instruction set extensions offer this functionality directly. However, when the mask is a constant, many different strategies for hardware acceleration open up. For example
- 
     interleaving bits can be assisted (though not fully implemented) using ARM zip1 zip2 
- 
     other permutations can be assisted by ARM tbl tbx 
As [Warren1] explains, the cost of computing 
All in all, there are multiple factors that strongly suggest a standard library implementation:
- 
     The strategy for computing bit_compress bit_expand - 
       tzcnt clmul popcnt 
 
- 
       
- 
     ISO C++ does not offer a mechanism through which all of this information can be utilized. Namely, it is not possible to change strategy based on information that only becomes available during optimization passes. Compiler extensions such as __builtin_constant_p 
- 
     ISO C++ does not offer a mechanism through which function implementations can be chosen based on the surrounding context. In a situation where multiple bit_compress m 
Bullets 2. and 3. suggest that 
Even with a complete lack of hardware support, a software implementation of 
unsigned bit_compress_known_mask ( unsigned x ) { return cxx26bp :: bit_compressr ( x , 0xf0f0u ); } 
Clang 18 emits the following (and GCC virtually the same); see [CompilerExplorer1]:
bit_compress_known_mask ( unsigned int ): # bit_compress_known_mask(unsigned int edi) mov eax , edi # { unsigned int eax = edi; shr eax , 4 # eax >>= 4; and eax , 15 # eax &= 0xf; shr edi , 8 # edi >>= 8; and edi , 240 # edi &= 0xf0; or eax , edi # eax |= edi; ret # return eax; } 
Knowing the implementation of 
5. Impact on existing code
This proposal is purely a standard library expansion. No existing code is affected.
6. Design considerations
The design choices in this paper are based on [P0553R4], wherever applicable.
6.1. Signature of bit_repeat 
   
It is also the only function not marked 
6.2. Why the names compress and expand?
The use of 
Furthermore, there are multiple synonymous sets of terminology:
- 
     deposit extract 
- 
     compress expand 
- 
     gather scatter 
I have decided against 
Taking the input 
Extract each second bit from
and densely deposit it into the result.0b10101 
Similarly, taking the input 
Extract each bit from
and sparsely deposit it in the result.0b111 
Both operations can be described with 
Further design choices are consistent with [P0553R4].
The abbreviations 
6.3. Why the lack of generalization?
6.3.1. No generalized bit_compress bit_expand 
   [N3864] originally suggested much more general versions of compression and expansion, which support:
- 
     performing the operation not just on the whole operand, but on "words" of it, in parallel 
- 
     performing the operation not just on bits, but on arbitrarily sized groups of bits 
I don’t propose this generality for the following reasons:
- 
     The utility functions in < bit > 
- 
     These more general form can be built on top of the proposed hardware-oriented versions. This can be done with relative ease and with little to no overhead. 
- 
     The generality falsely suggests hardware support for all forms, despite the function only being accelerated for specific inputs. This makes the performance characteristics unpredictable. 
- 
     The proposed functions have wide contracts and can be noexcept 
- 
     Generality adds complexity to the standardization process, to implementation, and from the perspective of language users. It is unclear whether this added complexity is worth it in this case. 
6.3.2. No generalized bit_reverse 
   Bit reversal can also be generalized to work with any group size:
template < typename T > T bit_reverse ( T x , int group_size = 1 ) noexcept ( false); 
With this generalization, 
However, this general version is much less used, not as consistently supported in
hardware, and has a narrow contract. 
Therefore, a generalized bit-reversal is not proposed in this paper.
6.4. Why does the signature of bit_compress T 
   Initially, I went through a number of different signatures.
template < unsigned_integral T , unsigned_integral X > constexpr T bit_compressr ( X x , T m ) noexcept ; 
This signature is quite clever because the result never has more bits than the mask 
Furthermore, I’ve realized that while the result never has more bits than 
template < unsigned_integral T , unsigned_integral X > constexpr common_type_t < T , X > bit_compressr ( X x , T m ) noexcept ; 
However, it is not trivial to juggle bits between the left and right versions of 
Since this proposal includes low-level bit operations, it is reasonable and safe to require
the user to be explicit.
A call to 
template < unsigned_integral T > constexpr T bit_compressr ( T x , T m ) noexcept ; 
7. Possible implementation
7.1. Reference implementation
All proposed functions have been implemented in [Schultke1]. This reference implementation is compatible with all three major compilers, and leverages hardware support from ARM and x86_64 where possible.
7.2. Other implementations
[Warren1] presents algorithms which are the basis for [Schultke1].
- 
     An O(log n) bit_reverse 
- 
     An O(log2 n) bit_compress bit_expand - 
       can be O(log n) with hardware support for carry-less multiplication aka. GF(2) polynomial multiplication 
 
- 
       
[Zp7] offers fast software implementations for 
[StackOverflow1] contains discussion of various possible software implementations
of 
8. Proposed wording
The proposed changes are relative to the working draft of the standard as of [N4917].
8.1. Feature-testing
In subclause 17.3.2 [version.syn] paragraph 2, update the synopsis as follows:
#define __cpp_lib_bitops 201907L 20XXXXL // freestanding, also in <bit> 
8.2. Header synopsis
In subclause 22.15.2 [bit.syn], update the synopsis as follows:
+ // 22.15.X, permutations + template < class T > + constexpr T bit_reverse ( T x ) noexcept ; + template < class T > + constexpr T bit_repeat ( T x , int l ); + template < class T > + constexpr T bit_compressr ( T x , T m ) noexcept ; + template < class T > + constexpr T bit_expandr ( T x , T m ) noexcept ; + template < class T > + constexpr T bit_compressl ( T x , T m ) noexcept ; + template < class T > + constexpr T bit_expandl ( T x , T m ) noexcept ; 
8.3. New subclause
In subclause 22.15 [bit], add the following subclause:
22.15.X Permutation [bit.permute]
1 In the following descriptions, let N denote
. Let αn denote the n-th least significant bit ofnumeric_limits < T > :: digits , so thatα equalsα template < class T > constexpr T bit_reverse ( T x ) noexcept ; 2 Constraints:
is an unsigned integer type ([basic.fundamental]).T 3 Returns:
[Note:equalsbit_reverse ( bit_reverse ( x )) . — end note]x template < class T > constexpr T bit_repeat ( T x , int l ); 4 Constraints:
is an unsigned integer type ([basic.fundamental]).T 5 Preconditions:
is greater than zero.l 6 Returns:
7 Throws: Nothing.
template < class T > constexpr T bit_compressr ( T x , T m ) noexcept ; 8 Constraints:
is an unsigned integer type ([basic.fundamental]).T 9 Returns:
template < class T > constexpr T bit_expandr ( T x , T m ) noexcept ; 10 Constraints:
is an unsigned integer type ([basic.fundamental]).T 11 Returns:
[Note: Ifequals zero, thenx & ~ m equalsbit_expandr ( bit_compressr ( x , m ), m ) . Ifx equals zero, thenx >> popcount ( m ) equalsbit_compressr ( bit_expandr ( x , m ), m ) . — end note]x template < class T > constexpr T bit_compressl ( T x , T m ) noexcept ; 12 Constraints:
is an unsigned integer type ([basic.fundamental]).T 13 Returns:
.bit_reverse ( bit_compressr ( bit_reverse ( x ), bit_reverse ( m ))) template < class T > constexpr T bit_expandl ( T x , T m ) noexcept ; 14 Constraints:
is an unsigned integer type ([basic.fundamental]).T 15 Returns:
.bit_reverse ( bit_expandr ( bit_reverse ( x ), bit_reverse ( m ))) 
Note: I would have preferred a less mathematical approach to defining these functions.
      However, it is too difficult to precisely define 
9. Acknowledgements
I greatly appreciate the assistance of Stack Overflow users in assisting me with research for this proposal. I especially thank Peter Cordes for his tireless and selfless dedication to sharing knowledge.
I also thank various Discord users from Together C & C++ and #include<C++> who have reviewed drafts of this proposal and shared their thoughts.