1. Motivation
ISO/IEC 19570:2018 introduced data-parallel types to the C++ Extensions
for Parallelism TS. [P1915R0] asked for feedback on 
One of the extra features we suggested was the ability to be able to permute
elements across or within SIMD values. These operations are critical for
supporting formatting of data in preparation for other operations (e.g.,
transpose, strided operations, interleaving, and many more), but are relatively
poorly supported in the current 
In this document we shall start with some background into the current 
2. Revision History
R3 => R4
- 
     Renamed simd simd_mask basic_simd basic_simd_mask 
- 
     Replaced deduced masks with nested typedefs (i.e., use basic_simd < T , ABI >:: mask_type basic_simd_mask < T , MaskABI > 
- 
     Replaced fixed_size types with their new counterparts (e.g., fixed_size_simd < T , 5 > simd < T , 5 > 
- 
     Removed design option to allow compress expand 
- 
     Reordered mask parameter position (it now matches simd_select copy_to 
R2 => R3
- 
     Added more detail about compiler’s ability to optimise a generated sequence 
- 
     Made memory operations (gather/scatter) a class of its own, rather than trying to make them appear like special forms of the other types of permute. 
- 
     Moved special indexes and generator size parameter description into main body of text rather than being a design option. 
- 
     Added fill value parameters for compress and expand functions. 
- 
     Improved wording and examples for memory permutations (gather/scatter). 
- 
     Added wording for static permute, dynamic permute, mask permute and gather/scatter. 
R1 => R2
- 
     Rewrite to bikeshed 
- 
     Add more content for proposal overview. 
R0 => R1
- 
     Add polls from Kona SG1 2022 
3. Background
In ISO/IEC 19570:2018 there were only a few functions which could be used to
achieve any sort of element permutation across or within 
simd < float , 5 > x ; …const auto [ piece0 , piece1 ] = split < 2 , 3 > ( x ); 
Those smaller pieces could be acted on using any of the 
The 
In [P2638R0] Intel suggested extra operations to insert and extract SIMD values to and from other SIMD containers:
const auto t = extract < 2 , 5 > ( v ); // Read out 3 SIMD values, starting from position 2 insert < 5 > ( v , t ); // Put the 3 previously read values back into the container at position 5 
These are convenient, but they still don’t allow arbitrary or expressive permute operations.
Two suggestions were made in [P0918R1]: add an arbitrary compile-time shuffle function, and add a specific type of permutation called an interleave. The compile-time shuffle can be used as follows:
const auto p = shuffle < 3 , 4 , 1 , 1 , 2 , 3 > ( x ); // p::value_type will be the same as x::value_type // p::size will be 6 
The indexes can be arbitrary, and therefore allow any duplication and reordering
of elements. The output 
In addition to the shuffle function, [P0918R1] also proposes a function called
interleave which accepts two SIMD values and interleaves respective elements
into a new SIMD value which is twice the size. For example, given inputs 
- 
     the set of named functions can only include those features available on all potential targets 
- 
     portability is reduce by exposing functions only on those targets which support them. 
Neither of these is desirable, and in our suggestions below we provide an alternative.
4. Extending std :: simd 
   There are four classes of permutation which could be supported by 
- 
     using compile-time computed indexes 
- 
     using another SIMD as a dynamic index 
- 
     using a bit-mask for selecting active elements 
- 
     permuting memory (gather and scatter) 
Modern processor instruction sets typically support all 4 of these classes of instruction. In the following sections we shall examine in more detail each of these classes and the potential APIs needed to expose those features to the user. We shall also examine what it means to permute memory, and whether named functions should be provided for common permutation patterns.
Note that a 
4.1. Using compile-time computed indexes
The first proposal is to provide a 
template < std :: size_t SizeSelector = 0 , typename T , typename Abi , typename PermuteGenerator > constexpr simd < T , output - size > permute ( const basic_simd < T , Abi >& v , PermuteGenerator fn ) 
It can be used as in the following example:
const auto dupEvenIndex = []( size_t idx ) -> size_t { return ( idx / 2 ) * 2 ; }; const auto de = permute ( values , dupEvenIndex ); 
Note that the permutation function maps from the index of each output element to
the index which should be used as the source for that position. So, for example,
the 
By default, the permute function will generate as many elements in the output
simd as there were input values, but the output size can be explicitly specified
if it cannot be inferred or it needs to be modified. For example, the following
permute generates 4 values with a stride of 3 (i.e., indexes 
4.1.1. Special generator indexes
The obvious index representation to return from each generator invocation is to
return an integral value in the range 
- 
     Negative indexes represent reverse indexes, which are intepreted as indexes taken from the end of the simd instead of the beginning. So -1 would be the last element, -2 the penultimate element, and so on. 
- 
     The special index named simd_zero_element 
- 
     The special index named simd_uninit_element 
Here is an example where zeros are inserted into the even-indexed output positions of a SIMD value while the odd-indexed values are copied directly from the input.
4.1.2. Generator function size argument
It can be useful to create libraries of common permutation generator functions
to handle common use cases, but such functions need to be able to know the
extent of the indexes that they are allowed to permute. The generator function
is allowed to accept an optional 
auto topHalf = []( std :: integral auto idx , std :: size_t simd_size ) { return ( simd_size / 2 ) + idx ; }; 
4.1.3. Design option - more special constants
There are other possible special constants that could be given special status in the compile-time permute, including the ability to insert NaN, +ve infinity, -ve infinite, max_value, min_value, and so on. Should these be adopted too?
4.1.4. Implementation Experience
In other simd or vector libraries which provide named specific permute
functions, the reason often given is that it allows specific hardware
instructions to be used to efficiently implement those permutes. For example,
Intel has the 
Using the generator-based approach makes it easier to access a wide range of
underlying special purpose instructions, or to fall back to equivalent
instructions sequences where they are not available, but it is desirable that
such behavior should be efficient. To judge whether the generator-based
permutation API was efficient and useful we implemented the extensions in the
Intel 
The generator function works by creating a list of compile-time constants which are used to perform the permutation. Because the list is compile-time, that list can be generated using completely arbitrary (and potentially very inefficient) code, but the outcome will always be a list of constants. Therefore the compiler’s ability to generate a permutation is dependent upon how well it can convert a list of constants into a permutation, not how well the programmer is at writing clear permute functions.
As a small illustration of the compiler’s ability, the following table shows some compile-time function permute calls and the code that the LLVM 14.0.0 compiler has generated for each:
| Permute call | Purpose | Output from clang 14.0.0 | 
|  | Duplicate even elements |  | 
|  | Swap even/odd elements in each pair (complex-valued IQ swap) |  | 
|  | Extract upper half of a 16-element vector. Note that the instruction sequence accepts a zmm input and returns a ymm output. |  | 
In each case the compiler has accepted the compile-time index constants and converted them into an instruction which efficiently implements the desired permutation. We can safely leave the compiler to determine the best instruction from an arbitrary compile-time function.
4.2. Using another SIMD as the dynamic index
The second proposal for the permute API is to allow the required indexes to be passed in using a second SIMD value containing the run-time indexes:
template < typename T , typename AbiT , std :: integral Idx , typename AbiIdx > constexpr simd < T , basic_simd < Idx , AbiIdx >:: size () > permute ( const basic_simd < T , AbiT >& v , const basic_simd < Idx , AbiIdx >& indexes ) 
This can be used as in the following example:
simd < unsigned , 8 > indexes = getIndexes (); simd < float , 5 > values = getValues (); ... const auto result = permute ( values , indexes ); // result::value_type is float // result::size is 8 
The permute function will return a new SIMD value which has the same element type as the input value being permuted, and the same number of elements as the indexing SIMD (i.e., float and 8 in the example above). The permute may duplicate or arbitrarily reorder the values. The index values must be of integral type, with no implicit conversions from other types permitted. The behavior is undefined if any index is outside the valid index range. Dynamic permutes across multiple input values are handled by concatenating the input values together. The indexes in the selector will only be treated as indexes, and will never have special properties as described above (e.g., no zero element selection)
In addition to the function called permute, a subscript operator will also be provided:
template < std :: integral Idx , typename AbiIdx > constexpr simd < T , basic_simd < Idx , AbiIdx >:: size () > simd :: operator []( const simd < Idx , AbiIdx > & indexes ) const ; 
Here is an example of how this could be used:
simd < int , 8 > indexes = getIndexes (); simd < float , 5 > values = getValues (); ... const auto result = values [ indexes ]; // result::value_type is float // result::size is 8 
4.2.1. Design option - C++23 multi-index subscript
Sometimes the subscripts to use in a permute might be known at compile time, but to use them in a permutation requires them to be put into an index simd first:
constexpr simd < int > indexes = { 3 , 6 , 1 , 6 }; // Note - assumes initialiser list from P2876R0. auto output = values [ indexes ]; 
In C++23 multi-subscript operator was introduced and this could be used to allow a more compact representation:
auto output = values [ 3 , 6 , 1 , 6 ]; // output::value_type is values::value_type // output::size is 4 
It would be good if it could also appear on the left-hand-side;
output_simd [ 2 , 5 , 6 ] = x ; // Overwrite selected elements with a single value. 
Should non-constant indexes be allowed too? Our experience with GCC and LLVM suggests that work would be needed on their code generation for non-constant variable indexes to be improved though, as they are both currently poor at this.
Should 
4.2.2. Implementation experience
We implemented the runtime-indexed permute API in Intel’s example 
4.3. Permutation using simd masks (compress/expand)
A third variant of permutation is where a 
On the left the values in the SIMD are compressed; only those elements which
have their respective 
template < typename T , typename Abi > constexpr basic_simd < T , Abi > compress ( const typename basic_simd < T , Abi >:: mask_type & mask , const basic_simd < T , Abi >& value ); template < typename T , typename Abi > constexpr basic_simd < T , Abi > compress ( const typename basic_simd < T , Abi >:: mask_type & mask , const basic_simd < T , Abi >& value , T fill_value ); template < typename T , typename Abi > constexpr basic_simd < T , Abi > expand ( const typename basic_simd < T , Abi >:: mask_type & mask , const basic_simd < T , Abi >& value , const simd < T , Abi >& original ); 
In both functions all 
A compression operation is typically used to throw away unwanted values (i.e., values which do not satisfy the mask condition). Unwanted positions in the output will be left unspecified, unless the user supplies a third parameter specifying the value which should be used to fill those unwanted positions.
An expansion operation is typically used to overwrite selected elements of some existing SIMD value with new values taken from the contiguous beginning of another SIMD value. The expansion operation therefore has a slightly different meaning for its fill value.
4.3.1. Unused value element initialisation
In the design shown above the 
simd < int , 10 > values = ...; simd_mask < int , 6 > mask = ...; auto compressed = compress ( values , mask ); // compressed::value_type is int // compressed::size is 6 
The output value has the same size as the mask selector, but the actual mask bits won’t be known until run-time. This raises the question of what values should be put into the unused output elements.
The current proposal would leave the unused elements in a valid but unspecified
state. This is comparable to functions like 
The advantage of an unspecified state is that the code doesn’t need to make a special effort to insert values which might not be used anyway, but has the disadvantage that the unspecified state might catch out the unwary user. Using a value initialized state helps the unwary user by providing a default state which is sane and repeatable, but may come with a performance cost.
Intel have an example implementation of 
Given that the compress function is essentially only extracting valid values from the input SIMD and discarding the others, and that there is a potential performance penalty that comes from initialising unused values, then we think that the default should be to leave unused elements in unspecified states.
Related to this is what to do when a programmer does want to initialize unused
values to a known value. With the proposed interface the programmer would be
required to compute a new partial mask from their original selection mask (i.e.,
compute the population count of the original mask, and then turn that into a
mask of the lower elements), and then use that to conditionally blend in their
desired value. This can be inefficient, and for targets like Intel AVX-512, it
doesn’t exploit the hardware’s ability to automatically insert a value anyway.
For this reason we propose that an overload of 
template < typename T , typename Abi > basic_simd < T , Abi > compress ( const basic_simd < T , Abi >& value , const typename basic_simd < T , Abi >:: mask_type & mask , T default_value ); 
This makes the programmer’s intent explicit in the code, it simplifies the code, and it allows those targets which support automatic insertion of a value to efficiently make use of that feature.
4.3.2. Implementation Experience
In Intel’s example implementation of 
4.4. Memory-based permutes
When permuting values which are stored in memory, the operations are normally called gather (reading data from memory), or scatter (writing values to memory). We propose that two functions are provided which implement these operations:
template < std :: contiguous_iterator Iter , std :: integral Idx , typename AbiIdx > constexpr simd < std :: iter_value_t < Iter > , basic_simd < Idx , AbiIdx >:: size () > gather ( Iter in , const simd < Idx , AbiIdx >& indexes ); template < std :: contiguous_iterator Iter , typename T , typename TAbi , std :: integral Idx , typename IdxAbi > constexpr void scatter ( const simd < T , TAbi > , Iter out , const simd < Idx , IdxAbi >& indexes ); 
The gather operation could be used as follows:
int array [ 1024 ]; ... const auto r1 = gather ( array , indexes ); 
The scatter would be used as follows:
int array [ 1024 ]; simd < int > values = ...; simd < int > indexes = ...; ... scatter ( values , array , indexes ); 
4.4.1. Design option - allow subscripting a contiguous iterator
If [DNMSO] is allowed, then a pointer or 
m = v . begin (); auto data = m [ simd_values ]; m [ simd_values ] = inputs ; // Scatter to memory 
4.4.2. Design option - allow gather/scatter to take masks
gather/scatter could be overloaded to accept 
template < std :: contiguous_iterator Iter , std :: integral Idx , typename Abi > constexpr rebind_simd_t < std :: iter_value_t < Iter > , simd_mask < Idx , Abi >> gather ( Iter in , const simd < Idx , Abi >& indexes ); template < std :: contiguous_iterator Iter , typename T , typename TAbi , std :: integral Idx , typename IdxAbi > constexpr void scatter ( const simd < T , TAbi > , Iter out , const simd_mask < Idx , IdxAbi >& indexes ); 
They can be used as follows:
auto data = gather ( ptr , simd_mask ); // Copy_if scatter ( ptr , simd , simd_mask ); 
Note that the masked gather essentially behaves like 
5. Named functions
The APIs discussed in this document are very flexible and can be used to build effectively arbitrary permutations. However, there are inevitably certain patterns that are very common and it could be convenient to provide named functions for those patterns. For example:
auto dupEven ( simdable v ) { return permute ( v , []( size_t idx ) -> size_t { return ( idx / 2 ) * 2 ; }); }; auto dupOdd ( simdable v ) { return permute ( v , []( size_t idx ) -> size_t { return idx | 1 ; }); }; auto swapOddEven ( simdable auto v ) { return permute ( v , []( size_t idx ) -> size_t { return idx ^ 1 ; }); } auto even ( simdable auto v ) { return permute < v :: size () / 2 > ( v , []( size_t idx ) -> size_t { return idx * 2 ; };); } 
In many cases there are already names in C++ that reflect the operation.
Providing overloads in 
std :: rotate std :: shift_left std :: shift_right std :: copy_if std :: remove_if std :: slice // tricky - used for valarray at the moment, but it // also captures concepts like odd and even std :: stable_partition // Use a mask to split the values 
These functions could be added at a later date.
6. Summary
In this document we have described four styles of interface for handling permutes: compile-time generated, simd-indexed, mask-indexed, and memory based. In combination, these interfaces allow all types of common permute to be expressed in a way which is clean, concise, and efficient for the compiler to implement.
7. Wording
7.1. Add new section [simd.static_permute]
�
static permute [simd.static_permute]simd constexpr static int simd_zero_element = /* Implementation defined */ ; constexpr static int simd_uninit_element = /* Implementation defined */ ; Constraints:
These cannot take values which could be mistaken for valid index values, so they must be outside the range
.[ 0. . max - impl - size ) Remarks:
Special constants can be returned by the permutation generator function to indicate special behavior of that element.
// Free function to generate compile-time permute of simd template < std :: size_t SizeSelector = 0 , typename T , typename Abi , typename PermuteGenerator > constexpr simd < T , output - size > permute ( const basic_simd < T , Abi >& v , PermuteGenerator fn ) // Free function to generate compile-time permute of simd_mask template < std :: size_t SizeSelector = 0 , typename T , typename Abi , typename PermuteGenerator > constexpr simd_mask < T , output - size > permute ( const basic_simd_mask < T , Abi >& v , PermuteGenerator fn ) Constraints:
isstd :: integral < std :: invoke_result < PermuteGenerator , std :: size_t > || std :: integral < std :: invoke_result < PermuteGenerator , std :: size_t , std :: size_t > true.Returns:
A
orbasic_simd object of sizebasic_simd_mask (see below) where the ith element is initialized as follows. For each output indexoutput - size the result of the expression[ 0. . output - size ) , orsimd ( gen ( integral_constant < size_t , i > ())) is computed. The result indexsimd ( gen ( integral_constant < size_t , i > (), integral_constant < size_t , output - size )) is used to select a value fromx as follows:v 
A value in the range
will use[ 0. . size ) .v [ x ] 
The value
will usesimd_zero_element .T () 
The value
will use a valid but unspecified value of the compiler’s choosing.simd_uninit_element 
Any other value is undefined.
The
will be the same asoutput - size ifv . size is zero, otherwise it will be set toSizeSelector .SizeSelector Remarks:
The calls to the generator are unsequenced with respect to each other.
7.2. Add new section [simd.dynamic_permute]
�dynamic permute [simd.dynamic_permute]simd // Free function to permute simd by dynamic index template < typename T , typename AbiT , std :: integral Idx , typename AbiIdx > constexpr simd < T , basic_simd < Idx , AbiIdx >:: size () > permute ( const basic_simd < T , AbiT >& v , const basic_simd < Idx , AbiIdx >& indexes ) // Free function to permute simd mask by dynamic index template < typename T , typename AbiT , std :: integral Idx , typename AbiIdx > constexpr simd_mask < T , basic_simd_mask < Idx , AbiIdx >:: size () > permute ( const basic_simd_mask < T , AbiT >& v , const basic_simd_mask < Idx , AbiIdx >& indexes ) // Member function to permute simd by dynamic index using subscript operator template < std :: integral Idx , typename AbiIdx > constexpr simd < T , basic_simd < Idx , AbiIdx >:: size () > basic_simd :: operator []( const basic_simd < Idx , AbiIdx >& indexes ) // Member function to permute a simd mask by dynamic index using subscript operator template < std :: integral Idx , typename AbiIdx > constexpr simd_mask < T , basic_simd < Idx , AbiIdx >:: size () > basic_simd_mask :: operator []( const basic_simd < Idx , AbiIdx >& indexes ) Preconditions:
All values in
must be in the rangeindexes .[ 0. . size ) 
All index elements must be integral types.
Returns:
A
orbasic_simd object of sizebasic_simd_mask where the ith element is a copy ofsimd_size_v < Idx , AbiIdx > if i is in the rangev [ indexes [ i ]] or undefined otherwise.[ 0. . v . size ) 
7.3. Add new section [simd.mask_permute]
�mask permute [simd.mask_permute]simd // Free function to compress simd values using a mask selector ( 1 ) template < typename T , typename Abi > constexpr basic_simd < T , Abi > compress ( const basic_simd < T , Abi >:: mask_type & selector , const basic_simd < T , Abi >& v ) // Free function to compress simd_mask elements using a mask selector ( 2 ) template < typename T , typename Abi > constexpr basic_simd_mask < T , Abi > compress ( const basic_simd_mask < T , Abi >& selector , const basic_simd_mask < T , Abi >& v ) // Free function to compress simd values using a mask selector // with a fill value for unused elements ( 3 ) template < typename T , typename Abi > constexpr basic_simd < T , Abi > compress ( const basic_simd_mask < T , Abi >:: mask_type & selector , const basic_simd < T , Abi >& v , T fill_value ) // Free function to compress simd_mask elements using a mask selecto // with a fill value for unused elements ( 4 ) template < typename T , typename Abi > constexpr basic_simd_mask < T , Abi > compress ( const basic_simd_mask < T , Abi >& selector , const basic_simd_mask < T , Abi >& v , T fill_value ) Returns:
A
orbasic_simd object in which the firstbasic_simd_mask values are copies of those elements in[ 0. . reduce_count ( selector )) whose corresponding mask element is set. The relative order of the values fromv is unchanged.v 
For (1) and (2) output values in the range
will be valid but unspecified values of type T.[ reduce_count ( selector )... v . size )) 
For (3) and (4) output values in the range
will be copies of[ reduce_count ( selector )... v . size )) .fill_value Remarks:
The
operation is used to throw away unwanted values, so thecompress is used to put sane values into unused positions. This is in contrast tofill_value where the operation is typically used to overwrite selected elements of an existing SIMD value and so unused positions come from another SIMD value.expand // Free function to expand simd values using a mask selector template < typename T , typename Abi > constexpr basic_simd < T , Abi > expand ( const typename basic_simd < T , Abi >:: mask_type & selector , const basic_simd < T , Abi >& v , const basic_simd < T , Abi >& original = ()) // Free function to expand simd_mask elements using a mask selector template < typename T , typename Abi > constexpr basic_simd_mask < T , Abi > expand ( const basic_simd < T , Abi >:: mask_type & selector , const basic_simd_mask < T , Abi >& v , const basic_simd < T , Abi >& original = ()) Returns:
A
orbasic_simd object in which the firstbasic_simd_mask values are copied to those output elements whose respective selector elements are set. Any output elements whose respective selector elements are not set will have the corresponding element from[ 0. . reduce_count ( selector )) copied into that position.original Remarks:
is often used to overwrite selected positions in an existing SIMD value, so these functions take a complete SIMD value as the source from which to take unused mask selector elements. This is in contrast toexpand where the compression operation is effectively throwing away unwanted elements and thecompress is being used to put sane values into unused positions.fill_value 
7.4. Add new section [simd.memory_permute]
�memory permute [simd.memory_permute]simd template < std :: contiguous_iterator Iter , std :: integral Idx , typename Abi > constexpr rebind_simd_t < std :: iter_value_t < Iter > , simd < Idx , Abi >> gather ( Iter in , const simd < Idx , Abi >& indexes ); Constraints:
isstd :: integral < Idx > true.Preconditions:
All values in
must refer to elements which are within the range of the contiguous input iterator.indexes Returns:
A
object with the same number of elements asbasic_simd and the type of the element to whichindexes refers. The ith element of the object will be a copy of the valueIter .in [ indexes [ i ]] template < typename T , typename TAbi , std :: contiguous_iterator Iter , std :: integral Idx , typename IdxAbi > constexpr void scatter ( const simd < T , TAbi >& value , Iter out , const simd < Idx , IdxAbi >& indexes ); Constraints:
isstd :: integral < Idx > true.
issimd_size_v < simd < T , TAbi >> == simd_size_v < simd < Idx , IdxAbi >> true.
isstd :: convertible_to < T , std :: iter_value_t < Iter >> true.Preconditions:
All values in
must refer to elements which are within the range of the contiguous output iterator.indexes Effects:
For all i in the range
the value at[ 0. . simd_size_v < Idx , IdxAbi > ) will be a copy ofout [ indexes [ i ]] .value [ i ] Remarks:
The order in which the individual memory writes occur is unspecified.
8. Acknowledgments
Thank you to Matthias Kretz for useful offline discussions regarding the behaviour and contents of the permutation API.
9. Polls
9.1. Varna Meeting - 16th May 2023
POLL: SIMD permute should not support negative indices
| SF | F | N | A | SA | 
|---|---|---|---|---|
| 2 | 10 | 0 | 0 | 0 | 
POLL: SIMD permute should always pass both an index and a size to the invocable.
| SF | F | N | A | SA | 
|---|---|---|---|---|
| 2 | 2 | 6 | 3 | 1 | 
POLL: Rename 
| SF | F | N | A | SA | 
|---|---|---|---|---|
| 6 | 8 | 1 | 0 | 0 | 
9.2. Telecon Meeting - 2nd May 2023
Poll: We should promise more committee time to pursuing P2664R2 (Proposal to extend 
| SF | F | N | A | SA | 
|---|---|---|---|---|
| 6 | 4 | 1 | 0 | 0 | 
Poll: Pursue extended indices in the permute callable (e.g., negative indices, special values for zero element, etc.).
| SF | F | N | A | SA | 
|---|---|---|---|---|
| 5 | 4 | 1 | 0 | 0 | 
Poll: Pursue multi-index subscripting for dynamic permutations in the initial revision.
| SF | F | N | A | SA | 
|---|---|---|---|---|
| 0 | 1 | 8 | 2 | 0 | 
Poll : Pursue 
| SF | F | N | A | SA | 
|---|---|---|---|---|
| 1 | 6 | 2 | 0 | 0 | 
Poll: Pursue 
| SF | F | N | A | SA | 
|---|---|---|---|---|
| 0 | 0 | 7 | 2 | 0 |