1. Motivation
ISO/IEC 19570:2018 introduced data-parallel types to the C++ Extensions for Parallelism TS [P1928R15]. That paper, and several ancillary papers, do an excellent job of setting out the main features of an extension to C++ which allows generic data parallel programming on arbitrary targets. However, it is inevitable that the programmer will want to make some use of target-specific intrinsics, or call into legacy libraries that accept fixed-size platform-specific SIMD types such as , in order to unlock features of those specific platforms or reuse existing optimized code. This requires that the programmer is able to allow a value to be used in a call to a target intrinsic or library function, and that the result of that call can be used to generate a new value. This is already permitted for values which fit into the expected type, but it is harder to achieve when the value is larger.
In this paper we will propose a function called which
makes it easy to repeatedly apply a target specific intrinsic to native-sized
pieces of large value arguments (created using ), and
to marshall their individual results back into a or result (using ).
The function is named and placed in to
directly align with related established functions such as and . This
terminology was suggested by committee members during prior reviews, and
reflects the internal vocabulary of the SIMD library.
2. Revision History
R3 => R4
-
Made arguments to
bechunked_invoke .const & -
Clarified that
is used to glue the results back together.simd :: cat -
Changed the chunk index from a compile-time constant to a runtime simd_size_type value, resolving the template instantiation concern raised during LEWG review.
-
Added discussion of the implications of modifiable arguments (as requested by LEWG poll).
-
Added "as-if full decomposition" semantics with implementation freedom to chunk incrementally.
R2 => R3
-
Small wording changes and rendering fixes.
R1 => R2
-
Updated to match the Working Draft.
-
Changed the name of the function to avoid confusion with
.std :: invoke -
Defined callable order to be by ascending index.
-
Numerous minor clarifications and wording fixes.
-
Changed the Constraints into Mandates to make them hard errors.
-
Added description of why prototype-based chunking is not supported.
R0 => R1
-
Freshened up the wording to match the current state of the draft proposal.
-
Removed
in favour of probing the callables capabilities.invoke_indexed
3. Background
Although has been carefully crafted to include APIs which access all
of the common or desirable features of SIMD instruction sets, it is inevitable
that the user will sometimes want to take advantage of instructions which are
specific to a particular platform. For example, a DSP target may have a special
type of accumulator instruction, or algorithm specific instruction (e.g., AES
crypto). Clearly, calling these intrinsics results in non-portable code, but the
increase in hardware-accelerated performance on a given target could be a worthwhile
trade-off.
The draft standard of recommends that provision is made for
conversions to and from implementation defined types. For example:
vec < float > addsub ( vec < float > a , vec < float > b ) { return static_cast < vec < float >> ( _mm256_addsub_ps ( static_cast < __m256 > ( a ), static_cast < __m256 > ( b ))); }
In this example the inputs, which are native register-sized values, are
explicitly converted into their target-specific typed values. Those target
specific types are used to call the intrinsics, and then the target specific
return value is converted back into the closest representation.
This example is straightforward, since the values are explicitly the
correct native size. For targets which support several different register
sizes (e.g., some variants of AVX support 128-, 256- and 512-bit registers) the
code can use a constexpr conditional to select which size to use:
vec < float > addsub ( vec < float > a , vec < float > b ) { if constexpr ( vec < float >:: size () == 4 ) return vec < float > ( _mm_addsub_ps ( static_cast < __m128 > ( a ), static_cast < __m128 > ( b ))); else if constexpr ( vec < float >:: size () == 8 ) return vec < float > ( _mm256_addsub_ps ( static_cast < __m256 > ( a ), static_cast < __m256 > ( b ))); else error (); // Invalid native register }
Things become more tricky when dealing with values which are larger than
their implementation types. Such types cannot be converted into a type which can
be used to call an intrinsic. Instead, the must be broken down into small
pieces which are the correct size for the call to the intrinsic, and then the
results of that glued back together. Here is one way to do that for a value which is twice as big as a native register:
// Assumes AVX is in use, and that each native register is therefore 8xfloat vec < float > addsub ( vec < float , 16 > a , vec < float , 16 > b ) { // Get register-sized pieces auto { lowA , highA } = chunk < vec < float >> ( a ); auto { lowB , highB } = chunk < vec < float >> ( b ); // Call the intrinsic on each pair of pieces. auto resultLow = vec < float > ( _mm256_addsub_ps ( static_cast < __m256 > ( lowA ), static_cast < __m256 > ( lowB ))); auto resultHigh = vec < float > ( _mm256_addsub_ps ( static_cast < __m256 > ( highA ), static_cast < __m256 > ( highB ))); // Glue the individual results back together. return simd :: cat ( resultLow , resultHigh ); }
This is now getting verbose, and it only handles value inputs which are
twice the size of a register value. To use the intrinsic with larger values, or values which don’t map into native register-sized pieces, more
work is needed (e.g., if was used then the pieces would be
of size 8, 8, and 4 respectively, and this would require a suitable call to the
intrinsic of the appropriate size).
The boiler-plate code needed to handle this is technically straight-forward, but
verbose. A completely generic solution which could handle arbitrary value
size would also require additional mechanisms like immediately invoked lambdas
or index sequences to be used too. Rather than requiring every user to have to
write their own intrinsic call handlers, we can abstract the general mechanism
into something that is easily reused. In particular we want to break a set of value arguments into smaller pieces, call an intrinsic on each, and then
glue the results of those intrinsic calls back together. We achieve this though
a proposed function called which we will describe in the remainder
of this paper.
4. Description of simd :: chunked_invoke
The function is rather like the standard in that it takes a
callable object and a set of arguments. Its basic signature is as follows:
template < typename Fn , typename ... Args > auto simd :: chunked_invoke ( Fn fn , Args ...);
The parameter should be a callable that accepts some arguments which can be
used to invoke an intrinsic from a native register. For example, to continue our example from above,
we could create a utility function which calls the intrinsic from native-sized values:
inline auto native_addsub ( vec < float > lhs , vec < float > rhs ) { auto nativeLhs = static_cast < __m256 > ( lhs ); auto nativeRhs = static_cast < __m256 > ( rhs ); return vec < float > ( _mm256_addsub_ps ( nativeLhs , nativeRhs )); }
Given this wrapper for native-sized intrinsic calls, we can now use to break down a large value into individual register-sized
calls:
auto addsub ( vec < float , 32 > x , vec < float , 32 > y ) { return simd :: chunked_invoke ( native_addsub , x , y ); }
The function accepts any number of arguments and will
break each one down into native-sized in a way which is equivalent to calling on each argument. Respective chunked pieces of each argument are
then used to invoke the supplied function argument. On completion, the individual
results are glued back together using to produce a single / value result.
Let’s look how the example from § 3 Background looks like with applied
using before-after table:
| Before | After |
|---|---|
|
|
By default the function will use the native size for the element type, with the
aim of calling the intrinsic with the largest permitted builtin type. However, can also be explicitly given the size of block to use. For
example, the following code breaks down into pieces of size 4 instead (and also
supplies the callable as a local lambda, to make the function self-contained):
auto addsub ( vec < float , 32 > x , vec < float , 32 > y ) { auto do_native = []( vec < float , 4 > lhs , vec < float , 4 > rhs ) { return vec < float , 4 > ( _mm_addsub_ps ( static_cast < __m128 > ( lhs ), static_cast < __m128 > ( rhs ))); }; return simd :: chunked_invoke < 4 > ( do_native , x , y ); }
Being able to define a different size is useful for two reasons:
-
We may wish to process data in smaller input sizes than native size. For example, if the data needs to be upconverted to a large element size for the operation, then it can be useful to choose a smaller block size to begin with so that the upconverted data is no larger than a native register. This could be more efficient than accepting native register-sized data and then upconverting to several registers.
-
If the element type of the arguments are different then the
function cannot determine the appropriate native size for itself. For example, if the first argument hassimd :: chunked_invoke elements and the second argument hadfloat elements then the number of elements to use in the calls cannot be inferred, and the call toint8_t must be told how many elements to use in each block.simd :: chunked_invoke
So far we have only considered what happens when is given value arguments which are multiples of some native register size, but as
we saw in our introduction, it might be useful to be able to call on values of arbitrary size. For example, suppose that
the add/sub is being called on a value with 19 elements. In that case on
a target with a native size of 8 elements would need to break
the calls down into pieces of sizes 8, 8, and 3 respectively, and the callable
would need to be able to handle values of arbitrary size. The following
example shows how this might work:
auto addsub ( vec < float , 19 > x , vec < float , 19 > y ) { // Invoke the most appropriate intrinsic for the given simd types. auto do_native = [] < typename T , typename ABI > ( basic_vec < T , ABI > lhs , basic_vec < T , ABI > rhs ) { if constexpr ( basic_vec < T , ABI >:: size <= 4 ) return vec < float , 4 > ( _mm_addsub_ps ( static_cast < __m128 > ( lhs ), static_cast < __m128 > ( rhs ))); else return vec < float , 8 > ( _mm256_addsub_ps ( static_cast < __m256 > ( lhs ), static_cast < __m256 > ( rhs ))); }; return chunked_invoke ( do_native , x , y ); }
In this example the local lambda function can accept inputs of any size,
and will choose the most appropriate intrinsic to use. For example, given the
block of 3 tail elements the lambda utility will convert the 3 elements to an register and call . This lambda function can then be
called by to enable an arbitrarily sized set of arguments to be mapped onto their underlying intrinsics. For the example above,
the following code was generated:
vmovups ymm0 , ymmword ptr [ rsi ] vmovups ymm1 , ymmword ptr [ rsi +32 ] vmovups xmm2 , xmmword ptr [ rsi +64 ] mov rax , rdi vaddsubps ymm0 , ymm0 , ymmword ptr [ rdx ] vaddsubps ymm1 , ymm1 , ymmword ptr [ rdx +32 ] vaddsubps xmm2 , xmm2 , xmmword ptr [ rdx +64 ] vmovups ymmword ptr [ rdi ], ymm0 vmovups ymmword ptr [ rdi +32 ], ymm1 vmovups xmmword ptr [ rdi +64 ], xmm2
Notice how the load, addsub and store instructions work respectively in ymm, ymm and xmm registers to cope with the different sizes.
Having now described the basic operation of we can consider some
of the rules for using it, and a useful extension to it which makes certain
scenarios easier to deal with.
4.1. Using indexed Callable invocations
When a large value is broken down into pieces to invoke the callable
function, the size of that piece can be obtained from the callable function
invocation’s parameter type but it can also be useful to pass in the index of
that piece too. For example, suppose a function is invoking an intrinsic to
perform a special type of memory store operation. Each register-sized sub-piece
of the incoming needs to know its offset so that it can be written to the
correct pointer offset. The following example code illustrates how this could
happen:
auto special_memory_store ( vec < float , 32 > x , float * ptr ) { // Invoke the most appropriate intrinsic for the given simd types. auto do_native = [ = ] < typename T , typename ABI > ( basic_vec < T , ABI > data , auto idx ) { ( _mm256_special_store_ps ( ptr + idx , // NEED TO USE THE OFFSET HERE static_cast < __m256 > ( data ))); }; chunked_invoke ( do_native , x ); }
The function can probe the Callable that it is given to determine if it
will accept an offset as its last parameter. If the extra parameter can be
accepted then the offset of the sub-piece within the parent will be
passed in too (i.e., 0, 8, 16 and 24 in this example).
Note in this example that the Callable does not return a value itself as it is writing to memory.
4.1.1. Design option - avoiding probing the index capabilities
In the first revision of this paper we proposed that the function which invokes
a Callable with an explicit offset should be called with a suffix
(e.g., ). This name makes it clear that it expects
the Callable to have the extra parameter. However, a precedent has been set in
the function to allow Callables to be probed for their
capabilities rather than naming the function to call it out (e.g., can
optionally take a parameter), so we have now followed suit here.
4.2. Considerations in using simd :: chunked_invoke
These are a set of considerations for using . In the following, is a type that is either a or a .
Argument types: All the arguments that are passed to the callable function must satisfy . The arguments can be a mixture of or values. Other non-vec types are disallowed.
Chunk decomposition: The chunk size template parameter is forwarded from to to decompose each argument into appropriately sized pieces. The
chunk sizes for each argument are exactly the same as those produced by , including the handling of tail (remainder) elements.
No prototype-based chunking: Unlike which offers a prototype parameter to control chunking ABI
(e.g., to force chunks into a specific ABI type), intentionally does not provide this feature. In practice, the chunking performed
by automatically preserves the ABI of each input
argument already, as chunk types are formed using , thus retaining their original ABI. Supporting
prototype-based chunking for multiple arguments would require users to specify a
prototype for each argument, which is both complicated and error-prone. As a
convenience utility, is intentionally limited to common,
safe use cases. For advanced scenarios that demand explicit ABI control or
distinct chunking strategies per argument, users should write custom,
case-specific code rather than rely on this facility.
Matching argument sizes: When multiple arguments are provided, each must have
the same number of elements. This ensures that for each argument produces the same number of chunks, with corresponding chunks having matching
sizes.
Native size deduction: For the native block size to be deduced, all the objects
must have the same native size. This is to ensure that when the arguments are broken into pieces they will always map to the
same respective sizes. If the size cannot be deduced like this then the user
must explicitly supply the block size.
Callable flexibility: The callable fn provided to must be able to accept chunk
arguments of any size produced during decomposition, including smaller tail
chunks when the total size is not a multiple of the chunk size. Failure to
support all possible chunk sizes will result in ill-formed code or undefined
behavior.
Non-modifiable arguments: The arguments passed to are
not modifiable through the Callable. Each argument is decomposed using , which produces independent values rather than references into the
original argument. The Callable receives these chunk values as its parameters.
Any modifications the Callable makes to its parameters affect only those local
copies and do not propagate back to the original arguments of .
During LEWG review, the question was raised of whether the Callable should be
permitted to modify its arguments in a way that propagates back to the original arguments. The behaviour of is
defined as if all arguments are fully decomposed via before any
invocation of the Callable begins. Since the chunks are independent values,
writes to the Callable’s parameters are naturally invisible to the caller and do
not constitute undefined behavior but simply have no effect. This definition also
means that an implementation is free to decompose and invoke one chunk at a time
rather than materializing all chunks up front, since the observable behaviour is
identical in either case. Supporting true modification of the original arguments
would break this equivalence, as earlier invocations could alter data not yet
chunked, forcing all implementations to materialize every chunk before the first
invocation. This would add both specification complexity and runtime overhead
for a feature that has not been needed in practice. Users who need to modify
SIMD data in place should use and directly.
Invocation order: The callable order is defined to be in increasing order of chunk index (i.e., from [0..NumChunks)) to allow the invoked function to accumulate state in a predictable way.
Return type requirement: When the Callable returns a value, it must return a object, and all return values must be of the same element type. The results will be merged together using . Alternatively, for lambdas which have side-effects, can be used as the return type, in which case will also return .
Return size flexibility: When the Callable returns a object, it need not have the same size as
its input arguments. For example, the Callable function could perform an
operation like a partial horizontal add (e.g., the addition of adjacent elements), resulting in fewer output values. The only requirement is that the return type of the Callable must be consistent across all chunk invocations, and that the results can be concatenated back together.
5. Wording
The wording diff is against the current C++ Working Draft.
5.1. Modify [simd.syn]
Insert new exposition only concepts after :
template < class V > concept simd - mask - type = // exposition only same_as < V , basic_mask < mask - element - size < V > , typename V :: abi_type >> && is_default_constructible_v < V > ; template < class V > concept simd - vec - or - mask - type = // exposition only simd - vec - type < V > || simd - mask - type < V > ;
5.2. Add a new section to [simd.syn]
template < simd - floating - point V > rebind_t < complex < typename V :: value_type > , V > polar ( const V & x , const V & y = {}); template < simd - complex V > constexpr V pow ( const V & x , const V & y ); // [simd.chunked.invoke] chunked_invoke utility function template < simd - size - type N = see below , class Fn , simd - vec - or - mask - type Arg0 , simd - vec - or - mask - type ... Args > constexpr auto chunked_invoke ( Fn fn , const Arg0 & first_arg , const Args & ... other_args );
5.3. Add new section [simd.chunked.invoke]
utility function [simd.chunked.invoke]chunked_invoke template < simd - size - type N = see below , class Fn , simd - vec - or - mask - type Arg0 , simd - vec - or - mask - type ... Args > constexpr auto chunked_invoke ( Fn fn , const Arg0 & first_arg , const Args & ... other_args ); Let:
be set toN if the caller does not provide a value forsimd :: vec < typename Arg0 :: value_type >:: size () .N
be the number of tuple elements in the result of callingNumChunks .chunk < N > ( first_arg )
be a function which returns theArgChunks ( A , i ) th element ofi , withchunk < N > ( A ) in the rangei .[ 0. . NumChunks ) Mandates:
is(( Arg0 :: size () == Args :: size ()) && ...) true.The result type of
isfn or satisfiesvoid .simd - vec - or - mask - type The callable
shall be invocable for every combination of chunk argument types that may be produced, including tail chunks of sizes less thanfn .N Effects:
The behavior is as if all arguments are fully decomposed via
before any invocation of the Callable begins.simd :: chunk < N > For each i in the range
in increasing order of i and exactly once each, the Callable function[ 0. . NumChunks ) is called with the following arguments:fn
for the first argument, andArgChunks ( first_arg , i )
for each of the other arguments.ArgChunks ( other_args , i ) If
is a Callable which is well-formed when given an additional parameter of typefn , then the last parameter will be set to the valuesimd - size - type , otherwisei * N will be called without that extra parameter.fn [Note: If the callable
is invocable with and without the chunk index, the form accepting the index as an additional trailing argument is selected. Care should be taken to avoid ambiguous overloads or call signatures. — end note]fn [Note: The chunk arguments passed to
are independent values. Any modificationsfn makes to its parameters do not propagate back to the original arguments offn . An implementation is permitted to decompose and invoke one chunk at a time rather than materializing all chunks up front, as the observable behavior is identical. — end note]chunked_invoke Returns: If
returnsfn , thenvoid returnschunked_invoke , otherwise returnsvoid wheresimd :: cat ( r 0 , r 1 , ... r NumChunks -1 ) is the result of the ith call tor ᵢ as described above.fn Remarks:
If the chunk size does not divide the argument size exactly, the last chunk may be smaller. This mirrors the behavior of
.simd :: chunk