1. Revision History
1.1. R3 → R4
Updated after SG6 review at Croydon:
-
Made customization points parts of the proposal, rather than optional extensions.
-
Retained closed-list approach for built-in vectorizable types rather than replacing it with trait-based definition, to preserve semantic distinction and avoid unintended consequences.
-
Explored how to customise
/min reductions.max -
Removed alignof constraint
-
Stronger reasoning for banning pointers.
-
Clarified architectural boundary for heterogeneous operations, noting that support for quantities and units library should be deferred.
-
Made
andto_underlying part of the proposal rather than optional extensions.to_integer -
Noted reason for choosing fixed element sizes.
1.2. R2 → R3
-
Fix wording issues
-
Fix rendering issues
1.3. R1 → R2
-
Changed approach from customization-focused to trait-based constraints
-
Moved customization points to design alternative section
-
Provide many implementation examples
1.4. R0 → R1
-
Incorporated SG1 and SG6 feedback from 2024 Tokyo meeting
-
Added restrictions on element types
-
Added inferencing as valid method for constructing simd operators
-
Changed from opt-in to opt-out mechanism
2. Introduction
The C++ standard library includes data-parallel types in the header, currently restricting element types to a closed list of built-in vectorizable types: arithmetic types and specializations. This paper proposes a minimal change to the specification in which this list is extended. Firstly, is added to the closed list as a standard library type with fixed semantics that the implementation handles directly. Secondly, trait-based constraints define a second category of user-defined vectorizable types as any type that is trivially copyable, of appropriate size, and not explicitly opted out. All existing built-in vectorizable types retain their current semantics unchanged, while support naturally extends to enumerations and user-defined types.
Although the change is fairly minimal, this paper thoroughly explores the implications of the changes, including detailed design of type constraints, operator semantics, conversions, and implementation experience. This comprehensive approach is in response to committee feedback requesting evidence that the approach works in practice and careful consideration of edge cases, particularly around type conversions and compiler optimization capabilities.
2.1. Evolution and Design Foundation
Earlier revisions of this proposal focused on providing explicit customization mechanisms for user-defined types. Committee feedback encouraged us to explore element-wise inference instead, making use of the Working Draft’s wording in which everything is defined in terms of element operations and element-wise application of those operations. This led to a key question: can modern compilers effectively auto-vectorize element-wise operations on user-defined types? Our investigation showed that leading optimizing compilers can indeed do this remarkably well, and this observation became the foundation of our design.
By relying on compiler optimization, we can open to user-defined types without requiring customization points for basic operations. This meant we could achieve the desired functionality by simply changing which types are allowed to be elements (i.e., what a vectorizable type is), without modifying operation semantics. The elegance of this approach is that changing only the gate-keeper logic provides the extension we need to support not only user-defined types, but other useful types like enumerations.
During the last committee meeting, concerns were raised about the performance implications of this approach - what if compilers failed to vectorize the code? To address these concerns we implemented our proposal in Intel’s implementation and tested it across multiple generations of Intel architectures with various user-defined types, enumerations, strong typedefs, and specialized DSP types (saturating arithmetic and fixed-point). Implementation experience (§ 7 Implementation Experience) demonstrates that with current leading compilers (Clang and Intel oneAPI), these types can generate assembly identical to built-in arithmetic types for standard operations. This proves the approach is viable. Compilers that don’t yet optimize as well will improve over time.
2.2. What This Proposal Enables
This proposal allows to support user-defined types, , enumerations, and other types beyond the current closed list of types. The key requirement is that element-wise application of the scalar operations makes sense for the type.
Examples of types that become vectorizable:
-
User-defined types for type safety:
- strong typedefs that wrap primitivesstruct Meters { float value ; }; -
Enumerations:
- type-safe alternatives to raw integersenum class Color : uint32_t { Red , Green , Blue }; -
std::byte: for packet processing and binary data manipulation
-
Specialized arithmetic types: saturating integers, fixed-point numbers with custom operators
-
Simple aggregates:
- small value types with element-wise semanticsstruct RGBA { uint8_t r , g , b , a ; };
What these types share is that the desired SIMD behavior is straightforward: if a type has , then should provide element-wise with the same semantics. The scalar operations on define what should do.
We did find that while element-wise inference works well for most operations (arithmetic, comparisons, permutations, broadcasts), it can occasionally struggle with complex algorithms like reductions or user-defined operators containing branching. To address this, we propose an ADL-based customization mechanism ( for operations, for conversions) that allows users to provide optimized implementations for specific operations while maintaining element-wise inference as the default. This hybrid approach provides a solid foundation that works well in practice while enabling targeted optimization when necessary.
This proposal does NOT address heterogeneous type operations where operands have different types and produce a third type (e.g., dimensional analysis where ). Such operations represent a fundamentally different design space requiring type-level computation and are explicitly out of scope.
2.3. Core Proposal
The core idea of our proposal is to extend the set of vectorizable types by adding a second category of user-defined vectorizable types alongside the existing built-in vectorizable types. A type is a user-defined vectorizable type if:
-
is truestd :: is_trivially_copyable_v < T > -
is 1, 2, 4, 8, or 16sizeof ( T ) -
is falsestd :: disable_vectorization < T >
The built-in vectorizable list is retained rather than replaced because it carries meaning throughout the working draft. Retaining it also makes visible the semantic distinction between built-in vectorizable types (which have privileged behavior such as math function support and value-preserving conversion rules) and user-defined vectorizable types, which have different rules. The trait-based constraints extend the set of vectorizable types, they do not redefine it.
All existing built-in vectorizable types satisfy the trait-based constraints, so the two categories are consistent. We change only which types are allowed; operator behavior remains element-wise application as currently specified. User-defined vectorizable types work exactly like built-in vectorizable types of the same size, where an operation is available for if and only if it exists for .
We did notice that it will be necessary to tighten the wording of some operator constraints to explicitly require appropriate return types for user-defined types. This prevents certain classes of errors and performance traps. The constraints distinguish between promotable types (arithmetic types and unscoped enumerations, which may undergo integer promotion) and all other vectorizable types (including , scoped enumerations, , and user-defined types), which should return the exact type. For example, produces an due to integer promotion, requiring lenient checking that allows explicit conversion back. In contrast, user-defined vectorizable type operators must return the correct type directly to prevent subtle bugs. This doesn’t affect existing built-in vectorizable types, but ensures user-defined vectorizable types behave correctly.
Everything else in the proposal stays the same. All operations and their semantics, performance characteristics, ABI selection, and existing code remain unchanged.
2.4. Scope and Future Directions
2.4.1. In Scope: Element-wise Semantics
This proposal maintains exact semantic parity with existing operations. All operators require operands of and return (or for comparisons), exactly as does today. The only change is expanding which types are permitted as elements in a by adding to the built-in vectorizable types, and introducing trait-based constraints for user-defined vectorizable types.
This design immediately enables important use cases:
-
Type-safe dimensional types that maintain scalar semantics
-
Enumeration processing
-
for binary data processingstd :: byte -
Domain-specific numeric types (saturating, fixed-point, custom precision)
-
Future numeric types (bfloat16, float8, and other emerging formats)
Beyond user-defined types, the trait-based approach future-proofs for numeric type evolution. Compiler builtins or emerging standard types like or , and vendor-specific formats automatically work without requiring standard amendments. As hardware evolves for machine learning and scientific computing, new numeric types integrate seamlessly into .
The trait-based gatekeeper change provides substantial value independently, enabling these use cases without requiring the committee to solve significantly harder problems.
2.4.2. Deliberately Out of Scope: Heterogeneous Operations
Heterogeneous type operations, where , are deliberately excluded from this proposal. This proposal is purely additive in which types are permitted as elements of and doesn’t change their operator semantics. All operations remain homogeneous, where , exactly as they are for today.
Heterogeneous operations would require changing itself and would introduce fundamentally different design problems that do not arise in this proposal:
-
Type-level computation: The result type of
must be computed assimd < Length > / simd < Time > . This requires a type algebra that is defined by a units library, not bysimd < Speed > . Different units libraries may define different type algebras.simd -
Operator template explosion: Every binary operator would need to be templated over pairs of element types, with constraints determining which combinations are valid. This is a significant specification and implementation burden.
-
ABI reconciliation: If the input types have different elements then it isn’t clear what conversions should take place to resolve them (e.g., what result type to produce from a mixed-type operator). This is a policy decision that belongs to the type algebra (e.g., a units library), not to
. Note that this is consistent with current practice wheresimd is not directly supported either, but can be achieved through conversion of one of its inputs. This proposal maintains that existing behaviour.simd < int > + simd < double > -
Ownership of semantics: The meaning of
is defined by a units library such as mp-units ([P3045R4]), not byMeters / Seconds -> Speed . It would be inappropriate forsimd to encode or assume any particular type algebra. The units library should control these semantics, and may even choose to have a quantity whose type was asimd , rather than asimd whose element type is a quantity.simd
Note: Conversions between types (e.g., to ) do not constitute heterogeneous operations. A conversion produces a new of the target type by construction, not a binary operation. Conversions use element-wise, delegating to whatever scalar conversion the type author has defined. This requires no type algebra and no changes to ’s operation semantics. Users can convert first and then operate, achieving heterogeneous workflows through composition of homogeneous operations.
Heterogeneous operations are a topic that merits its own proposal, ideally authored in collaboration with domain experts in dimensional analysis and units libraries who understand the requirements. Such a proposal would apply to all element types (including arithmetic types such as ), not just user-defined ones.
2.4.3. Forward Compatibility
The current design is fully forward-compatible with future heterogeneous operations. Adding template overloads such as:
template < typename T , typename U , typename Abi > friend basic_vec < /* computed result type */ , Abi > operator + ( const basic_vec < T , Abi >& , const basic_vec < U , Abi >& );
would not conflict with existing homogeneous operators - it would simply add new overloads to the existing set. The trait-based vectorizable definition in this proposal works unchanged with such future extensions.
2.4.4. Future Extension: Heterogeneous Operations
A future proposal could extend to support heterogeneous operations where operand and result types differ. Such a proposal should be developed in collaboration with the authors of quantities and units libraries (e.g., [P3045R4]) and would need to address type algebra, ABI reconciliation, and the question of which library ( or the units library) owns the semantics of cross-type operations. Similar considerations apply to other standard library types such as and to linear algebra types, where operations may produce different result types.
3. Motivation
The current restriction to a closed list of vectorizable types prevents several valuable use cases that would naturally benefit from SIMD parallelism, including strong typedefs for physical units, enumerations for state machines and flags, for low-level data processing, and small compound types for structure-of-arrays patterns. This section presents motivating examples.
3.1. Type Safety and Strong Typedefs
Physical units, identifiers, and other domain-specific types are commonly wrapped in strong typedefs to prevent semantic errors:
struct Meters { float value ; }; struct Seconds { float value ; }; // Type safety at scalar level Meters distance { 100.0f }; Seconds time { 5.0f }; // Meters m = time; // Error: type mismatch // Same type safety should extend to parallel code vec < Meters > distances = { 100.0f , 200.0f , 150.0f , 180.0f }; vec < Seconds > times = { 5.0f , 10.0f , 7.5f , 9.0f }; // vec<Meters> m = times; // Should also be error
Currently users who wish to put these strong types into a would need to unpack them to , losing type safety precisely where parallel operations occur. This proposal preserves type safety uniformly.
3.2. Signal and Media Processing Types
Specialized domains use custom numeric types optimized for their workloads:
// Fixed-point arithmetic for digital signal processing struct fixed_point_16s8 { std :: int16_t data ; fixed_point_16s8 operator + ( fixed_point_16s8 rhs ) const { return fixed_point_16s8 { saturate_add ( data , rhs . data )}; } // Other operators... }; // Should work with vec vec < fixed_point_16s8 > samples = load_audio_samples (); auto processed = apply_filter ( samples ); // Element-wise fixed-point operations
The proposal allows to provide its parallel infrastructure (loads, stores, masking, permutations, reductions) while deferring arithmetic to the user-defined type’s operators.
3.3. Enumerations
Enumerations are essentially only restricted integer types with named values. They are widely used for state machines, flags, and encoded data. Vectorizing enumerations enables batch processing of such data.
enum class Color : std :: uint32_t { Red , Green , Blue , Alpha }; vec < Color > pixel_channels = /* ... */ ; auto masked = pixel_channels & Color :: Alpha ; // Bitwise operations on scoped enums
Scoped enums () only allow operations that are valid for the enum itself (typically bitwise operations, comparisons, and conversions), while unscoped enums allow arithmetic operations through implicit conversion to their underlying type. The element-wise application mechanism automatically respects these restrictions without any special handling.
3.4. std::byte
is a distinct type representing raw byte data, commonly used in low-level programming. Vectorizing enables efficient byte-level operations such as encryption, checksums, and encoding.
vec < std :: byte > data = /* load from buffer */ ; auto encrypted = data ^ vec < std :: byte > { 0xFF }; // XOR cipher
3.5. Compound Types
Small compound types that fit in 16 bytes can be vectorized as atomic units, enabling structure-of-arrays patterns, or packet processing of multiple values simultaneously.
// Coordinate pairs vec < std :: pair < int , int >> coordinates ; // RGBA color pixels vec < std :: array < std :: uint8_t , 4 >> pixels ;
4. Understanding Type Constraints
To ensure user-defined types work correctly with , we impose constraints that match hardware capabilities and prevent subtle bugs. In summary the constraints are:
-
Trivially copyable
-
Size: must be 1, 2, 4, 8, or 16 bytes
-
Opt-out mechanism via
disable_vectorization -
Banned standard library types and categories (pointers, unions, cv-qualified, empty)
We now look in more detail at each of these constraints.
4.1. Trivially Copyable Constraint
We require . Many operations move elements bitwise (permutations, broadcasts, gathers, scatters). For these to work correctly, an element’s value must be preserved when its bit pattern is copied. Trivially copyable types have no special copy, move, or destroy logic, so bitwise copying always produces correct results.
4.2. Size Constraint
We require to be exactly 1, 2, 4, 8, or 16 bytes. All known hardware vector instruction sets support only power-of-2 element sizes. No shipping or announced vector ISA supports non-power-of-2 element widths. The largest current vectorizable type is at 16 bytes.
An alternative design considered was to define the valid sizes as implementation-defined or derived from the sizes of existing vectorizable types. However, explicitly listing the sizes is simpler, and directly reflects the reality of all current hardware. If future hardware or standard types introduce new element widths, the list can be extended by a future standard revision — a straightforward, non-breaking change.
4.3. Padding and Bit Representation
SIMD operations treat element types as uninterpreted bit patterns of the specified size. If a user-defined type contains padding bytes (e.g., typically has with one padding byte), is agnostic to which bits represent data versus padding. All bits are preserved through operations, with semantics determined solely by the element type’s operators. This is consistent with trivially copyable semantics.
4.4. Opt-Out Mechanism
The standard library uses a common pattern for selectively disabling features where a variable template can be specialized. For , this proposal adds , which will default to false but can be specialized to true for types that should not be vectorizable. This mechanism will allow the implementation to opt out of allowing vectorization for semantically inappropriate types which otherwise appear to permit vectorization.
Users may specialize for their own types, such as:
namespace my_lib { struct InternalType { std :: uint64_t data ; }; } template <> inline constexpr bool std :: disable_vectorization < my_lib :: InternalType > = true;
Specializations for cv-qualified or reference types are ill-formed.
4.5. Banned Standard Library Types
In addition to allowing the user to opt out of some types, the mechanism can also be used by the implementation to ban specific standard types and categories which have no meaningful vectorization semantics.
Type categories automatically banned:
-
Pointer types (is_pointer_v
or is_member_pointer_v ): The direction of C++ is toward stronger type safety and away from raw pointer manipulation, as reflected in the adoption of std::span ([P0122R7]) and ongoing memory safety work ([P3081R0]). A of pointers would hold semantically unrelated addresses where element-wise arithmetic has no coherent meaning. The common use case of non-contiguous memory access is already well-served by storing integer indices in asimd and using gather/scatter operations against a basesimd , which is both safer and maps directly to hardware instructions.std :: span -
Union types (
): Ambiguity about which member is active.is_union_v < T > -
CV-qualified types (
oris_const_v < T > ): Breaks assignment operators. (Note: cv-qualified vec objects likeis_volatile_v < T > are permitted; the ban applies only to cv-qualified element types.)const vec < int > -
bool: Overlaps semantically with
, which is the intended mechanism for parallel boolean values.basic_mask -
Empty types (
): Carry no data.is_empty_v < T >
Standard library types:
// Types not caught by category rules template <> inline constexpr bool disable_vectorization < std :: nullptr_t > = true; template <> inline constexpr bool disable_vectorization < std :: source_location > = true; template < class T , class Abi > inline constexpr bool disable_vectorization < std :: basic_vec < T , Abi >> = true; template < class T , class Abi > inline constexpr bool disable_vectorization < std :: basic_mask < T , Abi >> = true;
Note that under these constraints, arrays (), , and are not banned, provided they satisfy the constraints. They can all be useful in their own way, such as representing vector-processing of packet processing patterns, structured data, and structure-of-array layouts. Even if these types do not provide arithmetic or mathematical operations, it is still useful to be able to use them for parallel load/store, masking, permutation and bit-level operations. Note also that the category rules catch many of the standard types that are disallowed (e.g., is excluded by being an empty type).
This list is not exhaustive; implementations may provide additional specializations for other types where vectorization is semantically inappropriate.
4.6. std::byte as a built-in vectorizable type
Although is technically a scoped enumeration (), this proposal adds it to the built-in vectorizable type list rather than treating it as a user-defined vectorizable type. This is justified because:
-
It is defined by the standard library, not by users
-
It has dedicated language support, including
in [cstddef.syn]to_integer <> () -
Its operators (
,<< ,>> ,| ,& ,^ ) are explicitly specified by the standard library~ -
The implementation knows its exact semantics. There is nothing to infer through element-wise application
In these respects is analogous to : a standard library type with fixed, known semantics that the implementation can handle directly. Treating it as built-in ensures that implementations use optimized code paths for operations and never consult ADL customization points, consistent with how other built-in vectorizable types are handled.
4.7. Summary of Constraints
The constraints work together to ensure types are safe and efficient for vectorization:
-
Trivially copyable enables bitwise element manipulation
-
Power-of-2 size matches hardware vector capabilities
-
Opt-out mechanism allows excluding inappropriate types
These enable user-defined types like , , and , while excluding pointers, unions, cv-qualified types, empty types, and opted-out types.
5. Operations on User-Defined Types
This section describes how operations work with user-defined element types. The key principle is element-wise application: operations on apply the corresponding operation on to each element independently.
User-defined types are treated as atomic blocks of bits whose internal structure is not modified by simd operations. This proposal does not include struct-of-arrays conversions or layout transformations for user-defined types.
5.1. Operator Constraints
The specification provides operators conditionally using clauses. The working draft currently checks only that element-wise operations are valid expressions, without constraining return types. This proposal tightens these constraints to require appropriate return types, with different rules for promotable types versus all other vectorizable types.
For promotable vectorizable types (arithmetic types excluding , and unscoped enumerations) operators may return a promoted type (e.g., → ), which is then explicitly converted back to the element type. This preserves existing behavior for built-in types.
For all other vectorizable types (scoped enumerations, , , and user-defined types), operators must return exactly . This prevents subtle bugs where user-defined operators return incorrect types.
The constraints use exposition-only concepts that capture this two-tier checking:
template < typename T , typename BinaryOp > concept supported - binary - op = /* exposition only */ ( is_arithmetic_v < T > || ( is_enum_v < T > && ! is_scoped_enum_v < T > )) ? // Is it promotable? requires ( T a , T b ) { BinaryOp {}( a , b ); } : requires ( T a , T b ) { { BinaryOp {}( a , b ) } -> same_as < T > ; };
Return type requirements:
Arithmetic operators (, , , , , , , , , , unary , ):
-
For promotable types (arithmetic types excluding
, and unscoped enums): Allow promotion with explicit conversion backbool -
For all other vectorizable types: Must return exactly
value_type
Comparison operators (, , , , , ):
-
Must return
(no promotion of result type)bool
These requirements prevent size mismatches, avoid conversions that change semantics, and prevent performance traps from proxy types, while maintaining backward compatibility for arithmetic types.
Note: Comparison operators are not synthesized from each other, maintaining parity with existing behavior for built-in vectorizable types. For example, is not synthesized from . This avoids introducing inconsistency with current semantics. Synthesis of comparison operators could be proposed separately as an enhancement to all types, not just user-defined ones.
Examples:
struct Meters { float value ; Meters operator + ( Meters rhs ) const { return Meters { value + rhs . value }; } bool operator < ( Meters rhs ) const { return value < rhs . value ; } }; vec < Meters > a , b ; auto sum = a + b ; // ✅ OK: operator+ returns Meters auto mask = a < b ; // ✅ OK: operator< returns bool struct NoAdd { float value ; }; vec < NoAdd > x , y ; auto result = x + y ; // ❌ Error: operator+ not defined struct DifferentReturn { int16_t value ; int32_t operator + ( DifferentReturn ) const ; // Change return type }; vec < DifferentReturn > v , w ; auto bad = v + w ; // ❌ Error: int32_t is not DifferentReturn
Compound assignments use the same constraints as their corresponding binary operators:
friend constexpr basic_vec & operator += ( basic_vec & lhs , const basic_vec & rhs ) requires supported - binary - op < value_type , plus <>> ; // Same as operator+
All six comparison operators continue to be independently specified.
The mask type is determined by the element type’s size, not its contents. Masks indicate active/inactive lanes for a group of bits of size . For any user-defined type, the mask semantics are identical to those of built-in vectorizable types of the same size - one mask bit per element, regardless of what data the element contains.
5.2. Conversions and Casts
Converting constructors use for element conversion:
//Element `i` is initialized with `static_cast<T>(v[i])`. template < typename U > explicit constexpr basic_vec ( const basic_vec < U , Abi >& v ) requires /* appropriate constraints */ ;
This naturally supports user-defined conversions:
struct Meters { float value ; }; struct Feet { float value ; operator Meters () const { return Meters { value * 0.3048 }; } }; vec < Feet > feet = { 3.0f , 6.0f , 9.0f , 12.0f }; vec < Meters > meters { feet }; // ✅ Works via conversion operator
The existing semantics handle all conversion scenarios without additional specification.
5.2.1. Value-Preserving Conversions
The working draft defines "value-preserving" only for conversions from arithmetic types: "The conversion from an arithmetic type U to a vectorizable type T is value-preserving if all possible values of U can be represented with type T" ([simd.general]). This definition is precise for arithmetic types but does not extend to user-defined types.
For conversions involving user-defined types, this proposal defers to the type author’s judgment as expressed through implicit versus explicit conversions:
For conversions between built-in vectorizable types: Use the existing value-preserving definition (e.g., to is value-preserving, but to is not).
For conversions involving at least one user-defined type: Use to determine if the conversion may be implicit:
-
If
isis_convertible_v < From , To > true, the type author has declared the conversion safe via an implicit constructor, soallows it implicitlysimd -
If
isis_convertible_v < From , To > falsebutisis_constructible_v < To , From > true, the type author requires, soexplicit also requires explicit conversionsimd
Examples:
struct Meters { float value ; Meters ( float f ) : value ( f ) {} // Implicit - author says it’s safe }; struct Feet { float value ; explicit Feet ( float f ) : value ( f ) {} // Explicit - author says be careful }; vec < float > vf = {...}; vec < Meters > v0 = vf ; // OK - Meters(float) is implicit vec < Feet > v1 = vf ; // Error - Feet(float) is explicit vec < Feet > v2 = vec < Feet > ( vf ); // OK - explicit construction std :: span < float , 1024 > sf ; // OK - implicit conversion from float to Meters auto m_vec = unchecked_load < vec < Meters , 8 >> ( sf ); // Error - implicit conversion from float to Feet not allowed auto f_vec = unchecked_load < vec < Feet , 8 >> ( sf ); // OK - conversion from float allowed with flag_convert tag auto f_vec = unchecked_load < vec < Feet , 8 >> ( sf , flag_convert );
This approach:
-
Respects the type author’s design decisions about safety
-
Maintains consistency with scalar usage patterns
-
Avoids second-guessing the type author’s judgment
-
Does not require
to define value-preservation semantics for user-defined typessimd
Same-type operations are unaffected - broadcasts by copying, not converting, so these rules don’t apply.
Note that a type author could declare an implicit conversion that loses information (e.g., with storage). However, this is the type author’s choice at the scalar level, and should not override that judgment. If the scalar user-defined type allows implicit lossy conversion, does too.
5.3. Maths Functions
The working draft provides mathematical functions such as , , , and for , but constrains them to built-in floating-point and integer types. This proposal does not extend these functions to user-defined types. Unlike operators, which map directly to simple scalar operations that compilers can reliably auto-vectorize, maths functions typically involve internal loops, conditionals, and table lookups that would prevent the compiler from producing efficient vectorized code through element-wise inference. Rather than promising best-effort inference that risks a performance cliff, we disallow these functions for user-defined types entirely. Users who need them should provide their own implementations in their type’s namespace, where they will be found via ADL at unqualified call sites, following the established pattern of .
and are the exception. They are trivial (typically just a compare and a select) and widely used, making them good candidates for element-wise inference. They are the only mathematical functions for which we provide support for user-defined types. For built-in types, implementations use dedicated hardware instructions (e.g., , on x86). For user-defined types, the implementation applies scalar / element-wise. If a user needs to provide an optimized vector-level implementation, the same ADL mechanism applies:
namespace my_lib { struct MyType { int16_t data ; /* ... */ }; // Found via ADL when called as unqualified min(a, b) template < typename Abi > basic_vec < MyType , Abi > min ( const basic_vec < MyType , Abi >& a , const basic_vec < MyType , Abi >& b ) { return /* optimized implementation */ ; } }
As with , users should use unqualified calls to enable ADL. A qualified call to will always use the standard element-wise implementation.
5.4. Reductions
Reduction operations (e.g., , , ) apply an operation pairwise across elements:
// Applies `binary_op` pairwise to elements in unspecified order. template < typename T , typename Abi , typename BinaryOp = std :: plus <>> constexpr T reduce ( const basic_vec < T , Abi >& v , BinaryOp binary_op = {});
For with a binary operation such as , the operation goes through the standard three-tier dispatch: built-in optimized path, customization, or element-wise inference. This means user-defined types benefit from customization automatically — if a user provides for , will use it.
Note: Reductions assume associativity. For types with non-associative operations, results may differ from sequential left-to-right reduction. This is consistent with floating-point behavior, where may produce different results than sequential summation due to intermediate rounding. The working draft already specifies this behavior via preconditions on the binary operation.
struct ModularInt { int value ; ModularInt operator + ( ModularInt rhs ) const { return ModularInt {( value + rhs . value ) % 100 }; } }; vec < ModularInt > v = { 50 , 30 , 40 , 20 }; auto sum = reduce ( v , std :: plus <> {}); // Result: ModularInt{40} // Could evaluate as // ((50+30)+40)+20 = (80+40)+20 = 20+20 = 40 // (50+30)+(40+20) = 80+60 = 40
5.4.1. reduce_min and reduce_max
and differ from other reductions. The working draft specifies them in terms of ("the value of an element for which is false for all "), not in terms of /. For user-defined types, can be customized via with , providing a path to optimized reductions.
However, this path is less direct than for other reductions. With , the user customizes and the reduction benefits immediately. With , the user might expect to use their custom internally, but the specification doesn’t require this connection.
We note that this specification choice conveniently accommodates the masked variants of /, which operate only on active lanes, without requiring masked / overloads on .
We considered whether explicit customization points for min/max reductions are needed. The possible approaches are:
-
Respecify min/max in terms of compare-and-select, routing through the existing
customization for comparisons. However, this would be a retrograde step for built-in types where dedicated hardware min/max instructions outperform a compare-and-blend sequence.simd_operator -
Introduce tag-based customization for min/max, analogous to how
usessimd_operator ,std :: plus <> , etc. to identify operations. This would require new transparent function object types such asstd :: minus <> andstd :: simd :: min_op , and thestd :: simd :: max_op dispatch would need to be extended to recognise them. The reductions would then also need to be respecified in terms of a masked form of that customized min/max, adding considerable specification complexity.simd_operator -
Respecify the reductions in terms of an unqualified call to min/max, which would naturally find user-provided overloads via ADL. This is the simplest approach, but would change how existing reductions on built-in types are specified, which is beyond the intended scope of this proposal.
Given that element-wise inference and ADL-based overloading provide functional (if imperfect) paths today, we defer this question until practical experience demonstrates a need.
5.5. Load and Store Operations
Load operations already specify element conversion via :
// Element `i` is initialized with `static_cast<T>(*std::next(first, i))`. template < typename It > constexpr basic_vec ( It first , It last );
This naturally handles both same-type loads and converting loads via the mechanism (see § 5.2 Conversions and Casts for examples). Implementations may optimize by using vector loads followed by vector conversions rather than converting each element individually.
Store operations work similarly. No specification changes are needed.
5.6. Copy Operations
Operations that move elements without interpreting values work on any trivially copyable type:
-
,permute - rearrange elementsbroadcast -
,compress - conditional packing/unpackingexpand -
- conditional element selectionselect -
,chunk - size/shape changescat
These operate at the bit level and require no knowledge of element semantics. The trivially copyable constraint ensures they already work correctly for user-defined types.
5.7. Implementation Considerations
In this section we shall briefly examine two important implementation considerations when supporting user-defined types in : exception safety and ABI selection.
All operations are declared in the working draft. This has important implications for user-defined types: if an element-wise operation throws an exception during a simd operation, will be called.
This behavior is appropriate for SIMD code. Detecting and propagating exceptions on individual elements would require serializing the operation, checking each element’s result, and managing partial completion state. This fundamentally contradicts SIMD’s purpose of parallel execution. User-defined types intended for use in should have non-throwing operations, or accept that exceptions will terminate the program.
The specification means:
-
Element-wise operations are not required to be
themselvesnoexcept -
If they do throw during simd operations,
is calledstd :: terminate -
Users must ensure their types' operations don’t throw in practice
-
This is consistent with SIMD being performance-critical code where exceptions are inappropriate
5.8. ABI Selection for User-Defined Types
ABI selection determines the vector width (number of elements) for a object. For user-defined types, ABI selection is based solely on . A UDT of size N bytes is treated identically to built-in vectorizable types of size N for ABI purposes. This means:
struct A { int32_t x ; }; // sizeof=4 → treated like int32_t for ABI struct B { float f ; }; // sizeof=4 → treated like float for ABI struct C { uint8_t data [ 4 ] }; // sizeof=4 → treated like int32_t for ABI
Any two types with the same size will receive the same ABI and therefore the same number of elements:
struct MyInt32 { std :: int32_t value ; }; vec < int > v1 ; // Suppose this gets 512-bit vectors = 16 elements vec < float > v2 ; // Also 512-bit vectors = 16 elements (both 4 bytes) vec < MyInt32 > v3 ; // Also 512-bit vectors = 16 elements (also 4 bytes)
Implementations select vector width based on element size to match hardware capabilities. This ensures consistent behavior and predictable performance characteristics across types of the same size.
6. Customization Points
In early revisions of this paper, we considered a design where all operations on user-defined types were implemented as customization points discovered via ADL. While allowing very precise user control, it also meant the user always had to provide implementations for every operation, even if the compiler could infer the same operation for itself. In later revisions we switched to a hybrid approach where element-wise inference is the default, and customization points are only consulted when users want to provide optimized implementations for specific operations.
There are two categories of customization: operations (e.g., addition, multiplication) and conversions (e.g., to ). Both are discovered via argument-dependent lookup in the namespace of the element type. Users do not inject declarations into namespace . Because carries ’s associated namespaces, ADL naturally finds overloads declared alongside the element type definition.
6.1. Operation Customization
A single overloaded function name handles both unary and binary operations, distinguished by arity:
// In user’s namespace, discovered via ADL: auto simd_operator ( vec < T > v , Op op ) -> vec < T > ; // Unary auto simd_operator ( vec < T > v1 , vec < T > v2 , Op op ) -> vec < T > ; // Binary
The parameter is one of the standard transparent function objects (, , , , etc.) identifying which operation is being customized ([P4006] proposes adding and to help here). This design allows users to customize individual operations by providing overloads for specific function objects while relying on element-wise inference for everything else.
For arithmetic and bitwise operations, the return type of must be exactly . For comparison operations, the return type must be exactly . If ADL finds a that returns a different type, it is not considered and element-wise inference is used instead.
When a operation is performed, the implementation uses a three-tier dispatch:
-
Built-in vectorizable types: The implementation uses its own optimized code path. The
customization point is never checked. This ensures that built-in types always use the most efficient implementation and prevents users from accidentally overriding well-optimized library code.simd_operator -
ADL
: For user-defined vectorizable types, if a validsimd_operator overload is found via ADL, it is used.simd_operator -
Element-wise fallback: If no
is found, the implementation applies the scalar operator to each element independently, relying on compiler auto-vectorization.simd_operator
In conceptual terms, the dispatch looks like this:
template < typename T > // Built-in vectorizable types requires /*built-in-vectorizable*/ && supported - binary - op < T , std :: plus <>> friend basic_vec operator + ( const basic_vec & lhs , const basic_vec & rhs ) { return /* implementation-defined optimized implementation */ ; } template < typename T > // User-defined vectorizable types requires /*user-defined-vectorizable*/ && supported - binary - op < T , std :: plus <>> friend basic_vec operator + ( const basic_vec & lhs , const basic_vec & rhs ) { if constexpr ( requires { simd_operator ( lhs , rhs , std :: plus <> {}); }) { return simd_operator ( lhs , rhs , std :: plus <> {}); // Custom via ADL } else { return /* element-wise application */ ; // Default } }
For enumerations and user-defined types without customization, the check fails at compile time and element-wise inference is used. Since enumerations without custom operators compile to simple integer arithmetic, element-wise inference produces optimal code.
Example: Providing a custom saturating add that maps directly to hardware instructions:
namespace my_lib { struct saturating_int16 { std :: int16_t data ; friend saturating_int16 operator + ( saturating_int16 lhs , saturating_int16 rhs ) { auto r = std :: int32_t ( lhs . data ) + std :: int32_t ( rhs . data ); return saturating_int16 { std :: clamp < int32_t > ( r , -32768 , 32767 )}; } }; // Custom SIMD addition using native saturating instructions template < typename Abi > basic_vec < saturating_int16 , Abi > simd_operator ( const basic_vec < saturating_int16 , Abi >& lhs , const basic_vec < saturating_int16 , Abi >& rhs , std :: plus <> ) { // Implementation can use platform-specific intrinsics // e.g., _mm256_adds_epi16 on x86 return /* optimized saturating add */ ; } }
Without the customization, element-wise inference would apply the scalar to each element. As shown in the implementation experience section (§ 7 Implementation Experience), leading compilers can often auto-vectorize such operations into the same hardware instructions. The customization point provides a guarantee of optimal code generation regardless of compiler sophistication.
6.2. Conversion Customization
Conversions between types use a separate customization point, , with a tag-based dispatch pattern. The tag type and its associated variable template are provided as part of the public API:
// Provided by the simd library: template < typename T > struct convert_to_t { using type = T ; constexpr explicit convert_to_t () noexcept = default ; }; template < class T > inline constexpr convert_to_t < T > convert_to {};
Users provide overloads of in the namespace of their element type:
// User customization point signature: template < typename Abi > basic_vec < To , Abi > simd_convert ( const basic_vec < From , Abi >& source , convert_to_t < To > );
The tag argument serves two purposes: it enables ADL discovery, and it allows users to write customization points for specific conversion directions.
As with operations, conversion dispatch uses a three-tier strategy:
-
Both built-in vectorizable types: The implementation uses its own optimized conversion. The
customization point is never checked.simd_convert -
ADL
: If at least one type is not a built-in vectorizable type and a validsimd_convert overload is found via ADL that returns exactlysimd_convert , it is used.basic_vec < To , Abi > -
Element-wise fallback: If no
is found, the implementation falls back to element-wisesimd_convert , which invokes the scalar conversion operators or constructors on each element.static_cast
Example: Optimizing BFloat16 to float conversion using hardware instructions:
namespace my_lib { struct BFloat16 { uint16_t bits ; /* ... */ }; template < typename Abi > basic_vec < float , Abi > simd_convert ( const basic_vec < BFloat16 , Abi >& source , convert_to_t < float > ) { #ifdef __AVX512BF16__ return /* use native bfloat16 conversion instructions */ ; #else return /* software shift-based implementation */ ; #endif } }
7. Implementation Experience
We implemented this approach in Intel’s implementation and tested across multiple Intel architectures. This section presents the technical details: code generation results, assembly analysis, and identified limitations.
7.1. Test Implementation
We experimented with a number of different test types, including an enumeration, a strong type, and a saturating integer type to evaluate code generation quality:
enum Color { Red , Green , Blue }; struct Meters { float value ; Meters operator + ( Meters rhs ) const { return Meters { value + rhs . value }; } bool operator < ( Meters rhs ) const { return value < rhs . value ; } }; struct saturating_int16 { saturating_int16 ( int v ) : data ( v ) {} std :: int16_t data ; // Saturating addition friend saturating_int16 operator + ( saturating_int16 lhs , saturating_int16 rhs ) { auto r = std :: int32_t ( lhs . data ) + std :: int32_t ( rhs . data ); return saturating_int16 ( std :: clamp < int32_t > ( r , -32768 , 32767 )); } friend bool operator > ( saturating_int16 lhs , saturating_int16 rhs ) { return lhs . data > rhs . data ; } // Other operators defined similarly... };
7.2. Successful Inference Cases
Testing was performed with Clang 20 and Intel oneAPI 2025.0 targeting Intel Sapphire Rapids. For most operations, these compilers generated excellent code from element-wise operator application. The generated assembly uses native vector instructions throughout, with no scalar fallback or element-by-element processing. The instruction selection matches what hand-written intrinsics would produce, demonstrating that element-wise inference can generate performance-competitive code for common operations.
Important note on compiler variance: Optimization quality for user-defined types varies significantly between compiler vendors and versions. The results presented here reflect what’s possible with current leading implementations - other compilers may produce substantially less optimal code, particularly for complex operations like reductions. This variance is a quality-of-implementation issue, not a fundamental limitation of the design. Clang and oneAPI demonstrate the approach works. Compilers that currently struggle will improve over time as their optimization passes mature. Users should verify code quality with their specific toolchains and consider using the customization mechanisms (§ 6 Customization Points) if their compiler doesn’t yet optimize well.
See § 13 Appendix: Assembly Code Examples for detailed assembly listings showing the code generated for a variety of common patterns.
7.3. Identified Limitation
We did identify one case where element-wise inference produced suboptimal code:
| C++ Code | Generated Assembly (Suboptimal) |
|---|---|
|
|
For this reduction, the compiler started with vector operations but then switched to element-by-element scalar execution. The first two instructions are correct (extract and vector add), but subsequent operations process elements individually rather than maintaining vectorization throughout.
7.4. Implications for Customization
This experience demonstrates that:
-
Element-wise inference succeeds for most operations with leading compilers: Permutations, broadcasts, and direct operators generate optimal code with current Clang and Intel oneAPI implementations.
-
Compiler maturity varies significantly: Optimization quality for user-defined types shows substantial differences between compiler vendors and versions. While Clang and oneAPI generate excellent code, other compilers may produce significantly less optimal results - sometimes falling back to scalar operations where vectorization should succeed. This reflects differences in compiler optimization sophistication, not limitations of the design itself.
-
Specific limitations exist: Even with mature compilers, complex algorithms like reductions may not auto-vectorize perfectly from scalar operator definitions.
-
Customization provides value: For cases where compilers struggle, the ADL-based customization mechanism (
andsimd_operator ) enables users to provide optimized implementations, ensuring good performance regardless of compiler optimization quality.simd_convert
The identified limitations motivated the customization design presented in § 6 Customization Points. However, these limitations do not diminish the value of the core proposal’s element-wise inference and the customization mechanism also serves as both a performance optimization for complex cases and a portability tool for users working with compilers that haven’t yet achieved sophisticated UDT vectorization.
7.5. Implementation Impact
Implementations already handle element types generically for many operations (permutations, broadcasts, masking). The trait-based definition formalizes this practice and extends it uniformly.
The following changes are needed:
-
Modify type trait checking for the
concept/traitvectorizable -
Add
variable template with standard library specializationsdisable_vectorization -
Update operator constraints to check return types (constraints already present, only need tightening)
-
Add
andconvert_to_t tag typesconvert_to -
Add customization point dispatch logic for user-defined vectorizable types
The effort to customize the implementation is minimal. The core machinery already exists and only the gate-keeping logic and dispatch logic changes. The implementation experience demonstrates the approach described in this proposal is viable.
Technical details and example implementation are provided in § 12 Appendix: Customization Point Technical Details.
8. Extended Enum and Byte Support
With our proposal, enumerations and now become vectorizable. Consequently, related utility functions should be extended to work with :
// Element-wise to_underlying for enumerations template < class Enum , class Abi > constexpr rebind_t < underlying_type_t < Enum > , basic_vec < Enum , Abi >> to_underlying ( const basic_vec < Enum , Abi >& v ) noexcept ; // Element-wise to_integer for std::byte template < class IntegerType , class Abi > constexpr rebind_t < IntegerType , basic_vec < byte , Abi >> to_integer ( const basic_vec < byte , Abi >& v ) noexcept ;
These provide consistency with their scalar counterparts and convenience for common conversions.
9. Proposed Wording
The wording in this section is relative to the working draft at https://eel.is/c++draft/simd.
9.1. Modify [simd.syn]
Add the following declarations to [simd.syn]:
// [simd.convert.tag], customization point conversion tag types template < typename T > struct convert_to_t ; template < class T > inline constexpr convert_to_t < T > convert_to {}; // [simd.disable], disabling customization point vectorization template < class T > inline constexpr bool disable_vectorization = see below ;
9.2. Modify [simd.general]
Modify [simd.general] as follows:
The set of vectorizable types comprises
The types in the first four bullets are the built-in vectorizable types. Types that are vectorizable only by virtue of the fifth bullet are user-defined vectorizable types. [Note: All built-in vectorizable types satisfy the trait-based constraints of the fifth bullet. A type that appears in the first four bullets is a built-in vectorizable type regardless of any specialization of
- all standard integer types, character types, and the types
andfloat ;double ,std :: float16_t , andstd :: float32_t if defined;std :: float64_t wherecomplex < T > is a vectorizable floating-point typeT .;; andstd :: byte - any type
for which:T
isis_trivially_copyable_v < T > true,is 1, 2, 4, 8, or 16, andsizeof ( T ) ([simd.disable]) isdisable_vectorization < T > false.. —end note]disable_vectorization
9.3. Add [simd.convert.tag]
Insert a new subclause [simd.convert.tag]:
9.3.1. Conversion tag types [simd.convert.tag]
template < typename T > struct convert_to_t { using type = T ; constexpr explicit convert_to_t () noexcept = default ; }; template < class T > inline constexpr convert_to_t < T > convert_to {}; The class template
and the variable templateconvert_to_t serve as tag types for theconvert_to customization point ([simd.cust.convert]).simd_convert
9.4. Add [simd.disable] after [simd.general]
Insert a new subclause [simd.disable] after [simd.general]:
9.4.1. Disabling vectorization [simd.disable]
template < class T > inline constexpr bool disable_vectorization = see below ; The variable template
evaluates todisable_vectorization < T > trueif any of the following conditions hold:
isis_pointer_v < T > true, or
isis_member_pointer_v < T > true, or
isis_union_v < T > true, or
isis_same_v < remove_cvref_t < T > , T > false, or
isis_empty_v < T > true, or
isis_same_v < T , bool > true, orA program-defined or implementation-provided specialization of
explicitly sets it todisable_vectorization < T > true.Otherwise,
evaluates todisable_vectorization < T > false.A program may provide explicit specializations of
for program-defined types. Such specializations shall be usable in constant expressions and have typedisable_vectorization .const bool Specializations of
for cv-qualified types or reference types are ill-formed.disable_vectorization The implementation provides explicit specializations that set
todisable_vectorization truefor the following standard library types:,nullptr_t ,source_location , andbasic_vec < T , Abi > .basic_mask < T , Abi > Implementations may provide additional specializations for other types where vectorization is semantically inappropriate.
9.5. Add exposition-only concepts to [simd.expos]
Add the following to [simd.expos], after the existing exposition-only definitions:
template < typename T > concept promotable - type = // exposition only ( is_arithmetic_v < T > && ! is_same_v < T , bool > ) || ( is_enum_v < T > && ! is_scoped_enum_v < T > ); template < typename T , typename UnaryOp > concept supported - unary - op = // exposition only promotable - type < T > ? requires ( T a ) { UnaryOp {}( a ); } : requires ( T a ) { { UnaryOp {}( a ) } -> same_as < T > ; }; template < class T , class BinaryOp > concept supported - binary - op = // exposition only ( promotable - type < T > && requires ( T a , T b ) { BinaryOp {}( a , b ); }) || ( ! promotable - type < T > && requires ( T a , T b ) { { BinaryOp {}( a , b ) } -> same_as < T > ; }); [Note: The
concept identifies types that participate in C++'s standard implicit conversion and integer promotion rules (arithmetic types excludingpromotable - type , and unscoped enumerations). For these types, binary operations may return a promoted type that requires explicit conversion back tobool (e.g.,value_type returnsuint8_t + uint8_t ). For all other vectorizable types (scoped enumerations,int ,byte , and user-defined types), operations must return exactlycomplex .value_type is excluded frombool because it is banned as an element type viapromotable - type ([simd.disable]). —end note]disable_vectorization
9.6. Add [simd.cust] — Customization points
Insert a new subclause [simd.cust]:
9.6.1. Customization points [simd.cust]
Customization points allow users to provide optimized implementations of operations and conversions for user-defined vectorizable types. Customization points are discovered via argument-dependent lookup ([basic.lookup.argdep]) in the namespaces associated with the element type.
[Note: Because
carries the associated namespaces ofbasic_vec < T , Abi > , ADL naturally finds overloads declared alongside the element type definition. Users do not inject declarations into namespaceT . —end note]std Customization points are never consulted for built-in vectorizable types. For built-in vectorizable types, the implementation always uses its own optimized code paths.
[Note: Implementations are encouraged to emit a diagnostic if a
orsimd_operator overload is declared that would never be consulted because it targets a built-in vectorizable type. —end note]simd_convert 9.6.1.1. Operation customization [simd.cust.op]
For user-defined vectorizable types, when a unary or binary arithmetic or bitwise operation is performed on a
, the implementation determines the result as follows:basic_vec
- If the expression
(for unary) orsimd_operator ( v , op ) (for binary) is well-formed via ADL, wheresimd_operator ( v1 , v2 , op ) is an object of the corresponding standard transparent function object type, and the return type is exactlyop , then the result is that expression.basic_vec < value_type , Abi > - Otherwise, the result is the element-wise application of the scalar operator.
For user-defined vectorizable types, when a comparison operation is performed on a
, the implementation determines the result as follows:basic_vec
- If the expression
is well-formed via ADL, wheresimd_operator ( v1 , v2 , op ) is an object of the corresponding standard transparent function object type (op ,equal_to <> ,not_equal_to <> ,less <> ,less_equal <> ,greater <> ), and the return type is exactlygreater_equal <> , then the result is that expression.basic_mask < value_type , Abi > - Otherwise, the result is the element-wise application of the scalar comparison operator.
[Note: The well-formedness checks above are SFINAE-friendly. If ADL finds a
overload with an incorrect return type, it is not considered and the implementation falls through to element-wise application. —end note]simd_operator The
parameter is one of the following standard transparent function objects:Op
- Unary:
,negate <> bit_not <> - Binary arithmetic:
,plus <> ,minus <> ,multiplies <> ,divides <> modulus <> - Binary bitwise:
,bit_and <> ,bit_or <> bit_xor <> - Comparison:
,equal_to <> ,not_equal_to <> ,less <> ,less_equal <> ,greater <> greater_equal <> [Note: Shift operators are not listed because the C++ standard library does not currently provide transparent function objects for them. See [P4006]. —end note]
[Note: When element-wise application is used, scalar operators are invoked with elements as prvalues. If a user-defined type provides both lvalue-reference-qualified and rvalue-reference-qualified overloads of an operator, the rvalue-reference-qualified overload is selected. This is an observable property of the element-wise application. —end note]
9.6.1.2. Conversion customization [simd.cust.convert]
For conversions between
types where at least one of the source or destination element types is a user-defined vectorizable type, the implementation determines the result as follows:basic_vec
- If the expression
is well-formed via ADL, wheresimd_convert ( source , convert_to < To > ) is of typesource and the return type is exactlyconst basic_vec < From , Abi >& , then the result is that expression.basic_vec < To , Abi > - Otherwise, the result is element-wise
for each i in the rangestatic_cast < To > ( source [ i ]) .[ 0 , basic_vec < From , Abi >:: size ()) For conversions where both the source and destination element types are built-in vectorizable types, the
customization point is never consulted. The implementation uses its own optimized conversion.simd_convert [Note: The well-formedness check above is SFINAE-friendly. If ADL finds a
overload with an incorrect return type, it is not considered and the implementation falls through to element-wisesimd_convert . —end note]static_cast [Note: The
tag argument ([simd.convert.tag]) enables ADL discovery of the customization point. Without the tag, the destination typeconvert_to_t < To > would appear only as a template parameter and would not contribute to ADL. —end note]To
9.7. Modify [simd.ctor] broadcasting constructor
Modify the constraints for the broadcasting constructor in [simd.ctor]:
Constraints:
is not an arithmetic type and does not satisfyFrom , orconstexpr - wrapper - like is not an arithmetic type and does not satisfyFrom andconstexpr - wrapper - like isis_convertible_v < From , value_type > true, or
is an arithmetic type and the conversion fromFrom toFrom is value-preserving ([simd.general]), orvalue_type
satisfiesFrom ,constexpr - wrapper - like is an arithmetic type, andremove_cvref_t < decltype ( From :: value ) > is representable byFrom :: value .value_type
Drafting note: This ensures that conversions involving user-defined types respect the type author’s design. If the scalar type requires explicit conversion (e.g., ), the simd conversion also requires explicit construction. If the scalar type allows implicit conversion, simd follows suit.
9.8. Modify [simd.ctor] converting constructor
Modify the Remarks and Effects for the converting constructor in [simd.ctor]:
Remarks: The expression inside
evaluates toexplicit trueif any of the following hold:
the conversion fromtoU is not value-preserving, orvalue_type - both
andU are built-in vectorizable types and the conversion fromvalue_type toU is not value-preserving, orvalue_type - at least one of
orU is not a built-in vectorizable type andvalue_type isis_convertible_v < U , value_type > false, orThe remaining conditions about integer conversion rank and floating-point conversion rank remain unchanged.
Effects:
Initializes the ith element withIf bothfor all i in the rangestatic_cast < value_type > ( x [ i ]) .[ 0 , size ()) andU are built-in vectorizable types, initializes the ith element withvalue_type for all i in the rangestatic_cast < value_type > ( x [ i ]) using an implementation-defined optimized conversion. Otherwise, determines the result according to the conversion customization rules ([simd.cust.convert]).[ 0 , size ())
Drafting note: This extends the explicit-ness determination to user-defined vectorizable types. For UDT conversions, we check rather than value-preserving (which is only defined for built-in vectorizable types).
The phrase "at least one of or is not a built-in vectorizable type" covers three cases:
-
Built-in vectorizable → User-defined vectorizable (e.g.,
→simd < float > ): requires and respects whether the UDT provides an implicit or explicit constructor from the built-in vectorizable type (e.g.,simd < Meters > )Meters ( float ) -
User-defined vectorizable → Built-in vectorizable (e.g.,
→simd < Meters > ): requires and respects whether the UDT provides an implicit or explicit conversion operator to the built-in vectorizable type (e.g.,simd < float > )operator float () -
User-defined vectorizable → User-defined vectorizable (e.g.,
→simd < Meters > ): requires and respects whether conversion is available and whether it’s implicit or explicit. The UDT author can provide either a constructor in the target type (simd < Feet > ) or a conversion operator in the source type (Feet ( Meters ) ) allowingMeters :: operator Feet () to use whichever is available.static_cast
The Built-in vectorizable → Built-in vectorizable case is handled by the first condition’s value-preserving check. We use "at least one" (not "both") because we want to respect the type author’s implicit/explicit judgment for any conversion involving a UDT. The check will fail (requiring explicit construction) if the necessary constructor or conversion operator doesn’t exist or is marked explicit.
9.9. Modify [simd.unary]
Modify the constraints and effects in [simd.unary] as follows:
Let op be the operator.
Constraints:
isrequires ( value_type a ) { op a ; } truesupported - unary - op is< value_type , Op > true, whereis the corresponding standard transparent function object (Op ,negate <> ) .bit_not <> Returns:
AIfobject initialized with the results of applying op tobasic_vec as a unary element-wise operation.v is a built-in vectorizable type, avalue_type object initialized with the results of applying op tobasic_vec using an implementation-defined optimized implementation. Otherwise, av object determined according to the operation customization rules ([simd.cust.op]).basic_vec
9.10. Modify [simd.binary]
Modify the constraints and effects in [simd.binary] as follows:
Let op be the operator.
Constraints:
isrequires ( value_type a , value_type b ) { a op b ; } truesupported - binary - op is< value_type , Op > true, whereis the corresponding standard transparent function object (Op ,plus <> ,minus <> ,multiplies <> ,divides <> ,modulus <> ,bit_and <> ,bit_or <> ) .bit_xor <> Returns:
AIfobject initialized with the results of applying op tobasic_vec andlhs as a binary element-wise operation.rhs is a built-in vectorizable type, avalue_type object initialized with the results of applying op tobasic_vec andlhs using an implementation-defined optimized implementation. Otherwise, arhs object determined according to the operation customization rules ([simd.cust.op]).basic_vec
For the shift operators:
Let op be the operator.
Constraints:
isrequires ( value_type a , simd - size - type b ) { a op b ; } truesupported - binary - op is< value_type , Op > true, whereis the corresponding standard transparent function object .Op Returns:
AIfobject initialized with the results of applying op tobasic_vec andlhs as a binary element-wise operation.rhs is a built-in vectorizable type, avalue_type object initialized with the results of applying op tobasic_vec andlhs using an implementation-defined optimized implementation. Otherwise, arhs object initialized with the results of applying op element-wise.basic_vec
[Note: Shift operators do not consult because the C++ standard library does not provide transparent function objects for them. See [P4006]. If [P4006] is adopted, shift operators could be added to the customization mechanism. —end note]
9.11. Modify [simd.cassign]
Modify the constraints and effects in [simd.cassign] as follows:
Let op be the operator.
Constraints:
isrequires ( value_type a , value_type b ) { a op b ; } truesupported - binary - op is< value_type , Op > true, whereis the standard transparent function object corresponding to the binary operator with the same name (e.g.,Op uses the constraint fromoperator += ) .operator + Effects: Equivalent to
, where the binary operation is performed according to the rules in [simd.binary].lhs = lhs op rhs
For the shift compound assignment operators:
Let op be the operator.
Constraints:
isrequires ( value_type a , simd - size - type b ) { a op b ; } truesupported - binary - op is< value_type , Op > true, whereis the standard transparent function object corresponding to the binary operator .Op Effects: Equivalent to
, where the binary operation is performed according to the rules in [simd.binary].lhs = lhs op rhs
9.12. Modify [simd.comparison]
Modify the constraints and effects in [simd.comparison] as follows:
Let op be the operator.
Constraints:
isrequires ( value_type a , value_type b ) { a op b ; } trueisrequires ( value_type a , value_type b ) { { a op b } -> same_as < bool > ; } true.Returns:
AIfobject initialized with the results of applying op tomask_type andlhs as a binary element-wise operation.rhs is a built-in vectorizable type, avalue_type object initialized with the results of applying op tomask_type andlhs using an implementation-defined optimized implementation. Otherwise, arhs object determined according to the comparison customization rules ([simd.cust.op]).mask_type
9.13. Add overload for to_underlying
Add to [simd.casts]:
template < simd - vec - type V > constexpr rebind_t < underlying_type_t < typename V :: value_type > , V > to_underlying ( const V & v ) noexcept ; Constraints:
isis_enum_v < typename V :: value_type > true.Returns: A
object where the ith element is initialized to the result ofbasic_vec for all i in the rangeto_underlying ( v [ i ]) .[ 0 , V :: size ())
9.14. Add overload for to_integer
Add to [simd.casts]:
template < class IntegerType , class Abi > constexpr rebind_t < IntegerType , basic_vec < byte , Abi >> to_integer ( const basic_vec < byte , Abi >& v ) noexcept ; Constraints:
isis_integral_v < IntegerType > true.Returns: A
object where the ith element is initialized to the result ofbasic_vec for all i in the rangeto_integer < IntegerType > ( v [ i ]) .[ 0 , basic_vec < byte , Abi >:: size ())
9.15. Feature test macro [version.syn]
Add to [version.syn]:
#define __cpp_lib_simd_udt YYYYMML // also in <simd>
[Note: This macro covers both the trait-based vectorizable type extension and the customization point mechanism ( and ), as they form a single integrated feature. —end note]
10. Conclusion
This proposal extends to support user-defined element types through a minimal, principled change where the closed list of vectorizable types is extended, with added directly, and trait-based constraints introduced for user-defined vectorizable types.
Earlier revisions explored explicit customization mechanisms, leading to complicated designs. Committee feedback encouraged exploring element-wise inference. The working draft specification already defines all operations through element-wise application, so changing only the definition of which types are allowed provides the extension we need.
Committee discussion raised legitimate concerns about whether compilers could actually optimize user-defined operator calls into efficient vector code. Implementation experience with leading compilers (Clang 20, Intel oneAPI 2025.0) has shown that they can. While compiler maturity varies across vendors and versions, the results demonstrate the fundamental viability of the element-wise inference approach.
By changing only the gate-keeping logic for vectorizable types, we enable type safety for strong typedefs, domain-specific types for signal processing and other specialized domains, enumerations, , and small compound types. This is achieved with no breaking changes to existing code and no modification to any operation semantics.
The proposal includes ADL-based customization points ( for operations, for conversions) that enable users to provide optimized implementations where compiler inference is insufficient. The hybrid approach — element-wise inference by default, customization when needed — provides a clear path for users to achieve optimal performance regardless of compiler optimization quality.
11. Acknowledgements
We would like to thank Matthias Kretz for his feedback and contributions to discussions throughout the development of this proposal. We also thank the members of SG1 and SG6 who provided feedback during recent meetings, which significantly shaped the direction of the later revisions.
12. Appendix: Customization Point Technical Details
This appendix provides additional technical details for the ADL-based customization mechanism proposed in § 6 Customization Points.
12.1. Dual Dispatch Strategy
The customization design uses separate code paths based on type category:
// Built-in vectorizable types : always optimized template < typename T > requires std :: is_arithmetic_v < T > || std :: is_same_v < T , std :: byte > || /* complex */ friend constexpr basic_vec operator + ( const basic_vec & lhs , const basic_vec & rhs ) { return /* implementation-defined optimized implementation */ ; } // User-defined vectorizable types: check for customization via ADL template < typename T > requires ( ! std :: is_arithmetic_v < T > && ! std :: is_same_v < T , std :: byte > && /* not complex */ ) friend constexpr basic_vec operator + ( const basic_vec & lhs , const basic_vec & rhs ) requires requires ( value_type a , value_type b ) { { a + b } -> std :: same_as < value_type > ; } { if constexpr ( requires { simd_operator ( lhs , rhs , std :: plus <> {}); }) { return simd_operator ( lhs , rhs , std :: plus <> {}); // Custom via ADL } else { return /* element-wise application */ ; // Default } }
This ensures:
-
Built-in vectorizable types: Always optimized, never check for customization
-
User-defined vectorizable types (including enumerations) : Can provide
customization, otherwise element-wise inference is used. For enumerations without custom operators, element-wise inference produces optimal code since enum operations compile to simple integer arithmetic.simd_operator -
Performance guarantee: No overhead for standard built-in vectorizable types
Users control their own target-specific optimizations if desired:
// User code for target-specific optimization namespace my_lib { enum class PackedColor : uint32_t { /* ... */ }; // Custom enum operator PackedColor operator + ( PackedColor a , PackedColor b ) { return /* custom blending logic */ ; } // SIMD optimization, discoverable by ADL. auto simd_operator ( vec < PackedColor > lhs , vec < PackedColor > rhs , std :: plus <> ) { #ifdef __AVX512F__ return my_avx512_blend ( lhs , rhs ); #else return my_generic_blend ( lhs , rhs ); #endif } }
12.2. Complete Example with Selective Customization
This example shows how users can customize specific operations while relying on element-wise inference for others:
namespace my_lib { struct fixed_point_16s8 { std :: int16_t data ; // Basic operators use normal semantics fixed_point_16s8 operator + ( fixed_point_16s8 rhs ) const { return fixed_point_16s8 { data + rhs . data }; } fixed_point_16s8 operator - ( fixed_point_16s8 rhs ) const { return fixed_point_16s8 { data - rhs . data }; } bool operator < ( fixed_point_16s8 rhs ) const { return data < rhs . data ; } }; // Customize multiply (requires scaling) - Binary operation template < typename Abi > auto simd_operator ( const basic_vec < fixed_point_16s8 , Abi >& lhs , const basic_vec < fixed_point_16s8 , Abi >& rhs , std :: multiplies <> ) { // Custom implementation with appropriate scaling // Could use intrinsics or library functions return /* optimized multiply with scaling */ ; } // Customize divide (requires scaling) - Binary operation template < typename Abi > auto simd_operator ( const basic_vec < fixed_point_16s8 , Abi >& lhs , const basic_vec < fixed_point_16s8 , Abi >& rhs , std :: divides <> ) { // Custom implementation with appropriate scaling return /* optimized divide with scaling */ ; } // Addition, subtraction, comparisons use element-wise inference // No customization needed for these simple operations } // Usage vec < my_lib :: fixed_point_16s8 > a , b ; auto sum = a + b ; // Uses element-wise inference (fast) auto diff = a - b ; // Uses element-wise inference (fast) auto product = a * b ; // Uses custom simd_operator (optimal) auto quotient = a / b ; // Uses custom simd_operator (optimal) auto mask = a < b ; // Uses element-wise inference (fast)
Conversion example:
namespace my_lib { struct BFloat16 { uint16_t bits ; /* ... */ }; // Optimize conversion to float template < typename Abi > basic_vec < float , Abi > simd_convert ( const basic_vec < BFloat16 , Abi >& source , convert_to_t < float > ) { // Use hardware bfloat16 conversion if available #ifdef __AVX512BF16__ return /* use vcvtne2ps2bf16 or similar */ ; #else return /* shift bits implementation */ ; #endif } }
This demonstrates the key benefit: users customize only what needs optimization while relying on inference for everything else. The single name handles unary, binary, and comparison operations through overloading.
13. Appendix: Assembly Code Examples
This section provides detailed assembly listings from the implementation experience, demonstrating how element-wise inference generates optimal vector code. Testing was performed with Clang 20 and Intel oneAPI 2025.0 targeting Intel Sapphire Rapids.
Complex Expression Composition
Element-wise operations compose well across multiple operations in a single expression:
| C++ Code | Generated Assembly |
|---|---|
|
|
The assembly is identical for both the user-defined type and the built-in type, demonstrating that user-defined types achieve zero-overhead abstraction. The compiler successfully fuses multiple operations and optimizes register allocation regardless of whether the element type is or .
| C++ Code | Generated Assembly |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
These examples demonstrate optimal code generation with native vector instructions and no scalar fallback.