1. Authors and contributors
1.1. Authors

Mark Hoemmen (mhoemmen@nvidia.com) (NVIDIA)

Daisy Hollman (cpp@dsh.fyi) (Google)

Christian Trott (crtrott@sandia.gov) (Sandia National Laboratories)

Daniel Sunderland (dansunderland@gmail.com)

Nevin Liber (nliber@anl.gov) (Argonne National Laboratory)

Alicia Klinvex (alicia.klinvex@unnpp.gov) (Naval Nuclear Laboratory)

LiTa Lo (ollie@lanl.gov) (Los Alamos National Laboratory)

Damien LebrunGrandie (lebrungrandt@ornl.gov) (Oak Ridge National Laboratories)

Graham Lopez (glopez@nvidia.com) (NVIDIA)

Peter Caday (peter.caday@intel.com) (Intel)

Sarah Knepper (sarah.knepper@intel.com) (Intel)

Piotr Luszczek (luszczek@icl.utk.edu) (University of Tennessee)

Timothy Costa (tcosta@nvidia.com) (NVIDIA)
1.2. Contributors

Chip Freitag (chip.freitag@amd.com) (AMD)

Bryce Adelstein Lelbach (brycelelbach@gmail.com) (NVIDIA)

Srinath Vadlamani (Srinath.Vadlamani@arm.com) (ARM)

Rene Vanoostrum (Rene.Vanoostrum@amd.com) (AMD)
2. Revision history

Revision 0 (preCologne) submitted 20190617

Received feedback in Cologne from SG6, LEWGI, and (???).


Revision 1 (preBelfast) to be submitted 20191007

Account for Cologne 2019 feedback

Make interface more consistent with existing Standard algorithms

Change
,dot
,dotc
, andvector_norm2
to imitatevector_abs_sum
, so that they return their result, instead of taking an output parameter. Users may set the result type via optionalreduce
parameter.init


Minor changes to "expression template" classes, based on implementation experience

Briefly address LEWGI request of exploring concepts for input arguments.

Lazy ranges style API was NOT explored.


Revision 2 (preCologne) to be submitted 20200113

Add "Future work" section.

Remove "Options and votes" section (which were addressed in SG6, SG14, and LEWGI).

Remove
overloads.basic_mdarray 
Remove batched linear algebra operations.

Remove over and underflow requirement for
.vector_norm2 
Mandate any extent compatibility checks that can be done at compile time.

Add missing functions
and{ symmetric , hermitian } _matrix_rank_k_update
.triangular_matrix_ { left , right } _product 
Remove
function.packed_view 
Fix wording for
, so that implementations may optimize the return type. Make sure that{ conjugate , transpose , conjugate_transpose } _view
of atranspose_view
matrix returns alayout_blas_packed
matrix with oppositelayout_blas_packed
andTriangle
.StorageOrder 
Remove second template parameter
fromT
.accessor_conjugate 
Make
andscaled_scalar
exposition only.conjugated_scalar 
Add inplace overloads of
,triangular_matrix_matrix_ { left , right } _solve
, andtriangular_matrix_ { left , right } _product
.triangular_matrix_vector_solve 
Add
overloads toalpha
.{ symmetric , hermitian } _matrix_rank_ { 1 , k } _update 
Add Cholesky factorization and solve examples.


Revision 3 (electronic) to be submitted 20210415

Per LEWG request, add a section on our investigation of constraining template parameters with concepts, in the manner of P1813R0 with the numeric algorithms. We concluded that we disagree with the approach of P1813R0, and that the Standard’s current GENERALIZED_SUM approach better expresses numeric algorithms' behavior.

Update references to the current revision of P0009 (
).mdspan 
Per LEWG request, introduce
namespace and put everything in there.std :: linalg 
Per LEWG request, replace the
prefix with the aforementioned namespace. We renamedlinalg_
tolinalg_add
,add
tolinalg_copy
, andcopy
tolinalg_swap
.swap_elements 
Per LEWG request, do not use
as a suffix, to avoid confusion with "views" in the sense of Ranges. We renamed_view
toconjugate_view
,conjugated
toconjugate_transpose_view
,conjugate_transposed
toscaled_view
, andscaled
totranspose_view
.transposed 
Change wording from "then implementations will use
's precision or greater for intermediate terms in the sum," to "then intermediate terms in the sum useT
's precision or greater." Thanks to Jens Maurer for this suggestion (and many others!).T 
Before, a Note on
said, "We recommend that implementers document their guarantees regarding overflow and underflow ofvector_norm2
for floatingpoint return types." Implementations always document "implementationdefined behavior" per [defs.impl.defined]. (Thanks to Jens Maurer for pointing out that "We recommend..." does not belong in the Standard.) Thus, we changed this from a Note to normative wording in Remarks: "If eithervector_norm2
orin_vector_t :: element_type
are floatingpoint types or complex versions thereof, then any guarantees regarding overflow and underflow ofT
are implementationdefined."vector_norm2 
Define return types of the
,dot
,dotc
, andvector_norm2
overloads withvector_abs_sum
return type.auto 
Remove the explicitly stated constraint on
andadd
that the rank of the array arguments be no more than 2. This is redundant, because we already impose this via the existing constraints on template parameters namedcopy
,in_object * _t
, orinout_object * _t
. If we later wish to relax this restriction, then we only have to do so in one place.out_object * _t 
Add
. First, this gives implementers a path to implementingvector_sum_of_squares
in a way that achieves the over/underflow guarantees intended by the BLAS Standard. Second, this is a useful algorithm in itself for parallelizing vector 2norm computation.vector_norm2 
Add
,matrix_frob_norm
, andmatrix_one_norm
(thanks to coauthor Piotr Luszczek).matrix_inf_norm 
Address LEWG request for us to investigate support for GPU memory. See section "Explicit support for asynchronous return of scalar values."

Add
overloads of the inplace versions ofExecutionPolicy
,triangular_matrix_vector_solve
,triangular_matrix_left_product
,triangular_matrix_right_product
, andtriangular_matrix_matrix_left_solve
.triangular_matrix_matrix_right_solve


Revision 4 (electronic), to be submitted 20210815

Update authors' contact info.

Rebase atop P2299R3, which in turn sits atop P0009R12. Make any needed fixes due to these changes. (P1673R3 was based on P0009R10, without P2299.) Update P0009 references to point to the latest version (R12).

Fix requirements for
.{ symmetric , hermitian } _matrix_ { left , right } _product 
Change
toSemiRegular < Scalar >
.semiregular < Scalar > 
Make
requirements refer to [complex.numbers.general], rather than explicitly listing allowed types. Remove redundant constraints onReal
.Real 
In [linalg.algs.reqs], clarify that "unique layout" for output matrix, vector, or object types means
equalsis_always_unique () true
. 
Change file format from Markdown to Bikeshed.

Impose requirements on the types on which algorithms compute, and on the algorithms themselves (e.g., what rearrangements are permitted). Add a section explaining how we came up with the requirements. Lift the requirements into a new higherlevel section that applies to the entire contents of [linalg], not just to [linalg.algs].

Add "Overview of contents" section.

In the last review, LEWG had asked us to consider using expositiononly concepts and
clauses to express requirements more clearly. We decided not to do so, because we did not think it would add clarity.requires 
Add more examples.


Revision 5 (electronic), to be submitted 20211015

P0009R13 (to be submitted 20211015) changes
to usemdspan
instead ofoperator []
as the array access operator. Revision 5 of P1673 adopts this change, and is "rebased" atop P1673R5.operator ()


Revision 6 (electronic), to be submitted 20211215

Update references to P0009 (P0009R14) and P2128 (P2128R6).

Fix typos in
descriptions.* rank_2k 
Remove references to any
rank greater than 2. (These were left over from earlier versions of the proposal that included "batched" operations.)mdspan 
Fix
name in BLAS comparison table.vector_sum_of_squares 
Replace "Requires" with "Preconditions," per new wording guidelines.

Remove all overloads of
andsymmetric_matrix_rank_k_update
that do not take anhermitian_matrix_rank_k_update
parameter. This prevents ambiguity between overloads that takealpha
but notExecutionPolicy &&
, and overloads that takealpha
but notalpha
.ExecutionPolicy && 
Harmonize with the implementation, by adding
,operator +
, and comparison operators tooperator *
.conjugated_scalar


Revision 7 (electronic), to be submitted 20220415

Update author affiliations and email addresses

Update proposal references

Fix typo observed here

Add missing
overload of inplaceExecutionPolicy
; issue was observed heretriangular_matrix_vector_product 
Fix mixedup order of
aggregate initialization arguments insum_of_squares_result
notevector_norm2 
Fill in missing parts of
andmatrix_frob_norm
specification, addressing this issuevector_norm2


Revision 8 (electronic), to be submitted 20220515

Fix
andTriangle
in Cholesky TSQR exampleR [ 0 , 0 ] 
Explain why we apply
to the possibly transformed input matrix, while the BLAS appliesTriangle
to the original input matrixUPLO 
Optimize
for all known layouts, so as to avoid use oftransposed
when not needed; fix computation of strides for transposed layoutslayout_transpose 
Fix matrix extents in constraints and mandates for
andsymmetric_matrix_rank_k_update
(thanks to Mikołaj Zuzek (NexGen Analytics,hermitian_matrix_rank_k_update
) for reporting the issue)mikolaj . zuzek @ng  analytics . com 
Resolve vagueness in
ness of return type ofconst transposed 
Resolve vagueness in
ness of return type ofconst
, and make its element type the type of the product, rather than forcing it back to the inputscaled
's element typemdspan 
Remove
member function fromdecay
andaccessor_conjugate
, as it is no longer part ofaccessor_scaled
's accessor policy requirementsmdspan 
Make sure
andaccessor_conjugate
work correctly for userdefined complex types, introduceconjugated
_ to simplify wording, and resolve vagueness inconj  if  needed
ness of return type ofconst
. Make sure thatconjugated
_ works for custom types whereconj  if  needed
is not typepreserving. (Thanks to Yu You (NVIDIA, yuyou@nvidia.com) and Phil Miller (Intense Computing,conj
) for helpful discussions.)phil . miller @intensecomputing . com 
Fix typo in
for complex numbers, and other typos (thanks to Phil Miller for reporting the issue)givens_rotation_setup


Revision 9 (electronic), to be submitted 20220615

Apply tobesubmitted P0009R17 changes (see P2553 in particular) to all layouts, accessors, and examples in this proposal.

Improve
"mathematical expression of the algorithm" wording.triangular_matrix_matrix_ { left , right } _solve () 
: Fixlayout_blas_packed
andrequired_span_size ()
wordingoperator () 
Make sure all definitions of lower and upper triangle are consistent.

Changes to
:layout_transpose 
Make
constructorlayout_transpose :: mapping ( const nested_mapping_type & )
, to avoid inadvertent transposition.explicit 
Remove the following Constraint on
: "for all specializationslayout_transpose :: mapping
ofE
withextents
equal to 2,E :: rank ()
istypename Layout :: template mapping < E >:: is_always_unique () true
." (This Constraint was not correct, because the underlying mapping is allowed to be nonunique.) 
Make
wording independent oflayout_transpose :: mapping :: stride
equals 2 constraint, to improve consistency with rest ofrank ()
wording.layout_transpose


Changes to
andscaled
:conjugated 
Include and specify all the needed overloaded arithmetic operators for
andscaled_scalar
, and fixconjugated_scalar
andaccessor_scaled
accordingly.accessor_conjugate 
Simplify
to ensure preservation of order of operations.scaled 
Add missing
tonested_accessor ()
.accessor_scaled 
Add hidden friends
,abs
,real
, andimag
to common subclass ofconj
andscaled_scalar
. Add wording to algorithms that useconjugated_scalar
,abs
, and/orreal
, to indicate that these functions are to be found by unqualified lookup. (Algorithms that use conjugation already useimag
_ in their wording.)conj  if  needed


Changes suggested by SG6 small group review on 2022/06/09

Make existing expositiononly function
useconj  if  needed
if it can find it via unqualified (ADLonly) lookup, otherwise be the identity. Make it a function object instead of a function, to prevent ADL issues.conj 
Algorithms that mathematically need to do division can now distinguish left division and right division (for the case of noncommutative multiplication), by taking an optional
binary function object parameter. If none is given, binaryBinaryDivideOp
is used.operator /


Changes suggested by LEWG review of P1673R8 on 2022/05/24

LEWG asked us to add a section to the paper explaining why we don’t define an interface for customization of the "backend" optimized BLAS operations. This section already existed, but we rewrote it to improve clarity. Please see the section titled "We do not require using the BLAS library or any particular backend."

LEWG asked us to add a section to the paper showing how BLAS 1 and ranges algorithms would coexist. We added this section, titled "Criteria for including BLAS 1 algorithms; coexistence with ranges."

Address LEWG feedback to defer support for custom complex number types (but see above SG6 small group response).


Fix P1674 links to point to R2.


Revision 10 (electronic), to be submitted soon

Revise
andscaled
wording.conjugated 
Make all matrix view functions
.constexpr 
Rebase atop P2642R1. Remove wording saying that we rebase atop P0009R17. (We don’t need that any more, because P0009 was merged into the current C++ draft.)

Remove
, as it has been replaced with the layouts proposed by P2642 (layout_blas_general
andlayout_left_padded
).layout_right_padded 
Update
to match mdspan’s other layout mappings in the current C++ Standard draft.layout_blas_packed 
Update accessors to match mdspan’s other accessors in the current C++ Standard draft.

Update nonwording to reflect current status of
(voted into C++ Standard draft) andmdspan
(P2630).submdspan

3. Purpose of this paper
This paper proposes a C++ Standard Library dense linear algebra interface based on the dense Basic Linear Algebra Subroutines (BLAS). This corresponds to a subset of the BLAS Standard. Our proposal implements the following classes of algorithms on arrays that represent matrices and vectors:

elementwise vector sums;

multiplying all elements of a vector or matrix by a scalar;

2norms and 1norms of vectors;

vectorvector, matrixvector, and matrixmatrix products (contractions);

lowrank updates of a matrix;

triangular solves with one or more "righthand side" vectors; and

generating and applying plane (Givens) rotations.
Our algorithms work with most of the matrix storage formats that the BLAS Standard supports:

"general" dense matrices, in columnmajor or rowmajor format;

symmetric or Hermitian (for complex numbers only) dense matrices, stored either as general dense matrices, or in a packed format; and

dense triangular matrices, stored either as general dense matrices or in a packed format.
Our proposal also has the following distinctive characteristics.

It uses free functions, not arithmetic operator overloading.

The interface is designed in the spirit of the C++ Standard Library’s algorithms.

It uses
(P0009 and followon papers, which have been adopted into the C++23 draft), a multidimensional array view, to represent matrices and vectors.mdspan 
The interface permits optimizations for matrices and vectors with small compiletime dimensions; the standard BLAS interface does not.

Each of our proposed operations supports all element types for which that operation makes sense, unlike the BLAS, which only supports four element types.

Our operations permit "mixedprecision" computation with matrices and vectors that have different element types. This subsumes most functionality of the MixedPrecision BLAS specification, comprising Chapter 4 of the BLAS Standard.

Like the C++ Standard Library’s algorithms, our operations take an optional execution policy argument. This is a hook to support parallel execution and hierarchical parallelism (through the proposed executor extensions to execution policies; see P1019R2).

Unlike the BLAS, our proposal can be expanded to support "batched" operations (see P1417R0) with almost no interface differences. This will support machine learning and other applications that need to do many small matrix or vector operations at once.
Here are some examples of what this proposal offers.
In these examples, we ignore
namespace qualification
for anything in our proposal or for
.
We start with a "hello world" that scales the elements of a 1D
by a constant factor, first sequentially, then in parallel.
constexpr size_t N = 40 ; std :: vector < double > x_vec ( N ); mdspan x ( x_vec . data (), N ); for ( size_t i = 0 ; i < N ; ++ i ) { x [ i ] = double ( i ); } linalg :: scale ( 2.0 , x ); // x = 2.0 * x linalg :: scale ( std :: execution :: par_unseq , 3.0 , x ); for ( size_t i = 0 ; i < N ; ++ i ) { assert ( x [ i ] == 6.0 * double ( i )); }
Here is a matrixvector product example.
It illustrates the
function that makes our interface more concise,
while still permitting the BLAS' performance optimization
of fusing computations with multiplications by a scalar.
It also shows the ability to exploit dimensions known at compile time,
and to mix compiletime and runtime dimensions arbitrarily.
constexpr size_t N = 40 ; constexpr size_t M = 20 ; std :: vector < double > A_vec ( N * M ); std :: vector < double > x_vec ( M ); std :: array < double , N > y_vec ( N ); mdspan A ( A_vec . data (), N , M ); mdspan x ( x_vec . data (), M ); mdspan y ( y_vec . data (), N ); for ( int i = 0 ; i < A . extent ( 0 ); ++ i ) { for ( int j = 0 ; j < A . extent ( 1 ); ++ j ) { A [ i , j ] = 100.0 * i + j ; } } for ( int j = 0 ; j < x . extent ( 0 ); ++ j ) { x [ i ] = 1.0 * j ; } for ( int i = 0 ; i < y . extent ( 0 ); ++ i ) { y [ i ] = 1.0 * i ; } linalg :: matrix_vector_product ( A , x , y ); // y = A * x // y = 0.5 * y + 2 * A * x linalg :: matrix_vector_product ( std :: execution :: par , linalg :: scaled ( 2.0 , A ), x , linalg :: scaled ( 0.5 , y ), y );
This example illustrates the ability to perform mixedprecision computations,
and the ability to compute on subviews of a matrix or vector
by using
(P2630). (
was separated from the rest of P0009
as a way to avoid delaying the hopeful adoption of P0009 for C++23.
The reference implementation of
includes
.)
constexpr size_t M = 40 ; std :: vector < float > A_vec ( M * 8 * 4 ); std :: vector < double > x_vec ( M * 4 ); std :: vector < double > y_vec ( M * 8 ); mdspan < float , extents < size_t , dynamic_extent , 8 , 4 >> A ( A_vec . data (), M ); mdspan < double , extents < size_t , 4 , dynamic_extent >> x ( x_vec . data (), M ); mdspan < double , extents < size_t , dynamic_extent , 8 >> y ( y_vec . data (), M ); for ( size_t m = 0 ; m < A . extent ( 0 ); ++ m ) { for ( size_t i = 0 ; i < A . extent ( 1 ); ++ i ) { for ( size_t j = 0 ; j < A . extent ( 2 ); ++ j ) { A [ m , i , j ] = 1000.0 * m + 100.0 * i + j ; } } } for ( size_t i = 0 ; i < x . extent ( 0 ); ++ i ) { for ( size_t m = 0 ; m < x . extent ( 1 ); ++ m ) { x [ i , m ] = 33. * i + 0.33 * m ; } } for ( size_t m = 0 ; m < y . extent ( 0 ); ++ m ) { for ( size_t i = 0 ; i < y . extent ( 1 ); ++ i ) { y [ m , i ] = 33. * m + 0.33 * i ; } } for ( size_t m = 0 ; m < M ; ++ m ) { auto A_m = submdspan ( A , m , full_extent , full_extent ); auto x_m = submdspan ( x , full_extent , m ); auto y_m = submdspan ( y , m , full_extent ); // y_m = A * x_m linalg :: matrix_vector_product ( A_m , x_m , y_m ); }
4. Overview of contents
Section 5 explains how this proposal is interoperable with other linear algebra proposals currently under WG21 review. In particular, we believe this proposal is complementary to P1385R6, and the authors of P1385R6 have expressed the same view.
Section 6 motivates considering any dense linear algebra proposal for the C++ Standard Library.
Section 7 shows why we chose the BLAS as a starting point for our proposed library. The BLAS is an existing standard with decades of use, a rich set of functions, and many optimized implementations.
Section 8 lists what we consider general criteria for including algorithms in the C++ Standard Library. We rely on these criteria to justify the algorithms in this proposal.
Section 9 describes BLAS notation and conventions in C++ terms. Understanding this will give readers context for algorithms, and show how our proposed algorithms expand on BLAS functionality.
Section 10 lists functionality that we intentionally exclude from our proposal. We imitate the BLAS in aiming to be a set of "performance primitives" on which external libraries or applications may build a more complete linear algebra solution.
Section 11 elaborates on our design justification. This section explains

why we use
to represent matrix and vector parameters;mdspan 
how we translate the BLAS' Fortrancentric idioms into C++;

how the BLAS' different "matrix types" map to different algorithms, rather than different
layouts;mdspan 
how we express qualityofimplementation recommendations about avoiding undue overflow and underflow; and

how we impose requirements on algorithms' behavior and on the various value types that algorithms encounter.
Section 12 lists future work, that is, ways future proposals could build on this one.
Section 13 gives the data structures and utilities from other proposals on which we depend.
In particular, we rely heavily on
(P0009),
and add custom layouts and accessors.
Section 14 credits funding agencies and contributors.
Section 15 is our bibliography.
Section 16 is where readers will find the normative wording we propose.
Finally, Section 17 gives some more elaborate examples of linear algebra algorithms that use our proposal.
The examples show how
's features let users easily describe "submatrices" with
,
proposed in P2630 as a followon to mdspan.
(The reference implementation of
includes
.)
This integrates naturally with "block" factorizations of matrices.
The resulting notation is concise, yet still computes in place,
without unnecessary copies of any part of the matrix.
Here is a table that maps between Reference BLAS function name, and algorithm or function name in our proposal. The mapping is not always one to one. "N/A" in the "BLAS name(s)" field means that the function is not in the BLAS.
BLAS name(s)  Our name(s) 

xLARTG 

xROT 

xSWAP 

xSCAL  ,

xCOPY 

xAXPY  ,

xDOT, xDOTU 

xDOTC 

N/A 

xNRM2 

xASUM 

xIAMAX 

N/A 

N/A 

N/A 

xGEMV 

xSYMV 

xHEMV 

xTRMV 

xTRSV 

xGER, xGERU 

xGERC 

xSYR 

xHER 

xSYR2 

xHER2 

xGEMM 

xSYMM 

xHEMM 

xTRMM 

xSYRK 

xHERK 

xSYR2K 

xHER2K 

xTRSM 

5. Interoperable with other linear algebra proposals
We believe this proposal is complementary to P1385R6, a proposal for a C++ Standard linear algebra library that introduces matrix and vector classes with overloaded arithmetic operators. The P1385 authors and we have expressed together in a joint paper, P1891, that P1673 and P1385 "are orthogonal. They are not competing papers; ... there is no overlap of functionality."
Many of us have had experience using and implementing matrix and vector operator arithmetic libraries like what P1385 proposes. We designed P1673 in part as a natural foundation or implementation layer for such libraries. Our view is that a free function interface like P1673’s clearly separates algorithms from data structures, and more naturally allows for a richer set of operations such as what the BLAS provides. Our paper P1674 explains why we think our proposal is a minimal C++ "respelling" of the BLAS.
A natural extension of the present proposal would include
accepting P1385’s matrix and vector objects as input
for the algorithms proposed here.
A straightforward way to do that would be for P1385’s matrix and vector objects
to make views of their data available as
.
6. Why include dense linear algebra in the C++ Standard Library?

"Direction for ISO C++" (P0939R4) explicitly calls out "Linear Algebra" as a potential priority for C++23.

C++ applications in "important application areas" (see P0939R4) have depended on linear algebra for a long time.

Linear algebra is like
: obvious algorithms are slow, and the fastest implementations call for hardwarespecific tuning.sort 
Dense linear algebra is core functionality for most of linear algebra, and can also serve as a building block for tensor operations.

The C++ Standard Library includes plenty of "mathematical functions." Linear algebra operations like matrixmatrix multiply are at least as broadly useful.

The set of linear algebra operations in this proposal are derived from a wellestablished, standard set of algorithms that has changed very little in decades. It is one of the strongest possible examples of standardizing existing practice that anyone could bring to C++.

This proposal follows in the footsteps of many recent successful incorporations of existing standards into C++, including the UTC and TAI standard definitions from the International Telecommunications Union, the time zone database standard from the International Assigned Numbers Authority, and the ongoing effort to integrate the ISO unicode standard.
Linear algebra has had wide use in C++ applications for nearly three decades (see P1417R0 for a historical survey). For much of that time, many thirdparty C++ libraries for linear algebra have been available. Many different subject areas depend on linear algebra, including machine learning, data mining, web search, statistics, computer graphics, medical imaging, geolocation and mapping, engineering, and physicsbased simulations.
"Directions for ISO C++" (P0939R4) not only lists "Linear Algebra" explicitly as a potential C++23 priority, it also offers the following in support of adding linear algebra to the C++ Standard Library.

P0939R4 calls out "Support for demanding applications" in "important application areas, such as medical, finance, automotive, and games (e.g., key libraries...)" as an "area of general concern" that "we should not ignore." All of these areas depend on linear algebra.

"Is my proposal essential for some important application domain?" Many large and small private companies, science and engineering laboratories, and academics in many different fields all depend on linear algebra.

"We need better support for modern hardware": Modern hardware spends many of its cycles in linear algebra. For decades, hardware vendors, some represented at WG21 meetings, have provided and continue to provide features specifically to accelerate linear algebra operations. Some of them even implement specific linear algebra operations directly in hardware. Examples include NVIDIA’s Tensor Cores and Cerebras' Wafer Scale Engine. Several large computer system vendors offer optimized linear algebra libraries based on or closely resembling the BLAS. These include AMD’s BLIS, ARM’s Performance Libraries, Cray’s LibSci, Intel’s Math Kernel Library (MKL), IBM’s Engineering and Scientific Subroutine Library (ESSL), and NVIDIA’s cuBLAS.
Obvious algorithms for some linear algebra operations like dense
matrixmatrix multiply are asymptotically slower than lessobvious
algorithms. (Please refer to a survey one of us coauthored, "Communication lower bounds and optimal algorithms
for numerical linear algebra." Furthermore, writing the fastest dense matrixmatrix multiply depends
on details of a specific computer architecture. This makes such
operations comparable to
in the C++ Standard Library: worth
standardizing, so that Standard Library implementers can get them
right and hardware vendors can optimize them. In fact, almost all C++
linear algebra libraries end up calling nonC++ implementations of
these algorithms, especially the implementations in optimized BLAS
libraries (see below). In this respect, linear algebra is also
analogous to standard library features like
: often
implemented directly in assembly or even with special hardware, and
thus an essential component of allowing no room for another language
"below" C++ (see P0939R4)
and Stroustrup’s "The Design and Evolution of C++").
Dense linear algebra is the core component of most algorithms and applications that use linear algebra, and the component that is most widely shared over different application areas. For example, tensor computations end up spending most of their time in optimized dense linear algebra functions. Sparse matrix computations get best performance when they spend as much time as possible in dense linear algebra.
The C++ Standard Library includes many "mathematical special functions" ([sf.cmath]), like incomplete elliptic integrals, Bessel functions, and other polynomials and functions named after various mathematicians. Any of them comes with its own theory and set of applications for which robust and accurate implementations are indispensible. We think that linear algebra operations are at least as broadly useful, and in many cases significantly more so.
7. Why base a C++ linear algebra library on the BLAS?

The BLAS is a standard that codifies decades of existing practice.

The BLAS separates "performance primitives" for hardware experts to tune, from mathematical operations that rely on those primitives for good performance.

Benchmarks reward hardware and system vendors for providing optimized BLAS implementations.

Writing a fast BLAS implementation for common element types is nontrivial, but well understood.

Optimized thirdparty BLAS implementations with liberal software licenses exist.

Building a C++ interface on top of the BLAS is a straightforward exercise, but has pitfalls for unaware developers.
Linear algebra has had a crosslanguage standard, the Basic Linear Algebra Subroutines (BLAS), since 2002. The Standard came out of a standardization process that started in 1995 and held meetings three times a year until 1999. Participants in the process came from industry, academia, and government research laboratories. The dense linear algebra subset of the BLAS codifies forty years of evolving practice, and has existed in recognizable form since 1990 (see P1417R0).
The BLAS interface was specifically designed as the distillation of the "computer science" or performanceoriented parts of linear algebra algorithms. It cleanly separates operations most critical for performance, from operations whose implementation takes expertise in mathematics and roundingerror analysis. This gives vendors opportunities to add value, without asking for expertise outside the typical required skill set of a Standard Library implementer.
Wellestablished benchmarks such as the LINPACK benchmark, reward computer hardware vendors for optimizing their BLAS implementations. Thus, many vendors provide an optimized BLAS library for their computer architectures. Writing fast BLASlike operations is not trivial, and depends on computer architecture. However, it is a wellunderstood problem whose solutions could be parameterized for a variety of computer architectures. See, for example, Goto and van de Geijn 2008. There are optimized thirdparty BLAS implementations for common architectures, like ATLAS and GotoBLAS. A (slow but correct) reference implementation of the BLAS exists and it has a liberal software license for easy reuse.
We have experience in the exercise of wrapping a C or Fortran BLAS implementation for use in portable C++ libraries. We describe this exercise in detail in our paper "Evolving a Standard C++ Linear Algebra Library from the BLAS" (P1674). It is straightforward for vendors, but has pitfalls for developers. For example, Fortran’s application binary interface (ABI) differs across platforms in ways that can cause runtime errors (even incorrect results, not just crashing). Historical examples of vendors' C BLAS implementations have also had ABI issues that required workarounds. This dependence on ABI details makes availability in a standard C++ library valuable.
8. Criteria for including algorithms
8.1. Criteria for all the algorithms
We include algorithms in our proposal based on the following criteria, ordered by decreasing importance. Many of our algorithms satisfy multiple criteria.

Getting the desired asymptotic run time is nontrivial

Opportunity for vendors to provide hardwarespecific optimizations

Opportunity for vendors to provide qualityofimplementation improvements, especially relating to accuracy or reproducibility with respect to floatingpoint rounding error

User convenience (familiar name, or tedious to implement)
Regarding (1), "nontrivial" means "at least for novices to the field."
Dense matrixmatrix multiply is a good example.
Getting close to the asymptotic lower bound on the number of memory reads and writes
matters a lot for performance, and calls for a nonintuitive loop reordering.
An analogy to the current C++ Standard Library is
,
where intuitive algorithms that many humans use are not asymptotically optimal.
Regarding (2), a good example is copying multidimensional arrays.
The Kokkos library spends about 2500 lines of code on multidimensional array copy,
yet still relies on system libraries for lowlevel optimizations.
An analogy to the current C++ Standard Library is
or even
.
Regarding (3), accurate floatingpoint summation is nontrivial.
Wellmeaning compiler optimizations might defeat even simple technqiues,
like compensated summation.
The most obvious way to compute a vector’s Euclidean norm
(square root of sum of squares) can cause overflow or underflow,
even when the exact answer is much smaller than the overflow threshold,
or larger than the underflow threshold.
Some users care deeply about sums, even parallel sums,
that always get the same answer, despite rounding error.
This can help debugging, for example.
It is possible to make floatingpoint sums completely independent
of parallel evaluation order.
See e.g., the ReproBLAS effort.
Naming these algorithms and providing
customization hooks
gives vendors a chance to provide these improvements.
An analogy to the current C++ Standard Library is
,
whose language in the C++ Standard alludes to the tighter POSIX requirements.
Regarding (4), the C++ Standard Library is not entirely minimalist.
One example is
.
Existing Standard Library algorithms already offered this functionality,
but a member
function is easy for novices to find and use,
and avoids the tedium of comparing the result of
to
.
The BLAS exists mainly for the first two reasons. It includes functions that were nontrivial for compilers to optimize in its time, like scaled elementwise vector sums, as well as functions that generally require human effort to optimize, like matrixmatrix multiply.
8.2. Criteria for including BLAS 1 algorithms; coexistence with ranges
The BLAS developed in three "levels": 1, 2, and 3. BLAS 1 includes "vectorvector" operations like dot products, norms, and vector addition. BLAS 2 includes "matrixvector" operations like matrixvector products and outer products. BLAS 3 includes "matrixmatrix" operations like matrixmatrix products and triangular solve with multiple "righthand side" vectors. The BLAS level coincides with the number of nested loops in a naïve sequential implementation of the operation. Increasing level also comes with increasing potential for data reuse. For history of the BLAS "levels" and a bibliography, see P1417.
We mention this here because some reviewers have asked how the algorithms in our proposal would coexist with the existing ranges algorithms in the C++ Standard Library. This question actually encloses two questions.

Will our proposed algorithms syntactically collide with existing ranges algorithms?

How much overlap do our proposed algorithms have with the existing ranges algorithms? (That is, do we really need these new algorithms?)
8.2.1. Low risk of synactic collision with ranges
We think there is low risk of our proposal colliding syntactically with existing ranges algorithms, for the following reasons.

We propose our algorithms in a new namespace
.std :: linalg 
None of the algorithms we propose share names with any existing ranges algorithms.

We take care not to use
as a suffix, in order to avoid confusion or name collisions with "views" in the sense of ranges._view 
We specifically do not use the names
ortranspose
, since LEWG has advised us that ranges algorithms may want to claim these names. (One could imagine "transposing" a range of ranges.)transpose_view 
We constrain our algorithms only to take vector and matrix parameters as
.mdspan
is not currently a range, and there are currently no proposals in flight that would make it a range. Changingmdspan
of arbitrary rank to be a range would require a design for multidimensional iterators. P0009’s coauthors have not proposed a design, and it has proven challenging to get compilers to optimize any existing design for multidimensional iterators.mdspan
8.2.2. Minimal overlap with ranges is justified by user convenience
The rest of this section answers the second question. The BLAS 2 and 3 algorithms require multiple nested loops, and highperforming implementations generally need intermediate storage. This would make it unnatural and difficult to express them in terms of ranges. Only the BLAS 1 algorithms in our proposal might have a reasonable translation to ranges algorithms. There, we limit ourselves to the BLAS 1 algorithms in what follows.
Any rank1
can be translated into the following range:
auto x_range = views :: iota ( size_t ( 0 ), x . extent ( 0 ))  views :: transform ([ x ] ( auto k ) { return x [ k ]; });
with specializations possible for
whose layout mapping’s range
is a contiguous or fixedstride set of offsets.
However, just because code _could_ be written in a certain way
doesn’t mean that it _should_ be.
We have ranges even though the language has
loops;
we don’t need to step in the Turing tarpit on purpose
(see Perlis 1982).
Thus, we will analyze the BLAS 1 algorithms in this proposal
in the context of the previous section’s four general criteria.
Our proposal would add 61 new unique names to the C++ Standard Library. Of those, 16 are BLAS 1 algorithms, while 24 are BLAS 2 and 3 algorithms. The 16 BLAS 1 algorithms fall into three categories:

Optimization hooks, like
. As withcopy
, the fastest implementation may depend closely on the computer architecture, and may differ significantly from a straightforward implementation. Some of these algorithms, likememcpy
, can operate on multidimensional arrays as well, though it is traditional to list them as BLAS 1 algorithms.copy 
Floatingpoint qualityofimplementation hooks, like
. These give vendors opportunities to avoid preventable floatingpoint underflow and overflow (as withvector_sum_of_squares
), improve accuracy, and reduce or even avoid parallel nondeterminism and order dependence of floatingpoint sums.hypot 
Uncomplicated elementwise algorithms like
,add
, andidx_abs_max
.scale
We included the first category mainly because of Criterion (2) "Opportunity for vendors to provide hardwarespecific optimizations," and the second mainly because of Criterion (3) ("Opportunity for vendors to provide qualityofimplementation improvements"). Fast implementations of algorithms in the first category are not likely to be simple uses of ranges algorithms.
Algorithms in the second category could be presented as ranges algorithms,
as
algorithms, or as both.
The "iterating over elements" part of those algorithms
is not the most challenging part of their implementation,
nor is it what makes an implementation "high quality."
Algorithms in the third category could be replaced
with a few lines of C++ that use ranges algorithms.
For example, here is a parallel implementation of
,
omitting constraints on template parameters for simplicity.
(Here is a Compiler Explorer link to a working example.)
template < class Element , class Extents , class Layout , class Accessor > typename Extents :: size_type idx_abs_max ( std :: mdspan < Element , Extents , Layout , Accessor > x ) { auto theRange = std :: views :: iota ( size_t ( 0 ), x . extent ( 0 ))  std :: views :: transform ([ = ] ( auto k ) { return std :: abs ( x [ k ]); }); auto iterOfMax = std :: max_element ( std :: execution :: par_unseq , theRange . begin (), theRange . end ()); auto indexOfMax = std :: ranges :: distance ( theRange . begin (), iterOfMax ); // In GCC 12.1, the return type is __int128. return static_cast < typename Extents :: size_type > ( indexOfMax ); }
Even though the algorithms in the third category could be implemented straightforwardly with ranges, we provide them because of Criterion 4 ("User convenience"). Criterion (4) applies to all the algorithms in this proposal, and particularly to the BLAS 1 algorithms. Matrix algorithm developers need BLAS 1 and 2 as well as BLAS 3, because matrix algorithms tend to decompose into vector algorithms. This is true even of socalled "block" matrix algorithms that have been optimized to use matrixmatrix operations wherever possible, in order to improve memory locality. Demmel et al. 1987 (p. 4) explains.
Block algorithms generally require an unblocked version of the same algorithm to be available to operate on a single block. Therefore, the development of the software will fall naturally into two phases: first, develop unblocked versions of the routines, calling the Level 2 BLAS wherever possible; then develop blocked versions where possible, calling the Level 3 BLAS.
Dongarra et al. 1990 (pp. 1215) outlines this development process
for the specific example of Cholesky factorization.
The Cholesky factorization algorithm (on p. 14)
spends most of its time (for a sufficiently large input matrix)
in matrixmatrix multiplies (
),
rankk symmetric matrices updates
(
, a special case of matrixmatrix multiply),
and triangular solves with multiple "righthand side" vectors (
).
However, it still needs an "unblocked" Cholesky factorization
as the blocked factorization’s "base case."
This is called
in Dongarra et al. 1990 (p. 15),
and it uses
,
(both BLAS2), and
(BLAS 2).
In the case of Cholesky factorization,
it’s possible to express the "unblocked" case without using BLAS 1 or 2 operations,
by using recursion.
This is the approach that LAPACK takes with the blocked Cholesky factorization
and its unblocked base case
.
However, even a recursive formulation of most matrix factorizations
needs to use BLAS 1 or 2 operations.
For example, the unblocked base case
of LAPACK’s blocked LU factorization
needs to invoke vectorvector operations like
.
In summary, matrix algorithm developers need vector algorithms,
because matrix algorithms decompose into vector algorithms.
If our proposal lacked BLAS 1 algorithms,
even simple ones like
and
,
matrix algorithm developers would end up writing them anyway.
9. Notation and conventions
9.1. The BLAS uses Fortran terms
The BLAS' "native" language is Fortran. It has a C binding as well, but the BLAS Standard and documentation use Fortran terms. Where applicable, we will call out relevant Fortran terms and highlight possibly confusing differences with corresponding C++ ideas. Our paper P1674 ("Evolving a Standard C++ Linear Algebra Library from the BLAS") goes into more detail on these issues.
9.2. We call "subroutines" functions
Like Fortran, the BLAS distinguishes between functions that return a value, and subroutines that do not return a value. In what follows, we will refer to both as "BLAS functions" or "functions."
9.3. Element types and BLAS function name prefix
The BLAS implements functionality for four different matrix, vector, or scalar element types:

(REAL
in C++ terms)float 
(DOUBLE PRECISION
in C++ terms)double 
(COMPLEX
in C++ terms)complex < float > 
(DOUBLE COMPLEX
in C++ terms)complex < double >
The BLAS' Fortran 77 binding uses a function name prefix to distinguish functions based on element type:

forS
("single")REAL 
forD DOUBLE PRECISION 
forC COMPLEX 
forZ DOUBLE COMPLEX
For example, the four BLAS functions
,
,
, and
all perform the vector update
for vectors
and
and scalar
, but for different vector and scalar
element types.
The convention is to refer to all of these functions together as
. In general, a lowercase
is a placeholder for all data
type prefixes that the BLAS provides. For most functions, the
is
a prefix, but for a few functions like
, the data type
"prefix" is not the first letter of the function name. (
is a
Fortran function that returns
, and therefore follows the old
Fortran implicit naming rule that integers start with
,
, etc.)
Not all BLAS functions exist for all four data types. These come in three categories:

The BLAS provides only realarithmetic (
andS
) versions of the function, since the function only makes mathematical sense in real arithmetic.D 
The complexarithmetic versions perform a slightly different mathematical operation than the realarithmetic versions, so they have a different base name.

The complexarithmetic versions offer a choice between nonconjugated or conjugated operations.
As an example of the second category, the BLAS functions
and
compute the sums of absolute values of a vector’s elements.
Their complex counterparts
and
compute the sums of
absolute values of real and imaginary components of a vector
, that
is, the sum of
for all
in the
domain of
. The latter operation is still useful as a vector norm,
but it requires fewer arithmetic operations.
Examples of the third category include the following:

nonconjugated dot product
and conjugated dot productxDOTU
; andxDOTC 
rank1 symmetric (
) vs. Hermitian (xGERU
) matrix update.xGERC
The conjugate transpose and the (nonconjugated) transpose are the same operation in real arithmetic (if one considers real arithmetic embedded in complex arithmetic), but differ in complex arithmetic. Different applications have different reasons to want either. The C++ Standard includes complex numbers, so a Standard linear algebra library needs to respect the mathematical structures that go along with complex numbers.
10. What we exclude from the design
10.1. Most functions not in the Reference BLAS
The BLAS Standard includes functionality that appears neither in the Reference BLAS library, nor in the classic BLAS "level" 1, 2, and 3 papers. (For history of the BLAS "levels" and a bibliography, see P1417R0. For a paper describing functions not in the Reference BLAS, see "An updated set of basic linear algebra subprograms (BLAS)," listed in "Other references" below.) For example, the BLAS Standard has

several new dense functions, like a fused vector update and dot product;

sparse linear algebra functions, like sparse matrixvector multiply and an interface for constructing sparse matrices; and

extended and mixedprecision dense functions (though we subsume some of their functionality; see below).
Our proposal only includes core Reference BLAS functionality, for the following reasons:

Vendors who implement a new component of the C++ Standard Library will want to see and test against an existing reference implementation.

Many applications that use sparse linear algebra also use dense, but not vice versa.

The Sparse BLAS interface is a stateful interface that is not consistent with the dense BLAS, and would need more extensive redesign to translate into a modern C++ idiom. See discussion in P1417R0.

Our proposal subsumes some dense mixedprecision functionality (see below).
We have included vector sumofsquares and matrix norms as exceptions, for the same reason that we include vector 2norm: to expose hooks for qualityofimplementation improvements that avoid underflow or overflow when computing with floatingpoint values.
10.2. LAPACK or related functionality
The LAPACK Fortran library implements solvers for the following classes of mathematical problems:

linear systems,

linear leastsquares problems, and

eigenvalue and singular value problems.
It also provides matrix factorizations and related linear algebra operations. LAPACK deliberately relies on the BLAS for good performance; in fact, LAPACK and the BLAS were designed together. See history presented in P1417R0.
Several C++ libraries provide slices of LAPACK functionality. Here is a brief, noninclusive list, in alphabetical order, of some libraries actively being maintained:
P1417R0 gives some history of C++ linear algebra libraries. The authors of this proposal have designed, written, and maintained LAPACK wrappers in C++. Some authors have LAPACK founders as PhD advisors. Nevertheless, we have excluded LAPACKlike functionality from this proposal, for the following reasons:

LAPACK is a Fortran library, unlike the BLAS, which is a multilanguage standard.

We intend to support more general element types, beyond the four that LAPACK supports. It’s much more straightforward to make a C++ BLAS work for general element types, than to make LAPACK algorithms work generically.
First, unlike the BLAS, LAPACK is a Fortran library, not a standard. LAPACK was developed concurrently with the "level 3" BLAS functions, and the two projects share contributors. Nevertheless, only the BLAS and not LAPACK got standardized. Some vendors supply LAPACK implementations with some optimized functions, but most implementations likely depend heavily on "reference" LAPACK. There have been a few efforts by LAPACK contributors to develop C++ LAPACK bindings, from Lapack++ in pretemplates C++ circa 1993, to the recent "C++ API for BLAS and LAPACK". (The latter shares coauthors with this proposal.) However, these are still just C++ bindings to a Fortran library. This means that if vendors had to supply C++ functionality equivalent to LAPACK, they would either need to start with a Fortran compiler, or would need to invest a lot of effort in a C++ reimplementation. Mechanical translation from Fortran to C++ introduces risk, because many LAPACK functions depend critically on details of floatingpoint arithmetic behavior.
Second, we intend to permit use of matrix or vector element types other than just the four types that the BLAS and LAPACK support. This includes "short" floatingpoint types, fixedpoint types, integers, and userdefined arithmetic types. Doing this is easier for BLASlike operations than for the much more complicated numerical algorithms in LAPACK. LAPACK strives for a "generic" design (see Jack Dongarra interview summary in P1417R0), but only supports two real floatingpoint types and two complex floatingpoint types. Directly translating LAPACK source code into a "generic" version could lead to pitfalls. Many LAPACK algorithms only make sense for number systems that aim to approximate real numbers (or their complex extentions). Some LAPACK functions output error bounds that rely on properties of floatingpoint arithmetic.
For these reasons, we have left LAPACKlike functionality for future work. It would be natural for a future LAPACKlike C++ library to build on our proposal.
10.3. Extendedprecision BLAS
Our interface subsumes some functionality of the MixedPrecision BLAS
specification (Chapter 4 of the BLAS Standard). For example, users
may multiply two 16bit floatingpoint matrices (assuming that a
16bit floatingpoint type exists) and accumulate into a 32bit
floatingpoint matrix, just by providing a 32bit floatingpoint
matrix as output. Users may specify the precision of a dot product
result. If it is greater than the input vectors' element type
precisions (e.g.,
vs.
), then this effectively
performs accumulation in higher precision. Our proposal imposes
semantic requirements on some functions, like
, to
behave in this way.
However, we do not include the "ExtendedPrecision BLAS" in this proposal. The BLAS Standard lets callers decide at run time whether to use extended precision floatingpoint arithmetic for internal evaluations. We could support this feature at a later time. Implementations of our interface also have the freedom to use more accurate evaluation methods than typical BLAS implementations. For example, it is possible to make floatingpoint sums completely independent of parallel evaluation order.
10.4. Arithmetic operators and associated expression templates
Our proposal omits arithmetic operators on matrices and vectors. We do so for the following reasons:

We propose a lowlevel, minimal interface.

could have multiple meanings for matrices and vectors. Should it mean elementwise product (likeoperator *
) or matrix product? Should libraries reinterpret "vector times vector" as a dot product (row vector times column vector)? We prefer to let a higherlevel library decide this, and make everything explicit at our lower level.valarray 
Arithmetic operators require defining the element type of the vector or matrix returned by an expression. Functions let users specify this explicitly, and even let users use different output types for the same input types in different expressions.

Arithmetic operators may require allocation of temporary matrix or vector storage. This prevents use of nonowning data structures.

Arithmetic operators strongly suggest expression templates. These introduce problems such as dangling references and aliasing.
Our goal is to propose a lowlevel interface. Other libraries, such as that proposed by P1385R6, could use our interface to implement overloaded arithmetic for matrices and vectors. A constrained, functionbased, BLASlike interface builds incrementally on the many years of BLAS experience.
Arithmetic operators on matrices and vectors would require the
library, not necessarily the user, to specify the element type of an
expression’s result. This gets tricky if the terms have mixed element
types. For example, what should the element type of the result of the
vector sum
be, if
has element type
and
has element type
? It’s tempting to use
,
but
is
. This
loses precision. Some users may want
; others may
want
or something else, and others may want to
choose different types in the same program.
P1385R6 lets users customize the return type
of such arithmetic expressions. However, different algorithms may
call for the same expression with the same inputs to have different
output types. For example, iterative refinement of linear systems
can work either with an extendedprecision intermediate
residual vector
, or with a residual vector that has the
same precision as the input linear system. Each choice produces a
different algorithm with different convergence characteristics,
periteration run time, and memory requirements. Thus, our library
lets users specify the result element type of linear algebra
operations explicitly, by calling a named function that takes an
output argument explicitly, rather than an arithmetic operator.
Arithmetic operators on matrices or vectors may also need to allocate
temporary storage. Users may not want that. When LAPACK’s developers
switched from Fortran 77 to a subset of Fortran 90, their users
rejected the option of letting LAPACK functions allocate temporary
storage on their own. Users wanted to control memory allocation.
Also, allocating storage precludes use of nonowning input data
structures like
, that do not know how to allocate.
Arithmetic expressions on matrices or vectors strongly suggest
expression templates, as a way to avoid allocation of temporaries and
to fuse computational kernels. They do not require expression
templates. For example,
offers overloaded operators for
vector arithmetic, but the Standard lets implementers decide whether
to use expression templates. However, all of the current C++ linear
algebra libraries that we mentioned above have some form of expression
templates for overloaded arithmetic operators, so users will expect
this and rely on it for good performance. This was, indeed, one of
the major complaints about initial implementations of
: its
lack of mandate for expression templates meant that initial
implementations were slow, and thus users did not want to rely on it.
(See Josuttis 1999, p. 547, and Vandevoorde and Josuttis 2003, p. 342,
for a summary of the history. Fortran has an analogous issue, in
which (under certain conditions) it is implementation defined whether
the runtime environment needs to copy noncontiguous slices of an
array into contiguous temporary storage.)
Expression templates work well, but have issues. Our papers P1417R0 and
"Evolving a Standard C++ Linear Algebra Library from the BLAS" (P1674) give more detail on these concerns.
A particularly troublesome one is that C++
type deduction
makes it easy for users to capture expressions
before the expression templates system
has the chance to evaluate them and write the result into the output.
For matrices and vectors with container semantics,
this makes it easy to create dangling references.
Users might not realize that they need to assign expressions to named
types before actual work and storage happen. Eigen’s
documentation describes this common problem.
Our
,
,
, and
functions
make use of one aspect of expression templates, namely modifying the
array access operator. However, we intend these
functions for use only as inplace modifications of arguments of a
function call. Also, when modifying
, these functions
merely view the same data that their input
views. They
introduce no more potential for dangling references than
itself. The use of views like
is
selfdocumenting; it tells users that they need to take responsibility
for scope of the viewed data.
10.5. Banded matrix layouts
This proposal omits banded matrix types. It would be easy to add the required layouts and specializations of algorithms later. The packed and unpacked symmetric and triangular layouts in this proposal cover the major concerns that would arise in the banded case, like nonstrided and nonunique layouts, and matrix types that forbid access to some multiindices in the Cartesian product of extents.
10.6. Tensors
We exclude tensors from this proposal, for the following reasons.
First, tensor libraries naturally build on optimized dense linear
algebra libraries like the BLAS, so a linear algebra library is a good
first step. Second,
has natural use as a
lowlevel representation of dense tensors, so we are already partway
there. Third, even simple tensor operations that naturally generalize
the BLAS have infintely many more cases than linear algebra. It’s not
clear to us which to optimize. Fourth, even though linear algebra is
a special case of tensor algebra, users of linear algebra have
different interface expectations than users of tensor algebra. Thus,
it makes sense to have two separate interfaces.
10.7. Explicit support for asynchronous return of scalar values
After we presented revision 2 of this paper,
LEWG asked us to consider support for discrete graphics processing units (GPUs).
GPUs have two features of interest here.
First, they might have memory that is not accessible from ordinary C++ code,
but could be accessed in a standard algorithm
(or one of our proposed algorithms)
with the right implementationspecific
.
(For instance, a policy could say "run this algorithm on the GPU.")
Second, they might execute those algorithms asynchronously.
That is, they might write to output arguments at some later time
after the algorithm invocation returns.
This would imply different interfaces in some cases.
For instance, a hypothetical asynchronous vector 2norm
might write its scalar result via a pointer to GPU memory,
instead of returning the result "on the CPU."
Nothing in principle prevents
from viewing memory
that is inaccessible from ordinary C++ code.
This is a major feature of the
class
from the Kokkos library,
and
directly inspired
.
The C++ Standard does not currently define how such memory behaves,
but implementations could define its behavior and make it work with
.
This would, in turn, let implementations define our algorithms
to operate on such memory efficiently,
if given the right implementationspecific
.
Our proposal excludes algorithms that might write to their output arguments
at some time after after the algorithm returns.
First, LEWG insisted that our proposed algorithms that compute a scalar result,
like
,
return that result in the manner of
,
rather than writing the result to an output reference or pointer.
(Previous revisions of our proposal used the latter interface pattern.)
Second, it’s not clear whether writing a scalar result to a pointer
is the right interface for asynchronous algorithms.
Followon proposals to Executors (P0443R14) include asynchronous algorithms,
but none of these suggest returning results asynchronously by pointer.
Our proposal deliberately imitates the existing standard algorithms.
Right now, we have no standard asynchronous algorithms to imitate.
11. Design justification
We take a stepwise approach. We begin with core BLAS dense linear algebra functionality. We then deviate from that only as much as necessary to get algorithms that behave as much as reasonable like the existing C++ Standard Library algorithms. Future work or collaboration with other proposals could implement a higherlevel interface.
Please refer to our papers "Evolving a Standard C++ Linear Algebra Library from the BLAS" (P1674) and "Historical lessons for C++ linear algebra library standardization" (P1417) They will give details and references for many of the points that we summarize here.
11.1. We do not require using the BLAS library or any particular "backend"
Our proposal is inspired by and extends the dense BLAS interface. A natural implementation might look like this:

wrap an existing C or Fortran BLAS library,

hope that the BLAS library is optimized, and then

extend the wrapper to include straightforward Standard C++ implementations of P1673’s algorithms for matrix and vector value types and data layouts that the BLAS does not support.
P1674 describes the process of writing such an implementation. However, P1673 does not require implementations to wrap the BLAS. In particular, P1673 does not specify a "backend" Cstyle interface that would let users or implementers "swap out" different BLAS libraries. Here are some reasons why we made this choice.
First, it’s possible to write an optimized implementation entirely in Standard C++, without calling external C or Fortran functions. For example, one can write a cacheblocked matrixmatrix multiply implementationen entirely in Standard C++.
Second, different vendors may have their own libraries
that support matrix and vector value types and/or layouts
beyond what the standard dense BLAS supports.
For example, they may have C functions
for mixedprecision matrixmatrix multiply, like
BLIS'
(example here),
or NVIDIA’s
(example here).
Third, just because a C or Fortran BLAS library can be found, doesn’t mean that it’s optimized at all or optimized well. For example, many Linux distributions have a BLAS software package that is built by compiling the Reference BLAS. This will give poor performance for BLAS 3 operations. Even "optimized" vendor BLAS libraries may not optimize all cases. Release notes even for recent versions show performance improvements.
In summary: While a natural way to implement this proposal would be to wrap an existing C or Fortran BLAS library, we do not want to require this. Thus, we do not specify a "backend" Cstyle interface.
11.2. Why use mdspan
?
11.2.1. View of a multidimensional array
The BLAS operates on what C++ programmers might call views of multidimensional arrays. Users of the BLAS can store their data in whatever data structures they like, and handle their allocation and lifetime as they see fit, as long as the data have a BLAScompatible memory layout.
The corresponding C++ data structure is
.
This class encapsulates the large number of pointer and integer arguments
that BLAS functions take, that represent views of matrices and vectors.
Using
in the C++ interface reduce the number of arguments
and avoids common errors, like mixing up the order of arguments.
It supports all the array memory layouts that the BLAS supports,
including row major and column major.
It also expresses the same data ownership model that the BLAS expresses.
Users may manage allocation and deallocation however they wish.
In addition,
lets our algorithms exploit any dimensions known at compile time.
11.2.2. Ease of use
The
class' layout and accessor policies
let us simplify our interfaces,
by encapsulating transpose, conjugate, and scalar arguments.
Features of
make implementing BLASlike algorithms
much less error prone and easier to read.
These include its encapsulation of matrix indexing
and its builtin "slicing" capabilities via
.
11.2.3. BLAS and mdspan
are low level
The BLAS is low level; it imposes no mathematical meaning on multidimensional arrays.
This gives users the freedom to develop mathematical libraries with the semantics they want.
Similarly,
is just a view of a multidimensional array;
it has no mathematical meaning on its own.
We mention this because "matrix," "vector," and "tensor"
are mathematical ideas that mean more than just arrays of numbers.
This is more than just a theoretical concern.
Some BLAS functions operate on "triangular," "symmetric," or "Hermitian" matrices,
but they do not assert that a matrix has any of these mathematical properties.
Rather, they only read only one side of the matrix (the lower or upper triangle),
and compute as if the other side of the matrix satisfies the mathematical property.
A key feature of the BLAS and libraries that build on it, like LAPACK,
is that they can operate on the matrix’s data in place.
These operations change both the matrix’s mathematical properties and its representation in memory.
For example, one might have an N x N array representing a matrix that is symmetric in theory,
but computed and stored in a way that might not result in exactly symmetric data.
In order to solve linear systems with this matrix,
one might give the array to LAPACK’s
to compute an LDL^T factorization,
asking
only to access the array’s lower triangle.
If
finishes successfully,
it has overwritten the lower triangle of its input
with a representation of both the lower triangular factor L
and the block diagonal matrix D,
as computed assuming that the matrix is the sum of the lower triangle
and the transpose of the lower triangle.
The resulting N x N array no longer represents a symmetric matrix.
Rather, it contains part of the representation of a LDL^T factorization of the matrix.
The upper triangle still contains the original input matrix’s data.
One may then solve linear systems by giving
the lower triangle,
along with other output of
.
The point of this example is that a "symmetric matrix class"
is the wrong way to model this situation.
There’s an N x N array, whose mathematical interpretation changes
with each inplace operation performed on it.
The lowlevel
data structure carries no mathematical properties in itself,
so it models this situation better.
11.2.4. Hook for future expansion
The
class treats its layout as an extension point.
This lets our interface support layouts beyond what the BLAS Standard permits.
The accessor extension point offers us a hook for future expansion
to support heterogeneous memory spaces.
(This is a key feature of
,
the data structure that inspired
.)
In addition, using
will make it easier for us to add
an efficient "batched" interface in future proposals.
11.2.5. Generic enough to replace a "multidimensional array concept"
LEWGI requested in the 2019 Cologne meeting that we explore using a
concept instead of
to define the arguments for the
linear algebra functions.
This would mean that instead of having our functions take
parameters,
they would be generic on one or more suitably constrained multidimensional array types.
The constraints would form a "multidimensional array concept."
We investigated this option, and rejected it, for the following reasons.

Our proposal uses enough features of
that any concept generally applicable to all functions we propose would largely replicate the definition ofmdspan
.mdspan 
This proposal could support most multidimensional array types, if the array types just made themselves convertible to
.mdspan 
We could always generalize our algorithms later.

Any multidimensional array concept would need revision in the light of P2128R6.
This proposal refers to almost all of
's features,
including
, layouts, and accessors.
We expect implementations to use all of them for optimizations,
for example to extract the scaling factor from the return value of
in order to call an optimized BLAS library directly.
Suppose that a general customization point
exists,
that takes a reference to a multidimensional array type
and returns an
that views the array.
Then, our proposal could support most multidimensional array types.
"Most" includes all such types that refer to a subset of a contiguous span of memory.
Requiring that a multidimensional array refer to a subset of a contiguous span of memory
would exclude multidimensional array types that have a noncontiguous backing store,
such as a
.
If we later wanted to support such types,
we could always generalize our algorithms later.
Finally, any existing multidimensional array concept would need revision
in the light of P2128R6,
which has been voted into the C++ Standard draft.
P2128 lets
take multiple parameters.
The version of mdspan which was voted into the C++ Standard draft
uses
instead of
as the array access operator.
After further discussion at the 2019 Belfast meeting,
LEWGI accepted our position that having our algorithms take
instead of template parameters constrained by a multidimensional array concept
would be fine for now.
11.3. Function argument aliasing and zero scalar multipliers
Summary:

The BLAS Standard forbids aliasing any input (readonly) argument with any output (writeonly or readandwrite) argument.

The BLAS uses
(readandwrite) arguments to express "updates" to a vector or matrix. By contrast, C++ Standard algorithms likeINTENT ( INOUT )
take input and output iterator ranges as different parameters, but may let input and output ranges be the same.transform 
The BLAS uses the values of scalar multiplier arguments ("alpha" or "beta") of vectors or matrices at run time, to decide whether to treat the vectors or matrices as write only. This matters both for performance and semantically, assuming IEEE floatingpoint arithmetic.

We decide separately, based on the category of BLAS function, how to translate
arguments into a C++ idiom:INTENT ( INOUT )
a. For triangular solve and triangular multiply, inplace behavior
is essential for computing matrix factorizations in place,
without requiring extra storage proportional to the input
matrix’s dimensions. However, inplace functions may hinder
implementations' use of some forms of parallelism.
Thus, we have both notinplace and inplace overloads.
Both take an optional
,
as some forms of parallelism (e.g., vectorization)
may still be effective with inplace operations.
b. Else, if the BLAS function unconditionally updates (like
), we retain readandwrite behavior for that argument.
c. Else, if the BLAS function uses a scalar
argument to
decide whether to read the output argument as well as write to
it (like
), we provide two versions: a writeonly version
(as if
is zero), and a readandwrite version (as if
is nonzero).
For a detailed analysis, please see our paper "Evolving a Standard C++ Linear Algebra Library from the BLAS" (P1674).
11.4. Support for different matrix layouts
Summary:

The dense BLAS supports several different dense matrix "types." Type is a mixture of "storage format" (e.g., packed, banded) and "mathematical property" (e.g., symmetric, Hermitian, triangular).

Some "types" can be expressed as custom
layouts. Other types actually represent algorithmic constraints: for instance, what entries of the matrix the algorithm is allowed to access.mdspan 
Thus, a C++ BLAS wrapper cannot overload on matrix "type" simply by overloading on
specialization. The wrapper must use different function names, tags, or some other way to decide what the matrix type is.mdspan
For more details, including a list and description of the matrix "types" that the dense BLAS supports, please see our paper "Evolving a Standard C++ Linear Algebra Library from the BLAS" (P1674).
A C++ linear algebra library has a few possibilities for distinguishing the matrix "type":

It could imitate the BLAS, by introducing different function names, if the layouts and accessors do not sufficiently describe the arguments.

It could introduce a hierarchy of higherlevel classes for representing linear algebra objects, use
(or something like it) underneath, and write algorithms to those higherlevel classes.mdspan 
It could use the layout and accessor types in
simply as tags to indicate the matrix "type." Algorithms could specialize on those tags.mdspan
We have chosen Approach 1. Our view is that a BLASlike interface
should be as lowlevel as possible. Approach 2 is more like a "Matlab
in C++"; a library that implements this could build on our proposal’s
lowerlevel library. Approach 3 sounds attractive. However, most
BLAS matrix "types" do not have a natural representation as layouts.
Trying to hack them in would pollute
 a simple class
meant to be easy for the compiler to optimize  with extra baggage
for representing what amounts to sparse matrices. We think that BLAS
matrix "type" is better represented with a higherlevel library that
builds on our proposal.
11.5. Interpretation of "lower / upper triangular"
11.5.1. Triangle refers to what part of the matrix is accessed
The triangular, symmetric, and Hermitian algorithms in this proposal
all take a
tag that specifies whether the algorithm
should access the upper or lower triangle of the matrix.
This has the same function as the
argument
of the corresponding BLAS routines.
The upper or lower triangular argument
only refers to what part of the matrix the algorithm will access.
The "other triangle" of the matrix need not contain useful data.
For example, with the symmetric algorithms,
need not equal
for any
and
in the domain of
with
not equal to
.
The algorithm just accesses one triangle and interprets the other triangle
as the result of flipping the accessed triangle over the diagonal.
This "interpretation" approach to representing triangular matrices
is critical for matrix factorizations.
For example, LAPACK’s LU factorization (
)
overwrites a matrix A with both its L (lower triangular,
implicitly represented diagonal of all ones)
and U (upper triangular, explicitly stored diagonal) factors.
Solving linear systems Ax=b with this factorization,
as LAPACK’s
routine does,
requires solving first a linear system with the upper triangular matrix U,
and then solving a linear system with the lower triangular matrix L.
If the BLAS required that the "other triangle"
of a triangular matrix had all zero elements,
then LU factorization would require at least twice the storage.
For symmetric and Hermitian matrices,
only accessing the matrix’s elements nonredundantly
ensures that the matrix remains mathematically symmetric resp. Hermitian,
even in the presence of rounding error.
11.5.2. BLAS applies UPLO to original matrix; we apply Triangle to transformed matrix
The BLAS routines that take an
argument
generally also take a
argument.
The
argument says whether to apply the matrix,
its transpose, or its conjugate transpose.
The BLAS applies the
argument to the "original" matrix,
not to the transposed matrix.
For example, if
or
,
means the routine will access the upper triangle of the matrix,
not the upper triangle of the matrix’s transpose.
Our proposal takes the opposite approach.
It applies
to the input matrix,
which may be the result of a transformation
such as
or
.
For example, if
is
,
the algorithm will always access the matrix
for
in its domain with
≤
(or
strictly less than
,
if the algorithm takes a
tag
and
is
).
If the input matrix is
for a
,
this means that the algorithm will access the upper triangle of
,
which is actually the lower triangle of
.
We took this approach because our interface permits arbitrary layouts,
with possibly arbitrary nesting of layout transformations.
This comes from
's design itself, not even necessarily from our proposal.
For example, users might define
,
that flips indices over the antidiagonal
(the "other diagonal" that goes from the lower left to the upper right of the matrix,
instead of from the upper left to the lower right).
Layout transformations need not even be onetoone,
because layouts themselves need not be (hence
).
Since it’s not possible to "undo" a general layout,
there’s no way to get back to the "original matrix."
Our approach, while not consistent with the BLAS, is internally consistent.
always has a clear meaning,
no matter what transformations users apply to the input.
Layout transformations like
have the same interpretation for all the matrix algorithms,
whether for general, triangular, symmetric, or Hermitian matrices.
This interpretation is consistent with the standard meaning of
layouts.
C BLAS implementations already apply layout transformations like this
so that they can use an existing columnmajor Fortran BLAS
to implement operations on matrices with different layouts.
For example, the transpose of an
x
matrix
is just the same data, viewed as an
x
matrix.
Thus,
is consistent with current practice.
In fact,
need not use a special
,
if it knows how to reinterpret the input layout.
11.5.3. Summary

BLAS applies
to the original matrix, before any transposition. Our proposal appliesUPLO
to the transformed matrix, after any transposition.Triangle 
Our approach is the only reasonable way to handle the full generality of userdefined layouts and layout transformations.
11.6. Over and underflow wording for vector 2norm
SG6 recommended to us at Belfast 2019 to change the special overflow /
underflow wording for
to imitate the BLAS Standard more
closely. The BLAS Standard does say something about overflow and
underflow for vector 2norms. We reviewed this wording and conclude
that it is either a nonbinding quality of implementation (QoI)
recommendation, or too vaguely stated to translate directly into C++
Standard wording.
Thus, we removed our special overflow / underflow wording.
However, the BLAS Standard clearly expresses the intent
that implementations document their underflow and overflow guarantees
for certain functions, like vector 2norms.
The C++ Standard requires documentation of "implementationdefined behavior."
Therefore, we added language to our proposal that makes
"any guarantees regarding overflow and underflow" of those certain functions
"implementationdefined."
Previous versions of this paper asked implementations to compute
vector 2norms "without undue overflow or underflow at intermediate
stages of the computation." "Undue" imitates existing C++ Standard
wording for
. This wording hints at the stricter requirements
in F.9 (normative, but optional) of the C Standard for math library
functions like
, without mandating those requirements. In
particular, paragraph 9 of F.9 says:
Whether or when library functions raise an undeserved "underflow" floatingpoint exception is unspecified. Otherwise, as implied by F.7.6, the <math.h> functions do not raise spurious floatingpoint exceptions (detectable by the user) [including the "overflow" exception discussed in paragraph 6], other than the "inexact" floatingpoint exception.
However, these requirements are for math library functions like
, not for general algorithms that return floatingpoint values.
SG6 did not raise a concern that we should treat
like a
math library function; their concern was that we imitate the BLAS
Standard’s wording.
The BLAS Standard says of several operations, including vector 2norm: "Here are the exceptional routines where we ask for particularly careful implementations to avoid unnecessary over/underflows, that could make the output unnecessarily inaccurate or unreliable" (p. 35).
The BLAS Standard does not define phrases like "unnecessary
over/underflows." The likely intent is to avoid naïve implementations
that simply add up the squares of the vector elements. These would
overflow even if the norm in exact arithmetic is significantly less
than the overflow threshold. The POSIX Standard (IEEE Std
1003.12017) analogously says that
must "take precautions
against overflow during intermediate steps of the computation."
The phrase "precautions against overflow" is too vague for us to translate into a requirement. The authors likely meant to exclude naïve implementations, but not require implementations to know whether a result computed in exact arithmetic would overflow or underflow. The latter is a special case of computing floatingpoint sums exactly, which is costly for vectors of arbitrary length. While it would be a useful feature, it is difficult enough that we do not want to require it, especially since the BLAS Standard itself does not. The implementation of vector 2norms in the Reference BLAS included with LAPACK 3.10.0 partitions the running sum of squares into three different accumulators: one for big values (that might cause the sum to overflow without rescaling), one for small values (that might cause the sum to underflow without rescaling), and one for the remaining "medium" values. (See Anderson 2017.) Earlier implementations merely rescaled by the current maximum absolute value of all the vector entries seen thus far. (See Blue 1978.) Implementations could also just compute the sum of squares in a straightforward loop, then check floatingpoint status flags for underflow or overflow, and recompute if needed.
For all of the functions listed on p. 35 of the BLAS Standard as needing "particularly careful implementations," except vector norm, the BLAS Standard has an "Advice to implementors" section with extra accuracy requirements. The BLAS Standard does have an "Advice to implementors" section for matrix norms (see Section 2.8.7, p. 69), which have similar over and underflow concerns as vector norms. However, the Standard merely states that "[h]ighquality implementations of these routines should be accurate" and should document their accuracy, and gives examples of "accurate implementations" in LAPACK.
The BLAS Standard never defines what "Advice to implementors" means. However, the BLAS Standard shares coauthors and audience with the Message Passing Interface (MPI) Standard, which defines "Advice to implementors" as "primarily commentary to implementors" and permissible to skip (see e.g., MPI 3.0, Section 2.1, p. 9). We thus interpret "Advice to implementors" in the BLAS Standard as a nonbinding quality of implementation (QoI) recommendation.
11.7. Constraining matrix and vector element types and scalars
11.7.1. Introduction
The BLAS only accepts four different types of scalars and matrix and vector elements.
In C++ terms, these correspond to
,
,
, and
.
The algorithms we propose generalize the BLAS
by accepting any matrix, vector, or scalar element types that make sense for each algorithm.
Those may be builtin types, like floatingpoint numbers or integers,
or they may be custom types.
Those custom types might not behave like conventional real or complex numbers.
For example, quaternions have noncommutative multiplication (
might not equal
),
polynomials in one variable over a field lack division,
and some types might not even have subtraction defined.
Nevertheless, many BLAS operations would make sense for all of these types.
"Constraining matrix and vector element types and scalars" means defining how these types must behave in order for our algorithms to make sense. This includes both syntactic and semantic constraints. We have three goals:

to help implementers implement our algorithms correctly;

to give implementers the freedom to make quality of implementation (QoI) enhancements, for both performance and accuracy; and

to help users understand what types they may use with our algorithms.
The whole point of the BLAS was to identify key operations for vendors to optimize. Thus, performance is a major concern. "Accuracy" here refers to either to rounding error or to approximation error (for matrix or vector element types where either makes sense).
11.7.2. Value type constraints do not suffice to describe algorithm behavior
LEWG’s 2020 review of P1673R2 asked us to investigate conceptification of its algorithms.
"Conceptification" here refers to an effort like that of P1813R0
("A Concept Design for the Numeric Algorithms"),
to come up with concepts that could be used to constrain the template parameters
of numeric algorithms like
or
.
(We are not referring to LEWGI’s request for us to consider
generalizing our algorithm’s parameters from
to a hypothetical multidimensional array concept.
We discuss that above, in the "Why use
?" section.)
The numeric algorithms are relevant to P1673 because many of the algorithms proposed in P1673
look like generalizations of
or
.
We intend for our algorithms to be generic on their matrix and vector element types,
so these questions matter a lot to us.
We agree that it is useful to set constraints that make it possible to reason about correctness of algorithms. However, we do not think constraints on value types suffice for this purpose. First, requirements like associativity are too strict to be useful for practical types. Second, what we really want to do is describe the behavior of algorithms, regardless of value types' semantics. "The algorithm may reorder sums" means something different than "addition on the terms in the sum is associative."
11.7.3. Associativity is too strict
P1813R0 requires associative addition for many algorithms, such as
.
However, many practical arithmetic systems that users might like to use
with algorithms like
have nonassociative addition. These include

systems with rounding;

systems with an "infinity": e.g., if 10 is Inf, 3 + 8  7 could be either Inf or 4; and

saturating arithmetic: e.g., if 10 saturates, 3 + 8  7 could be either 3 or 4.
Note that the latter two arithmetic systems have nothing to do with rounding error.
With saturating integer arithmetic, parenthesizing a sum in different ways might give results
that differ by as much as the saturation threshold.
It’s true that many nonassociative arithmetic systems behave
"associatively enough" that users don’t fear parallelizing sums.
However, a concept with an exact property (like "commutative semigroup")
isn’t the right match for "close enough,"
just like
isn’t the right match for describing "nearly the same."
For some number systems, a rounding error bound might be more appropriate,
or guarantees on when underflow or overflow may occur (as in POSIX’s
).
The problem is a mismatch between the requirement we want to express —
and its kin seems more helpful.
If the Standard describes an algorithm in terms of GENERALIZED_SUM,
then that tells the caller what the algorithm might do.
The caller then takes responsibility for interpreting the algorithm’s results.
We think this is important both for adding new algorithms (like those in this proposal)
and for defining behavior of an algorithm with respect to different
arguments.
(For instance,
could imply that the algorithm might change the order of terms in a sum,
while
need not.
Compare to
's
parameter,
that affects the behavior of algorithms like
when used with the resulting userdefined reduction operator.)
11.7.4. Generalizing associativity helps little
Suppose we accept that associativity and related properties are not useful
for describing our proposed algorithms.
Could there be a generalization of associativity that would be useful?
P1813R0’s most general concept is a
.
Mathematically, a magma is a set M with a binary operation ×,
such that if a and b are in M, then a × b is in M.
The operation need not be associative or commutative.
While this seems almost too general to be useful,
there are two reasons why even a magma is too specific for our proposal.

A magma only assumes one set, that is, one type. This does not accurately describe what the algorithms do, and it excludes useful features like mixed precision and types that use expression templates.

A magma is too specific, because algorithms are useful even if the binary operation is not closed.
First, even for simple linear algebra operations that "only" use plus and times, there is no one "set M" over which plus and times operate. There are actually three operations: plus, times, and assignment. Each operation may have completely heterogeneous input(s) and output. The sets (types) that may occur vary from algorithm to algorithm, depending on the input type(s), and the algebraic expression(s) that the algorithm is allowed to use. We might need several different concepts to cover all the expressions that algorithms use, and the concepts would end up being less useful to users than the expressions themselves.
For instance, consider the Level 1 BLAS "AXPY" function.
This computes
elementwise.
What type does the expression
have?
It doesn’t need to have the same type as
;
it just needs to be assignable to
.
The types of
,
, and
could all differ.
As a simple example,
might be
,
might be
, and
might be
.
The types of
and
might be more complicated;
e.g.,
might be a polynomial with
coefficients,
and
a polynomial with
coefficients.
If those polynomials use expression templates,
then slightly different sum expressions involving
and/or
(e.g.,
,
, or
)
might all have different types,
all of which differ from value type of
or
.
All of these types must be assignable and convertible to the output value type.
We could try to describe this with a concept that expresses a sum type. The sum type would include all the types that might show up in the expression. However, we do not think this would improve clarity over just the expression itself. Furthermore, different algorithms may need different expressions, so we would need multiple concepts, one for each expression. Why not just use the expressions to describe what the algorithms can do?
Second, the magma concept is not helpful even if we only had one set M, because our algorithms would still be useful even if binary operations were not closed over that set. For example, consider a hypothetical userdefined rational number type, where plus and times throw if representing the result of the operation would take more than a given fixed amount of memory. Programmers might handle this exception by falling back to different algorithms. Neither plus or times on this type would satisfy the magma requirement, but the algorithms would still be useful for such a type. One could consider the magma requirement satisfied in a purely syntactic sense, because of the return type of plus and times. However, saying that would not accurately express the type’s behavior.
This point returns us to the concerns we expressed earlier about assuming associativity. "Approximately associative" or "usually associative" are not useful concepts without further refinement. The way to refine these concepts usefully is to describe the behavior of a type fully, e.g., the way that IEEE 754 describes the behavior of floatingpoint numbers. However, algorithms rarely depend on all the properties in a specification like IEEE 754. The problem, again, is that we need to describe what algorithms do  e.g., that they can rearrange terms in a sum  not how the types that go into the algorithms behave.
In summary:

Many useful types have nonassociative or even nonclosed arithmetic.

Lack of (e.g.,) associativity is not just a rounding error issue.

It can be useful to let algorithms do things like reparenthesize sums or products, even for types that are not associative.

Permission for an algorithm to reparenthesize sums is not the same as a concept constraining the terms in the sum.

We can and do use existing Standard language, like GENERALIZED_SUM, for expressing permissions that algorithms have.
In the sections that follow, we will describe a different way to constrain the matrix and vector element types and scalars in our algorithms. We will start by categorizing the different quality of implementation (QoI) enhancements that implementers might like to make. These enhancements call for changing algorithms in different ways. We will distinguish textbook from nontextbook ways of changing algorithms, explain that we only permit nontextbook changes for floatingpoint types, then develop constraints on types that permit textbook changes.
11.7.5. Categories of QoI enhancements
An important goal of constraining our algorithms is to give implementers the freedom to make QoI enhancements. We categorize QoI enhancements in three ways:

those that depend entirely on the computer architecture;

those that might have architecturedependent parameters, but could otherwise be written in an architectureindependent way; and

those that diverge from a textbook description of the algorithm, and depend on element types having properties more specific than what that description requires.
An example of Category (1) would be special hardware instructions that perform matrixmatrix multiplications on small, fixedsize blocks. The hardware might only support a few types, such as integers, fixedpoint reals, or floatingpoint types. Implementations might use these instructions for the entire algorithm, if the problem sizes and element types match the instruction’s requirements. They might also use these instructions to solve subproblems. In either case, these instructions might reorder sums or create temporary values.
Examples of Category (2) include blocking to increase cache or translation lookaside buffer (TLB) reuse, or using SIMD instructions (given the Parallelism TS' inclusion of SIMD). Many of these optimizations relate to memory locality or parallelism. For an overview, see (Goto 2008) or Section 2.6 of (Demmel 1997). All such optimizations reorder sums and create temporary values.
Examples of Category (3) include Strassen’s algorithm for matrix multiplication. The textbook formulation of matrix multiplication only uses additions and multiplies, but Strassen’s algorithm also performs subtractions. A common feature of Category (3) enhancements is that their implementation diverges from a "textbook description of the algorithm" in ways beyond just reordering sums. As a "textbook," we recommend either (Strang 2016), or the concise mathematical description of operations in the BLAS Standard. In the next section, we will list properties of textbook descriptions, and explain some ways in which QoI enhancements might fail to adhere to those properties.
11.7.6. Properties of textbook algorithm descriptions
"Textbook descriptions" of the algorithms we propose tend to have the following properties. For each property, we give an example of a "nontextbook" algorithm, and how it assumes something extra about the matrix’s element type.
a. They compute floatingpoint sums straightforwardly (possibly reordered, or with temporary intermediate values), rather than using any of several algorithms that improve accuracy (e.g., compensated summation) or even make the result independent of evaluation order (see Demmel 2013). All such nonstraightforward algorithms depend on properties of floatingpoint arithmetic. We will define below what "possibly reordered, or with temporary intermediate values" means.
b. They use only those arithmetic operations on the matrix and vector element types that the textbook description of the algorithm requires, even if using other kinds of arithmetic operations would improve performance or give an asymptotically faster algorithm.
c. They use exact algorithms (not considering rounding error), rather than approximations (that would not be exact even if computing with real numbers).
d. They do not use parallel algorithms that would give an asymptotically faster parallelization in the theoretical limit of infinitely many available parallel processing units, at the cost of likely unacceptable rounding error in floatingpoint arithmetic.
As an example of (b), the textbook matrix multiplication algorithm only adds or multiplies the matrices' elements. In contrast, Strassen’s algorithm for matrixmatrix multiply subtracts as well as adds and multiplies the matrices' elements. Use of subtraction assumes that arbitrary elements have an additive inverse, but the textbook matrix multiplication algorithm makes sense even for element types that lack additive inverses for all elements. Also, use of subtractions changes floatingpoint rounding behavior in a way that was only fully understood recently (see Demmel 2007).
As an example of (c), the textbook substitution algorithm for solving triangular linear systems is exact. In contrast, one can approximate triangular solve with a stationary iteration. (See, e.g., Section 5 of (Chow 2015). That paper concerns the sparse matrix case; we cite it merely as an example of an approximate algorithm, not as a recommendation for dense triangular solve.) Approximation only makes sense for element types that have enough precision for the approximation to be accurate. If the approximation checks convergence, than the algorithm also requires lessthan comparison of absolute values of differences.
Multiplication by the reciprocal of a number, rather than division by that number, could fit into (b) or (c). As an example of (c), implementations for hardware where floatingpoint division is slow compared with multiplication could use an approximate reciprocal multiplication to implement division.
As an example of (d), the textbook substitution algorithm for solving triangular linear systems has data dependencies that limit its theoretical parallelism. In contrast, one can solve a triangular linear system by building all powers of the matrix in parallel, then solving the linear system as with a Krylov subspace method. This approach is exact for real numbers, but commits too much rounding error to be useful in practice for all but the smallest linear systems. In fact, the algorithm requires that the matrix’s element type have precision exponential in the matrix’s dimension.
Many of these nontextbook algorithms rely on properties of floatingpoint arithmetic. Strassen’s algorithm makes sense for unsigned integer types, but it could lead to unwarranted and unexpected overflow for signed integer types. Thus, we think it best to limit implementers to textbook algorithms, unless all matrix and vector element types are floatingpoint types. We always forbid nontextbook algorithms of type (d). If all matrix and vector element types are floatingpoint types, we permit nontextbook algorithms of Types (a), (b), and (c), under two conditions:

they satisfy the complexity requirements; and

they result in a logarithmically stable algorithm, in the sense of (Demmel 2007).
We believe that Condition (2) is a reasonable interpretation of Section 2.7 of the BLAS Standard.
This says that "no particular computational order is mandated by the function specifications.
In other words, any algorithm that produces results 'close enough'
to the usual algorithms presented in a standard book on matrix computations
is acceptable."
Examples of what the BLAS Standard considers "acceptable"
include Strassen’s algorithm, and implementing matrix multiplication as
,
, or
.
"Textbook algorithms" includes optimizations commonly found in BLAS implementations. This includes any available hardware acceleration, as well as the locality and parallelism optimizations we describe below. Thus, we think restricting generic implementations to textbook algorithms will not overly limit implementers.
The set of floatingpoint types currently has three members:
,
, and
.
If a proposal like P1467R4 ("Extended floatingpoint types and standard names") is accepted,
it will grow to include implementationspecific types,
such as short or extendedprecision floats.
This "futureproofs" our proposal in some sense,
though implementers will need to take care to avoid approximations
if the element type lacks the needed precision.
11.7.7. Reordering sums and creating temporaries
Even textbook descriptions of linear algebra algorithms presume the freedom to reorder sums and create temporary values. Optimizations for memory locality and parallelism depend on this. This freedom imposes requirements on algorithms' matrix and vector element types.
We could get this freedom either by limiting our proposal to the Standard’s current arithmetic types, or by forbidding reordering and temporaries for types other than arithmetic types. However, doing so would unnecessarily prevent straightforward optimizations for small and fast types that act just like arithmetic types. This includes socalled "short floats" such as bfloat16 or binary16, extendedprecision floatingpoint numbers, and fixedpoint reals. Some of these types may be implementation defined, and others may be userspecified. We intend to permit implementers to optimize for these types as well. This motivates us to describe our algorithms' type requirements in a generic way.
11.7.7.1. Special case: Only one element type
We find it easier to think about type requirements by starting with the assumption that all element and scalar types in algorithms are the same. One can then generalize to input element type(s) that might differ from the output element type and/or scalar result type.
Optimizations for memory locality and parallelism
both create temporary values, and change the order of sums.
For example, reorganizing matrix data to reduce stride involves
making a temporary copy of a subset of the matrix,
and accumulating partial sums into the temporary copy.
Thus, both kinds of optimizations impose
a common set of requirements and assumptions on types.
Let
be the output
's
.
Implementations may:

create arbitrarily many objects of type
, valueinitializing them or directinitializing them with any existing object of that type;value_type 
perform sums in any order; or

replace any value with the sum of that value and a valueinitialized
object.value_type
Assumption (1) implies that the output value type is
.
Contrast with [algorithms.parallel.exec]:
"Unless otherwise stated,
implementations may make arbitrary copies of elements of type
,
from sequences where
and
are true."
We omit the trivially constructible and destructible requirements here
and permit any
type.
Linear algebra algorithms assume mathematical properties
that let us impose more specific requirements than general parallel algorithms.
Nevertheless, implementations may want to enable optimizations
that create significant temporary storage
only if the value type is trivially constructible,
trivially destructible, and not too large.
Regarding Assumption (2): The freedom to compute sums in any order is not necessarily a type constraint. Rather, it’s a right that the algorithm claims, regardless of whether the type’s addition is associative or commutative. For example, floatingpoint sums are not associative, yet both parallelization and customary linear algebra optimizations rely on reordering sums. See the above "Value type constraints do not suffice to describe algorithm behavior" section for a more detailed explanation.
Regarding Assumption (3), we do not actually say that valueinitialization produces a twosided additive identity. What matters is what the algorithm’s implementation may do, not whether the type actually behaves in this way.
11.7.7.2. General case: Multiple input element types
An important feature of P1673 is the ability to compute
with mixed matrix or vector element types.
For instance,
implements the operation z = y + alpha*x,
an elementwise scaled vector sum.
The element types of the vectors x, y, and z could be all different,
and could differ from the type of alpha.
11.7.7.2.1. Accumulate into output value type
Generic algorithms would use the output
's
to accumulate partial sums, and for any temporary results.
This is the analog of
's scalar result type
.
Implementations for floatingpoint types might accumulate into higherprecision temporaries,
or use other ways to increase accuracy when accumulating partial sums,
but the output
's
would still control accumulation behavior in general.
11.7.7.2.2. Proxy references or expression templates

Proxy references: The input and/or output
might have an accessor with amdspan
type other thanreference
. For example, the outputelement_type &
might have a value typemdspan
, but itsvalue_type
type might bereference
.atomic_ref < value_type > 
Expression templates: The element types themselves might have arithmetic operations that defer the actual computation until the expression is assigned. These "expression template" types typically hold some kind of reference or pointer to their input arguments.
Neither proxy references nor expression template types are
,
because they behave like references, not like values.
However, we can still require that their underlying value type be
.
For instance, the possiblity of proxy references just means that
we need to use the output
's
when constructing or valueinitializing temporary values,
rather than trying to deduce the value type
from the type of an expression that indexes into the output
.
Expression templates just mean that we need to use the output
's
to construct or valueinitialize temporaries,
rather than trying to deduce the temporaries' type
from the righthand side of the expression.
The
example above shows that some of the algorithms we propose
have multiple terms in a sum on the righthand side of the expression
that defines the algorithm.
If algorithms have permission to rearrange the order of sums,
then they need to be able to break up such expressions into separate terms,
even if some of those expressions are expression templates.
11.7.8. "Textbook" algorithm description in semiring terms
As we explain in the "Value type constraints do not suffice to describe algorithm behavior" section above, we deliberately constrain matrix and vector element types to require associative addition. This means that we do not, for instance, define concepts like "ring" or "group." We cannot even speak of a single set of values that would permit defining things like a "ring" or "group." This is because our algorithms must handle mixed value types, expression templates, and proxy references. However, it may still be helpful to use mathematical language to explain what we mean by "a textbook description of the algorithm."
Most of the algorithms we propose only depend on addition and multiplication. We describe these algorithms in terms of one or more mathematical expressions on elements of a semiring with possibly noncommutative multiplication. The only difference between a semiring and a ring is that a semiring does not require all elements to have an additive inverse. That is, addition is allowed, but not subtraction. Implementers may apply any mathematical transformation to the expressions that would give the same result for any semiring.
11.7.8.1. Why a semiring?
We use a semiring because

we generally want to reorder terms in sums, but we do not want to order terms in products; and

we do not want to assume that subtraction works.
The first is because linear algebra computations are useful for matrix or vector element types with noncommutative multiplication, such as quaternions or matrices. The second is because algebra operations might be useful for signed integers, where a formulation using subtraction risks unexpected undefined behavior.
11.7.8.2. Semirings and testing
It’s important that implementers be able to test our proposed algorithms for custom element types, not just the builtin arithmetic types. We don’t want to require hypothetical "exact real arithmetic" types that take particular expertise to implement. Instead, we propose testing with simple classes built out of unsigned integers. This section is not part of our Standard Library proposal, but we include it to give guidance to implementers and to show that it’s feasible to test our proposal.
11.7.8.3. Commutative multiplication
C++ unsigned integers implement commutative rings.
(Rings always have commutative addition;
a "commutative ring" has commutative multiplication as well.)
We may transform (say)
into a commutative semiring
by wrapping it in a class that does not provide unary or binary
.
Adding a "tag" template parameter to this class
would let implementers build tests for mixed element types.
11.7.8.4. Noncommutative multiplication
The semiring of 2x2 matrices with element type a commutative semiring is itself a semiring, but with noncommutative multiplication. This is a good way to build a noncommutative semiring for testing.
11.7.9. Summary

Constraining the matrix and vector element types and scalar types in our functions gives implementers the freedom to make QoI enhancements without risking correctness.

We think describing algorithms' behavior and implementation freedom is more useful than mathematical concepts like "ring." For example, we permit implementations to reorder sums, but this does not mean that they assume sums are associative. This is why we do not define a hierarchy of number concepts.

We categorize different ways that implementers might like to change algorithms, list categories we exclude and categories we permit, and use the permitted categories to derive constraints on the types of matrix and vector elements and scalar results.

We explain how a semiring is a good way to talk about implementation freedom, even though we do not think it is a good way to constrain types. We then use the semiring description to explain how implementers can test generic algorithms.
11.8. Support for userdefined complex number types
11.8.1. conj
does not have the desired interface
Based on precedence from the BLAS and LAPACK,
generic linear algebra code that works for both complex and real numbers
uses the "conjugate transpose" for both cases,
and intends conjugation to be the identity for real numbers.
P1673 users may thus want to apply
to matrices of arbitrary value types.
The Standard currently offers
as the way to compute the conjugate of a number.
However, there are two issues with
.

For arguments of arithmetic type,
returnsconj
. The resulting value type change is mathematically unexpected, and it can hinder use of an optimized BLAS.complex 
P1673 users have good reasons to define their own complex number types, but users cannot overload
for these types.conj
11.8.2. Why users define their own complex number types

only permitscomplex < R >
=R
,float
, anddouble
.long double 
only hascomplex < R >
alignment, notsizeof ( R )
.2 * sizeof ( R ) 
Some C++ extensions cannot use
, because they require annotations on a type’s member functions.complex < R >
Users define their own complex number types for three reasons.
First, the C++ Standard currently supports
only for
=
,
, and
.
This prevents use of other types, including

signed integers (the resulting complex numbers represent the Gaussian integers);

"short" lowprecision floatingpoint or fixedpoint number types that are critical for performance of machine learning applications;

extendedprecision floatingpoint types like binary128 that can improve the accuracy of floatingpoint sums, reduce parallel nondeterminism, and make sums less dependent on evaluation order; and

custom number types such as "bigfloats" (custom arbitraryprecision floatingpoint types).
Second, the Standard explicitly specifies
that
has the same alignment as
.
That is, it is aligned to
.
Some systems would give better parallel or vectorized performance
if complex numbers were aligned to
.
Some C++ extensions define their own complex number types
partly for this reason.
Software libraries that use these custom complex number types
tempt users to alias between
and these custom types,
which would have the same bit representation except for alignment.
This has led to crashes or worse in software projects
that the authors have worked on.
Third, some C++ extensions cannot use
,
because they require types' member functions to have special annotations,
in order to compile code to make use of accelerator hardware.
These issues have led several software libraries and C++ extensions
to define their own complex number types.
These include CUDA, Kokkos, and Thrust.
The SYCL standard is contemplating adding a custom complex number type.
One of the authors wrote
circa 2014
to make it possible to build and run Trilinos' linear solvers
with complex numbers on NVIDIA graphics processing units (GPUs).
11.8.3. Why users want to "conjugate" matrices of real numbers
It’s possible to describe many linear algebra algorithms
in a way that works for both complex and real numbers,
by treating conjugation as the identity for real numbers.
This makes the "conjugate transpose" just the transpose
for a matrix of real numbers.
Matlab takes this approach, by defining the single quote operator
to take the conjugate transpose if its argument is complex,
and the transpose if its argument is real.
The Fortran BLAS also supports this,
by letting users specify the '
(
) even for real routines like
(doubleprecision general matrixmatrix multiply).
Krylov subspace methods in Trilinos' Anasazi and Belos packages
also follow a Matlablike generic approach.
Even though we think it should be possible to write
"generic" (real or complex) linear algebra code using
,
we still need to distinguish between symmetric and Hermitian matrix algorithms.
This is because symmetric does not mean the same thing as Hermitian
for matrices of complex numbers.
For example, a matrix whose offdiagonal elements are all
is symmetric,
but not Hermitian. Complex symmetric matrices are useful in practice,
for example when modeling damped vibrations (Craven 1969).
11.8.4. Effects of conj
's realtocomplex change
The fact that
for arithmetictype arguments returns
may complicate or prevent implementers
from using an existing optimized BLAS library.
If the user calls
with matrices all of value type
,
use of the (mathematically harmless)
function
would make one matrix have value type
.
Implementations could undo this value type change
for known layouts and accessors,
but would need to revert to generic code otherwise.
For example, suppose that a custom real value type
has arithmetic operators defined to permit
all needed mixedtype expressions with
,
where
times
and
times
both "promote" to
.
Users may then call
with
having value type
,
having value type
,
and
having a value type
.
However,
would not compile,
due to
not being well formed.
11.8.5. LEWG feedback on R8 solution
In R8 of this paper,
we proposed an expositiononly function
_.
For arithmetic types, it would be the identity function.
This would fix Issue (1).
For all other types, it would call
through argumentdependent lookup (ADL),
just like how
calls
.
This would fix Issue (2).
However, it would force users who define custom _real_ number types
to define a trivial
(in their own namespace) for their type.
The alternative would be to make
_ the identity
if it could not find
via ADL lookup.
However, that would cause silently incorrect results
for users who define a custom complex number type,
but forget or misspell
.
When reviewing R8, LEWG expressed a preference for a different solution.

Temporarily change P1673 to permit use of
andconjugated
only for value types that are eitherconjugate_transposed
or arithmetic types. Add a Note that reminds readers to look out for Steps (2) and (3) below.complex 
Write a separate paper which introduces a uservisible customization point, provisionally named
. The paper could use any of various proposed libraryonly customization point mechanisms, such as the customization point objects used by ranges orconjugate
(see P1895R0, with the expectation that LEWG and perhaps also EWG (see e.g., P2547) may express a preference for a different mechanism.tag_invoke 
If LEWG accepts the
customization point, then change P1673 again to useconjugate
(thus replacing R8’sconjugate
_). This would thus reintroduce support for custom complex numbers.conj  if  needed
11.8.6. SG6’s response to LEWG’s R8 feedback
SG6 small group (there was no quorum) reviewed P1673 on 2022/06/09, after LEWG’s R8 review on 2022/05/24. SG6 small group expressed the following:

Being able to write
orconjugated ( A )
for a matrix or vectorconjugate_transposed ( A )
of userdefined types is reasonably integral to the proposal. We generically oppose deferring it based on the hope that we’ll be able to specify it in a nicer way in the future, with some new customization point syntax.A 
A simple, teachable rule: Do ADLONLY lookup (preventing finding
for primitive types) forstd :: conj
(as with ranges); if you find something you use it, and if you don’t, you do nothing (conjugation is the identity). ("Primitives aren’t that special.") Benefit is that custom real types work out of the box.conj 
The alternative: specify that if users choose to use
orconjugated
with a userdefined type, then they MUST supply theconjugate_transposed
ADLfindable thing, else illformed. This is a safety mechanism that may not have been considered previously by LEWG. (Make primitives special, to regain safety. The cost is that custom real types need to have aconj
ADLfindable, if users useconj
orconjugated
.)conjugate_transposed
11.8.7. Current solution
We have adopted SG6 small group’s recommendation, with a slight wording modification to make it obvious that the conjugate of an arithmetic type returns the same type as its input.
We propose an expositiononly function object
_.
For arithmetic types, it would behave as the identity function.
If it can call
through ADLonly (unqualified) lookup,
it does so. Otherwise, it again would behave as the identity function.
This approach has the following advantages.

Most custom number types, noncomplex or complex, will work "out of the box." Custom complex number types would likely already have an ADLfindable
defined, for interface compatibility withconj
.std :: complex 
Conjugation will preserve the type of its argument.

It uses existing C++ idioms and interfaces for complex numbers.

It does not depend on a future customization point syntax or library convention.
11.9. Support for division with noncommutative multiplication
An important feature of this proposal is its support for value types that have noncommutative multiplication. Examples include square matrices with a fixed number of rows and columns, and quaternions and their generalizations. Most of the algorithms in this proposal only add or multiply arbitrary value types, so preserving the order of multiplication arguments is straightforward. The various triangular solve algorithms are an exception, because they need to perform divisions as well.
If multiplication commutes and if a type has division,
then the division x ÷ y is just x times (the multiplicative inverse of y),
assuming that the multiplicative inverse of y exists.
However, if multiplication does not commute,
"x times (the multiplicative inverse of y)"
need not equal
"(the multiplicative inverse of y) times x."
The C++ binary
does not give callers
a way to distinguish between these two cases.
This suggests four ways to express "ordered division."

Explicitly divide one by the quotient:
, orx * ( 1 / y ) ( 1 / y ) * x 
Like (2), but instead of using literal
, get "one" as a1
input to the algorithm:value_type
, orx * ( one / y ) ( one / y ) * x 
as a unary callable input to the algorithm:inverse
, orx * inverse ( y ) inverse ( y ) * x 
as a binary callable input to the algorithm:divide
, ordivide ( x , y ) divide ( y , x )
Both SG6 small group (in its review of this proposal on 2022/06/09)
and the authors prefer Way (4), the
binary callable input.
The binary callable would be optional,
and ordinary binary
would be used as the default.
This would imitate existing Standard Library algorithms like
,
with its optional
that defaults to addition.
For mixedprecision computation,
users would need to avoid
,
as its two parameters and its return type all have the same types.
However, the obvious
would work just fine.
Way (4) also preserves the original rounding behavior
for types with commutative multiplication.
The main disadvantage of the other approaches
is that they would change rounding behavior for floatingpoint types.
They also require two operations —

Way (1) would assume that an overloaded
exists, and that the literaloperator / ( int , value_type )
behaves like a multiplicative identity. In practice, not all custom number types may have defined mixed arithmetic with1
.int 
Way (2) would complicate the interface. Users might make the mistake of passing in literal
(of type1
) orint
(of type1.0
) as the value of one, thus leading to Way (1)'s issues.double 
Way (3) would again complicate the interface. Users would be tempted to use
as the inverse function, thus leading back to Way (1)'s issues.[]( const auto & y ) { return 1 / y ; }
12. Future work
Summary:

Consider generalizing function parameters to take any type that implements a customization point for getting an
viewing its elements. This includesmdspan
(see P1684).mdarray 
Add batched linear algebra overloads.
12.1. Generalize function parameters
Our functions differ from the C++ Standard algorithms, in that they
take a concrete type
with template parameters, rather
than any type that satisfies a concept. We think that the template
parameters of
fully describe the multidimensional
equivalent of a multipass iterator, and that "conceptification" of
multidimensional arrays would unnecessarily delay this proposal.
In a future proposal, we may consider generalizing our function’s template
parameters, to permit any type besides
that implements
the
customization point, as long as the return value of
satisfies the current requirements.
would
return an
that views its argument’s data.
The
class, proposed in P1684, is the
container analog of
. It is a new kind of container,
with the same copy behavior as containers like
.
It will be possible to get an
that views an
.
Previous versions of this proposal included function overloads that
took
directly. The goals were user convenience, and
to avoid any potential overhead of conversion to
,
especially for very small matrices and vectors. In a future revision
of P1684,
will implement a customization point
(final name yet to be decided),
that returns an
viewing the elements of the
.
This would let users use
directly in our functions.
This customization point approach would also simplify using our functions
with other matrix and vector types, such as those proposed by P1385.
Implementations may optionally add direct overloads of our functions
for
or other types.
This would address any concerns about overhead
of converting from
to
.
12.2. Batched linear algebra
We plan to write a separate proposal that will add "batched" versions of linear algebra functions to this proposal. "Batched" linear algebra functions solve many independent problems all at once, in a single function call. For discussion, see Section 6.2 of our background paper P1417R0. Batched interfaces have the following advantages:

They expose more parallelism and vectorization opportunities for many small linear algebra operations.

They are useful for many different fields, including machine learning.

Hardware vendors currently offer both hardware features and optimized software libraries to support batched linear algebra.

There is an ongoing interface standardization effort, in which we participate.
The
data structure makes it easy to represent a batch
of linear algebra objects, and to optimize their data layout.
With few exceptions, the extension of this proposal to support batched operations will not require new functions or interface changes. Only the requirements on functions will change. Output arguments can have an additional rank; if so, then the leftmost extent will refer to the batch dimension. Input arguments may also have an additional rank to match; if they do not, the function will use ("broadcast") the same input argument for all the output arguments in the batch.
13. Data structures and utilities borrowed from other proposals
13.1. mdspan
This proposal depends on P0009 and followon papers
that were voted into the current C++ Standard draft.
P0009’s main class is
, which is a "view"
(in the sense of
) of a multidimensional array. The rank
(number of dimensions) is fixed at compile time. Users may specify
some dimensions at run time and others at compile time; the type of
the
expresses this.
also has two
customization points:

expresses the array’s memory layout: e.g., rowmajor (C++ style), columnmajor (Fortran style), or strided. We use a customLayout
later in this paper to implement a "transpose view" of an existingLayout
.mdspan 
defines the storage handle (Accessor
) stored in thedata_handle_type
, as well as the reference type returned by its access operator. This is an extension point for modifying how access happens, for example by usingmdspan
to get atomic access to every element. We use customatomic_ref
s later in this paper to implement "scaled views" and "conjugated views" of an existingAccessor
.mdspan
13.2. New mdspan
layouts in this proposal
Our proposal uses the layout mapping policy of
in order
to represent different matrix and vector data layouts.
The current C++ Standard draft includes three layouts:
,
, and
. P2642R1 includes
and
.
These two layouts represent exactly the data layout assumed by
the General (GE) matrix type in the BLAS' C and Fortran bindings.
They have has two advantages:

Unlike
andlayout_left
, any "submatrix" (subspan of consecutive rows and consecutive columns) of a matrix withlayout_right
resp.layout_left_padded
layout also haslayout_right_padded
resp.layout_left_padded
layout.layout_right_padded 
Unlike
, the two layouts always have compiletime unit stride in one of the matrix’s two extents.layout_stride
BLAS functions call the possibly nonunit stride of the matrix the
"leading dimension" of that matrix. For example, a BLAS function
argument corresponding to the leading dimension of the matrix
is
called
, for "leading dimension of the matrix A."
This proposal introduces a new layout,
.
This describes the layout used by the BLAS' Symmetric Packed (SP),
Hermitian Packed (HP), and Triangular Packed (TP) "types."
The
class has a "tag" template parameter
that controls its properties; see below.
We do not include layouts for unpacked "types," such as Symmetric
(SY), Hermitian (HE), and Triangular (TR).
Our paper P1674.
explains our reasoning.
In summary: Their actual layout  the arrangement of
matrix elements in memory  is the same as General. The only
differences are constraints on what entries of the matrix algorithms
may access, and assumptions about the matrix’s mathematical
properties. Trying to express those constraints or assumptions as
"layouts" or "accessors" violates the spirit (and sometimes the law)
of
. We address these different matrix types with
different function names.
The packed matrix "types" do describe actual arrangements of matrix
elements in memory that are not the same as in General. This is why
we provide
. Note that
is
the first addition to the existing layouts that is neither always
unique, nor always strided.
Algorithms cannot be written generically if they permit output
arguments with nonunique layouts. Nonunique output arguments require
specialization of the algorithm to the layout, since there’s no way to
know generically at compile time what indices map to the same matrix
element. Thus, we will impose the following rule: Any
output argument to our functions must always have unique layout
(
is true
), unless otherwise specified.
Some of our functions explicitly require outputs with specific nonunique layouts. This includes lowrank updates to symmetric or Hermitian matrices.
14. Acknowledgments
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DENA0003525.
Special thanks to Bob Steagall and Guy Davidson for boldly leading the charge to add linear algebra to the C++ Standard Library, and for many fruitful discussions. Thanks also to Andrew Lumsdaine for his pioneering efforts and history lessons. In addition, I very much appreciate feedback from Davis Herring on constraints wording.
15. References
15.1. References by coathors

G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz, "Communication lower bounds and optimal algorithms for numerical linear algebra,", Acta Numerica, Vol. 23, May 2014, pp. 1155.

C. Trott, D. S. Hollman, D. LebrunGrande, M. Hoemmen, D. Sunderland, H. C. Edwards, B. A. Lelbach, M. Bianco, B. Sander, A. Iliopoulos, J. Michopoulos, and N. Liber, "
," P0009R16, Mar. 2022. The authors anticipate release of R17 in the next mailing.mdspan 
G. Davidson, M. Hoemmen, D. S. Hollman, B. Steagall, and C. Trott, P1891R0, Oct. 2019.

M. Hoemmen, D. S. Hollman, and C. Trott, "Evolving a Standard C++ Linear Algebra Library from the BLAS," P1674R2, May 2022.

M. Hoemmen, J. Badwaik, M. Brucher, A. Iliopoulos, and J. Michopoulos, "Historical lessons for C++ linear algebra library standardization," P1417R0, Jan. 2019.

M. Hoemmen, D. S. Hollman, C. Jabot, I. Muerte, and C. Trott, "Multidimensional subscript operator," P2128R6, Sep. 2021.

C. Trott, D. S. Hollman, M. Hoemmen, and D. Sunderland, "
: An Owning Multidimensional Array Analog ofmdarray
", P1684R1, Mar. 2022.mdspan 
D. S. Hollman, C. Kohlhoff, B. A. Lelbach, J. Hoberock, G. Brown, and M. Dominiak, "A General Property Customization Mechanism," P1393R0, Jan. 2019.
15.2. Other references

E. Anderson, "Algorithm 978: Safe Scaling in the Level 1 BLAS," ACM Transactions on Mathematical Software, Vol. 44, pp. 128, 2017.

"Basic Linear Algebra Subprograms Technical (BLAST) Forum Standard," International Journal of High Performance Applications and Supercomputing, Vol. 16, No. 1, Spring 2002.

L. S. Blackford, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry, M. Heroux, L. Kaufman, A. Lumsdaine, A. Petitet, R. Pozo, K. Remington, and R. C. Whaley, "An updated set of basic linear algebra subprograms (BLAS)," ACM Transactions on Mathematical Software, Vol. 28, No. 2, Jun. 2002, pp. 135151.

J. L. Blue, "A Portable Fortran Program to Find the Euclidean Norm of a Vector," ACM Transactions on Mathematical Software, Vol. 4, pp. 1523, 1978.

B. D. Craven, "Complex symmetric matrices", Journal of the Australian Mathematical Society, Vol. 10, No. 34, Nov. 1969, pp. 341354.

E. Chow and A. Patel, "FineGrained Parallel Incomplete LU Factorization", SIAM J. Sci. Comput., Vol. 37, No. 2, C169C193, 2015.

G. Davidson and B. Steagall, "A proposal to add linear algebra support to the C++ standard library," P1385R6, Mar. 2020.

B. Dawes, H. Hinnant, B. Stroustrup, D. Vandevoorde, and M. Wong, "Direction for ISO C++," P0939R4, Oct. 2019.

J. Demmel, "Applied Numerical Linear Algebra," Society for Industrial and Applied Mathematics, Philadelphia, PA, 1997, ISBN 0898713897.

J. Demmel, I. Dumitriu, and O. Holtz, "Fast linear algebra is stable," Numerische Mathematik 108 (5991), 2007.

J. Demmel and H. D. Nguyen, "Fast Reproducible FloatingPoint Summation," 2013 IEEE 21st Symposium on Computer Arithmetic, 2013, pp. 163172, doi: 10.1109/ARITH.2013.9.

J. Dongarra, J. Du Croz, S. Hammarling, and I. Duff, "A set of level 3 basic linear algebra subprograms," ACM Transactions on Mathematical Software, Vol. 16, No. 1, pp. 117, Mar. 1990.

J. Dongarra, R. Pozo, and D. Walker, "LAPACK++: A Design Overview of ObjectOriented Extensions for High Performance Linear Algebra," in Proceedings of Supercomputing '93, IEEE Computer Society Press, 1993, pp. 162171.

M. Gates, P. Luszczek, A. Abdelfattah, J. Kurzak, J. Dongarra, K. Arturov, C. Cecka, and C. Freitag, "C++ API for BLAS and LAPACK," SLATE Working Notes, Innovative Computing Laboratory, University of Tennessee Knoxville, Feb. 2018.

K. Goto and R. A. van de Geijn, "Anatomy of highperformance matrix multiplication,", ACM Transactions on Mathematical Software (TOMS), Vol. 34, No. 3, May 2008.

J. Hoberock, "Integrating Executors with Parallel Algorithms," P1019R2, Jan. 2019.

N. A. Josuttis, "The C++ Standard Library: A Tutorial and Reference," AddisonWesley, 1999.

M. Kretz, "DataParallel Vector Types & Operations," P0214R9, Mar. 2018.

A. J. Perlis, "Epigrams on programming," SIGPLAN Notices, Vol. 17, No. 9, pp. 713, 1982.

G. Strang, "Introduction to Linear Algebra," 5th Edition, Wellesley  Cambridge Press, 2016, ISBN 9780980232776, x+574 pages.

D. Vandevoorde and N. A. Josuttis, "C++ Templates: The Complete Guide," AddisonWesley Professional, 2003.
16. Wording
Text in blockquotes is not proposed wording, but rather instructions for generating proposed wording. The � character is used to denote a placeholder section number which the editor shall determine. First, apply all wording from P2642R1. (This proposal is a "rebase" atop the changes proposed by P2642R1.) At the end of Table � ("Numerics library summary") in [numerics.general], add the following: [linalg], Linear algebra,
. At the end of [numerics], add all the material that follows.
< linalg >
16.1. Header < linalg >
synopsis [linalg.syn]
namespace std :: linalg { // [linalg.tags.order], storage order tags struct column_major_t ; inline constexpr column_major_t column_major ; struct row_major_t ; inline constexpr row_major_t row_major ; // [linalg.tags.triangle], triangle tags struct upper_triangle_t ; inline constexpr upper_triangle_t upper_triangle ; struct lower_triangle_t ; inline constexpr lower_triangle_t lower_triangle ; // [linalg.tags.diagonal], diagonal tags struct implicit_unit_diagonal_t ; inline constexpr implicit_unit_diagonal_t implicit_unit_diagonal ; struct explicit_diagonal_t ; inline constexpr explicit_diagonal_t explicit_diagonal ; // [linalg.layouts.packed], class template layout_blas_packed template < class Triangle , class StorageOrder > class layout_blas_packed ; // [linalg.scaled.conj], expositiononly function object <it>conjifneeded</it> /* seebelow */ < it > conj  if  needed </ it > ; // [linalg.scaled.base], expositiononly function and class templates class proxy_reference_base ; // <it>exposition only</it> template < class Reference , class Value , class Derived > class proxy_reference ; // <it>exposition only</it> // [linalg.scaled.accessor_scaled], class template accessor_scaled template < class ScalingFactor , class Accessor > class accessor_scaled ; // [linalg.scaled.scaled], scaled inplace transformation template < class ScalingFactor , class ElementType , class Extents , class Layout , class Accessor > /* seebelow */ constexpr scaled ( ScalingFactor scaling_factor , const mdspan < ElementType , Extents , Layout , Accessor >& x ); // [linalg.conj.accessor_conjugate], class template accessor_conjugate template < class Accessor > class accessor_conjugate ; // [linalg.conj.conjugated], conjugated inplace transformation template < class ElementType , class Extents , class Layout , class Accessor > constexpr auto conjugated ( mdspan < ElementType , Extents , Layout , Accessor > a ); // [linalg.transp.layout_transpose], class template layout_transpose template < class Layout > class layout_transpose ; // [linalg.transp.transposed], transposed inplace transformation template < class ElementType , class Extents , class Layout , class Accessor > constexpr auto transposed ( mdspan < ElementType , Extents , Layout , Accessor > a ); // [linalg.conj_transp], // conjugated transposed inplace transformation template < class ElementType , class Extents , class Layout , class Accessor > constexpr auto conjugate_transposed ( mdspan < ElementType , Extents , Layout , Accessor > a ); // [linalg.algs.blas1.givens.lartg], compute Givens rotation template < class Real > void givens_rotation_setup ( const Real a , const Real b , Real & c , Real & s , Real & r ); template < class Real > void givens_rotation_setup ( const complex < Real >& a , const complex < Real >& b , Real & c , complex < Real >& s , complex < Real >& r ); // [linalg.algs.blas1.givens.rot], apply computed Givens rotation template < class inout_vector_1_t , class inout_vector_2_t , class Real > void givens_rotation_apply ( inout_vector_1_t x , inout_vector_2_t y , const Real c , const Real s ); template < class ExecutionPolicy , class inout_vector_1_t , class inout_vector_2_t , class Real > void givens_rotation_apply ( ExecutionPolicy && exec , inout_vector_1_t x , inout_vector_2_t y , const Real c , const Real s ); template < class inout_vector_1_t , class inout_vector_2_t , class Real > void givens_rotation_apply ( inout_vector_1_t x , inout_vector_2_t y , const Real c , const complex < Real > s ); template < class ExecutionPolicy , class inout_vector_1_t , class inout_vector_2_t , class Real > void givens_rotation_apply ( ExecutionPolicy && exec , inout_vector_1_t x , inout_vector_2_t y , const Real c , const complex < Real > s ); } // [linalg.algs.blas1.swap], swap elements template < class inout_object_1_t , class inout_object_2_t > void swap_elements ( inout_object_1_t x , inout_object_2_t y ); template < class ExecutionPolicy , class inout_object_1_t , class inout_object_2_t > void swap_elements ( ExecutionPolicy && exec , inout_object_1_t x , inout_object_2_t y ); // [linalg.algs.blas1.scal], multiply elements by scalar template < class Scalar , class inout_object_t > void scale ( const Scalar alpha , inout_object_t obj ); template < class ExecutionPolicy , class Scalar , class inout_object_t > void scale ( ExecutionPolicy && exec , const Scalar alpha , inout_object_t obj ); // [linalg.algs.blas1.copy], copy elements template < class in_object_t , class out_object_t > void copy ( in_object_t x , out_object_t y ); template < class ExecutionPolicy , class in_object_t , class out_object_t > void copy ( ExecutionPolicy && exec , in_object_t x , out_object_t y ); // [linalg.algs.blas1.add], add elementwise template < class in_object_1_t , class in_object_2_t , class out_object_t > void add ( in_object_1_t x , in_object_2_t y , out_object_t z ); template < class ExecutionPolicy , class in_object_1_t , class in_object_2_t , class out_object_t > void add ( ExecutionPolicy && exec , in_object_1_t x , in_object_2_t y , out_object_t z ); // [linalg.algs.blas1.dot], // dot product of two vectors // [linalg.algs.blas1.dot.dotu], // nonconjugated dot product of two vectors template < class in_vector_1_t , class in_vector_2_t , class T > T dot ( in_vector_1_t v1 , in_vector_2_t v2 , T init ); template < class ExecutionPolicy , class in_vector_1_t , class in_vector_2_t , class T > T dot ( ExecutionPolicy && exec , in_vector_1_t v1 , in_vector_2_t v2 , T init ); template < class in_vector_1_t , class in_vector_2_t > auto dot ( in_vector_1_t v1 , in_vector_2_t v2 ) > /* seebelow */ ; template < class ExecutionPolicy , class in_vector_1_t , class in_vector_2_t > auto dot ( ExecutionPolicy && exec , in_vector_1_t v1 , in_vector_2_t v2 ) > /* seebelow */ ; // [linalg.algs.blas1.dot.dotc], // conjugated dot product of two vectors template < class in_vector_1_t , class in_vector_2_t , class T > T dotc ( in_vector_1_t v1 , in_vector_2_t v2 , T init ); template < class ExecutionPolicy , class in_vector_1_t , class in_vector_2_t , class T > T dotc ( ExecutionPolicy && exec , in_vector_1_t v1 , in_vector_2_t v2 , T init ); template < class in_vector_1_t , class in_vector_2_t > auto dotc ( in_vector_1_t v1 , in_vector_2_t v2 ) > /* seebelow */ ; template < class ExecutionPolicy , class in_vector_1_t , class in_vector_2_t > auto dotc ( ExecutionPolicy && exec , in_vector_1_t v1 , in_vector_2_t v2 ) > /* seebelow */ ; // [linalg.algs.blas1.ssq], // Scaled sum of squares of a vector’s elements template < class T > struct sum_of_squares_result { T scaling_factor ; T scaled_sum_of_squares ; }; template < class in_vector_t , class T > sum_of_squares_result < T > vector_sum_of_squares ( in_vector_t v , sum_of_squares_result < T > init ); sum_of_squares_result < T > vector_sum_of_squares ( ExecutionPolicy && exec , in_vector_t v , sum_of_squares_result < T > init ); // [linalg.algs.blas1.nrm2], // Euclidean norm of a vector template < class in_vector_t , class T > T vector_norm2 ( in_vector_t v , T init ); template < class ExecutionPolicy , class in_vector_t , class T > T vector_norm2 ( ExecutionPolicy && exec , in_vector_t v , T init ); template < class in_vector_t > auto vector_norm2 ( in_vector_t v ) > /* seebelow */ ; template < class ExecutionPolicy , class in_vector_t > auto vector_norm2 ( ExecutionPolicy && exec , in_vector_t v ) > /* seebelow */ ; // [linalg.algs.blas1.asum], // sum of absolute values of vector elements template < class in_vector_t , class T > T vector_abs_sum ( in_vector_t v , T init ); template < class ExecutionPolicy , class in_vector_t , class T > T vector_abs_sum ( ExecutionPolicy && exec , in_vector_t v , T init ); template < class in_vector_t > auto vector_abs_sum ( in_vector_t v ) > /* seebelow */ ; template < class ExecutionPolicy , class in_vector_t > auto vector_abs_sum ( ExecutionPolicy && exec , in_vector_t v ) > /* seebelow */ ; // [linalg.algs.blas1.iamax], // index of maximum absolute value of vector elements template < class in_vector_t > typename in_vector_t :: extents_type idx_abs_max ( in_vector_t v ); template < class ExecutionPolicy , class in_vector_t > typename in_vector_t :: extents_type idx_abs_max ( ExecutionPolicy && exec , in_vector_t v ); // [linalg.algs.blas1.matfrobnorm], // Frobenius norm of a matrix template < class in_matrix_t , class T > T matrix_frob_norm ( in_matrix_t A , T init ); template < class ExecutionPolicy , class in_matrix_t , class T > T matrix_frob_norm ( ExecutionPolicy && exec , in_matrix_t A , T init ); template < class in_matrix_t > auto matrix_frob_norm ( in_matrix_t A ) > /* seebelow */ ; template < class ExecutionPolicy , class in_matrix_t > auto matrix_frob_norm ( ExecutionPolicy && exec , in_matrix_t A ) > /* seebelow */ ; // [linalg.algs.blas1.matonenorm], // One norm of a matrix template < class in_matrix_t , class T > T matrix_one_norm ( in_matrix_t A , T init ); template < class ExecutionPolicy , class in_matrix_t , class T > T matrix_one_norm ( ExecutionPolicy && exec , in_matrix_t A , T init ); template < class in_matrix_t > auto matrix_one_norm ( in_matrix_t A ) > /* seebelow */ ; template < class ExecutionPolicy , class in_matrix_t > auto matrix_one_norm ( ExecutionPolicy && exec , in_matrix_t A ) > /* seebelow */ ; // [linalg.algs.blas1.matinfnorm], // Infinity norm of a matrix template < class in_matrix_t , class T > T matrix_inf_norm ( in_matrix_t A , T init ); template < class ExecutionPolicy , class in_matrix_t , class T > T matrix_inf_norm ( ExecutionPolicy && exec , in_matrix_t A , T init ); template < class in_matrix_t > auto matrix_inf_norm ( in_matrix_t A ) > /* seebelow */ ; template < class ExecutionPolicy , class in_matrix_t > auto matrix_inf_norm ( ExecutionPolicy && exec , in_matrix_t A ) > /* seebelow */ ; // [linalg.algs.blas2.gemv], // general matrixvector product template < class in_vector_t , class in_matrix_t , class out_vector_t > void matrix_vector_product ( in_matrix_t A , in_vector_t x , out_vector_t y ); template < class ExecutionPolicy , class in_vector_t , class in_matrix_t , class out_vector_t > void matrix_vector_product ( ExecutionPolicy && exec , in_matrix_t A , in_vector_t x , out_vector_t y ); template < class in_vector_1_t , class in_matrix_t , class in_vector_2_t , class out_vector_t > void matrix_vector_product ( in_matrix_t A , in_vector_1_t x , in_vector_2_t y , out_vector_t z ); template < class ExecutionPolicy , class in_vector_1_t , class in_matrix_t , class in_vector_2_t , class out_vector_t > void matrix_vector_product ( ExecutionPolicy && exec , in_matrix_t A , in_vector_1_t x , in_vector_2_t y , out_vector_t z ); // [linalg.algs.blas2.symv], // symmetric matrixvector product template < class in_matrix_t , class Triangle , class in_vector_t , class out_vector_t > void symmetric_matrix_vector_product ( in_matrix_t A , Triangle t , in_vector_t x , out_vector_t y ); template < class ExecutionPolicy , class in_matrix_t , class Triangle , class in_vector_t , class out_vector_t > void symmetric_matrix_vector_product ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , in_vector_t x , out_vector_t y ); template < class in_matrix_t , class Triangle , class in_vector_1_t , class in_vector_2_t , class out_vector_t > void symmetric_matrix_vector_product ( in_matrix_t A , Triangle t , in_vector_1_t x , in_vector_2_t y , out_vector_t z ); template < class ExecutionPolicy , class in_matrix_t , class Triangle , class in_vector_1_t , class in_vector_2_t , class out_vector_t > void symmetric_matrix_vector_product ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , in_vector_1_t x , in_vector_2_t y , out_vector_t z ); // [linalg.algs.blas2.hemv], // Hermitian matrixvector product template < class in_matrix_t , class Triangle , class in_vector_t , class out_vector_t > void hermitian_matrix_vector_product ( in_matrix_t A , Triangle t , in_vector_t x , out_vector_t y ); template < class ExecutionPolicy , class in_matrix_t , class Triangle , class in_vector_t , class out_vector_t > void hermitian_matrix_vector_product ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , in_vector_t x , out_vector_t y ); template < class in_matrix_t , class Triangle , class in_vector_1_t , class in_vector_2_t , class out_vector_t > void hermitian_matrix_vector_product ( in_matrix_t A , Triangle t , in_vector_1_t x , in_vector_2_t y , out_vector_t z ); template < class ExecutionPolicy , class in_matrix_t , class Triangle , class in_vector_1_t , class in_vector_2_t , class out_vector_t > void hermitian_matrix_vector_product ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , in_vector_1_t x , in_vector_2_t y , out_vector_t z ); // [linalg.algs.blas2.trmv], // Triangular matrixvector product // [linalg.algs.blas2.trmv.ov], // Overwriting triangular matrixvector product template < class in_matrix_t , class Triangle , class DiagonalStorage , class in_vector_t , class out_vector_t > void triangular_matrix_vector_product ( in_matrix_t A , Triangle t , DiagonalStorage d , in_vector_t x , out_vector_t y ); template < class ExecutionPolicy , class in_matrix_t , class Triangle , class DiagonalStorage , class in_vector_t , class out_vector_t > void triangular_matrix_vector_product ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , DiagonalStorage d , in_vector_t x , out_vector_t y ); // [linalg.algs.blas2.trmv.inplace], // Inplace triangular matrixvector product template < class in_matrix_t , class Triangle , class DiagonalStorage , class inout_vector_t > void triangular_matrix_vector_product ( in_matrix_t A , Triangle t , DiagonalStorage d , inout_vector_t y ); template < class ExecutionPolicy , class in_matrix_t , class Triangle , class DiagonalStorage , class inout_vector_t > void triangular_matrix_vector_product ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , DiagonalStorage d , inout_vector_t y ); // [linalg.algs.blas2.trmv.up], // Updating triangular matrixvector product template < class in_matrix_t , class Triangle , class DiagonalStorage , class in_vector_1_t , class in_vector_2_t , class out_vector_t > void triangular_matrix_vector_product ( in_matrix_t A , Triangle t , DiagonalStorage d , in_vector_1_t x , in_vector_2_t y , out_vector_t z ); template < class ExecutionPolicy , class in_matrix_t , class Triangle , class DiagonalStorage , class in_vector_1_t , class in_vector_2_t , class out_vector_t > void triangular_matrix_vector_product ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , DiagonalStorage d , in_vector_1_t x , in_vector_2_t y , out_vector_t z ); // [linalg.algs.blas2.trsv], // Solve a triangular linear system // [linalg.algs.blas2.trsv.notinplace], // Solve a triangular linear system, not in place template < class in_matrix_t , class Triangle , class DiagonalStorage , class in_vector_t , class out_vector_t , class BinaryDivideOp > void triangular_matrix_vector_solve ( in_matrix_t A , Triangle t , DiagonalStorage d , in_vector_t b , out_vector_t x , BinaryDivideOp divide ); template < class ExecutionPolicy , class in_matrix_t , class Triangle , class DiagonalStorage , class in_vector_t , class out_vector_t , class BinaryDivideOp > void triangular_matrix_vector_solve ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , DiagonalStorage d , in_vector_t b , out_vector_t x , BinaryDivideOp divide ); template < class in_matrix_t , class Triangle , class DiagonalStorage , class in_vector_t , class out_vector_t > void triangular_matrix_vector_solve ( in_matrix_t A , Triangle t , DiagonalStorage d , in_vector_t b , out_vector_t x ); template < class ExecutionPolicy , class in_matrix_t , class Triangle , class DiagonalStorage , class in_vector_t , class out_vector_t > void triangular_matrix_vector_solve ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , DiagonalStorage d , in_vector_t b , out_vector_t x ); // [linalg.algs.blas2.trsv.inplace], // Solve a triangular linear system, in place template < class in_matrix_t , class Triangle , class DiagonalStorage , class inout_vector_t , class BinaryDivideOp > void triangular_matrix_vector_solve ( in_matrix_t A , Triangle t , DiagonalStorage d , inout_vector_t b , BinaryDivideOp divide ); template < class ExecutionPolicy , class in_matrix_t , class Triangle , class DiagonalStorage , class inout_vector_t , class BinaryDivideOp > void triangular_matrix_vector_solve ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , DiagonalStorage d , inout_vector_t b , BinaryDivideOp divide ); template < class in_matrix_t , class Triangle , class DiagonalStorage , class inout_vector_t > void triangular_matrix_vector_solve ( in_matrix_t A , Triangle t , DiagonalStorage d , inout_vector_t b ); template < class ExecutionPolicy , class in_matrix_t , class Triangle , class DiagonalStorage , class inout_vector_t > void triangular_matrix_vector_solve ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , DiagonalStorage d , inout_vector_t b ); // [linalg.algs.blas2.rank1.geru], // nonconjugated rank1 matrix update template < class in_vector_1_t , class in_vector_2_t , class inout_matrix_t > void matrix_rank_1_update ( in_vector_1_t x , in_vector_2_t y , inout_matrix_t A ); template < class ExecutionPolicy , class in_vector_1_t , class in_vector_2_t , class inout_matrix_t > void matrix_rank_1_update ( ExecutionPolicy && exec , in_vector_1_t x , in_vector_2_t y , inout_matrix_t A ); // [linalg.algs.blas2.rank1.gerc], // conjugated rank1 matrix update template < class in_vector_1_t , class in_vector_2_t , class inout_matrix_t > void matrix_rank_1_update_c ( in_vector_1_t x , in_vector_2_t y , inout_matrix_t A ); template < class ExecutionPolicy , class in_vector_1_t , class in_vector_2_t , class inout_matrix_t > void matrix_rank_1_update_c ( ExecutionPolicy && exec , in_vector_1_t x , in_vector_2_t y , inout_matrix_t A ); // [linalg.algs.blas2.rank1.syr], // symmetric rank1 matrix update template < class in_vector_t , class inout_matrix_t , class Triangle > void symmetric_matrix_rank_1_update ( in_vector_t x , inout_matrix_t A , Triangle t ); template < class ExecutionPolicy , class in_vector_t , class inout_matrix_t , class Triangle > void symmetric_matrix_rank_1_update ( ExecutionPolicy && exec , in_vector_t x , inout_matrix_t A , Triangle t ); template < class T , class in_vector_t , class inout_matrix_t , class Triangle > void symmetric_matrix_rank_1_update ( T alpha , in_vector_t x , inout_matrix_t A , Triangle t ); template < class ExecutionPolicy , class T , class in_vector_t , class inout_matrix_t , class Triangle > void symmetric_matrix_rank_1_update ( ExecutionPolicy && exec , T alpha , in_vector_t x , inout_matrix_t A , Triangle t ); // [linalg.algs.blas2.rank1.her], // Hermitian rank1 matrix update template < class in_vector_t , class inout_matrix_t , class Triangle > void hermitian_matrix_rank_1_update ( in_vector_t x , inout_matrix_t A , Triangle t ); template < class ExecutionPolicy , class in_vector_t , class inout_matrix_t , class Triangle > void hermitian_matrix_rank_1_update ( ExecutionPolicy && exec , in_vector_t x , inout_matrix_t A , Triangle t ); template < class T , class in_vector_t , class inout_matrix_t , class Triangle > void hermitian_matrix_rank_1_update ( T alpha , in_vector_t x , inout_matrix_t A , Triangle t ); template < class ExecutionPolicy , class T , class in_vector_t , class inout_matrix_t , class Triangle > void hermitian_matrix_rank_1_update ( ExecutionPolicy && exec , T alpha , in_vector_t x , inout_matrix_t A , Triangle t ); // [linalg.algs.blas2.rank2.syr2], // symmetric rank2 matrix update template < class in_vector_1_t , class in_vector_2_t , class inout_matrix_t , class Triangle > void symmetric_matrix_rank_2_update ( in_vector_1_t x , in_vector_2_t y , inout_matrix_t A , Triangle t ); template < class ExecutionPolicy , class in_vector_1_t , class in_vector_2_t , class inout_matrix_t , class Triangle > void symmetric_matrix_rank_2_update ( ExecutionPolicy && exec , in_vector_1_t x , in_vector_2_t y , inout_matrix_t A , Triangle t ); // [linalg.algs.blas2.rank2.her2], // Hermitian rank2 matrix update template < class in_vector_1_t , class in_vector_2_t , class inout_matrix_t , class Triangle > void hermitian_matrix_rank_2_update ( in_vector_1_t x , in_vector_2_t y , inout_matrix_t A , Triangle t ); template < class ExecutionPolicy , class in_vector_1_t , class in_vector_2_t , class inout_matrix_t , class Triangle > void hermitian_matrix_rank_2_update ( ExecutionPolicy && exec , in_vector_1_t x , in_vector_2_t y , inout_matrix_t A , Triangle t ); // [linalg.algs.blas3.gemm], // general matrixmatrix product template < class in_matrix_1_t , class in_matrix_2_t , class out_matrix_t > void matrix_product ( in_matrix_1_t A , in_matrix_2_t B , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class in_matrix_2_t , class out_matrix_t > void matrix_product ( ExecutionPolicy && exec , in_matrix_1_t A , in_matrix_2_t B , out_matrix_t C ); template < class in_matrix_1_t , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void matrix_product ( in_matrix_1_t A , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void matrix_product ( ExecutionPolicy && exec , in_matrix_1_t A , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C ); // [linalg.algs.blas3.symm], // symmetric matrixmatrix product // [linalg.algs.blas3.symm.ov.left], // overwriting symmetric matrixmatrix left product template < class in_matrix_1_t , class Triangle , class in_matrix_2_t , class out_matrix_t > void symmetric_matrix_left_product ( in_matrix_1_t A , Triangle t , in_matrix_2_t B , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class in_matrix_2_t , class out_matrix_t > void symmetric_matrix_left_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , in_matrix_2_t B , out_matrix_t C ); // [linalg.algs.blas3.symm.ov.right], // overwriting symmetric matrixmatrix right product template < class in_matrix_1_t , class Triangle , class in_matrix_2_t , class out_matrix_t > void symmetric_matrix_right_product ( in_matrix_1_t A , Triangle t , in_matrix_2_t B , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class in_matrix_2_t , class out_matrix_t > void symmetric_matrix_right_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , in_matrix_2_t B , out_matrix_t C ); // [linalg.algs.blas3.symm.up.left], // updating symmetric matrixmatrix left product template < class in_matrix_1_t , class Triangle , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void symmetric_matrix_left_product ( in_matrix_1_t A , Triangle t , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void symmetric_matrix_left_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C ); // [linalg.algs.blas3.symm.up.right], // updating symmetric matrixmatrix right product template < class in_matrix_1_t , class Triangle , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void symmetric_matrix_right_product ( in_matrix_1_t A , Triangle t , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void symmetric_matrix_right_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C ); // [linalg.algs.blas3.hemm], // Hermitian matrixmatrix product // [linalg.algs.blas3.hemm.ov.left], // overwriting Hermitian matrixmatrix left product template < class in_matrix_1_t , class Triangle , class in_matrix_2_t , class out_matrix_t > void hermitian_matrix_left_product ( in_matrix_1_t A , Triangle t , in_matrix_2_t B , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class in_matrix_2_t , class out_matrix_t > void hermitian_matrix_left_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , in_matrix_2_t B , out_matrix_t C ); // [linalg.algs.blas3.hemm.ov.right], // overwriting Hermitian matrixmatrix right product template < class in_matrix_1_t , class Triangle , class in_matrix_2_t , class out_matrix_t > void hermitian_matrix_right_product ( in_matrix_1_t A , Triangle t , in_matrix_2_t B , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class in_matrix_2_t , class out_matrix_t > void hermitian_matrix_right_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , in_matrix_2_t B , out_matrix_t C ); // [linalg.algs.blas3.hemm.up.left], // updating Hermitian matrixmatrix left product template < class in_matrix_1_t , class Triangle , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void hermitian_matrix_left_product ( in_matrix_1_t A , Triangle t , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void hermitian_matrix_left_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C ); // [linalg.algs.blas3.hemm.up.right], // updating Hermitian matrixmatrix right product template < class in_matrix_1_t , class Triangle , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void hermitian_matrix_right_product ( in_matrix_1_t A , Triangle t , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void hermitian_matrix_right_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C ); // [linalg.algs.blas3.trmm], // triangular matrixmatrix product // [linalg.algs.blas3.trmm.ov.left], // overwriting triangular matrixmatrix left product template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t > void triangular_matrix_left_product ( in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t > void triangular_matrix_left_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , out_matrix_t C ); template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t > void triangular_matrix_left_product ( in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t > void triangular_matrix_left_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t C ); // [linalg.algs.blas3.trmm.ov.right], // overwriting triangular matrixmatrix right product template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t > void triangular_matrix_right_product ( in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t > void triangular_matrix_right_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , out_matrix_t C ); template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t > void triangular_matrix_right_product ( in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t > void triangular_matrix_right_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t C ); // [linalg.algs.blas3.trmm.up.left], // updating triangular matrixmatrix left product template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void triangular_matrix_left_product ( in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void triangular_matrix_left_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C ); // [linalg.algs.blas3.trmm.up.right], // updating triangular matrixmatrix right product template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void triangular_matrix_right_product ( in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void triangular_matrix_right_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C ); // [linalg.alg.blas3.rankk.syrk], // rankk symmetric matrix update template < class T , class in_matrix_1_t , class inout_matrix_t , class Triangle > void symmetric_matrix_rank_k_update ( T alpha , in_matrix_1_t A , inout_matrix_t C , Triangle t ); template < class T , class ExecutionPolicy , class in_matrix_1_t , class inout_matrix_t , class Triangle > void symmetric_matrix_rank_k_update ( ExecutionPolicy && exec , T alpha , in_matrix_1_t A , inout_matrix_t C , Triangle t ); // [linalg.alg.blas3.rankk.herk], // rankk Hermitian matrix update template < class T , class in_matrix_1_t , class inout_matrix_t , class Triangle > void hermitian_matrix_rank_k_update ( T alpha , in_matrix_1_t A , inout_matrix_t C , Triangle t ); template < class ExecutionPolicy , class T , class in_matrix_1_t , class inout_matrix_t , class Triangle > void hermitian_matrix_rank_k_update ( ExecutionPolicy && exec , T alpha , in_matrix_1_t A , inout_matrix_t C , Triangle t ); // [linalg.alg.blas3.rank2k.syr2k], // rank2k symmetric matrix update template < class in_matrix_1_t , class in_matrix_2_t , class inout_matrix_t , class Triangle > void symmetric_matrix_rank_2k_update ( in_matrix_1_t A , in_matrix_2_t B , inout_matrix_t C , Triangle t ); template < class ExecutionPolicy , class in_matrix_1_t , class in_matrix_2_t , class inout_matrix_t , class Triangle > void symmetric_matrix_rank_2k_update ( ExecutionPolicy && exec , in_matrix_1_t A , in_matrix_2_t B , inout_matrix_t C , Triangle t ); // [linalg.alg.blas3.rank2k.her2k], // rank2k Hermitian matrix update template < class in_matrix_1_t , class in_matrix_2_t , class inout_matrix_t , class Triangle > void hermitian_matrix_rank_2k_update ( in_matrix_1_t A , in_matrix_2_t B , inout_matrix_t C , Triangle t ); template < class ExecutionPolicy , class in_matrix_1_t , class in_matrix_2_t , class inout_matrix_t , class Triangle > void hermitian_matrix_rank_2k_update ( ExecutionPolicy && exec , in_matrix_1_t A , in_matrix_2_t B , inout_matrix_t C , Triangle t ); // [linalg.alg.blas3.trsm], // solve multiple triangular linear systems // [linalg.alg.blas3.trsm.left], // solve multiple triangular linear systems // with triangular matrix on the left template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t , class BinaryDivideOp > void triangular_matrix_matrix_left_solve ( in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , out_matrix_t X , BinaryDivideOp divide ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t , class BinaryDivideOp > void triangular_matrix_matrix_left_solve ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , out_matrix_t X , BinaryDivideOp divide ); template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t , class BinaryDivideOp > void triangular_matrix_matrix_left_solve ( in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t B , BinaryDivideOp divide ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t , class BinaryDivideOp > void triangular_matrix_matrix_left_solve ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t B , BinaryDivideOp divide ); template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t > void triangular_matrix_matrix_left_solve ( in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , out_matrix_t X ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t > void triangular_matrix_matrix_left_solve ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , out_matrix_t X ); template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t > void triangular_matrix_matrix_left_solve ( in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t B ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t > void triangular_matrix_matrix_left_solve ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t B ); // [linalg.alg.blas3.trsm.right], // solve multiple triangular linear systems // with triangular matrix on the right template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t , class BinaryDivideOp > void triangular_matrix_matrix_right_solve ( in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , out_matrix_t X , BinaryDivideOp divide ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t , class BinaryDivideOp > void triangular_matrix_matrix_right_solve ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , DiagonalStorage d , in_matrix_t B , out_matrix_t X , BinaryDivideOp divide ); template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t , class BinaryDivideOp > void triangular_matrix_matrix_right_solve ( in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t B , BinaryDivideOp divide ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t , class BinaryDivideOp > void triangular_matrix_matrix_right_solve ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t B , BinaryDivideOp divide ); template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t > void triangular_matrix_matrix_right_solve ( in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , out_matrix_t X ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t > void triangular_matrix_matrix_right_solve ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , DiagonalStorage d , in_matrix_t B , out_matrix_t X ); template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t > void triangular_matrix_matrix_right_solve ( in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t B ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t > void triangular_matrix_matrix_right_solve ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t B ); }
16.2. Requirements [linalg.reqs]
16.2.1. Value and reference requirements [linalg.reqs.val]
This clause lists the minimum requirements for all algorithms and classes in [linalg], and for the following types:

for any input or output
parameter(s) of any algorithm or method in [linalg], the parameter(s)'mdspan
andvalue_type
type aliases;reference 
the
template parameter (if any) of any algorithm or class in [linalg];Scalar 
the
template parameter of any algorithm in [linalg] with aT
parameter; andT init 
the template parameter of
.sum_of_squares_result
In this clause, we refer to these types as linear algebra value types. These type requirements are parameterized by the algorithm or method using the type. Each algorithm or method using the type has one or more associated mathematical expressions that defines the algorithm’s or method’s behavior. For each algorithm or method, its mathematical expression(s) are either explicitly stated as such, or are implicitly stated in the algorithm’s or method’s description. The requirements below will refer to those mathematical expression(s).
[Note: This notion of parameterizing requirements on a mathematical expression generalizes GENERALIZED_SUM. end note]
All of the following requirements presume that the algorithm’s asymptotic complexity requirements, if any, are satisfied.

Any linear algebra value type meets the requirements of
.semiregular 
The algorithm or method may perform readonly access on any input or output
arbitrarily many times.mdspan 
The algorithm or method may make arbitrary many objects of any linear algebra value type, valueinitializing or directinitializing them with any existing object of that type.

The algorithm or method may assign arbitrarily many times to any reference resulting from a valid output
access.mdspan 
If the algorithm’s or method’s mathematical expression uses division and possibly also addition, subtraction, and multiplication, then the algorithm or method evaluates the mathematical expression using a sequence of evaluations of
,*
,*=
,/
,/=
,+
, unary+=
, binary
,
, and=
operators that would produce the correct result when operating on elements of a field with noncommutative multiplication. (We interpret=
asa / b
times the multiplicative inverse ofa
.) Any addend, any subtrahend, any partial sum of addends in any order (treating any difference as a sum with the second term negated), any factor, any partial product of factors respecting their order in the mathematical expression, any numerator, any denominator, and any assignment in the mathematical expression shall be well formed.b 
Otherwise, if the algorithm’s or method’s mathematical expression uses subtraction and possibly also addition and multiplication, then the algorithm or method evaluates the mathematical expression using a sequence of evaluations of
,*
,*=
,+
, unary+=
, binary
,
, and=
operators that would produce the correct result when operating on elements of a ring with noncommutative multiplication. Any addend, any subtrahend, any partial sum of addends in any order (treating any difference as a sum with the second term negated), any factor, any partial product of factors respecting their order in the mathematical expression, and any assignment in the mathematical expression shall be well formed.= 
Otherwise, if the algorithm’s or method’s mathematical expression uses multiplication and possibly also addition, then the algorithm or method evaluates the mathematical expression using a sequence of evaluations of
,*
,*=
,+
, and+=
operators that would produce the correct result when operating on elements of a semiring with noncommutative multiplication. Any addend, any partial sum of addends in any order, any factor, any partial product of factors respecting their order in the mathematical expression, and any assignment in the mathematical expression shall be well formed.= 
Otherwise, if the algorithm’s or method’s mathematical expression uses addition, then the algorithm or method evaluates the mathematical expression using a sequence of evaluations of
,+
, and+=
operators that would produce the correct result when operating on elements of a commutative semigroup. Any addend, any partial sum of addends in any order, and any assignment in the mathematical expression shall be well formed.= 
If the algorithm’s or method’s mathematical expression includes any of the following:
a.
conj  if  needed
for some expression( z )
,z b.
for some expressionabs ( x )
, orx c.
for some expressionsqrt ( x )
,x but otherwise conforms to case (5), (6), (7), or (8), then the relevant case (5), (6), (7), or (8) applies. In addition, any of the above subexpressions that appear in the algorithm’s or method’s mathematical expression shall be well formed.

If the algorithm or method has an output
, then all addends (or subtrahends, if applicable) in the algorithm’s or method’s mathematical expression are assignable and convertible to the outputmdspan
'smdspan
.value_type 
The algorithm or method may reorder addends and partial sums in its mathematical expression arbitrarily. [Note: Factors in each product are not reordered; multiplication is not necessarily commutative. end note]

The algorithm or method may replace any value with the sum of that value and a valueinitialized object of any input or output
'smdspan
.value_type 
If the algorithm or method has a
parameter, then the algorithm or method may replace any value with the sum of that value and a valueinitialized object of typeT init
.T
16.2.2. Requirements for algorithms and methods on floatingpoint values [linalg.reqs.flpt]
For all algorithms and classes in [linalg], suppose that

all input and output
havemdspan
a floatingpoint type,value_type 
any
template argument has a floatingpoint type, andScalar 
any argument corresponding to the
parameter has a floatingpoint type.T init
Then, algorithms and classes' methods may do the following:

compute floatingpoint sums in any way that improves their accuracy for arbitrary input;

perform arithmetic operations other than those in the algorithm’s or method’s mathematical expression, in order to improve performance or accuracy; and

use approximations (that might not be exact even if computing with real numbers), instead of computations that would be exact if it were possible to compute without rounding error;
as long as

the algorithm or method satisfies the complexity requirements; and

the algorithm or method is logarithmically stable, in the sense of (Demmel 2007).
[Note: Strassen’s algorithm for matrixmatrix multiply is an example of a logarithmically stable algorithm. end note]
16.3. Tag classes [linalg.tags]
16.3.1. Storage order tags [linalg.tags.order]
struct column_major_t { }; inline constexpr column_major_t column_major = { }; struct row_major_t { }; inline constexpr row_major_t row_major = { };
indicates a columnmajor order, and
indicates a rowmajor order. The interpretation of each depends on
the specific layout that uses the tag. See
below.
16.3.2. Triangle tags [linalg.tags.triangle]
Some linear algebra algorithms distinguish between the "upper triangle," "lower triangle," and "diagonal" of a matrix.

The diagonal is the set of all elements of
accessed byA
for 0 ≤A [ i , i ]
< min(i
,A . extent ( 0 )
).A . extent ( 1 ) 
The upper triangle of a matrix
is the set of all elements ofA
accessed byA
withA [ i , j ]
≤i
. It includes the diagonal.j 
The lower triangle of
is the set of all elements ofA
accessed byA
withA [ i , j ]
≥i
. It includes the diagonal.j
struct upper_triangle_t { }; inline constexpr upper_triangle_t upper_triangle = { }; struct lower_triangle_t { }; inline constexpr lower_triangle_t lower_triangle = { };
These tag classes specify whether algorithms and other users of a
matrix (represented as an
) should
access the upper triangle (
) or lower triangle
(
) of the matrix. This is also subject to the
restrictions of
if that tag is also
applied; see below.
[Note: The
and
functions in this section
return a view of an input
with its rightmost two indices effectively reversed.
Regardless, when we refer to the upper triangle of
,
we still mean the elements of
accessed by
with
≤
.
For all algorithms in this section that take a triangle tag parameter,
the triangle tag refers to the actual input matrix, after any transformations like
.
This differs from the BLAS'
argument,
which refers to the pretransformed "original" matrix.
end note]
16.3.3. Diagonal tags [linalg.tags.diagonal]
struct implicit_unit_diagonal_t { }; inline constexpr implicit_unit_diagonal_t implicit_unit_diagonal = { }; struct explicit_diagonal_t { }; inline constexpr explicit_diagonal_t explicit_diagonal = { };
These tag classes specify whether algorithms should access the matrix’s diagonal entries, and if not, then how algorithms should interpret the matrix’s implicitly represented diagonal values.
The
tag indicates two things:

the algorithm will never access the
element of the matrix for anyi , i
; andi 
the algorithm will interpret the matrix as if it has a "unit diagonal," a diagonal each of whose elements behaves as a twosided multiplicative identity (even if the matrix’s value type does not have a twosided multiplicative identity).
The tag
indicates that algorithms
may access the matrix’s diagonal entries directly.
16.4. Layouts for packed matrix types [linalg.layouts]
16.4.1. layout_blas_packed
[linalg.layouts.packed]
is an
layout mapping policy that
represents a square matrix that stores only the entries in one
triangle, in a packed contiguous format. Its
template
parameter determines whether an
with this layout stores
the upper or lower triangle of the matrix. Its
template parameter determines whether the layout packs the matrix’s
elements in columnmajor or rowmajor order.
A
of
indicates columnmajor ordering.
This packs matrix elements starting with the leftmost (least column
index) column, and proceeding column by column, from the top entry
(least row index).
A
of
indicates rowmajor ordering. This
packs matrix elements starting with the topmost (least row index) row,
and proceeding row by row, from the leftmost (least column index)
entry.
[Note:
describes the data layout used by the BLAS'
Symmetric Packed (SP), Hermitian Packed (HP), and Triangular Packed
(TP) matrix types.
end note]
template < class Triangle , class StorageOrder > class layout_blas_packed { public : template < class Extents > struct mapping { public : using extents_type = Extents ; using index_type = typename extents_type :: index_type ; using size_type = typename extents_type :: size_type ; using rank_type = typename extents_type :: rank_type ; using layout_type = layout_blas_packed < Triangle , StorageOrder > ; private : Extents < it > extents_ </ it > {}; // exposition only public : constexpr mapping () noexcept = default ; constexpr mapping ( const mapping & ) noexcept = default ; constexpr mapping ( const extents_type & e ) noexcept ; template < class OtherExtents > constexpr explicit ( ! is_convertible_v < OtherExtents , extents_type > ) mapping ( const mapping < OtherExtents >& other ) noexcept ; constexpr mapping & operator = ( const mapping & ) noexcept = default ; constexpr extents_type extents () const noexcept { return < it > extents_ </ it > ; } constexpr size_type required_span_size () const noexcept ; template < class ... Indices > constexpr index_type operator () ( Indices ...) const noexcept ; static constexpr bool is_always_unique () { return extents_type :: static_extent ( 0 ) != dynamic_extent && extents_type :: static_extent ( 0 ) < 2 ; } static constexpr bool is_always_exhaustive () { return true; } static constexpr bool is_always_strided () { return extents_type :: static_extent ( 0 ) != dynamic_extent && extents_type :: static_extent ( 0 ) < 2 ; } constexpr bool is_unique () const noexcept { return extent ( 0 ) < 2 ; } constexpr bool is_exhaustive () const noexcept { return true; } constexpr bool is_strided () const noexcept { return extent ( 0 ) < 2 ; } constexpr index_type stride ( rank_type ) const noexcept ; template < class OtherExtents > friend constexpr bool operator == ( const mapping & , const mapping < OtherExtents >& ) noexcept ; };

Constraints:

is eitherTriangle
orupper_triangle_t
.lower_triangle_t 
is eitherStorageOrder
orcolumn_major_t
.row_major_t 
is a specialization ofExtents
.extents 
equals 2.Extents :: rank ()

constexpr mapping ( const extents_type & e ) noexcept ;
Preconditions:

The size of the multidimensional index space
is representable as a value of typee
([basic.fundamental]), andindex_type 
equalse . extent ( 0 )
.e . extent ( 1 )

Effects: Directnonlistinitializes
withextents_
.e
template < class OtherExtents > explicit ( ! is_convertible_v < OtherExtents , extents_type > ) constexpr mapping ( const mapping < OtherExtents >& other ) noexcept ;
Constraints:
is true
.
Preconditions:
is representable as a value
of type
([basic.fundamental]).
Effects: Directnonlistinitializes
.
constexpr index_type required_span_size () const noexcept ;
Returns:
* (
+ 1)/2.
[Note: For example, a 5 x 5 packed matrix only stores 15 matrix elements. end note]
template < class ... Indices > constexpr index_type operator () ( Indices ... k ) const noexcept ;
Constraints:

equalssizeof ...( Indices )
,extents_type :: rank () 
is( is_convertible_v < Indices , index_Type > && ...) true
, and 
is( is_nothrow_constructible_v < index_type , Indices > ) true
.
Preconditions:
is
a multidimensional index in
Returns: Let
equal
, let
be the zeroth element of
,
and let
be the first element of
. Then:

If
isStorageOrder
andcolumn_major_t 
if
isTriangle
, thenupper_triangle_t
+i
* (j
+ 1)/2 ifj
≤i
, elsej
+j
* (i
+ 1)/2;i 
else, if
isTriangle
, thenlower_triangle_t
+i
*N
j
* (j
+ 1)/2 ifj
≥i
, elsej
+j
*N
i
* (i
+ 1)/2;i


else, if
isStorageOrder
androw_major_t 
if
isTriangle
, thenupper_triangle_t
+j
*N
i
* (i
+ 1)/2 ifi
≤i
, elsej
+i
*N
j
* (j
+ 1)/2;j 
else, if
isTriangle
, thenlower_triangle_t
+j
* (i
+ 1)/2 ifi
≥i
, elsej
+i
* (j
+ 1)/2.j

[Note:
An
layout mapping must permit access
for all multidimensional indices in the cross product of the extents,
so the above definition cannot exclude indices outside the matrix’s triangle.
Instead, it interprets such indices as if the matrix were symmetric.
end note]
template < class OtherExtents > friend constexpr bool operator == ( const mapping & , const mapping < OtherExtents >& ) noexcept ;
Constraints:
equals
.
Returns: true
if and only if
for 0 ≤
<
,
equals
.
constexpr index_type stride ( rank_type ) const noexcept ;
Returns: 1 if
is less than 2, else 0.
16.5. Scaled inplace transformation [linalg.scaled]
The
function takes a value
and an
, and returns a new readonly
with the same domain
as
, that represents the elementwise product of
with each
element of
.
[Example:
end example]// z = alpha * x + y void z_equals_alpha_times_x_plus_y ( mdspan < double , dextents < size_t , 1 >> z , const double alpha , mdspan < double , dextents < size_t , 1 >> x , mdspan < double , dextents < size_t , 1 >> y ) { add ( scaled ( alpha , x ), y , y ); } // w = alpha * x + beta * y void w_equals_alpha_times_x_plus_beta_times_y ( mdspan < double , dextents < size_t , 1 >> w , const double alpha , mdspan < double , dextents < size_t , 1 >> x , const double beta , mdspan < double , dextents < size_t , 1 >> y ) { add ( scaled ( alpha , x ), scaled ( beta , y ), w ); }
[Note:
An implementation could dispatch to a function in the BLAS library, by
noticing that the first argument has an
type. It could use this information to extract the appropriate
runtime value(s) of the relevant BLAS function arguments (e.g.,
and/or
), by calling
.
end note]
16.5.1. Expositiononly function object conj  if  needed
_ [linalg.scaled.conj]
The name
denotes an expositiononly function object.
The expression
for subexpression
is expressionequivalent to:

, ifE
is an arithmetic type;T 
else,
, if that expression is valid with overload resolution performed in a context that includes the declarationconj ( E )
and does not include a declaration oftemplate < class T > T conj ( const T & ) = delete ; linalg :: impl ::
_;conj_if_needed 
else,
.E
[Note: The special case for arithmetic types
preserves the type of its argument, unlike
.
The
case invokes
via unqualified lookup.
The
case presumes that a type without a
function is noncomplex,
so that the conjugate is the identity.
end note]
16.5.2. Expositiononly class templates proxy_reference_base
and proxy_reference
[linalg.scaled.base]
The expositiononly class
is a tag
to identify whether a class is a specialization of either
[linalg.scaled.accessor_scaled] or
[linalg.scaled.accessor_conjugated].
The expositiononly class template
is part of the implementation of
[linalg.scaled.accessor_scaled] and
[linalg.scaled.accessor_conjugated].
class proxy_reference_base {}; // exposition only template < class Reference , class Value , class Derived > class proxy_reference : public proxy_reference_base { // exposition only private : using this_type = proxy_reference < Reference , Value , Derived > ; Reference reference_ ; public : using reference_type = Reference ; using value_type = Value ; using derived_type = Derived ; constexpr explicit proxy_reference ( Reference reference ) : reference_ ( reference ) {} constexpr operator value_type () const { return static_cast < const Derived &> ( * this ). to_value ( reference_ ); } constexpr friend auto operator  ( const derived_type & cs ) { return  value_type ( cs ); } template < class Rhs , enable_if_t < is_base_of_v < proxy_reference_base , Rhs > , bool > = true> constexpr friend auto operator + ( derived_type lhs , Rhs rhs ) { using rhs_value_type = typename Rhs :: value_type ; // exposition only return value_type ( lhs ) + rhs_value_type ( rhs ); } template < class Rhs , enable_if_t <! is_base_of_v < proxy_reference_base , Rhs > , bool > = true> constexpr friend auto operator + ( derived_type lhs , Rhs rhs ) { return value_type ( lhs ) + rhs ; } template < class Lhs , enable_if_t <! is_base_of_v < proxy_reference_base , Lhs > , bool > = true> constexpr friend auto operator + ( Lhs lhs , derived_type rhs ) { return lhs + value_type ( rhs ); } template < class Rhs , enable_if_t < is_base_of_v < proxy_reference_base , Rhs > , bool > = true> constexpr friend auto operator  ( derived_type lhs , Rhs rhs ) { using rhs_value_type = typename Rhs :: value_type ; // exposition only return value_type ( lhs )  rhs_value_type ( rhs ); } template < class Rhs , enable_if_t <! is_base_of_v < proxy_reference_base , Rhs > , bool > = true> constexpr friend auto operator  ( derived_type lhs , Rhs rhs ) { return value_type ( lhs )  rhs ; } template < class Lhs , enable_if_t <! is_base_of_v < proxy_reference_base , Lhs > , bool > = true> constexpr friend auto operator  ( Lhs lhs , derived_type rhs ) { return lhs  value_type ( rhs ); } template < class Rhs , enable_if_t < is_base_of_v < proxy_reference_base , Rhs > , bool > = true> constexpr friend auto operator * ( derived_type lhs , Rhs rhs ) { using rhs_value_type = typename Rhs :: value_type ; // exposition only return value_type ( lhs ) * rhs_value_type ( rhs ); } template < class Rhs , enable_if_t <! is_base_of_v < proxy_reference_base , Rhs > , bool > = true> constexpr friend auto operator * ( derived_type lhs , Rhs rhs ) { return value_type ( lhs ) * rhs ; } template < class Lhs , enable_if_t <! is_base_of_v < proxy_reference_base , Lhs > , bool > = true> constexpr friend auto operator * ( Lhs lhs , derived_type rhs ) { return lhs * value_type ( rhs ); } template < class Rhs , enable_if_t < is_base_of_v < proxy_reference_base , Rhs > , bool > = true> constexpr friend auto operator / ( derived_type lhs , Rhs rhs ) { using rhs_value_type = typename Rhs :: value_type ; // exposition only return value_type ( lhs ) / rhs_value_type ( rhs ); } template < class Rhs , enable_if_t <! is_base_of_v < proxy_reference_base , Rhs > , bool > = true> constexpr friend auto operator / ( derived_type lhs , Rhs rhs ) { return value_type ( lhs ) / rhs ; } template < class Lhs , enable_if_t <! is_base_of_v < proxy_reference_base , Lhs > , bool > = true> constexpr friend auto operator / ( Lhs lhs , derived_type rhs ) { return lhs / value_type ( rhs ); } constexpr friend auto abs ( const derived_type & x ); constexpr friend auto real ( const derived_type & x ); constexpr friend auto imag ( const derived_type & x ); constexpr friend auto conj ( const derived_type & x ); };
constexpr friend auto abs ( const derived_type & x );
The expression
for subexpression
whose type is
is expressionequivalent to:

, ifvalue_type ( static_cast < const this_type &> ( E ))
is an unsigned integer;value_type 
else,
, if that expression is valid, with overload resolution performed in a context that includes the declarationabs ( value_type ( static_cast < const this_type &> ( E )))
. If the function selected by overload resolution does not return the absolute value of its input, the program is illformed, no diagnostic required.template < class T > T abs ( T ) = delete ;
[Note: This function exists because of
and
.
The unsigned integer special case avoids an ambiguous lookup.
The other case invokes
via unqualified lookup.
end note]
constexpr friend auto real ( const derived_type & x );
The expression
for subexpression
whose type is
is expressionequivalent to:

, ifvalue_type ( static_cast < const this_type &> ( E ))
is an arithmetic type;value_type 
else,
, if that expression is valid, with overload resolution performed in a context that includes the declarationreal ( value_type ( static_cast < const this_type &> ( E )))
; if the function selected by overload resolution does not return the absolute value of its input, the program is illformed, no diagnostic required;template < class T > T real ( T ) = delete ; 
else,
.value_type ( static_cast < const this_type &> ( E ))
[Note: This function exists because of
.
The arithmetic type special case exists because
returns a different type than its input
for some arithmetic types.
All arithmetic types represent a noncomplex number.
Otherwise, if
can be found via unqualified lookup, it is used.
If not, then
is assumed to represent a noncomplex number,
for which the number’s real part is the same as the number.
end note]
constexpr friend auto imag ( const derived_type & x );
The expression
for subexpression
whose type is
is expressionequivalent to:

, ifvalue_type {}
is an arithmetic type;value_type 
else,
, if that expression is valid, with overload resolution performed in a context that includes the declarationimag ( value_type ( static_cast < const this_type &> ( E )))
; if the function selected by overload resolution does not return the absolute value of its input, the program is illformed, no diagnostic required;template < class T > T imag ( T ) = delete ; 
else,
.value_type {}
[Note: This function exists because of
.
The arithmetic type special case exists because
returns a different type than its input
for some arithmetic types.
All arithmetic types represent a noncomplex number.
Otherwise, if
can be found via unqualified lookup, it is used.
If not, then
is assumed to represent a noncomplex number,
for which the number’s imaginary part is zero
(which is constructed by valueinitializing
).
end note]
constexpr friend auto conj ( const derived_type & x );

Effects: Equivalent to
return conj  if  needed
.( x );
16.5.3. Expositiononly class template scaled_scalar
The expositiononly class template
represents a
readonly value, which is the product of a fixed value (the "scaling
factor") and the value of a reference to an element of a
. [Note: The value is read only to avoid confusion
with the definition of "assigning to a scaled scalar." end note]
is part of the implementation of
[linalg.scaled.accessor_scaled].
template < class ScalingFactor , class ReferenceValue > concept < it > scalable </ it > = // exposition only requires { { declval < ScalingFactor > () * declval < ReferenceValue > () }; }; template < class ScalingFactor , class Reference , class ReferenceValue > class scaled_scalar : // exposition only public proxy_reference < Reference , ReferenceValue , scaled_scalar < ScalingFactor , Reference , ReferenceValue >> { private : ScalingFactor scaling_factor_ ; using my_type = scaled_scalar < ScalingFactor , Reference , ReferenceValue > ; // exposition only using base_type = proxy_reference < Reference , ReferenceValue , my_type > ; // exposition only public : explicit scaled_scalar ( const ScalingFactor & scaling_factor , const Reference & reference ); using value_type = decltype ( scaling_factor_ * ReferenceValue ( declval < Reference > ())); value_type to_value ( Reference reference ) const ; };
[Note:
The point of the
template parameter
is so that the input of
can be cast to a value immediately,
before any transformations are applied.
This ensures the original order of operations,
as if computing nonlazily.
It also makes fewer requirements of the reference type.
end note]
Mandates:

scalable
, and< ScalingFactor , ReferenceValue > 
is a complete object type that is neither an abstract class type nor an array type.value_type
models
.
models
.
explicit scaled_scalar ( const ScalingFactor & scaling_factor , const Reference & reference );
Effects:

Direct nonlistinitializes
withstatic_cast < base_type &> ( * this )
, andreference 
direct nonlistinitializes
withscaling_factor_
.scaling_factor
value_type to_value ( Reference reference ) const ;
Effects: Equivalent to
.
The class template
is an
accessor
policy whose reference type represents the product of a scaling factor
and its nested
accessor’s reference.
16.5.4. Class template accessor_scaled
[linalg.scaled.accessor_scaled]
The class template
is an
accessor
policy whose reference type represents the product of a fixed value
(the "scaling factor") and its nested
accessor’s
reference. It is part of the implementation of
[linalg.scaled.scaled].
template < class ScalingFactor , class Accessor > class accessor_scaled { public : using reference = scaled_scalar < ScalingFactor , typename Accessor :: reference , typename Accessor :: element_type > ; using element_type = add_const_t < typename reference :: value_type > ; using data_handle_type = Accessor :: data_handle_type ; using offset_policy = accessor_scaled < ScalingFactor , Accessor :: offset_policy > ; constexpr accessor_scaled ( const ScalingFactor & s , const Accessor & a ); constexpr reference access ( data_handle_type p , size_t i ) const noexcept ; constexpr offset_policy :: data_handle_type offset ( data_handle_type p , size_t i ) const noexcept ; constexpr ScalingFactor scaling_factor () const ; constexpr Accessor nested_accessor () const ; private : ScalingFactor scaling_factor_ ; // exposition only Accessor accessor_ ; // exposition only };
Mandates::
is true
.
models
.
meets the accessor policy requirements [mdspan.accessor.reqmts].
constexpr accessor_scaled ( const ScalingFactor & s , const Accessor & a );
Effects:

Direct nonlistinitializes
withscaling_factor_
, ands 
direct nonlistinitializes
withaccessor_
.a
constexpr reference access ( data_handle_type p , size_t i ) const noexcept ;
Effects: Equivalent to
.
constexpr offset_policy :: data_handle_type offset ( data_handle_type p , size_t i ) const noexcept ;
Effects: Equivalent to
.
constexpr ScalingFactor scaling_factor () const ;
Effects: Equivalent to
.
constexpr Accessor nested_accessor () const ;
Effects: Equivalent to
.
16.5.5. scaled
[linalg.scaled.scaled]
The
function takes a scaling factor
and an
, and returns a new readonly
with the same domain
as
, that represents the elementwise product of
with each
element of
.
[Note: Terms in this product will not be reordered. end note]
Effects: Equivalent totemplate < class ScalingFactor , class ElementType , class Extents , class Layout , class Accessor > constexpr auto scaled ( ScalingFactor scaling_factor , const mdspan < ElementType , Extents , Layout , Accessor >& x );
return mdspan { x . data (), x . mapping (), accessor_scaled { alpha , x . accessor ()}};
.
[Note: The elements of the returned
are read only.
Nested
expressions remain nested.
An expression such as
would not be transformed into
.
This is because such transformations would change at least
the order of operations, and possibly also the result type.
For example, if
is a rank1
whose
is
,
and if
is 4 and
is IEEE 754 binary64,
then
does not overflow
,
but
would overflow
.
end note]
[Example:
end example]void test_scaled ( mdspan < double , extents < int , 10 >> x ) { auto x_scaled = scaled ( 5.0 , x ); for ( int i = 0 ; i < x . extent ( 0 ); ++ i ) { assert ( x_scaled [ i ] == 5.0 * x [ i ]); } }
16.6. Conjugated inplace transformation [linalg.conj]
The
function takes an
, and returns
a new readonly
with the same domain as
, whose
elements are the complex conjugates of the corresponding elements of
.
[Note:
An implementation could dispatch to a function in the BLAS library, by
noticing that the
type of an
input has type
, and that its nested
type is
compatible with the BLAS library. If so, it could set the
corresponding
BLAS function argument accordingly and call the
BLAS function.
end note]
16.6.1. Expositiononly class template conjugated_scalar
The expositiononly class template
represents a
readonly value, which is the complex conjugate of the value of a
reference to an element of an
. [Note: The value is
read only to avoid confusion with the definition of "assigning to the
conjugate of a scalar." end note]
is part of
the implementation of
[linalg.conj.accessor_conjugate].
template < class ReferenceValue > concept < it > conjugatable </ it > = // exposition only requires { { < it > conj  if  needed </ it > ( declval < ReferenceValue > ()) } > same_as < ReferenceValue > ; }; template < class Reference , class ReferenceValue > class conjugated_scalar : // exposition only public proxy_reference < Reference , ReferenceValue , conjugated_scalar < Reference , ReferenceValue >> { private : using my_type = conjugated_scalar < Reference , ReferenceValue > ; // exposition only using base_type = proxy_reference < Reference , ReferenceValue , my_type > ; // exposition only public : constexpr explicit conjugated_scalar ( Reference reference ); using value_type = /* see below */ ; constexpr static auto to_value ( Reference reference ); };
Mandates:
.
models
.
using value_type = /* see below */ ;
names the type
.
constexpr explicit conjugated_scalar ( Reference reference );

Effects: Initializes
withbase_type
.reference
constexpr static auto to_value ( Reference value_type );

Preconditions: This function may be invoked arbitrarily many times.

Effects: Equivalent to
return conj  if  needed
.( ReferenceValue ( reference ));
16.6.2. Class template accessor_conjugate
[linalg.conj.accessor_conjugate]
The class template
is an
accessor
policy whose reference type represents the complex conjugate of its
nested
accessor’s reference.
It is part of the implementation of
[linalg.conj.conjugated].
template < class Accessor > class accessor_conjugate { private : Accessor acc_ ; // exposition only public : using reference = /* see below */ ; using element_type = /* see below */ ; using data_handle_type = typename Accessor :: data_handle_type ; using offset_policy = /* see below */ ; constexpr accessor_conjugate ( Accessor a ); constexpr reference access ( data_handle_type p , size_t i ) const noexcept ( noexcept ( reference ( acc_ . access ( p , i )))); constexpr typename offset_policy :: data_handle_type offset ( data_handle_type p , size_t i ) const noexcept ( noexcept ( acc_ . offset ( p , i ))); constexpr Accessor nested_accessor () const ; };
meets the
accessor policy requirements [mdspan.accessor.reqmts].
using reference = /* see below */ ;
This names

, iftypename Accessor :: reference
is an arithmetic type;remove_cv_t < typename Accessor :: element_type > 
otherwise,
.conjugated_scalar < typename Accessor :: reference , remove_cv_t < typename Accessor :: element_type >>
using element_type = /* see below */
Let the expositiononly type
name

, iftypename Accessor :: element_type
is an arithmetic type;remove_cv < typename Accessor :: element_type > 
otherwise,
.reference :: value_type
Then,
names
.
[Note:
provides readonly access. end note]
using offset_policy = /* see below */ ;
This names

, iftypename Accessor :: offset_policy
is an arithmetic type;remove_cv_t < typename Accessor :: element_type > 
otherwise,
.accessor_conjugate < typename Accessor :: offset_policy >
constexpr accessor_conjugate ( Accessor a );

Effects: Direct nonlistinitializes
withacc_
.a
constexpr reference access ( data_handle_type p , size_t i ) const noexcept ( noexcept ( reference ( acc_ . access ( p , i ))));

Effects: Equivalent to
.return reference ( acc_ . access ( p , i ));
constexpr typename offset_policy :: data_handle_type offset ( data_handle_type p , size_t i ) const noexcept ( noexcept ( acc_ . offset ( p , i )));

Effects: Equivalent to
.return acc_ . offset ( p , i );
constexpr Accessor nested_accessor () const ;

Effects: Equivalent to
.return acc_ ;
16.6.3. conjugated
[linalg.conj.conjugated]
template < class ElementType , class Extents , class Layout , class Accessor > constexpr auto conjugated ( mdspan < ElementType , Extents , Layout , Accessor > a );
Let
name the type
,
where

isReturnElementType 
if
isAccessor
for someaccessor_conjugate < NestedAccessor >
, thenNestedAccessor
[Note: conjugation is selfannihilating end note];typename NestedAccessor :: element_type 
else if
is an arithmetic type, thenremove_cv_t < ElementType >
;ElementType 
else
;accessor_conjugate < Accessor >:: element_type


and
is:ReturnAccessor 
if
isAccessor
for someaccessor_conjugate < NestedAccessor >
, thenNestedAccessor
[Note: conjugation is selfannihilating end note];NestedAccessor 
else if
is an arithmetic type, thenremove_cv_t < ElementType >
[Note: this reduces nesting in cases where conjugation is known to be the identity end note];Accessor 
else
.accessor_conjugate < Accessor >

Effects:

If
isAccessor
, then equivalent toaccessor_conjugate < NestedAccessor >
;return R ( a . data (), a . mapping (), a . nested_accessor ()); 
else if
is an arithmetic type, then equivalent toremove_cv_t < ElementType >
;return R ( a . data (), a . mapping (), a . accessor ()); 
else, equivalent to
;return R ( a . data (), a . mapping (), ReturnAccessor ( a . accessor ()));
[Note: The elements of the returned
are read only.
end note];
[Example:
end example]void test_conjugated_complex ( mdspan < complex < double > , extents < int , 10 >> a ) { auto a_conj = conjugated ( a ); for ( int i = 0 ; i < a . extent ( 0 ); ++ i ) { assert ( a_conj [ i ] == conj ( a [ i ]); } auto a_conj_conj = conjugated ( a_conj ); for ( int i = 0 ; i < a . extent ( 0 ); ++ i ) { assert ( a_conj_conj [ i ] == a [ i ]); } } void test_conjugated_real ( mdspan < double , extents < int , 10 >> a ) { auto a_conj = conjugated ( a ); for ( int i = 0 ; i < a . extent ( 0 ); ++ i ) { assert ( a_conj [ i ] == a [ i ]); } auto a_conj_conj = conjugated ( a_conj ); for ( int i = 0 ; i < a . extent ( 0 ); ++ i ) { assert ( a_conj_conj [ i ] == a [ i ]); } }
16.7. Transpose inplace transformation [linalg.transp]
is an
layout mapping policy that
swaps the rightmost two indices, extents, and strides (if applicable)
of any unique
layout mapping policy.
The
function takes an
representing a matrix, and returns a new readonly
representing the transpose of the input matrix.
16.7.1. layout_transpose
[linalg.transp.layout_transpose]
is an
layout mapping policy that
swaps the rightmost two indices, extents, and strides (if applicable)
of any unique
layout mapping policy.
template < class InputExtents > using transpose_extents_t = /* see below */ ; // exposition only
For
a specialization of
,
names the
type
such that

equalsInputExtents :: static_extent ( InputExtents :: rank () 1 )
,OutputExtents :: static_extent ( OutputExtents :: rank () 2 ) 
equalsInputExtents :: static_extent ( InputExtents :: rank () 2 )
, andOutputExtents :: static_extent ( OutputExtents :: rank () 1 ) 
equalsInputExtents :: static_extent ( r )
for 0 ≤OutputExtents :: static_extent ( r )
<r
.InputExtents :: rank () 2 
Preconditions:
is a specialization ofInputExtents
.extents 
Constraints:
equals 2.InputExtents :: rank ()
template < class InputExtents > constexpr transpose_extents_t < InputExtents > transpose_extents ( const InputExtents & in ); // exposition only

Constraints:
equals 2.InputExtents :: rank () 
Returns: An
objectextents
such thatout 
equalsout . extent ( in . rank () 1 )
,in . extent ( in . rank () 2 ) 
equalsout . extent ( in . rank () 2 )
, andin . extent ( in . rank () 1 ) 
equalsout . extent ( r )
for 0 ≤in . extent ( r )
<r
.in . rank () 2

template < class Layout > class layout_transpose { public : template < class Extents > struct mapping { private : using nested_mapping_type = typename Layout :: template mapping < transpose_extents_t < Extents >> ; // exposition only nested_mapping_type nested_mapping_ ; // exposition only public : using extents_type = Extents ; using size_type = typename extents_type :: size_type ; using layout_type = layout_transpose ; constexpr explicit mapping ( const nested_mapping_type & map ); constexpr extents_type extents () const noexcept ( noexcept ( nested_mapping_ . extents ())); constexpr size_type required_span_size () const noexcept ( noexcept ( nested_mapping_ . required_span_size ())); template < class ... Indices > constexpr size_type operator ()( Indices ... indices ) const noexcept ( noexcept ( nested_mapping_ ( indices ...))); constexpr nested_mapping_type nested_mapping () const ; static constexpr bool is_always_unique (); static constexpr bool is_always_contiguous (); static constexpr bool is_always_strided (); constexpr bool is_unique () const noexcept ( noexcept ( nested_mapping_ . is_unique ())); constexpr bool is_contiguous () const noexcept ( noexcept ( nested_mapping_ . is_contiguous ())); constexpr bool is_strided () const noexcept ( noexcept ( nested_mapping_ . is_strided ())); constexpr size_type stride ( size_t r ) const noexcept ( noexcept ( nested_mapping_ . stride ( r ))); template < class OtherExtents > friend constexpr bool operator == ( const mapping & , const mapping < OtherExtents >& ) noexcept ; }; };

Preconditions:
shall meet theLayout
layout mapping policy requirements.mdspan 
Constraints:
equals 2.Extents :: rank ()
constexpr explicit mapping ( const nested_mapping_type & map );

Effects: Initializes
withnested_mapping_
.map
constexpr extents_type extents () const noexcept ( noexcept ( nested_mapping_ . extents ()));

Effects: Equivalent to
.return transpose_extents ( nested_mapping_ . extents ());
constexpr size_type required_span_size () const noexcept ( noexcept ( nested_mapping_ . required_span_size ()));

Effects: Equivalent to
.return nested_mapping_ . required_span_size ();
template < class ... Indices > constexpr size_type operator ()( Indices ... indices ) const noexcept ( noexcept ( nested_mapping_ ( indices ...)));

Effects: Equivalent to
, wherereturn nested_mapping_ ( last_two_indices_reversed );
is the result of reversing the last two elements oflast_two_indices_reversed
.indices
constexpr nested_mapping_type nested_mapping () const ;

Effects: Equivalent to
.return nested_mapping_ ;
static constexpr bool is_always_unique ();

Effects: Equivalent to
.return nested_mapping_type :: is_always_unique ();
static constexpr bool is_always_contiguous ();

Effects: Equivalent to
.return nested_mapping_type :: is_always_contiguous ();
static constexpr bool is_always_strided ();

Effects: Equivalent to
.return nested_mapping_type :: is_always_strided ();
constexpr bool is_unique () const noexcept ( noexcept ( nested_mapping_ . is_unique ()));

Effects: Equivalent to
.return nested_mapping_ . is_unique ();
constexpr bool is_contiguous () const noexcept ( noexcept ( nested_mapping_ . is_contiguous ()));

Effects: Equivalent to
.return nested_mapping_ . is_contiguous ();
constexpr bool is_strided () const noexcept ( noexcept ( nested_mapping_ . is_strided ()));

Effects: Equivalent to
.return nested_mapping_ . is_strided ();
constexpr size_type stride ( size_t r ) const noexcept ( noexcept ( nested_mapping_ . stride ( r )));

Constraints:
isis_always_strided () true
. 
Effects: Equivalent to
, wherereturn nested_mapping_ . stride ( s ); 
iss
 2 ifrank ()
equalsr
 1,rank () 
iss
 1 ifrank ()
equalsr
 2, andrank () 
iss
for 0 ≤r
<r
.in . rank () 2

template < class OtherExtents > friend constexpr bool operator == ( const mapping & , const mapping < OtherExtents >& ) noexcept ;

Constraints:

andnested_mapping_type
are equality comparable, anddecltype ( m . nested_mapping_ ) 
equalsOtherExtents :: rank ()
.rank ()


Effects: Equivalent to
.nested_mapping_ == m . nested_mapping_ ;
16.7.2. transposed
[linalg.transp.transposed]
The
function takes a rank2
representing a matrix, and returns a new readonly
representing the transpose of the input matrix. The input matrix’s
data are not modified, and the returned
accesses the
input matrix’s data in place.
template < class ElementType , class Extents , class Layout , class Accessor > constexpr auto transposed ( mdspan < ElementType , Extents , Layout , Accessor > a );
Let
name the type
.
Let
name the type
,
where

is:ReturnElementType 
if
isAccessor
, thendefault_accessor < ElementType >
;add_const_t < ElementType > 
else,
; [Note: For general accessors, theElementType
version of the accessor may not exist; Even if it does,const ElementType
has no way to deduce it generically. end note]transposed


is:ReturnLayout 
if
isLayout
, thenlayout_left
;layout_right 
else if
isLayout
, thenlayout_right
;layout_left 
else if
isLayout
, thenlayout_stride
;layout_stride 
else if
isLayout
for somelayout_left_padded < padding_stride >
, thenpadding_stride
;layout_right_padded < padding_stride > 
else if
isLayout
for somelayout_right_padded < padding_stride >
, thenpadding_stride
;layout_left_padded < padding_stride > 
else if
isLayout
for somelayout_blas_packed < Triangle , StorageOrder >
andTriangle
, thenStorageOrder
, wherelayout_blas_packed < OppositeTriangle , OppositeStorageOrder > 
names the typeOppositeTriangle
, andconditional_t < is_same_v < Triangle , upper_triangle_t > , lower_triangle_t , upper_triangle_t > 
names the typeOppositeStorageOrder
;conditional_t < is_same_v < StorageOrder , column_major_t > , row_major_t , column_major_t >


else if
isLayout
for somelayout_transpose < NestedLayout >
, thenNestedLayout
[Note: this optimizes applyingNestedLayout
twice to antransposed
with a custom layout, as long as there are no intervening layout changes end note];mdspan 
else,
;layout_transpose < Layout >


and,
is:ReturnAccessor 
if
isAccessor
, thendefault_accessor < ElementType >
;default_accessor < add_const_t < ElementType >> 
else,
.Accessor


Effects:

If
isLayout
,layout_left
, orlayout_right
for somelayout_blas_packed < Triangle , StorageOrder >
andTriangle
, then equivalent toStorageOrder return R ( a . data (), ReturnMapping ( transpose_extents ( a . mapping (). extents ())), a . accessor ()); 
else, if
isLayout
for somelayout_left_padded < padding_stride >
, then equivalent topadding_stride return R ( a . data (), ReturnMapping ( transpose_extents ( a . mapping (). extents ()), a . mapping (). stride ( 1 )), a . accessor ()); 
else, if
isLayout
for somelayout_right_padded < padding_stride >
, then equivalent topadding_stride return R ( a . data (), ReturnMapping ( transpose_extents ( a . mapping (). extents ()), a . mapping (). stride ( 0 )), a . accessor ()); 
else, if
isLayout
, then equivalent to [Note: reverse the strides as well as the extents end note]layout_stride return R ( a . data (), ReturnMapping ( transpose_extents ( a . mapping (). extents ()), array < decltype (), a . mapping (). rank () > { a . mapping (). stride ( 1 ), a . mapping (). stride ( 0 )}), a . accessor ()); 
else, if
isLayout
for somelayout_transpose < NestedLayout >
, then equivalent to [Note: getting the nested mapping untransposes the extents end note]NestedLayout return R ( a . data (), a . mapping (). nested_mapping (), a . accessor ()); 
else, equivalent to
, wherereturn R ( a . data (), ReturnMapping ( a . mapping ()), a . accessor ());
names the typeReturnMapping
.typename layout_transpose < Layout >:: template mapping < ReturnExtents >


Remarks: The elements of the returned
are read only.mdspan
[Example:
end example]void test_transposed ( mdspan < double , extents < size_t , 3 , 4 >> a ) { const auto num_rows = a . extent ( 0 ); const auto num_cols = a . extent ( 1 ); auto a_t = transposed ( a ); assert ( num_rows == a_t . extent ( 1 )); assert ( num_cols == a_t . extent ( 0 )); assert ( a . stride ( 0 ) == a_t . stride ( 1 )); assert ( a . stride ( 1 ) == a_t . stride ( 0 )); for ( size_t row = 0 ; row < num_rows ; ++ row ) { for ( size_t col = 0 ; col < num_rows ; ++ col ) { assert ( a [ row , col ] == a_t [ col , row ]); } } auto a_t_t = transposed ( a_t ); assert ( num_rows == a_t_t . extent ( 0 )); assert ( num_cols == a_t_t . extent ( 1 )); assert ( a . stride ( 0 ) == a_t_t . stride ( 0 )); assert ( a . stride ( 1 ) == a_t_t . stride ( 1 )); for ( size_t row = 0 ; row < num_rows ; ++ row ) { for ( size_t col = 0 ; col < num_rows ; ++ col ) { assert ( a [ row , col ] == a_t_t [ row , col ]); } } }
16.8. Conjugate transpose transform [linalg.conj_transp]
The
function returns a conjugate transpose
view of an object. This combines the effects of
and
.
template < class ElementType , class Extents , class Layout , class Accessor > constexpr auto conjugate_transposed ( mdspan < ElementType , Extents , Layout , Accessor > a );

Effects: Equivalent to
.return conjugated ( transposed ( a ));
[Example:
end example]void test_conjugate_transposed ( mdspan < complex < double > , extents < size_t , 3 , 4 >> a ) { const auto num_rows = a . extent ( 0 ); const auto num_cols = a . extent ( 1 ); auto a_ct = conjugate_transposed ( a ); assert ( num_rows == a_ct . extent ( 1 )); assert ( num_cols == a_ct . extent ( 0 )); assert ( a . stride ( 0 ) == a_ct . stride ( 1 )); assert ( a . stride ( 1 ) == a_ct . stride ( 0 )); for ( size_t row = 0 ; row < num_rows ; ++ row ) { for ( size_t col = 0 ; col < num_rows ; ++ col ) { assert ( a [ row , col ] == conj ( a_ct [ col , row ])); } } auto a_ct_ct = conjugate_transposed ( a_ct ); assert ( num_rows == a_ct_ct . extent ( 0 )); assert ( num_cols == a_ct_ct . extent ( 1 )); assert ( a . stride ( 0 ) == a_ct_ct . stride ( 0 )); assert ( a . stride ( 1 ) == a_ct_ct . stride ( 1 )); for ( size_t row = 0 ; row < num_rows ; ++ row ) { for ( size_t col = 0 ; col < num_rows ; ++ col ) { assert ( a [ row , col ] == a_ct_ct [ row , col ]); assert ( conj ( a_ct [ col , row ]) == a_ct_ct [ row , col ]); } } }
16.9. Algorithms [linalg.algs]
16.9.1. Requirements based on template parameter name [linalg.algs.reqs]
Throughout this Clause, where the template parameters are not
constrained, the names of template parameters are used to express type
requirements. In the requirements below, we use
in a typename to
denote a "wildcard," that matches zero characters,
,
,
,
or other things as appropriate.

Algorithms that have a template parameter named
are parallel algorithms [algorithms.parallel.defns].ExecutionPolicy 
is any type such thatReal
is specified. [Note: See [complex.numbers.general]. end note]complex < Real > 
is a rank1in_vector * _t
with a potentiallymdspan
element type and a unique layout. If the algorithm accesses the object, it will do so in readonly fashion.const 
is a rank1inout_vector * _t
with a nonmdspan
element type, and whose layout is always unique (const
equalslayout_type :: is_always_unique () true
). 
is a rank1out_vector * _t
with a nonmdspan
element type, and whose layout is always unique (const
equalslayout_type :: is_always_unique () true
). If the algorithm accesses the object, it will do so in writeonly fashion. 
is a rank2in_matrix * _t
with amdspan
element type. If the algorithm accesses the object, it will do so in readonly fashion.const 
is a rank2inout_matrix * _t
with a nonmdspan
element type.const 
is a rank2out_matrix * _t
with a nonmdspan
element type. If the algorithm accesses the object, it will do so in writeonly fashion.const 
is a rank1 or rank2in_object * _t
with a potentiallymdspan
element type, and whose layout is always unique (const
equalslayout_type :: is_always_unique () true
). If the algorithm accesses the object, it will do so in readonly fashion. 
is a rank1 or rank2inout_object * _t
with a nonmdspan
element type, and whose layout is always unique (const
equalslayout_type :: is_always_unique () true
). 
is a rank1 or rank2out_object * _t
with a nonmdspan
element type, and whose layout is always unique (const
equalslayout_type :: is_always_unique () true
). 
is eitherTriangle
orupper_triangle_t
.lower_triangle_t 
is eitherDiagonalStorage
orimplicit_unit_diagonal_t
.explicit_diagonal_t 
template parameters may deduce ain_ * _t
lvalue reference or a (nonconst
) rvalue reference to anconst
.mdspan 
andinout_ * _t
template parameters may deduce aout_ * _t
lvalue reference to anconst
, or a (nonmdspan
) rvalue reference to anconst
.mdspan 
template parameters are a binary function object ([function.objects]).BinaryDivideOp
16.9.2. BLAS 1 functions [linalg.algs.blas1]
[Note:
The BLAS developed in three "levels": 1, 2, and 3. BLAS 1 includes vectorvector operations, BLAS 2 matrixvector operations, and BLAS 3 matrixmatrix operations. The level coincides with the number of nested loops in a naïve sequential implementation of the operation. Increasing level also comes with increasing potential for data reuse. The BLAS traditionally lists computing a Givens rotation among the BLAS 1 operations, even though it only operates on scalars.
end note]
Complexity: All algorithms in this Clause with
parameters
perform a count of
array accesses and arithmetic operations
that is linear in the maximum product of extents of any
parameter.
16.9.2.1. Givens rotations [linalg.algs.blas1.givens]
16.9.2.1.1. Compute Givens rotation [linalg.algs.blas1.givens.lartg]
template < class Real > void givens_rotation_setup ( const Real a , const Real b , Real & c , Real & s , Real & r ); template < class Real > void givens_rotation_setup ( const complex < Real >& a , const complex < Real >& b , Real & c , complex < Real >& s , complex < Real >& r );
This function computes the plane (Givens) rotation represented by the
two values
and
such that the 2x2 system of equations [Note: use of monospaced text in this equation does not indicate C++ code end note]
[ c s ] [ a ] [ r ] [ ] * [ ] = [ ] [  conj ( s ) c ] [ b ] [ 0 ]
holds, where
is always a real scalar, and
equals one.
That is,
and
represent a 2 x 2 matrix, that when multiplied by the
right by the input vector whose components are
and
, produces a
result vector whose first component
is the Euclidean norm of the
input vector, and whose second component as zero.
[Note: This function corresponds to the LAPACK function
.
The BLAS variant
takes four arguments 
,
,
, and
 and overwrites the input
with
. We have chosen
's interface because it separates input and output, and to
encourage following
's more careful implementation.
end note]
[Note:
has an overload for complex numbers,
because the output argument
(cosine) is a signed magnitude.
end note]

Effects: Assigns to
andc
the plane (Givens) rotation corresponding to the inputs
anda
. Assigns tob
the Euclidean norm of the twocomponent vector formed byr
anda
.b 
Throws: Nothing.
16.9.2.1.2. Apply a computed Givens rotation to vectors [linalg.algs.blas1.givens.rot]
template < class inout_vector_1_t , class inout_vector_2_t , class Real > void givens_rotation_apply ( inout_vector_1_t x , inout_vector_2_t y , const Real c , const Real s ); template < class ExecutionPolicy , class inout_vector_1_t , class inout_vector_2_t , class Real > void givens_rotation_apply ( ExecutionPolicy && exec , inout_vector_1_t x , inout_vector_2_t y , const Real c , const Real s ); template < class inout_vector_1_t , class inout_vector_2_t , class Real > void givens_rotation_apply ( inout_vector_1_t x , inout_vector_2_t y , const Real c , const complex < Real > s ); template < class ExecutionPolicy , class inout_vector_1_t , class inout_vector_2_t , class Real > void givens_rotation_apply ( ExecutionPolicy && exec , inout_vector_1_t x , inout_vector_2_t y , const Real c , const complex < Real > s );
[Note: These functions correspond to the BLAS function
.
and
form a plane (Givens) rotation. Users normally would compute
and
using
, but they are not required to do
this.
end note]
For
in the domains of both
and
,
the mathematical expressions for the algorithm are
and
.

Preconditions:
equalsx . extent ( 0 )
.y . extent ( 0 ) 
Mandates: If neither
norx . static_extent ( 0 )
equalsy . static_extent ( 0 )
, thendynamic_extent
equalsx . static_extent ( 0 )
.y . static_extent ( 0 ) 
Effects: Applies the plane (Givens) rotation specified by
andc
to the input vectorss
andx
, as if the rotation were a 2 x 2 matrix and the input vectors were successive rows of a matrix with two rows.y
16.9.2.2. Swap matrix or vector elements [linalg.algs.blas1.swap]
template < class inout_object_1_t , class inout_object_2_t > void swap_elements ( inout_object_1_t x , inout_object_2_t y ); template < class ExecutionPolicy , class inout_object_1_t , class inout_object_2_t > void swap_elements ( ExecutionPolicy && exec , inout_object_1_t x , inout_object_2_t y );
[Note: These functions correspond to the BLAS function
.
end note]

Preconditions: For all
in 0, 1, ...,r
 1,x . rank ()
equalsx . extent ( r )
.y . extent ( r ) 
Constraints:

equalsx . rank ()
.y . rank () 
is no more than 2.x . rank () 
For
in the domain ofi ...
andx
, the expressiony
is well formed.x [ i ...] = y [ i ...]


Mandates: For all
in 0, 1, ...,r
 1, if neitherx . rank ()
norx . static_extent ( r )
equalsy . static_extent ( r )
, thendynamic_extent
equalsx . static_extent ( r )
.y . static_extent ( r ) 
Effects: Swap all corresponding elements of the objects
andx
.y
16.9.2.3. Multiply the elements of an object in place by a scalar [linalg.algs.blas1.scal]
template < class Scalar , class inout_object_t > void scale ( const Scalar alpha , inout_object_t obj ); template < class ExecutionPolicy , class Scalar , class inout_object_t > void scale ( ExecutionPolicy && exec , const Scalar alpha , inout_object_t obj );
[Note: These functions correspond to the BLAS function
.
end note]
For
in the domain of
,
the mathematical expression for the algorithm is
.

Constraints:
is less than or equal to 2.obj . rank () 
Effects: Replace each element of
in place by the product ofobj
and that element.alpha
16.9.2.4. Copy elements of one matrix or vector into another [linalg.algs.blas1.copy]
template < class in_object_t , class out_object_t > void copy ( in_object_t x , out_object_t y ); template < class ExecutionPolicy , class in_object_t , class out_object_t > void copy ( ExecutionPolicy && exec , in_object_t x , out_object_t y );
[Note: These functions correspond to the BLAS function
.
end note]

Preconditions: For all
in 0, 1, ...,r
 1,x . rank ()
equalsx . extent ( r )
.y . extent ( r ) 
Constraints:

equalsx . rank ()
.y . rank () 
For all
in the domain ofi ...
andx
, the expressiony
is well formed.y [ i ...] = x [ i ...]


Mandates: For all
in 0, 1, ...,r
 1, if neitherx . rank ()
norx . static_extent ( r )
equalsy . static_extent ( r )
, thendynamic_extent
equalsx . static_extent ( r )
.y . static_extent ( r ) 
Effects: Overwrite each element of
with the corresponding element ofy
.x
16.9.2.5. Add vectors or matrices elementwise [linalg.algs.blas1.add]
template < class in_object_1_t , class in_object_2_t , class out_object_t > void add ( in_object_1_t x , in_object_2_t y , out_object_t z ); template < class ExecutionPolicy , class in_object_1_t , class in_object_2_t , class out_object_t > void add ( ExecutionPolicy && exec , in_object_1_t x , in_object_2_t y , out_object_t z );
[Note: These functions correspond to the BLAS function
.
end note]
For
in the domains of
,
, and
,
the mathematical expression for the algorithm is
.

Preconditions: For all
in 0, 1, ...,r
 1,x . rank () 
equalsx . extent ( r )
.z . extent ( r ) 
equalsy . extent ( r )
.z . extent ( r )


Constraints:
,x . rank ()
, andy . rank ()
are all equal.z . rank () 
Mandates: For all
in 0, 1, ...,r
 1,x . rank () 
if neither
norx . static_extent ( r )
equalsz . static_extent ( r )
, thendynamic_extent
equalsx . static_extent ( r )
; andz . static_extent ( r ) 
if neither
nory . static_extent ( r )
equalsz . static_extent ( r )
, thendynamic_extent
equalsy . static_extent ( r )
.z . static_extent ( r ) 
if neither
norx . static_extent ( r )
equalsy . static_extent ( r )
, thendynamic_extent
equalsx . static_extent ( r )
;y . static_extent ( r )


Effects: Compute the elementwise sum z = x + y.
16.9.2.6. Dot product of two vectors [linalg.algs.blas1.dot]
16.9.2.6.1. Nonconjugated dot product of two vectors [linalg.algs.blas1.dot.dotu]
[Note: The functions in this section correspond to the BLAS
functions
(for real element types) and
(for complex
element types). end note]
Nonconjugated dot product with specified result type
template < class in_vector_1_t , class in_vector_2_t , class T > T dot ( in_vector_1_t v1 , in_vector_2_t v2 , T init ); template < class ExecutionPolicy , class in_vector_1_t , class in_vector_2_t , class T > T dot ( ExecutionPolicy && exec , in_vector_1_t v1 , in_vector_2_t v2 , T init );
For
equal to
,
the mathematical expression for the algorithm is
plus the sum of
for all
in the domain of
.

Preconditions:
equalsv1 . extent ( 0 )
.v2 . extent ( 0 ) 
Mandates: If neither
norv1 . static_extent ( 0 )
equalsv2 . static_extent ( 0 )
, thendynamic_extent
equalsv1 . static_extent ( 0 )
.v2 . static_extent ( 0 ) 
Effects: Let
beN
. Ifv1 . extent ( 0 )
is zero, returnsN
, else returns GENERALIZED_SUM(init
,plus <> ()
,init
, ...,v1 [ 0 ] * v2 [ 0 ]
).v1 [ N 1 ] * v2 [ N 1 ] 
Remarks: If
andin_vector_t :: element_type
are both floatingpoint types or complex versions thereof, and ifT
has higher precision thanT
, then intermediate terms in the sum usein_vector_type :: element_type
's precision or greater.T
[Note: Like
,
applies binary
in an
unspecified order. This may yield a nondeterministic result for
nonassociative or noncommutative
such as floatingpoint
addition. However, implementations may perform extra work to make the
result deterministic. They may do so for all
overloads, or just
for specific
types. end note]
[Note: Users can get
behavior by giving the first argument
as the result of
. Alternately, they can use the shortcut
below. end note]
Nonconjugated dot product with default result type
template < class in_vector_1_t , class in_vector_2_t > auto dot ( in_vector_1_t v1 , in_vector_2_t v2 ) > /* seebelow */ ; template < class ExecutionPolicy , class in_vector_1_t , class in_vector_2_t > auto dot ( ExecutionPolicy && exec , in_vector_1_t v1 , in_vector_2_t v2 ) > /* seebelow */ ;

Effects: Let
beT
. Then, the twoparameter overload is equivalent todecltype ( v1 [ 0 ] * v2 [ 0 ])
, and the threeparameter overload is equivalent todot ( v1 , v2 , T {});
.dot ( exec , v1 , v2 , T {});
16.9.2.6.2. Conjugated dot product of two vectors [linalg.algs.blas1.dot.dotc]
[Note:
The functions in this section correspond to the BLAS functions
(for real element types) and
(for complex element types).
exists to give users reasonable default inner product behavior
for both real and complex element types.
end note]
Conjugated dot product with specified result type
template < class in_vector_1_t , class in_vector_2_t , class T > T dotc ( in_vector_1_t v1 , in_vector_2_t v2 , T init ); template < class ExecutionPolicy , class in_vector_1_t , class in_vector_2_t , class T > T dotc ( ExecutionPolicy && exec , in_vector_1_t v1 , in_vector_2_t v2 , T init );

Effects: The threeargument overload is equivalent to
. The fourargument overload is equivalent todot ( conjugated ( v1 ), v2 , init );
.dot ( exec , conjugated ( v1 ), v2 , init );
Conjugated dot product with default result type
template < class in_vector_1_t , class in_vector_2_t > auto dotc ( in_vector_1_t v1 , in_vector_2_t v2 ) > /* seebelow */ ; template < class ExecutionPolicy , class in_vector_1_t , class in_vector_2_t > auto dotc ( ExecutionPolicy && exec , in_vector_1_t v1 , in_vector_2_t v2 ) > /* seebelow */ ;

Effects: Let
beT decltype ( conj  if  needed
. Then, the twoparameter overload is equivalent to( v1 [ 0 ]) * v2 [ 0 ])
, and the threeparameter overload is equivalent todotc ( v1 , v2 , T {});
.dotc ( exec , v1 , v2 , T {});
16.9.2.7. Scaled sum of squares of a vector’s elements [linalg.algs.blas1.ssq]
template < class T > struct sum_of_squares_result { T scaling_factor ; T scaled_sum_of_squares ; }; template < class in_vector_t , class T > sum_of_squares_result < T > vector_sum_of_squares ( in_vector_t v , sum_of_squares_result < T > init ); template < class ExecutionPolicy , class in_vector_t , class T > sum_of_squares_result < T > vector_sum_of_squares ( ExecutionPolicy && exec , in_vector_t v , sum_of_squares_result < T > init );
[Note: These functions correspond to the LAPACK function
.
end note]

Preconditions:

shall be Cpp17MoveConstructible and Cpp17LessThanComparable, andT 
shall be convertible toabs ( x [ 0 ])
.T


Constraints: For all
in the domain ofi
, and forv
andf
of typessq
, the expressionT
is well formed.ssq = ssq + ( abs ( x [ i ]) / f ) * ( abs ( x [ i ]) / f ) 
Effects: Returns two values:

: the maximum ofscaling_factor
andinit . scaling_factor
for allabs ( x [ i ])
in the domain ofi
; andv 
: a value such thatscaled_sum_of_squares
equals the sum of squares ofscaling_factor * scaling_factor * scaled_sum_of_squares
plusabs ( x [ i ])
.init . scaling_factor * init . scaling_factor * init . scaled_sum_of_squares


Remarks: If
is either a floatingpoint type orin_vector_t :: element_type
for some floatingpoint typecomplex < R >
, and ifR
is a floatingpoint type, thenT 
if
has higher precision thanT
, then intermediate terms in the sum usein_vector_type :: element_type
's precision or greater; andT 
any guarantees regarding overflow and underflow of
are implementationdefined.vector_sum_of_squares

16.9.2.8. Euclidean norm of a vector [linalg.algs.blas1.nrm2]
16.9.2.8.1. Euclidean norm with specified result type
template < class in_vector_t , class T > T vector_norm2 ( in_vector_t v , T init ); template < class ExecutionPolicy , class in_vector_t , class T > T vector_norm2 ( ExecutionPolicy && exec , in_vector_t v , T init );
[Note: These functions correspond to the BLAS function
.
end note]
For
equal to
,
the mathematical expression for the algorithm is
,
where
is the sum of
for all
in the domain of
. [Note: This does not imply a recommended implementation for floatingpoint types.
See Remarks below.
end note]

Preconditions:
shall be convertible toinit + abs ( declval < in_vector_t :: value_type > ()) * abs ( declval < in_vector_t :: value_type > ())
.T 
Effects: Returns the square root of the sum of squares of
and the absolute values of the elements ofinit
. [Note: ForA
equal to zero, this is the Euclidean norm (also called 2norm) of the vectorinit
.v This description does not imply a recommended implementation for floatingpoint types. See Remarks below. end note]

Remarks: If
is a floatingpoint type or a complex version thereof, and ifin_vector_t :: element_type
is a floatingpoint type, thenT 
if
has higher precision thanT
, then intermediate terms in the sum usein_vector_type :: element_type
's precision or greater; andT 
any guarantees regarding overflow and underflow of
are implementationdefined.vector_norm2

[Note: The
function is invoked via unqualified lookup.
A suggested implementation of this function for floatingpoint types
would use the
result
from
.
end note]
16.9.2.8.2. Euclidean norm with default result type
template < class in_vector_t > auto vector_norm2 ( in_vector_t v ) > /* seebelow */ ; template < class ExecutionPolicy , class in_vector_t > auto vector_norm2 ( ExecutionPolicy && exec , in_vector_t v ) > /* seebelow */ ;

Effects: Let
beT
. Then, the oneparameter overload is equivalent todecltype ( abs ( v [ 0 ]) * abs ( v [ 0 ]))
, and the twoparameter overload is equivalent tovector_norm2 ( v , T {});
.vector_norm2 ( exec , v , T {});
16.9.2.9. Sum of absolute values of vector elements [linalg.algs.blas1.asum]
16.9.2.9.1. Sum of absolute values with specified result type
template < class in_vector_t , class T > T vector_abs_sum ( in_vector_t v , T init ); template < class ExecutionPolicy , class in_vector_t , class T > T vector_abs_sum ( ExecutionPolicy && exec , in_vector_t v , T init );
[Note: This function corresponds to the BLAS functions
,
,
, and
. The different behavior for complex
element types is based on the observation that this lowercost
approximation of the onenorm serves just as well as the actual
onenorm for many linear algebra algorithms in practice. end note]
For
equal to
,
the mathematical expression for the algorithm is
plus the sum of
for all
in the domain of
.

Preconditions: For
in the domain ofi
,v
shall be convertible toinit + abs ( v [ i ])
.T 
Effects: Let
beN
:v . extent ( 0 ) 
If
is zero, returnsN
;init 
else, if
isin_vector_t :: element_type
for somecomplex < R >
, then returns GENERALIZED_SUM(R
,plus <> ()
,init
, ...,abs ( real ( v [ 0 ])) + abs ( imag ( v [ 0 ]))
).abs ( real ( v [ N 1 ])) + abs ( imag ( v [ N 1 ])) 
else, returns GENERALIZED_SUM(
,plus <> ()
,init
, ...,abs ( v [ 0 ])
).abs ( v [ N 1 ])


Remarks: If
is a floatingpoint type or a complex version thereof, ifin_vector_t :: element_type
is a floatingpoint type, and ifT
has higher precision thanT
, then intermediate terms in the sum usein_vector_type :: element_type
's precision or greater.T
[Note: The
,
, and
functions are invoked via unqualified lookup.
end note]
16.9.2.9.2. Sum of absolute values with default result type
template < class in_vector_t > auto vector_abs_sum ( in_vector_t v ) > /* seebelow */ ; template < class ExecutionPolicy , class in_vector_t > auto vector_abs_sum ( ExecutionPolicy && exec , in_vector_t v ) > /* seebelow */ ;

Effects: Let
beT
. Then, the oneparameter overload is equivalent todecltype ( abs ( v [ 0 ]))
, and the twoparameter overload is equivalent tovector_abs_sum ( v , T {});
.vector_abs_sum ( exec , v , T {});
16.9.2.10. Index of maximum absolute value of vector elements [linalg.algs.blas1.iamax]
template < class in_vector_t > typename in_vector_t :: size_type idx_abs_max ( in_vector_t v ); template < class ExecutionPolicy , class in_vector_t > typename in_vector_t :: size_type idx_abs_max ( ExecutionPolicy && exec , in_vector_t v );
[Note: These functions correspond to the BLAS function
.
end note]

Preconditions: For
in the domain ofi
, letv
beabs_value_type
; then,decltype ( v [ i ])
shall be Cpp17LessThanComparable.abs_value_type 
Effects: Returns the index (in the domain of
) of the first element ofv
having largest absolute value. Ifv
has zero elements, then returnsv
.numeric_limits < typename in_vector_t :: size_type >:: max ()
[Note: The
function is invoked via unqualified lookup.
end note]
16.9.2.11. Frobenius norm of a matrix [linalg.algs.blas1.matfrobnorm]
16.9.2.11.1. Frobenius norm with specified result type
template < class in_matrix_t , class T > T matrix_frob_norm ( in_matrix_t A , T init ); template < class ExecutionPolicy , class in_matrix_t , class T > T matrix_frob_norm ( ExecutionPolicy && exec , in_matrix_t A , T init );

Preconditions:
shall be convertible toinit + abs ( declval < in_matrix_t :: value_type > ()) * abs ( declval < in_matrix_t :: value_type > ())
.T 
Effects: Returns the square root of the sum of squares of
and the absolute values of the elements ofinit
. [Note: ForA
equal to zero, this is the Frobenius norm of the matrixinit
.A This description does not imply a recommended implementation for floatingpoint types. See Remarks below. end note]

Remarks: If
is a floatingpoint type orin_matrix_t :: element_type
for some floatingpoint typecomplex < R >
, and ifR
is a floatingpoint type, thenT 
if
has higher precision thanT
, then intermediate terms in the sum usein_matrix_type :: element_type
's precision or greater; andT 
any guarantees regarding overflow and underflow of
are implementationdefined.matrix_frob_norm

[Note: The
function is invoked via unqualified lookup.
end note]
16.9.2.11.2. Frobenius norm with default result type
template < class in_matrix_t > auto matrix_frob_norm ( in_matrix_t A ) > /* seebelow */ ; template < class ExecutionPolicy , class in_matrix_t > auto matrix_frob_norm ( ExecutionPolicy && exec , in_matrix_t A ) > /* seebelow */ ;

Effects: Let
beT
. Then, the oneparameter overload is equivalent toabs ( declval < in_matrix_t :: value_type > ()) * abs ( declval < in_matrix_t :: value_type > ())
, and the twoparameter overload is equivalent tomatrix_frob_norm ( A , T {});
.matrix_frob_norm ( exec , A , T {});
16.9.2.12. One norm of a matrix [linalg.algs.blas1.matonenorm]
16.9.2.12.1. One norm with specified result type
template < class in_matrix_t , class T > T matrix_one_norm ( in_matrix_t A , T init ); template < class ExecutionPolicy , class in_matrix_t , class T > T matrix_one_norm ( ExecutionPolicy && exec , in_matrix_t A , T init );

Preconditions:
shall be convertible toabs ( A [ 0 , 0 ])
.T 
Effects:

If
is zero, returnsA . extent ( 1 )
;init 
else, returns the sum of
and the one norm of the matrixinit
. The one norm of the matrixA
is the maximum over all columns ofA
, of the sum of the absolute values of the elements of the column.A


Remarks: If
is a floatingpoint type orin_matrix_t :: element_type
for some floatingpoint typecomplex < R >
, ifR
is a floatingpoint type, and ifT
has higher precision thanT
, then intermediate terms in each sum usein_matrix_type :: element_type
's precision or greater.T
16.9.2.12.2. One norm with default result type
template < class in_matrix_t > auto matrix_one_norm ( in_matrix_t A ) > /* seebelow */ ; template < class ExecutionPolicy , class in_matrix_t > auto matrix_one_norm ( ExecutionPolicy && exec , in_matrix_t A ) > /* seebelow */ ;

Effects: Let
beT
. Then, the oneparameter overload is equivalent todecltype ( abs ( A [ 0 , 0 ])
, and the twoparameter overload is equivalent tomatrix_one_norm ( A , T {});
.matrix_one_norm ( exec , A , T {});
[Note: The
function is invoked via unqualified lookup.
end note]
16.9.2.13. Infinity norm of a matrix [linalg.algs.blas1.matinfnorm]
16.9.2.13.1. Infinity norm with specified result type
template < class in_matrix_t , class T > T matrix_inf_norm ( in_matrix_t A , T init ); template < class ExecutionPolicy , class in_matrix_t , class T > T matrix_inf_norm ( ExecutionPolicy && exec , in_matrix_t A , T init );

Preconditions:
shall be convertible toabs ( A [ 0 , 0 ])
.T 
Effects:

If
is zero, returnsA . extent ( 0 )
;init 
else, returns the sum of
and the infinity norm of the matrixinit
. The infinity norm of the matrixA
is the maximum over all rows ofA
, of the sum of the absolute values of the elements of the row.A


Remarks: If
is a floatingpoint type orin_matrix_t :: element_type
for some floatingpoint typecomplex < R >
, ifR
is a floatingpoint type, and ifT
has higher precision thanT
, then intermediate terms in each sum usein_matrix_type :: element_type
's precision or greater.T
[Note: The
function is invoked via unqualified lookup.
end note]
16.9.2.13.2. Infinity norm with default result type
template < class in_matrix_t > auto matrix_inf_norm ( in_matrix_t A ) > /* seebelow */ ; template < class ExecutionPolicy , class in_matrix_t > auto matrix_inf_norm ( ExecutionPolicy && exec , in_matrix_t A ) > /* seebelow */ ;

Effects: Let
beT
. Then, the oneparameter overload is equivalent todecltype ( abs ( A [ 0 , 0 ])
, and the twoparameter overload is equivalent tomatrix_inf_norm ( A , T {});
.matrix_inf_norm ( exec , A , T {});
16.9.3. BLAS 2 functions [linalg.algs.blas2]
16.9.3.1. General matrixvector product [linalg.algs.blas2.gemv]
[Note: These functions correspond to the BLAS function
. end note]
The following requirements apply to all functions in this section.

Preconditions:

equalsA . extent ( 1 )
,x . extent ( 0 ) 
equalsA . extent ( 0 )
, andy . extent ( 0 ) 
equalsy . extent ( 0 )
(if applicable).z . extent ( 0 )


Constraints: For all functions in this section:

has unique layout; andin_matrix_t 
equals 2,A . rank ()
equals 1,x . rank ()
equals 1, andy . rank ()
equals 1 (if applicable).z . rank ()


Mandates:

If neither
norA . static_extent ( 1 )
equalsx . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 1 )
.x . static_extent ( 0 ) 
If neither
norA . static_extent ( 0 )
equalsy . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.y . static_extent ( 0 ) 
If neither
nory . static_extent ( 0 )
equalsz . static_extent ( 0 )
, thendynamic_extent
equalsy . static_extent ( 0 )
(if applicable).z . static_extent ( 0 )


Complexity: All algorithms in this Clause with
parameters perform a count ofmdspan
array accesses and arithmetic operations that is linear inmdspan
timesx . extent ( 0 )
.A . extent ( 1 )
16.9.3.1.1. Overwriting matrixvector product
template < class in_vector_t , class in_matrix_t , class out_vector_t > void matrix_vector_product ( in_matrix_t A , in_vector_t x , out_vector_t y ); template < class ExecutionPolicy , class in_vector_t , class in_matrix_t , class out_vector_t > void matrix_vector_product ( ExecutionPolicy && exec , in_matrix_t A , in_vector_t x , out_vector_t y );
For
in the domain of
and
equal to
,
the mathematical expression for the algorithm is
the sum of
for all
such that
is in the domain of
.

Effects: Assigns to the elements of
the product of the matrixy
with the vectorA
.x
[Example:
end example]constexpr size_t num_rows = 5 ; constexpr size_t num_cols = 6 ; // y = 3.0 * A * x void scaled_matvec_1 ( mdspan < double , extents < size_t , num_rows , num_cols >> A , mdspan < double , extents < size_t , num_cols >> x , mdspan < double , extents < size_t , num_rows >> y ) { matrix_vector_product ( scaled ( 3.0 , A ), x , y ); } // y = 3.0 * A * x + 2.0 * y void scaled_matvec_2 ( mdspan < double , extents < size_t , num_rows , num_cols >> A , mdspan < double , extents < size_t , num_cols >> x , mdspan < double , extents < size_t , num_rows >> y ) { matrix_vector_product ( scaled ( 3.0 , A ), x , scaled ( 2.0 , y ), y ); } // z = 7.0 times the transpose of A, times y void scaled_matvec_2 ( mdspan < double , extents < size_t , num_rows , num_cols >> A , mdspan < double , extents < size_t , num_rows >> y , mdspan < double , extents < size_t , num_cols >> z ) { matrix_vector_product ( scaled ( 7.0 , transposed ( A )), y , z ); }
16.9.3.1.2. Updating matrixvector product
template < class in_vector_1_t , class in_matrix_t , class in_vector_2_t , class out_vector_t > void matrix_vector_product ( in_matrix_t A , in_vector_1_t x , in_vector_2_t y , out_vector_t z ); template < class ExecutionPolicy , class in_vector_1_t , class in_matrix_t , class in_vector_2_t , class out_vector_t > void matrix_vector_product ( ExecutionPolicy && exec , in_matrix_t A , in_vector_1_t x , in_vector_2_t y , out_vector_t z );
For
in the domain of
and
equal to
,
the mathematical expression for the algorithm is
for all
such that
is in the domain of
.

Effects: Assigns to the elements of
the elementwise sum ofz
, and the product of the matrixy
with the vectorA
.x
16.9.3.2. Symmetric matrixvector product [linalg.algs.blas2.symv]
[Note: These functions correspond to the BLAS functions
and
. end note]
The following requirements apply to all functions in this section.

Preconditions:

equalsA . extent ( 0 )
,A . extent ( 1 ) 
equalsA . extent ( 1 )
,x . extent ( 0 ) 
equalsA . extent ( 0 )
, andy . extent ( 0 ) 
equalsy . extent ( 0 )
(if applicable).z . extent ( 0 )


Constraints:

either has unique layout, orin_matrix_t
layout.layout_blas_packed 
If
hasin_matrix_t
layout, then the layout’slayout_blas_packed
template argument has the same type as the function’sTriangle
template argument.Triangle 
equals 2,A . rank ()
equals 1,x . rank ()
equals 1, andy . rank ()
equals 1 (if applicable).z . rank ()


Mandates:

If neither
norA . static_extent ( 0 )
equalsA . static_extent ( 1 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.A . static_extent ( 1 ) 
If neither
norA . static_extent ( 1 )
equalsx . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 1 )
.x . static_extent ( 0 ) 
If neither
norA . static_extent ( 0 )
equalsy . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.y . static_extent ( 0 ) 
If neither
nory . static_extent ( 0 )
equalsz . static_extent ( 0 )
, thendynamic_extent
equalsy . static_extent ( 0 )
(if applicable).z . static_extent ( 0 )


Complexity: All algorithms in this Clause with
parameters perform a count ofmdspan
array accesses and arithmetic operations that is linear inmdspan
timesx . extent ( 0 )
.A . extent ( 1 ) 
Remarks: The functions will only access the triangle of
specified by theA
argumentTriangle
, and will interpret the valuet
as equal toA [ i , j ]
for indicesA [ j , i ]
in the domain ofi , j
but outside the triangle specified byA
.t
16.9.3.2.1. Overwriting symmetric matrixvector product
template < class in_matrix_t , class Triangle , class in_vector_t , class out_vector_t > void symmetric_matrix_vector_product ( in_matrix_t A , Triangle t , in_vector_t x , out_vector_t y ); template < class ExecutionPolicy , class in_matrix_t , class Triangle , class in_vector_t , class out_vector_t > void symmetric_matrix_vector_product ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , in_vector_t x , out_vector_t y );
For
in the domain of
,
the mathematical expression for the algorithm is
the sum of
for all
in the triangle of
specified by
,
plus the sum of
for all
not equal to
such that
is in the domain of A
but not in the triangle of
specified by
.

Effects: Assigns to the elements of
the product of the matrixy
with the vectorA
.x
16.9.3.2.2. Updating symmetric matrixvector product
template < class in_matrix_t , class Triangle , class in_vector_1_t , class in_vector_2_t , class out_vector_t > void symmetric_matrix_vector_product ( in_matrix_t A , Triangle t , in_vector_1_t x , in_vector_2_t y , out_vector_t z ); template < class ExecutionPolicy , class in_matrix_t , class Triangle , class in_vector_1_t , class in_vector_2_t , class out_vector_t > void symmetric_matrix_vector_product ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , in_vector_1_t x , in_vector_2_t y , out_vector_t z );
For
in the domain of
,
the mathematical expression for the algorithm is
plus the sum of
for all
in the triangle of
specified by
,
plus the sum of
for all
not equal to
such that
is in the domain of A
but not in the triangle of
specified by
.

Effects: Assigns to the elements of
the elementwise sum ofz
, with the product of the matrixy
with the vectorA
.x
16.9.3.3. Hermitian matrixvector product [linalg.algs.blas2.hemv]
[Note: These functions correspond to the BLAS functions
and
. end note]
The following requirements apply to all functions in this section.

Preconditions:

equalsA . extent ( 0 )
,A . extent ( 1 ) 
equalsA . extent ( 1 )
,x . extent ( 0 ) 
equalsA . extent ( 0 )
, andy . extent ( 0 ) 
equalsy . extent ( 0 )
(if applicable).z . extent ( 0 )


Constraints:

either has unique layout, orin_matrix_t
layout.layout_blas_packed 
If
hasin_matrix_t
layout, then the layout’slayout_blas_packed
template argument has the same type as the function’sTriangle
template argument.Triangle 
equals 2,A . rank ()
equals 1,x . rank ()
equals 1, andy . rank ()
equals 1.z . rank ()


Mandates:

If neither
norA . static_extent ( 0 )
equalsA . static_extent ( 1 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.A . static_extent ( 1 ) 
If neither
norA . static_extent ( 1 )
equalsx . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 1 )
.x . static_extent ( 0 ) 
If neither
norA . static_extent ( 0 )
equalsy . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.y . static_extent ( 0 ) 
If neither
nory . static_extent ( 0 )
equalsz . static_extent ( 0 )
, thendynamic_extent
equalsy . static_extent ( 0 )
(if applicable).z . static_extent ( 0 )


Complexity: All algorithms in this Clause with
parameters perform a count ofmdspan
array accesses and arithmetic operations that is linear inmdspan
timesx . extent ( 0 )
.A . extent ( 1 ) 
Remarks: The functions will only access the triangle of
specified by theA
argumentTriangle
, and will interpret the valuet
as equal toA [ i , j ] conj  if  needed
for indicesA [ j , i ]
in the domain ofi , j
but outside the triangle specified byA
.t
16.9.3.3.1. Overwriting Hermitian matrixvector product
template < class in_matrix_t , class Triangle , class in_vector_t , class out_vector_t > void hermitian_matrix_vector_product ( in_matrix_t A , Triangle t , in_vector_t x , out_vector_t y ); template < class ExecutionPolicy , class in_matrix_t , class Triangle , class in_vector_t , class out_vector_t > void hermitian_matrix_vector_product ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , in_vector_t x , out_vector_t y );
For
in the domain of
,
the mathematical expression for the algorithm is
the sum of
for all
in the triangle of
specified by
,
plus the sum of
for all
not equal to
such that
is in the domain of A
but not in the triangle of
specified by
.

Effects: Assigns to the elements of
the product of the matrixy
with the vectorA
.x
16.9.3.3.2. Updating Hermitian matrixvector product
template < class in_matrix_t , class Triangle , class in_vector_1_t , class in_vector_2_t , class out_vector_t > void hermitian_matrix_vector_product ( in_matrix_t A , Triangle t , in_vector_1_t x , in_vector_2_t y , out_vector_t z ); template < class ExecutionPolicy , class in_matrix_t , class Triangle , class in_vector_1_t , class in_vector_2_t , class out_vector_t > void hermitian_matrix_vector_product ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , in_vector_1_t x , in_vector_2_t y , out_vector_t z );
For
in the domain of
,
the mathematical expression for the algorithm is
plus the sum of
for all
in the triangle of
specified by
,
plus the sum of
for all
not equal to
such that
is in the domain of A
but not in the triangle of
specified by
.

Effects: Assigns to the elements of
the elementwise sum ofz
, and the product of the matrixy
with the vectorA
.x
16.9.3.4. Triangular matrixvector product [linalg.algs.blas2.trmv]
[Note: These functions correspond to the BLAS functions
and
. end note]
The following requirements apply to all functions in this section.

Preconditions:

equalsA . extent ( 0 )
,A . extent ( 1 ) 
equalsA . extent ( 0 )
,y . extent ( 0 ) 
equalsA . extent ( 1 )
(if applicable), andx . extent ( 0 ) 
equalsy . extent ( 0 )
(if applicable).z . extent ( 0 )


Constraints:

either has unique layout, orin_matrix_t
layout;layout_blas_packed 
if
hasin_matrix_t
layout, then the layout’slayout_blas_packed
template argument has the same type as the function’sTriangle
template argument;Triangle 
equals 2;A . rank () 
equals 1;y . rank () 
equals 1 (if applicable); andx . rank () 
equals 1 (if applicable).z . rank ()


Mandates:

if neither
norA . static_extent ( 0 )
equalsA . static_extent ( 1 )
, thendynamic_extent
equalsA . static_extent ( 0 )
;A . static_extent ( 1 ) 
if neither
norA . static_extent ( 0 )
equalsy . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
;y . static_extent ( 0 ) 
if neither
norA . static_extent ( 1 )
equalsx . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 1 )
(if applicable); andx . static_extent ( 0 ) 
if neither
nory . static_extent ( 0 )
equalsz . static_extent ( 0 )
, thendynamic_extent
equalsy . static_extent ( 0 )
(if applicable).z . static_extent ( 0 )


Complexity: All algorithms in this Clause with
parameters perform a count ofmdspan
array accesses and arithmetic operations that is linear inmdspan
timesx . extent ( 0 )
.A . extent ( 1 ) 
Remarks:

The functions will only access the triangle of
specified by theA
argumentTriangle
.t 
If the
template argument has typeDiagonalStorage
, then the functions will not access the diagonal ofimplicit_unit_diagonal_t
, and will interpretA
as if each of its diagonal elements behaves as a twosided multiplicative identity.A

16.9.3.4.1. Overwriting triangular matrixvector product [linalg.algs.blas2.trmv.ov]
template < class in_matrix_t , class Triangle , class DiagonalStorage , class in_vector_t , class out_vector_t > void triangular_matrix_vector_product ( in_matrix_t A , Triangle t , DiagonalStorage d , in_vector_t x , out_vector_t y ); template < class ExecutionPolicy , class in_matrix_t , class Triangle , class DiagonalStorage , class in_vector_t , class out_vector_t > void triangular_matrix_vector_product ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , DiagonalStorage d , in_vector_t x , out_vector_t y );
For
in the domain of
,
the mathematical expression for the algorithm is
the sum of
for all
in the subset of the domain of
specified by
and
.

Effects: Assigns to the elements of
the product of the matrixy
with the vectorA
.x
16.9.3.4.2. Inplace triangular matrixvector product [linalg.algs.blas2.trmv.inplace]
template < class in_matrix_t , class Triangle , class DiagonalStorage , class inout_vector_t > void triangular_matrix_vector_product ( in_matrix_t A , Triangle t , DiagonalStorage d , inout_vector_t y ); template < class ExecutionPolicy , class in_matrix_t , class Triangle , class DiagonalStorage , class inout_vector_t > void triangular_matrix_vector_product ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , DiagonalStorage d , inout_vector_t y );
[Note:
Performing this operation in place hinders parallelization.
However, other
specific optimizations,
such as vectorization, are still possible.
This is why the
overload exists.
end note]
For
in the domain of
,
the mathematical expression for the algorithm is
the sum of
for all
in the subset of the domain of
specified by
and
.

Preconditions:
equalsA . extent ( 1 )
.y . extent ( 0 ) 
Mandates: If neither
norA . static_extent ( 1 )
equalsy . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 1 )
.y . static_extent ( 0 ) 
Effects: Overwrites
(on output) with the product of the matrixy
with the vectorA
(on input).y
16.9.3.4.3. Updating triangular matrixvector product [linalg.algs.blas2.trmv.up]
template < class in_matrix_t , class Triangle , class DiagonalStorage , class in_vector_1_t , class in_vector_2_t , class out_vector_t > void triangular_matrix_vector_product ( in_matrix_t A , Triangle t , DiagonalStorage d , in_vector_1_t x , in_vector_2_t y , out_vector_t z ); template < class ExecutionPolicy , class in_matrix_t , class Triangle , class DiagonalStorage , class in_vector_1_t , class in_vector_2_t , class out_vector_t > void triangular_matrix_vector_product ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , DiagonalStorage d , in_vector_1_t x , in_vector_2_t y , out_vector_t z );
For
in the domain of
,
the mathematical expression for the algorithm is
plus the sum of
for all
in the subset of the domain of
specified by
and
.

Effects: Assigns to the elements of
the elementwise sum ofz
, with the product of the matrixy
with the vectorA
.x
16.9.3.5. Solve a triangular linear system [linalg.algs.blas2.trsv]
[Note: These functions correspond to the BLAS functions
and
. end note]
The following requirements apply to all functions in this section.

Preconditions:

equalsA . extent ( 0 )
, andA . extent ( 1 ) 
equalsA . extent ( 1 )
.b . extent ( 0 )


Constraints:

equals 2;A . rank () 
equals 1;b . rank () 
either has unique layout, orin_matrix_t
layout; andlayout_blas_packed 
if
hasin_matrix_t
layout, then the layout’slayout_blas_packed
template argument has the same type as the function’sTriangle
template argument.Triangle


Mandates:

if neither
norA . static_extent ( 0 )
equalsA . static_extent ( 1 )
, thendynamic_extent
equalsA . static_extent ( 0 )
; andA . static_extent ( 1 ) 
if neither
norA . static_extent ( 1 )
equalsb . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 1 )
.b . static_extent ( 0 )


Complexity: All algorithms in this Clause with
parameters perform a count ofmdspan
array accesses and arithmetic operations that is linear inmdspan
timesx . extent ( 0 )
.A . extent ( 1 ) 
Remarks:

The functions will only access the triangle of
specified by theA
argumentTriangle
.t 
If the
template argument has typeDiagonalStorage
, then the functions will not access the diagonal ofimplicit_unit_diagonal_t
, and will interpretA
as if each of its diagonal elements behaves as a twosided multiplicative identity.A

16.9.3.5.1. Notinplace triangular solve [linalg.algs.blas2.trsv.notinplace]
template < class in_matrix_t , class Triangle , class DiagonalStorage , class in_vector_t , class out_vector_t , class BinaryDivideOp > void triangular_matrix_vector_solve ( in_matrix_t A , Triangle t , DiagonalStorage d , in_vector_t b , out_vector_t x , BinaryDivideOp divide ); template < class ExecutionPolicy , class in_matrix_t , class Triangle , class DiagonalStorage , class in_vector_t , class out_vector_t , class BinaryDivideOp > void triangular_matrix_vector_solve ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , DiagonalStorage d , in_vector_t b , out_vector_t x , BinaryDivideOp divide ); template < class in_matrix_t , class Triangle , class DiagonalStorage , class in_vector_t , class out_vector_t > void triangular_matrix_vector_solve ( in_matrix_t A , Triangle t , DiagonalStorage d , in_vector_t b , out_vector_t x ); template < class ExecutionPolicy , class in_matrix_t , class Triangle , class DiagonalStorage , class in_vector_t , class out_vector_t > void triangular_matrix_vector_solve ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , DiagonalStorage d , in_vector_t b , out_vector_t x );
If
is
,
then for
in the domain of
,
the mathematical expression for the algorithm is

for overloads with ax [ i ] = divide ( b [ i ]  s , A [ i , i ])
parameter, anddivide 
for all other overloads,x [ i ] = ( b [ i ]  s ) / A [ i , i ]
where
is the sum of
for all
not equal to
in the subset of the domain of
specified by
and
.
If
is
,
then for
in the domain of
,
the mathematical expression for the algorithm is
,
where
is the sum of
for all
not equal to
in the subset of the domain of
specified by
and
.

Preconditions:
equalsA . extent ( 0 )
.x . extent ( 0 ) 
Constraints:
equals 1 andx . rank ()
equals 1.b . rank () 
Mandates: If neither
norA . static_extent ( 0 )
equalsx . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.x . static_extent ( 0 ) 
Effects: Assigns to the elements of
the result of solving the triangular linear system Ax=b.x
16.9.3.5.2. Inplace triangular solve [linalg.algs.blas2.trsv.inplace]
template < class in_matrix_t , class Triangle , class DiagonalStorage , class inout_vector_t , class BinaryDivideOp > void triangular_matrix_vector_solve ( in_matrix_t A , Triangle t , DiagonalStorage d , inout_vector_t b , BinaryDivideOp divide ); template < class ExecutionPolicy , class in_matrix_t , class Triangle , class DiagonalStorage , class inout_vector_t , class BinaryDivideOp > void triangular_matrix_vector_solve ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , DiagonalStorage d , inout_vector_t b , BinaryDivideOp divide ); template < class in_matrix_t , class Triangle , class DiagonalStorage , class inout_vector_t > void triangular_matrix_vector_solve ( in_matrix_t A , Triangle t , DiagonalStorage d , inout_vector_t b ); template < class ExecutionPolicy , class in_matrix_t , class Triangle , class DiagonalStorage , class inout_vector_t > void triangular_matrix_vector_solve ( ExecutionPolicy && exec , in_matrix_t A , Triangle t , DiagonalStorage d , inout_vector_t b );
[Note:
Performing triangular solve in place hinders parallelization.
However, other
specific optimizations,
such as vectorization, are still possible.
This is why the
overload exists.
end note]
If
is
,
then for
in the domain of
,
the mathematical expression for the algorithm is

for overloads with ab [ i ] = divide ( b [ i ]  s , A [ i , i ])
parameter, anddivide 
for all other overloads,b [ i ] = ( b [ i ]  s ) / A [ i , i ]
where
is the sum of
for all
not equal to
in the subset of the domain of
specified by
and
.
If
is
,
then for
in the domain of
,
the mathematical expression for the algorithm is
,
where
is the sum of
for all
not equal to
in the subset of the domain of
specified by
and
.

Preconditions:
equalsA . extent ( 0 )
.b . extent ( 0 ) 
Mandates: If neither
norA . static_extent ( 0 )
equalsb . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.b . static_extent ( 0 ) 
Effects: Overwrites
with the result of solving the triangular linear system Ax=b for x.b
16.9.3.6. Rank1 (outer product) update of a matrix [linalg.algs.blas2.rank1]
16.9.3.6.1. Nonsymmetric nonconjugated rank1 update [linalg.algs.blas2.rank1.geru]
template < class in_vector_1_t , class in_vector_2_t , class inout_matrix_t > void matrix_rank_1_update ( in_vector_1_t x , in_vector_2_t y , inout_matrix_t A ); template < class ExecutionPolicy , class in_vector_1_t , class in_vector_2_t , class inout_matrix_t > void matrix_rank_1_update ( ExecutionPolicy && exec , in_vector_1_t x , in_vector_2_t y , inout_matrix_t A );
[Note: This function corresponds to the BLAS functions
(for
real element types) and
(for complex element types). end
note]
For
in the domain of
,
the mathematical expression for the algorithm is
.

Preconditions:

equalsA . extent ( 0 )
, andx . extent ( 0 ) 
equalsA . extent ( 1 )
.y . extent ( 0 )


Constraints:
equals 2,A . rank ()
equals 1, andx . rank ()
equals 1.y . rank () 
Mandates:

If neither
norA . static_extent ( 0 )
equalsx . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.x . static_extent ( 0 ) 
If neither
norA . static_extent ( 1 )
equalsy . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 1 )
.y . static_extent ( 0 )


Effects: Assigns to
on output the sum ofA
on input, and the outer product ofA
andx
.y 
Complexity: The count of
array accesses and arithmetic operations is linear inmdspan
timesx . extent ( 0 )
.y . extent ( 0 )
[Note: Users can get
behavior by giving the second argument
as the result of
. Alternately, they can use the shortcut
below. end note]
16.9.3.6.2. Nonsymmetric conjugated rank1 update [linalg.algs.blas2.rank1.gerc]
template < class in_vector_1_t , class in_vector_2_t , class inout_matrix_t > void matrix_rank_1_update_c ( in_vector_1_t x , in_vector_2_t y , inout_matrix_t A ); template < class ExecutionPolicy , class in_vector_1_t , class in_vector_2_t , class inout_matrix_t > void matrix_rank_1_update_c ( ExecutionPolicy && exec , in_vector_1_t x , in_vector_2_t y , inout_matrix_t A );
[Note: This function corresponds to the BLAS functions
(for
real element types) and
(for complex element types). end
note]

Effects: Equivalent to
.matrix_rank_1_update ( x , conjugated ( y ), A );
16.9.3.6.3. Rank1 update of a Symmetric matrix [linalg.algs.blas2.rank1.syr]
template < class in_vector_t , class inout_matrix_t , class Triangle > void symmetric_matrix_rank_1_update ( in_vector_t x , inout_matrix_t A , Triangle t ); template < class ExecutionPolicy , class in_vector_t , class inout_matrix_t , class Triangle > void symmetric_matrix_rank_1_update ( ExecutionPolicy && exec , in_vector_t x , inout_matrix_t A , Triangle t ); template < class T , class in_vector_t , class inout_matrix_t , class Triangle > void symmetric_matrix_rank_1_update ( T alpha , in_vector_t x , inout_matrix_t A , Triangle t ); template < class ExecutionPolicy , class T , class in_vector_t , class inout_matrix_t , class Triangle > void symmetric_matrix_rank_1_update ( ExecutionPolicy && exec , T alpha , in_vector_t x , inout_matrix_t A , Triangle t );
[Note:
These functions correspond to the BLAS functions
and
.
They have overloads taking a scaling factor
, because it would be
impossible to express updates like C = C  x x^T otherwise.
end note]
For
in the domain of
and in the triangle of
specified by
,
the mathematical expression for the algorithm is
for overloads without an
parameter, and
for overloads with an
parameter.

Preconditions:

equalsA . extent ( 0 )
, andA . extent ( 1 ) 
equalsA . extent ( 0 )
.x . extent ( 0 )


Constraints:

equals 2 andA . rank ()
equals 1.x . rank () 
either has unique layout, orA
layout.layout_blas_packed 
if
hasA
layout, then the layout’slayout_blas_packed
template argument has the same type as the function’sTriangle
template argument.Triangle


Mandates:

If neither
norA . static_extent ( 0 )
equalsA . static_extent ( 1 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.A . static_extent ( 1 ) 
If neither
norA . static_extent ( 0 )
equalsx . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.x . static_extent ( 0 )


Effects:

Overloads without an
parameter assign toalpha
on output, the elementwise sum ofA
on input, with (the outer product ofA
andx
).x 
Overloads with an
parameter assign toalpha
on output, the elementwise sum ofA
on input, with the product of alpha and (the outer product ofA
andx
).x


Complexity: The count of
array accesses and arithmetic operations is linear inmdspan
timesx . extent ( 0 )
.x . extent ( 0 ) 
Remarks: The functions will only access the triangle of
specified by theA
argumentTriangle
, and will interpret the valuet
as equal toA [ i , j ]
for indicesA [ j , i ]
in the domaini , j
but outside that triangle.A
16.9.3.6.4. Rank1 update of a Hermitian matrix [linalg.algs.blas2.rank1.her]
template < class in_vector_t , class inout_matrix_t , class Triangle > void hermitian_matrix_rank_1_update ( in_vector_t x , inout_matrix_t A , Triangle t ); template < class ExecutionPolicy , class in_vector_t , class inout_matrix_t , class Triangle > void hermitian_matrix_rank_1_update ( ExecutionPolicy && exec , in_vector_t x , inout_matrix_t A , Triangle t ); template < class T , class in_vector_t , class inout_matrix_t , class Triangle > void hermitian_matrix_rank_1_update ( T alpha , in_vector_t x , inout_matrix_t A , Triangle t ); template < class ExecutionPolicy , class T , class in_vector_t , class inout_matrix_t , class Triangle > void hermitian_matrix_rank_1_update ( ExecutionPolicy && exec , T alpha , in_vector_t x , inout_matrix_t A , Triangle t );
[Note:
These functions correspond to the BLAS functions
and
.
They have overloads taking a scaling factor
, because it would be
impossible to express the update A = A  x x^H otherwise.
end note]
For
in the domain of
and in the triangle of
specified by
,
the mathematical expression for the algorithm is
for overloads without an
parameter, and
for overloads with an
parameter.

Preconditions:

equalsA . extent ( 0 )
; andA . extent ( 1 ) 
equalsA . extent ( 0 )
.x . extent ( 0 )


Constraints:

equals 2 andA . rank ()
equals 1.x . rank () 
either has unique layout, orA
layout.layout_blas_packed 
If
hasA
layout, then the layout’slayout_blas_packed
template argument has the same type as the function’sTriangle
template argument.Triangle


Mandates:

If neither
norA . static_extent ( 0 )
equalsA . static_extent ( 1 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.A . static_extent ( 1 ) 
If neither
norA . static_extent ( 0 )
equalsx . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.x . static_extent ( 0 )


Effects:

Overloads without
assign toalpha
on output, the elementwise sum ofA
on input, with (the outer product ofA
and the conjugate ofx
).x 
Overloads with
assign toalpha
on output, the elementwise sum ofA
on input, with alpha times (the outer product ofA
and the conjugate ofx
).x


Complexity: The count of
array accesses and arithmetic operations is linear inmdspan
timesx . extent ( 0 )
.x . extent ( 0 ) 
Remarks:

The functions will only access the triangle of
specified by theA
argumentTriangle
.t 
For indices
in the domain ofi , j
but outside the triangle specified byA
, the functions will interpret the valuet
as equal toA [ i , j ] conj  if  needed
.A [ j , i ]

16.9.3.7. Rank2 update of a symmetric matrix [linalg.algs.blas2.rank2.syr2]
template < class in_vector_1_t , class in_vector_2_t , class inout_matrix_t , class Triangle > void symmetric_matrix_rank_2_update ( in_vector_1_t x , in_vector_2_t y , inout_matrix_t A , Triangle t ); template < class ExecutionPolicy , class in_vector_1_t , class in_vector_2_t , class inout_matrix_t , class Triangle > void symmetric_matrix_rank_2_update ( ExecutionPolicy && exec , in_vector_1_t x , in_vector_2_t y , inout_matrix_t A , Triangle t );
[Note: These functions correspond to the BLAS functions
and
. end note]
For
in the domain of
and in the triangle of
specified by
,
the mathematical expression for the algorithm is
.

Preconditions:

equalsA . extent ( 0 )
,A . extent ( 1 ) 
equalsA . extent ( 0 )
, andx . extent ( 0 ) 
equalsA . extent ( 0 )
.y . extent ( 0 )


Constraints:

equals 2,A . rank ()
equals 1, andx . rank ()
equals 1.y . rank () 
either has unique layout, orA
layout.layout_blas_packed 
If
hasA
layout, then the layout’slayout_blas_packed
template argument has the same type as the function’sTriangle
template argument.Triangle


Mandates:

If neither
norA . static_extent ( 0 )
equalsA . static_extent ( 1 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.A . static_extent ( 1 ) 
If neither
norA . static_extent ( 0 )
equalsx . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.x . static_extent ( 0 ) 
If neither
norA . static_extent ( 0 )
equalsy . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.y . static_extent ( 0 )


Effects: Assigns to
on output the sum ofA
on input, the outer product ofA
andx
, and the outer product ofy
andy
.x 
Complexity: The count of
array accesses and arithmetic operations is linear inmdspan
timesx . extent ( 0 )
.y . extent ( 0 ) 
Remarks: The functions will only access the triangle of
specified by theA
argumentTriangle
, and will interpret the valuet
as equal toA [ i , j ]
for indicesA [ j , i ]
in the domaini , j
but outside that triangle.A
16.9.3.8. Rank2 update of a Hermitian matrix [linalg.algs.blas2.rank2.her2]
template < class in_vector_1_t , class in_vector_2_t , class inout_matrix_t , class Triangle > void hermitian_matrix_rank_2_update ( in_vector_1_t x , in_vector_2_t y , inout_matrix_t A , Triangle t ); template < class ExecutionPolicy , class in_vector_1_t , class in_vector_2_t , class inout_matrix_t , class Triangle > void hermitian_matrix_rank_2_update ( ExecutionPolicy && exec , in_vector_1_t x , in_vector_2_t y , inout_matrix_t A , Triangle t );
[Note: These functions correspond to the BLAS functions
and
. end note]
For
in the domain of
and in the triangle of
specified by
,
the mathematical expression for the algorithm is
.

Preconditions:

equalsA . extent ( 0 )
,A . extent ( 1 ) 
equalsA . extent ( 0 )
, andx . extent ( 0 ) 
equalsA . extent ( 0 )
.y . extent ( 0 )


Constraints:

equals 2,A . rank ()
equals 1, andx . rank ()
equals 1.y . rank () 
either has unique layout, orA
layout.layout_blas_packed 
If
hasA
layout, then the layout’slayout_blas_packed
template argument has the same type as the function’sTriangle
template argument.Triangle


Mandates:

If neither
norA . static_extent ( 0 )
equalsA . static_extent ( 1 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.A . static_extent ( 1 ) 
If neither
norA . static_extent ( 0 )
equalsx . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.x . static_extent ( 0 ) 
If neither
norA . static_extent ( 0 )
equalsy . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.y . static_extent ( 0 )


Effects: Assigns to
on output the sum ofA
on input, the outer product ofA
and the conjugate ofx
, and the outer product ofy
and the conjugate ofy
.x 
Complexity: The count of
array accesses and arithmetic operations is linear inmdspan
timesx . extent ( 0 )
.y . extent ( 0 ) 
Remarks: The functions will only access the triangle of
specified by theA
argumentTriangle
, and will interpret the valuet
as equal toA [ i , j ] conj  if  needed
for indicesA [ j , i ]
in the domain ofi , j
but outside the triangle specified byA
.t
16.9.4. BLAS 3 functions [linalg.algs.blas3]
16.9.4.1. General matrixmatrix product [linalg.algs.blas3.gemm]
[Note: These functions correspond to the BLAS function
.
end note]
The following requirements apply to all functions in this section.

Preconditions:

equalsC . extent ( 0 )
(if applicable),E . extent ( 0 ) 
equalsC . extent ( 1 )
(if applicable),E . extent ( 1 ) 
equalsA . extent ( 1 )
,B . extent ( 0 ) 
equalsA . extent ( 0 )
, andC . extent ( 0 ) 
equalsB . extent ( 1 )
.C . extent ( 1 )


Constraints:

,in_matrix_1_t
,in_matrix_2_t
(if applicable), andin_matrix_3_t
have unique layout.out_matrix_t 
equals 2,A . rank ()
equals 2,B . rank ()
equals 2, andC . rank ()
(if applicable) equals 2.E . rank ()


Mandates:

For all
in 0, 1, ...,r
 1, if neitherC . rank ()
norC . static_extent ( r )
equalsE . static_extent ( r )
, thendynamic_extent
equalsC . static_extent ( r )
(if applicable).E . static_extent ( r ) 
If neither
norA . static_extent ( 1 )
equalsB . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 1 )
.B . static_extent ( 0 ) 
If neither
norA . static_extent ( 0 )
equalsC . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.C . static_extent ( 0 ) 
If neither
norB . static_extent ( 1 )
equalsC . static_extent ( 1 )
, thendynamic_extent
equalsB . static_extent ( 1 )
.C . static_extent ( 1 )


Complexity: For all algorithms in this Clause, the count of
array accesses and arithmetic operations is linear inmdspan
timesA . extent ( 0 )
timesB . extent ( 1 )
.A . extent ( 1 )
16.9.4.1.1. Overwriting general matrixmatrix product
template < class in_matrix_1_t , class in_matrix_2_t , class out_matrix_t > void matrix_product ( in_matrix_1_t A , in_matrix_2_t B , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class in_matrix_2_t , class out_matrix_t > void matrix_product ( ExecutionPolicy && exec , in_matrix_1_t A , in_matrix_2_t B , out_matrix_t C );
For
in the domain of
,
the mathematical expression for the algorithm is
the sum of
for all
such that
is in the domain of
.

Effects: Assigns to the elements of the matrix
the product of the matricesC
andA
.B
16.9.4.1.2. Updating general matrixmatrix product
template < class in_matrix_1_t , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void matrix_product ( in_matrix_1_t A , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void matrix_product ( ExecutionPolicy && exec , in_matrix_1_t A , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C );
For
in the domain of
,
the mathematical expression for the algorithm is
plus the sum of
for all
such that
is in the domain of
.

Effects: Assigns to the elements of the matrix
on output, the elementwise sum ofC
and the product of the matricesE
andA
.B 
Remarks:
andC
may refer to the same matrix. If so, then they must have the same layout.E
16.9.4.2. Symmetric matrixmatrix product [linalg.algs.blas3.symm]
[Note:
These functions correspond to the BLAS function
.
Unlike the symmetric rank1 or rankk update functions, these functions assume
that the input matrix
 not the output matrix  is symmetric.
end note]
The following requirements apply to all functions in this section.

Preconditions:

equalsA . extent ( 0 )
,A . extent ( 1 ) 
equalsC . extent ( 0 )
(if applicable), andE . extent ( 0 ) 
equalsC . extent ( 1 )
(if applicable).E . extent ( 1 )


Constraints:

either has unique layout, orin_matrix_1_t
layout.layout_blas_packed 
,in_matrix_2_t
(if applicable), andin_matrix_3_t
have unique layout.out_matrix_t 
If
hasin_matrix_1_t
layout, then the layout’slayout_blas_packed
template argument has the same type as the function’sTriangle
template argument.Triangle 
equals 2,A . rank ()
equals 2,B . rank ()
equals 2, andC . rank ()
(if applicable) equals 2.E . rank ()


Mandates:

If neither
norA . static_extent ( 0 )
equalsA . static_extent ( 1 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.A . static_extent ( 1 ) 
For all
in 0, 1, ...,r
 1, if neitherC . rank ()
norC . static_extent ( r )
equalsE . static_extent ( r )
, thendynamic_extent
equalsC . static_extent ( r )
(if applicable).E . static_extent ( r )


Remarks:

Remarks: The functions will only access the triangle of
specified by theA
argumentTriangle
, and will interpret the valuet
as equal toA [ i , j ]
for indicesA [ j , i ]
in the domaini , j
but outside that triangle.A 
Remarks:
andC
(if applicable) may refer to the same matrix. If so, then they must have the same layout.E

The following requirements apply to all overloads of
.

Preconditions:

equalsA . extent ( 1 )
,B . extent ( 0 ) 
equalsA . extent ( 0 )
, andC . extent ( 0 ) 
equalsB . extent ( 1 )
.C . extent ( 1 )


Mandates:

if neither
norA . static_extent ( 1 )
equalsB . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 1 )
;B . static_extent ( 0 ) 
if neither
norA . static_extent ( 0 )
equalsC . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
; andC . static_extent ( 0 ) 
if neither
norB . static_extent ( 1 )
equalsC . static_extent ( 1 )
, thendynamic_extent
equalsB . static_extent ( 1 )
.C . static_extent ( 1 )


Complexity: The count of
array accesses and arithmetic operations is linear inmdspan
timesA . extent ( 0 )
timesB . extent ( 1 )
.A . extent ( 1 )
The following requirements apply to all overloads of
.

Preconditions:

equalsB . extent ( 1 )
,A . extent ( 0 ) 
equalsB . extent ( 0 )
, andC . extent ( 0 ) 
equalsA . extent ( 1 )
.C . extent ( 1 )


Mandates:

if neither
norB . static_extent ( 1 )
equalsA . static_extent ( 0 )
, thendynamic_extent
equalsB . static_extent ( 1 )
;A . static_extent ( 0 ) 
if neither
norB . static_extent ( 0 )
equalsC . static_extent ( 0 )
, thendynamic_extent
equalsB . static_extent ( 0 )
; andC . static_extent ( 0 ) 
if neither
norA . static_extent ( 1 )
equalsC . static_extent ( 1 )
, thendynamic_extent
equalsA . static_extent ( 1 )
.C . static_extent ( 1 )


Complexity: The count of
array accesses and arithmetic operations is linear inmdspan
timesB . extent ( 0 )
timesA . extent ( 1 )
.B . extent ( 1 )
16.9.4.2.1. Overwriting symmetric matrixmatrix left product [linalg.algs.blas3.symm.ov.left]
template < class in_matrix_1_t , class Triangle , class in_matrix_2_t , class out_matrix_t > void symmetric_matrix_left_product ( in_matrix_1_t A , Triangle t , in_matrix_2_t B , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class in_matrix_2_t , class out_matrix_t > void symmetric_matrix_left_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , in_matrix_2_t B , out_matrix_t C );
For
in the domain of
,
the mathematical expression for the algorithm is
plus the sum of
for all
not equal to
such that
is in the domain of
and in the triangle of
specified by
,
plus the sum of
for all
not equal to
such that
is in the domain of
and in the triangle of
specified by
.

Effects: Assigns to the elements of the matrix
the product of the matricesC
andA
.B
16.9.4.2.2. Overwriting symmetric matrixmatrix right product [linalg.algs.blas3.symm.ov.right]
template < class in_matrix_1_t , class Triangle , class in_matrix_2_t , class out_matrix_t > void symmetric_matrix_right_product ( in_matrix_1_t A , Triangle t , in_matrix_2_t B , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class in_matrix_2_t , class out_matrix_t > void symmetric_matrix_right_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , in_matrix_2_t B , out_matrix_t C );
For
in the domain of
,
the mathematical expression for the algorithm is
plus the sum of
for all
not equal to
such that
is in the domain of
and in the triangle of
specified by
,
plus the sum of
for all
not equal to
such that
is in the domain of
and in the triangle of
specified by
.

Effects: Assigns to the elements of the matrix
the product of the matricesC
andB
.A
16.9.4.2.3. Updating symmetric matrixmatrix left product [linalg.algs.blas3.symm.up.left]
template < class in_matrix_1_t , class Triangle , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void symmetric_matrix_left_product ( in_matrix_1_t A , Triangle t , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void symmetric_matrix_left_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C );
For
in the domain of
,
the mathematical expression for the algorithm is
plus the sum of
for all
not equal to
such that
is in the domain of
and in the triangle of
specified by
,
plus the sum of
for all
not equal to
such that
is in the domain of
and in the triangle of
specified by
.

Effects: assigns to the elements of the matrix
on output, the elementwise sum ofC
and the product of the matricesE
andA
.B
16.9.4.2.4. Updating symmetric matrixmatrix right product [linalg.algs.blas3.symm.up.right]
template < class in_matrix_1_t , class Triangle , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void symmetric_matrix_right_product ( in_matrix_1_t A , Triangle t , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void symmetric_matrix_right_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C );
For
in the domain of
,
the mathematical expression for the algorithm is
plus the sum of
for all
not equal to
such that
is in the domain of
and in the triangle of
specified by
,
plus the sum of
for all
not equal to
such that
is in the domain of
and in the triangle of
specified by
.

Effects: assigns to the elements of the matrix
on output, the elementwise sum ofC
and the product of the matricesE
andB
.A
16.9.4.3. Hermitian matrixmatrix product [linalg.algs.blas3.hemm]
[Note:
These functions correspond to the BLAS function
.
Unlike the Hermitian rank1 or rankk update functions, these functions assume that the input matrix  not the output matrix  is Hermitian.
end note]
The following requirements apply to all functions in this section.

Preconditions:

equalsA . extent ( 0 )
,A . extent ( 1 ) 
equalsC . extent ( 0 )
(if applicable), andE . extent ( 0 ) 
equalsC . extent ( 1 )
(if applicable).E . extent ( 1 )


Constraints:

either has unique layout, orin_matrix_1_t
layout.layout_blas_packed 
,in_matrix_2_t
(if applicable), andin_matrix_3_t
have unique layout.out_matrix_t 
If
hasin_matrix_1_t
layout, then the layout’slayout_blas_packed
template argument has the same type as the function’sTriangle
template argument.Triangle 
equals 2,A . rank ()
equals 2,B . rank ()
equals 2, andC . rank ()
(if applicable) equals 2.E . rank ()


Mandates:

If neither
norA . static_extent ( 0 )
equalsA . static_extent ( 1 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.A . static_extent ( 1 ) 
For all
in 0, 1, ...,r
 1, if neitherC . rank ()
norC . static_extent ( r )
equalsE . static_extent ( r )
, thendynamic_extent
equalsC . static_extent ( r )
(if applicable).E . static_extent ( r )


Remarks:

The functions will only access the triangle of
specified by theA
argumentTriangle
, and will interpret the valuet
as equal toA [ i , j ] conj  if  needed
for indicesA [ j , i ]
in the domain ofi , j
but outside the triangle specified byA
.t 
andC
(if applicable) may refer to the same matrix. If so, then they must have the same layout.E

The following requirements apply to all overloads of
.

Preconditions:

equalsA . extent ( 1 )
,B . extent ( 0 ) 
equalsA . extent ( 0 )
, andC . extent ( 0 ) 
equalsB . extent ( 1 )
.C . extent ( 1 )


Mandates:

If neither
norA . static_extent ( 1 )
equalsB . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 1 )
;B . static_extent ( 0 ) 
if neither
norA . static_extent ( 0 )
equalsC . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
; andC . static_extent ( 0 ) 
if neither
norB . static_extent ( 1 )
equalsC . static_extent ( 1 )
, thendynamic_extent
equalsB . static_extent ( 1 )
.C . static_extent ( 1 )


Complexity: The count of
array accesses and arithmetic operations is linear inmdspan
timesA . extent ( 0 )
timesB . extent ( 1 )
.A . extent ( 1 )
The following requirements apply to all overloads of
.

Preconditions:

equalsB . extent ( 1 )
,A . extent ( 0 ) 
equalsB . extent ( 0 )
, andC . extent ( 0 ) 
equalsA . extent ( 1 )
.C . extent ( 1 )


Mandates:

If neither
norB . static_extent ( 1 )
equalsA . static_extent ( 0 )
, thendynamic_extent
equalsB . static_extent ( 1 )
;A . static_extent ( 0 ) 
if neither
norB . static_extent ( 0 )
equalsC . static_extent ( 0 )
, thendynamic_extent
equalsB . static_extent ( 0 )
; andC . static_extent ( 0 ) 
if neither
norA . static_extent ( 1 )
equalsC . static_extent ( 1 )
, thendynamic_extent
equalsA . static_extent ( 1 )
.C . static_extent ( 1 )


Complexity: The count of
array accesses and arithmetic operations is linear inmdspan
timesB . extent ( 0 )
timesA . extent ( 1 )
.B . extent ( 1 )
16.9.4.3.1. Overwriting Hermitian matrixmatrix left product [linalg.algs.blas3.hemm.ov.left]
template < class in_matrix_1_t , class Triangle , class in_matrix_2_t , class out_matrix_t > void hermitian_matrix_left_product ( in_matrix_1_t A , Triangle t , in_matrix_2_t B , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class in_matrix_2_t , class out_matrix_t > void hermitian_matrix_left_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , in_matrix_2_t B , out_matrix_t C );
For
in the domain of
,
the mathematical expression for the algorithm is
plus the sum of
for all
not equal to
such that
is in the domain of
and in the triangle of
specified by
,
plus the sum of
for all
not equal to
such that
is in the domain of
and in the triangle of
specified by
.

Effects: Assigns to the elements of the matrix
the product of the matricesC
andA
.B
16.9.4.3.2. Overwriting Hermitian matrixmatrix right product [linalg.algs.blas3.hemm.ov.right]
template < class in_matrix_1_t , class Triangle , class in_matrix_2_t , class out_matrix_t > void hermitian_matrix_right_product ( in_matrix_1_t A , Triangle t , in_matrix_2_t B , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class in_matrix_2_t , class out_matrix_t > void hermitian_matrix_right_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , in_matrix_2_t B , out_matrix_t C );
For
in the domain of
,
the mathematical expression for the algorithm is
plus the sum of
for all
not equal to
such that
is in the domain of
and in the triangle of
specified by
,
plus the sum of
for all
not equal to
such that
is in the domain of
and in the triangle of
specified by
.

Effects: Assigns to the elements of the matrix
the product of the matricesC
andB
.A
16.9.4.3.3. Updating Hermitian matrixmatrix left product [linalg.algs.blas3.hemm.up.left]
template < class in_matrix_1_t , class Triangle , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void hermitian_matrix_left_product ( in_matrix_1_t A , Triangle t , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void hermitian_matrix_left_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C );
For
in the domain of
,
the mathematical expression for the algorithm is
plus the sum of
for all
not equal to
such that
is in the domain of
and in the triangle of
specified by
,
plus the sum of
for all
not equal to
such that
is in the domain of
and in the triangle of
specified by
.

Effects: Assigns to the elements of the matrix
on output, the elementwise sum ofC
and the product of the matricesE
andA
.B
16.9.4.3.4. Updating Hermitian matrixmatrix right product [linalg.algs.blas3.hemm.up.right]
template < class in_matrix_1_t , class Triangle , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void hermitian_matrix_right_product ( in_matrix_1_t A , Triangle t , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void hermitian_matrix_right_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C );
For
in the domain of
,
the mathematical expression for the algorithm is
plus the sum of
for all
not equal to
such that
is in the domain of
and in the triangle of
specified by
,
plus the sum of
for all
not equal to
such that
is in the domain of
and in the triangle of
specified by
.

Effects: Assigns to the elements of the matrix
on output, the elementwise sum ofC
and the product of the matricesE
andB
.A
16.9.4.4. Triangular matrixmatrix product [linalg.algs.blas3.trmm]
[Note: These functions correspond to the BLAS function
.
end note]
The following requirements apply to all functions in this section.

Preconditions:

equalsA . extent ( 0 )
,A . extent ( 1 ) 
equalsC . extent ( 0 )
(if applicable), andE . extent ( 0 ) 
equalsC . extent ( 1 )
(if applicable).E . extent ( 1 )


Constraints:

either has unique layout, orin_matrix_1_t
layout.layout_blas_packed 
,in_matrix_2_t
(if applicable),in_matrix_3_t
, andout_matrix_t
(if applicable) have unique layout.inout_matrix_t 
If
hasin_matrix_1_t
layout, then the layout’slayout_blas_packed
template argument has the same type as the function’sTriangle
template argument.Triangle 
equals 2,A . rank ()
equals 2,B . rank ()
equals 2, andC . rank ()
(if applicable) equals 2.E . rank ()


Mandates:

If neither
norA . static_extent ( 0 )
equalsA . static_extent ( 1 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.A . static_extent ( 1 ) 
For all
in 0, 1, ...,r
 1, if neitherC . rank ()
norC . static_extent ( r )
equalsE . static_extent ( r )
, thendynamic_extent
equalsC . static_extent ( r )
(if applicable).E . static_extent ( r )


Remarks:

The functions will only access the triangle of
specified by theA
argumentTriangle
.t 
If the
template argument has typeDiagonalStorage
, then the functions will not access the diagonal ofimplicit_unit_diagonal_t
, and will interpretA
as if each of its diagonal elements behaves as a twosided multiplicative identity.A 
andC
(if applicable) may refer to the same matrix. If so, then they must have the same layout.E

The following requirements apply to all overloads of
.

Preconditions:

equalsA . extent ( 1 )
(if applicable),B . extent ( 0 ) 
equalsA . extent ( 0 )
, andC . extent ( 0 ) 
equalsB . extent ( 1 )
(if applicable).C . extent ( 1 )


Mandates:

if neither
norA . static_extent ( 1 )
equalsB . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 1 )
(if applicable);B . static_extent ( 0 ) 
if neither
norA . static_extent ( 0 )
equalsC . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
; andC . static_extent ( 0 ) 
if neither
norB . static_extent ( 1 )
equalsC . static_extent ( 1 )
, thendynamic_extent
equalsB . static_extent ( 1 )
(if applicable).C . static_extent ( 1 )


Complexity: The count of
array accesses and arithmetic operations is linear inmdspan
timesA . extent ( 0 )
timesB . extent ( 1 )
.A . extent ( 1 )
The following requirements apply to all overloads of
.

Preconditions:

equalsB . extent ( 1 )
(if applicable),A . extent ( 0 ) 
equalsB . extent ( 0 )
(if applicable), andC . extent ( 0 ) 
equalsA . extent ( 1 )
.C . extent ( 1 )


Mandates:

if neither
norB . static_extent ( 1 )
equalsA . static_extent ( 0 )
, thendynamic_extent
equalsB . static_extent ( 1 )
(if applicable);A . static_extent ( 0 ) 
if neither
norB . static_extent ( 0 )
equalsC . static_extent ( 0 )
, thendynamic_extent
equalsB . static_extent ( 0 )
(if applicable); andC . static_extent ( 0 ) 
if neither
norA . static_extent ( 1 )
equalsC . static_extent ( 1 )
, thendynamic_extent
equalsA . static_extent ( 1 )
.C . static_extent ( 1 )


Complexity: The count of
array accesses and arithmetic operations is linear inmdspan
timesB . extent ( 0 )
timesA . extent ( 1 )
.B . extent ( 1 )
16.9.4.4.1. Overwriting triangular matrixmatrix left product [linalg.algs.blas3.trmm.ov.left]
Notinplace overwriting triangular matrixmatrix left product
template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t > void triangular_matrix_left_product ( in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t > void triangular_matrix_left_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , out_matrix_t C );
For
in the domain of
,
the mathematical expression for the algorithm is
the sum of
for all
such that
is in the domain of
and in the triangle of
specified by
and
.

Effects: Assigns to the elements of the matrix
the product of the matricesC
andA
.B
Inplace overwriting triangular matrixmatrix left product
template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t > void triangular_matrix_left_product ( in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t > void triangular_matrix_left_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t C );
For
in the domain of
,
the mathematical expression for the algorithm is
the sum of
for all
such that
is in the domain of
and in the triangle of
specified by
and
.

Preconditions:
equalsA . extent ( 1 )
.C . extent ( 0 ) 
Mandates: If neither
norA . static_extent ( 1 )
equalsC . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 1 )
.C . static_extent ( 0 ) 
Effects: Overwrites
on output with the product of the matricesC
andA
(on input).C
16.9.4.4.2. Overwriting triangular matrixmatrix right product [linalg.algs.blas3.trmm.ov.right]
Notinplace overwriting triangular matrixmatrix right product
template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t > void triangular_matrix_right_product ( in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t > void triangular_matrix_right_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , out_matrix_t C );
For
in the domain of
,
the mathematical expression for the algorithm is
the sum of
for all
such that
is in the domain of
and in the triangle of
specified by
and
.

Effects: Assigns to the elements of the matrix
the product of the matricesC
andB
.A
Inplace overwriting triangular matrixmatrix right product
template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t > void triangular_matrix_right_product ( in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t > void triangular_matrix_right_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t C );
For
in the domain of
,
the mathematical expression for the algorithm is
the sum of
for all
such that
is in the domain of
and in the triangle of
specified by
and
.

Preconditions:
equalsC . extent ( 1 )
.A . extent ( 0 ) 
Mandates: If neither
norC . static_extent ( 1 )
equalsA . static_extent ( 0 )
, thendynamic_extent
equalsC . static_extent ( 1 )
.A . static_extent ( 0 ) 
Effects: Overwrites
on output with the product of the matricesC
(on input) andC
.A
16.9.4.4.3. Updating triangular matrixmatrix left product [linalg.algs.blas3.trmm.up.left]
template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void triangular_matrix_left_product ( in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void triangular_matrix_left_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C );
For
in the domain of
,
the mathematical expression for the algorithm is
plus the sum of
for all
such that
is in the domain of
and in the triangle of
specified by
and
.

Effects: Assigns to the elements of the matrix
on output, the elementwise sum ofC
and the product of the matricesE
andA
.B
16.9.4.4.4. Updating triangular matrixmatrix right product [linalg.algs.blas3.trmm.up.right]
template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void triangular_matrix_right_product ( in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class in_matrix_3_t , class out_matrix_t > void triangular_matrix_right_product ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , in_matrix_3_t E , out_matrix_t C );
For
in the domain of
,
the mathematical expression for the algorithm is
plus the sum of
for all
such that
is in the domain of
and in the triangle of
specified by
and
.

Effects: Assigns to the elements of the matrix
on output, the elementwise sum ofC
and the product of the matricesE
andB
.A
16.9.4.5. Rankk update of a symmetric or Hermitian matrix [linalg.alg.blas3.rankk]
[Note: Users can achieve the effect of the
argument of these
BLAS functions, by applying
or
to the input matrix. end note]

Complexity: For all algorithms in this Clause, the count of
array accesses and arithmetic operations is linear inmdspan
timesA . extent ( 0 )
timesA . extent ( 1 )
.A . extent ( 0 )
16.9.4.5.1. Rankk symmetric matrix update [linalg.alg.blas3.rankk.syrk]
template < class T , class in_matrix_1_t , class inout_matrix_t , class Triangle > void symmetric_matrix_rank_k_update ( T alpha , in_matrix_1_t A , inout_matrix_t C , Triangle t ); template < class ExecutionPolicy , class T , class in_matrix_1_t , class inout_matrix_t , class Triangle > void symmetric_matrix_rank_k_update ( ExecutionPolicy && exec , T alpha , in_matrix_1_t A , inout_matrix_t C , Triangle t );
[Note:
These functions correspond to the BLAS function
.
They take a scaling factor
, because it would be
impossible to express the update C = C  A A^T otherwise.
The scaling factor parameter is required
in order to avoid ambiguity with
.
end note]
For
in the domain of
and in the triangle of
specified by
,
the mathematical expression for the algorithm is
plus the sum of
for all
such that
is in the domain of
.

Preconditions:

equalsA . extent ( 0 )
, andC . extent ( 0 ) 
equalsC . extent ( 0 )
.C . extent ( 1 )


Constraints:

equals 2 andA . rank ()
equals 2.C . rank () 
either has unique layout, orC
layout.layout_blas_packed 
If
hasC
layout, then the layout’slayout_blas_packed
template argument has the same type as the function’sTriangle
template argument.Triangle


Mandates:

If neither
norA . static_extent ( 0 )
equalsC . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.C . static_extent ( 0 ) 
If neither
norC . static_extent ( 0 )
equalsC . static_extent ( 1 )
, thendynamic_extent
equalsC . static_extent ( 0 )
.C . static_extent ( 1 )


Effects:

Assigns to
on output, the elementwise sum ofC
on input withC
times (the matrix product ofalpha
and the nonconjugated transpose ofA
).A


Remarks: The functions will only access the triangle of
specified by theC
argumentTriangle
, and will interpret the valuet
as equal toC [ i , j ]
for indicesC [ j , i ]
in the domain ofi , j
but outside the triangle specified byC
.t
16.9.4.5.2. Rankk Hermitian matrix update [linalg.alg.blas3.rankk.herk]
template < class T , class in_matrix_1_t , class inout_matrix_t , class Triangle > void hermitian_matrix_rank_k_update ( T alpha , in_matrix_1_t A , inout_matrix_t C , Triangle t ); template < class ExecutionPolicy , class T , class in_matrix_1_t , class inout_matrix_t , class Triangle > void hermitian_matrix_rank_k_update ( ExecutionPolicy && exec , T alpha , in_matrix_1_t A , inout_matrix_t C , Triangle t );
[Note:
These functions correspond to the BLAS function
.
They take a scaling factor
, because it would be
impossible to express the updates C = C  A A^T or C = C  A A^H
otherwise.
The scaling factor parameter is required
in order to avoid ambiguity with
.
end note]
For
in the domain of
and in the triangle of
specified by
,
the mathematical expression for the algorithm is
plus the sum of
for all
such that
is in the domain of
.

Preconditions:

equalsA . extent ( 0 )
, andC . extent ( 0 ) 
equalsC . extent ( 0 )
.C . extent ( 1 )


Constraints:

equals 2 andA . rank ()
equals 2.C . rank () 
either has unique layout, orC
layout.layout_blas_packed 
If
hasC
layout, then the layout’slayout_blas_packed
template argument has the same type as the function’sTriangle
template argument.Triangle


Mandates:

If neither
norA . static_extent ( 0 )
equalsC . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.C . static_extent ( 0 ) 
If neither
norC . static_extent ( 0 )
equalsC . static_extent ( 1 )
, thendynamic_extent
equalsC . static_extent ( 0 )
.C . static_extent ( 1 )


Effects:

Assigns to
on output, the elementwise sum ofC
on input with alpha times (the matrix product ofC
and the conjugated transpose ofA
).A


Remarks: The functions will only access the triangle of
specified by theC
argumentTriangle
, and will interpret the valuet
as equal toC [ i , j ]
for indicesC [ j , i ]
in the domain ofi , j
but outside the triangle specified byC
.t
16.9.4.6. Rank2k update of a symmetric or Hermitian matrix [linalg.alg.blas3.rank2k]
[Note: Users can achieve the effect of the
argument of these
BLAS functions, by applying
or
to the input matrices. end note]

Complexity: The count of
array accesses and arithmetic operations is linear inmdspan
timesA . extent ( 0 )
timesA . extent ( 1 )
.B . extent ( 1 )
16.9.4.6.1. Rank2k symmetric matrix update [linalg.alg.blas3.rank2k.syr2k]
template < class in_matrix_1_t , class in_matrix_2_t , class inout_matrix_t , class Triangle > void symmetric_matrix_rank_2k_update ( in_matrix_1_t A , in_matrix_2_t B , inout_matrix_t C , Triangle t ); template < class ExecutionPolicy , class in_matrix_1_t , class in_matrix_2_t , class inout_matrix_t , class Triangle > void symmetric_matrix_rank_2k_update ( ExecutionPolicy && exec , in_matrix_1_t A , in_matrix_2_t B , inout_matrix_t C , Triangle t );
[Note: These functions correspond to the BLAS function
.
The BLAS "quick reference" has a typo; the "ALPHA" argument of
and
should not be conjugated. end note]
For
in the domain of
and in the triangle of
specified by
,
the mathematical expression for the algorithm is
plus the sum of
for all
such that
is in the domain of
,
plus the sum of
for all
such that
is in the domain of
.

Preconditions:

equalsA . extent ( 0 )
,B . extent ( 0 ) 
equalsA . extent ( 1 )
, andB . extent ( 1 ) 
equalsA . extent ( 0 )
.C . extent ( 0 )


Constraints:

equals 2,A . rank ()
equals 2, andB . rank ()
equals 2.C . rank () 
either has unique layout, orC
layout.layout_blas_packed 
If
hasC
layout, then the layout’slayout_blas_packed
template argument has the same type as the function’sTriangle
template argument.Triangle


Mandates:

If neither
norA . static_extent ( 0 )
equalsB . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.B . static_extent ( 0 ) 
If neither
norA . static_extent ( 1 )
equalsB . static_extent ( 1 )
, thendynamic_extent
equalsA . static_extent ( 1 )
.B . static_extent ( 1 ) 
If neither
norA . static_extent ( 0 )
equalsC . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.C . static_extent ( 0 )


Effects: Assigns to
on output, the elementwise sum ofC
on input with (the matrix product ofC
and the nonconjugated transpose ofA
) and (the matrix product ofB
and the nonconjugated transpose ofB
.)A 
Remarks: The functions will only access the triangle of
specified by theC
argumentTriangle
, and will interpret the valuet
as equal toC [ i , j ]
for indicesC [ j , i ]
in the domain ofi , j
but outside the triangle specified byC
.t
16.9.4.6.2. Rank2k Hermitian matrix update [linalg.alg.blas3.rank2k.her2k]
template < class in_matrix_1_t , class in_matrix_2_t , class inout_matrix_t , class Triangle > void hermitian_matrix_rank_2k_update ( in_matrix_1_t A , in_matrix_2_t B , inout_matrix_t C , Triangle t ); template < class ExecutionPolicy , class in_matrix_1_t , class in_matrix_2_t , class inout_matrix_t , class Triangle > void hermitian_matrix_rank_2k_update ( ExecutionPolicy && exec , in_matrix_1_t A , in_matrix_2_t B , inout_matrix_t C , Triangle t );
[Note: These functions correspond to the BLAS function
.
end note]
For
in the domain of
and in the triangle of
specified by
,
the mathematical expression for the algorithm is
plus the sum of
for all
such that
is in the domain of
,
plus the sum of
for all
such that
is in the domain of
.

Preconditions:

equalsA . extent ( 0 )
,B . extent ( 0 ) 
equalsA . extent ( 1 )
, andB . extent ( 1 ) 
equalsA . extent ( 0 )
.C . extent ( 0 )


Constraints:

equals 2,A . rank ()
equals 2, andB . rank ()
equals 2.C . rank () 
either has unique layout, orC
layout.layout_blas_packed 
If
hasC
layout, then the layout’slayout_blas_packed
template argument has the same type as the function’sTriangle
template argument.Triangle


Mandates:

If neither
norA . static_extent ( 0 )
equalsB . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.B . static_extent ( 0 ) 
If neither
norA . static_extent ( 1 )
equalsB . static_extent ( 1 )
, thendynamic_extent
equalsA . static_extent ( 1 )
.B . static_extent ( 1 ) 
If neither
norA . static_extent ( 0 )
equalsC . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.C . static_extent ( 0 )


Effects: Assigns to
on output, the elementwise sum ofC
on input with (the matrix product ofC
and the conjugate transpose ofA
) and (the matrix product ofB
and the conjugate transpose ofB
.)A 
Remarks: The functions will only access the triangle of
specified by theC
argumentTriangle
, and will interpret the valuet
as equal toC [ i , j ] conj  if  needed
for indicesC [ j , i ]
in the domain ofi , j
but outside the triangle specified byC
.t
16.9.4.7. Solve multiple triangular linear systems [linalg.alg.blas3.trsm]
[Note: These functions correspond to the BLAS function
. The
Reference BLAS does not have a
function. end note]
The following requirements apply to all functions in this section.

Preconditions:

For all
in 0, 1, ...,r
 1,B . rank ()
equalsX . extent ( r )
(if applicable); andB . extent ( r ) 
equalsA . extent ( 0 )
.A . extent ( 1 )


Constraints:

equals 2 andA . rank ()
equals 2;B . rank () 
equals 2 (if applicable);X . rank () 
either has unique layout, orin_matrix_1_t
layout;layout_blas_packed 
has unique layout (if applicable);in_matrix_2_t 
has unique layout;out_matrix_t 
has unique layout (if applicable);inout_matrix_t 
if
is in the domain ofr , j
andX
, then the expressionB
is well formed (if applicable); andX [ r , j ] = B [ r , j ] 
if
isDiagonalStorage
, andexplicit_diagonal_t
is in the domain ofi , j
, then the expressionX
is well formed (if applicable).X [ i , j ] /= A [ i , i ]


Mandates:

For all
in 0, 1, ...,r
 1, if neitherX . rank ()
norX . static_extent ( r )
equalsB . static_extent ( r )
, thendynamic_extent
equalsX . static_extent ( r )
(if applicable).B . static_extent ( r ) 
If neither
norA . static_extent ( 0 )
equalsA . static_extent ( 1 )
, thendynamic_extent
equalsA . static_extent ( 0 )
.A . static_extent ( 1 )


Remarks:

The functions will only access the triangle of
specified by theA
argumentTriangle
.t 
If the
template argument has typeDiagonalStorage
, then the functions will not access the diagonal ofimplicit_unit_diagonal_t
, and will interpretA
as if each of its diagonal elements behaves as a twosided multiplicative identity.A

16.9.4.7.1. Solve multiple triangular linear systems with triangular matrix on the left [linalg.alg.blas3.trsm.left]

Complexity: The count of
array accesses and arithmetic operations is linear inmdspan
timesA . extent ( 0 )
timesA . extent ( 1 )
.X . extent ( 1 )
Notinplace left solve of multiple triangular systems
template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t , class BinaryDivideOp > void triangular_matrix_matrix_left_solve ( in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , out_matrix_t X , BinaryDivideOp divide ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t , class BinaryDivideOp > void triangular_matrix_matrix_left_solve ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , out_matrix_t X , BinaryDivideOp divide ); template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t > void triangular_matrix_matrix_left_solve ( in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , out_matrix_t X ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t > void triangular_matrix_matrix_left_solve ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , out_matrix_t X );
If
is
,
then for
in the domain of
,
the mathematical expression for the algorithm is

for overloads with aX [ i , j ] = divide ( B [ i , j ]  s , A [ i , i ])
parameter, anddivide 
for all other overloads,X [ i , j ] = ( B [ i , j ]  s ) / A [ i , i ]
where
is the sum of
for all
not equal to
in the subset of the domain of
specified by
.
[Note: Since the triangular matrix is on the left,
the desired
implementation
in the case of noncommutative multiplication
would be mathematically equivalent to
the product of (the inverse of the second argument)
and the first argument, in that order.
end note]
If
is
,
then for
in the domain of
,
the mathematical expression for the algorithm is
,
where
is the sum of
for all
not equal to
in the subset of the domain of
specified by
.

Preconditions:
equalsA . extent ( 1 )
.B . extent ( 0 ) 
Mandates: If neither
norA . static_extent ( 1 )
equalsB . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 1 )
.B . static_extent ( 0 ) 
Effects: Assigns to the elements of
the result of solving the triangular linear system(s) AX=B for X.X
Inplace left solve of multiple triangular systems
template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t , class BinaryDivideOp > void triangular_matrix_matrix_left_solve ( in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t B , BinaryDivideOp divide ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t , class BinaryDivideOp > void triangular_matrix_matrix_left_solve ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t B , BinaryDivideOp divide ); template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t > void triangular_matrix_matrix_left_solve ( in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t B ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t > void triangular_matrix_matrix_left_solve ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t B );
[Note:
This algorithm makes it possible to compute factorizations like Cholesky and LU in place.
Performing triangular solve in place hinders parallelization.
However, other
specific optimizations,
such as vectorization, are still possible.
This is why the
overload exists.
end note]
If
is
,
then for
in the domain of
,
the mathematical expression for the algorithm is

for overloads with aB [ i , j ] = divide ( B [ i , j ]  s , A [ i , i ])
parameter, anddivide 
for all other overloads,B [ i , j ] = ( B [ i , j ]  s ) / A [ i , i ]
where
is the sum of
for all
not equal to
in the subset of the domain of
specified by
.
[Note: Since the triangular matrix is on the left,
the desired
implementation
in the case of noncommutative multiplication
would be mathematically equivalent to
the product of (the inverse of the second argument)
and the first argument, in that order.
end note]
If
is
,
then for
in the domain of
,
the mathematical expression for the algorithm is
,
where
is the sum of
for all
not equal to
in the subset of the domain of
specified by
.

Preconditions:
equalsA . extent ( 1 )
.B . extent ( 0 ) 
Mandates: If neither
norA . static_extent ( 1 )
equalsB . static_extent ( 0 )
, thendynamic_extent
equalsA . static_extent ( 1 )
.B . static_extent ( 0 ) 
Effects: Overwrites
with the result of solving the triangular linear system(s) AX=B for X.B
16.9.4.7.2. Solve multiple triangular linear systems with triangular matrix on the right [linalg.alg.blas3.trsm.right]

Complexity: The count of
array accesses and arithmetic operations is linear inmdspan
timesB . extent ( 0 )
timesB . extent ( 1 )
.A . extent ( 1 )
Notinplace right solve of multiple triangular systems
template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t , class BinaryDivideOp > void triangular_matrix_matrix_right_solve ( in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , out_matrix_t X , BinaryDivideOp divide ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t , class BinaryDivideOp > void triangular_matrix_matrix_right_solve ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , out_matrix_t X , BinaryDivideOp divide ); template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t > void triangular_matrix_matrix_right_solve ( in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , out_matrix_t X ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class in_matrix_2_t , class out_matrix_t > void triangular_matrix_matrix_right_solve ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , in_matrix_2_t B , out_matrix_t X );
If
is
,
then for
in the domain of
,
the mathematical expression for the algorithm is

for overloads with aX [ i , j ] = divide ( B [ i , j ]  s , A [ j , j ])
parameter, anddivide 
for all other overloads,X [ i , j ] = ( B [ i , j ]  s ) / A [ j , j ]
where
is the sum of
for all
not equal to
in the subset of the domain of
specified by
.
[Note: Since the triangular matrix is on the right,
the desired
implementation
in the case of noncommutative multiplication
would be mathematically equivalent to
the product of the first argument, and
(the inverse of the second argument),
in that order.
end note]
If
is
,
then for
in the domain of
,
the mathematical expression for the algorithm is
,
where
is the sum of
for all
not equal to
in the subset of the domain of
specified by
.
[Note:
One can derive these mathematical expressions
from the mathematical expressions of the
algorithm,
by swapping the order of indices of each matrix access
(representing the transpose), swapping
and
,
and swapping the order of product terms in the sum
.
This is because
solves AX = B,
taking the transpose of both sides does not change the equation,
and the transpose of a product of two matrices
is the product of the reversed matrices in transposed order.
end note]

Preconditions:
equalsA . extent ( 1 )
.B . extent ( 1 ) 
Mandates: If neither
norA . static_extent ( 1 )
equalsB . static_extent ( 1 )
, thendynamic_extent
equalsA . static_extent ( 1 )
.B . static_extent ( 1 ) 
Effects: Assigns to the elements of
the result of solving the triangular linear system(s) XA=B for X.X
Inplace right solve of multiple triangular systems
template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t , class BinaryDivideOp > void triangular_matrix_matrix_right_solve ( in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t B , BinaryDivideOp divide ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t , class BinaryDivideOp > void triangular_matrix_matrix_right_solve ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t B , BinaryDivideOp divide ); template < class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t > void triangular_matrix_matrix_right_solve ( in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t B ); template < class ExecutionPolicy , class in_matrix_1_t , class Triangle , class DiagonalStorage , class inout_matrix_t > void triangular_matrix_matrix_right_solve ( ExecutionPolicy && exec , in_matrix_1_t A , Triangle t , DiagonalStorage d , inout_matrix_t B );
[Note:
This algorithm makes it possible to compute factorizations like Cholesky and LU in place.
Performing triangular solve in place hinders parallelization.
However, other
specific optimizations,
such as vectorization, are still possible.
This is why the
overload exists.
end note]
If
is
,
then for
in the domain of
,
the mathematical expression for the algorithm is

for overloads with aB [ i , j ] = divide ( B [ i , j ]  s , A [ j , j ])
parameter, anddivide 
for all other overloads,B [ i , j ] = ( B [ i , j ]  s ) / A [ j , j ]
where
is the sum of
for all
not equal to
in the subset of the domain of
specified by
.
[Note: Since the triangular matrix is on the right,
the desired
implementation
in the case of noncommutative multiplication
would be mathematically equivalent to
the product of the first argument, and
(the inverse of the second argument),
in that order.
end note]
If
is
,
then for
in the domain of
,
the mathematical expression for the algorithm is
,
where
is the sum of
for all
not equal to
in the subset of the domain of
specified by
.

Preconditions:
equalsA . extent ( 1 )
.B . extent ( 1 ) 
Mandates: If neither
norA . static_extent ( 1 )
equalsB . static_extent ( 1 )
, thendynamic_extent
equalsA . static_extent ( 1 )
.B . static_extent ( 1 ) 
Effects: Overwrites
with the result of solving the triangular linear system(s) XA=B for X.B
17. Examples
17.1. Cholesky factorization
This example shows how to compute the Cholesky factorization of a real
symmetric positive definite matrix A stored as an
with a
unique nonpacked layout.
The algorithm imitates
in LAPACK 3.9.0.
If
is
,
then it computes the factorization A = U^T U in place,
with U stored in the upper triangle of A on output.
Otherwise, it computes the factorization A = L L^T in place,
with L stored in the lower triangle of A on output.
The function returns 0 if success, else k+1
if row/column k has a zero or NaN (not a number) diagonal entry.
#include <linalg>#include <cmath>// Flip upper to lower, and lower to upper lower_triangular_t opposite_triangle ( upper_triangular_t ) { return {} } upper_triangular_t opposite_triangle ( lower_triangular_t ) { return {} } // Returns nullopt if no bad pivots, // else the index of the first bad pivot. // A "bad" pivot is zero or NaN. template < class inout_matrix_t , class Triangle > optional < typename inout_matrix_t :: size_type > cholesky_factor ( inout_matrix_t A , Triangle t ) { using value_type = typename inout_matrix_t :: value_type ; using size_type = typename inout_matrix_t :: size_type ; constexpr value_type ZERO {}; constexpr value_type ONE ( 1.0 ); const size_type n = A . extent ( 0 ); if ( n == 0 ) { return std :: nullopt ; } else if ( n == 1 ) { if ( A [ 0 , 0 ] <= ZERO  isnan ( A [ 0 , 0 ])) { return { size_type ( 1 )}; } A [ 0 , 0 ] = sqrt ( A [ 0 , 0 ]); } else { // Partition A into [A11, A12, // A21, A22], // where A21 is the transpose of A12. const size_type n1 = n / 2 ; const size_type n2 = n  n1 ; auto A11 = submdspan ( A , pair { 0 , n1 }, pair { 0 , n1 }); auto A22 = submdspan ( A , pair { n1 , n }, pair { n1 , n }); // Factor A11 const auto info1 = cholesky_factor ( A11 , t ); if ( info1 . has_value ()) { return info1 ; } using std :: linalg :: symmetric_matrix_rank_k_update ; using std :: linalg :: transposed ; if constexpr ( std :: is_same_v < Triangle , upper_triangle_t > ) { // Update and scale A12 auto A12 = submdspan ( A , pair { 0 , n1 }, pair { n1 , n }); using std :: linalg :: triangular_matrix_matrix_left_solve ; // BLAS would use original triangle; we need to flip it triangular_matrix_matrix_left_solve ( transposed ( A11 ), opposite_triangle ( t ), explicit_diagonal , A12 ); // A22 = A22  A12^T * A12 // // The Triangle argument applies to A22, // not transposed(A12), so we don’t flip it. symmetric_matrix_rank_k_update (  ONE , transposed ( A12 ), A22 , t ); } else { // // Compute the Cholesky factorization A = L * L^T // // Update and scale A21 auto A21 = submdspan ( A , pair { n1 , n }, pair { 0 , n1 }); using std :: linalg :: triangular_matrix_matrix_right_solve ; // BLAS would use original triangle; we need to flip it triangular_matrix_matrix_right_solve ( transposed ( A11 ), opposite_triangle ( t ), explicit_diagonal , A21 ); // A22 = A22  A21 * A21^T symmetric_matrix_rank_k_update (  ONE , A21 , A22 , t ); } // Factor A22 const auto info2 = cholesky_factor ( A22 , t ); if ( info2 . has_value ()) { return { info2 . value () + n1 }; } } return nullopt ; }
17.2. Solve linear system using Cholesky factorization
This example shows how to solve a symmetric positive definite linear
system Ax=b, using the Cholesky factorization computed in the previous
example inplace in the matrix
. The example assumes that
returned
, indicating no zero or NaN pivots.
template < class in_matrix_t , class Triangle , class in_vector_t , class out_vector_t > void cholesky_solve ( in_matrix_t A , Triangle t , in_vector_t b , out_vector_t x ) { using std :: linalg :: transposed ; using std :: linalg :: triangular_matrix_vector_solve ; if constexpr ( std :: is_same_v < Triangle , upper_triangle_t > ) { // Solve Ax=b where A = U^T U // // Solve U^T c = b, using x to store c. triangular_matrix_vector_solve ( transposed ( A ), opposite_triangle ( t ), explicit_diagonal , b , x ); // Solve U x = c, overwriting x with result. triangular_matrix_vector_solve ( A , t , explicit_diagonal , x ); } else { // Solve Ax=b where A = L L^T // // Solve L c = b, using x to store c. triangular_matrix_vector_solve ( A , t , explicit_diagonal , b , x ); // Solve L^T x = c, overwriting x with result. triangular_matrix_vector_solve ( transposed ( A ), opposite_triangle ( t ), explicit_diagonal , x ); } }
17.3. Compute QR factorization of a tall skinny matrix
This example shows how to compute the QR factorization of a "tall and
skinny" matrix
, using a cacheblocked algorithm based on rankk
symmetric matrix update and Cholesky factorization. "Tall and skinny"
means that the matrix has many more rows than columns.
// Compute QR factorization A = Q R, with A storing Q. template < class inout_matrix_t , class out_matrix_t > optional < typename inout_matrix_t :: size_type > cholesky_tsqr_one_step ( inout_matrix_t A , // A on input, Q on output out_matrix_t R ) { using size_type = typename inout_matrix_t :: size_type ; // One might use cache size, sizeof(element_type), and A.extent(1) // to pick the number of rows per block. For now, we just pick // some constant. constexpr size_type max_num_rows_per_block = 500 ; using R_element_type = typename out_matrix_t :: element_type ; constexpr R_element_type ZERO {}; for ( size_type j = 0 ; j < R . extent ( 1 ); ++ j ) { for ( size_type i = 0 ; i < R . extent ( 0 ); ++ i ) { R [ i , j ] = ZERO ; } } // Cacheblocked version of R = R + A^T * A. const auto num_rows = A . extent ( 0 ); auto rest_num_rows = num_rows ; auto A_rest = A ; while ( A_rest . extent ( 0 ) > 0 ) { const size_type num_rows_per_block = min ( A . extent ( 0 ), max_num_rows_per_block ); auto A_cur = submdspan ( A_rest , pair { 0 , num_rows_per_block }, full_extent ); A_rest = submdspan ( A_rest , pair { num_rows_per_block , A_rest . extent ( 0 )}, full_extent ); // R = R + A_cur^T * A_cur using std :: linalg :: symmetric_matrix_rank_k_update ; constexpr R_element_type ONE ( 1.0 ); // The Triangle argument applies to R, // not transposed(A_cur), so we don’t flip it. symmetric_matrix_rank_k_update ( ONE , transposed ( A_cur ), R , upper_triangle ); } const auto info = cholesky_factor ( R , upper_triangle ); if ( info . has_value ()) { return info ; } using std :: linalg :: triangular_matrix_matrix_left_solve ; triangular_matrix_matrix_left_solve ( R , upper_triangle , A ); return nullopt ; } // Compute QR factorization A = Q R. Use R_tmp as temporary R factor // storage for iterative refinement. template < class in_matrix_t , class out_matrix_1_t , class out_matrix_2_t , class out_matrix_3_t > optional < typename out_matrix_1_t :: size_type > cholesky_tsqr ( in_matrix_t A , out_matrix_1_t Q , out_matrix_2_t R_tmp , out_matrix_3_t R ) { assert ( R . extent ( 0 ) == R . extent ( 1 )); assert ( A . extent ( 1 ) == R . extent ( 0 )); assert ( R_tmp . extent ( 0 ) == R_tmp . extent ( 1 )); assert ( A . extent ( 0 ) == Q . extent ( 0 )); assert ( A . extent ( 1 ) == Q . extent ( 1 )); copy ( A , Q ); const auto info1 = cholesky_tsqr_one_step ( Q , R ); if ( info1 . has_value ()) { return info1 ; } // Use one step of iterative refinement to improve accuracy. const auto info2 = cholesky_tsqr_one_step ( Q , R_tmp ); if ( info2 . has_value ()) { return info2 ; } // R = R_tmp * R using std :: linalg :: triangular_matrix_left_product ; triangular_matrix_left_product ( R_tmp , upper_triangle , explicit_diagonal , R ); return nullopt ; }