| Document Number: | |
|---|---|
| Date: | |
| Revises: | |
| Editor: | NVIDIA Corporation | 
Note: this is an early draft. It’s known to be incomplet and incorrekt, and it has lots of bad formatting.
This Technical Specification describes requirements for implementations of an interface that computer programs written in the C++ programming language may use to invoke algorithms with parallel execution. The algorithms described by this Technical Specification are realizable across a broad class of computer architectures.
This Technical Specification is non-normative. Some of the functionality described by this Technical Specification may be considered for standardization in a future version of C++, but it is not currently part of any C++ standard. Some of the functionality in this Technical Specification may never be standardized, and other functionality may be standardized in a substantially changed form.
The goal of this Technical Specification is to build widespread existing practice for parallelism in the C++ standard algorithms library. It gives advice on extensions to those vendors who wish to provide them.
The following referenced document is indispensable for the application of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies.
ISO/IEC 14882:— is herein called the C++ Standard. The library described in ISO/IEC 14882:— clauses 17-30 is herein called the C++ Standard Library. The C++ Standard Library components described in ISO/IEC 14882:— clauses 25, 26.7 and 20.7.2 are herein called the C++ Standard Algorithms Library.
Unless otherwise specified, the whole of the C++ Standard's Library
    introduction (
Since the extensions described in this Technical Specification are
    experimental and not part of the C++ Standard Library, they should not be
    declared directly within namespace std. Unless otherwise specified, all
    components described in this Technical Specification are declared in namespace 
    std::experimental::parallel::v2.v1
std. 
    
    — end note ]
  Unless otherwise specified, references to such entities described in this
    Technical Specification are assumed to be qualified with
    std::experimental::parallel::v2, and references to entities described in the C++
    Standard Library are assumed to be qualified with v1std::.
Extensions that are expected to eventually be added to an existing header
    <meow> are provided inside the <experimental/meow> header,
    which shall include the standard contents of <meow> as if by
    #include <meow>
For the purposes of this document, the terms and definitions given in the C++ Standard and the following apply.
A parallel algorithm is a function template described by this Technical Specification declared in namespace std::experimental::parallel::v2 with a formal template parameter named v1ExecutionPolicy.
Parallel algorithms access objects indirectly accessible via their arguments by invoking the following functions:
sort function may invoke the following element access functions:
  
        RandomAccessIterator.
          swap function on the elements of the sequence (as per 25.4.1.1 [sort]/2).
          Compare function object.
          An implementation that provides support for this Technical Specification shall define the feature test macro(s) in Table 1.
| Name | Value | Header | 
|---|---|---|
| __cpp_lib_experimental_parallel_algorithm | 201505 | <experimental/algorithm><experimental/exception_list><experimental/execution_policy><experimental/numeric> | 
| __cpp_lib_experimental_parallel_task_block | 201510 | <experimental/task_block> | 
This clause describes classes that are execution policy types. An object of an execution policy type indicates the kinds of parallelism allowed in the execution of an algorithm and expresses the consequent requirements on the element access functions.
std::vector<int> v = ...
// standard sequential sort
std::sort(v.begin(), v.end());
using namespace std::experimental::parallel;
// explicitly sequential sort
sort(seq, v.begin(), v.end());
// permitting parallel execution
sort(par, v.begin(), v.end());
// permitting vectorization as well
sort(par_vec, v.begin(), v.end());
// sort with dynamically-selected execution
size_t threshold = ...
execution_policy exec = seq;
if (v.size() > threshold)
{
  exec = par;
}
sort(exec, v.begin(), v.end());
    
    — end example ]
  <experimental/execution_policy> synopsisnamespace std {
namespace experimental {
namespace parallel {
inline namespace v2v1 {
  // 2.3, Execution policy type trait 
  template<class T> struct is_execution_policy;
  template<class T> constexpr bool is_execution_policy_v = is_execution_policy<T>::value;
  // 2.4, Sequential execution policy 
  class sequential_execution_policy;
  // 2.5, Parallel execution policy 
  class parallel_execution_policy;
  // 2.6, Parallel+Vector execution policy 
  class parallel_vector_execution_policy;
  // 2.7, Dynamic execution policy 
  class execution_policy;
}
}
}
}
  
    template<class T> struct is_execution_policy { see below };
    is_execution_policy
 can be used to detect parallel execution policies for the purpose of 
excluding function signatures from otherwise ambiguous overload 
resolution participation.
is_execution_policy<T> shall be a UnaryTypeTrait with a BaseCharacteristic of true_type if T is the type of a standard or implementation-defined execution policy, otherwise false_type.
    
The behavior of a program that adds specializations for is_execution_policy is undefined.
class sequential_execution_policy{ unspecified };
    The class sequential_execution_policy
 is an execution policy type used as a unique type to disambiguate 
parallel algorithm overloading and require that a parallel algorithm's 
execution may not be parallelized.
class parallel_execution_policy{ unspecified };
    The class parallel_execution_policy
 is an execution policy type used as a unique type to disambiguate 
parallel algorithm overloading and indicate that a parallel algorithm's 
execution may be parallelized.
class parallel_vector_execution_policy{ unspecified };
    The class class parallel_vector_execution_policy
 is an execution policy type used as a unique type to disambiguate 
parallel algorithm overloading and indicate that a parallel algorithm's 
execution may be vectorized and parallelized.
class execution_policy
{
  public:
    // 2.7.1, execution_policy construct/assign 
    template<class T> execution_policy(const T& exec);
    template<class T> execution_policy& operator=(const T& exec);
    // 2.7.2, execution_policy object access 
    const type_info& type() const noexcept;
    template<class T> T* get() noexcept;
    template<class T> const T* get() const noexcept;
};
    The class execution_policy is a container for execution policy objects.
    execution_policy allows dynamic control over standard algorithm execution.
std::vector<float> sort_me = ...
        
using namespace std::experimental::parallel;
execution_policy exec = seq;
if(sort_me.size() > threshold)
{
  exec = std::par;
}
 
std::sort(exec, std::begin(sort_me), std::end(sort_me));
    
    — end example ]
  Objects of type execution_policy shall be constructible and assignable from objects of
    type T for which is_execution_policy<T>::value is true.
execution_policy construct/assigntemplate<class T> execution_policy(const T& exec); execution_policy object with a copy of exec's state.is_execution_policy<T>::value is true.
        template<class T> execution_policy& operator=(const T& exec); exec's state to *this.*this.
      execution_policy object accessconst type_info& type() const noexcept; typeid(T), such that T is the type of the execution policy object contained by *this.template<class T> T* get() noexcept; template<class T> const T* get() const noexcept; target_type() == typeid(T), a pointer to the stored execution policy object; otherwise a null pointer.is_execution_policy<T>::value is true.
        constexpr sequential_execution_policy      seq{};
constexpr parallel_execution_policy        par{};
constexpr parallel_vector_execution_policy par_vec{};
    The header <experimental/execution_policy> declares a global object associated with each type of execution policy defined by this Technical Specification.
          During the execution of a standard parallel algorithm, 
          if temporary memory resources are required and none are available,
          the algorithm throws a std::bad_alloc exception.
        
During the execution of a standard parallel algorithm, if the invocation of an element access function exits via an uncaught exception, the behavior of the program is determined by the type of execution policy used to invoke the algorithm:
class parallel_vector_execution_policy,
              std::terminate shall be called.
            sequential_execution_policy or
              parallel_execution_policy, the execution of the algorithm exits via an
               exception. The exception shall be an exception_list containing all uncaught exceptions thrown during
              the invocations of element access functions, or optionally the uncaught exception if there was only one.
              for_each is executed sequentially,
                if an invocation of the user-provided function object throws an exception, for_each can exit via the uncaught exception, or throw an exception_list containing the original exception.
              
    — end note ]
  std::bad_alloc, all exceptions 
thrown during the execution of
                the algorithm are communicated to the caller. It is 
unspecified whether an algorithm implementation will "forge ahead" after
 
                encountering and capturing a user exception.
              
    — end note ]
  std::bad_alloc
 exception even if one or more
                user-provided function objects have exited via an 
exception. For example, this can happen when an algorithm fails to 
allocate memory while
                creating or adding elements to the exception_list object.
              
    — end note ]
  <experimental/exception_list> synopsisnamespace std {
namespace experimental {
namespace parallel {
inline namespace v2v1 {
  class exception_list : public exception
  {
    public:
      typedef unspecified iterator;
  
      size_t size() const noexcept;
      iterator begin() const noexcept;
      iterator end() const noexcept;
      const char* what() const noexcept override;
  };
}
}
}
}
      
      
        The class exception_list owns a sequence of exception_ptr objects. The parallel
        algorithms may use the exception_list to communicate uncaught exceptions encountered during parallel execution to the
        caller of the algorithm.
      
        The type exception_list::iterator shall fulfill the requirements of
        ForwardIterator.
      
size_t size() const noexcept; exception_ptr objects contained within the exception_list.
        iterator begin() const noexcept; exception_ptr object contained within the exception_list.
        iterator end() const noexcept; const char* what() const noexcept override; 
        Function objects passed into parallel algorithms as objects of type BinaryPredicate,
        Compare, and BinaryOperation shall not directly or indirectly modify
        objects via their arguments.
      
        Parallel algorithms have template parameters named ExecutionPolicy which describe
        the manner in which the execution of these algorithms may be parallelized and the manner in
        which they apply the element access functions.
      
        The invocations of element access functions in parallel algorithms invoked with an execution
        policy object of type sequential_execution_policy execute in sequential order in
        the calling thread.
      
        The invocations of element access functions in parallel algorithms invoked with an execution
        policy object of type parallel_execution_policy are permitted to execute in an
        unordered fashion in either the invoking thread or in a thread implicitly created by the library
        to support parallel algorithm execution. Any such invocations executing in the same thread are
        indeterminately sequenced with respect to each other. 
        
using namespace std::experimental::parallel;
int a[] = {0,1};
std::vector<int> v;
for_each(par, std::begin(a), std::end(a), [&](int i) {
  v.push_back(i*2+1);
});
        The program above has a data race because of the unsynchronized access to the container
        v.
      
    — end example ]
  using namespace std::experimental::parallel;
std::atomic<int> x = 0;
int a[] = {1,2};
for_each(par, std::begin(a), std::end(a), [&](int n) {
  x.fetch_add(1, std::memory_order_relaxed);
  // spin wait for another iteration to change the value of x
  while (x.load(std::memory_order_relaxed) == 1) { }
});
        The above example depends on the order of execution of the iterations, and is therefore
        undefined (may deadlock).
      
    — end example ]
  using namespace std::experimental::parallel;
int x=0;
std::mutex m;
int a[] = {1,2};
for_each(par, std::begin(a), std::end(a), [&](int) {
  m.lock();
  ++x;
  m.unlock();
});
        The above example synchronizes access to object x ensuring that it is
        incremented correctly.
      
    — end example ]
  
        The invocations of element access functions in parallel algorithms invoked with an execution
        policy of type parallel_vector_execution_policy
        are permitted to execute in an unordered fashion in unspecified threads, and unsequenced
        with respect to one another within each thread.
        
parallel_vector_execution_policy allows the execution of element access functions to be
          interleaved on a single thread, synchronization, including the use of mutexes, risks deadlock. Thus the
          synchronization with parallel_vector_execution_policy is restricted as follows:
          A standard library function is vectorization-unsafe if it is specified to synchronize with
          another function invocation, or another function invocation is specified to synchronize with it, and if
          it is not a memory allocation or deallocation function. Vectorization-unsafe standard library functions
          may not be invoked by user code called from parallel_vector_execution_policy algorithms.
          using namespace std::experimental::parallel;
int x=0;
std::mutex m;
int a[] = {1,2};
for_each(par_vec, std::begin(a), std::end(a), [&](int) {
  m.lock();
  ++x;
  m.unlock();
});
        The above program is invalid because the applications of the function object are not
        guaranteed to run on different threads.
      
    — end example ]
  m.lock on the same thread, which may deadlock.
      
    — end note ]
  parallel_execution_policy or the
        parallel_vector_execution_policy invocation allow the implementation to fall back to
        sequential execution if the system cannot parallelize an algorithm invocation due to lack of
        resources.
      
    — end note ]
  
        Algorithms invoked with an execution policy object of type execution_policy
        execute internally as if invoked with the contained execution policy object.
      
The semantics of parallel algorithms invoked with an execution policy object of implementation-defined type are implementation-defined.
ExecutionPolicy algorithm overloads
        The Parallel Algorithms Library provides overloads for each of the algorithms named in
        Table 1, corresponding to the algorithms with the same name in the C++ Standard Algorithms Library.
        
        For each algorithm in ExecutionPolicy, which shall be the first template parameter.
        
        In addition, each such overload shall have the new function parameter as the
        first function parameter of type ExecutionPolicy&&.
      
        Unless otherwise specified, the semantics of ExecutionPolicy algorithm overloads
        are identical to their overloads without.
      
        Parallel algorithms shall not participate in overload resolution unless
        is_execution_policy<decay_t<ExecutionPolicy>>::value is true.
      
| adjacent_difference | adjacent_find | all_of | any_of | 
| copy | copy_if | copy_n | count | 
| count_if | equal | exclusive_scan | fill | 
| fill_n | find | find_end | find_first_of | 
| find_if | find_if_not | for_each | for_each_n | 
| generate | generate_n | includes | inclusive_scan | 
| inner_product | inplace_merge | is_heap | is_heap_until | 
| is_partitioned | is_sorted | is_sorted_until | lexicographical_compare | 
| max_element | merge | min_element | minmax_element | 
| mismatch | move | none_of | nth_element | 
| partial_sort | partial_sort_copy | partition | partition_copy | 
| reduce | remove | remove_copy | remove_copy_if | 
| remove_if | replace | replace_copy | replace_copy_if | 
| replace_if | reverse | reverse_copy | rotate | 
| rotate_copy | search | search_n | set_difference | 
| set_intersection | set_symmetric_difference | set_union | sort | 
| stable_partition | stable_sort | swap_ranges | transform | 
| transform_exclusive_scan | transform_inclusive_scan | transform_reduce | uninitialized_copy | 
| uninitialized_copy_n | uninitialized_fill | uninitialized_fill_n | unique | 
| unique_copy | 
      Define GENERALIZED_SUM(op, a1, ..., aN) as follows:
      
a1 when N is 1op(GENERALIZED_SUM(op, b1, ..., bK), GENERALIZED_SUM(op, bM, ..., bN)) where
           
          b1, ..., bN may be any permutation of a1, ..., aN and1 < K+1 = M ≤ N.
      Define GENERALIZED_NONCOMMUTATIVE_SUM(op, a1, ..., aN) as follows:
      
a1 when N is 1op(GENERALIZED_NONCOMMUTATIVE_SUM(op, a1, ..., aK), GENERALIZED_NONCOMMUTATIVE_SUM(op, aM, ..., aN) where 1 < K+1 = M ≤ N.
        <experimental/algorithm> synopsisnamespace std {
namespace experimental {
namespace parallel {
inline namespace v2v1 {
  template<class ExecutionPolicy,
           class InputIterator, class Function>
    void for_each(ExecutionPolicy&& exec,
                  InputIterator first, InputIterator last,
                  Function f);
  template<class InputIterator, class Size, class Function>
    InputIterator for_each_n(InputIterator first, Size n,
                             Function f);
  template<class ExecutionPolicy,
           class InputIterator, class Size, class Function>
    InputIterator for_each_n(ExecutionPolicy&& exec,
                             InputIterator first, Size n,
                             Function f);
}
}
}
}
    
    template<class ExecutionPolicy,
      class InputIterator, class Function>
void for_each(ExecutionPolicy&& exec,
              InputIterator first, InputIterator last,
              Function f); f to the result of dereferencing every iterator in the range [first,last).
          first satisfies the requirements of a mutable iterator, f may
            apply nonconstant functions through the dereferenced iterator.
          
    — end note ]
  f exactly last - first times.
        f returns a result, the result is ignored.
        for_each does not return a copy of
          its Function parameter, since parallelization may not permit efficient state
          accumulation.
        for_each requires
          Function to meet the requirements of CopyConstructible.
        template<class InputIterator, class Size, class Function>
InputIterator for_each_n(InputIterator first, Size n,
                         Function f); Function shall meet the requirements of MoveConstructible
          
          Function need not meet the requirements of CopyConstructible.
          
    — end note ]
  f to the result of dereferencing every iterator in the range
          [first,first + n), starting from first and proceeding to first + n - 1.
          first satisfies the requirements of a mutable iterator,
            f may apply nonconstant functions through the dereferenced iterator.
          
    — end note ]
  first + n for non-negative values of n and first for negative values.
        f returns a result, the result is ignored.
        template<class ExecutionPolicy,
      class InputIterator, class Size, class Function>
InputIterator for_each_n(ExecutionPolicy && exec,
                         InputIterator first, Size n,
                         Function f); f to the result of dereferencing every iterator in the range
          [first,first + n), starting from first and proceeding to first + n - 1.
          first satisfies the requirements of a mutable iterator,
            f may apply nonconstant functions through the dereferenced iterator.
          
    — end note ]
  first + n for non-negative values of n and first for negative values.
        f returns a result, the result is ignored.
        for_each_n requires
          Function to meet the requirements of CopyConstructible.
        <experimental/numeric> synopsisnamespace std {
namespace experimental {
namespace parallel {
inline namespace v2v1 {
  template<class InputIterator>
    typename iterator_traits<InputIterator>::value_type
      reduce(InputIterator first, InputIterator last);
  template<class ExecutionPolicy,
           class InputIterator>
    typename iterator_traits<InputIterator>::value_type
      reduce(ExecutionPolicy&& exec,
             InputIterator first, InputIterator last);
  template<class InputIterator, class T>
    T reduce(InputIterator first, InputIterator last, T init);
  template<class ExecutionPolicy,
           class InputIterator, class T>
    T reduce(ExecutionPolicy&& exec,
             InputIterator first, InputIterator last, T init);
  template<class InputIterator, class T, class BinaryOperation>
    T reduce(InputIterator first, InputIterator last, T init,
             BinaryOperation binary_op);
  template<class ExecutionPolicy, class InputIterator, class T, class BinaryOperation>
    T reduce(ExecutionPolicy&& exec,
             InputIterator first, InputIterator last, T init,
             BinaryOperation binary_op);
  template<class InputIterator, class OutputIterator,
           class T>
    OutputIterator
      exclusive_scan(InputIterator first, InputIterator last,
                     OutputIterator result,
                     T init);
  template<class ExecutionPolicy,
           class InputIterator, class OutputIterator,
           class T>
    OutputIterator
      exclusive_scan(ExecutionPolicy&& exec,
                     InputIterator first, InputIterator last,
                     OutputIterator result,
                     T init);
  template<class InputIterator, class OutputIterator,
           class T, class BinaryOperation>
    OutputIterator
      exclusive_scan(InputIterator first, InputIterator last,
                     OutputIterator result,
                     T init, BinaryOperation binary_op);
  template<class ExecutionPolicy,
           class InputIterator, class OutputIterator,
           class T, class BinaryOperation>
    OutputIterator
      exclusive_scan(ExecutionPolicy&& exec,
                     InputIterator first, InputIterator last,
                     OutputIterator result,
                     T init, BinaryOperation binary_op);
  template<class InputIterator, class OutputIterator>
    OutputIterator
      inclusive_scan(InputIterator first, InputIterator last,
                     OutputIterator result);
  template<class ExecutionPolicy,
           class InputIterator, class OutputIterator>
    OutputIterator
      inclusive_scan(ExecutionPolicy&& exec,
                     InputIterator first, InputIterator last,
                     OutputIterator result);
  template<class InputIterator, class OutputIterator,
           class BinaryOperation>
    OutputIterator
      inclusive_scan(InputIterator first, InputIterator last,
                     OutputIterator result,
                     BinaryOperation binary_op);
  template<class ExecutionPolicy,
           class InputIterator, class OutputIterator,
           class BinaryOperation>
    OutputIterator
      inclusive_scan(ExecutionPolicy&& exec,
                     InputIterator first, InputIterator last,
                     OutputIterator result,
                     BinaryOperation binary_op);
  template<class InputIterator, class OutputIterator,
           class BinaryOperation, class T>
    OutputIterator
      inclusive_scan(InputIterator first, InputIterator last,
                     OutputIterator result,
                     BinaryOperation binary_op, T init);
  template<class ExecutionPolicy,
           class InputIterator, class OutputIterator,
           class BinaryOperation, class T>
    OutputIterator
      inclusive_scan(ExecutionPolicy&& exec,
                     InputIterator first, InputIterator last,
                     OutputIterator result,
                     BinaryOperation binary_op, T init);
  template<class InputIterator, class UnaryOperation,
           class T, class BinaryOperation>
    T transform_reduce(InputIterator first, InputIterator last,
                       UnaryOperation unary_op,
                       T init, BinaryOperation binary_op);
  template<class ExecutionPolicy,
           class InputIterator, class UnaryOperation,
           class T, class BinaryOperation>
    T transform_reduce(ExecutionPolicy&& exec,
                       InputIterator first, InputIterator last,
                       UnaryOperation unary_op,
                       T init, BinaryOperation binary_op);
  template<class InputIterator, class OutputIterator,
           class UnaryOperation, class T, class BinaryOperation>
    OutputIterator
      transform_exclusive_scan(InputIterator first, InputIterator last,
                               OutputIterator result,
                               UnaryOperation unary_op,
                               T init, BinaryOperation binary_op);
  template<class ExecutionPolicy,
           class InputIterator, class OutputIterator,
           class UnaryOperation, class T, class BinaryOperation>
    OutputIterator
      transform_exclusive_scan(ExecutionPolicy&& exec,
                               InputIterator first, InputIterator last,
                               OutputIterator result,
                               UnaryOperation unary_op,
                               T init, BinaryOperation binary_op);
  template<class InputIterator, class OutputIterator,
           class UnaryOperation, class BinaryOperation>
    OutputIterator
      transform_inclusive_scan(InputIterator first, InputIterator last,
                               OutputIterator result,
                               UnaryOperation unary_op,
                               BinaryOperation binary_op);
  template<class ExecutionPolicy,
           class InputIterator, class OutputIterator,
           class UnaryOperation, class BinaryOperation>
    OutputIterator
      transform_inclusive_scan(ExecutionPolicy&& exec,
                               InputIterator first, InputIterator last,
                               OutputIterator result,
                               UnaryOperation unary_op,
                               BinaryOperation binary_op);
  template<class InputIterator, class OutputIterator,
           class UnaryOperation, class BinaryOperation, class T>
    OutputIterator
      transform_inclusive_scan(InputIterator first, InputIterator last,
                               OutputIterator result,
                               UnaryOperation unary_op,
                               BinaryOperation binary_op, T init);
  template<class ExecutionPolicy,
           class InputIterator, class OutputIterator,
           class UnaryOperation, class BinaryOperation, class T>
    OutputIterator
      transform_inclusive_scan(ExecutionPolicy&& exec,
                               InputIterator first, InputIterator last,
                               OutputIterator result,
                               UnaryOperation unary_op,
                               BinaryOperation binary_op, T init);
}
}
}
}
    
    template<class InputIterator>
typename iterator_traits<InputIterator>::value_type
    reduce(InputIterator first, InputIterator last); reduce(first, last, typename iterator_traits<InputIterator>::value_type{}).
        template<class InputIterator, class T>
T reduce(InputIterator first, InputIterator last, T init); reduce(first, last, init, plus<>()).
        template<class InputIterator, class T, class BinaryOperation>
T reduce(InputIterator first, InputIterator last, T init,
         BinaryOperation binary_op); GENERALIZED_SUM(binary_op, init, *first, ..., *(first + (last - first) - 1)).
        binary_op shall not invalidate iterators or subranges, nor modify elements in the
          range [first,last).
        last - first) applications of binary_op.
        reduce and accumulate is that the behavior
          of reduce may be non-deterministic for non-associative or non-commutative binary_op.
        template<class InputIterator, class OutputIterator, class T>
OutputIterator exclusive_scan(InputIterator first, InputIterator last,
                              OutputIterator result,
                              T init); exclusive_scan(first, last, result, init, plus<>()).
        template<class InputIterator, class OutputIterator, class T, class BinaryOperation>
OutputIterator exclusive_scan(InputIterator first, InputIterator last,
                              OutputIterator result,
                              T init, BinaryOperation binary_op); i in [result,result + (last - first)) the
          value of GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, init, *first, ..., *(first + (i - result) - 1)).
        result.
        binary_op shall not invalidate iterators or subranges, nor modify elements in the
          ranges [first,last) or [result,result + (last - first)).
        last - first) applications of binary_op.
        exclusive_scan and inclusive_scan is that
          exclusive_scan excludes the ith input element from the ith
          sum. If binary_op is not mathematically associative, the behavior of
          exclusive_scan may be non-deterministic.
        template<class InputIterator, class OutputIterator>
OutputIterator inclusive_scan(InputIterator first, InputIterator last,
                              OutputIterator result); inclusive_scan(first, last, result, plus<>()).
        template<class InputIterator, class OutputIterator, class BinaryOperation>
OutputIterator inclusive_scan(InputIterator first, InputIterator last,
                              OutputIterator result,
                              BinaryOperation binary_op); template<class InputIterator, class OutputIterator, class BinaryOperation>
OutputIterator inclusive_scan(InputIterator first, InputIterator last,
                              OutputIterator result,
                              BinaryOperation binary_op, T init); i in [result,result + (last - first)) the value of
          GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, *first, ..., *(first + (i - result))) or
          GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, init, *first, ..., *(first + (i - result)))
          if init is provided.
        result.
        binary_op shall not invalidate iterators or subranges, nor modify elements in the 
          ranges [first,last) or [result,result + (last - first)).
        last - first) applications of binary_op.
        exclusive_scan and inclusive_scan is that
          inclusive_scan includes the ith input element in the ith sum.
          If binary_op is not mathematically associative, the behavior of
          inclusive_scan may be non-deterministic.
        template<class InputIterator, class UnaryFunction, class T, class BinaryOperation>
T transform_reduce(InputIterator first, InputIterator last,
                   UnaryOperation unary_op, T init, BinaryOperation binary_op); GENERALIZED_SUM(binary_op, init, unary_op(*first), ..., unary_op(*(first + (last - first) -1))).
        unary_op nor binary_op shall invalidate subranges, or modify elements in the range [first,last)
        last - first) applications each of unary_op and binary_op.
        transform_reduce does not apply unary_op to init.
        template<class InputIterator, class OutputIterator,
      class UnaryOperation,
      class T, class BinaryOperation>
OutputIterator transform_exclusive_scan(InputIterator first, InputIterator last,
                                        OutputIterator result,
                                        UnaryOperation unary_op,
                                        T init, BinaryOperation binary_op); i in [result,result + (last - first)) the value of
          GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, init, unary_op(*first), ..., unary_op(*(first + (i- result) - 1))).
        result.
        unary_op nor binary_op shall invalidate iterators or subranges, or modify elements in the
          ranges [first,last) or [result,result + (last - first)).
        last - first) applications each of unary_op and binary_op.
        transform_exclusive_scan and transform_inclusive_scan is that transform_exclusive_scan
          excludes the ith input element from the ith sum. If binary_op is not mathematically associative, the behavior of
          transform_exclusive_scan may be non-deterministic. transform_exclusive_scan does not apply unary_op to init.
        template<class InputIterator, class OutputIterator,
      class UnaryOperation,
      class BinaryOperation>
OutputIterator transform_inclusive_scan(InputIterator first, InputIterator last,
                                        OutputIterator result,
                                        UnaryOperation unary_op,
                                        BinaryOperation binary_op); template<class InputIterator, class OutputIterator,
      class UnaryOperation,
      class BinaryOperation, class T>
OutputIterator transform_inclusive_scan(InputIterator first, InputIterator last,
                                        OutputIterator result,
                                        UnaryOperation unary_op,
                                        BinaryOperation binary_op, T init); i in [result,result + (last - first)) the value of
          GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, unary_op(*first), ..., unary_op(*(first + (i -result)))) or
          GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, init, unary_op(*first), ..., unary_op(*(first + (i- result))))
          if init is provided.
        result.
        unary_op nor binary_op shall invalidate iterators or subranges, or modify elements in the ranges [first,last)
          or [result,result + (last - first)).
        last - first) applications each of unary_op and binary_op.
        transform_exclusive_scan and transform_inclusive_scan is that transform_inclusive_scan
          includes the ith input element from the ith sum. If binary_op is not mathematically associative, the behavior of
          transform_inclusive_scan may be non-deterministic. transform_inclusive_scan does not apply unary_op to init.
        <experimental/task_block> synopsisnamespace std {
namespace experimental {
namespace parallel {
inline namespace v2 {
  class task_cancelled_exception;
  class task_block;
  template<class F>
    void define_task_block(F&& f);
  template<class f>
    void define_task_block_restore_thread(F&& f);
}
}
}
}
     
   
    task_cancelled_exceptionnamespace std {
namespace experimental {
namespace parallel
inline namespace v2 {
  class task_cancelled_exception : public exception
  {
    public:
      task_cancelled_exception() noexcept;
      virtual const char* what() const noexcept;
  };
}
}
}
}
     
     
       The class task_cancelled_exception defines the type of objects thrown by
       task_block::run or task_block::wait if they detect than an
       exception is pending within the current parallel block. See 
task_cancelled_exception member function whatvirtual const char* what() const noexcept task_blocknamespace std {
namespace experimental {
namespace parallel {
inline namespace v2 {
  class task_block
  {
    private:
      ~task_block();
    public:
      task_block(const task_block&) = delete;
      task_block& operator=(const task_block&) = delete;
      void operator&() const = delete;
      template<class F>
        void run(F&& f);
      void wait();
  };
}
}
}
}
     
     
       The class task_block defines an interface for forking and joining parallel tasks. The define_task_block and define_task_block_restore_thread function templates create an object of type task_block and pass a reference to that object to a user-provided function object.
     
       An object of class task_block cannot be constructed,
 destroyed, copied, or moved except by the implementation of the task 
block library. Taking the address of a task_block object via operator& is ill-formed. Obtaining its address by any other means (including addressof) results in a pointer with an unspecified value; dereferencing such a pointer results in undefined behavior.
     
       A task_block is active if it was created by the nearest enclosing task block, where “task block” refers to an
       invocation of define_task_block or define_task_block_restore_thread and “nearest enclosing” means the most
       recent invocation that has not yet completed. Code designated for execution in another thread by means other
       than the facilities in this section (e.g., using thread or async) are not enclosed in the task block and a
       task_block passed to (or captured by) such code is not active within that code. Performing any operation on a
       task_block that is not active results in undefined behavior.
     
       When the argument to task_block::run is called, no task_block is active, not even the task_block on which run was called.
       (The function object should not, therefore, capture a task_block from the surrounding block.)
     
define_task_block([&](auto& tb) {
  tb.run([&]{
    tb.run([] { f(); });               // Error: tb is not active within run
    define_task_block([&](auto& tb2) { // Define new task block
      tb2.run(f);
      ...
    });
  });
  ...
});
     
     
    — end example ]
  task_block member function template runtemplate<class F> void run(F&& f); F shall be MoveConstructible. DECAY_COPY(std::forward<F>(f))() shall be a valid expression.
         *this shall be the active task_block.
         DECAY_COPY(std::forward<F>(f))(), where DECAY_COPY(std::forward<F>(f))
           is evaluated synchronously within the current thread. The call to the resulting copy of the function object is
           permitted to run on an unspecified thread created by the implementation in an unordered fashion relative to
           the sequence of operations following the call to run(f) (the continuation), or indeterminately sequenced
           within the same thread as the continuation. The call to run synchronizes with the call to the function
           object. The completion of the call to the function object synchronizes with the next invocation of wait on
           the same task_block or completion of the nearest enclosing task block (i.e., the define_task_block or
           define_task_block_restore_thread that created this task_block).
         task_cancelled_exception, as described in run function may return on a thread other than the one on which it was called; in such cases,
           completion of the call to run synchronizes with the continuation.
           
           run is ordered similarly to an ordinary function call in a single thread.
    — end note ]
  f may be immediate or may be delayed until
           compute resources are available. run might or might not return before the invocation of f completes.
         task_block member function waitvoid wait(); *this shall be the active task_block.task_block have completed.
         task_cancelled_exception, as described in wait function may return on a thread other than the one on which it was called; in such cases, completion of the call to wait synchronizes with subsequent operations.
           
           define_task_block([&](auto& tb) {
  tb.run([&]{ process(a, w, x); }); // Process a[w] through a[x]
  if (y < x) tb.wait();             // Wait if overlap between [w,x) and [y,z)
  process(a, y, z);                 // Process a[y] through a[z]
});
           
    — end example ]
  define_task_blocktemplate<class F>
void define_task_block(F&& f);
        template<class F>
void define_task_block_restore_thread(F&& f);
        tb of type task_block, the expression f(tb) shall be well-formed
       task_block tb and calls f(tb).
       exception_list, as specified in f have finished execution.
       define_task_block function may return on a thread other than the one on which it was called
         unless there are no task blocks active on entry to define_task_block (see define_task_block returns on a different thread,
         it synchronizes with operations following the call. define_task_block_restore_thread
         function always returns on the same thread as the one on which it was called.
       f will (directly or indirectly) call tb.run(function-object).
       
       Every task_block has an associated exception list. When the task block starts, its associated exception list is empty.
     
       When an exception is thrown from the user-provided function object passed to define_task_block or
       define_task_block_restore_thread, it is added to the exception list for that task block. Similarly, when
       an exception is thrown from the user-provided function object passed into task_block::run, the exception
       object is added to the exception list associated with the nearest enclosing task block. In both cases, an
       implementation may discard any pending tasks that have not yet been invoked. Tasks that are already in
       progress are not interrupted except at a call to task_block::run or task_block::wait as described below.
     
       If the implementation is able to detect that an exception has been thrown by another task within
       the same nearest enclosing task block, then task_block::run or task_block::wait may throw
       task_canceled_exception; these instances of task_canceled_exception are not added to the exception
       list of the corresponding task block.
     
       When a task block finishes with a non-empty exception list, the exceptions are aggregated into an exception_list object, which is then thrown from the task block.
     
       The order of the exceptions in the exception_list object is unspecified.