An Asynchronous Call for C++

ISO/IEC JTC1 SC22 WG21 N2889 = 09-0079 - 2009-06-21

Lawrence Crowl, crowl@google.com, Lawrence@Crowl.org

Problem Description
    Solution Domain
    Thread Resources
    Solution Value
    Related Work
Proposed Solution
    Acknowledgements
    The async Function
    Thread Joining
    Execution Policies
    Eager and Lazy Evaluation
    Direct Execution
    New Future Type
Proposed Wording
    30.6.1 Overview [futures.overview]
    30.6.? Function template async [futures.async]
    30.6.? Class template joining_future [futures.joining_future]

Problem Description

One of the simplest methods for exploiting parallelism is to call one subroutine in parallel with another. However, with the current threading facilities, doing so is a difficult task.

There have been repeated requests for a simpler mechanism, all of which were rejected by the committee as not being within the spirit of the Kona compromise. However, there are now national body comments requesting such a facility.

UK-182 30.3.3.2.2

Future, promise and packaged_task provide a framework for creating future values, but a simple function to tie all three components together is missing. Note that we only need a simple facility for C++0x. Advanced thread pools are to be left for TR2.

async( F&& f, Args && ... ); Semantics are similar to creating a thread object with a packaged_task invoking f with forward<Args>(args...) but details are left unspecified to allow different scheduling and thread spawning implementations. It is unspecified whether a task submitted to async is run on its own thread or a thread previously used for another async task. If a call to async succeeds, it shall be safe to wait for it from any thread. The state of thread_local variables shall be preserved during async calls. No two incomplete async tasks shall see the same value of this_thread::get_id(). [Note: this effectively forces new tasks to be run on a new thread, or a fixed-size pool with no queue. If the library is unable to spawn a new thread or there are no free worker threads then the async call should fail.]

Solution Domain

Concurrency and parallism represent broad domain of problems and solutions. Mechanisms are generally appropriate to a limited portion of that domain. So, mechanisms should explicitly state the domain in which they are intended to be useful.

The anticipated domain for the following async solution is extracting a limited amount of concurrency from existing sequential programs. That is, some function calls will be made asynchrounous where appropriate to extract high-level concurrency from program structure, and not from its data structures. The facility is not intended to compete with OpenMP or automatic parallelizers that extract loop-level parallelism. To be concrete, the async facility would be appropriate to the recursive calls to quicksort, but not to the iteration in a partition.

In this domain, the programming model is:

At the highest levels of the program, add async where appropriate.
If enough concurrency has not been achieved, move down a layer.
Repeat until you achieve the desired core utilization.

In this model, nested asynchronous calls are not only supported, but desired, as they provide the implementation the opportunity to reuse threads for many potentially, but not actually, asynchronous calls.

Thread Resources

The central technical problem in providing an asynchronous execution facility is to provide it in a manner that does not require the use of thread pools, while at the same time avoiding problems synchronizing with the destructors for any thread-local variables used by any threads created to perform the asynchronous work. See N2880: C++ object lifetime interactions with the threads API, Hans-J. Boehm, Lawrence Crowl, ISO/IEC JTC1 WG21 N2880, 2009-05-01.

While not explicit, the essential lesson of N2880 is as follows.

Threads have variables in the form of thread-local variables, parameters, and automatic variables. To ensure that the resources held by those variables are released, one must join with the thread so that those variables are destroyed. To ensure that destructors of those variables are well-defined, one must join with the thread before its referenced environment is destroyed.

Some consequences of this observation are:

One should never detach non-trivial threads. (There is probably an opportunity for a formal definition in here.)
All thread pools should be explicitly declared, i.e. implicit thread pools are bad.
One should manage the thread pool as one would manage the resources it will accrete and reference.

Solution Value

In addition to the technical details, the committee must consider the value in any solution that meets the procedural bounds of the Kona compromise and the technical bounds embodied in N2880. In particular, external facilities like Cilk, the Threading Building Blocks, and the Parallel Patterns Library are known to be better able to handle fined-grained parallelism. So, is the solution space of sufficient value, relative to these external facilities, for standardization in C++0x?

The value in a solution is relative not only to external facilities, but also relative to facilities in the current standard. Our concurrency primitive, std::thread, does not return values, and getting a value out through std::packaged_task and std::unique_future may take more training than many programmers are willing to accept. So, is the solution space of sufficient value, relative to these internal facilities, for standardization in C++0x?

In this paper, we presume that the value in the solution comes from its improvement over existing internal facilities. The wording of the UK national body comment implies the same conclusion. On that basis, we propose the following solution.

Related Work

Oliver Kowalke is implementing boost.task (formerly known as boost.threadpool). In this library, launch_in_thread() reuses existing threads. The function returns a returns handle object for both thread and return value. This library also allows task interruption. It is available at the Boost Vault (http://www.boostpro.com/vault/ — section 'Concurrent Programming') or from the Boost sandbox (svn — https://svn.boost.org/svn/boost/sandbox/task/).

Herb Sutter has proposed an alternate solution in draft text, generally taking a different choice for those issues in which consensus has not formed. This paper should appear as N2901.

Proposed Solution

The proposed solution consists of a set of async functions to launch asychronous work and a future to manage the function result.

Acknowledgements

This solution derives from an extensive discussion on the C++ threads standardisation <cpp-threads@decadentplace.org.uk> mailing list. That discusson has not yet reached consensus. We highlight points of disagreement below. Note that the presentation in this paper is substantially expanded from earlier drafts, clarifying several issues, so the disagreements may be weaker than they were in discussion.

Thanks to the following contributors to the discussion on this topic: Hans Boehm, Beman Dawes, Peter Dimov, Pablo Halpern, Howard Hinnant, Oliver Kowalke, Doug Lea, Arch Robison, Bjarne Stroustrup, Alexander Terekhov, and Anthony Williams. In particular, we are extremely grateful to Herb Sutter for forcing a thorough analysis into the issues.

The `async` Function

The async functions use the standard techniques for deferring function execution. The function and its arguments are listed separately as parameters to the async functions, which are later combined at the point of invocation to call the designated work.

For example, consider computing the sum of a very large array. The first task is to not compute asynchronously when the overhead would be significant. The second task is to split the work into two pieces, one executed by the host thread and one executed asynchronously.


int parallel_sum(int* data, int size)
{
  int sum = 0;
  if ( size < 1000 )
    for ( int i = 0; i < size; ++i )
      sum += data[i];
  else {
    auto handle = std::async(parallel_sum, data+size/2, size-size/2);
    sum += parallel_sum(data, size/2);
    sum += handle.get();
  }
  return sum;
}

Thread Joining

Because Kona compromise prohibits thread pools and because we must join with any thread created, any asynchronous execution facility must ensure, at the very least, that any thread created is joined before the resulting handle is destroyed. (And, of course, the programmer must destroy the handle, not abandon it to free store.)

A consequence of the joining is that threads cannot be reused. Otherwise, some section of the program would lose control of the resources accreted in the thread being reused.

This issue has not yet reached consensus.

Given that the thread must join, there are two implementation strategies, intrusively implement async or keep the std::thread within the future for later joining.

In the intrusive async, the implementation within the thread must

capture any return value or exception;
destroy all thread-local variables; and only then
invoke the set_value or set_exception function of the promise corresponding to the future.

That is, the promise effectively joins the thread before the future becomes ready.

When storing the std::thread within the future, the implementation of async is a straightforward composition of packaged_task, unique_future, and std::thread.

One consequence of storing the std::thread within the future is that either unique_future must be substantially modified or that we introduce a new future type.

Another consequence of storing the std::thread within the future is that the waiting function changes from a condition-variable wait to a thread join. The std::thread class does provides neither a timed_join nor a try_join, and so a joining future cannot implement the full interface of unique_future.

Execution Policies

The async functions have a policy parameter. Three policies are defined in this paper.

Always invoke the function in a new thread.: Note that we do not choose "another thread" as a consequence of the discussion above.
Always invoke the function serially.: The value in this policy is primarily in temporarily reducing local concurrency in experiments to achieve higher system performance.
Invoke the function at the discretion of the implementation.: The implementation may use either of the above policies on a per-call basis. The value in this policy is that it enables the implementation to better manage resources. It is the policy used when the programmer does not specify one.

The intent of this proposal is to closely follow the parameters and overloads of the std::thread constructors. We expect this consistency to provide the least surprise to users. Because std::thread has a variadic constructor, the std::async function has a variadic overload. A consequence is that the standard technique for implementing a default policy via a defaulted paramter does not work. Hence the proposal places the policy at the front of the parameter list and implements the default policy with a separate overload that does not have that parameter. This placement seems unnatural to many members of the committee, and they desire to place policy parameter at the end.

The only way to provide the policy parameter at the end and to be consistent with std::thread constructors is to remove the variadic constructor from std::thread. Doing so would not lose a great deal of syntactic consiceness, because the lambda facility can encapsulate many parameters. The async form above can be written as follows.


auto handle = std::async(
    [=]{ return parallel_sum( data+size/2, size-size/2); },
    fully_threaded );

We have no objection to that approach. Indeed, it would make the referencing environment of the executed function quite explicit in the form of the lambda-capture. Should the variadic std::thread constructor be removed, we will modify the proposal to move the policy parameter to the end of the list and default it.

Alternatively, one could have inconsistent parameters for std::thread constructors and std::async overloads.

This choice has not reached consensus.

Eager and Lazy Evaluation

When the work is invoked serially, we propose to do so at the point of value request, rather than at the point of initiation. That is, work is invoked lazily rather than eagerly. This approach may seem surprising, but there are reasons to prefer invocation-on-request.

Exceptions in the hosting code may cause the future to be prematurely destroyed. As the return value cannot be recovered, the only reason to do the work is for its side effects.
Those side effects might not have occured in the original sequential formulation of the algorithm, so there would appear to be little lost in failing to execute those side effects if the value is not retrieved.
Work stealing implementations will be ineffective if the async functions have already committed to an eager serial execution.
Executing the work serially at the call to async might introduce deadlock. In contrast, executing the work serially at the call to get cannot introduce any deadlock that was not already present because the calling thread is necessarily blocked.
Lazy evaluation permits speculative execution. Rather than wait to invoke the function when the result is known to be needed, one can invoke async earlier. When there are sufficient processor resources, the function executes concurrently and speculatively. When there are not sufficient resources, the function will execute only when truly needed.

Eager semantics seem more natural when programmers think of "waiting to use the return value". On the other hand, lazy semantics seem more natural when programmers think of "moving the call earlier". Consider the following examples.


int original( int a, b ) {
    int c = work1( a );
    int d = work2( b );
}

int eager( int a, b ) {
    auto handle = async( work1, a );
    int d = work2( b );
    int c = handle.get();
}

int lazy( int a, b ) {
    auto handle = async( work2, b );
    int c = work1( a );
    int d = handle.get();
}

Note also that in the proposed lazy semantics, any serial execution will be in the context of the thread that executes the get(). While we expect that this thread will nearly always be the same as the thread that executes async() it need not be because a future can be moved.

There are consequences to lazy evaluation. In particular, the future returned from async must either be a modified version of the existing unique_future or the future must be of a new type. The reason is that lazy evaluation requires that the future carry an std::function to represent the computation needed.

Direct Execution

A desirable implementation in the case of synchronous execution is direct execution, in which the call to the std::function representing the work returns its result or exception directly to the caller.

In lazy evaluation, direct execution is straightforward; the implementation of a synchronous get() simply calls the std::function and returns its result. Any exeption is simply propogated as in a normal function call.

In eager evaluation, one must necessarily save the result in a variable for later copy/move at from the get() call. However, propogating the exception at the async() call would introduce a second place in which the programmer must protect against an exception. That burden is undesirable, so the async() call should also save any exception for later propogation by the get() call. All of this means that eager evaluation cannot exploit direct execution.

Direct execution has a consequence however. Since the value or exception status is unknown until the get call, the has_value and has_exception queries cannot provide meaningful results before then. That is, direct execution invalidates part of the interface to unique_future.

Note, however, that the has_value and has_exception queries are meaningful with lazy evaluation so long as the first call to them invokes the std::function in an indirect manner.

New Future Type

As hinted several times earlier, we must make a choice in the future type returned by async:

an unmodified unique_future,
a modified unique_future, or
a new future type.

Based on the discussion above,

A non-intrusive joining implementation implies a modified or new future type.
The loss of interface functionality in a non-intrusive joining implementation suggests a new future type.
Lazy evalutation implies a modified or new future type.
Direct execution implies a modified or new future type.
The loss of interface functionality in direct execution suggests a new future type.

Furthermore, a modified unique_future would necessarily induce more overhead on the original intended uses of unique_future. The direct overhead might be as low as a few words and a couple of tests or virtual calls. Unfortunately, tests and virtual calls tend to introduce pipeline bubbles and virtual calls tend to be barriers to optimization. So, the indirect overhead might be substantially higher. However, we have no measurements comparing that overhead to the normal cost of unique_future. Even so, the original uses of unique_future, such as in coordinating thread pools invocations, are likely to be in more performance-sensitive code than are uses of async. Therefore, avoiding potential performance impact to thread pools implies a new future type.

Modifying unique_future implies revisiting aspects of the working draft that we thought were stable. Introducing a new future type would avoid potentially destabilizing the draft.

On balance, we believe that a new future type is the best overall solution to the conflicting desireable features in the return type of the async function. This choice has not reached consensus.

Given that we have a new future, we remove timed_wait, is_ready, has_value, and has_exception, from the interface. That is, the new future interface, a joining_future, is modeled in part on thread, which has a unique owner and is therefore only movable.

enum async_policy {
    fully_threaded,
    fully_synchronous,
    impl_discretion
}
template<class F>   requires Callable<F>;   joining_future<Callable::result_type>   async(async_policy policy, F f);

template<typename F, typename ... Args>   requires Callable<F, Args...>;   joining_future<Callable::result_type>   async(async_policy policy, F&& f, Args&&...);

template<class F>   requires Callable<F>;   joining_future<Callable::result_type>   async(F f);

template<typename F, typename ... Args>   requires Callable<F, Args...>;   joining_future<Callable::result_type>   async(F&& f, Args&&...);
Requires: F and each type Ti in Args shall be CopyConstructible if an lvalue and otherwise MoveConstructible. INVOKE(f, w1, w2, ..., wN) (20.7.2) shall be a valid expression for some values w1, w2, ..., wN, where N == sizeof...(Args).

Effects: Constructs an object of type joining_future<Callable::result_type> ([futures.joining_future]). If policy is fully_threaded, creates an object of type thread and executes INVOKE(f, t1, t2, ..., tN) in a new thread of execution, where t1, t2, ..., tN are the values in args.... Any return value is captured by the joining_future. Any exception not caught by f is captured by the joining_future. If policy is fully_synchronous, the thread calling joining_future::get() ([future.joining_future]) executes INVOKE(f, t1, t2, ..., tN) in the caller's own thread of execution, where t1, t2, ..., tN are the values in args.... The invocation is said to be deferred. If policy is impl_discretion, the implementation may choose either policy above at any call to async. [Note: Implementations should defer invocations when no more concurrency can be effectively exploited. —end note] If there is no policy parameter, the behavior is as if there was a impl_discretion parameter was specified.

Synchronization: The invocation of the async happens before (1.10 [intro.multithread]) the invocation of f. [Note: This statement applies even when the corresponding joining_future is moved to another thread. —end note]

Throws: std::system_error if policy is fully_threaded and the implementation is unable to start a new thread.

Error conditions: — resource_unavailable_try_again — if policy is fully_threaded and either the system lacked the necessary resources to create another thread, or the system-imposed limit on the number of threads in a process would be exceeded.

[Example: Two items of work can be executed in parallel as below.
extern int work1(int value);
extern int work2(int value);
int work(int value) {
  auto handle = std::async(std::impl_discretion, work2, value);
  int tmp = work1(value);
  return tmp + handle.get();
}
—end example:] [Note: The statement
  return work1(value) + handle.get();
might not result in parallelism because get() may be evaluated before work1(), thus forcing work2 to be evaluated before work1(). —end note:]

30.6.? Class template `joining_future` [futures.joining_future]

Add the following section after the one above.

namespace std {
  template<class R>
  class joining_future {
  public:
    joining_future(joining_future &&);
    joining_future(const joining_future& rhs) = delete;
    ~joining_future();
    joining_future& operator=(const joining_future& rhs) = delete;
    // retrieving the value
    see below get();
    // functions to check state and wait for ready
  };
}
The implementation shall provide the template joining_future and two specializations, joining_future<R&> and joining_future<void>. These differ only in the return type and return value of the member function get, as set out in its description, below.

joining_future(joining_future&& rhs);

Effects: move constructs a joining_future object whose associated state is the same as the state of rhs before. The associated state derives from the async call that provided the original future. The state consists of one or more of any thread created by the call, the function object and its arguments, the return value of its invocation, or the exception of its invocation.

Postcondition: rhs can be safely destroyed.

~joining_future();

Effects: destroys *this and its associated state if no other object refers to that. If the invocation has been deferred, but not yet executed via get, the invocation is not executed.

Synchronization: If the invocation has been deferred, then the associated async call happens before (1.10 [intro.multithread]) the destructor return. Otherwise, as if associated-thread.join().

R&& joining_future::get();

R& joining_future<R&>::get();

void joining_future<void>::get();

Note: as described above, the template and its two required specializations differ only in the return type and return value of the member function get.

Effects: If the invocation has been deferred, then executes INVOKE(f, t1, t2, ..., tN) where t1, t2, ..., tN are the values in args....

Returns: If the invocation has been deferred, then

joining_future::get() returns the rvalue-reference of the result of the invocation.

joining_future<R&>::get() returns the reference of the result of the invocation.

joining_future<void>::get() returns nothing.

Otherwise,

joining_future::get() returns an rvalue-reference to the value stored in the asynchronous result.

joining_future<R&>::get() returns the stored reference.

joining_future<void>::get() returns nothing.

Throws: If the invocation has been deferred, then any exception from the invocation. Otherwise, the stored exception, if an exception was stored and not retrieved before.

Synchronization: If the invocation has been deferred, and the return from the invocation happens before (1.10 [intro.multithread]) the get returns. Otherwise, as if associated-thread.join().

Remark: the effect of calling get() a second time on the same joining_future object is unspecified.