| Document Number: | P0055R00 | 
|---|---|
| Date: | 2015-09-12 | 
| Project: | Programming Language C++, LEWG | 
| Revises: | none | 
| Reply to: | gorn@microsoft.com | 
Proposed Networking Library (N4478) uses the callback based asynchronous model described in N4045 which is shown to have lower overhead than the asynchronous I/O abstractions based on future.then ([4399]). The overhead of the Networking Library abstractions can be made even lower if it can take advantage of coroutines N4499. This paper suggests altering completion token transformation class templates described in N4478/[async.reqmts.async] to achieve near zero-overhead efficiency when used with coroutines. These changes do not alter the interfaces to asynchronous functions and do not change the performance characteristics of the Networking Library when used with callbacks.
Networking Library asynchronous functions uses class templates 
completion_handler_type_t and async_result to 
transform CompletionToken passed as a parameter to the interface 
functions starting with prefix async_ into a callable function object to be 
submitted to unspecified underlying implementation functions. This 
transformation allows to use the same set of functions whether using a callback 
model or relying on future based continuation mechanism. For the latter, an 
object of type use_future_t is provided in place of the callback 
parameter (ex: async_xyz(buf, len, use_future)).
template<class CompletionToken>
auto async_xyz(T1 t1, T2 t2, CompletionToken&& token)
{
  completion_handler_type_t<decay_t<CompletionToken>, void(R1 r1, R2 r2)>
    completion_handler(forward<CompletionToken>(token));
  async_result<decltype(completion_handler)> result(completion_handler);
  async_xyz_impl(t1, t2, completion_handler); // do the work
  return result.get();
}
We propose to use a single completion_token_transform function 
to perform transformation currently done via 
completion_handler_type_t and async_result. Not only 
this results in less boilerplate code for the user/library developer to write, 
but also enables zero-overhead mode when working with coroutines as described in 
the next section.
template<class CompletionToken>
auto async_xyz(T1 t1, T2 t2, CompletionToken&& token) noexcept(auto)
{
  return completion_token_transform<void(R1 r1, R2 r2)>(
       forward<CompletionToken>(token),
       [=](auto typeErasedHandler) { async_xyz_impl_raw(t1, t2, typeErasedHandler); });
}
Let's explore how a high level asynchronous function async_xyz 
can be built on top of a low level os_xyz supplied by the platform. 
At first, we will write both callback and coroutine based solutions separately. 
Then, we will show how utilizing completion_token_transform as 
shown in the previous section allows the same API to handle efficiently both 
cases.
Let ParamType be the type representing all the input parameters 
to an asynchronous call, ResultType be the type of the result 
provided asynchronously and OsContext* is a pointer to a context 
structure OsContext that os_xyz requires to remain 
valid  until the asynchronous operation is complete. The general shape of the 
low level API is assumed to be as shown below.
using CallbackFnPtr = void(*)(OsResultType r, OsContext*); // os wants this signature
void os_associate_completion_callback(CallbackFnPtr cb); // usually per handle or per threadpool
void os_xyz(ParamType p, OsContext* o); // initiating routine (per operation)
To transform a call to async_xyz(P, CompletionHandler) into a 
call to os_xyz, we need to type erase the completion handler and 
pass it to the os_xyz as OsContext* parameter. In the 
completion callback, given an OsContext*, the callback will downcast it to the 
type containing the actual handler class and invoke it. In a simplified form it 
can look like:
template <typename CompletionHandler>
void async_xyz(ParamType p, CompletionHandler && cb) {
    auto o = make_unique<Handler<decay_t<CompletionHandler>>>(forward<CompletionHandler>(cb));
    os_xyz(p, o.get());
    o.release();
}
where Handler and HandlerBase defined as follows
struct HandlerBase : OsContext {
    CallbackFnPtr cb;
    explicit HandlerBase(CallbackFnPtr cb) : cb(cb) {}
    static void callback(ResultType r, OsContext* o) { // register this with OS
        static_cast<HandlerBase*>(o)->cb(r, o);
    }
};
template <typename CompletionHandler>
struct Handler : HandlerBase, CompletionHandler {
    template <typename CompletionHandlerFwd>
    explicit Handler(CompletionHandlerFwd&& h)
        : CompletionHandler(forward<CompletionHandlerFwd>(h))
        , HandlerBase(&Handler::callback)
    {}
    static void callback(ResultType r, OsContext* o) {
        auto me = static_cast<Handler*>(o);
        auto handler = move(*static_cast<CompletionHandler*>(me));
        delete me;  // deleting it prior to invoke improves allocator behavior
        handler(r); // as handle is likely to request a similar block which can be immediately reused
    }
};
While sophisticated implementations may utilize specialized allocation / deallocation functions to lessen the overhead of type erasure and memory allocations, the overhead cannot be eliminated completely in a callback model.
However, when asynchronous API is used in a coroutine, no type erasure or memory allocation needs to be performed at all. No only this results in less code and faster execution, it also eliminates the sole source of failure mode of async APIs allowing the library to mark async_xxx functions as noexcept.
Let's compare mapping async_xyz to an os_xyz when 
used in a coroutine. To be usable in an await expression (N4499/[expr.await]), 
async_xyz(P, use_await_t) function needs to return an object with 
member functions await_ready, await_suspend and await_resume defined as follows: 
auto async_xyz(ParamType p, use_await_t = use_await_t{}) {
    struct Awaiter : AwaitableBase {
        ParamType p;
        explicit Awaiter(ParamType & p) : p(move(p)) {}
        bool await_ready() { return false; } // the operation has not started yet
        auto await_resume() { return move(this->result); } // unpack the result when done
        void await_suspend(coroutine_handle<> h) { // call the OS and setup completion
            this->resume = h;
            os_xyz(p, this);
        }
    };
    return Awaiter{ p };
}
where AwaitableBase defined as follows
struct AwaitableBase : HandlerBase {
    coroutine_handle<> resume;
    ResultType result;
    AwaitableBase() : HandlerBase(&AwaitableBase::Callback) {}
    static void Callback(ResultType r, OsContext* o) {
        auto me = static_cast<AwaitableBase*>(o);
        me->result = r;
        me->resume();
    }
};
The following example illustrates how a compiler transforms expression 
await async_xyz(p).
Note the absence of memory allocations / 
deallocations and type erasure of any kind. 
ResultType r = await async_xyz(p);
becomes
     async_xyz`Awaiter __tmp{p}; 
     $promise.resume_addr = &__resume_label;   // save the resumption point of the coroutine
     __tmp.resume = $RBP;                      // inlined await_suspend
     os_xyz(p,&OsContextBase::Invoke, &__tmp); // inlined await_suspend
     jmp Epilogue; // suspends the coroutine
__resume_label:    // will be resumed at this point once the operation is finished
     R r = move(__tmp.result); // inlined await_resume
Given the public async function async_xyz defined as described in the Overview section (and repeated below for readers convenience)
template<class CompletionToken>
auto async_xyz(T1 t1, T2 t2, CompletionToken&& token) noexcept(auto)
{
  return completion_token_transform<void(R1 r1, R2 r2)>(
       forward<CompletionToken>(token),
       [=](auto typeErasedHandler) { async_xyz_impl_raw(t1, t2, typeErasedHandler); });
}
with the completion_token_transform defined as follows, we can 
achieve the same efficient implementation of asynchronous function when using 
callbacks:
template <typename Signature, typename CompletionHandler, typename Invoker>
void completion_token_transform(CompletionHandler && fn, Invoker invoker)
{
    auto p = make_unique<Handler<decay_t<CompletionHandler>>>(forward<CompletionHandler>(fn));
    invoker(p.get());
    p.release(); // if we reached this point, handler is owned by async activity and unique_ptr can relinquish the ownership
}
By defining overload for use_await_t, we can get efficient 
implementation of async_xyz when used in coroutines.
template <typename Signature, typename Invoker>
auto completion_token_transform(use_await_t, Invoker invoker)
{
    struct Awaiter : AwaiterBase, Invoker {
        bool await_ready() { return false; }
        ResultType await_resume() { return move(this->result); }
        void await_suspend(coroutine_handle<> h) {
            this->resume = h;
            static_cast<Invoker*>(this)->operator()(this);
        }
        Awaiter(Invoker& invoker) : Invoker(move(invoker)) {}
    };
    return Awaiter{ invoker };
}
And finally, for completeness, here is how 
completion_token_transform overload for use_future_t 
will look like:
template <typename Signature, typename Invoker>
auto completion_token_transform(use_future_t, Invoker invoker) {
    struct FutHandler {
        promise<ResultType> p;
        void operator()(ResultType r) { p.set_value(move(r)); }
    };
    auto p = make_unique<Handler<FutHandler>>(FutHandler{});
    auto f = p->p.get_future();
    invoker(p.get());
    p.release();
    return f;
}
Proposed changes improve efficiency of the networking library by altering the mechanism how high-level public API interprets CompletionToken when invoking unspecified internal implementation. If this direction has support, the author of this article will gladly help the author of Networking Library proposal to flesh out the relevant details and provide testing of proposed changes using coroutines available in MSVC compiler.
There is an upcoming proposal (see [c++std-ext-17433]) to add [[nodiscard]] attribute/context-sensitive keyword to be applicable to classes and functions. If that attribute is applied to an awaiter class returned from the completion_token_transform, it will make it safe to add a default CompletionToken use_await_t to all async_xyz APIs.
template<class CompletionToken = use_await_t>
auto async_xyz(T1 t1, T2 t2, CompletionToken&& token  = use_await_t{}) noexcept(auto)
{
  return completion_token_transform<void(R1 r1, R2 r2)>(
       forward<CompletionToken>(token),
       [=](auto typeErasedHandler) { async_xyz_impl_raw(t1, t2, typeErasedHandler); });
}
If a user accidentally writes async_xyz(t1,t2) instead of 
await async_xyz(t1,t2), the mistake will be caught at compile time 
due to nodiscard tag on the awaitable class.
Moreover, given that coroutines enable coding simplicity of synchronous 
functions combined with efficiency and scalability of asynchronous I/O, we may 
chose to use the nicest names, namely (send, receive, accept) to asynchronous 
functions and use CompletionToken form of the API to deal with all 
cases. A single API function async_xyz can be utilized for all 
flavors of operations. This shrinks required API surface by two thirds.
Instead of 3 forms of every API:
   void send(T1,T2);
   void send(T1,T2,error_code&);
   void async_send(T1,T2, CompletionToken);
We can use a single form
   auto send(T1,T2,CompletionToken);
To be used as follows:
   await send(t1,t2); // CompletionToken defaults to use_await_t as being the most efficient and convenient way of using the async API
   send(t1,t2,block); // synchronous version throwing an exception
   send(t1,t2,block[ec]); // synchronous version reporting an error by setting error code into ec
   send(t1,t2,[]{ completion }); // asynchronous call using callback model
   auto fut = send(t1,t2,use_future); // completion via future
Benefit of this approach extends beyond the networking library to other future standard or non-standard libraries modeling their APIs on the CompletionToken/completion_token_transform.
Great thanks to Christopher Kohlhoff whose N4045 provided the inspiration for this work.
N4045: 
Library 
Foundations for Asynchronous Operations, Revision 2
N4399: 
Technical 
Specification for C++ Extensions for Concurrency
N4478: 
Networking 
Library Proposal (Revision 5)
N4499: 
Draft 
Wording For Coroutines (Revision 2)
[c++std-ext-17433] Andrew Tomazos: 
Draft proposal of [[unused]], [[nodiscard]] and [[fallthrough]] 
attributes.