Dynamic Initialization and Destruction with Concurrency

ISO/IEC JTC1 SC22 WG21 N2325 = 07-0185 - 2007-06-22

Lawrence Crowl, crowl@google.com, Lawrence@Crowl.org

This document is a revision of N2148 = 07-0008 - 2007-01-13.

Introduction

Concurrency introduces potential deadlock or data races in the dynamic initialization and destruction of static-duration objects. Because the opportunity for programmers to manually synchronize such objects is limited by a lack of syntactic handles, the language must introduce new syntax, define synchronization, or limit programs.

This proposal breaks the problem into three reasonably separable parts: initialization of function-local static-duration objects, initialization of non-local static-duration objects, and destruction of all static-duration objects.

This proposal exploits properties of a new algorithm for fast synchronized initialization by Mike Burrows of Google. (See the appendix for an implementation and an argument for its correctness.) The algorithm has three attributes signficant to its use here.

Function-Local Initialization

The core problem with function-local static-duration object initialization is that the containing function may be invoked concurrently, and thus the definition may execute concurrently. Failing to synchronize may introduce a race condition. The obvious solution is to synchronize. The problem is that such synchronization may introduce deadlock if the synchronization involves holding a lock while the initializer executes.

The proposal relies on Burrows' algorithm to avoid holding a lock while the initializer executes. Any deadlock that occurs as a result of initialization, must necessarily be a race condition in the absence of synchronization.

Implementation

The GNU compiler currently implements synchronized function-local variable initialization, though it holds the lock during initialization.

6.7 Declaration statement [stmt.dcl]

In paragraph 4, edit

Otherwise such an object is initialized the first time control passes through its declaration; such an object is considered initialized upon the completion of its initialization. If the initialization exits by throwing an exception, the initialization is not complete, so it will be tried again the next time control enters the declaration. If control enters the declaration concurrently while the object is being initialized, the concurrent execution waits for completion of the initialization. The implementation shall not introduce any locking around execution of the initializer. If control re-enters the declaration (recursively) while the object is being initialized, the behavior is undefined.

Non-Local Initialization

There are two problems with non-local static-duration object initialization. The first problem is that such initialization may occur concurrently. There are two sources to the concurrency, dynamic opening of a shared library in a multi-threaded context and creating of multiple threads in initializers, and hence the potentially implied concurrent execution of other initializers. The second problem is that static-duration object initialization may constitute a large fraction of process time, particularly when users are waiting for program start, and hence may benefit from parallel execution.

The standard currently admits unordered initialization. When initializing one object, reference to a relatively unordered object may result in access before dynamic initialization but after zero initialization. Since the concept of zero initialization has limited utility for many objects, particularly those that require dynamic construction, we consider this feature of the current C++ standard to be of limited utility. Therefore, we propose to make such references undefined.

Making such references undefined would normally imply that std::cout and std::operator new could not be used by any dynamic initializer. To resolve this problem, the standard should provide additional guarantees, and it currently appears to do so.

In the sequential 2003 standard, unspecified order of initialization essentially required some serial order of initialization. In the parallel proposed standard, the unspecified order of initialization admits concurrent initialization. The necessary additional restriction is that a single object may be initialized at most once. This restriction may require locking. With concurrent initialization, any write (or read in the presence of writes) to statically-initialized objects must be properly locked. An alternate approach, not proposed here, is to require a single global order for all initialization.

The implementation must initialize static-duration objects before any of their use within main or the functions it calls.

Implementation

To the best of our knowledge, no compiler synchronizes the initialization of non-local static-duration objects. However, the technical issues are similar to existing function-local initializations, and so present no new challenges.

3.6.2 Initialization of non-local objects [basic.start.init]

In paragraph 1, edit

Dynamic initialization of an a non-function-local static-duration object is either ordered or unordered. Definitions of explicitly specialized class template static data members have ordered initialization. Other class template static data members (i.e., implicitly or explicitly instantiated specializations) have unordered initialization. Other objects defined in namespace scope have ordered initialization. Such objects Objects within a single translation unit and with ordered initialization are relatively ordered, otherwise they are relatively unordered. Relatively ordered objects shall be initialized in the order of their definitions within the translation unit. The order of initialization is unspecified for relatively unordered objects with unordered initialization and for objects defined in different translation units. If the initialization of an object uses a relatively unordered object, the behavior is undefined; no diagnostic is required. [Note: This definition permits concurrent initialization.]

Destruction

The primary problem with destruction of static-duration objects is preventing execution of the destructors before threads finish. In many applications, finishing a thread requires terminating the threads. For various reasons, this shutdown needs to be cooperative. This need in turn implies that all code be able to handle shutdown. This requirement is new; nearly all existing code is not properly written. The difficulty of the code is equivalent to exception safety, which is known to be a hard task. In addition to this problem, there may be unknown dependences between modules, and in fact may be circular dependences between modules. With circular dependences, no cooperative termination is possible. Finally, operating system constraints seem to indicate that programs blocked on I/O are not cancellable and hence cannot participate in cooperative termination. The concern is that few programs will every be able to successfully finish in this environment. Therefore, it would be better to never execute static destructors.

The Committee has clearly stated that it wishes to preserve execution of static destructors. To enable both approaches, we propose an additional function that terminates the program after executing atexit registrations but that does not execute static destructors. A consequence of this proposal is that critical shutdown code, such as flushing I/O buffers, should be accomplished in functions registered with atexit.

As stated before, the primary problem with destruction is threads accessing objects during or after their destruction. To prevent this problem, we require that all user threads terminate before destruction begins. The mechanism to terminate threads is proposed in N2320 Multi-threading Library for Standard C++ and is not yet finalized.

Destruction also introduces the potential to consume a large fraction of process time. Therefore, similar to initialization, we enable concurrent destruction.

Objects defined at namespace scope with relatively ordered initializations must be destructed in reverse order of their initialization.

The complication to this approach is destruction of function-local static-duration objects and calls to functions registered with std::atexit. Since the order of initialization of function-local static-duration objects and of calls to std::atexit is defined by execution, rather than by declaration, we call these execution-ordered objects. (For expository purposes, calls to std::atexit correspond to initialization of a virtual object.) Non-local static-duration objects are called declaration-ordered objects. For execution-ordered objects initialized from within the dynamic context of a declaration-ordered initialization, their destruction shall occur in reverse order of completion of their initialization immediately before the destruction of the context object.

To make this approach viable, initialization and destruction of execution-ordered objects must obey the same restrictions as those on declaration-ordered objects. Specifically, the initialization and destruction of an execution-ordered object must not use an object that is relatively unordered with respect to the context object and must synchronize use of objects with trivial destructors.

Finally, execution-ordered objects may be initialized outside the context of initialization of a declaration-ordered object, that is, they may be initialized from ordinary code. These objects are destructed in reverse order of the completion of their initialization before destruction of any declaration-ordered object.

Implementation

One potential concern is the run-time cost of managing destruction order. To address this issue, we provide an implementation outline below.

3.6.3 Termination [basic.start.term]

In paragraph 1, edit

Destructors (12.4) for initialized objects of static storage duration (declared at block scope or at namespace scope) are called as a result of returning from main and as a result of calling std::exit (18.4). These objects are destroyed in the reverse order of the completion of their constructor or of the completion of their dynamic initialization. Dynamic initializations of local static objects executing inside the dynamic scope of the initialization a non-local static object are relatively ordered to the non-local static object. Dynamic initializations of local static objects executing outside the dynamic scope of the initialization a non-local static object are relatively ordered to main. Objects with relatively ordered initialization shall be destroyed in reverse order of completion of their initialization. Objects with relatively unordered initialization may be destroyed concurrently, except that all objects relatively ordered to main shall be destroyed before any non-local static-duration objects. If an object is initialized statically, the object is destroyed in the same order as if the object was dynamically initialized. For an object of array or class type, all subobjects of that object are destroyed before any local object with static storage duration initialized during the construction of the subobjects is destroyed.

In paragraph 2, edit

If a function contains a local object of static storage duration that has been destroyed and the function is called during the destruction of an object with static storage duration, the program has undefined behavior if the flow of control passes through the definition of the previously destroyed local object. Likewise, the behavior is undefined if the function-local object is used indirectly (i.e. through a pointer) after its destruction.

In paragraph 3, edit

A call to std::atexit from inside the dynamic scope of the initialization of a non-local static object is relatively ordered to the non-local static object. A call to std::atexit from outside the dynamic scope of the initialization of a non-local static object is relatively ordered to main. If a function is registered with std::atexit (see <cstdlib>, 18.4) then following the call to std::exit, any relatively ordered objects with static storage duration initialized prior to the registration of that function shall not be destroyed until the registered function is called from the termination process and has completed. For an a relatively ordered object with static storage duration constructed after a function is registered with std::atexit, then following the call to std::exit, the registered function is not called until the execution of the object's destructor has completed. If std::atexit is called during the construction of an object, the complete object to which it belongs shall be destroyed before the registered function is called.

18.4 Start and termination [lib.support.start.term]

In paragraph 8, table 19, edit

Functions: abort atexit exit quick_exit

In paragraph 8, add a new bullet 0

First, terminate all threads.

In paragraph 8, existing bullet 1, edit

First, Next, objects with static storage duration are destroyed and functions registered by calling atexit are called.219) See 3.6.3 for the order of destructions and calls. Non-local objects with static storage duration are destroyed in the reverse order of the completion of their constructor. (Automatic objects are not destroyed as a result of calling exit().)218) Functions registered with atexit are called in the reverse order of their registration, except that a function is called after any previously registered functions that had already been called at the time it was registered.219) A function registered with atexit before a non-local object obj1 of static storage duration is initialized will not be called until obj1's destruction has completed. A function registered with atexit after a non-local object obj2 of static storage duration is initialized will be called before obj2's destruction starts. A local static object obj3 is destroyed at the same time it would be if a function calling the obj3 destructor were registered with atexit at the completion of the obj3 constructor.

After the existing final paragraph 9, add a new function entry:

quick_exit(int status)

Add a new paragraph 10,

Functions registered by calling atexit are called. Objects are not destroyed as a result of calling quick_exit(). Functions with relatively-ordered atexit registration (3.6.2 [basic.start.init]) are called in the reverse order of their registration, except that a function is called after any previously registered functions that had already been called at the time it was registered.

Appendix: A Fast Implementation of Synchronized Initialization

Mike Burrows, m3b at Google.

Overview

I show a fast form of concurrent synchronized initialization in the form of an alternate implementation of pthread_once.

Licensing

It is our intent to make the following technique freely available. To that end, some licensing appears to be required.

Copyright (c) 2007, Google Inc.
All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Header fast_pthread_once.h

The header defines the synchronization type, defines a synchronization sentinal values, declares the synchronization function, and defines an inline fast synchronization function.


#ifndef FAST_PTHREAD_ONCE_H
#define FAST_PTHREAD_ONCE_H

#include <signal.h>
#include <stdint.h>

typedef sig_atomic_t fast_pthread_once_t;
#define FAST_PTHREAD_ONCE_INIT SIG_ATOMIC_MAX
extern __thread fast_pthread_once_t _fast_pthread_once_per_thread_epoch;

#ifdef __cplusplus
extern "C" {
#endif

extern void fast_pthread_once( pthread_once_t *once, void (*func)(void) );

inline static void fast_pthread_once_inline(
    fast_pthread_once_t *once, void (*func)(void) )
{
    fast_pthread_once_t x = *once;        /* unprotected access */
    if ( x > _fast_pthread_once_per_thread_epoch ) {
        fast_pthread_once( once, func );
    }
}

#ifdef __cplusplus
}
#endif

#endif FAST_PTHREAD_ONCE_H

Source fast_pthread_once.c

The source is written in C. The lines of the primary function are numbered for reference in the subsequent correctness argument.


#include "fast_pthread_once.h"

#include <pthread.h>

static pthread_mutex_t mu = PTHREAD_MUTEX_INITIALIZER;
  /* protects global_epoch and all fast_pthread_once_t writes */

static pthread_cond_t cv = PTHREAD_COND_INITIALIZER;
  /* signalled whenever a fast_pthread_once_t is finalized */

#define BEING_INITIALIZED (FAST_PTHREAD_ONCE_INIT - 1)

static fast_pthread_once_t global_epoch = 0; /* under mu */

__thread fast_pthread_once_t _fast_pthread_once_per_thread_epoch;

static void check( int x )
{
    if ( x == 0 )
        abort();
}

void fast_pthread_once( fast_pthread_once_t *once, void (*func)(void) )
{
/*01*/    fast_pthread_once_t x = *once;        /* unprotected access */
/*02*/    if ( x > _fast_pthread_once_per_thread_epoch ) {
/*03*/        check( pthread_mutex_lock(&mu) == 0 );
/*04*/        if ( *once == FAST_PTHREAD_ONCE_INIT ) {
/*05*/            *once = BEING_INITIALIZED;
/*06*/            check( pthread_mutex_unlock(&mu) == 0 );
/*07*/            (*func)();
/*08*/            check( pthread_mutex_lock(&mu) == 0 );
/*09*/            global_epoch++;
/*10*/            *once = global_epoch;
/*11*/            check( pthread_cond_broadcast(&cv) == 0 );
/*12*/        } else {
/*13*/            while ( *once == BEING_INITIALIZED ) {
/*14*/                check( pthread_cond_wait(&cv, &mu) == 0 );
/*15*/            }
/*16*/        }
/*17*/        _fast_pthread_once_per_thread_epoch = global_epoch;
/*18*/        check (pthread_mutex_unlock(&mu) == 0);
          }
}

Correctness Argument

The above code implements pthread_once() in a way that is both fast (a few simple instructions, no memory barriers) and portable. Careful use of memory atomicity and monotonicity are used in place of a memory barrier.

The fast-path for the inline function in the header has the following code as generated by gcc 3.2.2 on an x86:

mov%gs:(%esi),%eax # access to thread-local storage
cmp%eax, pthread_once_t_location # accesses the pthread_once_t variable
jgslow_path # decides whether to call the function

This code is one more instruction than a test that is not thread safe. This code touches two cache lines, while an unsafe version would touch only one. However, the additional cache line is thread-specific, and not especially likely to cause misses. As one might expect, repeated calls to the inline version achieve around 1 billion fast-path calls per second on a 2.8GHz processor with a hot cache.

In the sketch proof that follows, assumptions and deductions are given names in square brackets. This sketch will no doubt appear to be overly pedantic to some and sloppy to others. It probably contains errors and misuse of terminology; I hope these faults are not fatal. Its purpose is to convince the reader that with sufficient effort a full proof could be constructed, and perhaps to form the basis for such a proof for anyone eager for such mental exercise.

The portability assumptions beyond those in a straightforward implementation are as follows:

[thread_local]
There is a way to access thread-specific data or thread-local storage. The technique is useful only if such an access is faster than a memory barrier. For the disassembly above, the compiler used its ability to access thread-local storage using variables declared with __thread, so the thread-local access is a normal "mov" instruction with a different segment register.
[atomicity]
There exists an integral type T (such as sig_atomic_t) such that loads and stores of a variable of type T are atomic. It is not possible to read from such a variable any value that was not written to the variable, even if multiple reads and writes occur on many processors. Another way to say this is that variables of type T do not exhibit word tearing when accessed with full-width accesses.
[*once_bound]
The number of pthread_once_t variables in a programme is bounded and can be counted in an instance of type T without overflow. If T is 32-bits, this implementation requires that the number of pthread_once_t variables in a programme not exceed 2**31-3. (Recall that pthread_once_t variables are intended to have static storage class, so it would be remarkable for such a huge number to exist.)
[monotonicity]
Accesses to a single variable V of type T by a single thread X appear to occur in programme order. For example, if V is initially 0, then X writes 1, and then 2 to V, no thread (including but not limited to X) can read a value from V and then subsequently read a lower value from V. (Notice that this does not prevent arbitrary load and store reordering; it constrains ordering only between actions on a single memory location. This assumption is reasonable on all architectures that I currently know about. I suspect that the Java and CLR memory models require this assumption also.)

Let us say that every fast_pthread_once variable *once is "finalized" when the associated initialization function "func" returns (line 7 in the code above). We want to argue [correctness] (the combination of [safety] and [liveness]) of the code above, using the assumptions above, plus a few others given below. For [safety], we require [safety_1] that each variable *once is finalized at most once. We also require [safety_2] that any thread that calls fast_pthread_once on *once does not return until *once has been finalized at least once, and all the modifications made by func are visible to that thread. For [liveness], we require that the code does not deadlock or loop forever.

For the code, all variable accesses are consistent, except the load of *once on line 1 because:

[release_consistency] We assume that we have mutexes that provide at least release consistency. We use pthread_mutex_t in the example.

[fairness] We assume that runnable threads eventually run.

[atomicity2] We assume that fast_pthread_once variables are of the type T, from [atomicity].

[initialization] We assume that fast_pthread_once_t variables are originally set to FAST_PTHREAD_ONCE_INIT.

[global_epoch_value_range] global_epoch counts the number of *once variables that have been finalized; it is incremented (line 9) each time func is called (line 7) and never modified otherwise. From [*once_bound], we know the value of global_epoch is in {0, ..., FAST_PTHREAD_ONCE_INIT-2}, because global_epoch is initially 0, and FAST_PTHREAD_ONCE_INIT is the maximum value for the type of *once.

[*once_value_sequence] Every *once variable is initialized to FAST_PTHREAD_ONCE_INIT [initialization]. There are only two places where *once is changed:

Line 5:
*once is set to BEING_INITIALIZED (==FAST_PTHREAD_ONCE_INIT-1)
Line 10:
*once is set to the value of global_epoch, which is in {0, ..., FAST_PTHREAD_ONCE_INIT-2} (from global_epoch_value_range)

The sets {FAST_PTHREAD_ONCE_INIT}, {BEING_INITIALIZED}, and {0, ..., FAST_PTHREAD_ONCE_INIT-2} have no intersection because BEING_INITIALIZED==FAST_PTHREAD_ONCE_INIT-1, and FAST_PTHREAD_ONCE_INIT is the maximum value for the type of *once. The first assignment (line 5) to BEING_INITIALIZED is predicated on *once being FAST_PTHREAD_ONCE_INIT, and the update is atomic due to the use of a mutex. Therefore, a *once may make the transition from FAST_PTHREAD_ONCE_INIT to BEING_INITIALIZED at most once. The second assignment (line 10) occurs only if the executing thread performed the first assignment; therefore the second assignment occurs at most once. So any given *once location takes on a sequence of values that is a prefix of the sequence:

FAST_PTHREAD_ONCE_INIT, BEING_INITIALIZED, E
where E < FAST_PTHREAD_ONCE_INIT, E < BEING_INITIALIZED, and E is in {0, ..., FAST_PTHREAD_ONCE_INIT-2}.

[safety_1] The function func is called at most once for each *once variable, because it is called only if the executing thread performed the assignment at line 5, which is performed at most once, from [*once_value_sequence]. This shows [safety_1].

[slow_path_safety_2] Any thread W that reaches line 10 has called func; therefore *once is finalized and the modifications from func are visible to W. If W reaches line 18, W either passed through line 10, or passed through line 13 and found *once != BEING_INITIALIZED. We know from [monotonicity] and [*once_value_sequence] that *once cannot be FAST_PTHREAD_ONCE_INIT (line 4) or BEING_INITIALIZED (line 13), so it must be some value E, a value in {0, ..., FAST_PTHREAD_ONCE_INIT-2}. The location *once takes on such a value only after being finalized. Moreover, accesses by a thread after line 13 seeing E are ordered after the assignment of E to *once and the modifications made by the call to func. This is from [release_consistency] and the facts that accesses at line 13 occur with the mutex held that the thread that finalized *once released the same mutex after doing so. Thus, if W reaches line 18, it does so after *once has been finalized, and with all modifications made by func visible to it. This shows [safety_2] for the slow path, which we call [slow_path_safety_2].

[fast_path_safety_2] We now need to show that any thread X that takes the fast path (just lines 1 and 2, with a false predicate at line 2), does so only if *once is finalized and func's modifications are visible to X. The value read from *once at line 1 is not read under mu, so the value read may be inconsistent. However, it is known to be one of the values

FAST_PTHREAD_ONCE_INIT, BEING_INITIALIZED, E

(where E is in {0, ..., FAST_PTHREAD_ONCE_INIT-2} and E < FAST_PTHREAD_ONCE_INIT and E < BEING_INITIALIZED) because of [*once_bound], [atomicity2], and [*once_value_sequence]. Further, we know that X's _fast_pthread_once_per_thread_epoch is in {0, ..., FAST_PTHREAD_ONCE_INIT-2} because it is assigned only from global_epoch (line 17) and we have [global_epoch_value_range]. Therefore, X sees the line 2 predicate as false only if both:

From the first of these plus assumptions [*once_bound], [monotonicity], and [*once_value_sequence], we know that for X to take the fast path, some thread Y executed line 10, and therefore Y finalized *once. From the second of these, we know that X must have acquired mu (mu is held at line 17) after Y finalized *once and subsequently released mu. So from [release_consistency], all the values modified by func are visible to X. This gives us [fast_path_safety_2].

[safety] Combining [safety_1], [slow_path_safety_2], and [fast_path_safety_2], we have [safety].

[mu_deadlock_freedom] From the code, while mu is held, no other mutexes are acquired nor any other blocking call made, so mu cannot participate in a deadlock.

[liveness] Liveness comes from these observations

[correctness] We have [safety] and [liveness].

Test fast_pthread_once_test.c


#include <stdio.h>
#include <assert.h>
#include "fast_pthread_once.h"

fast_pthread_once_t once = FAST_PTHREAD_ONCE_INIT;

void run_once(void)
{
    static int run_count = 0;
    assert( run_count == 0 );
    run_count++;
}

int main( int argc, char *argv[] )
{
    int use_inline = 0;
    int n = 1000 * 1000 * 1000;
    int i;
    for ( i = 1; i < argc; i++ ) {
        const char *arg = argv[i];
        if ( strcmp(arg, "-i") == 0 ) {
            use_inline = 1;
        } else if ( strspn(arg, "0123456789") == strlen(arg) ) {
            n = atoi(arg);
        } else {
            fprintf( stderr, "usage: %s [-i] [number]\n", argv[0] );
            return 2;
        }
    }
    if ( use_inline ) {
        for ( i = 0; i != n; i++ ) {
            fast_pthread_once_inline(&once, run_once);
        }
    } else {
        for ( i = 0; i != n; i++ ) {
            fast_pthread_once(&once, run_once);
        }
    }
    return 0;
}

Improvements

There are two primary sources of inefficiency in code. First, there is a single mutex for maintaining the global epoch. Second, all waiting threads are woken for each completed initialization. We do not believe these inefficiencies will matter in practice, because they become significant only for many threads over many long-running initializations. (Recall that the mutex is not held during the actual initialization.) However, for pathological programs the following improvements should suffice.