Submitters: Tom Scogland
James Beyer
Jeff Larkin
Stephen Olivier
Bronis de Supinski

Submission Date: March 7, 2017
Document: WG14 N2131
Title: Concerns with CPLEX Working Draft (N2017)

# Concerns with CPLEX Working Draft (N2017)


Reply to:

* Tom Scogland scogland1@llnl.gov
* Bronis de Supinski bronis@llnl.gov


## Introduction

We have several concerns about the proposed extensions in the CPLEX working
draft and how it is represented to the wider community.

We write this message from our perspective as leaders of the OpenMP community
in general and the OpenMP Language Committee in particular.  For the technical
reasons detailed below we feel that the proposal fails to address the needs of
users adequately and that its adoption into the C standard without major
revisions would be a significant error.

## Synchronization Requirements

The requirement that every task must have an associated task block, and API or
ABI pollution caused by task spawning functions, means that the draft requires
significant amounts of implied synchronization.  The synchronization cannot be
eliminated without whole-program analysis, and even then may leave unnecessary
synchonization due to conditional spawns.   Since the semantics require that
"execution joins with all child tasks spawned directly or indirectly within
the compound statement," Without the implied synchronization, no mechanism
could ensure that a program could not observe the lack of the synchronization.

OpenMP does not require this synchronization because tasks are not required to
be part of the equivalent of a task block (a task group in OpenMP) but only
that they be created within a parallel region which can be of arbitrary scope
without impacting the signatures of intervening functions.

As a trivial example, an OpenMP function that calls two functions as tasks
which create tasks with only one synchronization at the end could be written
like this, note synchronization points are marked with comments:

```c
void other_task_spawner_1();
void other_task_spawner_2();
void task_spawner() {
     #pragma omp parallel
     #pragma omp single
     {
         #pragma omp task
         other_task_spawner_1();
         #pragma omp task
         other_task_spawner_2();
     }// Task synchronization here
}
```

The CPLEX draft allows this to be written as follows:

```c
void other_task_spawner_1() _Task _Call;
void other_task_spawner_2() _Task _Call;
void task_spawner() {
     _Task _Block {
         _Task _Spawn {
             other_task_spawner_1();
         }
         _Task _Spawn {
             other_task_spawner_2();
         }
     }// Task synchronization here
}
```

This maintains the synchronization properties but pollutes the
signatures of
the two functions.  Removing that pollution requires otherwise
unnecessary
synchronization:


```c
void other_task_spawner_1();
void other_task_spawner_2();
void task_spawner() {
     _Task _Block {
         _Task _Spawn {
             other_task_spawner_1();// Task synchronization here
         }
         _Task _Spawn {
             other_task_spawner_2();// Task synchronization here
         }
     }// Task synchronization here
}
```

In our experience, many people complain that OpenMP is heavy in
synchronization. Sometimes this complaint is well founded, we are quite aware
of the prevalence of barriers in naive OpenMP code, but OpenMP provides
mechanisms, such as the nowait clause, to indicate that the synchronization is
not needed algorithmically. It is a major concern that the CPLEX draft
currently requires as much, or in some cases more, synchronization as an
equivalent OpenMP program, in some cases forcing complete task barriers that
are algorithmically unnecessary. The draft does not provide sufficient
mechanisms to eliminate the implied synchronization, which will limit the
scalability of the proposed extension.

## Concurrency and Interoperability

The draft does not support producer/consumer patterns or any form of explicit
concurrency control where concurrency, or at least apparently concurrent
progress, is required for correctness. These forms, even when written with
other models that do support such guarantees, would become harder to verify
for correctness in the presence of the CPLEX constructs.  This leads to many
potential composability issues with other models, including C11 standard
threads and the backbones of system software on several major platforms
like pthreads, windows threads, OpenMP and GCD.  This might be less of a
concern if it seemed possible to extend the model to support concurrency
constructs in the future, but adding such support would invalidate the
fundamental requirements of the CPLEX model.  As an example, a simple
program requiring a progress guarantee could be written like this:


```c
void producer_consumer() {
    thread_safe_concurrent_queue_of_int q;
     _Task _Block {
         _Task _Spawn {
             produce_until_done(&q);
         }
         _Task _Spawn {
             consume_until_done(&q);
         }
     }// Task synchronization here
}
```

Admittedly this an arbitrary example, but it showcases the type of code that
cannot be written safely with the proposed extensions.  Even if two worker
threads exist, CPLEX makes no guarantee that a program like this will make
progress, it may deadlock, or not.  One of the main reasons given for this is
to allow the program to execute with a single thread.  In a world without
compiler control, and without well-explored bulletproof models for handling
that problem, it's a reasonable contention, but consider the equivalent golang
program:


```golang
func foo() {
    q := make(chan int)
    go produce_until_done(&q)
    consume_until_done(&q)
}
```

This version is shorter, supports multiple threaded or single-threaded
execution, and guarantees progress through both (barring abberant behavior
inside one of the elided functions).  OpenMP supports this kind of parallelism
in the same way that C11 threads does, by allowing the user to determine the
number of concurrent threads and depend that all such threads will make
progress independently.

Since no concurrency is guaranteed by CPLEX tasks, no two tasks can acquire
the same lock, or similar concurrency control mechanism, without risking a
non-deterministic deadlock that may occur in a single-threaded run of the
program.  Since any model that deals with concurrency uses such mechanisms to
ensure correctness, this creates a large body of existing programs that will
present a barrier to adoption due to their existing threaded interactions.

## Viral Annotations

The requirements for task blocks and function type annotations to support
spawning are another major limitation and are impractical for large code
bases. In Section 11.2 on "The task block statement", the draft states
"Determination of the associated task block is a hybrid process: lexically
within a function, and dynamically across calls to spawning functions." A
footnote on this sentence states "In Cilk, this determination can be done
entirely lexically. In OpenMP, this determination can be done entirely
dynamically." However, the draft's wording about associated task blocks, and
the requirement that all task spawn statements have an associated call block,
combine to require any function that contains an escaping task spawn statement
or calls such a function (and so on) must be a task spawning function. That
is, the function definition and all calls to the function must be annotated
with "_Task _Call". The main reason for this is that performant
implementations, based on the task-first task-stealing and stack tagging
approach taken by Cilk, require modification of the construction of the
call-site to function correctly when tasks can escape a function invocation.
This can actually be avoided by restoring CilkPlus's requirement that no tasks
can escape a function, but at the cost of even less control over
synchronization requirements.  Requiring annotations of every declaration and
definition for which tasks may escape significantly limits the ability to use
existing source code within CPLEX tasks, and further to invoke spawning
functions from legacy code that cannot change ABIs.

## Loop Hints

The loop hint parameters, which we believe are the one aspect that takes
inspiration from OpenMP, fail to deliver the capabilities of their OpenMP LOOP
construct counterparts. These parameters are all "recommended" but provide the
programmer with no guarantees as to whether they will be honored.  In OpenMP,
these same "parameters" are required so that the programmer can tune code
performance reliably. The OpenMP community has found that the ability to
reason about the operations performed by the compiler and runtime are critical
to the programmer's success.

Even if the parameters are honored, they will frequently fail to provide the
desired tuning effect. For example, the num_threads parameter would normally
be a mechanism to avoid oversubscription in the presence of other models, but
does not appear to offer any guarantees to set a minimum or maximum. The
chunk_size parameter is similarly anemic -- it sets a maximum, which can be
honored by using one iteration per thread of execution, when the programmer
will frequently want to set a minimum number of iterations per chunk to
control the overhead spent on task creation, except possibly for the final set
of chunks.

## Conclusions

Even if one accepts the rejection of a thread-oriented model that it embodies,
the working draft does not represent a compromise between CilkPlus and OpenMP,
as we have heard some characterize it. The semantics defined in the working
draft are almost entirely CilkPlus or Cilk derived, with a modicum of some
OpenMP inspired syntax but lacking OpenMP's associated semantic requirements.
The proposal should be re-evaluated for its failure to draw more widely on the
large body of existing experience writing parallel programs both in OpenMP and
in broad variety of other models supporting C.

For these reasons, we believe the working draft should not be adopted as it
stands, and alternative mechanisms for exploiting parallelism and concurrency
in C should be sought. We hope you will find these observations helpful in
delivering a solution that meets the needs of wide range of applications and
the long term future of the C standard.