[Cplex] Comments on the straw proposal and clarification of OpenMP
tom at scogland.com
Tue Jun 18 16:41:27 CEST 2013
Hello all, while I was on the call yesterday I was largely unable to
participate due to a number of technical problems on my end. Below is a
set of comments on the current straw proposal along with clarifications on
the current understanding of OpenMP and proposals for addressing certain
discrepancies, each is numbered with the section of the straw proposal it
Before I move to comments on the current document, we need to clear a
couple of things up about how OpenMP actually behaves. There seems to be a
belief that specifying "num_threads" always results in that number of
threads being created, this is not accurate. The implementation will try to
create that many threads, but will often fail and simply report the number
that are actually available at that level. In that sense,
over-subscription above the level of "max-threads" is not possible.
Programs which depend on getting the number of threads they request are
non-conforming to the standard and thus experience implementation defined
behavior. Almost all of the worries I heard in the call yesterday, and
some through this mailing list, with regards to nested parallelism et. al.
are based on the thread creation and management part of OpenMP, which does
not offer any guarantees.
Since I think this is the most important, I will list it first. Section
1.6 describes a proposal for a cilk like model in which there is little
control over the nesting or specification of concurrency. It may be
worthwhile to support this, but if there is no way to specify it many users
will simply use something else (I would, no question). I would propose
that we consider a way to specify a thread pool to be used by the runtime
system for tasks associated with that pool, perhaps tied to the cilk block
or similar region based design as discussed in the document. If these are
composed of C11 standard threads, there will be interoperability with the
existing spec, extremely low level control for those who want it, and the
option to ignore it for those who do not care.
That said, creating tasks for all constructs, including loops (discussed
more below), is a good idea which I fully support.
1.2) Cilk tasks require function calls whereas OpenMP tasks do not. While
it could be reasonable to go either way, it may be beneficial to
investigate pulling in either the N1451 "blocks" proposal for C or an
adaptation of the C++ N3092 lambdas design for handling of regions. This
project could benefit greatly from their presence, and the additional
functionality would be greatly useful elsewhere in the language as well.
In effect we would gain scoping benefits of functions, a call site to use
in runtime implementation, and the capability to (somewhat) transparently
inherent scope where desirable, the last of these being a significant
benefit to the existing OpenMP approach.
1.3) This proposal is reasonable, but there is a significant difference
between omp parallel and a "_Cilk_block" as proposed. The omp parallel
region both creates a contention group and, may or may not, spawn a team of
threads to execute the region. Given the discussion yesterday it seems
that we are focused on expressing available parallelism rather than dealing
explicitly with thread creation. In that sense, a construct such as the
_Cilk_block proposal makes sense.
Even so, allowing a user to specify a group of threads created with the c11
primitives for that purpose to a region would handle the issue of allowing
explicit thread management. It would also tie our work to that standard,
easing interoperability with other libraries which will already be
considering interoperability with C11 threads by the time this
1.5) Both Cilk+ and OpenMP decompose loops into tasks, the difference is
that OpenMP coalesces tasks into groups of at minimum "chunk_size"
iterations and allow users to specify the policy for splitting the loop
into tasks. A loop with the OpenMP Static schedule is really just
decomposed into "NUM_THREADS" tasks and handed out, though it is generally
assumed that each thread will execute exactly one of these tasks rather
than assigning them dynamically.
1.7) Reductions are definitely a point of difference, it may be worthwhile
to consider both approaches, but remember also that C11 defines atomic
operations. It may be reasonable to offer an ordered reduction primitive
and simply specify that non-ordered or structured reductions should be
accomplished through the already standard atomics.
"A little knowledge is a dangerous thing.
So is a lot."
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Cplex