[Cplex] Integrating OpenMP and Cilk into C++

Bronis R. de Supinski bronis at llnl.gov
Fri Jun 21 23:35:59 CEST 2013


Pablo:

OK, I appreciate your question. So the answer is that
most compilers do succeed in not oversubscribing the
inner loop, at least in an HPC context. They set the
maximum number of threads based on the amount of
hardware concurrency and simply serialize once no
more threads are available.

The real difficulty is retaining some hardware concurrency
for the inner loop. The basic way that is done in OpenMP
is to use the OMP_NUM_THREADS environment variable or the
num_threads clause so that you reserve threads for the
inner context. This solution puts the burden on the user.

Another solution is available if you have closely nested
loops, which is to collapse them. This solution can be
useful for other loop schedules also.

I can imagine ways to provide similar, more general
solutions although my experience is that the issue is
less significant for real HPC codes than seems to be
the concern here. Either the loops tend to be closely
nested or the programmer has a fairly good idea of
how the want to partition the parallelism and can use
expressions built from omp_get_max_threads to leave
them available for the inner levels.

Bronis



On Fri, 21 Jun 2013, Pablo Halpern wrote:

> This seems like a good opportunity for me to get educated and, from what
> I've seen, I'm not alone in needing to be educated in this way.  I'm
> wondering how static scheduling can be composable.  Consider the
> following (avoiding any specific syntax for fear that I'll get it wrong):
>
> void f() {
>     parallel_statically_scheduled_loop (int i = 0; i < 1000; ++i)
>         g(i);
> }
>
> void g(int i) {
>     parallel_statically_scheduled_loop (int j = 0; j < 1000; ++j)
>         compute(i, j);
> }
>
> If there are N cores available, how do you avoid having N-squared
> threads in the inner loop?
>
> Anticipating that you might tell me that the scheduler detects that
> there are no more threads available, and therefore run the inner loop
> serially, I'll ask a follow-up question: do any implementations actually
> do this?  If not, why not? (It's been a well-known problem for a long
> time, so its surprising to me that no implementation would have fixed
> it, if that's the right thing to do.)
>
> I empathize with the annoyance of people making incorrect assumptions
> about a model they don't understand (they do it with Cilk all the time),
> so I'll try to refrain from doing that with OpenMP in the future.
>
> Thanks,
> -Pablo
>
> On 06/21/2013 04:26 PM, Bronis R. de Supinski wrote:
>>
>> Pablo:
>>
>> Re:
>>> There is no need to get defensive.  I am not attacking OpenMP.
>>> However, it is well known that widely-used OpenMP features,
>>> particularly static scheduling, do not compose well with libraries
>>> that also use parallelism. If the author of a piece of code does not
>>> know whether that code might be called in a parallel context, then he
>>> cannot use parallelism without risking exponential oversubscription.
>>> (I have seen this happen, often). If run on a desktop or mobile system
>>> rather than a dedicated HPC system, static scheduling creates load
>>> imbalances that hurt performance.
>>
>> While I will agree that composability with other parallelism
>> models is a weakness of OpenMP, composability with itself is
>> not. The issue that you raise is not one of the model or the
>> specification but rather one of quality of implementation.
>>
>> Omitting static scheduling would be a mistake. Many, many
>> situations are well suited to it and it is a natural, low
>> overhead concept.
>>
>> Bronis
>>
>
> _______________________________________________
> Cplex mailing list
> Cplex at open-std.org
> http://www.open-std.org/mailman/listinfo/cplex
>


More information about the Cplex mailing list