[Cplex] Integrating OpenMP and Cilk into C++

Geva, Robert robert.geva at intel.com
Sat Jun 22 09:15:42 CEST 2013

I think that from compiler vendor perspective we (at Intel) have a different impression of how problem with oversubscription occur in practice in the marketplace.
We do have paying customers who run into problems.
In particular, they run into problem when they use parallelism in their code and also use libraries with parallelism within.

Turning off parallelism at one of the level, as was mentioned before, to run the inner loop serially, works for some but not all of the cases. You don't necessary have enough parallelism in the outer level to get the expected performance.

Collapsing the loops does not work in the scenario of using parallel libraries.

I don't yet see a general solution within this discussion. We are aware that there are cases in which oversubscription can be dealt with. And mostly, it does not involve cases where a single programmer writes the whole application code. We also have HPC customers who employ large teams of programmers, and also use libraries with parallelism inside.


-----Original Message-----
From: cplex-bounces at open-std.org [mailto:cplex-bounces at open-std.org] On Behalf Of Bronis R. de Supinski
Sent: Friday, June 21, 2013 2:36 PM
To: Pablo Halpern
Cc: cplex at open-std.org
Subject: Re: [Cplex] Integrating OpenMP and Cilk into C++


OK, I appreciate your question. So the answer is that most compilers do succeed in not oversubscribing the inner loop, at least in an HPC context. They set the maximum number of threads based on the amount of hardware concurrency and simply serialize once no more threads are available.

The real difficulty is retaining some hardware concurrency for the inner loop. The basic way that is done in OpenMP is to use the OMP_NUM_THREADS environment variable or the num_threads clause so that you reserve threads for the inner context. This solution puts the burden on the user.

Another solution is available if you have closely nested loops, which is to collapse them. This solution can be useful for other loop schedules also.

I can imagine ways to provide similar, more general solutions although my experience is that the issue is less significant for real HPC codes than seems to be the concern here. Either the loops tend to be closely nested or the programmer has a fairly good idea of how the want to partition the parallelism and can use expressions built from omp_get_max_threads to leave them available for the inner levels.


On Fri, 21 Jun 2013, Pablo Halpern wrote:

> This seems like a good opportunity for me to get educated and, from 
> what I've seen, I'm not alone in needing to be educated in this way.  
> I'm wondering how static scheduling can be composable.  Consider the 
> following (avoiding any specific syntax for fear that I'll get it wrong):
> void f() {
>     parallel_statically_scheduled_loop (int i = 0; i < 1000; ++i)
>         g(i);
> }
> void g(int i) {
>     parallel_statically_scheduled_loop (int j = 0; j < 1000; ++j)
>         compute(i, j);
> }
> If there are N cores available, how do you avoid having N-squared 
> threads in the inner loop?
> Anticipating that you might tell me that the scheduler detects that 
> there are no more threads available, and therefore run the inner loop 
> serially, I'll ask a follow-up question: do any implementations 
> actually do this?  If not, why not? (It's been a well-known problem 
> for a long time, so its surprising to me that no implementation would 
> have fixed it, if that's the right thing to do.)
> I empathize with the annoyance of people making incorrect assumptions 
> about a model they don't understand (they do it with Cilk all the 
> time), so I'll try to refrain from doing that with OpenMP in the future.
> Thanks,
> -Pablo
> On 06/21/2013 04:26 PM, Bronis R. de Supinski wrote:
>> Pablo:
>> Re:
>>> There is no need to get defensive.  I am not attacking OpenMP.
>>> However, it is well known that widely-used OpenMP features, 
>>> particularly static scheduling, do not compose well with libraries 
>>> that also use parallelism. If the author of a piece of code does not 
>>> know whether that code might be called in a parallel context, then 
>>> he cannot use parallelism without risking exponential oversubscription.
>>> (I have seen this happen, often). If run on a desktop or mobile 
>>> system rather than a dedicated HPC system, static scheduling creates 
>>> load imbalances that hurt performance.
>> While I will agree that composability with other parallelism models 
>> is a weakness of OpenMP, composability with itself is not. The issue 
>> that you raise is not one of the model or the specification but 
>> rather one of quality of implementation.
>> Omitting static scheduling would be a mistake. Many, many situations 
>> are well suited to it and it is a natural, low overhead concept.
>> Bronis
> _______________________________________________
> Cplex mailing list
> Cplex at open-std.org
> http://www.open-std.org/mailman/listinfo/cplex
Cplex mailing list
Cplex at open-std.org

More information about the Cplex mailing list