[Cplex] Integrating OpenMP and Cilk into C++
hsutter at microsoft.com
Sun Jun 23 18:14:44 CEST 2013
> I actually agree with you on this point in that expressing parallelism is a
> simpler problem to which we have a couple of reasonable solutions on
> the table already. The big issue is that I don't think we can reasonably
> release such an extension unless it can interoperate sensibly with the low
> level alternatives, especially threading at the existing C11 level.
Agreed. However, the following isn’t quite what’s needed:
> In the current proposal I see no way for even an expert programmer to
> produce a C11 threads, or pthreads etc., library to compose sensibly
> with the extension. There isn't even a way for such a library to query
> the number of threads in use by the system. For that matter, given a
> threaded application that wants to call a library using the new extension,
> how is it meant to convey the amount of parallelism for the library to use,
> let alone give it existing thread resources to run on?
The problem with querying “# HW threads” is that it’s too low-level, and too lumpy.
We definitely don’t want users to have to specify this, and mainstream parallel libraries have generally surfaced this knob only as a stopgap when they couldn’t tune well enough on their own and are now getting away from this.
But I think neither do we want libraries to have to specify this, because it doesn’t load balance well. The width of a computation can and should be able to change during the computation. For example, at the point where I want to begin a parallel computation, N parallel resources may be available, but that doesn’t mean I should take N – if during my computation M others end, I want to scale up to N+M, and if during my computation M others try to begin, I don’t want to starve them but want to share and scale down to, say, N/(N+M).
(It also doesn’t interact well with hypervisors or with systems where cores can be turned on and off to conserve power, where in both cases the HW resources can grow or shrink during execution.)
And Tom then alludes to the right granularity, which is enqueueing tasks:
> This group needs a runtime interface to rely upon so that it can specify
> a standard ABI for enqueueing tasks etc. and the scheduling interface
> would serve that purpose if developed.
This enqueueing tasks is exactly the right model, so that the point-in-time workload can be dynamically scheduled across the hardware.
From: cplex-bounces at open-std.org [mailto:cplex-bounces at open-std.org] On Behalf Of Tom Scogland
Sent: Saturday, June 22, 2013 10:25 PM
To: Nelson, Clark
Cc: cplex at open-std.org
Subject: Re: [Cplex] Integrating OpenMP and Cilk into C++
On Sat, Jun 22, 2013 at 12:10 AM, Nelson, Clark <clark.nelson at intel.com<mailto:clark.nelson at intel.com>> wrote:
> In my opinion, if we do not specify a scheduler interface as part of this
> effort, we will have done nothing worth doing. Neither Cilk nor OpenMP are
> useful without a runtime managing concurrency, scheduling work and (in some
> fashion) allowing the user to control concurrency. Either one can be used
> without one, but then all that's left is an overly verbose serial program
> (well, with better than average SIMD usage, but I digress). I am personally
> not interested in specifying a parallel language extension with no standard
> way to control its behavior.
I'm sure you're right -- from your perspective. I absolutely believe that you,
personally, would benefit in no way.
However, I'm pretty confident that there are quite a few less sophisticated
programmers in the world who would benefit considerably from having an
easier way to write a parallel program than by using pthreads, and an easier
way of writing a scalable, composable parallel program than by using OpenMP;
and they would benefit even more if that way were in some way standard.
BTW, when you talk about a "standard way to control its behavior", do you mean
"control" in any broader sense than would be covered by "tune"?
"Tune" to me implies that it is an alteration to increase performance by some metric without affecting correctness. For example, changing the number of threads available to the program as a whole, requesting a specific minimum chunk size for a loop, limiting a section of code to a specific number of threads, specifying an alternative scheduling scheme, etc. all of these could reasonably be called "tuning". In that sense, tuning is sufficient, but I have a feeling that is not the meaning you had in mind.
> In that sense I believe we need to specify *something* as a standard
> scheduler interface. The question at hand is not necessarily whether we
> should attempt to make *a single scheduler* which is perfect for all
> occasions, a nigh impossible task on the best of days, but rather a standard
> interface and mechanism for composing schedulers within an application,
> probably a standard library interface of some sort. In that fashion we
> allow users and runtime implementers to define schedulers which fit their
> needs, but still incorporate into the standard system.
I wholeheartedly support this idea. But I feel it's a separate and deeper
topic (and in a lot of ways more interesting :-) than the simple ability to
express what I'll call opportunistic parallelism, as in a program that can
take advantage of more than one processor, which can still be useful even if
it isn't tuned to within an inch of its life.
Please note that I *didn't* say that this separate and deeper topic should
be given lower priority than the other.
I actually agree with you on this point in that expressing parallelism is a simpler problem to which we have a couple of reasonable solutions on the table already. The big issue is that I don't think we can reasonably release such an extension unless it can interoperate sensibly with the low level alternatives, especially threading at the existing C11 level.
In the current proposal I see no way for even an expert programmer to produce a C11 threads, or pthreads etc., library to compose sensibly with the extension. There isn't even a way for such a library to query the number of threads in use by the system. For that matter, given a threaded application that wants to call a library using the new extension, how is it meant to convey the amount of parallelism for the library to use, let alone give it existing thread resources to run on?
I have no issue with providing a sensible default for when a user does not care, but there has to be a way for library writers and experts to be able to interface with the new runtime intelligently, and without being forced to reimplement everything in terms of the new extension. At a minimum that means offering hooks to set properties such as number of threads, binding of threads, and query those same values. Given that interface, we could at least introduce this without clobbering everyone else, this is acceptable but sub-optimal territory as far as I'm concerned.
Perhaps that's what we should seek to do in this group, and create another group to look into the issue of composing schedulers, although I'm not sure what it would necessarily accomplish. This group needs a runtime interface to rely upon so that it can specify a standard ABI for enqueueing tasks etc. and the scheduling interface would serve that purpose if developed.
For that matter, I'm not sure how much more effort would really be required. There is already an interface defined for passing tasks to a scheduler, and another (even if its presently hidden) for specifying the resources that scheduler is allowed to use. To use a well established example, in OpenMP parallelism is managed in a hierarchy, or a scope if you will. If you run a work-sharing loop in a parallel region with 8 threads, it will spread across exactly those 8 threads, even if at an outer parallel region there are 80 available. Effectively each parallel region is a scheduling scope. Would it be so bad to allow the specification of a, potentially user-defined, scheduler and resources (number of threads etc.) for each region? When I say scheduler here, I mean something like the "executor" discussed in another thread on this list. At the present moment, I can't think of any issue I have that could not be solved by some combination or implementation of those options.
"A little knowledge is a dangerous thing.
So is a lot."
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Cplex