[Cplex] Strands teaser

Jeff Hammond jhammond at alcf.anl.gov
Fri Jun 21 04:17:31 CEST 2013


There _was_ an implementation that did something different but I told
IBM that I wanted their OpenMP runtime to behave like GCC's w.r.t.
instantiating an OpenMP thread pool per Pthread (previously it did
something that I found unusable and will therefore not describe).  I
have used this model successfully on a platform that is really bad at
oversubscription (Blue Gene).  I merely do basic arithmetic and make
sure that num_pthreads*num_omp_threads<=64.  If I intended to have
users of an application that was nested in this way, I would have
logic to prevent the user from oversubscription.

While I have not run TBB with OpenMP on any platform yet, I am
involved in the porting effort of TBB to Blue Gene/Q (it's functional
but needs optimization) and we already have something along the lines
of TBB_NUM_THREADS to allow an intelligent user to not oversubscribe.

Because people who oversubscribe aren't my paying customers, I'm happy
to declare this <user error> (I promised Clark I would stop using
pejorative words on this list).

I accept that more dynamic interplay of threading models will not run
optimally with my statically capped approach, but I've yet to see
anyone try to do that with multiple models.  It's not hard to invent a
use case, of course, and I'm sure someone will do so to try to
invalidate my position.

Jeff

On Thu, Jun 20, 2013 at 6:03 PM, Robison, Arch <arch.robison at intel.com> wrote:
> I revise my statement to:
>
>         "The Intel and GNU and Microsoft implementations, by default, fire up P-1 threads for each top-level parallel construct entered by a non-OpenMP statement."
>
> Is there an existing OpenMP implementation that does not do this?  It would be useful to understand what is does, and if not, why OpenMP implementers feel the alternatives are worse.  One difficulty that occurs to me is fairness: if the first non-OpenMP thread is given all P-1 threads, then what does the second thread get?  If it gets nothing for sake of avoiding oversubscription, then it has only itself to execute its parallel region.
>
> - Arch
>
> -----Original Message-----
> From: Bronis R. de Supinski [mailto:bronis at llnl.gov]
> Sent: Thursday, June 20, 2013 5:09 PM
> To: Robison, Arch
> Cc: Mattson, Timothy G; Blaine Garst; cplex at open-std.org; Carl Hewitt
> Subject: Re: [Cplex] Strands teaser
>
>
>
> And I repeat that your statement is lame and inaccurate.
>
> On Thu, 20 Jun 2013, Robison, Arch wrote:
>
>> I should have emphasized the "(by tradition of paying customers)" part.  Yes, the spec offers other alternatives.
>>
>> - Arch
>>
>> From: Mattson, Timothy G
>> Sent: Thursday, June 20, 2013 4:45 PM
>> To: Robison, Arch; Blaine Garst; cplex at open-std.org
>> Cc: Carl Hewitt
>> Subject: RE: [Cplex] Strands teaser
>>
>> Arch
>>
>> You wrote that:
>>
>> ... OpenMP requires (by tradition of paying customers) that it start up a new pool of P-1 threads for each top-level parallel construct entered by a non-OpenMP thread.
>>
>> This is the typical way OpenMP is implemented today, but nothing in the standard says it has to be this way.   This is my complaint with people who get irritated with the way things are done with OpenMP, but instead of fixing implementations they go off and create new languages/APIs (I consider language proliferation to be extremely disruptive).   In dynamic mode an OpenMP implementation is free to give you any number of threads less than or equal to the number indicated by the number-of-threads internal control variable from one parallel region to the next.
>>
>> So let's be careful to distinguish between implementation choices and language definitions.   Just because implementers consistently choose an approach we consider less than optimal, it doesn't always follow that the problem is in the programming model.
>
>>
>> --Tim
>>
>>
>> From: cplex-bounces at open-std.org<mailto:cplex-bounces at open-std.org> [mailto:cplex-bounces at open-std.org] On Behalf Of Robison, Arch
>> Sent: Thursday, June 20, 2013 2:17 PM
>> To: Blaine Garst; cplex at open-std.org<mailto:cplex at open-std.org>
>> Cc: Carl Hewitt
>> Subject: Re: [Cplex] Strands teaser
>>
>>> From first glance, I believe that a Cilk run-time might be implemented on top of your strands abstraction, since we have an abstraction in the run-time that serves a similar purpose.  [Note: Cilk literature uses "strand" to mean something very different.]  A Cilk runtime cannot be implemented efficiently using FIFO queues.  It uses per-thread deques that are implemented using atomic reads, atomic writes, and mutual exclusion.  C11 gives us atomic reads and atomic writes, so if strands support exclusion and some kind of coroutine-like switching (block current strand; resume another specified strand), I think that's enough to implement a Cilk runtime.
>>
>> Heidi Pan's thesis is the best work to date that I know of on the issue of composing OpenMP/Cilk/TBB and other parallel libraries.  See http://lithe.eecs.berkeley.edu/pubs/pan-phd-thesis.pdf  It employs a hierarchical scheme along the lines of what you suggest.
>>
>> When OpenMP code is called from code written using any other threading package, the results are predictably unsatisfactory, because OpenMP requires (by tradition of paying customers) that it start up a new pool of P-1 threads for each top-level parallel construct entered by a non-OpenMP thread.  For example, if P threads from Cilk [or TBB or PPL] call OpenMP, up to P*(P-1) threads may be running.  The other way around is not so bad: OpenMP calling Cilk results in at most 2P-1 threads running  since Cilk shares a common pool of workers.
>>
>> - Arch
>>
>> From: cplex-bounces at open-std.org<mailto:cplex-bounces at open-std.org> [mailto:cplex-bounces at open-std.org] On Behalf Of Blaine Garst
>> Sent: Thursday, June 20, 2013 12:15 PM
>> To: cplex at open-std.org<mailto:cplex at open-std.org>
>> Cc: Carl Hewitt
>> Subject: [Cplex] Strands teaser
>>
>> Over the past several years I have been working, with Carl Hewitt, on a "strand" library built around the notion of provably minimal scheduling operations.  I was alluding to this during the conference call.
>>
>> The idea is that a strand consists of a stack, register save area, and a few minimal accounting hooks.  The basic primitives for contention arbitration use standard compare-and-swap and look, at a very high level, like mutex and condition variables, but operate directly on the strand rather than a separate mutex variable.  These strands cooperate with the heavier-weight pthreads as necessary but otherwise replace them for computation.
>>
>> Venturing out of strand code into general code that calls OS primitives requires a bit of stack fixup that the compiler can arrange (a few instructions).  The key observation here is that threads have been an operating system invention to simulate multi-core and are now mostly overhead.
>>
>> The scheduling primitives (enter exclusion, leave exclusion, defer to queue, resume queue) are used to implement many different policies: we have two forms of reader-writer (which has priority), a low-level gate (blocks until set), etc.  My belief is that these primitives are sufficient to model CSP, Go, Erlang, and probably several other systems (Linda?), and (the question in this forum) the lower level interactions that go on in the Cilk and OpenMP runtime libraries.  Having implemented the primitives, I can attest to their utter simplicity and minimalism, and would love to compare notes with the mechanisms in place for Cilk and OpenMP.  I suspect that OpenMP will require use of real-time scheduling policies for the underlying pthreads, which is straight-forward, which may well be of general interest.
>>
>> A key issue with strands is the mechanisms to provide adaptive scheduling.  The mechanism we see is a top-down hierarchy of scheduling "objects" we call Sponsors.   The highest level Sponsor deals with pthread-pool operating system primitives found on Mac OS X (FreeBSD), and *should* be able to work with Microsoft thread pools as well (not sure about the underlying OS support there).   It subdivides computing resources with its tree of sub-Sponsors, each of which only knows about itself, its parent, and its children.  Work stealing across sponsors is a natural tree operation.
>>
>> Whereas Carl and I are using these strand primitives to implement Actors, we see them as a general purpose mechanism and have been intending to offer them as such to the C committee for several months now.
>>
>>
>> Until I get that paper written, I have a question to ask: what happens when Cilk and OpenMP are used in the same program?   Unless they are built on the same underlying strand mechanism I can't help but believe that results are unpredictable and unsatisfactory.
>>
>>
>> Blaine
>>
>>
>>
>>
> _______________________________________________
> Cplex mailing list
> Cplex at open-std.org
> http://www.open-std.org/mailman/listinfo/cplex



-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
ALCF docs: http://www.alcf.anl.gov/user-guides


More information about the Cplex mailing list