[Cplex] Strands teaser
Mattson, Timothy G
timothy.g.mattson at intel.com
Thu Jun 20 23:44:57 CEST 2013
You wrote that:
... OpenMP requires (by tradition of paying customers) that it start up a new pool of P-1 threads for each top-level parallel construct entered by a non-OpenMP thread.
This is the typical way OpenMP is implemented today, but nothing in the standard says it has to be this way. This is my complaint with people who get irritated with the way things are done with OpenMP, but instead of fixing implementations they go off and create new languages/APIs (I consider language proliferation to be extremely disruptive). In dynamic mode an OpenMP implementation is free to give you any number of threads less than or equal to the number indicated by the number-of-threads internal control variable from one parallel region to the next.
So let's be careful to distinguish between implementation choices and language definitions. Just because implementers consistently choose an approach we consider less than optimal, it doesn't always follow that the problem is in the programming model.
From: cplex-bounces at open-std.org [mailto:cplex-bounces at open-std.org] On Behalf Of Robison, Arch
Sent: Thursday, June 20, 2013 2:17 PM
To: Blaine Garst; cplex at open-std.org
Cc: Carl Hewitt
Subject: Re: [Cplex] Strands teaser
>From first glance, I believe that a Cilk run-time might be implemented on top of your strands abstraction, since we have an abstraction in the run-time that serves a similar purpose. [Note: Cilk literature uses "strand" to mean something very different.] A Cilk runtime cannot be implemented efficiently using FIFO queues. It uses per-thread deques that are implemented using atomic reads, atomic writes, and mutual exclusion. C11 gives us atomic reads and atomic writes, so if strands support exclusion and some kind of coroutine-like switching (block current strand; resume another specified strand), I think that's enough to implement a Cilk runtime.
Heidi Pan's thesis is the best work to date that I know of on the issue of composing OpenMP/Cilk/TBB and other parallel libraries. See http://lithe.eecs.berkeley.edu/pubs/pan-phd-thesis.pdf It employs a hierarchical scheme along the lines of what you suggest.
When OpenMP code is called from code written using any other threading package, the results are predictably unsatisfactory, because OpenMP requires (by tradition of paying customers) that it start up a new pool of P-1 threads for each top-level parallel construct entered by a non-OpenMP thread. For example, if P threads from Cilk [or TBB or PPL] call OpenMP, up to P*(P-1) threads may be running. The other way around is not so bad: OpenMP calling Cilk results in at most 2P-1 threads running since Cilk shares a common pool of workers.
From: cplex-bounces at open-std.org<mailto:cplex-bounces at open-std.org> [mailto:cplex-bounces at open-std.org] On Behalf Of Blaine Garst
Sent: Thursday, June 20, 2013 12:15 PM
To: cplex at open-std.org<mailto:cplex at open-std.org>
Cc: Carl Hewitt
Subject: [Cplex] Strands teaser
Over the past several years I have been working, with Carl Hewitt, on a "strand" library built around the notion of provably minimal scheduling operations. I was alluding to this during the conference call.
The idea is that a strand consists of a stack, register save area, and a few minimal accounting hooks. The basic primitives for contention arbitration use standard compare-and-swap and look, at a very high level, like mutex and condition variables, but operate directly on the strand rather than a separate mutex variable. These strands cooperate with the heavier-weight pthreads as necessary but otherwise replace them for computation.
Venturing out of strand code into general code that calls OS primitives requires a bit of stack fixup that the compiler can arrange (a few instructions). The key observation here is that threads have been an operating system invention to simulate multi-core and are now mostly overhead.
The scheduling primitives (enter exclusion, leave exclusion, defer to queue, resume queue) are used to implement many different policies: we have two forms of reader-writer (which has priority), a low-level gate (blocks until set), etc. My belief is that these primitives are sufficient to model CSP, Go, Erlang, and probably several other systems (Linda?), and (the question in this forum) the lower level interactions that go on in the Cilk and OpenMP runtime libraries. Having implemented the primitives, I can attest to their utter simplicity and minimalism, and would love to compare notes with the mechanisms in place for Cilk and OpenMP. I suspect that OpenMP will require use of real-time scheduling policies for the underlying pthreads, which is straight-forward, which may well be of general interest.
A key issue with strands is the mechanisms to provide adaptive scheduling. The mechanism we see is a top-down hierarchy of scheduling "objects" we call Sponsors. The highest level Sponsor deals with pthread-pool operating system primitives found on Mac OS X (FreeBSD), and *should* be able to work with Microsoft thread pools as well (not sure about the underlying OS support there). It subdivides computing resources with its tree of sub-Sponsors, each of which only knows about itself, its parent, and its children. Work stealing across sponsors is a natural tree operation.
Whereas Carl and I are using these strand primitives to implement Actors, we see them as a general purpose mechanism and have been intending to offer them as such to the C committee for several months now.
Until I get that paper written, I have a question to ask: what happens when Cilk and OpenMP are used in the same program? Unless they are built on the same underlying strand mechanism I can't help but believe that results are unpredictable and unsatisfactory.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Cplex