[Cplex] Keywords, compliance, and a parallel UNCOL.
Bronis R. de Supinski
bronis at llnl.gov
Sun Jun 30 00:57:18 CEST 2013
> Yes, I think so:
> 1. OpenMP 4.0 support for offload is distinct from its support for
> shared memory parallelism, it is not the case that they extended the
> same pragmas to also apply to offloading. This is also the case for the
> Intel implementation supporting Xeon Phi as a coprocessor which was used
> prior to the OpenMP 4.0 standard.
This statement is incorrect (or perhaps inaccurate). OpenMP 4.0
adds aa few constructs to support "devices" (accelerators, GPUs,
coprocessors, DSPs, whatever you have or want to call them). The
target construct is the only required one. Once execution has
been initiated on the device, all existing OpenMP constructs
are available to programmer and work as one would expect.
The teams and distribute constructs were added due to aspects
of GPUs (NVIDIA would say "features"; others might use other
terms). Specifically, the lack of (reasonable) hardware support
for synchronization across thread blocks. In effect, teams
supports the cretion of multiple independent thread blocks;
within each thread block, the existing OpenMP constructs are
available to the programmer and work basically as one would
expect (some caveats for synchronization primitives exist
although they are available).
For systems such as the Xeon Phi, the likely implementation
of the teams construct creates a single thread block (a valid
choice since the teams construct ensures that no more than
the requested number of thread blocks are created; an
implementation is free to provide fewer). This choice
effectively creates an OpenMP environment on the device
that is completely consistent with natively running
OpenMP on the device.
The distribute construct supports the parallelization of a
loop across thread blocks (with no synchronization between
them allowed/supported). Again, within the thread blocks,
the existing constructs are available.
So, the existing constructs ("pragmas") do apply to offloading
although they do not implement it through new clauses. I suppose
that approach would be possible but would have been much more
difficult to accommodate the desired reduced support for
synchronization between thread blocks.
While adoption of the OpenMP device constructs has not been
proposed, adopting parallelization support that is consistent
with OpenMP provides a clear path to add support for devices
to the language at a later time.
> 2. Managing the heterogeneous HW, the distinct memory and the overhead
> of copying are fundamental to programming these systems, so abstracting
> these away and making them look the same as multi core parallelism is
> unlikely to be the desired direction.
> 3. A system that treats GPU and shared memory multi cores the same does
> not exist, or at least is not being proposed. I agree with others who
> said it before, that the CPLEX is too large of the committee to invent a
> solution. It would be an effective place to endorse and make incremental
> improvements to proposal coming in, hopefully based on implementations.
> By the way, the question of GPUs came up in the WG21 SG1 and it was unanimously agreed upon to not try to standardize GPU programming. I am hoping for the same agreement in CPLEX.
> -----Original Message-----
> From: Matthew Markland [mailto:markland at cray.com]
> Sent: Saturday, June 29, 2013 10:04 AM
> To: Geva, Robert; Nelson, Clark; cplex at open-std.org
> Subject: RE: Keywords, compliance, and a parallel UNCOL.
>> -----Original Message-----
>> From: Geva, Robert [mailto:robert.geva at intel.com]
>> Sent: Saturday, June 29, 2013 2:43 AM
>> To: Matthew Markland; Nelson, Clark; cplex at open-std.org
>> Subject: RE: Keywords, compliance, and a parallel UNCOL.
>> The problem address by OpenACC is support for accelerators /
>> coprocessors / etc, mostly with either non shared memory or
>> heterogeneous execution units (or both).
>> This problem is not in the scope of what WG14 asked this study group
>> to work on.
>> We need to solve the "simpler" problem of shared memory parallel
> But I believe we should keep in mind whether the syntax/concepts added to the language can extend to the accelerator/coprocessor model also. In 5 years would another extension be needed (offload), or can we make spawn general enough even in this design so that it could be reused for a future model?
> Cplex mailing list
> Cplex at open-std.org
More information about the Cplex