Document Number: N3528
Date: 2013-02-05

Minutes of Feb 5 2013 SG1 Phone Call

Pablo Halpern, Minutes taker

Attendees

Hans Boehm

Herb Sutter

Michael Wong

Tom Plum

Pablo Halpern

Torvald Riegel

Robert Geva

Niklas Gustafsson

Artur Laksberg

Arch Robison

Detlef Vollmann

Anthony Williams

Lawrence Crowl

Olivier Giroux

Clark Nelson

Agenda

N3520 Critical Sections in Vector Loops

CWG1441 Atomics in signal handlers (See reflector discussion)

latch_barrier.html C++ Latches and Barriers - What's the status of this?

Review short term proposals from Portland SG1 summary, particularly slide 3.

N3520 Critical Sections in Vector Loops

Robert: First thing we need to decide is whether this is a problem that needs fixing.

Robert: The specification of a vector loop is defined in terms of what orders of evaluations are allowed, not what order of evaluation is required.

Robert: Do we feel strongly that in the unlikely case where a deadlock could occur, do we want to require.

Clark: It's already understood that locks are not composable, so this is not a new problem created by this vector loop construct.

Pablo: We already have other things in the language that do not compose. You can't dereference a null pointer, you cannot acquire a non-reentrant mutex twice, you can't pool loops in constexpr functions. I think that the rules needed to get defined behavior here are fine.

Hans: do we have a good set of rules of what would be allowed?

Robert: Yes, that's part of the proposal presented in Portland. Note that the specification allows both the scalar and vector order of evaluation.

Anthony: It seems to me slighly odd to have two allowed execution orders. If it's not detectible, then it doesn't matter, and if it is detectible, then the results depend on your compiler. It would seem to me that if you use the vector loop, you should get the vector order of execution, not the scalar order.

Robert: We had this discussion at length within Intel. It is detectible, and we could have proposed that, but that is orthogonal to today's question.

Anthony: But if you required the vector execution, then you would always have the problem, consistently.

Clark: There are already places where the standard allows multiple execution orders. If we say that you must do it in the vector way, then if you don't specify the vector length, then you'll get still get different behavior on different compilers.

Arch: We will definiately need flexibility in the execution order. The general principle is that your program specifies the possible parallelism and then you need to map the parallelism to the specific machine. There will always be more parallelism than the hardware can support, so some of it must be squashed into sequential execution.

Robert: That was a big part of our thinking. The actual parallelism depends on hardware resources.

Lawrence: This for loop changes the relationship between iterations. My concern is that we are using a syntax that is like a for loop and people will have the expectation that all of their for loop assumptions. If we used a library syntax, then we would not have that confusion. I know that the compiler has to get involved, but the form of it is my concern. If we need to tweak the syntax, later, it would be less disruptive to do it as a library proposal.

Robert: We are not attached to the syntax, but rather to the capability. However, this does reflect existing practice.

Lawrence: Yes, it is existing practice, but the existing practice is pushed by vendors that want to make minimal changes, they are not looking at the language 20 years down the line.

Hans: are you talking about a library function where the body of the loop is in the lambda? The thing that I'm concerned about is that the semantics of that are hard to define.

Clark: It's not only vendors that want minimal changes, programmers do too.

Pablo: Minimal changes also allow you to try to vectorize a part of your code, find that it doesn't yield benefits, and undo the changes without major rewrites.

Hans: Is there any difference between scalar execution and vector-length-1 execution?

Robert: I can't think of any

Hans: That makes me feel better. Since you do not require that the compiler use any particular vector length, then vector-length-1 is automatically an allowed.

Olivier: I think that this case is clear and doesn't cause me much heart burn? Could the compiler help me in this case, if the lock/unlock were annotated? I agree that if you recurse down into another non-vector function, then it's not a problem unless you have a function that acquires a lock and doesn't release it. I'm wondering about std::atomic here, though. It's not a lock, but it is communication.

Anthony: Why don't we say that within a vector for, you can't have anything that isn't a compiler intrinsic or is annotated as be OK?

Robert: That is something that we implemented in an earlier version of our implementation. We disallowed calls to anything that wasn't a vector function, but we got requests from customers to relax this restriction.

Clark: That would give predictability absolute priority over composability.

Herb: I think we're talking about two different kinds of vector annotation. 1) vector parameters which is a subset of 2) all things that are safe to call within a vector loop. If we want an annotation for (2), I'm open to that, but if we do that, we should have a more generalized restriction facility, to avoid special cases. I don't think having a "vector-callable" annotation is really related to this question because its just one of many things

Anthony: Unless you think through all of the things that could be safe, it won't be enough, cause you can have locks, and communications issues. It could be a spin-lock with attomics, or a counter communications mechanism.

Arch: Atomic operations per se are not the problem. They vectorize fine. It's the wait-on problem.

Robert: So the only place we have problems is when iterations of the loop must make independent progress.

Pablo: I think that it is impractal to require that every safe function to be annotated, since most functions are safe: they don't grab mutexes or use atomics. Annotating what should NOT be called in a vector context is not as safe, but is much more practical.

Anthony: If you want to take advantage of vectorization, then there is only a restrictive set of operations that make sense: simple arithmetic, etc.

Pablo: I don't agree. Vector units and vectorizers keep getting more capable, so restricting the set today will hurt us tomorrow. Moreover, if you have a small number of non-vectorizable operations in your otherwise-vectorizable loop, you shouldn't lose the vectorization the rest of your loop.

Herb: If we introduce a block that has scalar execution, would that work?

Robert: This is already part of the proposal.

Herb: I like that it's a clear, greppable annotation. If there were a vector block, rather than a simd_for, then any for would implicitly be restricted.

Arch: It would make it hard to define what a while loop means. It's effectively what OpenMP does, but it's a different model and they still needed to annotate their for loops.

Anthony: I like the idea of the region idea. So you could say that you can't use locks unless it's in a scalar block. That strikes me as a good place to start. It seems to me that allowing any function call in your vector loop, that seems to be the wrong place to start.

Michael: The problem I see is that then you have to define things in terms of what can be lexically observed.

Hans: Seems a lot like the transactional memory discussions. You would have to annotate things all the way down every library.

Robert: We wanted to avoid viral annotations. By having the serial block, we have to annotate only the loop itself.

Hans: On slide 7, what would happen if the lock/unlock were in an inline function? Would that be OK in a non-inline function but not in an inline function? If we go with the annotation scheme, that would solve the problem.

Robert: Proposed straw poll. Define a vector loop saying that iterations cannot rely on making progress independent of each other, otherwise you get undefined behavior.

Hans: I'm concerned about the inline functions.

Pablo: The compiler can inline the function, but it must not vectorize it unless it can prove that it's safe.

Anthony: I'm uncomfortable with the idea that if you take the body of a vector loop and move it out into a separate function, then you've changed the semantics.

Robert: we can talk about annotations that make a function callable from a vector loop. That solves this problem and creates others. The proposal is currently to allow anything in a vector loop and if you are responsible to avoiding deadlock. That is what is implemented in our commercial compiler.

Herb: Three approaches: 1) Inside a vector loop, call only vector-parameter functions, else ill-formed, 2) Call anything and unknown functions will be run serially 3) C++-AMP style restriction (viral) annotation

Anthony: So there isn't a currently-shipping implementation that has this serial-block construct, right?

Robert: Right. It's part of the proposal, but not implemented yet. Also part of OMP, but not yet implemented.

Robert: The rest of the presentation is that if we want to allow critical sections in vector loops and have them have defined behavior, then we have a solution. But if we go with the undefined behavior, then we don't need to go through the rest of the presentation.

Hans: I think we would not to require non-interleaved execution of inlined function calls in inlined functions. So long as function calls don't get split up, then I'm OK, I think.

Straw Poll: Any code can be called within a Vector loop, but iterations are not required to make progress independent of each other

Anthony: Is there an alternative position?

Hans: If most people find this accepable, then I don't think we need another alternative.

SF: 4

WF: 6

N: 3

WA: 2

SA: 1

Reasons for strongly against:

Anthony: Looking at the simd_for, if you take the contents and put it into an inline function, you've change the semantics vs. manually inlining the contents of the function. And you've added the potential for deadlocks. If you want code that must be serial, then you put it in a serial block. I'd rather see a compiler error than undefined behavior.

Hans: We need a follow-up discussion on the exact semantics.

Robert: it would be helpful if people reread the Portland proposal and feed back before Bristol, so that we can have a detailed discussion then.

Herb: the reason that the UB doesn't bother me too much this time, is because it is likely to be caught during normal testing, unlike many other types of concurrence bugs. I don't have data, but I suspect that this is one of those case that will fail pretty reliably in testing.

Anthony: but it depends on what your compiler does, so if it doesn't fail on your compiler, that doesn't mean that it will work on another compiler.

Olivier: I'd like to hear more about how sequencing works with control flow.

Clark: We'll make sure to have a document in a mailing by Bristol that gets into these kinds of details.

latch_barrier.html C++ Latches and Barriers - What's the status of this?

Lawrence: On latches and barriers, Forwarded issues to the author, will ping him.

Lawrence: On concurrent queue, no change but will have update for Bristol.

Review short term proposals from Portland SG1 summary, particularly slide 3.

Lawrence: On stream mutex, I have an item to rewrite it to use the internal hash table, but it might be better if the committee had a replacement for streams.

Hans: that's clearly not a C++14 issue. Do we want something for C++14?

Lawrence: The internal hash table should work, but you'll have to lock it to find out which lock you can use, and there might be GC issues. It seems ugly and slow to me, but in the short-term, I don't think we'll get anything better.

Herb: Some of us teach people how to serialize I/O with a simple wrapper. I would be in favor of doing nothing for C++14 and get something better for C++17.

Hans: My concern is that there would be no way to get the equivalent of C file locking. There would be no way to ensure that everybody uses the same wrapper, so it would be nice to have something.

Lawrence: there is no way to add a lock to stream without breaking binary compatibility. I don't think there's a way, in general to use the file system's lock.

Hans: Then maybe Herb is right, that we should not try for C++14.

Lawrence: The proposed changes would allow independent C++ programmers to synchronize on streams, so in that way it is a net improvement. We can specialize the 6 standard streams so that they use the flock primitive, but for a general file, we can't.

Hans: Well stdio is an important case, so even that would be a big improvement.

Lawrence: then I will go forward with my implementation.

Hans: On Howard's proposal, there is a little more work to be done on spurious try-lock failures for reader-writer

Arch: On concurrent containers, we're not sure if we'll have a prototype in time for Bristol.

Herb: For C++14 Need CD out of either Bristol or Chicago and DIS out of either Chicago or Rapperswiel. We should view it as mostly a defect-report release, with small added things, but not delay the release for features. It would be great to get a CD out of Bristol.

CWG1441 Atomics in signal handlers (See reflector discussion)

Hans: [points people to CWG1441.txt on the Wiki]

Clark: The standard can say almost nothing about asynchronous signals. If you call raise(), that results in a call to the signal handler.

Hans: So the standard treats that as a function call, we we don't need to say anything else? So "interrupted by a signal handler" is sufficient to mean an asynchronous signal?

Clark: Yes. We can add the word "asychronous" if we want, though?

Detlef: Do we want to do anything about 18.9? If we defer to the C standard, then a lot of our work will do nothing.

Hans: good point. We'll have to look more into that.

Detlef: For example, the C standard does not allow calls to any library functions except abort, exit, and quick_exit().

Hans: To some degree, we need to leave a lot of this implementation defined.

Detlef: But can I call, for example, is_lock_free() for atomics?

Clark: There's really no prospect of saying what you can portably do and have it work on all implementations.

Detlef: It seems like some functions should be safe to call because there is no real runtime involved, but it gets annoying not to have these available.

Hans: but it says "implementation defined", not "undefined".

Detlef: I agree that we should not try to do anything for 2014. We should send an email to the library reflector to see what they think.

Clark: I agree we can't do much for 2014. The C committee has always maintained that all you can do is set a flag.

Detlef: they've also added atomics in 2011.

Hans: so you can't even set an atomic pointer and subsequently dereference the pointer.

Detlef: and you can't call time() and store that time somewhere.

Hans: the reason that this issue came up was to deal with atomics, and that should be our goal here. [Moves on to next question in document]

Herb: If the two subexpressions occur in different threads, then the programmer can reason that p = &i synchronizes with testing p, but we're not allowing the programmer to reason the same way within a single thread. That does seem bizar. We should come back to what is sequentially consistent. The programmer knows that he can't depend on the order of subexpression evaluation on the same thread, but he can reason about sequential consistency. If the atomic doesn't exist, you can still reason about it in a single thread. It is indeterminately sequenced, not unsequenced.

Clark: the assignment to i is certainly sequenced before the assignment to p. The get on p and the dereference of p are also sequenced.

Hans: So it appears that by the current rules this is currently legal.

Herb: is _Atomic int in C11 a function call?

Clark: Not really well defined. atomic_int in C++ is clearly a class type, but it is not clear whether, in C, atomic_int is the same as _Atomic int. If you declared p as _Atomic int* in C11, there is no guarantee of anything unless you use the explicit function calls to set and get it.

Hans: So it sounds like we should attempt to resolve this in C++ and then let WG14 know what we decided. I think we are leaning towards saying that this is guaranteed legal in C++.

Clark: and we should document our reasoning!

Hans: Point 2 on the write-up I think we can ignore, since synchronization operations are already in the realm of things that are implementation-defined for signal handlers.

Hans: For point 3, I updated the text with Lawrences words

Clark: I'm not sure that this point 3 is really necessary.

Hans: I think there's already an assumption that a signal handler must run in the same thread because of TLS etc.

Clark: I think you're trying to define too much portably

Detlef: C already allows access to a thread-local flag

Clark: I don't think that's true. If it is, I think it should be broken.

Hans: we don't want to model signal handlers completely running in their own thread because there is a notion of the thread being stopped.

Clark: we are talking about threads being contexts, but the standard definition of thread is different. I don't think that a signal handler can fit into that definition of thread.

Herb: I'm going to propose for Bristol that we add the word "const" in the same paragraph, but I don't think there's overlap with this issue.

Other business

Herb: Question on papers: we've been talking about publishing papers as they're done, rather than wait for a mailing. Does anybody else, if a place to publish were available, would want to publish their papers early.

Hans: The minutes from this meeting could be published like that.

Michael: Same with SG5 minutes.