ISO/IEC JTC1/SC22/WG14 N735 General wording issues (clause 7) First revision Clive D.W. Feather This document is an attempt to identify all the minor issues I can find in clause 7 of the Standard. This revision is updated to draft 10 pre 1. ======================================================================= Item 1: The following code: int *a; /* ... */ a [1] = (realloc)(a, a [0]); is strictly conforming because of the sequence point before realloc is called and the one before it returns. However, if the parentheses are removed, the call might be replaced by a macro and the sequence points are lost. This is surprising to most users of the standard library. This can either be eliminated or made explicit: (A) In 7.1.7 paragraph 1, after the sentence: Any invocation of a library function that is implemented as a macro shall expand to code that evaluates each of its arguments exactly once, fully protected by parentheses where necessary, so it is generally safe to use arbitrary expressions as arguments. add: In addition, the macro shall expand to code that contains a sequence point after the evaluation of all arguments and before any other action, and a sequence point at the end of evaluation. (B) Alternatively, add a footnote referred to by that sentence above: Such macros might not contain the sequence points that the corresponding function calls do. ==== Item 2: There was a long discussion some time ago about the following code: printf ("%n foo %n", &i, &i); and whether it is strictly conforming. I would suggest that we need the following somewhere in 7.1 (either as a new 7.1.8, or add in 7.1.7 after paragraph 2): [1] Except where explicitly stated, there are no sequence points during the evaluation of a library function. Where a function's action is described in sequential terms, or one function is defined in terms of calls to another, this is for the purpose of describing the final effect, and does not require the events to actually occur in that order, or for an actual call to the other function to occur. [2] Nevertheless, there is a sequence point immediately before the function is called (as specified by subclause 6.3.2.2), and immediately before it returns. [3] Example The call: int i; (printf) ("%n %n", &i, &i) invokes undefined behaviour, because it assigns to i twice between the same pair of sequence points. Even though printf is defined in terms of calls to putc(), it is not required for such a call actually to occur, nor for there to be a sequence point before and after outputting the space. ==== Item 3: Add to the end of 7.11 paragraph 3: It is permitted to create a pointer to a va_list and pass that pointer to another function, in which case the original function may make further use of the original list after the other function returns. ==== Item 4: I can see nothing to prevent EOF being defined, on a 16-bit int system, as -65535, which is, after all, a negative integral constant expression. However, this does not have a negative value. More generally, while some symbols have restrictions on them such as "suitable to be used as the 3rd argument of foo()", and others are never passed to or returned by functions, there are still some loopholes. To fix this, change 7.12.1 paragraph 3 (part only) from: EOF which expands to a negative integral constant expression that is returned by several functions to indicate end-of-file, that is, no more input from a stream; to: EOF | which expands to an integral constant expression, with type /int/ and | a negative value, that is returned by several functions to indicate end-of-file, that is, no more input from a stream; There are no doubt several other such cases. Making a quick check I can find: * EDOM, ERANGE, SIGABRT, SIGFPE, SIGILL, SIGINT, SIGSEGV, and SIGTERM are all positive integral expressions that need to be stored in an /int/, but are not restricted in this way. * CLOCKS_PER_SEC does not have a specified type. ==== Item 5: Change the first sentence of subclause 7.1.2 paragraph 1 from: Each library function is declared in a /header/, [111] ... Each library function is declared, with a type that includes a prototype, in a /header/, [111] ... The as-if rule means that this need not be done literally, provided that the effects of argument assignment rather than default promotion (other than trailing varargs, of course) will happen to all library function calls. ==== Item 6: A careful reading of subclause 7.3.1 shows that, for characters outside the 95-element minimal execution character set, there are two sets of classification macros that are significant. For each set, a character can belong to at most one member of the set. The following table shows these sets, examples of characters within those sets taken from the minimal 95, and cases that cannot happen: isprint() iscntrl() [neither] isalpha() 'A' forbidden =1= isspace() ' ' '\n' =2= [neither] ':' '\b' '\0' The interesting cases are those marked =1= and =2=; any characters with these properties must be locale-specific. The question turns on the intended meaning of "printable". The current definition requires the character to occupy a position on a printing device. [A] If so, such characters do make sense - =1= could be a "dead" character that overprints another one, or =2= could be a hair-thin space. Then attach a footnote to subclauses 7.3.1.2 (isalpha()), 7.3.1.6 (islower()), and 7.3.1.10 (isupper()): [*] The additional characters might not be printing characters; for example, they may be "dead" characters that overprint the preceeding or following character and are thus not "printing". [B] However, it is questionable whether the term "one printing position" still has a meaning in this day of proportional-spaced output devices, and whether there is a need for a better definition of "printable". In this case, change the definition to: The term "printing character" refers to a member of an implementation-defined set of characters, each of which has a characteristic appearance on a display device and usually occupies one printing position; in subclauses 7.3.1.2 (isalpha()), 7.3.1.6 (islower()), and 7.3.1.10 (isupper()), change: ... locale-specific set of characters ... to: ... locale-specific set of printing characters ... and in 7.3.1.9 (isspace()) change it to: ... locale-specific set of printing or control characters ... ==== Item 7: There are a couple of places where is less than clear. Attach a footnote to subclause 7.3.1.2 (isalpha()): [*] The functions islower() and isupper() test true or false separately for each of these additional characters; all four combinations are possible. In subclause 7.3.2.1 (tolower()), change the text from: Description The tolower function converts an uppercase letter to the corresponding lowercase letter. Returns If the argument is a character for which isupper is true and there is a corresponding character for which islower is true, the tolower function returns the corresponding character; otherwise, the argument is returned unchanged. to: Description The tolower function converts an uppercase letter to | a corresponding lowercase letter. Returns If the argument is a character for which isupper is true and there | are one or more corresponding characters for which islower is true, | the tolower function returns one of the corresponding characters | (always the same one for any given locale); otherwise, the argument is returned unchanged. and make the corresponding changes to subclause 7.3.2.2 (toupper()). ==== Item 8: In subclause 7.9.1.1 (setjmp()) there is a heading "Environmental constraint". This implies that the sentence is a Constraint, and that violation requires a diagnostic. It is reported that very few implementations generate such a diagnostic, and that most implementations correctly handle other contexts. Therefore change the heading to "Environmental restriction" or just make this part of the semantics. Possibly add at the end: If the invocation appears in any other context, the behaviour is undefined. ==== Item 9: [Withdrawn] ==== Item 10: Locales are currently treated as extremely opaque. It is not possible to determine whether two locales are equivalent in a category. It is not even sensible to compare locale strings for equality; the string returned need not be the same as the string passed in, even if it was also the string returned from a previous call. That is: char *loc; char copy_loc [LARGE_ENOUGH]; loc = setlocale (LC_COLLATE, "C"); if (strcmp (loc, "C") != 0) do_something (); // This can happen assert (strlen (loc) < LARGE_ENOUGH); strcpy (copy_loc, loc); loc = setlocale (LC_COLLATE, "C"); if (strcmp (loc, copy_loc) != 0) do_something (); // This can happen I realize that most systems store most locales in files, and therefore comparing for functional equality is not as simple as it might seem. However, I would recommend the following as a minimum: (1) Add to 7.5.1.1 (setlocale()) paragraph 8: Furthermore, if this string value is passed to the setlocale function with the same category, the result shall be the same string value. (2) Add either a function to compare two locale strings for functional equivalence in a category, or a function to compare a locale string with the current locale in a category. Functional equivalence is defined as: No behaviour defined in clause 7, other than the result of the setlocale function, changes as a result of changing the locale. Note that "strictly conforming" is not a good term to use in any comparison. ==== Item 11: The localeconv() function discusses monetary and non-monetary formatting, especially the former, but provides no easy way to implement it. The natural place to do this is the printf() family of functions. Therefore add to 7.12.6.1 (fprintf()): Flag , (comma): for d, i, o, u, x, X, f, F, e, E, g, G, a, and A conversions, the output shall be grouped in accordance with the /thousands_sep/ and /grouping/ fields of the locale. For other conversions, the behaviour is undefined. Format or flag $ (dollar): [It is unclear whether this is better as a flag or a format.] Generate a formatted monetary quantity. If it is a format, the argument is a double (or long double if L is included). The plus and space flags act as if the output already included a sign (even if it does not). The # flag specifies international formatting. The minus and zero flags can be used. If no precision is specified, the value of /frac_digits/ or /int_frac_digits/ from the current locale is used; if that is CHAR_MAX, the precision is unspecified. [If it is a flag, this would overrule the normal meaning of the precision.] Issues: [comma] Should there be a mechanism to allow the grouping to depend on the format (e.g. decimal output grouped in threes, hex output grouped in fours) ? I am informed that there are circumstances where the /thousands_sep/ character is different for each grouping. For example, a notation commonly used in Japan (particularly in newspapers) places characters meaning "myriad", "hundred million", "billion" and so on between the groups. This would require changing the separator to be a list of strings, and providing a convention to indicate this (for example, using CHAR_MAX as the first byte of the string). [dollar] I've used the normal rule that the specified precision overrides the default. An alternative would be that the precision applies only if the locale-specified value is CHAR_MAX. Which is preferable, or should there be a way to choose ? If $ is a flag and is used with %d, should it scale the value to the appropriate number of fractional digits ? For example, "%$6.2d" might indicate that the integer is to be printed in /ddd.dd/ form, with 12345 being printed as "123.45". Should %$d and %$i behave differently in this case ? If $ is a format, should there be an equivalent for integral types ? Since this proposal was drafted, it has been pointed out to me that any use of $ will conflict with the X/Open mechanisms, which use descriptors of the form "%1$d", "$*2$3$d", and "%*6$.*5$4$d". ==== Item 12: Change some C locale values in 7.5 () paragraph 2 from: mon_decimal_point "" negative_sign "" to: | mon_decimal_point "." | negative_sign "-" ==== Item 13: Change the last sentence of subclause 7.10.1.1 (signal()) paragraph 2 from: Such a function is called a signal handler. to: An invocation of such a function because of a signal, or of any further functions called by that invocation (other than functions in the standard library), is called a /signal handler/. Change subclause 7.10.1.1 paragraph 4 from: If the signal occurs other than as the result of calling the abort or raise function, the behavior is undefined if the signal handler calls any function in the standard library other than the signal function itself (with a first argument of the signal number corresponding to the signal that caused the invocation of the handler) or refers to any object with static storage duration other than by assigning a value to an object declared as type volatile sig_atomic_t. Furthermore, if such a call to the signal function results in a SIG_ERR return, the value of errno is indeterminate. [161] (wording change of DR 149 applied) to: If the signal occurs other than as the result of calling the abort or raise function, the behaviour is undefined if the signal handler | includes a call to any function in the standard library other than: | - the abort, exit, or longjmp functions, or | - the signal function itself, with the first argument equal to the | signal number corresponding to the signal that caused the | invocation of the handler, | or refers to any object with static storage duration other than | - by assigning a value to an object declared as volatile sig_atomic_t, | - as part of the first argument to a call to the longjmp | function[*], or | - as part of the execution of the abort, exit, or longjmp functions | (but not as part of any signal handler or function registered with | the atexit function called from them). Furthermore, if such a call to the signal function results in a SIG_ERR return, the value of errno is indeterminate. [161] | [*] That is, given: | static jmp_buf env; | then: | longjmp (env, 0); | is valid, but: | auto jmp_buf e2; | memcpy (e2, env, sizeof env); | longjmp (e2, 0); | even though the value of /env/ will eventually be used in such a | call. I suspect that subclause 7.10.1.1 paragraph 7 also needs to say "... for the most recent successful call ...". ==== Item 14: The Standard is somewhat unclear about the details of stdio buffering. For example, considering output (the analogous situation happens with input) a call to fputc() can have one of the following effects: (1) the character is sent to the underlying system; (2) the character is written to a buffer; (3) the character is written to a buffer and then a number of characters are sent to the underlying system from the buffer; (4) a number of characters are sent to the underlying system from a buffer, and then the character is written to the buffer. In case (1), failure can be reported in a straightforward manner, and it can be assumed that case (2) never fails. The question is: what will happen if cases (3) or (4) have a failure during the output, but not directly as a result of that character (that is, the error occurs earlier on in the buffer) ? The present wording of the Standard implies that an error in outputting a character can only be reported on that call to fputc(), and not on any subsequent call. This needs to be changed, or buffering becomes a nonsense - the implementation would be required to *predict* whether a write will succeed. A suitable location is 7.12.3, and the wording needs to say something along the following lines: If output is buffered, then it may be transmitted to the host environment at any subsequent call to fputc(), and shall be transmitted no later than the next fflush() call or when the stream is closed. Thus a call to fputc() may fail and set the error indicator on the stream because of the earlier output. Similarly, if input is buffered, a call to fgetc() may cause the error indicator to be set even though the same call on an unbuffered stream would not (because the error is associated with a later character in the input). Even if the data is successfully transmitted to the host environment, it is possible for an error to occur within the latter. If this happens after the stream has been closed, it can not be reported to the application; if it occurs earlier, it is implementation-defined when it is so reported. A secondary issue is: can the buffer be sent to the underlying system other than within a call to fputc(); is asynchronous I/O permitted ? If so, then: When a stream is buffered, characters may be transmitted to or from the host environment other than as part of a library function, and thus the error indicator for the stream may be set outside such a function (the indicator can only be cleared as part of a function that explicitly states it does so). ==== Item 15: Is there a need to provide a way to make the three standard streams be binary, in the same way that they can already be made wide ? Without it, there's no strictly-conforming way to write "cat". Even with it there is the trailing zero byte problem. ==== Item 16: There is no way to determine whether two fpos_t values represent the same position in a file. Therefore, it is not possible to do the following: open a file read through it, looking for some mark note the position using fgetpos() rewind read through it again to the same position, using calls to fgetpos() to determine where you are, rather than recalculating it I suggest the following function be added to subclause 7.12.9: struct fcmppos fcmppos (fpos_t* a, fpos_t* b, FILE *stream) Compares two fpos_t values that refer to the given stream; if either argument is a null pointer, the result of a call to fgetpos() on the stream is used instead. The resulting structure contains at least the following fields: int before; // Less than, equal to, or greater than zero according // to whether /a/ is before, at the same location as, // or after /b/ in the file. int mbstate; // Zero if the two positions have the same multibyte // parsing status. If the stream has been written to at any point before the later of the two positions, the behaviour is undefined. ==== Item 17: Add to subclause 7.13.4.2 (atexit()) paragraph 2: Whether the function is called on abnormal program termination is implementation-defined. or it could be unspecified. ==== Item 18a: Change the last sentence of subclause 7.13.3 paragraph 1 from: The value of a pointer that refers to freed space is indeterminate. to: | The value of a pointer that refers to freed space, or to space that | has subsequently been moved, is indeterminate. In subclause 7.13.3.4 (realloc ()) paragraph 2, change: or if the space has been deallocated by a call to the /free/ or /realloc/ function, to: | or if the space has been deallocated or moved by a call to the | /free/ or /realloc/ function, ==== Item 18b: The Standard provides no way to determine whether realloc() has moved the memory; this is something you want to do if you have pointers to within the block of memory. If it hasn't moved, the returned pointer will compare equal to the pointer argument. But if it has, you cannot make the comparison because a pointer to freed memory (and thus to moved memory) is indeterminate, and the comparison is undefined behaviour (unless you go through hoops like using memcpy()). There is a rationale behind this last part (making a legitimate value suddenly become illegitimate): some implementations may check pointers for validity whenever they are loaded into a register. However, it is a problem. Should the comparison be permitted ? Is it desirable to provide at least some mechanism to determine if the memory has moved ? ==== Item 19: The specification of the comparison functions for bsearch() and qsort() (7.13.5.1 and 7.13.5.2) is insufficient to safely code them. In particular, it does not address the following issues. (1) Are the pointers to objects within the base array (or the key object), or can they be to copies ? (2) Can the comparison alter the values of the pointed-to objects ? (3) If so, does the alteration persist ? (4) What are the requirements on the consistency of the comparison results ? I propose that comparisons are not allowed to alter the values, and therefore that the implementation can pass pointers to copies of the objects. [This, of course, invalidates an item in one of my articles in CUJ :-] Therefore add the following immediately after the heading of 7.13.5 (there is currently no text between that and the heading of 7.13.5.1). [1] These utilities make use of a comparison function. This shall behave in the following way. [2] The implementation shall ensure that the second argument (when called from /bsearch/), or both arguments (when called from /qsort/), shall be pointers to an element of the array, or to a copy of such an element. The first argument when called from /bsearch/ shall equal /key/. The function shall make its comparison based on the pointed-to objects, and not the specific addresses passed to it. [3] The comparison function shall not alter the contents of the array. The implementation may reorder elements of the array between calls to the comparison function, but shall not alter the contents of any individual element. [4] When the same object (consisting of /size/ bytes, irrespective of its current position in the array) is passed more than once to the comparison function, the results shall be consistent with one another. That is, for /qsort/ they shall define a total ordering on the array, and for /bsearch/ the same object shall always compare the same way with the key. [5] A sequence point occurs immediately before and immediately after each call to the comparison function, and also between any call to the comparison function and any movement of the objects passed as arguments to that call. If it is felt desirable that the pointers *shall* always point into the array, then replace paragraph [2] above by: [2] The implementation shall ensure that the second argument (when called from /bsearch/), or both arguments (when called from /qsort/), shall be pointers to elements of the array [*]. The first argument when called from /bsearch/ shall equal /key/. [*] That is, if the value passed is /p/, then the following expressions are always non-zero: ((char *) p - (char *) base) % size == 0 (char *) p >= (char *) base (char *) p < (char *) base + nmemb * size ==== Item 20: Modify the definition of a conversion specifier in 7.15.3.5 (strftime()) to allow the following: * the - flag * the 0 flag * a field width * a precision, with the meaning that fprintf assigns for %d and %s. ==== Item 21: The conversion carried out by localtime does not provide any way of determining the time zone used, and the normalization done by mktime does not specify how DST changes are handled. Similarly, many systems are now aware of leap seconds, but the Standard is not clear on how these are to be handled. Adding this information is not trivial, because there is no obvious way to extend /struct tm/ in a compatible manner. This proposal therefore contains a kludge. [The following is not final wording; it is first necessary to agree the semantics.] Add the following fields to struct tm: int tm_version; /* version number of the structure layout */ int tm_utcoffset; /* offset from UTC in minutes - [-1439, +1439] */ int tm_leapsecs; /* leap seconds applied */ int tm_xisdst; /* daylight saving time flag - [-1, +1439] */ and add the following macros to , all constant integral expressions capable of being stored in an object of type int: _EXTENDED_TM _NO_LEAP_SECONDS _LOCALTIME The /gmtime/ function shall set tm_utcoffset to 0, while the /localtime/ function shall set it according to the local time zone, including any DST corrections; a negative value for tm_utcoffset indicates ahead of UTC, so that PDT is represented by +420. If the implementation is unable to determine the local zone, /localtime/ shall set this field _LOCALTIME and /gmtime/ shall fail. Both functions shall set tm_isdst to represent whether DST is (believed to be) in effect at the represented time, and tm_xisdst to -1, 0, or the (positive) size of the DST offset, in minutes, according as whether tm_isdst is less than, equal to, or greater than zero. Both functions shall set tm_leapsecs to indicate the number of leap seconds that have been applied to the resulting value (if tm_sec == 60, the relevant leap second is *not* included in the count). If the implementation is not aware of leap seconds, it shall set tm_leapsecs to _NO_LEAP_SECONDS. Both functions shall set tm_version to 1. The /mktime/ function shall behave as follows. If the tm_isdst field is equal to _EXTENDED_TM, then the tm_version field shall be 1. The broken down time is normalized according to the following rules, and also converted to a /time_t/ representation. If the call is successful, a second call to /mktime/ with the resulting /struct tm/ value shall always leave it unchanged and return the same value as the first call. If the call is successful and the normalized time is exactly representable as a /time_t/ value, then the normalized broken-down time, and the broken-down time generated by converting the result of /mktime/ as if by a call to /localtime/, shall be identical except that, if the tm_isdst member of the former originally had the value _EXTENDED_TM, it shall remain unchanged. A time is normalized according to the following rules. The principle behind normalization is that the date is converted to a number of seconds past some epoch, and then converted back to the correct normalized form. If the tm_isdst member does not equal _EXTENDED_TM, then the rules shall be applied as if: - tm_leapsecs is _NO_LEAP_SECONDS; - tm_utcoffset is _LOCALTIME; - tm_xisdst is -1, 0, or +60 according to whether tm_isdst is less than, equal to, or greater than zero. All dates are in the Gregorian calendar. Thus a value of -800 for /tm_year/ represents 1100 CE, while a value of -2000 represents -100 CE (99 BCE); neither are leap years, while -2300 (-400 CE, 399 BCE) is. The value of /tm_leapsecs/ is the number of leap seconds applied (the value of UTC-UT0) at the represented time. It should therefore be added to the value determined by (days*86400 + hours*3600 + mins*60 + seconds). If the value is _NO_LEAP_SECONDS, then the implementation should determine the correct number if it can, and use 0 otherwise. The value of /tm_utcoffset/ is a number of minutes to be added to the time to convert it to UTC. The value _LOCALTIME is a request for the implementation to determine this; if it is unknown, it should assume that local time is UTC plus any DST offset determined from /tm_xisdst/. If /tm_mon/ is outside the range [0, 11], it shall be converted to that range by adding or subtracting a multiple of 12 and adjusting the year accordingly. This shall then be used to determine the number of days in the year prior to the month. Thus /tm_year/ == 97 and /tm_mon/ == -8 represents May of 1996, a leap year. Apart from this, the final date can be determined simply by adding together the various fields, each with a suitable weight, to get the number of seconds past the epoch. The normalization should be exact provided that there is no unreasonable overflow. I would consider reasonable limitations to be that each of the following expressions are in the range [-1<<30,+1<<30]: tm_year * 366 tm_mon * 31 tm_mday tm_hour * 3600 tm_min * 60 tm_sec tm_leapsecs tm_utcoffset * 60 tm_xisdst * 60 [if nonnegative, else tm_xisdst must be -1] This would ensure that separate "seconds in the day" and "days since epoch" calculations won't overflow in 32 bits. ==== Item 22: Add the following alternate forms to the conversion specifiers in 7.15.3.5 (strftime()), taking effect when the # flag is included in the specifier. %#w - weekday (1=Monday, 7=Sunday) %#W - ISO 8601 week number ) If %W would be zero, the date is %#y - ISO 8601 week number year % 100 ) treated as belonging to week 53 %#Y - ISO 8601 week number year ) of the previous year %#Z - timezone in "+0800" notation, with + being west of Greenwich. ==== Item 23: In subclause 7.15.1, change the range of tm_sec to [0,60] and remove footnote 201. See various WG14 mailing list items (e.g. 3482) or: # The International Earth Rotation Service periodically uses leap seconds # to keep UTC to within 0.9 s of TAI (atomic time); see # Terry J Quinn, The BIPM and the accurate measure of time, # Proc IEEE 79, 7 (July 1991), 894-905. ==== Item 24: Subclause 7.15.3.5 (strftime()) is unclear on how the values of the members of /timeptr/ affect the result, especially if they are outside the normal range. Add one of the following sets of wording, in each case after paragraph 4: Option [A]: If the value of any member of the structure pointed to by /timeptr/ is out of the normal range, or the values are not consistent with one another [*], the behaviour is undefined. [*] For example, the contents represent "30th Feb", "29th Feb 1997", or "Monday 10th May 1997". Option [B]: If the value of any member of the structure pointed to by /timeptr/ is out of the normal range, or the values are not consistent with one another [*], the value returned and the contents of the array are unspecified. [*] For example, the contents represent "30th Feb", "29th Feb 1997", or "Monday 10th May 1997". Option [C]: The characters placed in the array by each conversion specifier depend on a member of the structure pointed to by /timeptr/, as specified in brackets in the description. If this value is outside the normal range, the characters stored are unspecified. If option [C] is taken, add the following to each specifier in paragraph 3: %a [tm_wday] %A [tm_wday] %b [tm_mon] %B [tm_mon] %c [all] %d [tm_mday] %H [tm_hour] %I [tm_hour] %j [tm_yday] %m [tm_mon] %M [tm_min] %p [tm_hour] %S [tm_sec] %U [tm_year, tm_mday] %w [tm_wday] %W [tm_year, tm_mday] %x [all] %X [all] %y [tm_year] %Y [tm_year] %Z [tm_isdst] If item 19 is accepted, then %Z becomes [tm_utcoffset, tm_isdst, tm_xisdst]. ==== Item 25: All the conversion specifiers in subclause 7.15.3.5 (strftime()) should have values specified for the C locale.