Document Number: N2658
Submitter: Aaron Peter Bachmann
Submission Date: 2021-12-30
Make pointer type casting useful without negatively impacting performance - updates N2484

Summary

Pointer-type-punning is useful but mostly undefined behavior according to the standards ISO/IEC 9899:1999 ... ISO/IEC 9899:2018.
Before that there were implementation defined aspects. Therefore, we have to resort to additional - non-portable - extensions compilers normally provide or use another programming language (often assembler) or work-arounds to achieve our intended goals as programmers. Strict aliasing rules are also useful. They allow for more efficient code without extra work for the programmer. We can allow type-punning in many cases without negatively impacting performance.

Changes relative N2484

Prior work

N2297 [2] proposed to reallow pointer-casts. The proposal is very clear and simple. Its adoption would have made C a much more user-friendly language. It seems it has been rejected by the committee in order to keep the benefit of strict pointer aliasing i. e. more efficient code, unrefuitable a valid argument.

Why pointer casting is useful

Why strict aliasing rules are useful

Prior art

Problem

If pointers are derived from other pointers via type-casting then the strict aliasing-rules are often violated. Thus, the programs are incorrect.

Proposed solution

Make object-pointer-casting (casting a pointer-type to a pointer to a different  type)  valid and well defined. The accesses via this pointer shall be valid as well, provided we honor other restrictions (const, alignment, ...).

Discussion of the proposed solution and examples

Except as noted in the proposal, strict aliasing-rules shall remain intact.

void f1(float **f, unsigned **u){
        _Static_assert(sizeof(float)==sizeof(unsigned),"precondition violated");
        _Static_assert(4==sizeof(unsigned),"precondition violated");
        _Static_assert(8==CHAR_BIT,"precondition violated");
        unsigned u32;
    #ifdef DO_SOMETING_INVALID
        *f=*u; // invalid
        *u=*f; // invalid
        *f=*(float**)u; // invalid
        *u=*(unsigned**)f; // invalid
    #else
        (void)u;
    #endif
        float *fp=*f;
        unsigned *up=*u;
        u32=*(unsigned*)fp; // valid according to the proposal ... reading
        u32^=1u<<31;
        // under the restrictions given above a compiler not seeing
        // the implementation
of the function but its prototype only
        // must already assume
:
        // **f may be changed (as float); in this case **f=-**f
        *(float*)up=u32; // valid according to the proposal ... writing
}


An example - taken from musl-libc - modified for the proposal:
#include <string.h>
#include <limits.h>
#include <stdint.h>
#define ALIGN (sizeof(size_t))
#define ONES ((size_t)-1/UCHAR_MAX)          // 0x01010101 for 32 bit integer
#define HIGHS (ONES * (UCHAR_MAX/2+1))       // 0x80808080 for 32 bit integer
#define HASZERO(x) ((x)-ONES & ~(x) & HIGHS) // only 0 has OV & high bit 0

// harmless but undefined, because we eventually read more than object-size!

//#define HELP_THE_COMPILER
#if defined(HELP_THE_COMPILER) && defined(__GNUC__)
    #define AA(p,a) __builtin_assume_aligned(p,a) // non-portable hint
#else
    #define AA(p,a) (p)
#endif

size_t strlen_1(const char *s){
    union U {
        size_t v;
        char c[sizeof(size_t)];
    } u;
    const char *a = s;
    for (; (uintptr_t)s % ALIGN; s++){
        if (!*s)
            return s-a;
    }
    // pray compiler will remove memcpy() and use size_t-wide access
    for (;memcpy(&u.c,AA(s,ALIGN),sizeof(size_t)), !HASZERO(u.v); s+=sizeof(size_t))
        ;
    for (; *s; s++)
        ;
    return s-a;
}

size_t strlen_2(const char *s){
    const char *a = s;
    for (; (uintptr_t)s % ALIGN; s++){
        if (!*s)
            return s-a;
    }
    for (; !HASZERO(*(size_t*)s); s+=sizeof(size_t)) // code matching this proposal
        ;
    for (; *s; s++)
        ;
    return s-a;
}

All compilers tested fail to recognize that the 2nd loop in strlen_1() is aligned without a hint given by the programmer in a non-portable way. The code generated is poor for many CPUs without support for misaligned access. We show examples of code produced for a few targets with a few compilers.
For strlen_2() it is immediately clear - without further analysis - the pointer must be properly aligned, otherwise the code would be invalid, even if this proposal gets accepted.

static float *Fp=(float*)&Something;
static unsigned *Uu=(unsigned*)&Something; // invalid

Proposed wording changes


The working changes are relative to N2596 [3].

An object shall have its stored value accessed only by an lvalue expression that has one of the following types:89)

xxx) The object representation is reinterpreted as an object representation in the type used for the access as described in 6.2.6 (a process sometimes called "type punning"). This might be a trap representation.

Examples of code produced by compilers with and without the changes to the standard proposed here

In bold we have the inner loop. The colored comments point out the more interesting code parts.

strlen_1() ARM gcc 11.2 -Os -mthumb -mcpu=cortex-m0 -mfloat-abi=soft

strlen_1:
        movs    r3, r0
        movs    r2, #3
        push    {r4, lr}
.L2:
        tst     r3, r2
        bne     .L5
        ldr     r4, .L11
.L6: // memcpy() is inlined, but we have only byte-loads       
        ldrb    r2, [r3, #1]

        ldrb    r1, [r3]
        lsls    r2, r2, #8
        orrs    r2, r1
        ldrb    r1, [r3, #2]
        lsls    r1, r1, #16
        orrs    r1, r2
        ldrb    r2, [r3, #3]
        lsls    r2, r2, #24
        orrs    r2, r1
        ldr     r1, .L11+4
        adds    r1, r2, r1
        bics    r1, r2
        tst     r1, r4
        beq     .L7
.L8:
        ldrb    r2, [r3]
        cmp     r2, #0
        beq     .L10
        adds    r3, r3, #1
        b       .L8
.L5:
        ldrb    r1, [r3]
        cmp     r1, #0
        bne     .L3
.L10:
        subs    r0, r3, r0
        pop     {r4, pc}
.L3:
        adds    r3, r3, #1
        b       .L2
.L7:
        adds    r3, r3, #4
        b       .L6
.L11:
        .word   -2139062144
        .word   -16843009

strlen_1() armc7-a clang 11.0.1 -Os -mthumb -mcpu=cortex-m0 -mfloat-abi=soft

strlen_1:
        push    {r4, r5, r6, lr}
        lsls    r1, r0, #30
        mov     r1, r0
        beq     .LBB0_6
        ldrb    r1, [r0]
        cmp     r1, #0
        beq     .LBB0_10
        adds    r1, r0, #1
.LBB0_3:                                @ =>This Inner Loop Header: Depth=1
        lsls    r2, r1, #30
        beq     .LBB0_6
        adds    r2, r1, #1
        ldrb    r1, [r1]
        cmp     r1, #0
        mov     r1, r2
        bne     .LBB0_3
        subs    r1, r2, #4
        adds    r1, r1, #3
        b       .LBB0_11
.LBB0_6:
        subs    r1, r1, #4
        ldr     r2, .LCPI0_0
        ldr     r3, .LCPI0_1
.LBB0_7: // memcpy() is inlined, but we have only byte-loads                                       @ =>This Inner Loop Header: Depth=1
        ldrb    r4, [r1, #4]
        ldrb    r5, [r1, #5]
        lsls    r5, r5, #8
        adds    r4, r5, r4
        ldrb    r5, [r1, #6]
        ldrb    r6, [r1, #7]
        lsls    r6, r6, #8
        adds    r5, r6, r5
        lsls    r5, r5, #16
        adds    r4, r5, r4
        adds    r5, r4, r3
        mov     r6, r2
        bics    r6, r4
        adds    r1, r1, #4
        tst     r6, r5
        beq     .LBB0_7
        lsls    r2, r4, #24
        beq     .LBB0_11
.LBB0_9:                                @ =>This Inner Loop Header: Depth=1
        ldrb    r2, [r1, #1]
        adds    r1, r1, #1
        cmp     r2, #0
        bne     .LBB0_9
        b       .LBB0_11
.LBB0_10:
        mov     r1, r0
.LBB0_11:
        subs    r0, r1, r0
        pop     {r4, r5, r6, pc}
.LCPI0_0:
        .long   2155905152                      @ 0x80808080
.LCPI0_1:
        .long   4278124287                      @ 0xfefefeff

strlen_1() MIPS64 gcc 5.4 -Os

strlen_1:
        move    $2,$4
.L2:
        andi    $3,$2,0x7
        beq     $3,$0,.L13
        li      $5,-8454144       # 0xffffffffff7f0000

        lb      $3,0($2)
        beq     $3,$0,.L11
        nop

        b       .L2
        daddiu  $2,$2,1

.L13:
        li      $3,-16711680                        # 0xffffffffff010000
        daddiu  $5,$5,32639
        daddiu  $3,$3,257
        dsll    $5,$5,16
        dsll    $3,$3,16
        daddiu  $5,$5,32639
        daddiu  $3,$3,257
        dsll    $5,$5,17
        dsll    $3,$3,23
        ori     $5,$5,0xfeff
        ori     $3,$3,0x8080
.L6: // memcpy() is inlined 
        ldl     $6,0($2)  // 2 instructions for unaligned access for 8 bytes ... nice
        ldr     $6,7($2)  
        daddu   $7,$6,$5
        nor     $6,$0,$6
        and     $6,$7,$6
        and     $6,$6,$3
        bnel    $6,$0,.L14
        lb      $3,0($2)

        b       .L6
        daddiu  $2,$2,8 // off-topic: instruction in delay-slot

.L8:
        lb      $3,0($2)
.L14:
        beq     $3,$0,.L11
        nop

        b       .L8
        daddiu  $2,$2,1

.L11:
        j       $31
        dsubu   $2,$2,$4


strlen_1() ARM gcc 11.2 -Os -mthumb -mcpu=cortex-m0 -mfloat-abi=soft -DHELP_THE_COMPILER

strlen_1:
        movs    r3, r0
        movs    r2, #3
        push    {r4, lr}
.L2:
        tst     r3, r2
        bne     .L5
        ldr     r4, .L11
.L6: // with help we have word loads even with memcpy()
        ldr     r1, [r3] // off-topic: better use ldm r3!,{r1}
        ldr     r2, .L11+4
// off-topic: bad: loading a constant PC-relative in a loop!
        adds    r2, r1, r2
        bics    r2, r1
        tst     r2, r4
        beq     .L7 // off-topic: 2nd branch in a loop, with the change above jump to .L6
.L8:
        ldrb    r2, [r3]
        cmp     r2, #0
        beq     .L10
        adds    r3, r3, #1
        b       .L8
.L5:
        ldrb    r1, [r3]
        cmp     r1, #0
        bne     .L3
.L10:
        subs    r0, r3, r0
        pop     {r4, pc}
.L3:
        adds    r3, r3, #1
        b       .L2
.L7:
        adds    r3, r3, #4 // off-topic: rather use ldm r3!,{r1} instead of ldr r1, [r3]
        b       .L6
.L11:
        .word   -2139062144
        .word   -16843009


strlen_2() ARM gcc 11.2 -Os -mthumb -mcpu=cortex-m0 -mfloat-abi=soft

strlen_2:
        movs    r3, r0
        movs    r2, #3
        push    {r4, lr}
.L2:
        tst     r3, r2
        bne     .L5
        ldr     r4, .L11
.L6:
        ldr     r1, [r3] // the proposal improved code-quality
        ldr     r2, .L11+4 // off-topic: unrelated problems already outlined above remain
        adds    r2, r1, r2
        bics    r2, r1
        tst     r2, r4
        beq     .L7 // off-topic: unrelated problems already outlined above remain
.L8:

        ldrb    r2, [r3]
        cmp     r2, #0
        beq     .L10
        adds    r3, r3, #1
        b       .L8
.L5:
        ldrb    r1, [r3]
        cmp     r1, #0
        bne     .L3
.L10:
        subs    r0, r3, r0
        pop     {r4, pc}
.L3:
        adds    r3, r3, #1
        b       .L2
.L7:
        adds    r3, r3, #4 // off-topic: unrelated problems already outlined above remain
        b       .L6

.L11:
        .word   -2139062144
        .word   -16843009

strlen_2() armv7-a clang 11.0.1 -Os -mthumb -mcpu=cortex-m0 -mfloat-abi=soft

strlen_2:
        push    {r4, r5, r6, lr}
        lsls    r1, r0, #30
        mov     r1, r0
        beq     .LBB0_6
        ldrb    r1, [r0]
        cmp     r1, #0
        beq     .LBB0_10
        adds    r1, r0, #1
.LBB0_3:                                @ =>This Inner Loop Header: Depth=1
        lsls    r2, r1, #30
        beq     .LBB0_6
        adds    r2, r1, #1
        ldrb    r1, [r1]
        cmp     r1, #0
        mov     r1, r2
        bne     .LBB0_3
        subs    r1, r2, #4
        adds    r1, r1, #3
        b       .LBB0_11
.LBB0_6:
        subs    r1, r1, #4
        ldr     r2, .LCPI0_0
        ldr     r3, .LCPI0_1
.LBB0_7:                                @ =>This Inner Loop Header: Depth=1
        ldr     r4, [r1, #4] // the proposal improved code-quality 
        adds    r5, r4, r3

        mov     r6, r2
        bics    r6, r4
        adds    r1, r1, #4 // off-topic: with ldm this does not need to be in the loop
        tst     r6, r5
        beq     .LBB0_7
        lsls    r2, r4, #24
        beq     .LBB0_11
.LBB0_9:                                @ =>This Inner Loop Header: Depth=1
        ldrb    r2, [r1, #1]
        adds    r1, r1, #1
        cmp     r2, #0
        bne     .LBB0_9
        b       .LBB0_11
.LBB0_10:
        mov     r1, r0
.LBB0_11:
        subs    r0, r1, r0
        pop     {r4, r5, r6, pc}
.LCPI0_0:
        .long   2155905152                      @ 0x80808080
.LCPI0_1:
        .long   4278124287                      @ 0xfefefeff

Brief discussion of the results

For ARM Cortex-M0 a human would use 6 instructions in the inner loop (9 cycles). With the proposal in place compilers use 7...8 instuctions and 10...14 cycles (assuming no additional penalty due to wait-states for the PC-relative load) in the inner loop. Without the proposal compilers use 16...17 instructions (22...25 cycles) in the inner loop.
Using uint64_t instead of size_t could further increase the discrepancy.
While this is certainly something a compiler could do for the relative simple examples given, it is nothing the compilers used in the experiments can do. Experiments with other compiler options and other compilers (both older and from other vendors) executed in Summer 2021 gave similar results.
The same applies for many other CPUs without hardware-support for misaligned memory-accesses.
The asm-listings given above are produced using godbold.org [4].

Acknowledgements

I want to thank Martin Uecker, for coming up with the idea "immediate cast", for providing most of the normative text and for giving additional valuable advice. Without him the proposal would not exist in its current form.

References

[1] N2484 2020/02/17 Bachmann, Make pointer type casting useful without negatively impacting performance

[2] N2279 2018/07/11 Yodaiken, Proposal to make aliasing consistent

[3] N2596 2020/12/12 Meneide, C2x Working Draft

[4] https://godbolt.org/