Nifty features of the ARM architecture

2019-06-06 | tagged with

memory barriers
ARM
compilers

While sorting out GHC’s support for architectures with weakly-ordered memory subsystems I peeked at GCC’s lowering of the __sync_bool_compare_and_swap operation¹ on ARM. This reminded me of a few features of the architecture that I quite like and felt compelled to write about.

My test involved looking at this program,

$ cat hi.c
#include <stdint.h>
int hello(uint32_t *x) {
	return __sync_bool_compare_and_swap(x, 0, 1);
}
$ arm-linux-gnueabihf-gcc -c hi.c

This produced the following assembler:

Function prelude:
   0: b480        push  {r7}
   2: b083        sub sp, #12
   4: af00        add r7, sp, #0
   6: 6078        str r0, [r7, #4]
   8: 687b        ldr r3, [r7, #4]
   a: 2201        movs  r2, #1
Issue full barrier:
   c: f3bf 8f5b   dmb ish                         (1)
Load value and claim exclusive access:
  10: e853 1f00   ldrex r1, [r3]                  (2)
Compare, loop if necessary:
  14: 2900        cmp r1, #0
  16: d103        bne.n 20 <hello+0x20>
Store value and possibly cede exclusive access:
  18: e843 2000   strex r0, r2, [r3]              (2)
Ensure that store succeeded, loop if necessary:
  1c: 2800        cmp r0, #0
  1e: d1f7        bne.n 10 <hello+0x10>
Issue full barrier:
  20: f3bf 8f5b   dmb ish                         (1)
Function epilogue:
  24: bf0c        ite eq                          (3)
  26: 2301        moveq r3, #1
  28: 2300        movne r3, #0
  2a: 4618        mov r0, r3
  2c: 370c        adds  r7, #12
  2e: 46bd        mov sp, r7
  30: f85d 7b04   ldr.w r7, [sp], #4
  34: 4770        bx  lr

There are a few things here that I found interesting, parenthesized above and discussed in detail below.

1. Partitioned memory barriers

dmb is ARM’s Data Memory Barrier instruction. What is interesting is the options accepted by this instruction:

SY: Full system barrier
ST: Store barrier
ISH: Full barrier within inner shareable domain
ISHST: Store barrier within inner shareable domain
NSH: Full barrier out to point of unification
NSHST: Store barrier out to point of unification
OSH: Full barrier out to outer shareable domain
OSHST: Store barrier out to outer shareable domain

The amount of granularity embodied by these options is quite forward-thinking, especially given how much freedom it leaves with the microarchitecture designer:

inner/outer shareable domain: This allows a large system to be broken into multiple independently-coherent domains. For instance, a large machine may have several paritions, each running its own supervisor (e.g. hypervisor or kernel). The DMB ISH instruction provides an efficient way for a supervisor to issue a barrier within the “system” on which it runs.
point of unification: This is defined to be the point where the processor’s instruction and data caches are guaranteed to be coherent with its TLB walk logic. This depends upon the system design but is often the L2 cache.

This is handy for implementing JIT compilers and other self-modifying code as it provides a flush mechanism which ensures that code modified by the local core can be executed.

2. Exclusive load/store

For someone used to IA-32’s rather boring atomic primitives, ARM is a breath of fresh air. IA-32 provides a handful of instructions implementing a selection of atomic operations (e.g. atomic add, subtract, AND, OR, etc.) via the LOCK prefix. By constrast, ARM provides a far more flexible mechanism consisting of two instructions:

ldrex: This is similar to the typical ldr instruction, loading an address (given in a register) from memory into a destination register. However, it also lays a claim of “exclusive” access to the loaded region (typically a cacheline, I believe).
strex: This is stores a value from a register to a memory location previously claimed with ldrex. If the value is has changed since the location was claimed then the store is aborted and a failure result is returned.

These collectively provide the necessary pieces for a compare-and-swap operation, but allowing significantly more flexibility over a dedicated CAS instruction. For instance, one might do non-trivial computation on the loaded value before storing an updated value.

3. If/then/else instruction

If/then/else flow control is incredibly common in typical programs. ARM has a nifty it instruction which predicates the execution of up to three instructions according to a given condition.

For instance, consider the program:

// Assume:
//   uint64_t x, y, z;
//   uint64_t *p;

if (x == 0) {
  p[0] = x;
  p[1] = y;
} else {
  p[0] = z
}

this can be lowered as the following svelte instruction sequence:

cmp r0, #0         # x == 0?
ittte eq
str r0, [r7]       # p[0] = x;
str r1, [r7, #8]   # p[1] = y;
str r3, [r7]       # p[0] = z;
...

Here the it instruction conditional on the eq condition code and is followed by three condition switches (tte; where t should be read as “then” (the condition holds) and e as “else” (the condition does not hold)). This is a remarkably efficient encoding for this common idiom.

In particular, the documentation does not state with any certainty whether the operation implies a full memory barrier. Unfortunately it seems to be a pattern that memory ordering operations are woefully underdocumented despite their subtlety.↩︎