The Linux-Kernel Memory Model by Example

January 24, 2017

This article was contributed by Jade Alglave, Paul E. McKenney, Alan Stern, Luc Maranget, and Andrea Parri

Introduction

This document gives an example-driven intuitive introduction to the Linux kernel memory model. It is in no way a substitute for the mathematical model, in particular, these examples do not explore the many interesting corner cases of that model. However, these examples should prove useful for developers who would like to write good concurrent Linux-kernel code, but who don't mind giving up those last few percentage points of performance by staying a bit back from the edge of memory-model disaster.

This article assumes that you understand some of the material in the A Formal Model of Linux-Kernel Memory Ordering article, particularly the section entitled Memory Models and The Role of Cycles.

This article is organized as follows:

  1. Communications Relations
  2. Ordering For Free
  3. Local Execution-Based Ordering
  4. Release-Acquire Ordering
  5. Full-Barrier Transitive Ordering
  6. Restoring Sequential Consistency
  7. Special Cases
  8. Summary
These are followed by the inescapable answers to the quick quizzes.

Communications Relations

Before starting the examples, it is worthwhile reviewing the communications relations. These relations are as follows:

  1. rf: “read from”, where a read returns the value stored by an earlier write. This relation links the write to the read.
  2. co: “coherence”, where a write's value is overwritten by another write. This relation links the overwritten write to the overwriting write. Note that the co relation is transitive: If write A links to write B and write B links to write C, then write A will also link to write C.
  3. fr: “from read”, where a write overwrites the variable accessed by a read, but too late to prevent the read from returning the overwritten value. This relation links from the read to the write. Note that the fr relation is extended by the co relation, so that if fr links a read D to a write E, and if co links write E to another write F, then fr also links read D to write F. (More detail on fr may be found here.)

Each of these relations is described in more detail in one of the following sections.

Reads-From (rf)

Of these, the rf relation is the only one that implies a temporal constraint: In the absence of value-speculation optimizations, the read must have executed later than the write that supplied that read's value.

Quick Quiz 1: So what will you do if compilers or CPUs start using value-speculation optimizations?
Answer

A key point is that writing to a variable cannot result in an instantaneously visible change in value, courtesy of the fact that the speed of light is finite and that atoms are non-zero in size. A change will instead propagate through the system, as fancifully depicted below, with time advancing from left to right:

rf.svg

Quick Quiz 2: If it is not possible to make a change instantaneously visible, then how do sequentially consistent systems work?
Answer

The upshot is that if a read from a shared variable returns the value stored by a write, then there will be an rf link from that write to that read, and we know that the read must have executed after the write did.

Coherence (co)

Unlike the rf relation, the co relation does not imply any sort of temporal constraint. In fact, it is possible for an earlier write to overwrite a later write, as shown below:

co.svg

As can be seen in the figure, the write of the value 2 happened earlier in time, but it nevertheless overwrote the later write of the value 1, so that the co relation goes backwards in time. This outcome is due to the fact that the written values take time to propagate through the system, so that the system decides at some later time which of the two values wins.

Quick Quiz 3: Exactly how could the system “decide later” which write won?
Answer

Quick Quiz 4: Wouldn't it be way simpler if the last write always won?
Answer

From-Read (fr)

Like the co relation, the fr relation does not imply any sort of temporal constraint. Again, due to the finite speed with which information propagates through real systems, it is possible for a read to get a value from a write whose value is already destined to be overwritten, as shown below:

fr.svg

In this example, the read returned the initialization value of zero despite the write of the value 1 having already executed. However, this new value had not yet propagated from CPU 0 to CPU 3, so this later read, being unaware of the new value, returned the old one. The fr relation therefore goes backwards in time.

Relation Summary

In short, the rf relation implies a temporal constraint, while the co and fr relations do not. Therefore, as we will see in the remainder of this document, as a rough rule of thumb, the more rf links your code has, the less heavyweight ordering your code will need. This rule of thumb is illustrated by the examples in the remainder of this document.

Ordering For Free

This section describes types of ordering that are provided gratis by most CPUs, even the infamous DEC Alpha. But let's start with an example for which ordering is not guaranteed:

Example Litmus Test #1
  1 C C-LB+o-o+o-o+o-o.litmus
  2
  3 {
  4 }
  5
  6 P0(int *a, int *b)
  7 {
  8   int r1;
  9
 10   r1 = READ_ONCE(*a);
 11   WRITE_ONCE(*b, 1);
 12 }
 13
 14 P1(int *b, int *c)
 15 {
 16   int r1;
 17
 18   r1 = READ_ONCE(*b);
 19   WRITE_ONCE(*c, 1);
 20 }
 21
 22 P2(int *a, int *c)
 23 {
 24   int r1;
 25
 26   r1 = READ_ONCE(*c);
 27   WRITE_ONCE(*a, 1);
 28 }
 29
 30 exists
 31 (0:r1=1 /\ 1:r1=1 /\ 2:r1=1)

Because there is no natural ordering nor any sort of ordering constraint on the CPU, the counter-intuitive outcome on the exists clause on the last line can happen. All that is needed is for the CPU to reorder the read and the write from any of the three processes. And the Linux-kernel memory model agrees with this assessment, as confirmed by the Sometimes 1 7:

Outcome for Example Litmus Test #1 (strong model)
 1 Test C-LB+o-o+o-o+o-o Allowed
 2 States 8
 3 0:r1=0; 1:r1=0; 2:r1=0;
 4 0:r1=0; 1:r1=0; 2:r1=1;
 5 0:r1=0; 1:r1=1; 2:r1=0;
 6 0:r1=0; 1:r1=1; 2:r1=1;
 7 0:r1=1; 1:r1=0; 2:r1=0;
 8 0:r1=1; 1:r1=0; 2:r1=1;
 9 0:r1=1; 1:r1=1; 2:r1=0;
10 0:r1=1; 1:r1=1; 2:r1=1;
11 Ok
12 Witnesses
13 Positive: 1 Negative: 7
14 Condition exists (0:r1=1 /\ 1:r1=1 /\ 2:r1=1)
15 Observation C-LB+o-o+o-o+o-o Sometimes 1 7
16 Hash=63d9717e69db4ea05bb2e6840d1d22d4

Quick Quiz 5: Only the CPU? Can't the compiler also reorder the reads and writes in Example Litmus Test #1?
Answer

The key point of this unordered example is that even if your litmus test has nothing but causal rf relations, it still needs at least some ordering to prohibit the counter-intuitive cyclic outcome. For example, if each process provides minimal ordering from its read to its write, then the cyclic outcome will be forbidden, as shown by the following litmus test:

Example Litmus Test #2
  1 C C-LB+ldref-o+o-ctrl-o+o-dep-o.litmus
  2
  3 {
  4   a=x0;
  5   c=y0;
  6   1:r2=b;
  7 }
  8
  9 P0(int **a)
 10 {
 11   int *r1;
 12
 13   r1 = READ_ONCE(*a);
 14   WRITE_ONCE(*r1, 1);
 15 }
 16
 17 P1(int *b, int **c)
 18 {
 19   int r1;
 20
 21   r1 = READ_ONCE(*b);
 22   if (r1)
 23     WRITE_ONCE(*c, r2);
 24 }
 25
 26 P2(int **a, int **c)
 27 {
 28   int *r1;
 29
 30   r1 = READ_ONCE(*c);
 31   WRITE_ONCE(*a, r1);
 32 }
 33
 34 exists
 35 (0:r1=b /\ 1:r1=1 /\ 2:r1=b)

In this litmus test, P0() has an address dependency between its read and write, P1() has a control dependency between its read and write, and P2() has a data dependency between its read and write. The initialization on lines 4-6 interacts with each process's code to set up for the dependency in the next process. Because there is an rf link between each adjacent pair of processes and because each process maintains ordering between its read and write, the counter-intuitive exists clause cannot happen, as confirmed by the strong model:

Outcome for Example Litmus Test #2 (strong model)
 1 Test C-LB+ldref-o+o-ctrl-o+o-dep-o Allowed
 2 States 2
 3 0:r1=x0; 1:r1=0; 2:r1=y0;
 4 0:r1=y0; 1:r1=0; 2:r1=y0;
 5 No
 6 Witnesses
 7 Positive: 0 Negative: 2
 8 Condition exists (0:r1=b /\ 1:r1=1 /\ 2:r1=b)
 9 Observation C-LB+ldref-o+o-ctrl-o+o-dep-o Never 0 2
10 Hash=d8909b22b7196d3b91cd78e700c68cc6

Quick Quiz 6: But wait! DEC Alpha does not respect dependency ordering. So wouldn't it fail to order P0() and P2() in Example Litmus Test #2?
Answer

If any of the dependencies is removed, the cyclic outcome is allowed, as shown in the following litmus test, which eliminates P1()'s control dependency:

Example Litmus Test #3
  1 C C-LB+ldref-o+o-o+o-dep-o.litmus
  2
  3 {
  4   a=x0;
  5   c=y0;
  6   1:r2=b;
  7 }
  8
  9 P0(int **a)
 10 {
 11   int *r1;
 12
 13   r1 = READ_ONCE(*a);
 14   WRITE_ONCE(*r1, 1);
 15 }
 16
 17 P1(int *b, int **c)
 18 {
 19   int r1;
 20
 21   r1 = READ_ONCE(*b);
 22   WRITE_ONCE(*c, r2);
 23 }
 24
 25 P2(int **a, int **c)
 26 {
 27   int *r1;
 28
 29   r1 = READ_ONCE(*c);
 30   WRITE_ONCE(*a, r1);
 31 }
 32
 33 exists
 34 (0:r1=b /\ 1:r1=1 /\ 2:r1=b)

The strong model confirms that the cycle is now allowed:

Outcome for Example Litmus Test #3 (strong model)
 1 Test C-LB+ldref-o+o-o+o-dep-o Allowed
 2 States 5
 3 0:r1=b; 1:r1=0; 2:r1=b;
 4 0:r1=b; 1:r1=1; 2:r1=b;
 5 0:r1=x0; 1:r1=0; 2:r1=b;
 6 0:r1=x0; 1:r1=0; 2:r1=y0;
 7 0:r1=y0; 1:r1=0; 2:r1=y0;
 8 Ok
 9 Witnesses
10 Positive: 1 Negative: 4
11 Condition exists (0:r1=b /\ 1:r1=1 /\ 2:r1=b)
12 Observation C-LB+ldref-o+o-o+o-dep-o Sometimes 1 4
13 Hash=0ee6280cfb52bb0721c6e110bbf79597

Note that dependencies enforce ordering only for trailing writes. For example, consider the following trailing-read address-dependency litmus test, which corresponds to RCU-like insertion of an element into a linked list, and is an example of the message-passing pattern:

Example Litmus Test #4
  1 C C-MP+o-assign+o-dep-o.litmus
  2
  3 {
  4   x=y0;
  5   0:r4=y;
  6 }
  7
  8 P0(int *x, int *y)
  9 {
 10   WRITE_ONCE(*y, 1);
 11   rcu_assign_pointer(*x, r4);
 12 }
 13
 14 P1(int *x, int *y)
 15 {
 16   int r1;
 17   int r2;
 18
 19   r1 = READ_ONCE(*x);
 20   r2 = READ_ONCE(*r1);
 21 }
 22
 23 exists
 24 (1:r1=y /\ 1:r2=0)

The strong model shows that the cycle is permitted, and thus that the above litmus test does not demonstrate a reliable way to do RCU-like insertion, courtesy of DEC Alpha:

Outcome for Example Litmus Test #4 (strong model)
 1 Test C-MP+o-assign+o-dep-o Allowed
 2 States 3
 3 1:r1=y; 1:r2=0;
 4 1:r1=y; 1:r2=1;
 5 1:r1=y0; 1:r2=0;
 6 Ok
 7 Witnesses
 8 Positive: 1 Negative: 2
 9 Condition exists (1:r1=y /\ 1:r2=0)
10 Observation C-MP+o-assign+o-dep-o Sometimes 1 2
11 Hash=15cc666886633d7ffe9f7d6129d54145

The following litmus test shows how to reliably carry out such an insertion, using lockless_dereference(), which supplies the memory barrier required by DEC Alpha, but only for DEC Alpha:

Example Litmus Test #5
  1 C C-MP+o-assign+ldref-o.litmus
  2
  3 {
  4   x=y0;
  5   0:r4=y;
  6 }
  7
  8 P0(int *x, int *y)
  9 {
 10   WRITE_ONCE(*y, 1);
 11   rcu_assign_pointer(*x, r4);
 12 }
 13
 14 P1(int *x, int *y)
 15 {
 16   int r1;
 17   int r2;
 18
 19   r1 = lockless_dereference(*x);
 20   r2 = READ_ONCE(*r1);
 21 }
 22
 23 exists
 24 (1:r1=y /\ 1:r2=0)

The strong model confirms that the bad outcome is prohibited:

Outcome for Example Litmus Test #5 (strong model)
 1 Test C-MP+o-assign+ldref-o Allowed
 2 States 2
 3 1:r1=y; 1:r2=1;
 4 1:r1=y0; 1:r2=0;
 5 No
 6 Witnesses
 7 Positive: 0 Negative: 2
 8 Condition exists (1:r1=y /\ 1:r2=0)
 9 Observation C-MP+o-assign+ldref-o Never 0 2
10 Hash=6c282e8fa5aa6a72221fa5c9529d087b

This RCU insertion litmus test provides ordering for free on all CPU families other than DEC Alpha.

Local Execution-Based Ordering

Removing the control dependency allowed the cyclic results, as shown by Example Litmus Test #3. However, changing the READ_ONCE() in P1() to the lightweight primitive smp_load_acquire() can prohibit this cycle despite there being no dependency, as shown below:

Example Litmus Test #6
  1 C C-LB+ldref-o+acq-o+o-dep-o.litmus
  2
  3 {
  4   a=x0;
  5   c=y0;
  6   1:r2=b;
  7 }
  8
  9 P0(int **a)
 10 {
 11   int *r1;
 12
 13   r1 = READ_ONCE(*a);
 14   WRITE_ONCE(*r1, 1);
 15 }
 16
 17 P1(int *b, int **c)
 18 {
 19   int r1;
 20
 21   r1 = smp_load_acquire(b);
 22   WRITE_ONCE(*c, r2);
 23 }
 24
 25 P2(int **a, int **c)
 26 {
 27   int *r1;
 28
 29   r1 = READ_ONCE(*c);
 30   WRITE_ONCE(*a, r1);
 31 }
 32
 33 exists
 34 (0:r1=b /\ 1:r1=1 /\ 2:r1=b)

Again, the strong model confirms this:

Outcome for Example Litmus Test #6 (strong model)
 1 Test C-LB+ldref-o+acq-o+o-dep-o Allowed
 2 States 4
 3 0:r1=b; 1:r1=0; 2:r1=b;
 4 0:r1=x0; 1:r1=0; 2:r1=b;
 5 0:r1=x0; 1:r1=0; 2:r1=y0;
 6 0:r1=y0; 1:r1=0; 2:r1=y0;
 7 No
 8 Witnesses
 9 Positive: 0 Negative: 4
10 Condition exists (0:r1=b /\ 1:r1=1 /\ 2:r1=b)
11 Observation C-LB+ldref-o+acq-o+o-dep-o Never 0 4
12 Hash=67f08dc71f9c2a03065bd1c425c24f09

The venerable smp_rmb() barrier also operates via local execution-based ordering, as will be shown later. In the meantime, it is possible to substitute smp_load_acquire() for all of the dependencies, and still prohibit the cycle:

Example Litmus Test #7
  1 C C-LB+acq-o+acq-o+acq-o.litmus
  2
  3 {
  4 }
  5
  6 P0(int *a, int *b)
  7 {
  8   int *r1;
  9
 10   r1 = smp_load_acquire(a);
 11   WRITE_ONCE(*b, 1);
 12 }
 13
 14 P1(int *b, int *c)
 15 {
 16   int r1;
 17
 18   r1 = smp_load_acquire(b);
 19   WRITE_ONCE(*c, 1);
 20 }
 21
 22 P2(int *a, int *c)
 23 {
 24   int r1;
 25
 26   r1 = smp_load_acquire(c);
 27   WRITE_ONCE(*a, 1);
 28 }
 29
 30 exists
 31 (0:r1=1 /\ 1:r1=1 /\ 2:r1=1)

Again, the model agrees that the cycle is prohibited:

Outcome for Example Litmus Test #7 (strong model)
 1 Test C-LB+acq-o+acq-o+acq-o Allowed
 2 States 7
 3 0:r1=0; 1:r1=0; 2:r1=0;
 4 0:r1=0; 1:r1=0; 2:r1=1;
 5 0:r1=0; 1:r1=1; 2:r1=0;
 6 0:r1=0; 1:r1=1; 2:r1=1;
 7 0:r1=1; 1:r1=0; 2:r1=0;
 8 0:r1=1; 1:r1=0; 2:r1=1;
 9 0:r1=1; 1:r1=1; 2:r1=0;
10 No
11 Witnesses
12 Positive: 0 Negative: 7
13 Condition exists (0:r1=1 /\ 1:r1=1 /\ 2:r1=1)
14 Observation C-LB+acq-o+acq-o+acq-o Never 0 7
15 Hash=c7f0f09c322fb880cdd71087857a9071

However, this works only because of the causal rf-only nature of this litmus test. Removing the causality by replacing one of the rf links with a write-to-write coherence or co link allows the cycle, as can be seen from this litmus test:

Example Litmus Test #8
  1 C C-WWC+o+acq-o+acq-o.litmus
  2
  3 {
  4 }
  5
  6 P0(int *a)
  7 {
  8   WRITE_ONCE(*a, 1);
  9 }
 10
 11 P1(int *a, int *b)
 12 {
 13   int r1;
 14
 15   r1 = smp_load_acquire(a);
 16   WRITE_ONCE(*b, 1);
 17 }
 18
 19 P2(int *a, int *b)
 20 {
 21   int r1;
 22
 23   r1 = smp_load_acquire(b);
 24   WRITE_ONCE(*a, 2);
 25 }
 26
 27 exists
 28 (1:r1=1 /\ 2:r1=1 /\ a=1)

The Linux-kernel model shows that the cycle is allowed.

Outcome for Example Litmus Test #8 (strong model)
 1 Test C-WWC+o+acq-o+acq-o Allowed
 2 States 10
 3 1:r1=0; 2:r1=0; a=1;
 4 1:r1=0; 2:r1=0; a=2;
 5 1:r1=0; 2:r1=1; a=1;
 6 1:r1=0; 2:r1=1; a=2;
 7 1:r1=1; 2:r1=0; a=1;
 8 1:r1=1; 2:r1=0; a=2;
 9 1:r1=1; 2:r1=1; a=1;
10 1:r1=1; 2:r1=1; a=2;
11 1:r1=2; 2:r1=0; a=1;
12 1:r1=2; 2:r1=0; a=2;
13 Ok
14 Witnesses
15 Positive: 1 Negative: 9
16 Condition exists (1:r1=1 /\ 2:r1=1 /\ a=1)
17 Observation C-WWC+o+acq-o+acq-o Sometimes 1 9
18 Hash=e8861be8533cb22ff500cef0e1e446ab

Forbidding this cycle requires transitivity, a weak form of which is provided by release-acquire chains, which are the subject of the next section.

Release-Acquire Ordering

Transitivity is a subtle concept in memory ordering, being provided by a combination of A-cumulativity and B-cumulativity, which is discussed in the Memory barriers section of the strong model presentation.

In the meantime, for causal rf links, transitivity can be obtained by pairing smp_store_release() with smp_load_acquire(), forming a link in a release-acquire chain. As an added bonus, these two primitives generate no code on strongly ordered systems such as x86 and s390. In addition, the first smp_load_acquire() can be removed, as shown below:

Example Litmus Test #9
  1 C C-WWC+o+o-rel+acq-o.litmus
  2
  3 {
  4 }
  5
  6 P0(int *a)
  7 {
  8   WRITE_ONCE(*a, 1);
  9 }
 10
 11 P1(int *a, int *b)
 12 {
 13   int r1;
 14
 15   r1 = READ_ONCE(*a);
 16   smp_store_release(b, 1);
 17 }
 18
 19 P2(int *a, int *b)
 20 {
 21   int r1;
 22
 23   r1 = smp_load_acquire(b);
 24   WRITE_ONCE(*a, 2);
 25 }
 26
 27 exists
 28 (1:r1=1 /\ 2:r1=1 /\ a=1)

And as the model says:

Outcome for Example Litmus Test #9 (strong model)
 1 Test C-WWC+o+o-rel+acq-o Allowed
 2 States 9
 3 1:r1=0; 2:r1=0; a=1;
 4 1:r1=0; 2:r1=0; a=2;
 5 1:r1=0; 2:r1=1; a=1;
 6 1:r1=0; 2:r1=1; a=2;
 7 1:r1=1; 2:r1=0; a=1;
 8 1:r1=1; 2:r1=0; a=2;
 9 1:r1=1; 2:r1=1; a=2;
10 1:r1=2; 2:r1=0; a=1;
11 1:r1=2; 2:r1=0; a=2;
12 No
13 Witnesses
14 Positive: 0 Negative: 9
15 Condition exists (1:r1=1 /\ 2:r1=1 /\ a=1)
16 Observation C-WWC+o+o-rel+acq-o Never 0 9
17 Hash=d43b7fcba00cf52af389a7a87f51f46e

Note that although smp_store_release() and smp_load_acquire() provide ordering to processes participating in the release-acquire chain, they do not necessarily guarantee ordering from the perspective of processes outside of that chain. For example, in the following litmus test, P0() and P1() form a (rather short) release-acquire chain and P2() is an outside observer:

Example Litmus Test #10
  1 C C-Z6.0+o-rel+acq-o+o-mb-o.litmus
  2
  3 {
  4 }
  5
  6 P0(int *a, int *b)
  7 {
  8   WRITE_ONCE(*a, 1);
  9   smp_store_release(b, 1);
 10 }
 11
 12 P1(int *b, int *c)
 13 {
 14   int r1;
 15
 16   r1 = smp_load_acquire(b);
 17   WRITE_ONCE(*c, 1);
 18 }
 19
 20 P2(int *a, int *c)
 21 {
 22   int r1;
 23
 24   WRITE_ONCE(*c, 2);
 25   smp_mb();
 26   r1 = READ_ONCE(*a);
 27 }
 28
 29 exists
 30 (1:r1=1 /\ 2:r1=0 /\ c=2)

Note that P2() must use smp_mb() in order to order the prior write against the later read. However, even that smp_mb() is not sufficient to provide ordering in this case, given the co and fr links:

Outcome for Example Litmus Test #10 (strong model)
 1 Test C-Z6.0+o-rel+acq-o+o-mb-o Allowed
 2 States 8
 3 1:r1=0; 2:r1=0; c=1;
 4 1:r1=0; 2:r1=0; c=2;
 5 1:r1=0; 2:r1=1; c=1;
 6 1:r1=0; 2:r1=1; c=2;
 7 1:r1=1; 2:r1=0; c=1;
 8 1:r1=1; 2:r1=0; c=2;
 9 1:r1=1; 2:r1=1; c=1;
10 1:r1=1; 2:r1=1; c=2;
11 Ok
12 Witnesses
13 Positive: 1 Negative: 7
14 Condition exists (1:r1=1 /\ 2:r1=0 /\ c=2)
15 Observation C-Z6.0+o-rel+acq-o+o-mb-o Sometimes 1 7
16 Hash=30f1277a7ac1097b0934bce41caf6144

In this case, full ordering requires an smp_mb() on either P0() or P1(). The following litmus test chooses P0():

Example Litmus Test #11
  1 C C-Z6.0+o-mb-o+acq-o+o-mb-o.litmus
  2
  3 {
  4 }
  5
  6 P0(int *a, int *b)
  7 {
  8   WRITE_ONCE(*a, 1);
  9   smp_mb();
 10   WRITE_ONCE(*b, 1);
 11 }
 12
 13 P1(int *b, int *c)
 14 {
 15   int r1;
 16
 17   r1 = smp_load_acquire(b);
 18   WRITE_ONCE(*c, 1);
 19 }
 20
 21 P2(int *a, int *c)
 22 {
 23   int r1;
 24
 25   WRITE_ONCE(*c, 2);
 26   smp_mb();
 27   r1 = READ_ONCE(*a);
 28 }
 29
 30 exists
 31 (1:r1=1 /\ 2:r1=0 /\ c=2)

The choice of P0() also allows the smp_store_release() to be downgraded to a WRITE_ONCE(), as confirmed by the model:

Outcome for Example Litmus Test #11 (strong model)
 1 Test C-Z6.0+o-mb-o+acq-o+o-mb-o Allowed
 2 States 7
 3 1:r1=0; 2:r1=0; c=1;
 4 1:r1=0; 2:r1=0; c=2;
 5 1:r1=0; 2:r1=1; c=1;
 6 1:r1=0; 2:r1=1; c=2;
 7 1:r1=1; 2:r1=0; c=1;
 8 1:r1=1; 2:r1=1; c=1;
 9 1:r1=1; 2:r1=1; c=2;
10 No
11 Witnesses
12 Positive: 0 Negative: 7
13 Condition exists (1:r1=1 /\ 2:r1=0 /\ c=2)
14 Observation C-Z6.0+o-mb-o+acq-o+o-mb-o Never 0 7
15 Hash=7cbe17aa8c52275b4b3a6c03658952b6

Because smp_store_release() is always a store and smp_load_acquire() is always a load, a release-acquire chain necessarily links from a store in one process to a load in the next. The following section shows how to enforce ordering for other types of links.

Full-Barrier Transitive Ordering

The following litmus test uses smp_mb() to provide transitive ordering despite the fact that all links from one process to the next are fr links:

Example Litmus Test #12
  1 C C-3.SB+o-mb-o+o-mb-o+o-mb-o.litmus
  2
  3 {
  4 }
  5
  6 P0(int *x, int *y)
  7 {
  8   int r1;
  9
 10   WRITE_ONCE(*x, 1);
 11   smp_mb();
 12   r1 = READ_ONCE(*y);
 13 }
 14
 15 P1(int *y, int *z)
 16 {
 17   int r2;
 18
 19   WRITE_ONCE(*y, 1);
 20   smp_mb();
 21   r2 = READ_ONCE(*z);
 22 }
 23
 24 P2(int *z, int *x)
 25 {
 26   int r2;
 27
 28   WRITE_ONCE(*z, 1);
 29   smp_mb();
 30   r2 = READ_ONCE(*x);
 31 }
 32
 33 exists
 34 (0:r1=0 /\ 1:r2=0 /\ 2:r2=0)

The strong model says that this litmus test's cycle is forbidden:

Outcome for Example Litmus Test #12 (strong model)
 1 Test C-3.SB+o-mb-o+o-mb-o+o-mb-o Allowed
 2 States 7
 3 0:r1=0; 1:r2=0; 2:r2=1;
 4 0:r1=0; 1:r2=1; 2:r2=0;
 5 0:r1=0; 1:r2=1; 2:r2=1;
 6 0:r1=1; 1:r2=0; 2:r2=0;
 7 0:r1=1; 1:r2=0; 2:r2=1;
 8 0:r1=1; 1:r2=1; 2:r2=0;
 9 0:r1=1; 1:r2=1; 2:r2=1;
10 No
11 Witnesses
12 Positive: 0 Negative: 7
13 Condition exists (0:r1=0 /\ 1:r2=0 /\ 2:r2=0)
14 Observation C-3.SB+o-mb-o+o-mb-o+o-mb-o Never 0 7
15 Hash=9e78339cf867c77d247d5a43f65645f7

However, omitting even one smp_mb() results in the cycle being allowed:

Example Litmus Test #13
  1 C C-3.SB+o-o+o-mb-o+o-mb-o.litmus
  2
  3 {
  4 }
  5
  6 P0(int *x, int *y)
  7 {
  8   int r1;
  9
 10   WRITE_ONCE(*x, 1);
 11   r1 = READ_ONCE(*y);
 12 }
 13
 14 P1(int *y, int *z)
 15 {
 16   int r2;
 17
 18   WRITE_ONCE(*y, 1);
 19   smp_mb();
 20   r2 = READ_ONCE(*z);
 21 }
 22
 23 P2(int *z, int *x)
 24 {
 25   int r2;
 26
 27   WRITE_ONCE(*z, 1);
 28   smp_mb();
 29   r2 = READ_ONCE(*x);
 30 }
 31
 32 exists
 33 (0:r1=0 /\ 1:r2=0 /\ 2:r2=0)

This is confirmed by the strong model:

Outcome for Example Litmus Test #13 (strong model)
 1 Test C-3.SB+o-o+o-mb-o+o-mb-o Allowed
 2 States 8
 3 0:r1=0; 1:r2=0; 2:r2=0;
 4 0:r1=0; 1:r2=0; 2:r2=1;
 5 0:r1=0; 1:r2=1; 2:r2=0;
 6 0:r1=0; 1:r2=1; 2:r2=1;
 7 0:r1=1; 1:r2=0; 2:r2=0;
 8 0:r1=1; 1:r2=0; 2:r2=1;
 9 0:r1=1; 1:r2=1; 2:r2=0;
10 0:r1=1; 1:r2=1; 2:r2=1;
11 Ok
12 Witnesses
13 Positive: 1 Negative: 7
14 Condition exists (0:r1=0 /\ 1:r2=0 /\ 2:r2=0)
15 Observation C-3.SB+o-o+o-mb-o+o-mb-o Sometimes 1 7
16 Hash=d42d9489b5750bd71d719a152bd76659

The next section shows how smp_mb() provides even stronger guarantees than mere transitivity.

Restoring Sequential Consistency

The canonical litmus test for sequential consistency is the celebrated (though as far as we know, not used in practice) “independent reads of independent writes”, or IRIW:

Example Litmus Test #14
  1 C C-IRIW+o+o+o-mb-o+o-mb-o.litmus
  2
  3 {
  4 }
  5
  6 P0(int *x)
  7 {
  8   WRITE_ONCE(*x, 1);
  9 }
 10
 11 P1(int *y)
 12 {
 13   WRITE_ONCE(*y, 1);
 14 }
 15
 16 P2(int *x, int *y)
 17 {
 18   int r1;
 19   int r2;
 20
 21   r1 = READ_ONCE(*x);
 22   smp_mb();
 23   r2 = READ_ONCE(*y);
 24 }
 25
 26 P3(int *x, int *y)
 27 {
 28   int r1;
 29   int r2;
 30
 31   r1 = READ_ONCE(*y);
 32   smp_mb();
 33   r2 = READ_ONCE(*x);
 34 }
 35
 36 exists
 37 (2:r1=1 /\ 2:r2=0 /\ 3:r1=1 /\ 3:r2=0)

If the cycle is permitted, then the two reader threads can disagree on the order of the independent stores. When both reader threads use smp_mb(), this cycle is forbidden:

Outcome for Example Litmus Test #14 (strong model)
 1 Test C-IRIW+o+o+o-mb-o+o-mb-o Allowed
 2 States 15
 3 2:r1=0; 2:r2=0; 3:r1=0; 3:r2=0;
 4 2:r1=0; 2:r2=0; 3:r1=0; 3:r2=1;
 5 2:r1=0; 2:r2=0; 3:r1=1; 3:r2=0;
 6 2:r1=0; 2:r2=0; 3:r1=1; 3:r2=1;
 7 2:r1=0; 2:r2=1; 3:r1=0; 3:r2=0;
 8 2:r1=0; 2:r2=1; 3:r1=0; 3:r2=1;
 9 2:r1=0; 2:r2=1; 3:r1=1; 3:r2=0;
10 2:r1=0; 2:r2=1; 3:r1=1; 3:r2=1;
11 2:r1=1; 2:r2=0; 3:r1=0; 3:r2=0;
12 2:r1=1; 2:r2=0; 3:r1=0; 3:r2=1;
13 2:r1=1; 2:r2=0; 3:r1=1; 3:r2=1;
14 2:r1=1; 2:r2=1; 3:r1=0; 3:r2=0;
15 2:r1=1; 2:r2=1; 3:r1=0; 3:r2=1;
16 2:r1=1; 2:r2=1; 3:r1=1; 3:r2=0;
17 2:r1=1; 2:r2=1; 3:r1=1; 3:r2=1;
18 No
19 Witnesses
20 Positive: 0 Negative: 15
21 Condition exists (2:r1=1 /\ 2:r2=0 /\ 3:r1=1 /\ 3:r2=0)
22 Observation C-IRIW+o+o+o-mb-o+o-mb-o Never 0 15
23 Hash=b0840da8dd0f7482d111dab3f8dde214

The next litmus test replaces smp_mb() with release-acquire chains from each writer to the corresponding reader:

Example Litmus Test #15
  1 C C-IRIW+rel+rel+acq-o+acq-o.litmus
  2
  3 {
  4 }
  5
  6 P0(int *x)
  7 {
  8   smp_store_release(x, 1);
  9 }
 10
 11 P1(int *y)
 12 {
 13   smp_store_release(y, 1);
 14 }
 15
 16 P2(int *x, int *y)
 17 {
 18   int r1;
 19   int r2;
 20
 21   r1 = smp_load_acquire(x);
 22   r2 = READ_ONCE(*y);
 23 }
 24
 25 P3(int *x, int *y)
 26 {
 27   int r1;
 28   int r2;
 29
 30   r1 = smp_load_acquire(y);
 31   r2 = READ_ONCE(*x);
 32 }
 33
 34 exists
 35 (2:r1=1 /\ 2:r2=0 /\ 3:r1=1 /\ 3:r2=0)

In this case the cycle is allowed:

Outcome for Example Litmus Test #15 (strong model)
 1 Test C-IRIW+rel+rel+acq-o+acq-o Allowed
 2 States 16
 3 2:r1=0; 2:r2=0; 3:r1=0; 3:r2=0;
 4 2:r1=0; 2:r2=0; 3:r1=0; 3:r2=1;
 5 2:r1=0; 2:r2=0; 3:r1=1; 3:r2=0;
 6 2:r1=0; 2:r2=0; 3:r1=1; 3:r2=1;
 7 2:r1=0; 2:r2=1; 3:r1=0; 3:r2=0;
 8 2:r1=0; 2:r2=1; 3:r1=0; 3:r2=1;
 9 2:r1=0; 2:r2=1; 3:r1=1; 3:r2=0;
10 2:r1=0; 2:r2=1; 3:r1=1; 3:r2=1;
11 2:r1=1; 2:r2=0; 3:r1=0; 3:r2=0;
12 2:r1=1; 2:r2=0; 3:r1=0; 3:r2=1;
13 2:r1=1; 2:r2=0; 3:r1=1; 3:r2=0;
14 2:r1=1; 2:r2=0; 3:r1=1; 3:r2=1;
15 2:r1=1; 2:r2=1; 3:r1=0; 3:r2=0;
16 2:r1=1; 2:r2=1; 3:r1=0; 3:r2=1;
17 2:r1=1; 2:r2=1; 3:r1=1; 3:r2=0;
18 2:r1=1; 2:r2=1; 3:r1=1; 3:r2=1;
19 Ok
20 Witnesses
21 Positive: 1 Negative: 15
22 Condition exists (2:r1=1 /\ 2:r2=0 /\ 3:r1=1 /\ 3:r2=0)
23 Observation C-IRIW+rel+rel+acq-o+acq-o Sometimes 1 15
24 Hash=e740a02d7b6130c50c917c38998da1dc

In short, smp_mb() is much stronger than is release-acquire. As well it should be, given that it is usually also much more expensive!

Special Cases

This section covers a some special cases, the first because it is heavily used, the second because it is almost never used, and the last two because they allow you to emulate primitives that are not yet directly handled by the memory model:

  1. Message Passing
  2. Write-Only Scenarios
  3. Emulating Locking
  4. Emulating call_rcu()

Message Passing

The celebrated (and heavily used) message-passing pattern is shown below:

Example Litmus Test #16
  1 C C-MP+o-rel+acq-o.litmus
  2
  3 {
  4 }
  5
  6 P0(int *a, int *b)
  7 {
  8   WRITE_ONCE(*a, 1);
  9   smp_store_release(b, 1);
 10 }
 11
 12 P1(int *a, int *b)
 13 {
 14   int r1;
 15   int r2;
 16
 17   r1 = smp_load_acquire(b);
 18   r2 = READ_ONCE(*a);
 19 }
 20
 21 exists
 22 (1:r1=1 /\ 1:r2=0)

This is of course forbidden:

Outcome for Example Litmus Test #16 (strong model)
 1 Test C-MP+o-rel+acq-o Allowed
 2 States 3
 3 1:r1=0; 1:r2=0;
 4 1:r1=0; 1:r2=1;
 5 1:r1=1; 1:r2=1;
 6 No
 7 Witnesses
 8 Positive: 0 Negative: 3
 9 Condition exists (1:r1=1 /\ 1:r2=0)
10 Observation C-MP+o-rel+acq-o Never 0 3
11 Hash=a238b2bd1fd4e5e95a6c2d1095975619

However, smp_store_release and smp_load_acquire() can be replaced with the old-style smp_wmb() and smp_rmb() primitives as shown below:

Example Litmus Test #17
  1 C C-MP+o-wmb-o+o-rmb-o.litmus
  2
  3 {
  4 }
  5
  6 P0(int *a, int *b)
  7 {
  8   WRITE_ONCE(*a, 1);
  9   smp_wmb();
 10   WRITE_ONCE(*b, 1);
 11 }
 12
 13 P1(int *a, int *b)
 14 {
 15   int r1;
 16   int r2;
 17
 18   r1 = READ_ONCE(*b);
 19   smp_rmb();
 20   r2 = READ_ONCE(*a);
 21 }
 22
 23 exists
 24 (1:r1=1 /\ 1:r2=0)

This is still forbidden:

Outcome for Example Litmus Test #17 (strong model)
 1 Test C-MP+o-wmb-o+o-rmb-o Allowed
 2 States 3
 3 1:r1=0; 1:r2=0;
 4 1:r1=0; 1:r2=1;
 5 1:r1=1; 1:r2=1;
 6 No
 7 Witnesses
 8 Positive: 0 Negative: 3
 9 Condition exists (1:r1=1 /\ 1:r2=0)
10 Observation C-MP+o-wmb-o+o-rmb-o Never 0 3
11 Hash=14c5bb4f4878756fe2adc3eaed5fd3ce

That said, smp_store_release() and smp_load_acquire() are almost always easier to use and maintain, so they would normally be preferred. However, there are a few less-common cases where smp_wmb() and smp_rmb() can produce more efficient code on some weakly ordered systems:

Example Litmus Test #18
  1 C C-MP2+o-o-wmb-o-o+o-rmb-o+o-rmb-o.litmus
  2
  3 {
  4 }
  5
  6 P0(int *a, int *b, int *c, int *d)
  7 {
  8   WRITE_ONCE(*a, 1);
  9   WRITE_ONCE(*c, 1);
 10   smp_wmb();
 11   WRITE_ONCE(*b, 1);
 12   WRITE_ONCE(*d, 1);
 13 }
 14
 15 P1(int *a, int *b)
 16 {
 17   int r1;
 18   int r2;
 19
 20   r1 = READ_ONCE(*b);
 21   smp_rmb();
 22   r2 = READ_ONCE(*a);
 23 }
 24
 25 P2(int *c, int *d)
 26 {
 27   int r1;
 28   int r2;
 29
 30   r1 = READ_ONCE(*d);
 31   smp_rmb();
 32   r2 = READ_ONCE(*c);
 33 }
 34
 35 exists
 36 (1:r1=1 /\ 1:r2=0) \/ (2:r1=1 /\ 2:r2=0)

The model shows that this is also forbidden:

Outcome for Example Litmus Test #18 (strong model)
 1 Test C-MP2+o-o-wmb-o-o+o-rmb-o+o-rmb-o Allowed
 2 States 9
 3 1:r1=0; 1:r2=0; 2:r1=0; 2:r2=0;
 4 1:r1=0; 1:r2=0; 2:r1=0; 2:r2=1;
 5 1:r1=0; 1:r2=0; 2:r1=1; 2:r2=1;
 6 1:r1=0; 1:r2=1; 2:r1=0; 2:r2=0;
 7 1:r1=0; 1:r2=1; 2:r1=0; 2:r2=1;
 8 1:r1=0; 1:r2=1; 2:r1=1; 2:r2=1;
 9 1:r1=1; 1:r2=1; 2:r1=0; 2:r2=0;
10 1:r1=1; 1:r2=1; 2:r1=0; 2:r2=1;
11 1:r1=1; 1:r2=1; 2:r1=1; 2:r2=1;
12 No
13 Witnesses
14 Positive: 0 Negative: 9
15 Condition exists (1:r1=1 /\ 1:r2=0 \/ 2:r1=1 /\ 2:r2=0)
16 Observation C-MP2+o-o-wmb-o-o+o-rmb-o+o-rmb-o Never 0 9
17 Hash=5e1ec52282c90ed972723d7e044cab3e

Nevertheless, it is worth reiterating that smp_store_release() and smp_load_acquire() are almost always preferable to smp_wmb() and smp_rmb().

Write-Only Scenarios

Write-only scenarios are interesting in that it is not clear that reasonable hardware could fail to provide ordering, but it is also not clear that there are any reasonable use cases. Thus far, any useful write-only scenario has proven to have a more conventional counterpart that is better and easier to use.

Nevertheless, here is the simplest write-only scenario:

Example Litmus Test #19
  1 C C-2+2W+o-wmb-o+o-wmb-o.litmus
  2
  3 {
  4 }
  5
  6 P0(int *a, int *b)
  7 {
  8   WRITE_ONCE(*a, 2);
  9   smp_wmb();
 10   WRITE_ONCE(*b, 1);
 11 }
 12
 13 P1(int *a, int *b)
 14 {
 15   WRITE_ONCE(*b, 2);
 16   smp_wmb();
 17   WRITE_ONCE(*a, 1);
 18 }
 19
 20 exists
 21 (b=2 /\ a=2)

The strong model prohibits the cycle:

Outcome for Example Litmus Test #19 (strong model)
 1 Test C-2+2W+o-wmb-o+o-wmb-o Allowed
 2 States 3
 3 a=1; b=1;
 4 a=1; b=2;
 5 a=2; b=1;
 6 No
 7 Witnesses
 8 Positive: 0 Negative: 3
 9 Condition exists (b=2 /\ a=2)
10 Observation C-2+2W+o-wmb-o+o-wmb-o Never 0 3
11 Hash=8264db947f1b73b8be16f98dd6bf1634

However, the weak model allows this same cycle:

Outcome for Example Litmus Test #19 (weak model)
 1 Test C-2+2W+o-wmb-o+o-wmb-o Allowed
 2 States 4
 3 a=1; b=1;
 4 a=1; b=2;
 5 a=2; b=1;
 6 a=2; b=2;
 7 Ok
 8 Witnesses
 9 Positive: 1 Negative: 3
10 Condition exists (b=2 /\ a=2)
11 Observation C-2+2W+o-wmb-o+o-wmb-o Sometimes 1 3
12 Hash=8264db947f1b73b8be16f98dd6bf1634

This disagreement between the strong and the weak model documents the current uncertainty as to whether or not this two-processes write-only scenario should be supported. The same uncertainty applies to the three-process extension of Example Litmus Test #19:

Example Litmus Test #20
  1 C C-3+2W+o-wmb-o+o-wmb-o+o-wmb-o.litmus
  2
  3 {
  4 }
  5
  6 P0(int *a, int *b)
  7 {
  8   WRITE_ONCE(*a, 2);
  9   smp_wmb();
 10   WRITE_ONCE(*b, 1);
 11 }
 12
 13 P1(int *b, int *c)
 14 {
 15   WRITE_ONCE(*b, 2);
 16   smp_wmb();
 17   WRITE_ONCE(*c, 1);
 18 }
 19
 20 P2(int *c, int *a)
 21 {
 22   WRITE_ONCE(*c, 2);
 23   smp_wmb();
 24   WRITE_ONCE(*a, 1);
 25 }
 26
 27 exists
 28 (a=2 /\ b=2 /\ c=2)

This is therefore also prohibited by the strong model:

Outcome for Example Litmus Test #20 (strong model)
 1 Test C-3+2W+o-wmb-o+o-wmb-o+o-wmb-o Allowed
 2 States 7
 3 a=1; b=1; c=1;
 4 a=1; b=1; c=2;
 5 a=1; b=2; c=1;
 6 a=1; b=2; c=2;
 7 a=2; b=1; c=1;
 8 a=2; b=1; c=2;
 9 a=2; b=2; c=1;
10 No
11 Witnesses
12 Positive: 0 Negative: 7
13 Condition exists (a=2 /\ b=2 /\ c=2)
14 Observation C-3+2W+o-wmb-o+o-wmb-o+o-wmb-o Never 0 7
15 Hash=2054ae2b227f6081d169f318556b51a1

And similarly also allowed by the weak model:

Outcome for Example Litmus Test #20 (weak model)
 1 Test C-3+2W+o-wmb-o+o-wmb-o+o-wmb-o Allowed
 2 States 8
 3 a=1; b=1; c=1;
 4 a=1; b=1; c=2;
 5 a=1; b=2; c=1;
 6 a=1; b=2; c=2;
 7 a=2; b=1; c=1;
 8 a=2; b=1; c=2;
 9 a=2; b=2; c=1;
10 a=2; b=2; c=2;
11 Ok
12 Witnesses
13 Positive: 1 Negative: 7
14 Condition exists (a=2 /\ b=2 /\ c=2)
15 Observation C-3+2W+o-wmb-o+o-wmb-o+o-wmb-o Sometimes 1 7
16 Hash=2054ae2b227f6081d169f318556b51a1

Emulating Locking

Locking can be emulated using xchg(), for example as follows:

Example Litmus Test #21
  1 C C-locktest.litmus
  2
  3 {
  4 }
  5
  6 P0(int *x, int *l)
  7 {
  8   int r1;
  9   int r2;
 10
 11   r1 = xchg_acquire(l, 1);
 12   if (r1 == 0) {
 13     r2 = READ_ONCE(*x);
 14     WRITE_ONCE(*x, 1);
 15     WRITE_ONCE(*x, 0);
 16     smp_store_release(l, 0);
 17   }
 18 }
 19
 20 P1(int *x, int *l)
 21 {
 22   int r1;
 23   int r2;
 24
 25   r1 = xchg_acquire(l, 1);
 26   if (r1 == 0) {
 27     r2 = READ_ONCE(*x);
 28     WRITE_ONCE(*x, 1);
 29     WRITE_ONCE(*x, 0);
 30     smp_store_release(l, 0);
 31   }
 32 }
 33
 34 exists
 35 ((0:r1=0 /\ 0:r2=1) \/ (1:r1 = 0 /\ 1:r2=1))

This litmus test contains code that checks that this really does act like a lock, which the model confirms that it does, albeit a bit slowly:

Outcome for Example Litmus Test #21 (strong model)
 1 Test C-locktest Allowed
 2 States 3
 3 0:r1=0; 0:r2=0; 1:r1=0; 1:r2=0;
 4 0:r1=0; 0:r2=0; 1:r1=1; 1:r2=0;
 5 0:r1=1; 0:r2=0; 1:r1=0; 1:r2=0;
 6 No
 7 Witnesses
 8 Positive: 0 Negative: 4
 9 Condition exists (0:r1=0 /\ 0:r2=1 \/ 1:r1=0 /\ 1:r2=1)
10 Observation C-locktest Never 0 4
11 Hash=46161c4de6c633ba172e91fb61ee7d54

However, two of the three states above correspond to rejected executions, where the process went forward despite having failed to acquire the lock. This information is actually quite useful when debugging, but can be very distracting otherwise. Fortunately, there is a filter keyword, which can be used to exclude states corresponding to rejected executions:

Example Litmus Test #22
  1 C C-locktest.litmus
  2
  3 {
  4 }
  5
  6 P0(int *x, int *l)
  7 {
  8   int r1;
  9   int r2;
 10
 11   r1 = xchg_acquire(l, 1);
 12   r2 = READ_ONCE(*x);
 13   WRITE_ONCE(*x, 1);
 14   WRITE_ONCE(*x, 0);
 15   smp_store_release(l, 0);
 16 }
 17
 18 P1(int *x, int *l)
 19 {
 20   int r1;
 21   int r2;
 22
 23   r1 = xchg_acquire(l, 1);
 24   r2 = READ_ONCE(*x);
 25   WRITE_ONCE(*x, 1);
 26   WRITE_ONCE(*x, 0);
 27   smp_store_release(l, 0);
 28 }
 29
 30 locations [0:r1;1:r1]
 31 filter (0:r1=0 /\ 1:r1=0)
 32 exists (0:r2=1 \/ 1:r2=1)

This results in only the single legal state for locking:

Outcome for Example Litmus Test #22 (strong model)
 1 Test C-locktest Allowed
 2 States 1
 3 0:r1=0; 0:r2=0; 1:r1=0; 1:r2=0;
 4 No
 5 Witnesses
 6 Positive: 0 Negative: 2
 7 Condition exists (0:r2=1 \/ 1:r2=1)
 8 Observation C-locktest Never 0 2
 9 Hash=dd03f03dea03d8d2be6f678ffff703dc

The locations keyword causes the two r1 registers to be printed, despite their not appearing in the exists clause.

Emulating call_rcu()

The effect of call_rcu() can be emulated by adding a process, and having that process check a flag, wait for a grace period, then execute the code that would be in the callback function. The location that would execute the call_rcu() instead uses smp_store_release() to set the flag. For example, the following litmus test emulates an RCU callback that sets variable b to one, again using filter to reject executions in which P2() executed before P0() stored to r:

Example Litmus Test #23
  1 C C-LB+o-rel+rl-o-o-rul+o-sync-o.litmus
  2
  3 {
  4 }
  5
  6 P0(int *a, int *r)
  7 {
  8   int r1;
  9
 10   r1 = READ_ONCE(*a);
 11   smp_store_release(r, 1);
 12 }
 13
 14 P1(int *b, int *a)
 15 {
 16   int r2;
 17
 18   rcu_read_lock();
 19   r2 = READ_ONCE(*b);
 20   WRITE_ONCE(*a, 1);
 21   rcu_read_unlock();
 22 }
 23
 24 P2(int *r, int *b)
 25 {
 26   int r1;
 27
 28   r1 = READ_ONCE(*r);
 29   synchronize_rcu();
 30   WRITE_ONCE(*b, 1);
 31 }
 32
 33 filter (2:r1=1)
 34 exists (0:r1=1 /\ 1:r2=1)

The model confirms that the cycle is forbidden, just as it is without the emulated callback:

Outcome for Example Litmus Test #23 (strong model)
 1 Test C-LB+o-rel+rl-o-o-rul+o-sync-o Allowed
 2 States 3
 3 0:r1=0; 1:r2=0;
 4 0:r1=0; 1:r2=1;
 5 0:r1=1; 1:r2=0;
 6 No
 7 Witnesses
 8 Positive: 0 Negative: 3
 9 Condition exists (0:r1=1 /\ 1:r2=1)
10 Observation C-LB+o-rel+rl-o-o-rul+o-sync-o Never 0 3
11 Hash=190a2b21ba29cb53e597cfff2bfdec4e

Summary

This document has shown a progression of litmus tests requiring increasingly strong memory-ordering primitives, due to increasing numbers of non-causal non-rf links and decreasing numbers of dependencies. This progression give rise to the following rules of thumb:

  1. When the code has only causal rf links, and when each process has properly placed address, control, or data dependencies, no additional memory ordering is required.
  2. Code having only rf links can use either a release (including RCU's rcu_assign_pointer()) or an acquire on any process that lacks a properly placed dependency.
  3. Release-acquire chains (including RCU's rcu_assign_pointer() and rcu_dereference()) can guarantee ordering even if one of the links is a non-causal non-rf link, at long as the remaining links are all rf links.
  4. An acquire leading to a write can be replaced by dependencies, and in many cases a given process's acquire and release can also be replaced by dependencies. However, it is wise to run the model to check a given scenario.
  5. An acquire leading to a dependent read can usually be replaced by rcu_dereference() or lockless_dereference(). Again, it is wise to run the model to check a given scenario.
  6. However, given more than one non-rf link, release-acquire chains are not always able to provide ordering, and smp_mb() might also be required. As a rough rule of thumb, there will need to be at least one smp_mb() between each successive pair of non-rf links, but it is yet again wise to run the model to check a given scenario.
  7. If any process needs to order a prior store against a subsequent load, an smp_mb() is needed.
  8. When it is necessary to fully restore sequential consistency, it is in the worst case necessary to place an smp_mb() between each pair of accesses to shared memory within each process. The canonical sequential-consistency exmaple is shown in Example Litmus Test #14.
  9. Finally, in the Linux kernel, either a synchronous RCU grace-period operation (such as synchronize_rcu()) or a fully ordered value-returning atomic read-modify-write operations (such as xchg()) may be substituted for an smp_mb().

We hope that this provides a useful intuitive approximation to the full model. Those desiring a full understanding of the memory model should look here.

Answers to Quick Quizzes

Quick Quiz 1: So what will you do if compilers or CPUs start using value-speculation optimizations?

Answer: To be useful by concurrent software such as the Linux kernel, such compilers and hardware will need to do one or both of two things:

  1. Permit value speculation only on accesses to non-shared memory. For example, in C11, value speculation would need to be prohibited on atomic accesses.
  2. Make sure that any speculation is rolled back as needed not only to arrive at the correct value, but also to arrive at the correct ordering.

It is also quite possible that those putting forward speculative optimizations will receive considerable quantities of highly colorful feedback from any number of flamboyant personalities.

Back to Quick Quiz 1.

Quick Quiz 2: If it is not possible to make a change instantaneously visible, then how do sequentially consistent systems work?

Answer: First, there have not been many commercially available sequentially consistent systems. Second, sequential consistency does not guarantee simultaneity, but rather ordering. One way to guarantee ordering is to introduce delays, as shown by the red triangle below:

rf-sc.svg

Although there are any number of tricks that CPU designers can use to hide the delay, some would argue that these speed-of-light delays are a big reason why there have been so very few commercially available sequentially consistent systems.

Back to Quick Quiz 2.

Quick Quiz 3: Exactly how could the system “decide later” which write won?

Answer: Consider the following sequence of events:

  1. The variable x is initially zero, and its cacheline is in CPU 0's cache.
  2. CPU 1 writes the value 2 to x, but it does not have the cacheline. It therefore records the write in its write buffer and requests exclusive access to the cacheline.
  3. CPU 0 writes the value 1 to x, and because the cacheline is at CPU 0, the write completes almost immediately, updating the value of x in the cacheline.
  4. CPU 1's request for the cacheline arrives at CPU 0, which sends it (along with the new value of x) to CPU 1.
  5. CPU 1 receives the cacheline, and moves the new value of x to the cacheline. CPU 1's earlier write therefore overwrites CPU 0's later write.
As you can see, it is the later movement of the cacheline that determines which write wins.

Back to Quick Quiz 3.

Quick Quiz 4: Wouldn't it be way simpler if the last write always won?

Answer: It might well be simpler, but it might not be better. For example, the cache state of the system might be such that an earlier write was stuck in its CPU's store buffer for an extended time. Many CPU vendors would like to be able to decide which value wins without needing an expensive inspection of the store buffer of each and every CPU.

Back to Quick Quiz 4.

Quick Quiz 5: Only the CPU? Can't the compiler also reorder the reads and writes in Example Litmus Test #1?

Answer: No. The volatile accesses in READ_ONCE() and WRITE_ONCE() prohibit compiler reordering.

However, if the litmus test were to use C11 non-volatile memory_order_relaxed accesses, then the compiler could reorder these accesses.

Back to Quick Quiz 5.

Quick Quiz 6: But wait! DEC Alpha does not respect dependency ordering. So wouldn't it fail to order P0() and P2() in Example Litmus Test #2?

Answer: No. The Alpha does have weak ordering, but it still avoids speculative writes.

However, if P0()'s and P1()'s final WRITE_ONCE() was instead a READ_ONCE(), speculation would be possible and Alpha would not guarantee ordering, as will be shown later.

Back to Quick Quiz 6.