浅谈 x86 特定体系下的并发

前面的文章（已删除，但是可以看点别的）提到一些语言抽象层面级别的并发接口。

正如上文所说，语言层面应该关注抽象机器而非特定指令，以 C++ 为例，我们只看 C++ 内存模型定义的规则。但是在体系结构层面，我们必须关注硬件能为开发者提供的承诺，不然在 C++ 内存模型发布前的史前人类又该如何做到并发编程？

一旦涉及到体系结构，可能会报出 OoO 乱序执行、store buffer、缓存一致性协议等等菜单名，不妨先理顺一下它们的关系。

Q1. 缓存一致性协议影响并发吗？

它对并发的正确性没有影响。因为该协议具有一致性的保证。coherent 应当理解为透明的存在。

In cache-coherent systems, if the caches hold multiple copies of a given variable, all the copies of that variable must have the same value. (perfbook 15.1.1)

所以当你在考虑并发的正确性时想到这个东西，那你的方向错了。不过性能的影响是有的，通信成本随核心规格增加。

NOTE: 这里说的正确性是以 cache 的角度来描述，对同一位置的并发 store 操作中只要其中一个能竞争得到某个 cacheline 的所有权（写入某个值），那么可认为所有的 cache 都会对此达成一致，这就是平常所说的 store 具有 global order 的意思。

Q2. store buffer 和缓存一致性的关系？

MESI(F) 缓存一致性协议保证 cache 与 cache 间，cache 与 memory 间的交流没有歧义。

而增加的第三者 store buffer 则在此基础上实现提供更弱的一致性，从这里开始，并发的正确性会受到影响。因为 store 操作不是直接写到 cache 中，而是 store buffer。

NOTES:

多插一句，Memory Barriers: a Hardware View for Software Hackers 提到仅使用 store buffer 是错误设计，需要引入 store forwarding，不过这是硬件设计的事情，别在意。
也就是说 load 操作不只是从 cache 获取数据，因为 store buffer 对本地 CPU 是直连可读的。

Q3. OoO？不是全都乱序了吗？

我确实没有精力找资料能说明乱序执行到底乱序到什么程度，如何去实现，是否只与 store buffer 相关，与顺序提交（in-order commit）有何关系…这对于一个软件开发者来说是不是偏题了点？

其实对于我们开发者来说最关心的问题不过是：什么时候可以只用编译器屏障就足够了？

硬件的复杂性阻拦了我们也没关系，SDM 手册告诉我们：哪怕是硬件体系结构，还是要看抽象模型 memory order，它是直接告知 CPU 对内存的影响，即从内存的角度看 CPU store load 指令的顺序。

x86 的内存模型

该内存模型简称为 TSO (total store order)。名字无所谓，我们着重看承诺的效果。

In a single-processor system for memory regions defined as write-back cacheable, the memory-ordering model respects the following principles (Note the memory-ordering principles for single-processor and multipleprocessor systems are written from the perspective of software executing on the processor, where the term “processor” refers to a logical processor. For example, a physical processor supporting multiple cores and/or Intel Hyper-Threading Technology is treated as a multi-processor systems.):
- Reads are not reordered with other reads.
- Writes are not reordered with older reads.
- Writes to memory are not reordered with other writes, with the following exceptions:
  - streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and
  - string operations (see Section 8.2.4.1).
- No write to memory may be reordered with an execution of the CLFLUSH instruction; a write may be reordered with an execution of the CLFLUSHOPT instruction that flushes a cache line other than the one being written. Executions of the CLFLUSH instruction are not reordered with each other. Executions of CLFLUSHOPT that access different cache lines may be reordered with each other. An execution of CLFLUSHOPT may be reordered with an execution of CLFLUSH that accesses a different cache line.
- Reads may be reordered with older writes to different locations but not with older writes to the same location.
- Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.
- Reads cannot pass earlier LFENCE and MFENCE instructions.
- Writes and executions of CLFLUSH and CLFLUSHOPT cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.
- LFENCE instructions cannot pass earlier reads.
- SFENCE instructions cannot pass earlier writes or executions of CLFLUSH and CLFLUSHOPT.
- MFENCE instructions cannot pass earlier reads, writes, or executions of CLFLUSH and CLFLUSHOPT.

In a multiple-processor system, the following ordering principles apply:
- Individual processors use the same ordering principles as in a single-processor system.
- Writes by a single processor are observed in the same order by all processors.
- Writes from an individual processor are NOT ordered with respect to the writes from other processors.
- Memory ordering obeys causality (memory ordering respects transitive visibility).
- Any two stores are seen in a consistent order by processors other than those performing the stores
- Locked instructions have a total order.

除去特定指令，只看平凡的操作，会发现并不是任何读写都会 reorder：

load 与 load 之间不会乱序。
store 与 store 之间不会乱序。
前期 load 与后期 store 之间不会乱序。

总之就是只有 store -> load 会乱序。

个人认为这种设计是跟 acquire-release 有关，因为仅提供这种程度的乱序不会影响该语义。比如，C++ 内存模型映射到 x86 体系结构的内存模型时，不管是 std::memory_order_acquire 还是 std::memory_order_release 都可以是空实现。

NOTES:

编译器 -O 优化等级较低时未必是空实现，可能会插入 MFENCE 指令。
也可以进一步总结出 LFENCE 和 SFENCE 对于并发编程没有必须使用的要求，网上对这方面有过激烈的讨论，感兴趣可以搜索一下。

原子性的提供

x86 下普通的读写指令就满足原子性的要求。也就是说 relaxed 语义其实和普通的变量读写没啥差异，但仍需要拒绝编译器优化。注意这里提的 relaxed 是语言标准的范畴，不会真的以优化来衡量内存模型，你只需确保使用类似 std::atomic 的类型，即使是 relaxed 也能拒绝掉不合适的编译器优化。

NOTE: RMW 操作则是用 LOCK 前缀表示，特殊的 XCHG 不需要显式声明 LOCK。

The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:
- Reading or writing a byte
- Reading or writing a word aligned on a 16-bit boundary
- Reading or writing a doubleword aligned on a 32-bit boundary

The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:
- Reading or writing a quadword aligned on a 64-bit boundary
- 16-bit accesses to uncached memory locations that fit within a 32-bit data bus

The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically:
- Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line

volatile 的特殊使用

从语言抽象的层面来看，volatile 并不适合作为并发正确性的保证，至少它本身并不提供原子性。

但是在前面明确了特定体系下内存模型的基础上，当不需要提供内存屏障时，我们可以使用 volatile（体系结构提供了原子性）。

事实上，Linux 内核也有大量使用 volatile，只是以临时转换的形式隐藏起来，以 READ_ONCE 为例：

#define READ_ONCE(x) __READ_ONCE(x, 1)
#define __READ_ONCE(x, check)                       \
({                                  \
    union { typeof(x) __val; char __c[1]; } __u;            \
    if (check)                          \
        __read_once_size(&(x), __u.__c, sizeof(x));     \
    else                                \
        __read_once_size_nocheck(&(x), __u.__c, sizeof(x)); \
    smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \
    __u.__val;                          \
})
// ...
#define __READ_ONCE_SIZE                        \
({                                  \
    switch (size) {                         \
    case 1: *(__u8 *)res = *(volatile __u8 *)p; break;      \
    case 2: *(__u16 *)res = *(volatile __u16 *)p; break;        \
    case 4: *(__u32 *)res = *(volatile __u32 *)p; break;        \
    case 8: *(__u64 *)res = *(volatile __u64 *)p; break;        \
    default:                            \
        barrier();                      \
        __builtin_memcpy((void *)res, (const void *)p, size);   \
        barrier();                      \
    }                               \
})

NOTE: volatile 不推荐使用的另一个原因是定义复杂，不说人话。你可以品鉴一下：

This version guarantees that “Accesses through volatile glvalues are evaluated strictly according to the rules of the abstract machine”, that volatile accesses are side effects, that they are one of the four forward-progress indicators, and that their exact semantics are implementation-defined.

而它能用于并发正确性是因为本用于 MMIO，需要指令层面上要满足一定顺序，而这恰好用于编译器屏障的保证（当然它并不只是提供顺序上的保证，还有很多编译优化都可以阻止，invented load、store fusing 等等，考虑另外写一篇文章总结）。

当然这只是告诉你：它能（在很限定的上下文）用。Linux 以此为事实标准，是因为社区具有 LKMM（Linux 内核的内存模型）文档作为契约。既然语言标准已经提供了 atomic 和 memory order，那我们没有理由不用后者。

记住最基本的必要用法：信号处理中的 volatile sig_atomic_t 类型。具体自己翻 cppreference。

总结

这篇文章仅仅是对于 x86 特定体系下做并发编程的一些指引，它指出：

一些关于并发的硬件基础问题的讨论：cache 轻易达成一致，但是 store buffer 会搅黄事情。
语言抽象和特定体系的内存模型差距：x86 体系实际有很强力的保序承诺。
原子性提供的保证：现代 x86 对 64 位以内的平凡操作可以提供保证，甚至不需要对齐。
volatile 的特定用途：能用，但是要想清楚为什么而用。