上文调研了某个知名库的 memcpy 实现,本文将调研 Linux 内核的 memset 实现。不过本文的实际动机是讨论 x86 的 REP_GOOD、ERMS 和 FSRS 微架构特性。(虽然初衷是为了电子斗蛐蛐……)
caturra@BLUEPUNI:~$ cat /proc/cpuinfo | grep flags
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb
rdtscp lm constant_tsc rep_good 👈 nopl xtopology tsc_reliable nonstop_tsc
cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe
popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm
sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp
vmmcall fsgsbase bmi1 avx2 smep bmi2 erms 👈 invpcid rdseed adx smap clflushopt
clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr arat npt nrip_save
tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm 👈(fsrs 的兄弟)
由于内核要求使用标量指令而非 SIMD 向量指令,所以实现是相对简单的。选择 Linux 内核的版本也能比较好的体现 x86 微架构的发展历史。
NOTES:
- 名词表:ERMS = Enhanced rep movsb/stosb,FSRS = Fast Short rep stos。
- 本文可视为考古文。(patch 倒是近几年的)我认为这些属于吹毛求疵的知识,了解一下就好。
- 本文也可以用来和上文做对比。同样是
mem*设计,上文是花里胡哨的软件优化,本文是短至一条指令的硬件优化。现在该用什么,战未来该用什么,这是严重的分歧点。 - 本文没有 benchamrk 环节。因为不同的微架构和访问模式会有不一致的性能表现。我要做跑分很容易,但是什么都代表不了。本文会提供官方性质的字面参考。
memset 背景板
Linux 内核的 memset 实现位于 ./arch/x86/lib/memset_64.S。
// Linux 内核 v6.3 的实现(2023 年)
SYM_FUNC_START(__memset)
/*
* Some CPUs support enhanced REP MOVSB/STOSB feature. It is recommended
* to use it when possible. If not available, use fast string instructions.
*
* Otherwise, use original memset function.
*/
// 有两个替换选项:orig 版本和 erms 版本
// - 如果只有 REP_GOOD 特性,则走下方一堆指令
// - 如果有 ERMS 特性,则走 memset_erms 实现(略)
// - 什么都没有,那就走 memset_orig 实现(略)
ALTERNATIVE_2 "jmp memset_orig", "", X86_FEATURE_REP_GOOD, \
"jmp memset_erms", X86_FEATURE_ERMS
// REP_GOOD 版本
// Q+B 的 rep stos 实现,先处理大块,再处理小块
movq %rdi,%r9
movq %rdx,%rcx
andl $7,%edx
shrq $3,%rcx
/* expand byte value */
movzbl %sil,%esi
movabs $0x0101010101010101,%rax
imulq %rsi,%rax
// Q 版本
rep stosq
movl %edx,%ecx
// B 版本
rep stosb
movq %r9,%rax
RET
SYM_FUNC_END(__memset)
EXPORT_SYMBOL(__memset)
至少 2015 年起,直到 2023 年,Linux 内核长期使用以上选择路径:
- 如果只有 REP_GOOD 特性,则使用 Q+B 的 rep stos 实现。
- 如果支持 ERMS 特性,则使用 memset_erms 定制实现。
- 什么都没有的话,就使用 memset_orig 通用实现。
注意这里 ALTERNATIVE 宏是运行时替换汇编指令的。
// Linux 内核 v6.17 的实现(2025 年)
SYM_TYPED_FUNC_START(__memset)
ALTERNATIVE "jmp memset_orig", "", X86_FEATURE_FSRS
// 如果 CPU 具有 FSRS 特性则走这里
// 只有 B 版本的 rep stos 实现
movq %rdi,%r9
movb %sil,%al
movq %rdx,%rcx
rep stosb
movq %r9,%rax
RET
SYM_FUNC_END(__memset)
EXPORT_SYMBOL(__memset)
那个人在 2023 年动手优化实现后,直到现在 2025 年都是以上选择路径:只看是否支持 FSRS。
可以看出,微架构随时代发展有不同的指导价值。接下来直接考古 REP_GOOD、ERMS 和 FSRS 三个特性。
REP_GOOD
// 文件:arch/x86/include/asm/cpufeatures.h
/* Other features, Linux-defined mapping, word 3 */
#define X86_FEATURE_REP_GOOD ( 3*32+16) /* "rep_good" REP microcode works well */
需要注意 REP_GOOD 并不是硬件特性,这是 Linux 特有的,只是你刚好能在 /proc/cpuinfo 里找到。
历史发展路线:
- X86_FEATURE_K8_C 引入:在 commit 1da177e 之前。这是 Linux 内核最早的 git 记录(2.6.12 时代,2005 年),我不想再往前追查了。注意此时已经有 memset rep 优化。
- X86_FEATURE_K8_C 移除。认为这种字符串优化很老派,删了。
- X86_FEATURE_REP_GOOD 正式引入(2006 年):实质和 K8_C 相同,只是删掉后(Intel NetBurst)性能吃亏又加回来了。(同一人)
就本人能找到的 git 历史来说,REP_GOOD 字面上的引入时间是在 2006 年,目标架构是 AMD K8 (C 步进版本),实质上应该更早时间就已经支持。根据它的 K8 判断基准,Intel 处理器不会受益。但是这条 commit 185f3b9 表明 Intel 进入 64 位时代也提供 REP_GOOD 支持。谁先谁后还不好说。
NOTES:
- 只看微码而不是 REP_GOOD 标记,有内部人士透露 Intel 早在 1996 年就尝试优化 rep 实现。
- 总之这是相当的老黄历了,进入 64 位时代后,应该没有不支持 REP_GOOD 的道理。
ERMS
/* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
#define X86_FEATURE_ERMS ( 9*32+ 9) /* "erms" Enhanced REP MOVSB/STOSB instructions */
ERMS 是处理器提供的硬件特性。需要注意出于历史原因,硬件特性多是 Intel 定义的(Intel-defined)。不过这不意味着是独有的功能,比如 AVX2 也是 Intel-defined 的,友商可以做兼容……
3.7.6 Enhanced REP MOVSB and STOSB Operation
Beginning with processors based on Ivy Bridge microarchitecture, REP string operation using MOVSB and STOSB can provide both flexible and high-performance REP string operations for software in common situations like memory copy and set operations. Processors that provide enhanced MOVSB/STOSB oper- ations are enumerated by the CPUID feature flag: CPUID:(EAX=7H, ECX=0H):EBX.[bit 9] = 1.
来源:Intel® 64 and IA-32 Architectures Optimization Reference Manual
历史发展:
- X86_FEATURE_ERMS 引入(2011 年):基本是为 CPU 发布做准备。
- ivy bridge 架构正式提供 ERMS 特性(2012 年)。Intel 文档有写。
- ERMS 覆盖 REP_GOOD 能力:来自 commit 15b7ddc(2025 年),这更多是 Intel 处理历史遗留问题。
注意 X86_FEATURE_ERMS 只是纸面上的定义,内核不会主动设置,从 patch 的说明来看是 BIOS 提供选项(从而有用户开关,并且 ALTERNATIVE 仍可通过 CPUID 识别到这个 flag)。
这里并没有提到任何 AMD 相关的信息,因为它对于 ERMS 的支持非常混乱。下面详细介绍。
由于 AMD 的文档只能用悲剧来形容,所以我需要反查 CPUID 指令。既然已知 CPUID:(EAX=7H, ECX=0H):EBX.[bit 9] = 1 可以检测该功能,那么……
// 需要使用 C++20 编译,依赖于 InstLatx64 提供的 CPUID dump 文件
#include <filesystem>
#include <string>
#include <string_view>
#include <charconv>
#include <iostream>
#include <fstream>
#include <ranges>
#include <vector>
namespace stdf = std::filesystem;
namespace stdv = std::views;
namespace stdr = std::ranges;
using namespace std::literals;
// 执行时可指定要解析的目录,不指定就是当前目录
int main(int argc, char **argv) {
const stdf::path root {argc > 1 ? argv[1] : "."};
std::cout << "Start: " << root << std::endl;
auto cpuid_txt_path_view =
stdv::filter([](auto &&dentry) { return dentry.is_regular_file(); })
| stdv::transform([](auto &&dentry) { return dentry.path(); })
| stdv::filter([](auto &&path) { return path.extension() == ".txt"; })
| stdv::filter([](auto &&path) {
auto name = path.filename().string();
return name.find("CPUID"sv) != std::string::npos;
});
auto optimized_out = [](auto &&line) {
return line.starts_with("CPUID 80000001");
};
auto contains_feature = [](auto &&line) {
return line.starts_with("CPUID 00000007");
};
auto test_bit = [](auto &&line, auto bit_index) {
constexpr size_t bias = 25;
const char *features_hex {line.data() + bias};
unsigned long long bitmap = 0;
std::from_chars(features_hex, features_hex + 8, bitmap, 16);
return bitmap >> bit_index & 1;
};
std::vector<stdf::path> ok, ng;
for(auto &&path : stdf::recursive_directory_iterator(root) | cpuid_txt_path_view) {
std::ifstream file_stream {path};
std::string line;
while(std::getline(file_stream, line)) {
if(optimized_out(line)) break;
if(!contains_feature(line)) continue;
auto &vec = test_bit(line, 9 /* ERMS */) ? ok : ng;
vec.emplace_back(path);
break;
}
}
std::cout << "OK list:" << std::endl;
for(auto &&path : ok) {
std::cout << " " << path.filename() << std::endl;
}
std::cout << std::endl;
std::cout << "NG list:" << std::endl;
for(auto &&path : ng) {
std::cout << " " << path.filename() << std::endl;
}
}
你可以浏览别人做好的 CPUID dump 文件来确认历代特性支持信息。我使用上面的程序枚举出最早支持的是 Vermeer 架构(2020 年 / Zen 3),基本是落后 Intel 十年了。
NOTE:commit ca96b16 提到直到 2023 年也有部分 EPYC 服务器不支持 ERMS。
FSRM(番外)
同样 FSRM 也是 Intel 定义的。需要注意,FSRM 和 FSRS 是不一样的功能,前者是 move 操作,后者是 store 操作。
所以 FSRM 不会优化 memset 的性能(用不到这个指令),但是可以优化 memmove/memcpy。
3.7.6.1 Fast Short REP MOVSB
Beginning with processors based on Ice Lake Client microarchitecture, REP MOVSB performance of short operations is enhanced. The enhancement applies to string lengths between 1 and 128 bytes long. Support for fast-short REP MOVSB is enumerated by the CPUID feature flag: CPUID [EAX=7H, ECX=0H).EDX.FAST_SHORT_REP_MOVSB[bit 4] = 1. There is no change in the REP STOS performance.
来源:Intel® 64 and IA-32 Architectures Optimization Reference Manual
Intel 在优化手册中提到,Ice Lake(2019 年)后支持 FSRM。那 AMD 呢?当然是什么都没说。
NOTES:
- 关于 move 操作的硬件优化,还有 FZRM 特性(Fast Zero Length REP MOVSB),感兴趣请自行看 Intel 手册。
- 如果你使用谷歌搜索 AMD 相关的信息,那么大概率会看到有人在抱怨 Zen3 对 FSRM 支持不佳。
- 2020 年,Linux 内核正式支持 FSRM,并且第一步就是改进 memmove 实现。
FSRS
历史发展:
Support for fast-short REP STOSB is enumerated by the CPUID feature flag:
CPUID.07H.01H:EAX.FAST_SHORT_REP_STOSB[bit 11] = 1.
稍微改动上面 ERMS 的枚举程序得知:
- Intel 从 Alder Lake(2021 年)开始均支持 FSRS。
- AMD 全军覆没,直到 Zen 5(2024-2025 年)都不支持 FSRS。
这可能说明,那个人所做的修改,对于 AMD 来说是吃亏的,因为它直接回落到最通用的基本实现了。
那怎么办?那个人又出手了:只要 AMD 支持 FSRM,那就等同于支持 FSRS。所以对于 Linux 内核来说,这不是问题。
附录一:纸面斗蛐蛐
我们后续会看 Agner 提供的数据。不过在这之前先提供一些纸面信息。
17.8 Moving blocks of data
There are several ways of moving large blocks of data. The most common methods are:
- REP MOVS instruction.
- …
The REP MOVS instruction (1) is a simple solution which is useful when optimizing for code size rather than for speed.
…it is not optimal for small blocks of data. For large blocks of data, it may be quite efficient when certain conditions for alignment etc. are met. These conditions depend on the specific CPU (see page 140). On Intel Nehalem and later processors, this is sometimes as fast as the other methods when the memory block is large.
…
Many modern processors have optimized the REP MOVS instruction (method 1) to use the largest available register size and the fastest method, at least in simple cases. But there are still cases where the REP MOVS method is slow, for example for certain misalignment cases and false memory dependence. However, the REP MOVS instruction has the advantage that it will probably use the largest available register size on processors in a more distant future with registers bigger than 512 bits. As instructions with the expected future register sizes cannot yet be coded and tested, the REP MOVS instruction is the only way we can write code today that will take advantage of future extensions to the register size. Therefore, it may be useful to use the REP MOVS instruction for favorable cases of large aligned memory blocks.
Agner 汇编优化指南(2025)认为 rep 是战未来的好指令,因为 AVX512 变强它也同样受益,但是 rep 有独有的二进制体积优势。需要注意这份文档主要讨论的是 move 操作,并且对于非对齐源仍保持负面态度。
2.13.6 String Store Optimizations
The AMD Zen5 architecture includes several optimizations to improve the performance of stores produced by rep movs and rep stos instructions (string instructions).
At very large string sizes, sizes that are greater than or equal to the L3 size of the processor, Zen5 performs string stores using streaming-store operations. Streaming stores are non-cacheable stores that bypass the cache hierarchy of the processor and write data directly to the destination after aggregation in Write Combining Buffer. This optimization avoids replacing all cachelines in the cache hierarchy with the string data.
At smaller string sizes, in some system configurations, the AMD Zen5 architecture includes an optimization to eliminate the Read For Ownership (RFO) cache-coherence action for the destination cachelines that are fully overwritten by the string instruction. The destination cachelines are allocated into the cache hierarchy without being read, and are fully overwritten by the stores of the string instruction. The size threshold at which this optimization is active is implementation dependent. The optimization does not require that the size of the string instruction be a multiple of the cacheline size; the processor handles cachelines that are not fully-overwritten using a read-for-ownership.
In both optimizations, the stores produced by the string instruction may become visible to other processors out-of-order with respect to other stores in the string instruction. However, the processor ensures that stores older than the string instruction are visible before any stores in the string instruction, and that all stores from the string instruction are visible before stores younger than the string instruction.
Zen 5 优化手册特意强调该架构有极佳优化的 rep(你能想到的软件优化,AMD 都帮你在微码上做好了),这是此前所有 AMD 微架构优化手册都没提到的事情。
3.7.6.2 Memcpy Considerations
…
For processors supporting enhanced REP MOVSB/STOSB, implementing memcpy with REP MOVSB will provide even more compact benefits in code size and better throughput than using the combination of REP MOVSD+B. For processors based on Ivy Bridge microarchitecture, implementing memcpy using Enhanced REP MOVSB and STOSB might not reach the same level of throughput as using 256-bit or 128-bit AVX alternatives, depending on length and alignment factors.
Using Enhanced REP MOVSB and STOSB always delivers better performance than using REP MOVSD+B. If the length is a multiple of 64, it can produce even higher performance.
后续是 Intel 优化手册。store 操作无关,只是想表达现代处理器尽量直接用 B 版本。
3.7.6.4 Memset Considerations
The consideration of code size and throughput also applies for memset() implementations. For proces- sors supporting Enhanced REP MOVSB and STOSB, using REP STOSB will again deliver more compact code size and significantly better performance than the combination of STOSD+B technique described in Section 3.7.5.
When the destination buffer is 16-byte aligned, memset() using Enhanced REP MOVSB and STOSB can perform better than SIMD approaches. When the destination buffer is misaligned, memset() perfor- mance using Enhanced REP MOVSB and STOSB can degrade about 20% relative to aligned case, for processors based on Ivy Bridge microarchitecture. In contrast, SIMD implementation of memset() will experience smaller degradation when the destination is misaligned.
Memset() implemented with Enhanced REP MOVSB and STOSB can benefit further from the 256-bit data path in Haswell microarchitecture. see Section 15.16.3.3.
15.16.3.3 Memset() Implementation Considerations
…
Using REP STOSB to implement memset() has the code size advantage versus a SIMD implementation, like REP MOVSB for memcpy(). On Haswell microarchitecture, a memset() routine implemented using REP STOSB will also benefit the from the 256-bit data path and increased L1 data cache bandwidth to deliver up to 32 bytes per cycle for large count values.
Comparing the performance of memset() implementations using REP STOSB vs. 256-bit AVX2 requires one to consider the pattern of invocation of memset(). The invocation pattern can lead to the necessity of using different performance measurement techniques. There may be side effects affecting the outcome of each measurement technique.
…
When the relevant skew factors of measurement techniques are taken into effect, the performance of memset() using REP STOSB, for count values smaller than a few hundred bytes, is generally faster than the AVX2 version for the common memset() invocation scenarios. Only in the extreme scenarios of hundreds of unrolled memset() calls, all using count values less than a few hundred bytes and with no intervening instruction stream between each pair of memset() can the AVX2 version of memset() take advantage of the training effect of the branch predictor.
Intel 优化手册提到二进制体积和性能上通常优于传统的 SIMD(如 AVX2)实现,尤其是在常见的调用场景中。手册讨论范围一般是基于 Ivy Bridge 节点(2012 年)。
3.8.2 Fast Short REP STOSB
REP STOSB performance of short operations is enhanced. The enhancement applies to string lengths between 0 and 128 bytes long. When Fast Short REP STOSB feature is enabled, REP STOSB performance is flat 12 cycles per operation, for all strings 0-128 byte long whose destination operand resides in the processor first level cache.
Intel 优化手册还提到 FSRS 是针对长度为 [0, 128] 的短字符串进行优化。个人推测,Intel 为 FSRS 定性到 128 字节范围内,可能是因为此前 rep 这个范围在 Haswell 时期明确被 SIMD 超越了(但是 > 128B 总是 rep 小胜,图表信息需要自己翻资料),Ivy Bridge 时代也是这样(见前面最后一段)。
NOTES:
- 文档信息其实也很散乱了,不保证全部都找到。
- MSVC/GCC 也提到过非对齐有性能回退问题,这方面感兴趣可自行查找 issue。
附录二:Agner 斗蛐蛐
尽管 uops.info 有更方便的横向对比数据,但是它那里只有简单的数字(rcx 极小)……而 Agner Fog 大神整理的表格有更多备注,因此简单汇总一下。
在这里我们分别考虑延迟(latency,有依赖时每条指令所需的 cycle 数),以及基于倒数吞吐量(reciprocal throughput,无依赖时每条指令所需的 cycle 数)归一处理后得到的数据带宽(bytes/cycle):也就是每时钟周期 memset 所能处理的字节数。
保证统计的正确性是有点难度的,因为 Agner 表只提到使用 rep stos,但是不知道具体数据宽度;并且虽然测量区分了 small n 场景,但是没有告知 n 范围。对于带宽,我们尽可能使用 CPU 支持的最大宽度,并且 SIMD 按对齐([V]MOVDQA,见表中 128/256/512)指令算,操作数按 m,r 算。表格会忽略一些额外开销以便于统计(比如 mov/SIMD 要求额外的广播指令,或者启动前也需要填充必要的寄存器杂项等)。
| 年代 | 架构 | rep | mov | 128 | 256 | 512 | 备注 |
|---|---|---|---|---|---|---|---|
| 1999 | K7 | 4 | 8 | 只有 r32 | |||
| 2003 | K8 | 8 | 16 | 支持 r64,遥遥领先 | |||
| 2007 | K10 | 8 | 16 | 16 | |||
| 2011 | Bulldozer | 4-5.3 | 8 | 16 | 10.7 | 推土机是公认的不行 | |
| 2012 | Piledriver | 8 | 8 | 16 | 1.9 | AVX 吞吐量很低 | |
| 2014 | Steamroller | 8 | 8 | 16 | 32 | ||
| 2015 | Excavator | 8 | 8 | 16 | 16 | 不明白倒退情况 | |
| 2017 | Zen | 8-16 | 8 | 16 | 16 | 追上! | |
| 2019 | Zen 2 | 8-32 | 8 | 16 | 32 | ||
| 2020 | Zen 3 | 8-32 | 16 | 16 | 32 | mov m,r 吞吐增强 | |
| 2022 | Zen 4 | >8-32 | 16 | 16 | 32 | 32 | 确为传言的胶水 AVX512 |
| 2024 | Zen 5 | >8-32 | 16 | 32 | 64 | 64 | 依然是 SIMD 强 |
| 1993 | Pentium | 4 | 4 | 倒数吞吐实为 10+n | |||
| 1997 | Pentium II / III | 缺数据 | |||||
| 2003 | Pentium M | 5.7 | 4 | ||||
| 2006 | Core 2 | 14.5 | 8 | ||||
| 2008 | Nehalem | <8-16 | 8 | 16 | small 实为 12+n | ||
| 2011 | Sandy Bridge | 8-16 | 8 | 16 | |||
| 2012 | Ivy Bridge | 8-16 | 8 | 16 | |||
| 2013 | Haswell | 16-32 | 8 | 16 | 32 | 大飞跃 | |
| 2014 | Broadwell | 16-32 | 8 | 16 | 32 | ||
| 2015 | Skylake | 16-32 | 8 | 16 | 32 | ||
| 2015 | Skylake-X | 16-32 | 8 | 16 | 32 | 64 | |
| 2017 | Coffee Lake | 16-32 | 8 | 16 | 32 | 倒退,没有 AVX512 | |
| 2018 | Cannon Lake | 16-32 | 8 | 16 | 32 | 64 | |
| 2019 | Ice Lake | 3.2-64 | 8 | 32 | 64 | 64 | 牙膏最新款只能到这了 |
不算严谨,也代表不了任何负载,就当过目吧。无论是 Zen 还是各种桥/井/湖,rep 都是追着同期的 SIMD 整体发展,但很难说完全超越。