前言
上文通过阅读论文的方式对 F2FS 进行了概括性的总结,本文进一步通过源码分析的方式来理解 F2FS。一方面我们可以通过调研 F2FS 的早期 commit来上路,另一方面也可以通过对应时期的 mkfs.f2fs 工具来了解一个文件系统的初始化状态。虽然这些代码的稳定性相对于现在肯定是不足的,但这样更易于理解论文提到的若干特性。这篇文章先了解一个格式化的 F2FS 的磁盘布局(即它的初始化状态),我们通过调试内核以外的 mkfs.f2fs 工具源码来展开细节
Overview: On-disk layout
上文提到 F2FS 的布局是分为 6 个 area,如上图所示。那么它的初始化状态是怎样的?这里可以通过调试mkfs.f2fs
来了解具体的数据结构。附上一些动手前的准备:
git clone https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git
cd f2fs-tools
# 找到和上述内核代码相近时间点的commit,切过去
git checkout -b jintianxiaomidaobilema 036d45e551ca5405c726f8ccb51f446620cd4af4
cd mkfs
# 造一个1GB的img,调试用
dd if=/dev/zero of=f2fs.img bs=1024 count=1000000
# 编译mkfs.f2fs,使用-g3方便宏展开调试
gcc f2fs_format.c -g3 -o mkfs.f2fs
「纸面」上的 F2FS 数据结构可以直接看这里,作者注释超详细的
先在初始化主流程 f2fs_format_device()
返回前打断点,然后传递启动参数后输出 f2fs_params
变量。这个变量是用于初始化前的 parse options,我们也可以看到它的默认行为是怎样的
(gdb) b f2fs_format.c:1273
(gdb) r f2fs.img
(gdb) set print pretty on
(gdb) p f2fs_params
$1 = {
sector_size = 512,
reserved_segments = 25,
overprovision = 5,
cur_seg = {473, 1, 0, 476, 475, 474},
segs_per_sec = 1,
secs_per_zone = 1,
start_sector = 0,
total_sectors = 2000000,
sectors_per_blk = 8,
blks_per_seg = 512,
vol_label = "F2FS", '\000' <repeats 11 times>,
heap = 1,
fd = 3,
device_name = 0x7fffffffe137 "f2fs.img",
extension_list = 0x0
}
一点信息如下:
- overprovision 是预留给 GC 的百分比空间,并不开放给用户使用。reserved_segments 是由 overprovision 计算得出的
- 默认使用了 heap-based allocation(
heap = 1
),其实是影响了cur_seg
的赋值行为。下文提到 data 从起始地址开始,而 node 则从尾部地址开始 fd
仅用于writetodisk()
最终写入到设备/映像文件中,并不实际存放于文件系统
" overprovision-ratio-percentage" Specify the percentage over the volume size for overprovision area. This area is hidden to users, and utilized by F2FS cleaner. The default percentage is 5%. " heap-based-allocation" Specify 1 or 0 to enable/disable heap based block allocation policy. If the value is equal to 1, each of active log areas are initially assigned separately according to the whole volume size. The default value is 1. ... heap-style segment allocation which finds free segments for data from the beginning of main area, while for node from the end of main area.
Super block 初始化
在 f2fs.h
中的数据结构如下所示
/*
* For superblock
*/
struct f2fs_super_block {
__le32 magic; /* Magic Number */
__le16 major_ver; /* Major Version */
__le16 minor_ver; /* Minor Version */
__le32 log_sectorsize; /* log2 sector size in bytes */
__le32 log_sectors_per_block; /* log2 # of sectors per block */
__le32 log_blocksize; /* log2 block size in bytes */
__le32 log_blocks_per_seg; /* log2 # of blocks per segment */
__le32 segs_per_sec; /* # of segments per section */
__le32 secs_per_zone; /* # of sections per zone */
__le32 checksum_offset; /* checksum offset inside super block */
__le64 block_count; /* total # of user blocks */
__le32 section_count; /* total # of sections */
__le32 segment_count; /* total # of segments */
__le32 segment_count_ckpt; /* # of segments for checkpoint */
__le32 segment_count_sit; /* # of segments for SIT */
__le32 segment_count_nat; /* # of segments for NAT */
__le32 segment_count_ssa; /* # of segments for SSA */
__le32 segment_count_main; /* # of segments for main area */
__le32 segment0_blkaddr; /* start block address of segment 0 */
__le32 cp_blkaddr; /* start block address of checkpoint */
__le32 sit_blkaddr; /* start block address of SIT */
__le32 nat_blkaddr; /* start block address of NAT */
__le32 ssa_blkaddr; /* start block address of SSA */
__le32 main_blkaddr; /* start block address of main area */
__le32 root_ino; /* root inode number */
__le32 node_ino; /* node inode number */
__le32 meta_ino; /* meta inode number */
__u8 uuid[16]; /* 128-bit uuid for volume */
__le16 volume_name[512]; /* volume name */
__le32 extension_count; /* # of extensions below */
__u8 extension_list[F2FS_MAX_EXTENSION][8]; /* extension array */
} __packed;
同样使用 gdb
来 dump 初始化状态,如下所示
(gdb) p super_block
$2 = {
magic = 4076150800,
major_ver = 1,
minor_ver = 0,
log_sectorsize = 9,
log_sectors_per_block = 3,
log_blocksize = 12,
log_blocks_per_seg = 9,
segs_per_sec = 1,
secs_per_zone = 1,
checksum_offset = 0,
block_count = 250000,
section_count = 478,
segment_count = 487,
segment_count_ckpt = 2,
segment_count_sit = 2,
segment_count_nat = 4,
segment_count_ssa = 1,
segment_count_main = 478,
failure_safe_block_distance = 0,
segment0_blkaddr = 512,
start_segment_checkpoint = 512,
sit_blkaddr = 1536,
nat_blkaddr = 2560,
ssa_blkaddr = 4608,
main_blkaddr = 5120,
root_ino = 3,
node_ino = 1,
meta_ino = 2,
volume_serial_number = 0,
volume_name = {70, 50, 70, 83, 0 <repeats 508 times>},
extension_count = 22,
extension_list = {"jpg\000\000\000\000", "gif\000\000\000\000", "png\000\000\000\000", "avi\000\000\000\000", "divx\000\000\000", "mp4\000\000\000\000", "mp3\000\000\000\000",
"3gp\000\000\000\000", "wmv\000\000\000\000", "wma\000\000\000\000", "mpeg\000\000\000", "mkv\000\000\000\000", "mov\000\000\000\000", "asx\000\000\000\000", "asf\000\000\000\000",
"wmx\000\000\000\000", "svi\000\000\000\000", "wvx\000\000\000\000", "wm\000\000\000\000\000", "mpg\000\000\000\000", "mpe\000\000\000\000", "rm\000\000\000\000\000",
"ogg\000\000\000\000", "\000\000\000\000\000\000\000" <repeats 41 times>}
}
super block 部分是相当的易懂的,除了配置基本的块设备 block / sector 以及 segment 数目信息以外,还反映了默认 1:1:1 的 segment-section-zone 三级划分,以及各个 area 的起始地址(*_blkaddr
)
SIT 初始化
/**
* @brief It initialize SIT Data structure
* @param None
* @return 0 if success
*/
static int8_t f2fs_init_sit_area(void)
{
u_int32_t blk_size_bytes;
u_int32_t seg_size_bytes;
u_int32_t index = 0;
u_int64_t sit_seg_blk_offset = 0;
u_int8_t *zero_buf = NULL;
blk_size_bytes = 1 << le32_to_cpu(super_block.log_blocksize);
seg_size_bytes = (1 << le32_to_cpu(super_block.log_blocks_per_seg)) *
blk_size_bytes;
zero_buf = calloc(sizeof(u_int8_t), seg_size_bytes);
if(zero_buf == NULL) {
printf("\n\tError: Calloc Failed for sit_zero_buf!!!\n");
return -1;
}
sit_seg_blk_offset = le32_to_cpu(super_block.sit_blkaddr) *
blk_size_bytes;
for (index = 0;
index < (le32_to_cpu(super_block.segment_count_sit) / 2);
index++) {
if (writetodisk(f2fs_params.fd, zero_buf, sit_seg_blk_offset,
seg_size_bytes) < 0) {
printf("\n\tError: While zeroing out the sit area \
on disk!!!\n");
return -1;
}
sit_seg_blk_offset = sit_seg_blk_offset + seg_size_bytes;
}
free(zero_buf);
return 0 ;
}
/**
* @brief It writes buffer to disk or storage meant to be formatted
* with F2FS.
* @param fd File descriptor for device
* @param buf buffer to be written
* @param offset where to bw written on the device
* @param length length of the device
* @return 0 if success
*/
static int writetodisk(int32_t fd, void *buf, u_int64_t offset, size_t length)
{
if (lseek64(fd, offset, SEEK_SET) < 0) {
printf("\n\tError: While lseek to the derised location!!!\n");
return -1;
}
if (write(fd, buf, length) < 0) {
printf("\n\tError: While writing data to the disk!!! Error Num : \
%d\n", errno);
return -1;
}
return 0;
}
在 SIT 的初始化函数 f2fs_init_sit_area()
中,并没有在内存数据结构上做出改动,核心仅是 writetodisk()
写入到外存数据结构对应的 SIT 中
从之前的 dump 可以看到 super_block.segment_count_sit = 2
,即 SIT area 实际使用了 2 个 area,但是这里写入一半(1 个 segment)到外存映像,写入的内容是全 0(zero_buf
)
NAT 初始化
目前 NAT 的初始化行为与 SIT 一致,区别是 NAT 占用 4 个 segment,其它略
Note: 在后续 root directory 初始化阶段中,NAT 仍要为 root inode 做进一步更新操作
Root directory 初始化
接下来的流程执行 f2fs_create_root_dir()
,简单来说有 3 步:
f2fs_write_root_inode()
:在 main area 创建 root inodef2fs_update_nat_root()
:继续初始化 NAT,即在该区域记录 root inode 的地址映射f2fs_add_default_dentry_root()
后面我们来看这一大段代码
/**
* @brief It initializes and writes root inode on device.
* @param None
* @return 0 if success
*/
static int8_t f2fs_write_root_inode(void)
{
struct f2fs_node *raw_node = NULL;
u_int32_t blk_size_bytes;
u_int64_t data_blk_nor;
u_int64_t main_area_node_seg_blk_offset = 0;
raw_node = calloc(sizeof(struct f2fs_node), 1);
if (raw_node == NULL) {
printf("\n\tError: Calloc Failed for raw_node!!!\n");
return -1;
}
// 设置 root inode number
raw_node->footer.nid = super_block.root_ino;
raw_node->footer.ino = super_block.root_ino;
raw_node->footer.cp_ver = cpu_to_le64(1);
// 指向下一个 block
raw_node->footer.next_blkaddr = cpu_to_le32(
le32_to_cpu(super_block.main_blkaddr) +
f2fs_params.cur_seg[CURSEG_HOT_NODE] *
f2fs_params.blks_per_seg + 1);
raw_node->i.i_mode = cpu_to_le16(0x41ed);
raw_node->i.i_links = cpu_to_le32(2);
raw_node->i.i_uid = cpu_to_le32(getuid());
raw_node->i.i_gid = cpu_to_le32(getgid());
blk_size_bytes = 1 << le32_to_cpu(super_block.log_blocksize);
// 不同单位的文件大小(byte/block)
raw_node->i.i_size = cpu_to_le64(1 * blk_size_bytes); /* dentry */
raw_node->i.i_blocks = cpu_to_le64(2);
raw_node->i.i_ctime = cpu_to_le32(time(NULL));
raw_node->i.i_ctime_nsec = 0;
raw_node->i.i_mtime = cpu_to_le32(time(NULL));
raw_node->i.i_mtime_nsec = 0;
raw_node->i.i_xattr_nid = 0;
raw_node->i.i_flags = 0;
raw_node->i.current_depth = cpu_to_le32(1);
// HOT DATA 的位置
data_blk_nor = le32_to_cpu(super_block.main_blkaddr) +
f2fs_params.cur_seg[CURSEG_HOT_DATA] * f2fs_params.blks_per_seg;
// i_addr[ADDRS_PER_INODE=927] 是指向 data block 的指针
raw_node->i.i_addr[0] = cpu_to_le32(data_blk_nor);
raw_node->i.i_ext.fofs = 0;
raw_node->i.i_ext.blk_addr = cpu_to_le32(data_blk_nor);
raw_node->i.i_ext.len = cpu_to_le32(1);
// 先得到 main area 的起始地址
main_area_node_seg_blk_offset = le32_to_cpu(super_block.main_blkaddr);
// HOT NODE 的位置
main_area_node_seg_blk_offset += f2fs_params.cur_seg[CURSEG_HOT_NODE] *
f2fs_params.blks_per_seg;
main_area_node_seg_blk_offset *= blk_size_bytes;
// root inode 写入到 HOT NODE offset
if (writetodisk(f2fs_params.fd, raw_node, main_area_node_seg_blk_offset,
sizeof(struct f2fs_node)) < 0) {
printf("\n\tError: While writing the raw_node to disk!!!\n");
return -1;
}
memset(raw_node, 0xff, sizeof(struct f2fs_node));
// 下一个 block 开始全部糊上 0xff
if (writetodisk(f2fs_params.fd, raw_node,
main_area_node_seg_blk_offset + 4096,
sizeof(struct f2fs_node)) < 0) {
printf("\n\tError: While writing the raw_node to disk!!!\n");
return -1;
}
free(raw_node);
return 0;
}
/**
* @brief It updates NAT for root Inode
* @param None
* @return 0 if success
*/
static int8_t f2fs_update_nat_root(void)
{
// 一个 block 大小的数据结构,包含 4k/sizeof(f2fs_nat_entry) 个数的 f2fs_nat_entry
// 其 entry 是一个三元组<version, ino, block_addr>,意思很好懂,就是 ino 到 addr 的映射条目
struct f2fs_nat_block *nat_blk = NULL;
u_int32_t blk_size_bytes;
u_int64_t nat_seg_blk_offset = 0;
nat_blk = calloc(sizeof(struct f2fs_nat_block), 1);
if(nat_blk == NULL) {
printf("\n\tError: Calloc Failed for nat_blk!!!\n");
return -1;
}
/* update root */
// 执行 root inode 的 ino 到 addr 的映射,刚才的分析已经得知地址了
nat_blk->entries[super_block.root_ino].block_addr = cpu_to_le32(
le32_to_cpu(super_block.main_blkaddr) +
f2fs_params.cur_seg[CURSEG_HOT_NODE] * f2fs_params.blks_per_seg);
nat_blk->entries[super_block.root_ino].ino = super_block.root_ino;
// 剩下的还有 node inode (ino=1) 和 meta inode (ino=2)
/* update node nat */
// 不明白为啥是 1
nat_blk->entries[super_block.node_ino].block_addr = cpu_to_le32(1);
nat_blk->entries[super_block.node_ino].ino = super_block.node_ino;
/* update meta nat */
nat_blk->entries[super_block.meta_ino].block_addr = cpu_to_le32(1);
nat_blk->entries[super_block.meta_ino].ino = super_block.meta_ino;
blk_size_bytes = 1 << le32_to_cpu(super_block.log_blocksize);
// 定位然后写入到 NAT 的位置
nat_seg_blk_offset = le32_to_cpu(super_block.nat_blkaddr) *
blk_size_bytes;
if (writetodisk(f2fs_params.fd, nat_blk, nat_seg_blk_offset,
sizeof(struct f2fs_nat_block)) < 0) {
printf("\n\tError: While writing the nat_blk set0 to disk!!!\n");
return -1;
}
free(nat_blk);
return 0;
}
/**
* @brief It updates default dentries in Root Inode
* @param None
* @return 0 if success
*/
static int8_t f2fs_add_default_dentry_root(void)
{
// 补充一下 dentry block 的背景:大概包含一个 bitmap 表以及若干 dentry,见下方 bonus 详解
struct f2fs_dentry_block *dent_blk = NULL;
u_int32_t blk_size_bytes;
u_int64_t data_blk_offset = 0;
dent_blk = calloc(sizeof(struct f2fs_dentry_block), 1);
if(dent_blk == NULL) {
printf("\n\tError: Calloc Failed for dent_blk!!!\n");
return -1;
}
// 填充 root inode 相关 dentry 信息(.和..)
dent_blk->dentry[0].hash_code = 0;
dent_blk->dentry[0].ino = super_block.root_ino;
dent_blk->dentry[0].name_len = cpu_to_le16(1);
dent_blk->dentry[0].file_type = F2FS_FT_DIR;
memcpy(dent_blk->filename[0], ".", 1);
dent_blk->dentry[1].hash_code = 0;
dent_blk->dentry[1].ino = super_block.root_ino;
dent_blk->dentry[1].name_len = cpu_to_le16(2);
dent_blk->dentry[1].file_type = F2FS_FT_DIR;
memcpy(dent_blk->filename[1], "..", 2);
/* bitmap for . and .. */
// 手写 bitmap,0 和 1 位已经存在
dent_blk->dentry_bitmap[0] = (1 << 1) | (1 << 0);
blk_size_bytes = 1 << le32_to_cpu(super_block.log_blocksize);
data_blk_offset = (le32_to_cpu(super_block.main_blkaddr) +
f2fs_params.cur_seg[CURSEG_HOT_DATA] *
f2fs_params.blks_per_seg) * blk_size_bytes;
// 写入到 HOT DATA
if (writetodisk(f2fs_params.fd, dent_blk, data_blk_offset,
sizeof(struct f2fs_dentry_block)) < 0) {
printf("\n\tError: While writing the dentry_blk to disk!!!\n");
return -1;
}
free(dent_blk);
return 0;
}
f2fs_write_root_inode()
处理的和 VFS 层 inode
差不多。一些 F2FS 特化处理包括:
- root inode 写入到 HOT NODE,即 NODE 起始位置
- root inode 的
i_addr[0]
作为指向 data 的指针,指向了 HOT DATA
root inode 相关输出如下(注意
i
是和dn/in
作为一个union
,后两者的输出现在无意义):(gdb) b f2fs_format.c:1051 (gdb) info locals raw_node = 0x4098c0 blk_size_bytes = 4096 data_blk_nor = 247296 main_area_node_seg_blk_offset = 1019215872 # 为了避免过长排版,我这里暂时关了 pretty print (gdb) p *raw_node $3 = { {i = {i_mode = 16877, i_advise = 0 '\000', i_reserved = 0 '\000', i_uid = 1000, i_gid = 1000, i_links = 2, i_size = 4096, i_blocks = 2, i_ctime = 1696424848, i_mtime = 1696424848, i_ctime_nsec = 0, i_mtime_nsec = 0, current_depth = 1, i_xattr_nid = 0, i_flags = 0, i_pino = 0, i_namelen = 0, i_name = '\000' <repeats 255 times>, i_ext = {fofs = 0, blk_addr = 247296, len = 1}, i_addr = {247296, 0 <repeats 926 times>}, i_nid = {0, 0, 0, 0, 0}}, dn = {addr = {16877, 1000, 1000, 2, 4096, 0, 2, 0, 1696424848, 0, 1696424848, 0, 0, 0, 1, 0 <repeats 69 times>, 247296, 1, 247296, 0 <repeats 931 times>}}, in = {nid = {16877, 1000, 1000, 2, 4096, 0, 2, 0, 1696424848, 0, 1696424848, 0, 0, 0, 1, 0 <repeats 69 times>, 247296, 1, 247296, 0 <repeats 931 times>}}}, footer = {nid = 3, ino = 3, flag = 0, cp_ver = 1, next_blkaddr = 248833}}
f2fs_update_nat_root()
续上和刚才的 NAT 初始化过程,NAT 仍需要为 root inode 执行映射处理,具体来说就是用一个 block 代表若干映射条目,然后记下 inode 映射到 addr 的值,最后写入到 NAT area 即可
f2fs_add_default_dentry_root()
就是为 root inode 补充需要的 dentry
信息,处理 .
和 ..
,填上 bitmap 并写入到 HOT DATA
Bonus: 关于
dentry
的一些信息/* 4KB-sized directory entry block */ struct f2fs_dentry_block { /* validity bitmap for directory entries in each block */ __u8 dentry_bitmap[SIZE_OF_DENTRY_BITMAP]; __u8 reserved[SIZE_OF_RESERVED]; struct f2fs_dir_entry dentry[NR_DENTRY_IN_BLOCK]; __u8 filename[NR_DENTRY_IN_BLOCK][F2FS_NAME_LEN]; } __packed; /* One directory entry slot representing F2FS_NAME_LEN-sized file name */ struct f2fs_dir_entry { __le32 hash_code; /* hash code of file name */ __le32 ino; /* inode number */ __le16 name_len; /* lengh of file name */ __u8 file_type; /* file type */ } __packed;
dentry
的属性就很直观了,定位文件无非就是文件类型和哈希这些信息而一个 dentry block 包含:
- 用于快速查询更新的 bitmap
应该没啥用的reserved u8bits- 文件定位用的
dentry
array- 记录名字的 filename 字符串池
我们再深入点看下 dentry block 的具体布局
/* the number of dentry in a block */ #define NR_DENTRY_IN_BLOCK 214 #define F2FS_NAME_LEN 8 /* 256 Unicode */ #define SIZE_OF_DIR_ENTRY 11 /* by byte */ #define SIZE_OF_DENTRY_BITMAP ((NR_DENTRY_IN_BLOCK + BITS_PER_BYTE - 1) / \ BITS_PER_BYTE) #define SIZE_OF_RESERVED (PAGE_SIZE - ((SIZE_OF_DIR_ENTRY + \ F2FS_NAME_LEN) * \ NR_DENTRY_IN_BLOCK + SIZE_OF_DENTRY_BITMAP)) // 看着有点眼花,直接输出吧 // Note: -g3 编译后可以得知大小 (gdb) macro expand SIZE_OF_DENTRY_BITMAP expands to: ((214 + 8 - 1) / 8) (gdb) p SIZE_OF_DENTRY_BITMAP $4 = 27 (gdb) macro expand SIZE_OF_RESERVED expands to: (4096 - ((11 + 8) * 214 + ((214 + 8 - 1) / 8))) (gdb) p SIZE_OF_RESERVED $5 = 3
也就是说一个 dentry block 当中:
- 存放了 214 个 dentry,占用 \(214 \times 11 = 2354 \ bytes\)
- 字符串部分占用 \(214 \times 8 = 1712 \ bytes\)
- bitmap 按向上取整得到 \(\lceil \frac{214}{8} \rceil = 27 \ bytes\)
- 剩余部分按一个 4k block 减去即可 \(4096 - 2354 - 1712 - 27 = 3 \ bytes\)
因此取 214 这个值就是让 dentry block 利用率最大化(\(reserved \lt 19(+1bit) \ bytes\))
Check point 初始化
流程
check point 部分的初始化也是一大段的流程,这部分不仅涉及到 check point,还有 summary block 数据结构
/**
* @brief It writes check poiint pack on Check point Area
* @param None
* @return 0 if succes
*/
static int8_t f2fs_write_check_point_pack(void)
{
struct f2fs_checkpoint *ckp = NULL;
struct f2fs_summary_block *sum = NULL;
u_int32_t blk_size_bytes;
u_int64_t cp_seg_blk_offset = 0;
u_int32_t crc = 0;
int i;
ckp = calloc(F2FS_CP_BLOCK_SIZE, 1);
if (ckp == NULL) {
printf("\n\tError: Calloc Failed for f2fs_checkpoint!!!\n");
return -1;
}
sum = calloc(sizeof(struct f2fs_summary_block), 1);
if (sum == NULL) {
printf("\n\tError: Calloc Failed for summay_node!!!\n");
return -1;
}
// 每个 check point 分成 2 个 page (block) 来组成,分别放在首部和尾部,恢复时若相同则表示有效
// 同时 CP area 存在 2 个不同版本的 check point(check point pack),交替更新
/* 1. cp page 1 of checkpoint pack 1 */
// 当前 version 设为 1
ckp->checkpoint_ver = 1;
// 记录当前各个 log 的起始 segment 的编号,我们在前面已经给出具体的数值
ckp->cur_node_segno[0] =
cpu_to_le32(f2fs_params.cur_seg[CURSEG_HOT_NODE]);
ckp->cur_node_segno[1] =
cpu_to_le32(f2fs_params.cur_seg[CURSEG_WARM_NODE]);
ckp->cur_node_segno[2] =
cpu_to_le32(f2fs_params.cur_seg[CURSEG_COLD_NODE]);
ckp->cur_data_segno[0] =
cpu_to_le32(f2fs_params.cur_seg[CURSEG_HOT_DATA]);
ckp->cur_data_segno[1] =
cpu_to_le32(f2fs_params.cur_seg[CURSEG_WARM_DATA]);
ckp->cur_data_segno[2] =
cpu_to_le32(f2fs_params.cur_seg[CURSEG_COLD_DATA]);
// 作者说理论上提供了 16 个 log,但默认就用 6 个(node/data 各 3 个),其它没用的就糊上 0xff 吧
for (i = 3; i < MAX_ACTIVE_NODE_LOGS; i++) {
ckp->cur_node_segno[i] = 0xffffffff;
ckp->cur_data_segno[i] = 0xffffffff;
}
ckp->cur_node_blkoff[0] = cpu_to_le16(1);
ckp->nat_upd_blkoff[0] = cpu_to_le16(1);
ckp->cur_data_blkoff[0] = cpu_to_le16(1);
// 应该是只有 HOT DATA 和 HOT NODE 各占一个 block,见前面 NAT 初始化
ckp->valid_block_count = cpu_to_le64(2);
ckp->rsvd_segment_count = cpu_to_le32(f2fs_params.reserved_segments);
// 这些麻烦的计算后面看 dump 吧
ckp->overprov_segment_count = cpu_to_le32(
(le32_to_cpu(super_block.segment_count_main) -
le32_to_cpu(ckp->rsvd_segment_count)) *
f2fs_params.overprovision / 100);
ckp->overprov_segment_count = cpu_to_le32(
le32_to_cpu(ckp->overprov_segment_count) +
le32_to_cpu(ckp->rsvd_segment_count));
/* main segments - reserved segments - (node + data segments) */
// 这里 free_segment_count 指的是 main area 中的 free segment 数目
// 没看懂为啥是硬编码的 6……大概是 6 个 log 每个占 1 个 segment?
ckp->free_segment_count = cpu_to_le32(
le32_to_cpu(super_block.segment_count_main) - 6);
// 减去 overprov 预留部分
ckp->user_block_count = cpu_to_le64(
((le32_to_cpu(ckp->free_segment_count) + 6 -
le32_to_cpu(ckp->overprov_segment_count)) *
f2fs_params.blks_per_seg));
// checkpoint 总共用了 5 个 block
ckp->cp_pack_total_block_count = cpu_to_le32(5);
ckp->cp_pack_start_sum = cpu_to_le32(1);
ckp->valid_node_count = cpu_to_le32(1);
ckp->valid_inode_count = cpu_to_le32(1);
ckp->next_free_nid = cpu_to_le32(
le32_to_cpu(super_block.root_ino) + 1);
// 前面看到了只写入一半的 SIT,因此是/2,后面 NAT 同理
ckp->sit_ver_bitmap_bytesize = cpu_to_le32(
((le32_to_cpu(super_block.segment_count_sit) / 2) <<
le32_to_cpu(super_block.log_blocks_per_seg)) / 8);
ckp->nat_ver_bitmap_bytesize = cpu_to_le32(
((le32_to_cpu(super_block.segment_count_nat) / 2) <<
le32_to_cpu(super_block.log_blocks_per_seg)) / 8);
ckp->checksum_offset = cpu_to_le32(4092);
crc = f2fs_cal_crc32(F2FS_SUPER_MAGIC, ckp,
le32_to_cpu(ckp->checksum_offset));
*((u_int32_t *)((unsigned char *)ckp +
le32_to_cpu(ckp->checksum_offset))) = crc;
blk_size_bytes = 1 << le32_to_cpu(super_block.log_blocksize);
// 当前处于 check point area 的起始地址
cp_seg_blk_offset =
le32_to_cpu(super_block.start_segment_checkpoint) * blk_size_bytes;
// 把这部分写入外存
if (writetodisk(f2fs_params.fd, ckp, cp_seg_blk_offset,
F2FS_CP_BLOCK_SIZE) < 0) {
printf("\n\tError: While writing the ckp to disk!!!\n");
return -1;
}
/* 2. Prepare and write Segment summary for data blocks */
// 从这里开始涉及到 summary 的数据结构,可以先往下翻到 summary 的分析
SET_SUM_TYPE((&sum->footer), SUM_TYPE_DATA);
// summary 中首个 entries 为 root inode
sum->entries[0].nid = super_block.root_ino;
sum->entries[0].bidx = 0;
// 第二个 block 填入 summary
cp_seg_blk_offset += blk_size_bytes;
if (writetodisk(f2fs_params.fd, sum, cp_seg_blk_offset,
sizeof(struct f2fs_summary_block)) < 0) {
printf("\n\tError: While writing the sum_blk to disk!!!\n");
return -1;
}
/* 3. Fill segment summary for data block to zero. */
memset(sum, 0, sizeof(struct f2fs_summary_block));
cp_seg_blk_offset += blk_size_bytes;
// 第三个 block 填入 0
if (writetodisk(f2fs_params.fd, sum, cp_seg_blk_offset,
sizeof(struct f2fs_summary_block)) < 0) {
printf("\n\tError: While writing the sum_blk to disk!!!\n");
return -1;
}
/* 4. Fill segment summary for data block to zero. */
memset(sum, 0, sizeof(struct f2fs_summary_block));
/* inode sit for root */
// 记录 SIT 的 journal 信息,分为 node 和 data,各占 3 个 entry
sum->n_sits = cpu_to_le16(6);
sum->sit_j.entries[0].segno = ckp->cur_node_segno[0];
sum->sit_j.entries[0].se.vblocks = cpu_to_le16((CURSEG_HOT_NODE << 10) | 1);
f2fs_set_bit(0, sum->sit_j.entries[0].se.valid_map);
sum->sit_j.entries[1].segno = ckp->cur_node_segno[1];
sum->sit_j.entries[1].se.vblocks = cpu_to_le16((CURSEG_WARM_NODE << 10));
sum->sit_j.entries[2].segno = ckp->cur_node_segno[2];
sum->sit_j.entries[2].se.vblocks = cpu_to_le16((CURSEG_COLD_NODE << 10));
/* data sit for root */
sum->sit_j.entries[3].segno = ckp->cur_data_segno[0];
sum->sit_j.entries[3].se.vblocks = cpu_to_le16((CURSEG_HOT_DATA << 10) | 1);
f2fs_set_bit(0, sum->sit_j.entries[3].se.valid_map);
sum->sit_j.entries[4].segno = ckp->cur_data_segno[1];
sum->sit_j.entries[4].se.vblocks = cpu_to_le16((CURSEG_WARM_DATA << 10));
sum->sit_j.entries[5].segno = ckp->cur_data_segno[2];
sum->sit_j.entries[5].se.vblocks = cpu_to_le16((CURSEG_COLD_DATA << 10));
cp_seg_blk_offset += blk_size_bytes;
// 第四个 block 填入 SIT journal
if (writetodisk(f2fs_params.fd, sum, cp_seg_blk_offset,
sizeof(struct f2fs_summary_block)) < 0) {
printf("\n\tError: While writing the sum_blk to disk!!!\n");
return -1;
}
/* 5. cp page2 */
cp_seg_blk_offset += blk_size_bytes;
// 第五个 block 填入 check point 尾部副本
if (writetodisk(f2fs_params.fd, ckp, cp_seg_blk_offset,
F2FS_CP_BLOCK_SIZE) < 0) {
printf("\n\tError: While writing the ckp to disk!!!\n");
return -1;
}
/* 6. cp page 1 of check point pack 2
* Initiatialize other checkpoint pack with version zero
*/
// 这里 checkpoint 只更改版本号和 CRC
ckp->checkpoint_ver = 0;
crc = f2fs_cal_crc32(F2FS_SUPER_MAGIC, ckp,
le32_to_cpu(ckp->checksum_offset));
*((u_int32_t *)((unsigned char *)ckp +
le32_to_cpu(ckp->checksum_offset))) = crc;
// 从 check point 起始地址偏移一个 section
// 即 512 个 block 的偏移量
cp_seg_blk_offset = (le32_to_cpu(super_block.start_segment_checkpoint) +
f2fs_params.blks_per_seg) *
blk_size_bytes;
// 第 512 个 block 记录 check point#0
if (writetodisk(f2fs_params.fd, ckp,
cp_seg_blk_offset, F2FS_CP_BLOCK_SIZE) < 0) {
printf("\n\tError: While writing the ckp to disk!!!\n");
return -1;
}
/* 7. */
memset(sum, 0, sizeof(struct f2fs_summary_block));
SET_SUM_TYPE((&sum->footer), SUM_TYPE_DATA);
cp_seg_blk_offset += blk_size_bytes;
// 第 512 + 1 个 block 记录 0
if (writetodisk(f2fs_params.fd, sum, cp_seg_blk_offset,
sizeof(struct f2fs_summary_block)) < 0) {
printf("\n\tError: While writing the sum_blk to disk!!!\n");
return -1;
}
/* 8. */
memset(sum, 0, sizeof(struct f2fs_summary_block));
cp_seg_blk_offset += blk_size_bytes;
// 第 512 + 2 个 block 记录 0
if (writetodisk(f2fs_params.fd, sum, cp_seg_blk_offset,
sizeof(struct f2fs_summary_block)) < 0) {
printf("\n\tError: While writing the sum_blk to disk!!!\n");
return -1;
}
/* 9. */
memset(sum, 0, sizeof(struct f2fs_summary_block));
cp_seg_blk_offset += blk_size_bytes;
// 第 512 + 3 个 block 记录 0
if (writetodisk(f2fs_params.fd, sum, cp_seg_blk_offset,
sizeof(struct f2fs_summary_block)) < 0) {
printf("\n\tError: While writing the sum_blk to disk!!!\n");
return -1;
}
/* 10. cp page 2 of check point pack 2 */
cp_seg_blk_offset += blk_size_bytes;
// 第 512 + 4 个 block 记录 check point 尾部副本
if (writetodisk(f2fs_params.fd, ckp, cp_seg_blk_offset,
F2FS_CP_BLOCK_SIZE) < 0) {
printf("\n\tError: While writing the ckp to disk!!!\n");
return -1;
}
free(sum) ;
free(ckp) ;
return 0;
}
Summary block
summary block 在流程里面初看比较晦涩,需要整理一下数据结构
////////////////// summary 整体
struct f2fs_summary_block {
// #define ENTRIES_IN_SUM 512
struct f2fs_summary entries[ENTRIES_IN_SUM];
union {
__le16 n_nats; // nat_j.entries[] 的有效长度(实际长度是固定的)
__le16 n_sits; // sit_j.entries[] 的有效长度(实际长度是固定的)
};
// 大小是 4k - sizeof(f2fs_summary)*512 - sizeof(summary_footer)
union {
struct nat_journal nat_j;
struct sit_journal sit_j;
};
struct summary_footer footer;
} __attribute__((packed));
// 记录反向映射,即 node to parent
struct f2fs_summary {
__le32 nid; /* parent node id */
union {
__u8 reserved[3];
struct {
__u8 version; /* node version number */
__le16 bidx; /* block index in parent node */
} __attribute__((packed));
};
} __attribute__((packed));
////////////////// journal 入口
// sizeof(struct nat_journal) = 505
// NAT_JOURNAL_ENTRIES = 38
struct nat_journal {
struct nat_journal_entry entries[NAT_JOURNAL_ENTRIES];
__u8 reserved[NAT_JOURNAL_RESERVED];
} __attribute__((packed));
// sizeof(struct sit_journal) = 505
// SIT_JOURNAL_ENTRIES = 6
struct sit_journal {
struct sit_journal_entry entries[SIT_JOURNAL_ENTRIES];
__u8 reserved[SIT_JOURNAL_RESERVED];
} __attribute__((packed));
////////////////// journal 外部条目
struct nat_journal_entry {
__le32 nid;
struct f2fs_nat_entry ne;
} __attribute__((packed));
// sizeof(struct sit_journal_entry) = 505
struct sit_journal_entry {
__le32 segno;
struct f2fs_sit_entry se;
} __attribute__((packed));
////////////////// 实际两种 area 的条目
// 前面也大概见识过 NAT 三元组了
struct f2fs_nat_entry {
__u8 version;
__le32 ino;
__le32 block_addr;
} __attribute__((packed));
// 由于 SIT 没有复杂的初始化流程,所以看着陌生
struct f2fs_sit_entry {
__le16 vblocks; // valid blocks 的意思
__u8 valid_map[SIT_VBLOCK_MAP_SIZE];
__le64 mtime;
} __attribute__((packed));
////////////////// footer 边角料
// footer 放在 summary 尾部,用于区分 summary 的类型和一致性检查用的校验和
struct summary_footer {
unsigned char entry_type;
__u32 check_sum;
} __attribute__((packed));
虽然数据结构不仅数目多而且长得恶心,但不难看懂:
f2fs_summary_block
:描述外存按 block 单位存放的若干 summary 条目,剩余空间存放 journalf2fs_summary
:反向映射,记录对应的 parent node 和在 parent 中的 block indexnat_journal/sit_journal
:journal 的入口,包含若干 journal entrynat_journal_entry/sit_journal_entry
:journal entry,包含 id 以及具体的 area entryf2fs_nat_entry/f2fs_sit_entry
:需要 check point 记录的实际 entry,对应不同 area 的数据结构summary_footer
:存放于 summary block 的尾部,用以区分 journal 类型和存放 checksum 校验
/* 此处画了一张绝佳的辅助图解,但你需要支付 998,244,353 CNY 才能解锁 */
辅助调试信息
调试信息如下,断点到 free()
前:
(gdb) b f2fs_format.c:956
(gdb) info locals
ckp = 0x4098c0
sum = 0x40a8d0
blk_size_bytes = 4096
cp_seg_blk_offset = 4210688
crc = 1257347361
i = 8
(gdb) p *ckp
$6 = {
checkpoint_ver = 0,
user_block_count = 220672,
valid_block_count = 2,
rsvd_segment_count = 25,
overprov_segment_count = 47,
free_segment_count = 472,
cur_node_segno = {476, 475, 474, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295},
cur_node_blkoff = {1, 0, 0, 0, 0, 0, 0, 0},
nat_upd_blkoff = {1, 0, 0, 0, 0, 0, 0, 0},
cur_data_segno = {473, 1, 0, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295},
cur_data_blkoff = {1, 0, 0, 0, 0, 0, 0, 0},
ckpt_flags = 0,
cp_pack_total_block_count = 5,
cp_pack_start_sum = 1,
valid_node_count = 1,
valid_inode_count = 1,
next_free_nid = 4,
sit_ver_bitmap_bytesize = 64,
nat_ver_bitmap_bytesize = 128,
checksum_offset = 4092,
elapsed_time = 0,
alloc_type = '\000' <repeats 15 times>,
sit_nat_version_bitmap = ""
}
算法思路
从前面的流程以及 summary block 的记录来看并不足以了解 check point 的全貌,因为这里只有初始化过程。但是可以看出,checkpoint 是存在副本的,并且横跨 2 个 section,每 section 有 2 个 check point 副本
一个基本的算法思路是如果上一次写入了 check point#1,那么下一次就写入 check point#0,这将有助于一致性检查。另外,所谓的 recovery 过程似乎也不神秘,虽然目前没法得知完整的算法,但是这里所做的不过是记录所需要的 metadata,仅此而已
THE END
目前大致理解了 F2FS 的初始化视图,下一篇文章再来探索 F2FS 的运行时,先这样吧
最后惯例贴上内容无关的图