Home F2FS:通过 mkfs.f2fs 源码了解文件系统实现
Post
Cancel

F2FS:通过 mkfs.f2fs 源码了解文件系统实现

前言

上文通过阅读论文的方式对 F2FS 进行了概括性的总结,本文进一步通过源码分析的方式来理解 F2FS。一方面我们可以通过调研 F2FS 的早期 commit来上路,另一方面也可以通过对应时期的 mkfs.f2fs 工具来了解一个文件系统的初始化状态。虽然这些代码的稳定性相对于现在肯定是不足的,但这样更易于理解论文提到的若干特性。这篇文章先了解一个格式化的 F2FS 的磁盘布局(即它的初始化状态),我们通过调试内核以外的 mkfs.f2fs 工具源码来展开细节

Overview: On-disk layout

layout

上文提到 F2FS 的布局是分为 6 个 area,如上图所示。那么它的初始化状态是怎样的?这里可以通过调试mkfs.f2fs来了解具体的数据结构。附上一些动手前的准备:

git clone https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git
cd f2fs-tools

# 找到和上述内核代码相近时间点的commit,切过去
git checkout -b jintianxiaomidaobilema 036d45e551ca5405c726f8ccb51f446620cd4af4

cd mkfs

# 造一个1GB的img,调试用
dd if=/dev/zero of=f2fs.img bs=1024 count=1000000

# 编译mkfs.f2fs,使用-g3方便宏展开调试
gcc f2fs_format.c -g3 -o mkfs.f2fs

「纸面」上的 F2FS 数据结构可以直接看这里,作者注释超详细的

先在初始化主流程 f2fs_format_device() 返回前打断点,然后传递启动参数后输出 f2fs_params 变量。这个变量是用于初始化前的 parse options,我们也可以看到它的默认行为是怎样的

(gdb) b f2fs_format.c:1273
(gdb) r f2fs.img
(gdb) set print pretty on
(gdb) p f2fs_params
$1 = {
  sector_size = 512,
  reserved_segments = 25,
  overprovision = 5,
  cur_seg = {473, 1, 0, 476, 475, 474},
  segs_per_sec = 1,
  secs_per_zone = 1,
  start_sector = 0,
  total_sectors = 2000000,
  sectors_per_blk = 8,
  blks_per_seg = 512,
  vol_label = "F2FS", '\000' <repeats 11 times>,
  heap = 1,
  fd = 3,
  device_name = 0x7fffffffe137 "f2fs.img",
  extension_list = 0x0
}

一点信息如下:

  • overprovision 是预留给 GC 的百分比空间,并不开放给用户使用。reserved_segments 是由 overprovision 计算得出的
  • 默认使用了 heap-based allocation(heap = 1),其实是影响了 cur_seg 的赋值行为。下文提到 data 从起始地址开始,而 node 则从尾部地址开始
  • fd 仅用于 writetodisk() 最终写入到设备/映像文件中,并不实际存放于文件系统
" overprovision-ratio-percentage"
Specify the percentage over the volume size for overprovision area. This area
is hidden to users, and utilized by F2FS cleaner. The default percentage is 5%.


" heap-based-allocation"
Specify 1 or 0 to enable/disable heap based block allocation policy.
If the value is equal to 1, each of active log areas are initially
assigned separately according to the whole volume size.
The default value is 1.


... heap-style segment allocation which finds free segments for data
from the beginning of main area, while for node from the end of main
area.

Super block 初始化

f2fs.h 中的数据结构如下所示

/*
 * For superblock
 */
struct f2fs_super_block {
        __le32 magic;                   /* Magic Number */
        __le16 major_ver;               /* Major Version */
        __le16 minor_ver;               /* Minor Version */
        __le32 log_sectorsize;          /* log2 sector size in bytes */
        __le32 log_sectors_per_block;   /* log2 # of sectors per block */
        __le32 log_blocksize;           /* log2 block size in bytes */
        __le32 log_blocks_per_seg;      /* log2 # of blocks per segment */
        __le32 segs_per_sec;            /* # of segments per section */
        __le32 secs_per_zone;           /* # of sections per zone */
        __le32 checksum_offset;         /* checksum offset inside super block */
        __le64 block_count;             /* total # of user blocks */
        __le32 section_count;           /* total # of sections */
        __le32 segment_count;           /* total # of segments */
        __le32 segment_count_ckpt;      /* # of segments for checkpoint */
        __le32 segment_count_sit;       /* # of segments for SIT */
        __le32 segment_count_nat;       /* # of segments for NAT */
        __le32 segment_count_ssa;       /* # of segments for SSA */
        __le32 segment_count_main;      /* # of segments for main area */
        __le32 segment0_blkaddr;        /* start block address of segment 0 */
        __le32 cp_blkaddr;              /* start block address of checkpoint */
        __le32 sit_blkaddr;             /* start block address of SIT */
        __le32 nat_blkaddr;             /* start block address of NAT */
        __le32 ssa_blkaddr;             /* start block address of SSA */
        __le32 main_blkaddr;            /* start block address of main area */
        __le32 root_ino;                /* root inode number */
        __le32 node_ino;                /* node inode number */
        __le32 meta_ino;                /* meta inode number */
        __u8 uuid[16];                  /* 128-bit uuid for volume */
        __le16 volume_name[512];        /* volume name */
        __le32 extension_count;         /* # of extensions below */
        __u8 extension_list[F2FS_MAX_EXTENSION][8];     /* extension array */
} __packed;

同样使用 gdb 来 dump 初始化状态,如下所示

(gdb) p super_block
$2 = {
  magic = 4076150800,
  major_ver = 1,
  minor_ver = 0,
  log_sectorsize = 9,
  log_sectors_per_block = 3,
  log_blocksize = 12,
  log_blocks_per_seg = 9,
  segs_per_sec = 1,
  secs_per_zone = 1,
  checksum_offset = 0,
  block_count = 250000,
  section_count = 478,
  segment_count = 487,
  segment_count_ckpt = 2,
  segment_count_sit = 2,
  segment_count_nat = 4,
  segment_count_ssa = 1,
  segment_count_main = 478,
  failure_safe_block_distance = 0,
  segment0_blkaddr = 512,
  start_segment_checkpoint = 512,
  sit_blkaddr = 1536,
  nat_blkaddr = 2560,
  ssa_blkaddr = 4608,
  main_blkaddr = 5120,
  root_ino = 3,
  node_ino = 1,
  meta_ino = 2,
  volume_serial_number = 0,
  volume_name = {70, 50, 70, 83, 0 <repeats 508 times>},
  extension_count = 22,
  extension_list = {"jpg\000\000\000\000", "gif\000\000\000\000", "png\000\000\000\000", "avi\000\000\000\000", "divx\000\000\000", "mp4\000\000\000\000", "mp3\000\000\000\000",
    "3gp\000\000\000\000", "wmv\000\000\000\000", "wma\000\000\000\000", "mpeg\000\000\000", "mkv\000\000\000\000", "mov\000\000\000\000", "asx\000\000\000\000", "asf\000\000\000\000",
    "wmx\000\000\000\000", "svi\000\000\000\000", "wvx\000\000\000\000", "wm\000\000\000\000\000", "mpg\000\000\000\000", "mpe\000\000\000\000", "rm\000\000\000\000\000",
    "ogg\000\000\000\000", "\000\000\000\000\000\000\000" <repeats 41 times>}
}

super block 部分是相当的易懂的,除了配置基本的块设备 block / sector 以及 segment 数目信息以外,还反映了默认 1:1:1 的 segment-section-zone 三级划分,以及各个 area 的起始地址(*_blkaddr

SIT 初始化

/**
 * @brief       It initialize SIT Data structure
 * @param       None
 * @return      0 if success
 */
static int8_t f2fs_init_sit_area(void)
{
        u_int32_t blk_size_bytes;
        u_int32_t seg_size_bytes;
        u_int32_t index = 0;
        u_int64_t sit_seg_blk_offset = 0;
        u_int8_t *zero_buf = NULL;

        blk_size_bytes = 1 << le32_to_cpu(super_block.log_blocksize);
        seg_size_bytes = (1 << le32_to_cpu(super_block.log_blocks_per_seg)) *
                                blk_size_bytes;

        zero_buf = calloc(sizeof(u_int8_t), seg_size_bytes);
        if(zero_buf == NULL) {
                printf("\n\tError: Calloc Failed for sit_zero_buf!!!\n");
                return -1;
        }

        sit_seg_blk_offset = le32_to_cpu(super_block.sit_blkaddr) *
                                                blk_size_bytes;

        for (index = 0;
                index < (le32_to_cpu(super_block.segment_count_sit) / 2);
                                                                index++) {
                if (writetodisk(f2fs_params.fd, zero_buf, sit_seg_blk_offset,
                                        seg_size_bytes) < 0) {
                        printf("\n\tError: While zeroing out the sit area \
                                        on disk!!!\n");
                        return -1;
                }
                sit_seg_blk_offset = sit_seg_blk_offset + seg_size_bytes;
        }

        free(zero_buf);
        return 0 ;
}

/**
 * @brief       It writes buffer to disk or storage meant to be formatted
 *              with F2FS.
 * @param       fd File descriptor for device
 * @param       buf buffer to be written
 * @param       offset where to bw written on the device
 * @param       length length of the device
 * @return      0 if success
 */
static int writetodisk(int32_t fd, void *buf, u_int64_t offset, size_t length)
{
        if (lseek64(fd, offset, SEEK_SET) < 0) {
                printf("\n\tError: While lseek to the derised location!!!\n");
                return -1;
        }

        if (write(fd, buf, length) < 0) {
                printf("\n\tError: While writing data to the disk!!! Error Num : \
                                %d\n", errno);
                return -1;
        }

        return 0;
}

在 SIT 的初始化函数 f2fs_init_sit_area() 中,并没有在内存数据结构上做出改动,核心仅是 writetodisk() 写入到外存数据结构对应的 SIT 中

从之前的 dump 可以看到 super_block.segment_count_sit = 2,即 SIT area 实际使用了 2 个 area,但是这里写入一半(1 个 segment)到外存映像,写入的内容是全 0(zero_buf

NAT 初始化

目前 NAT 的初始化行为与 SIT 一致,区别是 NAT 占用 4 个 segment,其它略

Note: 在后续 root directory 初始化阶段中,NAT 仍要为 root inode 做进一步更新操作

Root directory 初始化

接下来的流程执行 f2fs_create_root_dir(),简单来说有 3 步:

  1. f2fs_write_root_inode():在 main area 创建 root inode
  2. f2fs_update_nat_root():继续初始化 NAT,即在该区域记录 root inode 的地址映射
  3. f2fs_add_default_dentry_root()

后面我们来看这一段代码

/**
 * @brief       It initializes and writes root inode on device.
 * @param       None
 * @return      0 if success
 */
static int8_t f2fs_write_root_inode(void)
{
        struct f2fs_node *raw_node = NULL;
        u_int32_t blk_size_bytes;
        u_int64_t data_blk_nor;
        u_int64_t main_area_node_seg_blk_offset = 0;

        raw_node = calloc(sizeof(struct f2fs_node), 1);
        if (raw_node == NULL) {
                printf("\n\tError: Calloc Failed for raw_node!!!\n");
                return -1;
        }

        // 设置 root inode number
        raw_node->footer.nid = super_block.root_ino;
        raw_node->footer.ino = super_block.root_ino;
        raw_node->footer.cp_ver = cpu_to_le64(1);
        // 指向下一个 block
        raw_node->footer.next_blkaddr = cpu_to_le32(
                        le32_to_cpu(super_block.main_blkaddr) +
                        f2fs_params.cur_seg[CURSEG_HOT_NODE] *
                        f2fs_params.blks_per_seg + 1);

        raw_node->i.i_mode = cpu_to_le16(0x41ed);
        raw_node->i.i_links = cpu_to_le32(2);
        raw_node->i.i_uid = cpu_to_le32(getuid());
        raw_node->i.i_gid = cpu_to_le32(getgid());

        blk_size_bytes = 1 << le32_to_cpu(super_block.log_blocksize);
        // 不同单位的文件大小(byte/block)
        raw_node->i.i_size = cpu_to_le64(1 * blk_size_bytes); /* dentry */
        raw_node->i.i_blocks = cpu_to_le64(2);

        raw_node->i.i_ctime = cpu_to_le32(time(NULL));
        raw_node->i.i_ctime_nsec = 0;
        raw_node->i.i_mtime = cpu_to_le32(time(NULL));
        raw_node->i.i_mtime_nsec = 0;
        raw_node->i.i_xattr_nid = 0;
        raw_node->i.i_flags = 0;
        raw_node->i.current_depth = cpu_to_le32(1);

        // HOT DATA 的位置
        data_blk_nor = le32_to_cpu(super_block.main_blkaddr) +
                f2fs_params.cur_seg[CURSEG_HOT_DATA] * f2fs_params.blks_per_seg;
        // i_addr[ADDRS_PER_INODE=927] 是指向 data block 的指针
        raw_node->i.i_addr[0] = cpu_to_le32(data_blk_nor);

        raw_node->i.i_ext.fofs = 0;
        raw_node->i.i_ext.blk_addr = cpu_to_le32(data_blk_nor);
        raw_node->i.i_ext.len = cpu_to_le32(1);

        // 先得到 main area 的起始地址
        main_area_node_seg_blk_offset = le32_to_cpu(super_block.main_blkaddr);
        // HOT NODE 的位置
        main_area_node_seg_blk_offset += f2fs_params.cur_seg[CURSEG_HOT_NODE] *
                                        f2fs_params.blks_per_seg;
        main_area_node_seg_blk_offset *= blk_size_bytes;

        // root inode 写入到 HOT NODE offset
        if (writetodisk(f2fs_params.fd, raw_node, main_area_node_seg_blk_offset,
                                sizeof(struct f2fs_node)) < 0) {
                printf("\n\tError: While writing the raw_node to disk!!!\n");
                return -1;
        }

        memset(raw_node, 0xff, sizeof(struct f2fs_node));

        // 下一个 block 开始全部糊上 0xff
        if (writetodisk(f2fs_params.fd, raw_node,
                                main_area_node_seg_blk_offset + 4096,
                                sizeof(struct f2fs_node)) < 0) {
                printf("\n\tError: While writing the raw_node to disk!!!\n");
                return -1;
        }
        free(raw_node);
        return 0;
}

/**
 * @brief       It updates NAT for root Inode
 * @param       None
 * @return      0 if success
 */
static int8_t f2fs_update_nat_root(void)
{
        // 一个 block 大小的数据结构,包含 4k/sizeof(f2fs_nat_entry) 个数的 f2fs_nat_entry
        // 其 entry 是一个三元组<version, ino, block_addr>,意思很好懂,就是 ino 到 addr 的映射条目
        struct f2fs_nat_block *nat_blk = NULL;
        u_int32_t blk_size_bytes;
        u_int64_t nat_seg_blk_offset = 0;

        nat_blk = calloc(sizeof(struct f2fs_nat_block), 1);
        if(nat_blk == NULL) {
                printf("\n\tError: Calloc Failed for nat_blk!!!\n");
                return -1;
        }

        /* update root */
        // 执行 root inode 的 ino 到 addr 的映射,刚才的分析已经得知地址了
        nat_blk->entries[super_block.root_ino].block_addr = cpu_to_le32(
                le32_to_cpu(super_block.main_blkaddr) +
                f2fs_params.cur_seg[CURSEG_HOT_NODE] * f2fs_params.blks_per_seg);
        nat_blk->entries[super_block.root_ino].ino = super_block.root_ino;

        // 剩下的还有 node inode (ino=1) 和 meta inode (ino=2)

        /* update node nat */
        // 不明白为啥是 1
        nat_blk->entries[super_block.node_ino].block_addr = cpu_to_le32(1);
        nat_blk->entries[super_block.node_ino].ino = super_block.node_ino;

        /* update meta nat */
        nat_blk->entries[super_block.meta_ino].block_addr = cpu_to_le32(1);
        nat_blk->entries[super_block.meta_ino].ino = super_block.meta_ino;

        blk_size_bytes = 1 << le32_to_cpu(super_block.log_blocksize);

        // 定位然后写入到 NAT 的位置
        nat_seg_blk_offset = le32_to_cpu(super_block.nat_blkaddr) *
                                                        blk_size_bytes;

        if (writetodisk(f2fs_params.fd, nat_blk, nat_seg_blk_offset,
                                sizeof(struct f2fs_nat_block)) < 0) {
                printf("\n\tError: While writing the nat_blk set0 to disk!!!\n");
                return -1;
        }

        free(nat_blk);
        return 0;
}

/**
 * @brief       It updates default dentries in Root Inode
 * @param       None
 * @return      0 if success
 */
static int8_t f2fs_add_default_dentry_root(void)
{
        // 补充一下 dentry block 的背景:大概包含一个 bitmap 表以及若干 dentry,见下方 bonus 详解
        struct f2fs_dentry_block *dent_blk = NULL;
        u_int32_t blk_size_bytes;
        u_int64_t data_blk_offset = 0;

        dent_blk = calloc(sizeof(struct f2fs_dentry_block), 1);
        if(dent_blk == NULL) {
                printf("\n\tError: Calloc Failed for dent_blk!!!\n");
                return -1;
        }

        // 填充 root inode 相关 dentry 信息(.和..)
        dent_blk->dentry[0].hash_code = 0;
        dent_blk->dentry[0].ino = super_block.root_ino;
        dent_blk->dentry[0].name_len = cpu_to_le16(1);
        dent_blk->dentry[0].file_type = F2FS_FT_DIR;
        memcpy(dent_blk->filename[0], ".", 1);

        dent_blk->dentry[1].hash_code = 0;
        dent_blk->dentry[1].ino = super_block.root_ino;
        dent_blk->dentry[1].name_len = cpu_to_le16(2);
        dent_blk->dentry[1].file_type = F2FS_FT_DIR;
        memcpy(dent_blk->filename[1], "..", 2);

        /* bitmap for . and .. */
        // 手写 bitmap,0 和 1 位已经存在
        dent_blk->dentry_bitmap[0] = (1 << 1) | (1 << 0);
        blk_size_bytes = 1 << le32_to_cpu(super_block.log_blocksize);
        data_blk_offset = (le32_to_cpu(super_block.main_blkaddr) +
                        f2fs_params.cur_seg[CURSEG_HOT_DATA] *
                        f2fs_params.blks_per_seg) * blk_size_bytes;

        // 写入到 HOT DATA
        if (writetodisk(f2fs_params.fd, dent_blk, data_blk_offset,
                                sizeof(struct f2fs_dentry_block)) < 0) {
                printf("\n\tError: While writing the dentry_blk to disk!!!\n");
                return -1;
        }

        free(dent_blk);
        return 0;
}

f2fs_write_root_inode() 处理的和 VFS 层 inode 差不多。一些 F2FS 特化处理包括:

  • root inode 写入到 HOT NODE,即 NODE 起始位置
  • root inode 的 i_addr[0] 作为指向 data 的指针,指向了 HOT DATA

root inode 相关输出如下(注意 i 是和 dn/in 作为一个 union,后两者的输出现在无意义):

(gdb) b f2fs_format.c:1051

(gdb) info locals
raw_node = 0x4098c0
blk_size_bytes = 4096
data_blk_nor = 247296
main_area_node_seg_blk_offset = 1019215872

# 为了避免过长排版,我这里暂时关了 pretty print
(gdb) p *raw_node
$3 = { {i = {i_mode = 16877, i_advise = 0 '\000', i_reserved = 0 '\000', i_uid = 1000,
      i_gid = 1000, i_links = 2, i_size = 4096, i_blocks = 2, i_ctime = 1696424848,
      i_mtime = 1696424848, i_ctime_nsec = 0, i_mtime_nsec = 0, current_depth = 1,
      i_xattr_nid = 0, i_flags = 0, i_pino = 0, i_namelen = 0,
      i_name = '\000' <repeats 255 times>, i_ext = {fofs = 0, blk_addr = 247296, len = 1},
      i_addr = {247296, 0 <repeats 926 times>}, i_nid = {0, 0, 0, 0, 0}}, dn = {addr = {16877,
        1000, 1000, 2, 4096, 0, 2, 0, 1696424848, 0, 1696424848, 0, 0, 0, 1,
        0 <repeats 69 times>, 247296, 1, 247296, 0 <repeats 931 times>}}, in = {nid = {16877,
        1000, 1000, 2, 4096, 0, 2, 0, 1696424848, 0, 1696424848, 0, 0, 0, 1,
        0 <repeats 69 times>, 247296, 1, 247296, 0 <repeats 931 times>}}}, footer = {nid = 3,
    ino = 3, flag = 0, cp_ver = 1, next_blkaddr = 248833}}

f2fs_update_nat_root() 续上和刚才的 NAT 初始化过程,NAT 仍需要为 root inode 执行映射处理,具体来说就是用一个 block 代表若干映射条目,然后记下 inode 映射到 addr 的值,最后写入到 NAT area 即可

f2fs_add_default_dentry_root() 就是为 root inode 补充需要的 dentry 信息,处理 ...,填上 bitmap 并写入到 HOT DATA

Bonus: 关于 dentry 的一些信息

/* 4KB-sized directory entry block */
struct f2fs_dentry_block {
        /* validity bitmap for directory entries in each block */
        __u8 dentry_bitmap[SIZE_OF_DENTRY_BITMAP];
        __u8 reserved[SIZE_OF_RESERVED];
        struct f2fs_dir_entry dentry[NR_DENTRY_IN_BLOCK];
        __u8 filename[NR_DENTRY_IN_BLOCK][F2FS_NAME_LEN];
} __packed;

/* One directory entry slot representing F2FS_NAME_LEN-sized file name */
struct f2fs_dir_entry {
        __le32 hash_code;     /* hash code of file name */
        __le32 ino;           /* inode number */
        __le16 name_len;      /* lengh of file name */
        __u8 file_type;               /* file type */
} __packed;

dentry 的属性就很直观了,定位文件无非就是文件类型和哈希这些信息

而一个 dentry block 包含:

  • 用于快速查询更新的 bitmap
  • 应该没啥用的reserved u8bits
  • 文件定位用的 dentry array
  • 记录名字的 filename 字符串池

我们再深入点看下 dentry block 的具体布局

/* the number of dentry in a block */
#define NR_DENTRY_IN_BLOCK      214
#define F2FS_NAME_LEN           8       /* 256 Unicode */

#define SIZE_OF_DIR_ENTRY       11      /* by byte */
#define SIZE_OF_DENTRY_BITMAP   ((NR_DENTRY_IN_BLOCK + BITS_PER_BYTE - 1) / \
                                        BITS_PER_BYTE)
#define SIZE_OF_RESERVED        (PAGE_SIZE - ((SIZE_OF_DIR_ENTRY + \
                                F2FS_NAME_LEN) * \
                                NR_DENTRY_IN_BLOCK + SIZE_OF_DENTRY_BITMAP))

// 看着有点眼花,直接输出吧
// Note: -g3 编译后可以得知大小
(gdb) macro expand SIZE_OF_DENTRY_BITMAP
expands to: ((214 + 8 - 1) / 8)
(gdb) p SIZE_OF_DENTRY_BITMAP
$4 = 27

(gdb) macro expand SIZE_OF_RESERVED
expands to: (4096 - ((11 + 8) * 214 + ((214 + 8 - 1) / 8)))
(gdb) p SIZE_OF_RESERVED
$5 = 3

也就是说一个 dentry block 当中:

  • 存放了 214 个 dentry,占用 \(214 \times 11 = 2354 \ bytes\)
  • 字符串部分占用 \(214 \times 8 = 1712 \ bytes\)
  • bitmap 按向上取整得到 \(\lceil \frac{214}{8} \rceil = 27 \ bytes\)
  • 剩余部分按一个 4k block 减去即可 \(4096 - 2354 - 1712 - 27 = 3 \ bytes\)

因此取 214 这个值就是让 dentry block 利用率最大化(\(reserved \lt 19(+1bit) \ bytes\))

Check point 初始化

流程

check point 部分的初始化也是一段的流程,这部分不仅涉及到 check point,还有 summary block 数据结构

/**
 * @brief       It writes check poiint pack on Check point Area
 * @param       None
 * @return      0 if succes
 */
static int8_t f2fs_write_check_point_pack(void)
{
        struct f2fs_checkpoint *ckp = NULL;
        struct f2fs_summary_block *sum = NULL;
        u_int32_t blk_size_bytes;
        u_int64_t cp_seg_blk_offset = 0;
        u_int32_t crc = 0;
        int i;

        ckp = calloc(F2FS_CP_BLOCK_SIZE, 1);
        if (ckp == NULL) {
                printf("\n\tError: Calloc Failed for f2fs_checkpoint!!!\n");
                return -1;
        }

        sum = calloc(sizeof(struct f2fs_summary_block), 1);
        if (sum == NULL) {
                printf("\n\tError: Calloc Failed for summay_node!!!\n");
                return -1;
        }

        // 每个 check point 分成 2 个 page (block) 来组成,分别放在首部和尾部,恢复时若相同则表示有效
        // 同时 CP area 存在 2 个不同版本的 check point(check point pack),交替更新

        /* 1. cp page 1 of checkpoint pack 1 */
        // 当前 version 设为 1
        ckp->checkpoint_ver = 1;
        // 记录当前各个 log 的起始 segment 的编号,我们在前面已经给出具体的数值
        ckp->cur_node_segno[0] =
                cpu_to_le32(f2fs_params.cur_seg[CURSEG_HOT_NODE]);
        ckp->cur_node_segno[1] =
                cpu_to_le32(f2fs_params.cur_seg[CURSEG_WARM_NODE]);
        ckp->cur_node_segno[2] =
                cpu_to_le32(f2fs_params.cur_seg[CURSEG_COLD_NODE]);
        ckp->cur_data_segno[0] =
                cpu_to_le32(f2fs_params.cur_seg[CURSEG_HOT_DATA]);
        ckp->cur_data_segno[1] =
                cpu_to_le32(f2fs_params.cur_seg[CURSEG_WARM_DATA]);
        ckp->cur_data_segno[2] =
                cpu_to_le32(f2fs_params.cur_seg[CURSEG_COLD_DATA]);
        // 作者说理论上提供了 16 个 log,但默认就用 6 个(node/data 各 3 个),其它没用的就糊上 0xff 吧
        for (i = 3; i < MAX_ACTIVE_NODE_LOGS; i++) {
                ckp->cur_node_segno[i] = 0xffffffff;
                ckp->cur_data_segno[i] = 0xffffffff;
        }

        ckp->cur_node_blkoff[0] = cpu_to_le16(1);
        ckp->nat_upd_blkoff[0] = cpu_to_le16(1);
        ckp->cur_data_blkoff[0] = cpu_to_le16(1);
        // 应该是只有 HOT DATA 和 HOT NODE 各占一个 block,见前面 NAT 初始化
        ckp->valid_block_count = cpu_to_le64(2);
        ckp->rsvd_segment_count = cpu_to_le32(f2fs_params.reserved_segments);
        // 这些麻烦的计算后面看 dump 吧
        ckp->overprov_segment_count = cpu_to_le32(
                        (le32_to_cpu(super_block.segment_count_main) -
                        le32_to_cpu(ckp->rsvd_segment_count)) *
                        f2fs_params.overprovision / 100);
        ckp->overprov_segment_count = cpu_to_le32(
                        le32_to_cpu(ckp->overprov_segment_count) +
                        le32_to_cpu(ckp->rsvd_segment_count));

        /* main segments - reserved segments - (node + data segments) */
        // 这里 free_segment_count 指的是 main area 中的 free segment 数目
        // 没看懂为啥是硬编码的 6……大概是 6 个 log 每个占 1 个 segment?
        ckp->free_segment_count = cpu_to_le32(
                        le32_to_cpu(super_block.segment_count_main) - 6);
        // 减去 overprov 预留部分
        ckp->user_block_count = cpu_to_le64(
                        ((le32_to_cpu(ckp->free_segment_count) + 6 -
                        le32_to_cpu(ckp->overprov_segment_count)) *
                         f2fs_params.blks_per_seg));
        // checkpoint 总共用了 5 个 block
        ckp->cp_pack_total_block_count = cpu_to_le32(5);
        ckp->cp_pack_start_sum = cpu_to_le32(1);
        ckp->valid_node_count = cpu_to_le32(1);
        ckp->valid_inode_count = cpu_to_le32(1);
        ckp->next_free_nid = cpu_to_le32(
                        le32_to_cpu(super_block.root_ino) + 1);

        // 前面看到了只写入一半的 SIT,因此是/2,后面 NAT 同理
        ckp->sit_ver_bitmap_bytesize = cpu_to_le32(
                        ((le32_to_cpu(super_block.segment_count_sit) / 2) <<
                         le32_to_cpu(super_block.log_blocks_per_seg)) / 8);

        ckp->nat_ver_bitmap_bytesize = cpu_to_le32(
                        ((le32_to_cpu(super_block.segment_count_nat) / 2) <<
                         le32_to_cpu(super_block.log_blocks_per_seg)) / 8);

        ckp->checksum_offset = cpu_to_le32(4092);

        crc = f2fs_cal_crc32(F2FS_SUPER_MAGIC, ckp,
                                        le32_to_cpu(ckp->checksum_offset));
        *((u_int32_t *)((unsigned char *)ckp +
                                le32_to_cpu(ckp->checksum_offset))) = crc;

        blk_size_bytes = 1 << le32_to_cpu(super_block.log_blocksize);
        // 当前处于 check point area 的起始地址
        cp_seg_blk_offset =
                le32_to_cpu(super_block.start_segment_checkpoint) * blk_size_bytes;

        // 把这部分写入外存
        if (writetodisk(f2fs_params.fd, ckp, cp_seg_blk_offset,
                                F2FS_CP_BLOCK_SIZE) < 0) {
                printf("\n\tError: While writing the ckp to disk!!!\n");
                return -1;
        }

        /* 2. Prepare and write Segment summary for data blocks */
        // 从这里开始涉及到 summary 的数据结构,可以先往下翻到 summary 的分析
        SET_SUM_TYPE((&sum->footer), SUM_TYPE_DATA);

        // summary 中首个 entries 为 root inode
        sum->entries[0].nid = super_block.root_ino;
        sum->entries[0].bidx = 0;

        // 第二个 block 填入 summary
        cp_seg_blk_offset += blk_size_bytes;
        if (writetodisk(f2fs_params.fd, sum, cp_seg_blk_offset,
                                sizeof(struct f2fs_summary_block)) < 0) {
                printf("\n\tError: While writing the sum_blk to disk!!!\n");
                return -1;
        }

        /* 3. Fill segment summary for data block to zero. */
        memset(sum, 0, sizeof(struct f2fs_summary_block));

        cp_seg_blk_offset += blk_size_bytes;
        // 第三个 block 填入 0
        if (writetodisk(f2fs_params.fd, sum, cp_seg_blk_offset,
                                sizeof(struct f2fs_summary_block)) < 0) {
                printf("\n\tError: While writing the sum_blk to disk!!!\n");
                return -1;
        }

        /* 4. Fill segment summary for data block to zero. */
        memset(sum, 0, sizeof(struct f2fs_summary_block));

        /* inode sit for root */
        // 记录 SIT 的 journal 信息,分为 node 和 data,各占 3 个 entry
        sum->n_sits = cpu_to_le16(6);
        sum->sit_j.entries[0].segno = ckp->cur_node_segno[0];
        sum->sit_j.entries[0].se.vblocks = cpu_to_le16((CURSEG_HOT_NODE << 10) | 1);
        f2fs_set_bit(0, sum->sit_j.entries[0].se.valid_map);
        sum->sit_j.entries[1].segno = ckp->cur_node_segno[1];
        sum->sit_j.entries[1].se.vblocks = cpu_to_le16((CURSEG_WARM_NODE << 10));
        sum->sit_j.entries[2].segno = ckp->cur_node_segno[2];
        sum->sit_j.entries[2].se.vblocks = cpu_to_le16((CURSEG_COLD_NODE << 10));

        /* data sit for root */
        sum->sit_j.entries[3].segno = ckp->cur_data_segno[0];
        sum->sit_j.entries[3].se.vblocks = cpu_to_le16((CURSEG_HOT_DATA << 10) | 1);
        f2fs_set_bit(0, sum->sit_j.entries[3].se.valid_map);
        sum->sit_j.entries[4].segno = ckp->cur_data_segno[1];
        sum->sit_j.entries[4].se.vblocks = cpu_to_le16((CURSEG_WARM_DATA << 10));
        sum->sit_j.entries[5].segno = ckp->cur_data_segno[2];
        sum->sit_j.entries[5].se.vblocks = cpu_to_le16((CURSEG_COLD_DATA << 10));

        cp_seg_blk_offset += blk_size_bytes;
        // 第四个 block 填入 SIT journal
        if (writetodisk(f2fs_params.fd, sum, cp_seg_blk_offset,
                                sizeof(struct f2fs_summary_block)) < 0) {
                printf("\n\tError: While writing the sum_blk to disk!!!\n");
                return -1;
        }

        /* 5. cp page2 */
        cp_seg_blk_offset += blk_size_bytes;
        // 第五个 block 填入 check point 尾部副本
        if (writetodisk(f2fs_params.fd, ckp, cp_seg_blk_offset,
                                F2FS_CP_BLOCK_SIZE) < 0) {
                printf("\n\tError: While writing the ckp to disk!!!\n");
                return -1;
        }

        /* 6. cp page 1 of check point pack 2
         * Initiatialize other checkpoint pack with version zero
         */
        // 这里 checkpoint 只更改版本号和 CRC
        ckp->checkpoint_ver = 0;

        crc = f2fs_cal_crc32(F2FS_SUPER_MAGIC, ckp,
                                        le32_to_cpu(ckp->checksum_offset));
        *((u_int32_t *)((unsigned char *)ckp +
                                le32_to_cpu(ckp->checksum_offset))) = crc;

        // 从 check point 起始地址偏移一个 section
        // 即 512 个 block 的偏移量
        cp_seg_blk_offset = (le32_to_cpu(super_block.start_segment_checkpoint) +
                                f2fs_params.blks_per_seg) *
                                blk_size_bytes;
        // 第 512 个 block 记录 check point#0
        if (writetodisk(f2fs_params.fd, ckp,
                                cp_seg_blk_offset, F2FS_CP_BLOCK_SIZE) < 0) {
                printf("\n\tError: While writing the ckp to disk!!!\n");
                return -1;
        }

        /* 7. */
        memset(sum, 0, sizeof(struct f2fs_summary_block));
        SET_SUM_TYPE((&sum->footer), SUM_TYPE_DATA);
        cp_seg_blk_offset += blk_size_bytes;
        // 第 512 + 1 个 block 记录 0
        if (writetodisk(f2fs_params.fd, sum, cp_seg_blk_offset,
                                sizeof(struct f2fs_summary_block)) < 0) {
                printf("\n\tError: While writing the sum_blk to disk!!!\n");
                return -1;
        }

        /* 8. */
        memset(sum, 0, sizeof(struct f2fs_summary_block));
        cp_seg_blk_offset += blk_size_bytes;
        // 第 512 + 2 个 block 记录 0
        if (writetodisk(f2fs_params.fd, sum, cp_seg_blk_offset,
                                sizeof(struct f2fs_summary_block)) < 0) {
                printf("\n\tError: While writing the sum_blk to disk!!!\n");
                return -1;
        }

        /* 9. */
        memset(sum, 0, sizeof(struct f2fs_summary_block));
        cp_seg_blk_offset += blk_size_bytes;
        // 第 512 + 3 个 block 记录 0
        if (writetodisk(f2fs_params.fd, sum, cp_seg_blk_offset,
                                sizeof(struct f2fs_summary_block)) < 0) {
                printf("\n\tError: While writing the sum_blk to disk!!!\n");
                return -1;
        }

        /* 10. cp page 2 of check point pack 2 */
        cp_seg_blk_offset += blk_size_bytes;
        // 第 512 + 4 个 block 记录 check point 尾部副本
        if (writetodisk(f2fs_params.fd, ckp, cp_seg_blk_offset,
                                F2FS_CP_BLOCK_SIZE) < 0) {
                printf("\n\tError: While writing the ckp to disk!!!\n");
                return -1;
        }

        free(sum) ;
        free(ckp) ;
        return  0;
}

Summary block

summary block 在流程里面初看比较晦涩,需要整理一下数据结构

////////////////// summary 整体


struct f2fs_summary_block {
        // #define ENTRIES_IN_SUM               512
        struct f2fs_summary entries[ENTRIES_IN_SUM];
        union {
                __le16 n_nats; // nat_j.entries[] 的有效长度(实际长度是固定的)
                __le16 n_sits; // sit_j.entries[] 的有效长度(实际长度是固定的)
        };
        // 大小是 4k - sizeof(f2fs_summary)*512 - sizeof(summary_footer)
        union {
                struct nat_journal nat_j;
                struct sit_journal sit_j;
        };
        struct summary_footer footer;
} __attribute__((packed));

// 记录反向映射,即 node to parent
struct f2fs_summary {
        __le32 nid;             /* parent node id */
        union {
                __u8 reserved[3];
                struct {
                        __u8 version;           /* node version number */
                        __le16 bidx;            /* block index in parent node */
                } __attribute__((packed));
        };
} __attribute__((packed));


////////////////// journal 入口

// sizeof(struct nat_journal) = 505
// NAT_JOURNAL_ENTRIES = 38
struct nat_journal {
        struct nat_journal_entry entries[NAT_JOURNAL_ENTRIES];
        __u8 reserved[NAT_JOURNAL_RESERVED];
} __attribute__((packed));

// sizeof(struct sit_journal) = 505
// SIT_JOURNAL_ENTRIES = 6
struct sit_journal {
        struct sit_journal_entry entries[SIT_JOURNAL_ENTRIES];
        __u8 reserved[SIT_JOURNAL_RESERVED];
} __attribute__((packed));


////////////////// journal 外部条目


struct nat_journal_entry {
        __le32 nid;
        struct f2fs_nat_entry ne;
} __attribute__((packed));

// sizeof(struct sit_journal_entry) = 505
struct sit_journal_entry {
        __le32 segno;
        struct f2fs_sit_entry se;
} __attribute__((packed));


////////////////// 实际两种 area 的条目


// 前面也大概见识过 NAT 三元组了
struct f2fs_nat_entry {
        __u8    version;
        __le32  ino;
        __le32  block_addr;
} __attribute__((packed));

// 由于 SIT 没有复杂的初始化流程,所以看着陌生
struct f2fs_sit_entry {
        __le16 vblocks; // valid blocks 的意思
        __u8 valid_map[SIT_VBLOCK_MAP_SIZE];
        __le64 mtime;
} __attribute__((packed));


////////////////// footer 边角料


// footer 放在 summary 尾部,用于区分 summary 的类型和一致性检查用的校验和
struct summary_footer {
        unsigned char entry_type;
        __u32 check_sum;
} __attribute__((packed));

虽然数据结构不仅数目多而且长得恶心,但不难看懂:

  • f2fs_summary_block:描述外存按 block 单位存放的若干 summary 条目,剩余空间存放 journal
  • f2fs_summary:反向映射,记录对应的 parent node 和在 parent 中的 block index
  • nat_journal/sit_journal:journal 的入口,包含若干 journal entry
  • nat_journal_entry/sit_journal_entry:journal entry,包含 id 以及具体的 area entry
  • f2fs_nat_entry/f2fs_sit_entry:需要 check point 记录的实际 entry,对应不同 area 的数据结构
  • summary_footer:存放于 summary block 的尾部,用以区分 journal 类型和存放 checksum 校验
/* 此处画了一张绝佳的辅助图解,但你需要支付 998,244,353 CNY 才能解锁 */

辅助调试信息

调试信息如下,断点到 free() 前:

(gdb) b f2fs_format.c:956

(gdb) info locals
ckp = 0x4098c0
sum = 0x40a8d0
blk_size_bytes = 4096
cp_seg_blk_offset = 4210688
crc = 1257347361
i = 8

(gdb) p *ckp
$6 = {
  checkpoint_ver = 0,
  user_block_count = 220672,
  valid_block_count = 2,
  rsvd_segment_count = 25,
  overprov_segment_count = 47,
  free_segment_count = 472,
  cur_node_segno = {476, 475, 474, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295},
  cur_node_blkoff = {1, 0, 0, 0, 0, 0, 0, 0},
  nat_upd_blkoff = {1, 0, 0, 0, 0, 0, 0, 0},
  cur_data_segno = {473, 1, 0, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295},
  cur_data_blkoff = {1, 0, 0, 0, 0, 0, 0, 0},
  ckpt_flags = 0,
  cp_pack_total_block_count = 5,
  cp_pack_start_sum = 1,
  valid_node_count = 1,
  valid_inode_count = 1,
  next_free_nid = 4,
  sit_ver_bitmap_bytesize = 64,
  nat_ver_bitmap_bytesize = 128,
  checksum_offset = 4092,
  elapsed_time = 0,
  alloc_type = '\000' <repeats 15 times>,
  sit_nat_version_bitmap = ""
}

算法思路

从前面的流程以及 summary block 的记录来看并不足以了解 check point 的全貌,因为这里只有初始化过程。但是可以看出,checkpoint 是存在副本的,并且横跨 2 个 section,每 section 有 2 个 check point 副本

一个基本的算法思路是如果上一次写入了 check point#1,那么下一次就写入 check point#0,这将有助于一致性检查。另外,所谓的 recovery 过程似乎也不神秘,虽然目前没法得知完整的算法,但是这里所做的不过是记录所需要的 metadata,仅此而已

THE END

目前大致理解了 F2FS 的初始化视图,下一篇文章再来探索 F2FS 的运行时,先这样吧

最后惯例贴上内容无关的图

fun

This post is licensed under CC BY 4.0 by the author.
Contents