Ext3 Journal

Table of Contents

1. Ext3 Journal

1.1. 三种模式

1.1.1. Journal

All filesystem data and metadata changes are logged into the journal. This mode minimizes the chance of losing the updates made to each file, but it requires many additional disk accesses. For example, when a new file is created, all its data blocks must be duplicated as log records. This is the safest and slowest Ext3 journaling mode.

1.1.2. Ordered

Only changes to filesystem metadata are logged into the journal. However, the Ext3 filesystem groups metadata and relative data blocks so that data blocks are written to disk before the metadata. This way, the chance to have data corruption inside the files is reduced; for instance, each write access that enlarges a file is guaranteed to be fully protected by the journal. This is the default Ext3 journaling mode.

1.1.3. Writeback

Only changes to filesystem metadata are logged; this is the method found on the other journaling filesystems and is the fastest mode. The journaling mode of the Ext3 filesystem is specified by an option of the mount system command. For instance, to mount an Ext3 filesystem stored in the /dev/sda2 partition on the /jdisk mount point with the "writeback" mode, the system administrator can type the command:

mount -t ext3 -o data=writeback /dev/sda2 /jdisk

1.2. Journal 如何工作

1.2.1. commit_write

  1. If the Ext3 filesystem has been mounted in "journal" mode, the commit_write method is implemented by the ext3_journalled_commit_write( ) function, which invokes journal_dirty_metadata( ) on every buffer of data (not metadata) in the page. This way, the buffer is included in the proper dirty list of the active transaction and not in the dirty list of the owner inode; moreover, the corresponding log records are written to the journal. Finally, ext3_journalled_commit_write( ) invokes journal_stop( ) to notify the JBD layer that the atomic operation handle is closed.
  2. If the Ext3 filesystem has been mounted in "ordered" mode, the commit_write method is implemented by the ext3_ordered_commit_write( ) function, which invokes the journal_dirty_data( ) function on every buffer of data in the page to insert the buffer in a proper list of the active transactions. The JBD layer ensures that all buffers in this list are written to disk before the metadata buffers of the transaction. No log record is written onto the journal. Next, ext3_ordered_commit_write( ) executes the normal generic_commit_write( ) function described in Chapter 15, which inserts the data buffers in the list of the dirty buffers of the owner inode. Finally, ext3_ordered_commit_write( ) invokes journal_stop( ) to notify the JBD layer that the atomic operation handle is closed.
  3. If the Ext3 filesystem has been mounted in "writeback" mode, the commit_write method is implemented by the ext3_writeback_commit_write( ) function, which executes the normal generic_commit_write( ) function described in Chapter 15, which inserts the data buffers in the list of the dirty buffers of the owner inode. Then, ext3_writeback_commit_write( ) invokes journal_stop( ) to notify the JBD layer that the atomic operation handle is closed.
  4. commit_write 之后, write 系统调用就结束了

1.2.2. journal_commit_transaction

  1. kjournald 会定期 commit journal transaction
  2. If the Ext3 filesystem has been mounted in "ordered" mode, the journal_commit_transaction( ) function activates the I/O data transfers for all data buffers included in the list of the transaction and waits until all data transfers terminate.
  3. The journal_commit_transaction( ) function activates the I/O data transfers for all metadata buffers included in the transaction (and also for all data buffers, if Ext3 was mounted in "journal" mode).
  4. Periodically, the kernel activates a checkpoint activity for every complete transaction in the journal. The checkpoint basically involves verifying whether the I/O data transfers triggered by journal_commit_transaction( ) have successfully terminated. If so, the transaction can be deleted from the journal.

1.3. The code

1.3.1. dirty data

1.3.1.1. ext3_journal_dirty_metadata

ext3_journal_dirty_metadata 作用是把 buffer (作为 metadata) 提交给 JBD, 后续 JBD commit transaction 时会把这些 metadata commit 到 journal

任何 mode 都需要 dirty metadata.

对于 journaled mode, commit_write 时会把它的 data 也假装成 metadata 提交给 JBD, 以便 data 也能被 commit 到 journal

1.3.1.2. ext3_journal_dirty_data

ordered 模式下需要在 commit journal 前写数据, ext3_journal_dirty_data 是把数据提交给 jbd (但不是作为 metadata, 因为数据不需要写入 journal)

1.3.2. commit_write

1.3.2.1. ext3_writeback_commit_write
ext3_writeback_commit_write:
  // 所有 mode 下, 在 commit write 之前, metadata 就已经被 commit 到 journal 了,
  // 这里无需再调用 journal_dirty_metadata
  // writeback 直接通过 generic_commit_write 把 page 和 inode 标记为 dirty
  // 后续 pdflush 时会把它写到 inode 的 block
  // 可见, writeback 模式下, data 与 journal (metadata) 是完全没有关系的
  generic_commit_write(file, page, from, to);
1.3.2.2. ext3_ordered_commit_write
ext3_ordered_commit_write:
  // ext3_journal_dirty_data 会把 data 放在 jbd 一个特定的 data list
  // jbd commit transaction 时会先把这个 list 里的 data 写到 inode block (而非
  // journal 中)
  // 但不妨碍后面的 generic_commit_write 也会把 data 标记为 inode dirty 以便 pdflush
  // 写磁盘
  ret = walk_page_buffers(handle, page_buffers(page),
                        from, to, NULL, ext3_journal_dirty_data);
  generic_commit_write(file, page, from, to);
1.3.2.3. ext3_journalled_commit_write
ext3_journalled_commit_write:
  // 所有 data 都被作为 `metadata` 提交给 ext3_journal_dirty_metadata, 以便 data
  // 被 commit 到 journal
  ret = walk_page_buffers(handle, page_buffers(page), from,
                        to, &partial, ext3_journal_dirty_metadata);
  // 后续并没有 generic_commit_write, 因为 data 在 commit 到 journal 之前,
  // 不能被写到 inode 对应的 block 上

1.3.3. commit_transaction

真正写磁盘发生在 journal_commit_transaction

journal_commit_transaction:
  // phase 1: write out all data, data first, then metadata, control data
  write_out_data
  // start io
  start_journal_io()
  // phase 2: wait until metadata is written
  wait_for_iobuf
  wait_for_ctlbuf          
  // phase 3: write commit record
  put_bh(bh);     
  // checkpoint

1.3.4. checkpoint

log_do_checkpoint:
  __flush_buffer(journal, jh, bhs, &batch_count, &drop_count);
    ll_rw_block(WRITE, *batch_count, bhs);
      submit_bh(rw, bh);

1.4. 实验

1.4.1. writeback

// 生成一个 100 MB 的全 FF 的数据
dd if=/dev/zero ibs=1k count=100000 | tr "\000" "\377" > ~/ff.bin

// 生成 img
dd if=/dev/zero of=test.img bs=1024 count=300000
mkfs.ext3 test.img
mount -o loop test.img -o data=writeback /mnt
sync

// dd
cd /mnt
dd if=~/ff.bin of=./dump.bin

// 等待 8s 后 reset 虚拟机, 因为 ext3 定时 commit journal 的时间默认为 5s,
// 8s 可以保证 metadata 被 commit, 但数据部分可能还没有写入
mount -o loop test.img -o data=writeback /mnt
cd /mnt

// dump.bin 大小与 ff.bin 相同
du -b ./dump.bin

// 但 dump.bin 的内容与 ff 并不一致, od 显示 dump.bin 的开头为 ff, 后面为 0
od -x ./dump.bin

1.4.2. ordered

所有步骤与 writeback 相同, 最后的观察结果是:

dump.bin 文件并不存在, 因为 3s 之内数据还没有写完, 导致 metadata 没有 commit

1.5. 总结

简单来说, journal 的工作过程是:

  1. commit_write 时 metadata 和 data 被提交给 jbd 和文件系统 (mark dirty):
    • metadata 只会提交给 jbd, 因为 metadata 是一定需要先写入 journal 的
    • writeback 时, data 提交给文件系统
    • ordered 时, data 提交给文件系统, 同时以 data (非 journal) 形式提交给 jbd
    • journal 时, data 不提交给文件系统, 以 journal 形式提交给 jbd.
  2. commit_transaction 时 (kjournald 定时 commit, 默认为 5s 秒 (JBD_DEFAULT_MAX_COMMIT_AGE)) JBD 负责把相应数据写入 journal
    • writeback 模式时, 把 metadata 写入 journal
    • ordered 模式时, 先把 data 直接写入文件系统, 再把 metadata 写入 journal
    • journal 模式时, 把 data 和 metadata 都写入 journal
  3. kjournald 周期性的 checkpoint 会负责清除 journal, 并把 journal 的数据写到文件系统中

Author: [email protected]
Date: 2019-08-21 Wed 00:00
Last updated: 2019-08-22 Thu 20:30

知识共享许可协议