Sparse File

1. Sparse File
- 1.1. 操作 sparse file
- 1.2. 参考

1. Sparse File

linux 本身就是 sparse file 的概念: 它指的实际上就是通过 seek 导致的中间有空洞的文件.

文件系统一般通过多级索引的方式保存文件, 所以对于 seek, 文件系统会很聪明的通过操作索引来"表示"这个空洞而不需要真正的将一串的 "0" 保存到磁盘. (参考 <unix 操作系统设计>)

所以, linux 的 sparse file 指的是 "使用 seek 来表示连续的 0" 这样的文件.

1.1. 操作 sparse file

创建 sparse file 本质都是通过 lseek, 有许多 linux 工具已经支持这种功能, 例如:

默认的 du 是显示实际占用的磁盘大小 (不包括空洞), 通过–apparent-size 可以显示包括空洞的大小 (看起来的大小), 例如:

$> dd if=/dev/urandom bs=4096 seek=100 count=2 of=file_with_holes
2+0 records in
2+0 records out
8192 bytes (8.2 kB) copied, 0.000890926 s, 9.2 MB/s
$> du --apparent-size -h  file_with_holes
408K    file_with_holes
$> du -h  file_with_holes
8.0K    file_with_holes

dd if=/dev/urandom bs=4096 seek=100 count=2 of=file_with_holes

该命令会创建一个开头有 400K 个 0 的空洞.

cp 的 –sparse 可以控制在复制文件时是否先推测被复制的文件是否可以用 sparse 文件来表示, 例如:

$> cp --sparse=never file_with_holes file_with_holes_2
$> du -h file_with_holes_2 
408K    file_with_holes_2
$> cp --sparse=always file_with_holes file_with_holes_2
$> du -h file_with_holes_2 
8.0K    file_with_holes_2

另外, cp 在判断目标文件是否可以通过 sparse 方式表示 (lseek) 时是使用了一种启发式的算法 (cp 无法知道源文件是否是 sparse 格式, 它能看到的只是字节流)

By default, sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well. 
That is the behavior selected by --sparse=auto. Specify --sparse=always to create a sparse DEST file
whenever the SOURCE file contains a long enough sequence of zero bytes. Use --sparse=never to inhibit creation of sparse files.

通过这段 man cp 的注释, 可以猜过 cp 是通过判断源文件中是否有连续的 0 来决定目标文件是否用 sparse 格式.

所以下一步尝试使用 cp 来复制一个正常的但全是 0 的文件(非 sparse 文件), 看看它的行为:

$> for i in `seq 10000`; do echo -e -n "\0">>all_zero; done
$> du -h all_zero
12K     all_zero
$> du -h --apparent-size all_zero
9.8K    all_zero
$> cp --sparse=always all_zero all_zero_2
$> du -h all_zero_2 
0       all_zero_2
$> du -h --apparent-size all_zero_2
9.8K    all_zero_2

tar
```
tar -S
```

Sparse File

Table of Contents

1. Sparse File

1.1. 操作 sparse file

1.2. 参考