前言
磁盘块与文件 inode entry 的申请和开释的解决机制总体上是统一的,所以就放到一起进行分享。在之前的文章中介绍了 ext2 文件系统磁盘的的总体布局,为了不便阐明,这里就假如磁盘只有一个块组。
块组布局
块组描述符
: 与超级块作用相似,记录该块组的根本属性信息数据块位图 /inode 位图
: 以 bit 作为基本操作对象,示意一个数据块或者 inode entry 的状态。置 0 示意闲暇,置 1 示意占用inode 表
: 集中寄存 inode entry数据块
: 存储文件的数据与元数据
发问 1: 超级块保留的是磁盘整体的信息,为什么每个块组中都蕴含一份?
这样做就是为了保留多个备份,有利于磁盘被损坏时的数据恢复。有的磁盘文件系统并没有采纳此种设计,就比方 ufs,它是在某些特定磁盘块寄存超级块数据的备份 (印象中是有 3 份)。
发问 2: 每块区域的大小是如何确定的?
个别是采纳估算法来确定的。用户能够依据本身的理论应用状况,预设每个文件均匀占用的磁盘块数。因为文件肯定会惟一对应一个 inode 数据结构,这样就能够计算出 inode entry 的大抵个数。再进一步,能够计算失去 inode 位图和 inode 表的大小。超级块和块组描述符个别会占用残缺的一个或者几个磁盘块,确定之后就能够计算失去数据块位图和数据块两个区域的大小。
每个区域的大小尽量设置为磁盘块大小的整数倍。如果存储空间比拟拮据,倡议保障磁盘块大小是数据结构大小的整数倍 (可能会造成一些空间节约),益处是升高代码实现的复杂度。
次要函数剖析
咱们次要关注 freebsd/usr/src/sys/fs/ext2fs 下的 ext2_alloc.c 和 ext2_balloc.c 两个文件。int ext2_alloc(...)
: 从磁盘申请一个闲暇数据块
/*
* Allocate a block in the filesystem.
*
* A preference may be optionally specified. If a preference is given
* the following hierarchy is used to allocate a block:
* 1) allocate the requested block.
* 2) allocate a rotationally optimal block in the same cylinder.
* 3) allocate a block in the same cylinder group.
* 4) quadradically rehash into other cylinder groups, until an
* available block is located.
* If no block preference is given the following hierarchy is used
* to allocate a block:
* 1) allocate a block in the cylinder group that contains the
* inode for the file.
* 2) quadradically rehash into other cylinder groups, until an
* available block is located.
*/
int
ext2_alloc(struct inode *ip, daddr_t lbn, e4fs_daddr_t bpref, int size,
struct ucred *cred, e4fs_daddr_t *bnp)
{
struct m_ext2fs *fs;
struct ext2mount *ump;
e4fs_daddr_t bno;
int cg;
*bnp = 0;
fs = ip->i_e2fs;
ump = ip->i_ump;
mtx_assert(EXT2_MTX(ump), MA_OWNED);
#ifdef INVARIANTS
if ((u_int)size > fs->e2fs_bsize || blkoff(fs, size) != 0) {
vn_printf(ip->i_devvp, "bsize = %lu, size = %d, fs = %s\n",
(long unsigned int)fs->e2fs_bsize, size, fs->e2fs_fsmnt);
panic("ext2_alloc: bad size");
}
if (cred == NOCRED)
panic("ext2_alloc: missing credential");
#endif /* INVARIANTS */
if (size == fs->e2fs_bsize && fs->e2fs_fbcount == 0)
goto nospace;
if (cred->cr_uid != 0 &&
fs->e2fs_fbcount < fs->e2fs_rbcount)
goto nospace;
if (bpref >= fs->e2fs_bcount)
bpref = 0;
if (bpref == 0)
cg = ino_to_cg(fs, ip->i_number);
else
cg = dtog(fs, bpref);
bno = (daddr_t)ext2_hashalloc(ip, cg, bpref, fs->e2fs_bsize,
ext2_alloccg);
if (bno > 0) {
/* set next_alloc fields as done in block_getblk */
ip->i_next_alloc_block = lbn;
ip->i_next_alloc_goal = bno;
ip->i_blocks += btodb(fs->e2fs_bsize);
ip->i_flag |= IN_CHANGE | IN_UPDATE;
*bnp = bno;
return (0);
}
nospace:
EXT2_UNLOCK(ump);
ext2_fserr(fs, cred->cr_uid, "filesystem full");
uprintf("\n%s: write failed, filesystem is full\n", fs->e2fs_fsmnt);
return (ENOSPC);
}
参数剖析:
daddr_t lbn
: file logical block number(文件逻辑块号),如何了解呢?假如一个文本文件大小是 4k,磁盘块大小是 512 bytes,那么文件的块大小是 8。当咱们应用 vim 关上这个文件时,肯定是从文件第 0 个块程序读取到第 7 个块,否则出现到屏幕上的文件内容将是乱序的,0- 7 就能够了解为文件的逻辑块号。文件逻辑块号对应的真正的磁盘块号是随机调配的,但绝对于文件起始块号的偏移则是确定的
。假如用户想要在文件开端增加一些内容,那么就须要给文件调配第 9 个逻辑块。这里就能够认为 lbn = 9。e4fs_daddr_t bpref
: prefer block number。ext2 文件系统为了放慢磁盘块调配速率,会在 struct inode 中的 i_next_alloc_goal 成员寄存一个可能能够应用的磁盘块 (个别是此次调配磁盘块号 +1),下次调配的时候会优先查看该磁盘块是否可用。如果是,则间接拿来用;否则,从新调用调配函数查找可用数据块。-
e4fs_daddr_t *bnp
: disk block number pointer,即上述文件逻辑块 9 对应的真正的磁盘块号函数中也有波及到块组号 cg 的计算,这里就不再开展了。思路就是就近准则,优先调配以后文件所在块组中的磁盘块 (可在读写过程中缩小磁头挪动间隔,提高效率)
该函数更多的是一些调配前的文件属性判断以及拿到磁盘块号后的赋值操作,真正的逻辑实现是在 ext2_alloccg(...)
函数
/*
* Determine whether a block can be allocated.
*
* Check to see if a block of the appropriate size is available,
* and if it is, allocate it.
*/
static daddr_t
ext2_alloccg(struct inode *ip, int cg, daddr_t bpref, int size)
{
struct m_ext2fs *fs;
struct buf *bp;
struct ext2mount *ump;
daddr_t bno, runstart, runlen;
int bit, loc, end, error, start;
char *bbp;
/* XXX ondisk32 */
fs = ip->i_e2fs;
ump = ip->i_ump;
if (e2fs_gd_get_nbfree(&fs->e2fs_gd[cg]) == 0)
return (0);
EXT2_UNLOCK(ump);
/* 从磁盘读取数据到内存 */
error = bread(ip->i_devvp, fsbtodb(fs,
e2fs_gd_get_b_bitmap(&fs->e2fs_gd[cg])),
(int)fs->e2fs_bsize, NOCRED, &bp);
if (error) {brelse(bp);
EXT2_LOCK(ump);
return (0);
}
if (EXT2_HAS_RO_COMPAT_FEATURE(fs, EXT2F_ROCOMPAT_GDT_CSUM) ||
EXT2_HAS_RO_COMPAT_FEATURE(fs, EXT2F_ROCOMPAT_METADATA_CKSUM)) {error = ext2_cg_block_bitmap_init(fs, cg, bp);
if (error) {brelse(bp);
EXT2_LOCK(ump);
return (0);
}
ext2_gd_b_bitmap_csum_set(fs, cg, bp);
}
error = ext2_gd_b_bitmap_csum_verify(fs, cg, bp);
if (error) {brelse(bp);
EXT2_LOCK(ump);
return (0);
}
if (e2fs_gd_get_nbfree(&fs->e2fs_gd[cg]) == 0) {
/*
* Another thread allocated the last block in this
* group while we were waiting for the buffer.
*/
brelse(bp);
EXT2_LOCK(ump);
return (0);
}
bbp = (char *)bp->b_data;
if (dtog(fs, bpref) != cg)
bpref = 0;
if (bpref != 0) {bpref = dtogd(fs, bpref);
/*
* if the requested block is available, use it
*/
if (isclr(bbp, bpref)) {
bno = bpref;
goto gotit;
}
}
/*
* no blocks in the requested cylinder, so take next
* available one in this cylinder group.
* first try to get 8 contigous blocks, then fall back to a single
* block.
*/
if (bpref)
start = dtogd(fs, bpref) / NBBY;
else
start = 0;
end = howmany(fs->e2fs->e2fs_fpg, NBBY) - start;
/* 解析数据块位图,查找可用数据块 */
retry:
runlen = 0;
runstart = 0;
for (loc = start; loc < end; loc++) {if (bbp[loc] == (char)0xff) {
runlen = 0;
continue;
}
/* Start of a run, find the number of high clear bits. */
if (runlen == 0) {bit = fls(bbp[loc]);
runlen = NBBY - bit;
runstart = loc * NBBY + bit;
} else if (bbp[loc] == 0) {
/* Continue a run. */
runlen += NBBY;
} else {
/*
* Finish the current run. If it isn't long
* enough, start a new one.
*/
bit = ffs(bbp[loc]) - 1;
runlen += bit;
if (runlen >= 8) {
bno = runstart;
goto gotit;
}
/* Run was too short, start a new one. */
bit = fls(bbp[loc]);
runlen = NBBY - bit;
runstart = loc * NBBY + bit;
}
/* If the current run is long enough, use it. */
if (runlen >= 8) {
bno = runstart;
goto gotit;
}
}
if (start != 0) {
end = start;
start = 0;
goto retry;
}
bno = ext2_mapsearch(fs, bbp, bpref);
if (bno < 0) {brelse(bp);
EXT2_LOCK(ump);
return (0);
}
gotit:
#ifdef INVARIANTS
if (isset(bbp, bno)) {
printf("ext2fs_alloccgblk: cg=%d bno=%jd fs=%s\n",
cg, (intmax_t)bno, fs->e2fs_fsmnt);
panic("ext2fs_alloccg: dup alloc");
}
#endif
setbit(bbp, bno); /* 令 bno 位 = 1,示意该块组中的这一数据块曾经被占用 */
EXT2_LOCK(ump);
ext2_clusteracct(fs, bbp, cg, bno, -1);
/* 更新超级块中的成员变量 (原子操作) */
fs->e2fs_fbcount--;
e2fs_gd_set_nbfree(&fs->e2fs_gd[cg],
e2fs_gd_get_nbfree(&fs->e2fs_gd[cg]) - 1);
fs->e2fs_fmod = 1;
EXT2_UNLOCK(ump);
ext2_gd_b_bitmap_csum_set(fs, cg, bp);
bdwrite(bp); /* 刷新后的内存数据回写到磁盘中 */
return (((uint64_t)cg) * fs->e2fs->e2fs_fpg + fs->e2fs->e2fs_first_dblock + bno);
}
bread(…) / bwrite(…) 波及到了操作系统的一些基础知识。CPU 不会对磁盘中的数据进行实时操作,因为速度太慢了,所以都是先将数据从磁盘读取到内存。从磁盘读数据也会有肯定的优化策略,比方块读取。假如咱们须要批改磁盘某个块中的 4 bytes 数据,磁盘驱动也会将对应磁盘块的 512 bytes 全副读取到内存中。借用局部性原理的相干表述,相邻数据后续被用到的可能性是比拟高的,外加驱动程序的一些个性,总体效率还是比拟高的。再例如对磁盘块号进行升序或者降序排列,最大限度保障磁头在同一阶段往同一个方向静止等。磁盘文件系统中可能还会引入 簇
的概念,集体了解跟上述起因比拟相似。
freebsd 通过缓存 (struct buf) 治理从磁盘读取到的块数据,每个磁盘块只映射到一个缓存构造。这样做是为了确保文件数据读写的原子性 (简略来说就是给这块缓存区域加读写锁)。这部分内容牵涉到内存治理相干常识,我集体也没有具体去钻研,当前有机会再来填坑吧。。。int ext2_valloc()
: 申请一个可用 inode
/*
* Allocate an inode in the filesystem.
*
*/
int
ext2_valloc(struct vnode *pvp, int mode, struct ucred *cred, struct vnode **vpp)
{
struct timespec ts;
struct inode *pip;
struct m_ext2fs *fs;
struct inode *ip;
struct ext2mount *ump;
ino_t ino, ipref;
int error, cg;
*vpp = NULL;
pip = VTOI(pvp);
fs = pip->i_e2fs;
ump = pip->i_ump;
EXT2_LOCK(ump);
if (fs->e2fs->e2fs_ficount == 0)
goto noinodes;
/*
* If it is a directory then obtain a cylinder group based on
* ext2_dirpref else obtain it using ino_to_cg. The preferred inode is
* always the next inode.
*/
if ((mode & IFMT) == IFDIR) {cg = ext2_dirpref(pip);
if (fs->e2fs_contigdirs[cg] < 255)
fs->e2fs_contigdirs[cg]++;
} else {cg = ino_to_cg(fs, pip->i_number);
if (fs->e2fs_contigdirs[cg] > 0)
fs->e2fs_contigdirs[cg]--;
}
ipref = cg * fs->e2fs->e2fs_ipg + 1;
ino = (ino_t)ext2_hashalloc(pip, cg, (long)ipref, mode, ext2_nodealloccg);
if (ino == 0)
goto noinodes;
error = VFS_VGET(pvp->v_mount, ino, LK_EXCLUSIVE, vpp);
if (error) {ext2_vfree(pvp, ino, mode);
return (error);
}
ip = VTOI(*vpp);
/*
* The question is whether using VGET was such good idea at all:
* Linux doesn't read the old inode in when it is allocating a
* new one. I will set at least i_size and i_blocks to zero.
*/
ip->i_flag = 0;
ip->i_size = 0;
ip->i_blocks = 0;
ip->i_mode = 0;
ip->i_flags = 0;
if (EXT2_HAS_INCOMPAT_FEATURE(fs, EXT2F_INCOMPAT_EXTENTS)
&& (S_ISREG(mode) || S_ISDIR(mode)))
ext4_ext_tree_init(ip);
else
memset(ip->i_data, 0, sizeof(ip->i_data));
/*
* Set up a new generation number for this inode.
* Avoid zero values.
*/
do {ip->i_gen = arc4random();
} while (ip->i_gen == 0);
vfs_timestamp(&ts);
ip->i_birthtime = ts.tv_sec;
ip->i_birthnsec = ts.tv_nsec;
/*
printf("ext2_valloc: allocated inode %d\n", ino);
*/
return (0);
noinodes:
EXT2_UNLOCK(ump);
ext2_fserr(fs, cred->cr_uid, "out of inodes");
uprintf("\n%s: create/symlink failed, no inodes free\n", fs->e2fs_fsmnt);
return (ENOSPC);
}
解决逻辑与上述函数相似,这就不再赘述。如果 inode 对应的是一个目录文件,对于多块组情景,尽可能与父目录调配到同一个块组中。
从 inode 设计上能够看到,当一个文件的大小超过肯定范畴时,间接块索引曾经无奈满足需要,此时就要动用间接块索引节点。那这个时候文件申请的数据块,就不单单寄存内容数据,还要寄存用于文件块索引的元数据。ext2 也提供相应的处理函数:
/*
* Balloc defines the structure of filesystem storage
* by allocating the physical blocks on a device given
* the inode and the logical block number in a file.
*/
int
ext2_balloc(struct inode *ip, e2fs_lbn_t lbn, int size, struct ucred *cred,
struct buf **bpp, int flags)
{
struct m_ext2fs *fs;
struct ext2mount *ump;
struct buf *bp, *nbp;
struct vnode *vp = ITOV(ip);
struct indir indirs[EXT2_NIADDR + 2];
e4fs_daddr_t nb, newb;
e2fs_daddr_t *bap, pref;
int osize, nsize, num, i, error;
*bpp = NULL;
if (lbn < 0)
return (EFBIG);
fs = ip->i_e2fs;
ump = ip->i_ump;
/*
* check if this is a sequential block allocation.
* If so, increment next_alloc fields to allow ext2_blkpref
* to make a good guess
*/
if (lbn == ip->i_next_alloc_block + 1) {
ip->i_next_alloc_block++;
ip->i_next_alloc_goal++;
}
if (ip->i_flag & IN_E4EXTENTS)
return (ext2_ext_balloc(ip, lbn, size, cred, bpp, flags));
/*
* The first EXT2_NDADDR blocks are direct blocks
*/
if (lbn < EXT2_NDADDR) {
/* 间接块索引 */
nb = ip->i_db[lbn];
/*
* no new block is to be allocated, and no need to expand
* the file
*/
if (nb != 0 && ip->i_size >= (lbn + 1) * fs->e2fs_bsize) {error = bread(vp, lbn, fs->e2fs_bsize, NOCRED, &bp);
if (error) {brelse(bp);
return (error);
}
bp->b_blkno = fsbtodb(fs, nb);
*bpp = bp;
return (0);
}
if (nb != 0) {
/*
* Consider need to reallocate a fragment.
*/
osize = fragroundup(fs, blkoff(fs, ip->i_size));
nsize = fragroundup(fs, size);
if (nsize <= osize) {error = bread(vp, lbn, osize, NOCRED, &bp);
if (error) {brelse(bp);
return (error);
}
bp->b_blkno = fsbtodb(fs, nb);
} else {
/*
* Godmar thinks: this shouldn't happen w/o
* fragments
*/
printf("nsize %d(%d) > osize %d(%d) nb %d\n",
(int)nsize, (int)size, (int)osize,
(int)ip->i_size, (int)nb);
panic("ext2_balloc: Something is terribly wrong");
/*
* please note there haven't been any changes from here on -
* FFS seems to work.
*/
}
} else {if (ip->i_size < (lbn + 1) * fs->e2fs_bsize)
nsize = fragroundup(fs, size);
else
nsize = fs->e2fs_bsize;
EXT2_LOCK(ump);
error = ext2_alloc(ip, lbn,
ext2_blkpref(ip, lbn, (int)lbn, &ip->i_db[0], 0),
nsize, cred, &newb);
if (error)
return (error);
/*
* If the newly allocated block exceeds 32-bit limit,
* we can not use it in file block maps.
*/
if (newb > UINT_MAX)
return (EFBIG);
bp = getblk(vp, lbn, nsize, 0, 0, 0);
bp->b_blkno = fsbtodb(fs, newb);
if (flags & BA_CLRBUF)
vfs_bio_clrbuf(bp);
}
ip->i_db[lbn] = dbtofsb(fs, bp->b_blkno);
ip->i_flag |= IN_CHANGE | IN_UPDATE;
*bpp = bp;
return (0);
}
/*
* Determine the number of levels of indirection.
*/
pref = 0;
if ((error = ext2_getlbns(vp, lbn, indirs, &num)) != 0)
return (error);
#ifdef INVARIANTS
if (num < 1)
panic("ext2_balloc: ext2_getlbns returned indirect block");
#endif
/*
* Fetch the first indirect block allocating if necessary.
*/
--num;
nb = ip->i_ib[indirs[0].in_off];
if (nb == 0) {EXT2_LOCK(ump);
pref = ext2_blkpref(ip, lbn, indirs[0].in_off +
EXT2_NDIR_BLOCKS, &ip->i_db[0], 0);
if ((error = ext2_alloc(ip, lbn, pref, fs->e2fs_bsize, cred,
&newb)))
return (error);
if (newb > UINT_MAX)
return (EFBIG);
nb = newb;
bp = getblk(vp, indirs[1].in_lbn, fs->e2fs_bsize, 0, 0, 0);
bp->b_blkno = fsbtodb(fs, newb);
vfs_bio_clrbuf(bp);
/*
* Write synchronously so that indirect blocks
* never point at garbage.
*/
if ((error = bwrite(bp)) != 0) {ext2_blkfree(ip, nb, fs->e2fs_bsize);
return (error);
}
ip->i_ib[indirs[0].in_off] = newb;
ip->i_flag |= IN_CHANGE | IN_UPDATE;
}
/*
* Fetch through the indirect blocks, allocating as necessary.
*/
for (i = 1;;) {
error = bread(vp,
indirs[i].in_lbn, (int)fs->e2fs_bsize, NOCRED, &bp);
if (error) {brelse(bp);
return (error);
}
bap = (e2fs_daddr_t *)bp->b_data;
nb = bap[indirs[i].in_off];
if (i == num)
break;
i += 1;
if (nb != 0) {bqrelse(bp);
continue;
}
EXT2_LOCK(ump);
if (pref == 0)
pref = ext2_blkpref(ip, lbn, indirs[i].in_off, bap,
bp->b_lblkno);
error = ext2_alloc(ip, lbn, pref, (int)fs->e2fs_bsize, cred, &newb);
if (error) {brelse(bp);
return (error);
}
if (newb > UINT_MAX)
return (EFBIG);
nb = newb;
nbp = getblk(vp, indirs[i].in_lbn, fs->e2fs_bsize, 0, 0, 0);
nbp->b_blkno = fsbtodb(fs, nb);
vfs_bio_clrbuf(nbp);
/*
* Write synchronously so that indirect blocks
* never point at garbage.
*/
if ((error = bwrite(nbp)) != 0) {ext2_blkfree(ip, nb, fs->e2fs_bsize);
EXT2_UNLOCK(ump);
brelse(bp);
return (error);
}
bap[indirs[i - 1].in_off] = nb;
/*
* If required, write synchronously, otherwise use
* delayed write.
*/
if (flags & IO_SYNC) {bwrite(bp);
} else {if (bp->b_bufsize == fs->e2fs_bsize)
bp->b_flags |= B_CLUSTEROK;
bdwrite(bp);
}
}
/*
* Get the data block, allocating if necessary.
*/
if (nb == 0) {EXT2_LOCK(ump);
pref = ext2_blkpref(ip, lbn, indirs[i].in_off, &bap[0],
bp->b_lblkno);
if ((error = ext2_alloc(ip,
lbn, pref, (int)fs->e2fs_bsize, cred, &newb)) != 0) {brelse(bp);
return (error);
}
if (newb > UINT_MAX)
return (EFBIG);
nb = newb;
nbp = getblk(vp, lbn, fs->e2fs_bsize, 0, 0, 0);
nbp->b_blkno = fsbtodb(fs, nb);
if (flags & BA_CLRBUF)
vfs_bio_clrbuf(nbp);
bap[indirs[i].in_off] = nb;
/*
* If required, write synchronously, otherwise use
* delayed write.
*/
if (flags & IO_SYNC) {bwrite(bp);
} else {if (bp->b_bufsize == fs->e2fs_bsize)
bp->b_flags |= B_CLUSTEROK;
bdwrite(bp);
}
*bpp = nbp;
return (0);
}
brelse(bp);
if (flags & BA_CLRBUF) {int seqcount = (flags & BA_SEQMASK) >> BA_SEQSHIFT;
if (seqcount && (vp->v_mount->mnt_flag & MNT_NOCLUSTERR) == 0) {
error = cluster_read(vp, ip->i_size, lbn,
(int)fs->e2fs_bsize, NOCRED,
MAXBSIZE, seqcount, 0, &nbp);
} else {error = bread(vp, lbn, (int)fs->e2fs_bsize, NOCRED, &nbp);
}
if (error) {brelse(nbp);
return (error);
}
} else {nbp = getblk(vp, lbn, fs->e2fs_bsize, 0, 0, 0);
nbp->b_blkno = fsbtodb(fs, nb);
}
*bpp = nbp;
return (0);
}
函数构造就是一个 if else,别离解决间接块索引和间接块索引两个分支。须要留神的是,在块调配的过程中要同时对元数据块进行更新。getblk(…) 是 bread() 逻辑实现的主体,所以两者的性能是统一的。间接块索引的解决算法就留给大家本人看了,我集体也是在重复浏览和实测之后,才明确了作者的实现思路。
后面文章中提出了一个问题,就是间接块索引的元数据块是否蕴含在文件大小中?我集体的了解是,不蕴含。咱们定位文件逻辑块号的时候,用的是文件大小与磁盘块大小的取模值 (或者再 +1)。如果把元数据也蕴含在内,那么将无奈断定某个文件指针偏移量到底对应的内容数据,还是元数据 (除非限定文件块的排列形式)。所以我集体认为,元数据起到的是辅助定位作用,不应该蕴含到文件大小当中。
结语
这里只列举了局部函数,其余细节上的实现还是须要大家去浏览源码。这部分代码也比拟艰涩难懂,集体倡议多入手测试,顺便看看有没有 bug (手动增加狗头)。