50 years in filesystems: A detour on vnodes

This is part 4 of a series.
- “1974 ” on the traditional Unix Filesystem.
- “1984 ” on the BSD Fast File System.
- “1994 ” on SGI XFS.
Progress is sometimes hard to see, especially when you have been part of it or otherwise lived through it. Often, it is easier to see by comparing modern educational material and the problems discussed with older material. Or look for the research papers and sources that fueled the change. So this is what we do.
How to have multiple filesystems
Steve Kleiman wrote “Vnodes: An Architecture for Multiple File System Types in Sun UNIX ” in 1986.
This is a short paper, with little text, because most of it is listing of data structures and diagrams of C language struct
’s pointing at each other.
Kleiman wanted to have multiple file systems in Unix, but wanted the file systems to be able to share interfaces and internal memory, if at all possible.
Specifically, he wanted a design that provided
- one common interface with multiple implementations,
- supporting BSD FFS, but also NFS and RFS, two remote filesystems, and Non-Unix filesystems, specifically MS-DOS,
- with the operations being defined by the interface being atomic.
And the implementations should be able to handle memory and structures dynamically, without performance impact, reentrant and multicore capable, and somewhat object-oriented.
Two abstractions
He looked at the various operations, and decided to provide two major abstractions:
The filesystem (vfs
, virtual filesystem) and the inode (vnode
, virtual inode), representing a filesystem and a file.
In true C++ style (but expressed in C), we get a virtual function table for each type, representing the class:
- For the
vfs
type, this is astruct vfsops
, a collection of function pointers with operations such asmount
,unmount
,sync
andvget
. Later in the paper, prototypes and functionality of these functions are explained. - For the
vnode
type, likewise, we getstruct vnodeops
. The functions here areopen
,rdwr
andclose
, of course, but alsocreate
,unlink
, andrename
. Some functions are specific to a filetype such asreadlink
,mkdir
,readdir
andrmdir
.
Actual mounts are being tracked in vfs
objects, which point to the operations applicable to this particular subtree with their struct vfsops *
.
Similarly, open files are being tracked in vnode
instances, which again, among other things, have a struct *vnodeops
pointer.
vnodes
are part of their vfs
, so they also have struct *vfs
to their filesystem instance.
Both, the vfs
and the vnode
need to provide a way for the implementation to store implementation specific data (“subclass private fields”).
So both structures end with caddr_t ...data
pointers.
That is, the private data is not part of the virtual structure, but located elsewhere and pointed to.
Vnodes in action
Kleiman sets out to explain how things work using the lookuppn()
function, which replaces the older namei()
function from traditional Unix.
Analogous to namei()
, the function consumes a path name, and returns a struct vnode *
to the vnode represented by that pathname.
Pathname traversal starts at the root vnode or the current directory vnode for the current process,
depending on the first character of a pathname being /
or not.
The function then takes the next pathname component, iteratively, and calls the lookup
function for the current vnode.
This function takes a pathname component, and a current vnode
assuming it is a directory.
It then returns the vnode
representing that component.
If a directory is a mountpoint, it has vfsmountedhere
set.
This is a struct vfs *
. lookuppn
follows the pointer,
and can call the root
function for that vfs
to get the root vnode
for that filesystem, replacing the current vnode
being worked on.
The inverse must also be possible:
When resolving a “..
” component and the current vnode
has a root flag set in its “flags” field,
we go from the current vnode
to the vfs
following the vfsmountedhere
pointer.
Then we can use the vnodecovered
field in that vfs
to get the vnode
of the superior filesystem.
In any case, upon successful completion, a struct vnode*
representing the consumed pathname is returned.
New system calls
In order to make things work, and to make things work efficiently, a few new system calls had to be added to round out the interfaces.
It is here in Unix history that we get statfs
and fstatfs
, to get an interface to filesystems in userland.
We also gain getdirentries
(plural) to get multiple directory entries at once (depending on the size of the buffer provided),
which makes directory reading faster a lot for remote filesystems.
In Linux
Looking at the Linux kernel source, we can find the general structure of Kleiman’s design, even if the complexity and richness of the Linux kernel obscure most of it. The Linux kernel has a wealth of file system types, and added also a lot of functionality that wasn’t present in BSD 40 years ago. So we find a lot more structures and system calls, implementing namespaces, quotas, attributes, read-only modes, directory name caches, and other things.
The file
Still, if you squint, the original structure can still be found:
Linux splits the in-memory structures around files in two, the opened file
, which is an inode with a current position associated,
and the inode
, which is the whole file.
We find instances of file objects, struct file
, defined
here
.
Among all the other things a file has, it has most notably a field loff_t f_pos
,
an offset (in bytes) from the start of the file,
the current position.
The file’s class is defined via a virtual function table.
We find the pointer as struct file_operations *f_op
,
and the definition here
.
It shows all the things a file can do, most notable, open
, close
, lseek
and then read
and write
.
The file also contains pointers to the inode, struct inode *f_inode
.
The inode
Operations on files without the need of an offset work on the file as a whole,
defined as struct inode *
.
Check the definition here . We see other definitions in here that have no equivalent in BSD from 40 years ago, such as ACLs and attributes.
We find the inode’s class is defined via a virtual function table,
struct inode_operations *i_op
,
and the definition here
.
Again, a lot of them deal with new features such as ACLs and extended attributes,
but we also find those we expect such as link
, unlink
, rename
and so on.
The inode also contains a pointer to a filesystem, the struct super_block *i_sb
.
The superblock
A mountpoint is represented as an instance of struct super_block
,
defined here
.
Again, the class is a struct super_operations *s_op
, defined
here
.
As an added complexity, there is no finite list of filesystems.
It is instead extensible through loadable modules, so we also have a struct file_system_type
,
here
.
This is basically a class with only one class method as a factory for superblocks, mount
.
Summary
Unix changed. It became a lot more runtime extensive, added a lot of new functionality and gained system calls. Things became more structured.
But the original design and data structures conceived by Kleiman and Joy held up, and can still be found in current Linux, 40 years later. We can point to concrete Linux code, which while looking completely different, is structurally mirroring the original design ideas.