File systems
Required reading: Chapter 18, 19, and chapter 20 (from "Resource
Allocation", pp. 20-4)
Overview
Users desire to store their data persistently so that data survives
when the user turns of his computer. The primary media for doing so
are: magnetic disks, flash memory, and tapes. We focus on magnetic
disks (e.g., the RK in v6).
To help users to remember where they stored their data, most
systems allow users to assign their own names to their data.
Typically the data is organized in files and users assign names to
files. To deal with many files, users can organize their files in
directories, in a hierarchical manner. To avoid that users have to
type long abolute names (e.g., in UNIX absolute names start with "/"),
users can change their working directory and use relative names .
In v6, File namespace operations include create, link, unlink, and
chdir; later UNIXes added rename, mkdir, etc. (How is "mv a b"
implemented in v6? Answer: "link a b"; "unlink "a".) To be able to
name the current directory and the parent directory every directory
includes two entries "." and "..".
A user grafts new file systems on a name space using mount. Umount
removes a file system from the name space. (In DOS, a file system is
named by its device letter.)
The data in a file can be organized in a structured way or not.
The structured variant is often called a database. UNIX uses the
unstructured variant: files are streams of bytes. Any particular
structure is likely to be useful to only a small class of
applications, and other applications will have to work hard to fit
their data into one of the pre-defined structures. Besides, if you
want structure, you can easily write a user-mode library program that
imposes that format on any file. The end-to-end argument in action.
A key issue is the semantics of an update. If a program updates a
file, when are the effects of the update persistent? In particular,
what happens if the logical update requires writing multiple disks
blocks and the power fails during the update. Can it happen that an
update happens only partially? To solve this problem correctly one
needs atomic operations with respect to failures.
Another key issue is what the semantics are of concurrent writes to
the same data item. What is the order of two updates that happen at
the same time? To solve this problem correctly one needs atomic
operations with respect to concurrency.
In v6, a file is a byte stream. The v6 file calls are open, read,
write, seek, close, and stat. Dup duplicates a file descriptor. Data
for a file is reclaimed if it doesn't appear in the name space
anymore. Each file descriptor has a file offset. Read/write
move it implicitly, seek moves it explicitly. Two processes that
share a file because one process is the child (and inherited an open
file descriptor from its parent) will share an offset pointer into the
file: their bytes will be ordered serially in some order; in practice,
one write after the other. Two processes that open the same file will
each have their own offset into the file and may overwrite each
other's data.
A final key issue is performance, because writing to magnetic disk
is relatively expensive compared to computing. (This is true for the
PDP11/RK6 too.) Two primary ways to improve performance are: careful
file system layout that induces few seeks and overlap I/O with
computation so that file operations don't have to wait until their
completion and so that that the disk driver has more data to write,
which allows disk scheduling etc. An additional method is to cheat on
the semantics of modifications, and give up on atomicity.
V6 follows the last two methods. It doesn't pay attention to file
system layout. It overlaps computation and I/O, but doesn't do any
disk scheduling. It cheats on semantics: when one closes a file, some
of its data may not have been written to disk yet, while other blocks
may have been written---v6 is more careful with i-nodes (probably,
mostly accidently). Later in some of the papers, we will see how one
can do better.
V6 code examples
On disk files are represented by an inode (ino.h), and blocks.
Small files have 8 direct pointers; in large files the eight pointers
are pointers to a block of pointers (see bmap, 6415). Files are
limited to 2^15 blocks. Directories are files with structure: 14-byte
name plus 2-byte inode number. In memory files are represented by a
file structure (struct file) and an in-core inode (inode.h).
- Name space.
- iget. get an inode.
- 7285: check incore inode table.
- 7292--7297: deal with mounted files systems. For example, "mount
/dev/rk0 /mnt/", will create an inode with IMOUNT for /mnt, which will
point to the inode for /dev/rk.
- 7316: does this lock work on multiprocessors? (Answer: no.)
- 7319: read block that contains the inode.
- 7330: copy the inode on disk into the in-core inode structure.
- namei converts pathnames into a pointer to an in-core inode. does
the function win a prize for elegance?
- 7531. Where is the current working directory maintained? (Answer:
struct user.) Why is the current directory maintained in the kernel?
(Answer: no particular reason, because the current directory could
just be another open file descriptor passed to the process on startup
(like the STDIN, STDOUT and STDERR are).
- 7534: lock inode!
- 7535: ////a////b is allowed!
- 7571: which characters are allowed in a directory name?
- Why are names restricted to 14 (DIRSIZE) bytes? (Answer: 16 bytes
is a convenient directory entry size: 14 bytes name, 2 bytes inode
number. With a block size of 512 bytes directory entries don't span
blocks.)
- 7622: directory block is read with bread(). bmap converts the
logical block number into a physical block number. Another process may
get scheduled now, but bread blocks to wait for the disk read to
complete.
- 7763: call iput to release lock!
- 7664: read the inode for the matched entry.
- The directory tree is acyclic; links to directories aren't
allowed, except for super users! (see line 5921). This allows
reference counters to be used, which is implemented by nlink. Link
increases it. Unlink decreases it. Iput checks it (and frees blocks,
if necessary).
- Link, unlink, chdir, mount, umount could have taken file
descriptors instead of their path argument. In fact, this would get
rid of some possible race conditions (some of which have security
implications, TOCTTOU). However, this would require that the current
working directory be remembered by the process, and UNIX didn't have
good ways of maintaining static state shared among all processes
belonging to a given user. The easiest way is to create shared state
is to place it in the kernel.
Example of TOCTTOU:
Root Attacker
mkdir ("/tmp/etc")
creat ("/tmp/etc/passw")
readdir ("tmp");
lstat ("tmp/etc");
readdir ("tmp/etc");
rename ("tmp/etc", "/tmp/x");
symlink ("etc", "/tmp/etc");
unlink ("tmp/etc/passwd");
Lstat checks whether /tmp/etc is not symbolic link, but by the time it
runs unlink the attacker had time to creat a symbolic link in the
place of /tmp/etc.
- File operations.
- Maintaining the file offset (f_offset in struct file) behind the
read/write interface is a controversial design decision . The
alternative is that the state of a read operation should be maintained
by the process doing the reading (i.e., that the pointer should be
passed as an argument to read). This argument is compelling in view
of the UNIX fork() semantics, which clones a process which shares the
file descriptors of its parent. A read by the parent of a shared file
descriptor (e.g., stdin, changes the read pointer seen by the child).
On the other hand the alternative would make it difficult to get
"(data; ls) > x" right.
- rdwr, 5731. What is the test on line 5746 for? (Answer: check
whether process is reading/writing to pipe; if so, call pipe
read/write functions).
- Writei. On a small write, is the block first read? (Answer: yes,
see 6303).
- What is the test on line 6309 for? Answer: if we have written a
complete block, write it out asynchronously and release it; otherwise,
write it out delayed, because with a bit of luck the next write will
be to the same block.
- Readi. What is the test on line 6255? Answer: if we are reading
sequentially, read a block ahead; otherwise, not.
- closef. Only on last close, call closei, which calls iput, which
writes i-node using bwrite(), which waits for completion. We have to
do so, because the incore inode table is not a true cache; as soon as
no process is using the inode, v6 writes it out to the disk (e.g.,
instead of waiting until the replacement policy throws it out).
- Disk layout. See bmap() and ialloc/alloc. No really smart layout
done.
- Overlapping I/O. Reading is done synchronous, except for read
ahead. Writing is always done asynchronously, except for inodes
(iupdate uses bdwrite).
- Reliability. No semantics that the programmer can rely on. V6
calls Update() periodically to write out all dirty blocks. V6 writes
inodes synchronously when no process is using the inode.
- Suppose a program creates a file "foo" on a rk disk, writes 1
byte (containing "a") to it, calls close, and the power fails right
after the call to closef() on line 5854 completes. What are the
possible states that the disk is in? More specifically:
- Is there an inode for "foo" stored on the disk? If so, what is its
content? (Answer: yes, because fclose will call iupdate, which calls
bwrite. the inode will have the content reflecting the one-byte
write.)
- Is there a data block for "foo" stored on the disk? If so, what is
its content? (Answer: perhaps. the data block may not have been
written out at the time of the power failure, because partially-filled
blocks are written using bdwrite. thus, the inode will list the data
block number, pointing effectively to a block whose content has
perhaps no relation to "foo".)
- Is there a directory entry for "foo"? Why? Why not? (Answer:
perhaps; the inode of the updated directory is on disk, because iput
will find i_count is 1, and write the inode using iupdate. the data
block, however, might not be, because if the data block is partially
filled it will be written using bdwrite. Note, however, that if the
data block just happens to be full, the iwrite call in wdir will use
bawrite, which will enter the write operation in the disk queue; then,
the iput call in wdir will cause a write of the inode using bdwrite;
bdwrite will first wait until the preceding operations in the queue
have completed; thus, in this special case the data block will be on
disk!)
- How does UNIX recover from this mess? When the operating starts
again the sys admin runs programs (INUM, etc., which later were
replaced by fsck) that fix up to disk to some consistent state, with
the help of the sys admin. Of course, some recent changes may be lost
(e.g., the byte in "foo").
- File locking. No file locking at the application
level. Protection bits can be used, but only by processes that don't
run as superuser.