File systems

Required reading: Chapter 18, 19, and chapter 20 (from "Resource Allocation", pp. 20-4)

Overview

Users desire to store their data persistently so that data survives when the user turns of his computer. The primary media for doing so are: magnetic disks, flash memory, and tapes. We focus on magnetic disks (e.g., the RK in v6).

To help users to remember where they stored their data, most systems allow users to assign their own names to their data. Typically the data is organized in files and users assign names to files. To deal with many files, users can organize their files in directories, in a hierarchical manner. To avoid that users have to type long abolute names (e.g., in UNIX absolute names start with "/"), users can change their working directory and use relative names .

In v6, File namespace operations include create, link, unlink, and chdir; later UNIXes added rename, mkdir, etc. (How is "mv a b" implemented in v6? Answer: "link a b"; "unlink "a".) To be able to name the current directory and the parent directory every directory includes two entries "." and "..".

A user grafts new file systems on a name space using mount. Umount removes a file system from the name space. (In DOS, a file system is named by its device letter.)

The data in a file can be organized in a structured way or not. The structured variant is often called a database. UNIX uses the unstructured variant: files are streams of bytes. Any particular structure is likely to be useful to only a small class of applications, and other applications will have to work hard to fit their data into one of the pre-defined structures. Besides, if you want structure, you can easily write a user-mode library program that imposes that format on any file. The end-to-end argument in action.

A key issue is the semantics of an update. If a program updates a file, when are the effects of the update persistent? In particular, what happens if the logical update requires writing multiple disks blocks and the power fails during the update. Can it happen that an update happens only partially? To solve this problem correctly one needs atomic operations with respect to failures.

Another key issue is what the semantics are of concurrent writes to the same data item. What is the order of two updates that happen at the same time? To solve this problem correctly one needs atomic operations with respect to concurrency.

In v6, a file is a byte stream. The v6 file calls are open, read, write, seek, close, and stat. Dup duplicates a file descriptor. Data for a file is reclaimed if it doesn't appear in the name space anymore. Each file descriptor has a file offset. Read/write move it implicitly, seek moves it explicitly. Two processes that share a file because one process is the child (and inherited an open file descriptor from its parent) will share an offset pointer into the file: their bytes will be ordered serially in some order; in practice, one write after the other. Two processes that open the same file will each have their own offset into the file and may overwrite each other's data.

A final key issue is performance, because writing to magnetic disk is relatively expensive compared to computing. (This is true for the PDP11/RK6 too.) Two primary ways to improve performance are: careful file system layout that induces few seeks and overlap I/O with computation so that file operations don't have to wait until their completion and so that that the disk driver has more data to write, which allows disk scheduling etc. An additional method is to cheat on the semantics of modifications, and give up on atomicity.

V6 follows the last two methods. It doesn't pay attention to file system layout. It overlaps computation and I/O, but doesn't do any disk scheduling. It cheats on semantics: when one closes a file, some of its data may not have been written to disk yet, while other blocks may have been written---v6 is more careful with i-nodes (probably, mostly accidently). Later in some of the papers, we will see how one can do better.

V6 code examples

On disk files are represented by an inode (ino.h), and blocks. Small files have 8 direct pointers; in large files the eight pointers are pointers to a block of pointers (see bmap, 6415). Files are limited to 2^15 blocks. Directories are files with structure: 14-byte name plus 2-byte inode number. In memory files are represented by a file structure (struct file) and an in-core inode (inode.h).

Name space.
- iget. get an inode.
- 7285: check incore inode table.
- 7292--7297: deal with mounted files systems. For example, "mount /dev/rk0 /mnt/", will create an inode with IMOUNT for /mnt, which will point to the inode for /dev/rk.
- 7316: does this lock work on multiprocessors? (Answer: no.)
- 7319: read block that contains the inode.
- 7330: copy the inode on disk into the in-core inode structure.
- namei converts pathnames into a pointer to an in-core inode. does the function win a prize for elegance?
- 7531. Where is the current working directory maintained? (Answer: struct user.) Why is the current directory maintained in the kernel? (Answer: no particular reason, because the current directory could just be another open file descriptor passed to the process on startup (like the STDIN, STDOUT and STDERR are).
- 7534: lock inode!
- 7535: ////a////b is allowed!
- 7571: which characters are allowed in a directory name?
- Why are names restricted to 14 (DIRSIZE) bytes? (Answer: 16 bytes is a convenient directory entry size: 14 bytes name, 2 bytes inode number. With a block size of 512 bytes directory entries don't span blocks.)
- 7622: directory block is read with bread(). bmap converts the logical block number into a physical block number. Another process may get scheduled now, but bread blocks to wait for the disk read to complete.
- 7763: call iput to release lock!
- 7664: read the inode for the matched entry.
- The directory tree is acyclic; links to directories aren't allowed, except for super users! (see line 5921). This allows reference counters to be used, which is implemented by nlink. Link increases it. Unlink decreases it. Iput checks it (and frees blocks, if necessary).
- Link, unlink, chdir, mount, umount could have taken file descriptors instead of their path argument. In fact, this would get rid of some possible race conditions (some of which have security implications, TOCTTOU). However, this would require that the current working directory be remembered by the process, and UNIX didn't have good ways of maintaining static state shared among all processes belonging to a given user. The easiest way is to create shared state is to place it in the kernel.
  Example of TOCTTOU:
```
    Root                         Attacker
                                 mkdir ("/tmp/etc")
				 creat ("/tmp/etc/passw")
    readdir ("tmp");
    lstat ("tmp/etc");
    readdir ("tmp/etc");
                                 rename ("tmp/etc", "/tmp/x");
				 symlink ("etc", "/tmp/etc");
    unlink ("tmp/etc/passwd");
```
  Lstat checks whether /tmp/etc is not symbolic link, but by the time it runs unlink the attacker had time to creat a symbolic link in the place of /tmp/etc.
File operations.
- Maintaining the file offset (f_offset in struct file) behind the read/write interface is a controversial design decision . The alternative is that the state of a read operation should be maintained by the process doing the reading (i.e., that the pointer should be passed as an argument to read). This argument is compelling in view of the UNIX fork() semantics, which clones a process which shares the file descriptors of its parent. A read by the parent of a shared file descriptor (e.g., stdin, changes the read pointer seen by the child). On the other hand the alternative would make it difficult to get "(data; ls) > x" right.
- rdwr, 5731. What is the test on line 5746 for? (Answer: check whether process is reading/writing to pipe; if so, call pipe read/write functions).
- Writei. On a small write, is the block first read? (Answer: yes, see 6303).
- What is the test on line 6309 for? Answer: if we have written a complete block, write it out asynchronously and release it; otherwise, write it out delayed, because with a bit of luck the next write will be to the same block.
- Readi. What is the test on line 6255? Answer: if we are reading sequentially, read a block ahead; otherwise, not.
- closef. Only on last close, call closei, which calls iput, which writes i-node using bwrite(), which waits for completion. We have to do so, because the incore inode table is not a true cache; as soon as no process is using the inode, v6 writes it out to the disk (e.g., instead of waiting until the replacement policy throws it out).
Disk layout. See bmap() and ialloc/alloc. No really smart layout done.
Overlapping I/O. Reading is done synchronous, except for read ahead. Writing is always done asynchronously, except for inodes (iupdate uses bdwrite).
Reliability. No semantics that the programmer can rely on. V6 calls Update() periodically to write out all dirty blocks. V6 writes inodes synchronously when no process is using the inode.
- Suppose a program creates a file "foo" on a rk disk, writes 1 byte (containing "a") to it, calls close, and the power fails right after the call to closef() on line 5854 completes. What are the possible states that the disk is in? More specifically:
- Is there an inode for "foo" stored on the disk? If so, what is its content? (Answer: yes, because fclose will call iupdate, which calls bwrite. the inode will have the content reflecting the one-byte write.)
- Is there a data block for "foo" stored on the disk? If so, what is its content? (Answer: perhaps. the data block may not have been written out at the time of the power failure, because partially-filled blocks are written using bdwrite. thus, the inode will list the data block number, pointing effectively to a block whose content has perhaps no relation to "foo".)
- Is there a directory entry for "foo"? Why? Why not? (Answer: perhaps; the inode of the updated directory is on disk, because iput will find i_count is 1, and write the inode using iupdate. the data block, however, might not be, because if the data block is partially filled it will be written using bdwrite. Note, however, that if the data block just happens to be full, the iwrite call in wdir will use bawrite, which will enter the write operation in the disk queue; then, the iput call in wdir will cause a write of the inode using bdwrite; bdwrite will first wait until the preceding operations in the queue have completed; thus, in this special case the data block will be on disk!)
- How does UNIX recover from this mess? When the operating starts again the sys admin runs programs (INUM, etc., which later were replaced by fsck) that fix up to disk to some consistent state, with the help of the sys admin. Of course, some recent changes may be lost (e.g., the byte in "foo").
File locking. No file locking at the application level. Protection bits can be used, but only by processes that don't run as superuser.