Buffer cache
Required reading: I/O-lite.
Overview
The design of a buffer management system impacts the performance of
servers. For example, consider the design of a Web server on V6
(assuming that V6 had networking support). An HTTP get request goes
through the following subsystems:
- The network system demultiplexes the message and delivers it to
user space, copying the message from kernel to user space
(perhaps requiring a copy from network card to kernel).
- The Web server looks in its cache for the requested URL. If it is
there, it sends the data back to the client, which involves copying
the data from user space into kernel buffers.
- The network system in the kernel hands the buffers to the network
card, which may involve a copy. The network system doesn't release
the buffers until the data has been acknowledged by the client.
- If the data is not in the cache of the Web, then the servers read
the file from the file system.
- The file system calls into the file system buffer cache, which
copies the data from the disk into the buffer cache. The file system
copies the data into the Web server's address space.
- The Web server then copies the data into kernel buffers so that
the network system can send the data to the client.
If we look at this picture we see that the data is in 3 places:
file system cache, Web server cache, and network buffers. We also see
that the data is copied 3-4 times: disk to kernel, kernel to to
server's address space, back into the kernel, and onto the network
(which may involve a copy to the network card). The two necessary
copies are from disk to kernel and from kernel to network.
IO-lite
IO-lite is a unified buffer and cache system. Its design allows
applications to avoid redudant copies, multiple buffering, and enables
cross-subsystem optimizations (e.g., caching Internet checksum).
The keys ideas:
- Buffers are an integral number of virtual contiguous VM pages.
This decisions allows pages to be mapped into different address spaces
easily.
- Buffers are immutable. This decision allows buffers to be safely
shared. I/O-lite makes a copy of a buffer when a buffer is written
(perhaps optimizing the copy by creating just a new slice for the new
data). This immutable strategy is also called copy on write (COW).
- Buffers are aggregated into buffer aggregates (see figure 1).
This decision allows buffers to be added (e.g., when a buffer needs to
be modified), concatenated, headers to be appended, etc.
- I/O-lite passes buffer aggregates by value, but buffers by
reference. All buffers are allocated in the IO-lite window, which is
mapped into all address spaces. This decisioin avoids copying buffers
and allows sharing among subsystems.
- Cache is divided into pools of buffers with the same access
control list. This decision allows page to stay mapped with the same
privileges and to be reused safely without having to change VM
mappings.
- A new API for the buffer cache; its calls take an aggregator as
argument. For example, read returns an aggregator and writes takes an
aggregator. Other calls include: concatenate, split, length,
duplicate, seek, tell, mmap, etc. (How are v6 read and write
implemented in terms of these calls? Why is this interface better?
Answer: application can avoid copies because IO-lite can choose the
address where the data will appear.)
Together these ideas allow a buffer to be mapped safely, using the
VM system, into multiple address space and shared efficiently among
multiple subsystems.
Pseudocode for read:
read(fd, Agg **agg) { // assume 8K read from file system
(fileid, offset) = fd->(fileid, offset);
cache_agg = lookup_cache(fileid, offset);
if (!cache_agg) {
allocate cache_agg for in cache
allocate buffer(s) from appropriate pool
read data into it from the disk.
insert cache_agg in cache;
}
if (!buffer mapped into process) map buffers into process's space.
copy cache cache_agg into *agg;
}
Pseudocode for write:
write(fd, Agg **agg) { // assume 8K write to file system
(fileid, offset) = fd->(fileid, offset);
cache_agg = lookup_cache(fileid, offset);
update (cache_agg, *agg)
}
update cache_agg (**agg) {
There are three cases:
(1) The write modifies every word in the file
(2) Only a subset of the file is modified by this write
(3) Enough words are modified in the file so that making a redundant
copy of the file cache buffer is less expensive than the
fragmentation overhead if IO-Lite doesn't.
In (1) and (3), allocate a new buffer(s), and write changes to
it. In (3) it would also copy over unchanged data.
For (2), store the new values to a newly allocated buffer and then
combine the new and old values by creating a new buffer aggregate
that reflects the logical layout of the file
decrease refcnt for buffers that were freedup in update of cache_agg
}
Paper discussion
How is the Web server implemented using IO-lite?
- The network system demultiplexes the message into an aggregator to
the Web server. (How does the network subsystem know from which pool
to allocate the aggregator? Answer: something called early
multiplexing, which is implemented using packet filters.)
- The Web server looks in its cache for requested URL. If present,
passes the aggregator to the network subsystem, but the pointers to
the data are passed by reference. (What happens if the server modifies
these buffers?)
- If the network system has sent this data earlier, it might have
kept a copy of the checksum of the data. It can safely to that,
because buffers are immutable. (Are buffers uniquely name?)
- The network system prepends a header to the aggregator, and DMAs
the aggregator onto the network. (Why is this important? What does
this assume of the network card?)
- If the data is not in the cache of the Web, then the servers read
the file from the file system, which issues an IOL_read.
- I/O-lite copies the data from the disk in a buffer in the
appropriate pool. The file system copies the aggregator to the
server, which may involve mapping the buffer into the server's address
space (if there was no free buffer that was already mapped).
- Then the server hands the aggregator to the network system, which
does its job as before.
How many copies? How many bufferings? What cross-subsystem
optimizations?
How does IO-lite and paging interact?
What is the page replacement policy?
Is I/O-lite worth it? (See figure 3 and 4) Is it worth for small
pages? What limits the performance of I/O-lite? How about Flash and
Apache?
How are CGI-bin apps implemented? Is it worth it? (See figures 5
and 6).
Is double buffering a problem? (See figure 7).
Is there any other app than a Web server that benefits (See figure 8).