Tue Mar 30: Eliminating Receive Livelock in an Interrupt-driven Kernel, 1995, USENIX.
Suppose we have an NFS client running the following code:
while (1) {
send NFS_READ RPC;
wait for response
}
Is the NFS server likely to livelock? What happens as we increase the number
of clients running this loop?
Thu Apr 1: Hypervisor-based
Fault Tolerance, SOSP 1995.
Suppose the primary fails due to a power failure, the backup takes
over, and then power to the primary is restored. You would like the
primary to resume replicated execution so that the system could
tolerate a subsequent failure of the backup. Is this likely to
be easy with hypervisor-based fault-tolerance? Why or why not?
Tue Apr 6: Hive: Fault Containment for
Shared-Memory Multiprocessors, SOSP 1995.
The Hive firewall hardware on a node selectively protects that node's
physical memory from writes by other nodes. The Hive designers could
instead have built firewall hardware associated with each node CPU
that controlled which areas of memory (on other nodes) the node was
allowed to write. Why do you suppose the Hive designers chose to have
the firewall act on incoming writes to memory rather than on outgoing
writes from each CPU?
Thu Apr 8: TreadMarks: Distributed
Shared Memory on Standard Workstations and Operating systems, USENIX 1994.
Suppose that a simplified version of Treadmarks, called Dreadmarks,
simply sent all modifications of variables between an acquire and a
release to the next processor to acquire the same lock. Deadmarks
doesn't otherwise communicate any changes to memory. Outline a
specific simple situation in which Treadmarks would provide more
useful or intuitive memory behavior than Dreadmarks.
Tue Apr 13: Optimistic Replication Using
Vector Time Pairs.
The paper's Section 3.3 says that replicas need not keep deletion
records. Suppose that replica R1 creates file F, then synchronizes
with R2, then R1 deletes its copy of the file and discards the
deletion notice, then R2 modifies its copy of the file, and then R1
and R2 synchronize again. What reasoning will R1 use to conclude that
there is a conflict?
Tue Apr 13: Group Communication
in Amoeba and its Applications, 1993.
On page 6 the paper says that node failure events are totally
ordered along with the ordinary multicast messages. It is probably the case
that the Bullet server uses this fact: how?
Thu Apr 22: Recovery Management in QuickSilver, 1988.
The paper seems long but that's mostly because of the way it's typeset.
You can skip Section 4.3 and Section 5.
Section 4.1 says that the one-phase commit protocol is used by servers that
maintain only volatile state. What additional action(s) do servers
with non-volatile (persistent) state do that requires them to participate
in two-phase commit?
Tue Apr 27: Frangipani: A Scalable Distributed File System, 1997.
What if a server modifies an i-node, appends the
modification to its log, then another server modifies the same i-node,
then the first server crashes, and the recovery system needs to decide
whether to play back the log entry. Clearly this would not be a good
thing, since it would un-do the second change. Can this situation
arise? What does Frangipani do to deal with it?
Tue May 4: Freenet:
A Distributed Anonymous Information Retrieval System, 2000.
The second paragraph of Section 3.5 describes a complex mechanism to
choose the key under which a new node is stored in other nodes'
routing tables. Why can't the new node just choose a random key itself
and tell the other nodes that key? What might go wrong if this was the
way that Freenet worked?
Thu May 6: Democratizing
content publication with Coral, 2004.
The paper mentions (Sections 2.3 and 4.2) that Coral's "sloppy"
DHT can store multiple different values for each key. What would
break, or not work as well, if Coral's DHT only stored one
value for each key?
Tue May 10: Untangling
the Web from DNS, 2004.
Section 3.3 says that the SFRTag is a hash of a public key and a
salt. What is the salt for? What would go wrong, or be awkward,
if the SFRTag were the hash of a public key alone?
Thu May 12: Hybrid Global-Local
Indexing for Efficient Peer-to-peer Information Retrieval, 2004.
Suppose the Web contains 3 billion pages, and the average page
contains 1000 words. Roughly how much total storage would eSearch
require to hold an index of the Web? Assuming hosts with 100 GB disks,
how many Engine nodes would eSearch need to hold this data?