Lecture 19: Distributed File Systems

The objective is to have a single file system distributed across several machines.

Models of interactions with files

Read-only

Process P on machine M1 has read-only permission for files on machine M2. The only way P can affect the file system on M2 is indirectly, by sending a message to a process P2 running on M2. P2 can react to this message as it chooses.

(WWW model)

Advantages

Disadvantages

Implementation Issues

Downloading

Stateless vs. Stateful server

A stateless server has the advantages that:

A stateful server have the advantages that:

Notification

What happens if process P starts to read file F and F is modified while P has F open?

Replication

Copies of file F may in fact exist on two or more servers M2 and M3. The system chooses which of these should interact with P, based on system load, network state, machine being up, etc. If server is stateless, then this can switch in the middle; if server is stateful, then chosen when file is opened.

Problem of keeping replicas consistent.

Caching

Either the entire file or blocks may be cached either in the server RAM or in the client disk or in the client RAM (kernel).
Caching in the server RAM is simple and the only cost is the standard cost of maintaining a cache.
Caching in the client disk pretty much amounts to replication.
Caching in the client RAM can be useful if processes on client often need this file (or block) but raises problem of currency.

Database model

Very important because major application of distributed file system. Use transaction model; each transaction must be executed atomically. Take a course on databases.

Readers/writers protocol

Each file is either being read by 1 or more readers or being written by exactly one writer.
Problematic in the context of distributed file system, because if a writer crashes, then the file is permanently locked. Of course, this can also happen in a single processor system, -- a process can lock a resource and then go into an infinite loop -- but there it's easier to detect and correct.

Allow multiple writers

Suppose that two processes on two different clients are writing to the same file simultaneously. The truth is that this kind of issue is quite specific to the kind of file involved, and you're not going to get a general solution at the OS level that is always going to do what you want. For any given type of file, you have to look at It's rare that general notions like "Sequential consistency", "session semantics" etc. end up being the best way to think about the issue.

On the other hand, the OS does have to have _some_ policy to deal with the case of a new type of file or of an file being accessed in some new kind of way. What's most important is that an application that does have a clear idea of what it wants to do in this regard should not find the OS's policy an unavoidable and intolerable obstacle.

Naming and directories

Method 1: machine-name:path (as in URLs) or /machine-name/path.
Note: The file has the same name on every machine.

Method 2: Different machines have different views of file systems. For example, a machine may have a pseudo-directory "external". The name on M1 of the file of path name P on machine M2 would be "/external/M2/P". When a program on M1 accesses this file, the file server knows that it should request the file from M2.

Method 3: Path name is entirely independent of machine. Supports transparent replication and migration. Directory implementations:

In either case, the same problems of consistency come up as with simple files.
Another question is whether you want to use the same server for directories and files or a different server.

Replication/Migration strategies