================ Start Lecture #16 ================

Chapter 4: File Systems

Requirements

Size: Store very large amounts of data.
Persistence: Data survives the creating process.
Access: Multiple processes can access the data concurrently.

Solution: Store data in files that together form a file system.

4.1: Files

4.1.1: File Naming

Very important. A major function of the file system.

Does each file have a unique name?
Answer: Often no. We will discuss this below when we study links.
Extensions, e.g. the ``html'' in ``class-notes.html''.
1. Conventions just for humans: letter.teq (my convention).
2. Conventions giving default behavior for some programs.
  - The emacs editor thinks .html files should be edited in html mode but
    can edit them in any mode and can edit any file in html mode.
  - Netscape thinks .html means an html file, but
    <html> ... </html> works as well
  - Gzip thinks .gz means a compressed file but accepts a --suffix flag
3. Required extensions for programs
  - The gnu C compiler (and probably others) requires C programs be named *.c and assembler programs be named *.s
4. Required extensions by operating systems
  - MS-DOS treats .com files specially
  - Windows 95 requires (as far as I can tell) shortcuts to end in .lnk.
Case sensitive?
Unix: yes. Windows: no.

4.1.2: File structure

A file is a

Byte stream
- Unix, dos, windows (I think).
- Maximum flexibility.
- Minimum structure.
(fixed size) Record stream: Out of date
- 80-character records for card images.
- 133-character records for line printer files. Column 1 was for control (e.g., new page) Remaining 132 characters were printed.
Varied and complicated beast.
- Indexed sequential.
- B-trees.
- Supports rapidly finding a record with a specific key.
- Supports retrieving (varying size) records in key order.
- Treated in depth in database courses.

4.1.3: File types

Examples

(Regular) files.
Directories: studied below.

Special files (for devices). Uses the naming power of files to unify many actions.

    dir             # prints on screen
    dir > file      # result put in a file
    dir > /dev/tape # results written to tape

``Symbolic'' Links (similar to ``shortcuts''): Also studied below.

``Magic number'': Identifies an executable file.

There can be several different magic numbers for different types of executables.
unix: #!/usr/bin/perl

Strongly typed files:

The type of the file determines what you can do with the file.
This make the easy and (hopefully) common case easier and, more importantly safer.
It tends to make the unusual case harder. For example, you have a program that turns out data (.dat) files. But you want to use it to turn out a java file but the type of the output is data and cannot be easily converted to type java.

4.1.4: File access

There are basically two possibilities, sequential access and random access (a.k.a. direct access). Previously, files were declared to be sequential or random. Modern systems do not do this. Instead all files are random and optimizations are applied when the system dynamically determines that a file is (probably) being accessed sequentially.

With Sequential access the bytes (or records) are accessed in order (i.e., n-1, n, n+1, ...). Sequential access is the most common and gives the highest performance. For some devices (e.g. tapes) access ``must'' be sequential.
With random access, the bytes are accessed in any order. Thus each access must specify which bytes are desired.

4.1.5: File attributes

A laundry list of properties that can be specified for a file For example:

hidden
do not dump
owner
key length (for keyed files)

4.1.6: File operations

Create: Essential if a system is to add files. Need not be a separate system call (can be merged with open).
Delete: Essential if a system is to delete files.
Open: Not essential. An optimization in which the translation from file name to disk locations is perform only once per file rather than once per access.
Close: Not essential. Free resources.
Read: Essential. Must specify filename, file location, number of bytes, and a buffer into which the data is to be placed. Several of these parameters can be set by other system calls and in many OS's they are.
Write: Essential if updates are to be supported. See read for parameters.
Seek: Not essential (could be in read/write). Specify the offset of the next (read or write) access to this file.
Get attributes: Essential if attributes are to be used.
Set attributes: Essential if attributes are to be user settable.
Rename: Tanenbaum has strange words. Copy and delete is not acceptable for big files. Moreover copy-delete is not atomic. Indeed link-delete is not atomic so even if link (discussed below) is provided, renaming a file adds functionality.

Homework: 2, 3, 4.
Read and understand ``copyfile'' on page 155.

Notes on copyfile

Normally in unix one wouldn't call read and write directly.
Indeed, for copyfile, getchar() and putchar() would be nice since they take care of the buffering (standard I/O, stdio).
Tanenbaum is correct that the error reporting is atrocious.
The worst is exiting the loop on error and thus generating an exit(0) as if nothing happened.

4.1.7: Memory mapped files

Conceptually simple and elegant. Associate a segment with each file and then normal memory operations take the place of I/O.

Thus copyfile does not have fgetc/fputc (or read/write). Instead it is just like memcopy

while ( *(dest++) = *(src++) );

The implementation is via segmentation with demand paging but the backing store for the pages is the file itself. This all sounds great but ...

How do you tell the length of a newly created file? You know which pages were written but not what words in those pages. So a file with one byte or 10, looks like a page.
What if same file is accessed by both I/O and memory mapping.
What if the file is bigger than the size of virtual memory (will not be a problem for systems built 3 years from now as all will have enormous virtual memory sizes).