\input{myLectHead}
\setlength{\textwidth}{1.2\textwidth}
\newcommand{\dtt}[1]{{\tt #1}}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\setcounter{lecture}{10}
\lecture{File Systems}
% from Lect 14 and 15 in 2006
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\lectpart{Introduction to File Systems}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\bitem

	\item 
	Computers have different kinds of memory, as typified
	by the memory hierarchy.
	Among these, the secondary memory (popularly known as disk memory)
	is the most familiar to users.
	That is because disk memory is logically organized into
	a file structure, and this file structure is largely
	determined by users: they can create, modify and
	delete files in this structure.

	Disk memory is non-volatile, meaning that it maintains
	its information even after the computer is shut off.
	Hence it serves as the permanent repository of data in the computer.
	In contrast, the main memory is volatile and must be loaded
	with information from disk when the computer starts up.

	\item
	Physically, the disk memory is distributed over some
	geometry that depends on the particular device.  Each disk
	is a stack of platters (like a CD) ranging from 1.8 to 5.25 inches.
	Each platter has two sides, called a cylinder
	(a somewhat confusing terminology).
	The surface is divided into discrete
	memory units along independent dimensions: it is divided into
	concentric circular tracks, and also into sectors by radial lines.
	These cylinders are covered with magnetic substrate, so that each
	unit of memory can be magnetized in one of two orientations,
	representing a bit of information.  Thus each memory bit has a physical
	address given by a triple (cylinder, sector, track).
	There is a reading arm that can simultaneously read a given
	sector/track of all the cylinders.  Thus, if there are 8 cylinders,
	then we can view the 8 bits that are read at an instant.
	
	\item
	We will assume the addressable units of
	memory are byte-size (=8 bits).  Instead of bytes, one could
	assume other units such as \dt{words} (= 4 bytes).

	Conceptually, {\em we regard physical memory as an linear array
	of bytes, indexed from $0$ to some maximum value}.
	In this linear array, the index of a byte is called
	its \dt{address}.

	The view of disk memory as a linear array
	represents a slight abstraction beyond the geometry
	of the disk device.  There is a corresponding problem
	of translating linear addresses to the device address space
	(e.g., cylinders, sectors, tracks)
	that can be handled by device drivers.
	
	\item
	{\em The main problem of file system implementation is to
	provide a logical (or user) view of the physical memory.
	}
	For our purposes, this logical view will be a collection
	of memory units called \dt{files} which are in turn
	organized into a hierarchical structure.
	Each file is logically an array of bytes.

	To support this hierarchical structure, we classify
	files into one of two types: \dt{regular files} or \dt{directory
	files}.  The latter represents a collection of files (regular
	or directories).

	\item
	\dt{Files attributes}

		\benum
		\item File name: human readable string
		\item Identifier: a number that is the file system
		\item Type: ascii, binary, graphics, etc
		\item Location: physical address
		\item Size
		\item Protection:
		\item Time, Date, Ownership: creation, last modification,
			last use, etc.
		\eenum

	These file attributes are useful in implementing functions for
		protection, security, monitoring and search.

	Files can store data of various types:
		E.g., source, binary, graphics, sound, etc.

	But for the present purposes, these distinctions
		are irrelevant.

	Remark: file name typically has 2 parts
		(second part is the extension,
		indicating type for some systems)

	\item
		There is a file directory, also in secondary storage,
		for this info.

	\item File Operations:

		\benum
		\item Create: find space, create directory entry
		\item Write: depends on current location
		\item Read: depends on current location
		\item Seek: change the current location
		\item Delete:
		\item Truncate:
		\eenum

	\item Open and Close:

		Since read/write/seek, etc, has nontrivial initialization
		cost, most OS requires that we must first OPEN a file
		before we can do these operations.  Hence we also need
		to CLOSE a file when done.

		There is a list of OPENED files (per process, and system-wide).

		The per-process list of OPENED files points to
		the system-wide list.

		We need these info for the per-process opened files:

		\benum
		\item File pointer: current location
		\item File-open count: closes file when it reaches 0
		\item Memory location: location in main memory about file
		\item Buffer: this is sometimes used to speed up I/O.
		\item Access rights:
		\eenum

%	\item File Structure:
\eitem

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\lectpart{File Allocation Methods}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

	See section 11.4, p.421, Silbers. (Edn.7)

\bitem
\item
	We begin with the problem of managing physical memory.
	Recall that this is conceptually a linear sequence of bytes.

\item
	Disk memory is partitioned into two parts: ``free memory pool''
	and ``allocated memory''.  The latter is presumably
	used by the file system.

	Each byte of memory are either \dt{free} or \dt{allocated},
	depending on which partition they belong.
	There are two basic functions which move bytes from one
	partition to the other:

	\bitem
	\item
	Alloc(n) will move $n$ contiguous bytes from the
	free memory pool to allocated memory.
	The address of the first of the $n$ bytes is returned.

	\item
	Free(ADDR,n) will move $n$ contiguous bytes,
	starting at address ADDR, from allocated memory 
	to physical memory pool.
	\eitem

	There is an asymmetry in these 2 requests:
	the Alloc(n) request could fail, and return
	a NULL pointer if there is no contiguous $n$ bytes in the 
	free memory pool.  On the other hand,
	Free(ADDR,n) will always succeed.

\item
	The \dt{allocation problem} to implement the Alloc/Free functions.

	We have seen this problem before, in memory allocation.
	There is an obvious solution to this problem, but it leads
	to serious problems of fragmentation.
	The way out of this dilemma was to allocate only fixed size
	units called pages or frames.
	
	We use the analogous solution here: we divide
	the physical memory into fixed size units called \dt{blocks}.
	Thus blocks is the analogue of pages and/or frames.

	We re-interpret ``Alloc(n)'' to mean the request for $n$ blocks,
	or ``Free(ADDR,n)'' as request to free n blocks.
	Now, every block is either \dt{free} or \dt{allocated}.

\item
	How to choose block sizes?
	Typical hardware disk sectors (``disk blocks'')
	are typically 512 Bytes. 
	Block sizes tend to be small multiples of this.

	Most often, block size ranges from 512B to 4KB.

	Sometimes, there is an additional
	\dt{fragment size}, which is between 512 Bytes and block size.
	A file would use the regular block sizes for all but the last one.
	
	Yet another variation is to allow any power of 2 of disk block
	sizes.  This means we can have the sizes $512B, 1KB, 2KB, 4KB,$ etc.

\item
	We next give some standard solutions to the Free/Alloc problem.

	SOLUTION ONE: FreeList.

	The simplest solution to the Free/Alloc problem
	is this: keep all the free blocks is a singly linked
	list called the \dt{FreeList}.  See figure.

	FIGURE: free-list

	Alloc and Free are easily solved: for instance, we
	can assume that Alloc() returns a linked list of $n$ blocks.
	Free() simply appends the freed blocks (assumed to be a linked list)
	to the FreeList.

\item
	Issues: to get $n$ blocks, we need to read $n$ blocks
	from disk.  This is inefficient, since it is often
	sufficient to return the addresses of $n$ blocks.

\item
	SOLUTION TWO: 
	Organize the free blocks into a B-tree.  Thus, all the
	internal nodes of the B-tree are ``directory blocks'', meaning
	that they hold a number of addresses of blocks.

	This is essentially the approach taken by the
	NTFS (New Technology File System)
	of Microsoft, replacing the FAT system of previous Window systems.

	Technically, they use a B+ tree, meaning that the leaves
	are linearly linked.

\item
	SOLUTION THREE:
	Suppose a block can store $A$ addresses.  
	So we can use a free block as an \dt{index block},
	meaning that it will store some number
	$a \in\set{0,1\dd A}$ of addresses of other free blocks. 


	E.g., block size is 4KB and block addresses are 4 bytes,
	then $A=1024$.  In the B-Tree solution, $a$ is either $0$ or some
	number between $A/2$ and $A$.

	In Unix, we use the last address to point to another index block
	(or to NULL).

\item
	SOLUTION FOUR:
	The FAT (File Allocation Table) solution.
	This table has as many entries as there are blocks in the system.
	The $i$th entry is either $0$ (meaning it is in-use) or it
	points to another free block (or NULL).

	The FAT solution was used by early Microsoft OS.
	It gives much faster access to files, but it introduces
	the problem of large file tables.  For instance,
	a 120 GB disk with 4KB blocks will have 30M entries.

\eitem

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\lectpart{Unix File Systems}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

We will go into the details of the Unix file system.  

\bitem
\item
	As noted above, there are two views of files:
	the physical and the logical (or user) view.
	Physically, the disk is a sequence of blocks.

\item
	Consider the physical blocks.

	\benum
	\item
	Block 0 is the \dt{boot block}
	(or boot control or boot partition block).
	If we want to boot an OS from this partition, this block is
	used.  Otherwise it can be empty.

	\item
	Block 1 called the \dt{superblock}.
	Contains number of blocks, size of blocks, free-block count,
	free-block pointers, i-node count, i-node pointers, etc.
	Note: i-node is also called File Control Block (FCB).

	\item
	Block 2 to some Max are the blocks for i-nodes (see below).

	\item
	The rest are data-blocks.
	\eenum

\item
	Logically, a file is a sequence of blocks.
	There are two main kinds of files:
	\dt{regular files} and \dt{directory files}.
	There is a third, \dt{non-disk files}.
	See below.

	%REMARK: In Windows, directories are not treated as a
	%special kind of file.

\item
	There is an \dt{i-node} (short for \dt{index node}
	for each file.
	It contains the following information about a (regular) file:

	\benum
	\item user ID and group ID of owner of file,
	\item time of last modification and access,
	\item number of hard links to file,
	\item mode (read/write/execute permissions),
	\item type of file
		(regular, directory, symbolic link, char/block/socket dev),
	\item 12 pointers to data blocks on disk.  Thus, access
		to the first 48KB ($=12\times 4$ K) of data is very fast.
	\item 3 additional indirect block pointers:
		single indirect,
		double indirect, and
		triple indirect.
	
	Since each block holds 4K bytes,
	and each pointer is 4 bytes, the first
	single indirect can access 4MB of data.
	The second can access 4GB.
	The triple indirect is not used, since
	current systems address space of 32-bit cannot
	go beyond 4GB.
	\eenum

	Normal users refer to files by their names and path.
	The System uses i-nodes to refer to files.

	Q: As user, how do you get the i-node number of a file?

	A: Type "ls -i file-name"

\item
	A directory file is just a sequence of variable length
	triples: (length, inode-ptr, file-name).

	Each triple corresponds to a file in the directory.
	The inode-ptr refers to the i-node of the file,
	and file-name can be up to 255 chars long.
	The length refers to this file-name length.

	The first two entries in the file refers to "." and "..".

\item
	Given a path, there is the obvious sequential algorithm to
	search directories (starting from / or from current directory).

	To avoid infinite loops, we cound the number of symbolic
	liks encountered and stop when a limit (8) is reached.
	REMARK: can't a hard link cause infinite loop too?
	If so, hard links should be counted as well.

\item
	LINKS:
	
	Hard link are directory entries, like ordinary entries.
	So each hard link has a corresponding i-node.

	Symbolic links: such links do not creat a new i-node,
	only an entry in some directory.

	Hard links has no effect on the search algorithm,
	but symbolic links affect the search algorithm as follows.
	When our search reaches a symbolic link, we start the search
	all over, starting from the head of a path name that is associated
	with the symbolic link!

	Q: How to you verify that on your unix that
	hard links creates a new i-node but symbolic links does not?

	A: Create a symbolic and hard link to some file "foo"
	and look at their inodes:
	 \\ @ ln -s foo symfoo
	 \\ @ ln foo hardfoo
	 \\ @ ls -i foo symfoo hardfoo

\item
	HOW TO HANDLE NON-DISK FILES.

	There are 3 main types: block device, char device, or sockets.
	We call appropriate drivers to handle them.

\item
	Opening and Closing files.

	After we found the i-node of the file, we allocate
	a \dt{open-file structure} for this i-node.  An index
	into the \dt{open-file (structure) table} is what we call a
	\dt{file descriptor}.
	This open-file table is PER PROCESS.

\item
	A directory name cache can be used to hold
	recent directory-to-inode translations, to speed up file access.

	There is a in-core list of i-nodes.  All
	currently open files have a copy of their i-node here.

\item
	It turns out, we want an intermediate structure
	between open-file structure and the i-node.
	The {\em current position} of an open file is
	one information that does not
	belong to either i-node or the per-process open-file structure.
	Hence we store this information in an entry of the
	\dt{system-wide open-file (structure) table}.
	This entry, in turn, points to the i-node.
	It can also keep track of the number of
	processes that has opened this file.
	(See Fig A.7 of Appendix A)

	%ACTUALLY, this example, from Tanenbaum, is not convincing.
	% page  467 of Silbershatz seems to contradict this example.
	Let us illustrate a situation where we need
	this intermediate structure.
	Suppose a process P0 opens a file F1, and then 
	fork two processes: P1 to write some head info into F1,
	followed by P2 to write additional info into F2.
	Note that P0's open-file structure for 
	F1 will be inherited by P1 and P2 through the fork.

	Now, after P1 has done its work, the updated location
	should be accessible to P2.  But this information, if stored
	in P1's open-file table, would not be accessible to
	P2.  But if both of them points to the "system-wide open-file
	structure", then P2 can continue where P1 left off!
	
\item \dt{Clustering Schemes.}
	It is best to allocate an i-node and its data blocks
	from the smake locality in a disk.
	FreeBSD has a concept called \dt{cylinder group}
	(cylinder refers to a locality on the disk).
	Each cylinder block has a superblock, array of inodes and data blocks,
	just as in a partition.
	All cylinder blocks have the same superblock.
	Block allocation can use such information to allocate
	free blocks to preserve locality (and other considerations).
	See p.908 of Appendix A.
\item \dt{Free Space Management.}
	There are three basic methods:

	(1) The simplest method is to keep a linked list of all the free blocks.

	(2) We could also use bitvectors to track the free blocks.

	(3) Finally, we could let each free block store a sequence
	of pairs (ADDR, NUM) where ADDR is the address of NUM
	free contiguous blocks.

	FIGURE illustrating (3).

	Let us discuss scheme (3):
	Which of these free blocks should recursively be used
	to store similar information?   If ALL the free blocks are used
	in this way, then we cannot easily give away the current
	free block to requests for blocks.  An intermediate solution is
	to just use the last two
	free blocks for recursively storing such free-block information.

	Advantage: we have a branching factor of two, but
	all but two of the free blocks in the current direct can
	be directly given away.

	If a free block is released (by file delete), this free
	block can often be inserted directly at the root.
	This is most efficient.
\item

	Solution to Thrashing in (3):  keep the first two blocks in memory,
	writing out a full block only after we have 1.5 blocks of memory!

\item
	\dt{File Consistency Problem.}
	Suppose the computer crashes. 
	The open-file table is generally lost.  We need
	to check the consistency of the file system.
	The loss of an i-node, a directory block,
	or a free-list block is most serious.
	To deal with this, most OS has a utility program
	to check consistency of a file system.
	In Unix, it is called "fsck" and Windows it is called "scandisk".
	We normally run this utility on system reboot, especially
	after a crash.
	In Unix, we need to check consistency of two things:
	block structure and file structure.

	Let us first consider block consistency: we will
	consider the problem under the assumption that free blocks
	are kept in a free-list.
	Basically, every block must exactly once,
	either in the free list or referenced by some i-node.
	Suppose $b$ is the block number and $free[b]$ and $inode[b]$
	indicates how many times $b$ appears in either list.
	Then we have the following states:
	\bitem
	\item $free[b]+inode[b]=1$: $b$ is consistent
	\item $free[b]+inode[b]=0$: $b$ is missing
	\item $free[b]>1$: $b$ is duplicated in free list
	\item $inode[b]>1$: $b$ is duplicated in file system list
	\eitem
	So our algorithm begins by initializing
	$free[b]=inode[b]=0$ for all $b$.  Then we
	go through the free list,
	and increment $free[b]$ for each $b$ that we find in the free list.
	We also go through the file hierararchy, and for each i-node,
	we look for all the blocks pointed to via this i-node.

	PRESUMABLY, there is enough main memory for storing
	the $free[b]$ and $inode[b]$ arrays.

	In case of inconsistency, we need to take some
	action: if a block is missing, we add it to the free list.
	If it is duplicated in the inode list, we duplicate the
	data in the block (by getting blocks from the free list),
	and modify the inodes that point to these blocks.
	If it is duplicated in the free list (either because
	$free[b]>1$ or ($free[b]=1$ and $inode[b]>0$), we
	can fix the free list.

	Next consider checking consistency of the file system:
	This is basically a check that the number of hard links
	to an inode is correct.  We simply recompute this
	number for each inode.

	Finally, it is possible to combine both block consistency
	and file consistency check into one single algorithm.

\eitem

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\lectpart{Virtual File Systems}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\bitem

\item
	File systems can become very general when it has to deal with
	\bitem
	\item Raw disk (for database applications, etc)
	\item Dual boots of different OS
	\item Different file systems mounted together
	\item Network file systems
	\item etc
	\eitem
\item
	This calls for a Virtual File System organization.
	\bitem
	\item
	There are 3 layers.
	\item
	Layer 1 has the basic interface of open(), read(), write(), close().
	\item
	Concept of vnode: unique identifer and kernel object 
		for each directory or file.
	\eitem

\eitem

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\lectpart{Network File Systems}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\bitem
\item Sect 11.9 of Silbers.
\eitem

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\beginExer
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\bexer
	What are the trade-off issues in choosing block sizes?
\bsol{
	Larger block sizes allows faster access for larger files;
	but for smaller files, this causes more internal fragmentatin
	and is slower.
	}{}
\eexer
\bexer
	Suppose you have a file named \ttt{f0} and you do:
	\myprogtt{
	\> ln f0 f1 \\
	\> ln -s f0 f2 \\
	\> ls -i f0 f1 f2
	}
	What do you expect to see from the last command?
	\bsol{
	The i-node for f0 and f1 are identical, and different from
	the i-node for f2.
	}{}
\eexer
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\doneExer
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\input{myTail}