Released Friday, April 10, 2015
Due Friday, April 24, 2015, 9:00 PM
In this lab, you will implement (pieces of) a simple disk-based file system.
Some notes:
This lab is an edited version of a lab written by Isami Romanowski. (He in turn adapted code from MIT's JOS, porting it to the FUSE and Linux environment, adding inodes, etc.)
We recommend doing this lab inside the development platform that you set up in the previous labs (the virtual devbox or the CIMS machines).
From within this platform,
change to your lab directory,
use Git to commit changes you've made since handing in Lab 5 (if any),
obtain
the latest version of the course
repository, and then create a local branch called lab6 based on our
origin/lab6 (which is our lab6 branch).
The commands are as follows, but you
may not need all of them (for example, if git status
indicates that
there are no changes to commit):
$ cd ~/cs202 $ git commit -am 'my solution to lab5' Created commit 254dac5: my solution to lab5 3 files changed, 31 insertions(+), 6 deletions(-) $ git pull Already up-to-date. $ git checkout -b lab6 origin/lab6 Branch lab6 set up to track remote branch refs/remotes/origin/lab6. Switched to a new branch "lab6" $ cd lab6
You are now in the lab6 directory. Your next step depends on whether you are using the CIMS machines or the devbox:
crackle1.cims.nyu.edu crackle2.cims.nyu.edu crackle3.cims.nyu.edu crackle4.cims.nyu.edu crackle5.cims.nyu.eduNote that this requirement supersedes the usual list of machines.
Next, install required library files:
$ pwd # make sure that you are in lab6 directory /home/your-netid/cs202/lab6 $ ./install-lab6-for-cims.sh You only need this on CIMS machines, press <ENTER> to continue, or <Ctrl>-C to cancel <ENTER> Copying fuse library.... Installation successful
$ sudo apt-get update # update the apt catalog ..... $ sudo apt-get install fuse libfuse-dev .....
The code provided in this lab handles the bulk of the work for creating a file system abstraction in Linux; you will implement a few fundamental operations. Stop now to look over the code that we have provided in the following files:
This entire arrangement (file system implemented in user space with arbitrary choice of storage) is due to software called FUSE (Filesystem in Userspace). In order to really understand what FUSE is doing, we need to take a brief detour to describe VFS. Linux (like Unix) has a layer of kernel software called VFS; conceptually, every file system sits below this layer, and exports a uniform interface to VFS. (You can think of any potential file system as being a "driver" for VFS; VFS asks the software below it to do things like "read", "write", etc.; that software fulfills these requests by interacting with a disk driver, and interpreting the contents of disk blocks.) The purpose of this architecture is to make it relatively easy to plug a new file system into the kernel: the file system writer simply implements the interface that VFS is expecting from it, and the rest of the OS uses the interface that VFS exports to it. In this way, we obtain the usual benefits of pluggability and modularity.
FUSE is just another "VFS driver", but it comes with a twist. Instead of FUSE implementing a disk-based file system (the usual picture), it responds to VFS's requests by asking a user-level process (which is called a "FUSE driver") to respond to it. So the FUSE kernel module is an adapter that speaks "fuse" to a user-level process (and you will be writing your code in this user-level process) and "VFS" to the rest of the kernel.
Meanwhile, a FUSE driver can use whatever implementation it wants. It could store its data in memory, across the network, on Jupiter, whatever. In the setup in this lab, the FUSE driver will interact with a traditional Linux file (as noted above), and pretend that this file is a sector-addressable disk.
The FUSE driver registers a set of callbacks with the FUSE system (via libfuse and ultimately the FUSE kernel module); these callbacks are things like read, write, etc. A FUSE driver is associated with a particular directory, or mount point. The concept of mounting was explained in OSTEP 39 (see 39.15). Any I/O operations requested on files and directories under this mount point are dispatched by the kernel (via VFS, the FUSE kernel module, and libfuse) to the callbacks registered by the FUSE driver. To recap all of the above: the file system user interacts with the file system roughly in this fashion:
|
Here's an example from the staff solution to show what this looks like, where testfs.img is a disk image with only the root directory and the file hello on its file system:
$ mkdir mnt # create a directory to serve as a mount point $ df mnt # see what file system mnt is associated with Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 7092728 4536616 2172780 68% / $ ls mnt $ build/fsdriver testfs.img mnt # mount testfs.img at mnt $ df mnt # note that mnt's file system is now different Filesystem Size Used Avail Use% Mounted on cs202fs#testfs.img 8.0M 24K 8.0M 1% /home/cs202/cs202/lab6/mnt $ ls mnt # and there's the hello file hello $ cat mnt/hello # which we can read with any program Hello, world! $ fusermount -u mnt # unmount mnt $ df mnt # and its associated file system is back to normal Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 7092728 4536616 2172780 68% / $ ls mnt # and hello is gone, but still lives in testfs.img $Note that in the above example, after we run fsdriver, the kernel is actually dispatching the all the open, read, readdir, etc. calls that ls and cat make to our FUSE driver. The FUSE driver takes care of searching for a file when open is called, reading file data when read is called, and so on. When fusermount is run, our file system is unmounted from mnt, and then all I/O operations under mnt return to being serviced normally by the kernel.
Below, we give an overview of the features that our file system will support; along the way, we review some of the file system concepts that we have studied in class and the reading.
Most UNIX file systems divide available disk space into two main types of regions: inode regions and data regions. UNIX file systems assign one inode to each file in the file system; a file's inode holds a file's meta-data (pointers to data blocks, etc.). The data regions are divided into much larger (typically 8KB or more) data blocks, within which the file system stores file data and directory data. Directory entries (the "data" in a directory) contain file names and inode numbers; a file is said to be hard-linked if multiple directory entries in the file system refer to that file's inode. Both files and directories logically consist of a series of data blocks; these blocks can be scattered throughout the disk much as the pages of a process's virtual address space can be scattered throughout physical memory.
Unlike most UNIX file systems, we make a simplification in the layout of the file system: there is only one region on the disk, in which both inode blocks and data blocks reside. Furthermore, each inode is allocated its own disk block instead of being packed alongside other inodes in a single disk block.
Our file system will use a block size of 4096 bytes.
File systems typically place important meta-data at reserved, well-known disk blocks (such as the very start of the disk). This meta-data describes properties of the entire file system (block size, disk size, meta-data required to find the root directory, the time the file system was last mounted, the time the file system was last checked for errors, and so on). These special blocks are called superblocks. Many "real" file systems maintain multiple replicas of superblocks, placing them far apart on the disk; that way, if one of them is corrupted or the disk develops a media error in that region, the other replicas remain accessible.
Our file system will have a single superblock,
which will always be at block 0 on the disk.
Its layout is defined by struct superblock
in fs_types.h.
Block 0 is typically reserved to hold boot loaders and partition tables,
so file systems generally do not use the very first disk block.
Since our file system is not meant to be used on a real disk, we use
block 0 to store the superblock for simplicity.
The superblock in our file system contains a reference to a block
containing the "root" inode (the s_root
field in
struct superblock
).
The "root" inode is the inode for
the file system's root directory. This inode
stores pointers to blocks; these blocks, together,
contain a sequence of dirent
structures. Each structure
includes a file and an inode number (one can think of this as assigning
a given "name" to a given inode); the collection of these structures
forms the content of the file system's root directory.
These contents can include further directories, etc.
In the same way that the kernel must manage the system's physical memory
to ensure that a given physical page is used for only one purpose at a time,
a file system must manage the blocks of storage on a disk
to ensure that a given disk block is used for only one purpose at a time.
In Lab 5, you kept the physical_pageinfo
structures
for all physical pages
in an array, pageinfo
,
to keep track of the free physical pages in kernel.c.
In file systems it is common to keep track of free disk blocks
using a bitmap (essentially, an array of bits, one for each
resource that is being tracked).
A given bit in the bitmap is set if the corresponding block is free,
and clear if the corresponding block is in use.
The bitmap in our file system always starts at disk block 1, immediately after the superblock. For simplicity we will reserve enough bitmap blocks to hold one bit for each block in the entire disk, including the blocks containing the superblock and the bitmap itself. We will simply make sure that the bitmap bits corresponding to these special, "reserved" areas of the disk are always clear (marked in-use). Note that, since our file system uses 4096-byte blocks, each bitmap block contains 4096*8=32768 bits, or enough bits to track 32768 disk blocks.
The layout of the meta-data describing a file in our file system is
described by struct inode
in fs_types.h. This
meta-data includes the file's size, type (regular file, directory,
symbolic link, etc.), time stamps, permission information,
and pointers to the data blocks of the file. Because our file
system supports hard links, one inode may be referred to by more than
one name -- which is why the inode itself does not store the file "name".
Instead, directory entries give names to inodes (as noted
eearlier).
The i_direct
array in struct inode
contains
space to store the block numbers of the first 10
(N_DIRECT
) blocks of the file, which we will call the
file's direct blocks. For small files up to 10*4096 = 40KB in
size, this means that the block numbers of all of the file's blocks
will fit directly within the inode
structure itself. For
larger files, however, we need a place to hold the rest of the file's
block numbers. For any file greater than 40KB,
an additional disk block, called the file's indirect
block, holds up to 4096/4 = 1024 additional block numbers,
pushing the maximum file size up to (10 + 1024)*4096 = 4136KB, or a little
over 4MB.
The file system also supports double-indirect blocks.
A double-indirect block
(i_double
in the inode
structure) stores
4096/4 = 1024 additional
indirect block numbers, which themselves each store 1024
additional direct block numbers. This affords an additional 1024*1024*4096 =
4GB worth of data, bringing the maximum file size to a little over
4GB, in theory.
To support even larger files,
real-world file systems typically support triple-indirect blocks
(and sometimes beyond).
You will implement some components of the FUSE driver (and hence the file system): allocating disk blocks, mapping file offsets to disk blocks, and freeing disk blocks allocated in inodes. Because a lot of the implementation is provided, it is important that you familiarize yourself with the provided code and the various file system interfaces.
The main file system code that we've provided for you resides in fsdriver.c. This file contains all the FUSE callbacks to handle I/O syscalls, as well as the main function. Once the path to the disk image and the path to the mount point (testfs.img and mnt respectively in the supplied example) have been read from the command line arguments, our FUSE driver maps the specified disk image into memory using the map_disk_image function (defined in disk_map.c), which itself initializes some file system metadata. Then, fsdriver.c calls fuse_main, which handles kernel dispatches to our registered callbacks. These callbacks will invoke the functions that you write in the coming exercises.
The FUSE driver code we provide maps the disk image into memory using the mmap syscall. In short, mmap reserves a portion of the running process's virtual address space to provide read and write access to a file as if it were an array in memory. For example, iffile
is the pointer to the first
byte of the file in memory after mmaping,
writing ((char *)file)[5] =
3
is approximately equivalent to the two calls
lseek(fd, 5, SEEK_SET)
and then write(fd,
<location of memory that holds byte 3>, 1)
. To flush any in-memory changes
you've made to a file onto the disk, you would use the
msync function. To unmap the file from virtual memory,
use munmap. As always, you can check the man pages
for more information on these syscalls. For this lab, you will
not need to be intimately familiar with their operation, but you
should have a high-level understanding of what they do.
When you run the FUSE driver ($ fsdriver testfs.img mnt),
the function
map_disk_image
will set the bitmap
pointer.
After this, we
can treat bitmap
as a packed array of bits, one for each
block on the disk. See, for example, block_is_free
,
which simply checks whether a given block is marked free in the
bitmap.
alloc_block
in bitmap.c. It should
find a free disk
block in the bitmap, mark it used, and return the number of that
block. When you allocate a block, you should immediately flush
the changed bitmap block to disk with flush_block
, to
help file system consistency. Under the hood,
flush_block
calls msync
to schedule the disk
block in memory to get flushed to the actual disk image.
Note: You should use free_block
as a model.
Use make grade to test your code. Your code should now pass the "alloc_block" test. All code for "core file system operations" tests is in the fs_test function in fsdriver.c, so if this test or any other test specifically mentioned in the exercises fails, you can check there to see what might be going wrong.
Running our tests may not provide you with enough information to debug a problem. Thus, you may sometimes find it helpful to run the driver directly. First, you'll need to create a disk image to use for testing. The easiest way to do this is to run test/testbasic.bash. As a side effect, this creates a disk image, testfs.img, which is set up properly for the internal tests in the driver. Then, to run the driver, do one of the following three things:$ build/fsdriver testfs.img mnt --test-opsThe command above runs the internal file system tests and then exits; it does not mount the file system.
$ build/fsdriver -d testfs.img mntThis command runs the driver in debugging mode and mounts testfs.img at mnt. In debugging mode, the driver does not exit until you press Ctrl-C. This means that you cannot interact with the file system via the terminal containing the command above; instead, you will need to open up a new terminal and interact with the file system from there. While in debugging mode, fsdriver will print to the console a trace of the syscalls dispatched to the driver; any printfs that you inserted in the driver will also be displayed. Once you run the command above, you should see something like:
FUSE library version: 2.9.2 nullpath_ok: 0 nopath: 0 utime_omit_ok: 0 unique: 1, opcode: INIT (26), nodeid: 0, insize: 56, pid: 0 INIT: 7.23 flags=0x0003f7fb max_readahead=0x00020000 INIT: 7.19 flags=0x00000011 max_readahead=0x00020000 max_write=0x00020000 max_background=0 congestion_threshold=0 unique: 1, success, outsize: 40 ...Now, open up a new terminal, and interact with the file system to see the debugging outputs. Something like:
$ ls mntThis should cause the original terminal to print germane output.
We have provided various functions in dir.c and inode.c
to implement the basic facilities you will need
to interpret and manage inode
structures,
scan and manage the entries of directories,
and walk the file system from the root
to resolve an absolute pathname.
Read through all of the code in these files
and make sure you understand what each function does
before proceeding.
inode_block_walk
and inode_get_block
in inode.c.
They are the
workhorses of the file system. For example, inode_read
and inode_write
are little more than the bookkeeping atop
inode_get_block
necessary to copy bytes between scattered
blocks and a sequential buffer.
int inode_block_walk(struct inode *ino, uint32_t filebno, uint32_t **ppdiskbno, bool alloc); int inode_get_block(struct inode *ino, uint32_t filebno, char **blk);
inode_block_walk
finds the disk block number slot for the 'filebno'th block in inode 'ino',
and sets '*ppdiskbno' to point to that slot.
inode_get_block
goes one step further and
sets *blk
to the start of the block,
such that by using *blk
, we can access the contents of the block.
It also allocates a new block if necessary.
Use make grade to test your code.
Your code should pass the
"inode_open" and "inode_get_block" tests.
After Exercise 3, you should be able to read and write to the file system. Try something like
$ echo "hello" > "mnt/world"; cat "mnt/world"
inode_truncate_blocks
in inode.c.
inode_truncate_blocks
frees data and metadata blocks
that an inode allocated but no longer needs. This function is used, for
instance, when an inode is deleted; the space reserved by the inode
must be freed so that other files can be created on the system.
Use make grade to test your code.
Your code should pass the "inode_flush/inode_truncate/file rewrite" tests.
inode_link
and inode_unlink
in
inode.c.
inode_link
links an inode referenced by one path to
another location, and inode_unlink
removes a reference to
an inode at a specified path. Make sure that you properly increment
the link count in an inode when linking and decrement the link count
when unlinking. Don't forget to free an inode when its link count
reaches zero!
inode_link
and inode_unlink
allow us to
exploit the level of indirection provided by using inodes in our file
system (as opposed to storing all file meta-data inside of
directories, for instance) and manage referencing inodes with multiple
names. The inode_unlink
operation is particularly
important as it allows us to release the space reserved for an inode,
acting as a "remove" operation when an inode's link count is one.
After Exercise 5, you should be able to make hard links. Try something like
$ echo "hello" > "mnt/world"; ln "mnt/world" "mnt/hello"; rm "mnt/world"; cat "mnt/hello"
The tests after "inode_link/inode_unlink" are all effectively stress tests, in some way or another, for the driver. Each of them relies on the core functionality that you implemented; some can fail if you didn't handle certain edge cases correctly. If you fail one of these tests, go back and check the logic in your code to make sure you didn't miss anything.
Exercise 6 (Honors). The file system is likely to be corrupted if it gets interrupted in the middle of an operation (for example, by a crash or a reboot). Implement soft updates or journalling to make the file system crash-resilient and demonstrate some situation where the old file system would get corrupted, but yours doesn't.
Exercise 7 (Honors). Currently, our file system allocates one block (4096 bytes) per inode. However, eachstruct inode
only takes
up 98 bytes. If we were clever with file system design, we
could store Answer the following questions in answers.txt.
If you are working in a pair, only one of the team members shoud submit.
If you have not done so yet, commit your code.
$ git status # see what files you changed $ git diff # examine your changes $ git commit -am "My solutions to lab6"Then, follow these steps:
$ make handinThis command removes binaries from your directory, archives your directory in a tarball called
lab6-handin.tgz
, and invokes a
script that submits this tarball to our submission server.
Assuming that you performed the submission
setup step in lab1,
you should see something like this:
CLEAN CREATE lab6-handin.tgz SUBMITTING ... ... progress ... Submission successful
After a successful submission, you should receive email confirmation (at your NYU email address).
If you submit multiple times, we will take the latest submission and count slack hours accordingly.
This completes the lab.
Acknowledgement: the diagram explaining FUSE is adapted from the diagram displayed on FUSE's homepage.
Last updated: Mon May 04 11:24:46 -0400 2015 [validate xhtml]