Class 19
CS 202
10 November 2025

On the board
------------

1. Last time
2. SSDs
3. Intro to file systems
4. Files
5. Implementing files
    preface
    contiguous
    linked
    indexed

---------------------------------------------------------------------------

1. Last time

    HDDs

2. SSDs

    Today, flash memory more common in consumer devices and very common in data centers

    flash memory

        solid state: no moving parts

            memory: stores charge

        limited number of overwrites possible
            blocks wear out after 10,000 (MLC) to 100,000 (SLC) overwrites
                [SLC = single-level cell, MLC = multi-level cell]

            requires FTL (flash translation layer) for *wear leveling*, so
                repeated writes to logical block do not wear out physical
                block

            random writes are thus very expensive
                [will see this below]

        limited durability

            turn off device for a year, can lose data


    NAND vs NOR

        NAND (most prevalent for storage):
            higher density
            faster erase and write
            more errors internally (so need error correction)

        NOR:
            faster reads in smaller data units
            can execute code right out of NOR flash
            significantly slower erases

    For NAND: SLC vs MLC vs TLC vs QLC

        - MLC encodes multiple (two) bits in voltage level
        - MLC slower to write than SLC
        - MLC has lower durability (bits decay faster)

        Now, most flash drives are TLC (or even QLC)

    overview of NAND flash:

        2112-byte pages
            2048 bytes for data, 64 bytes for metadata and ECC

        Block contains 64 (SLC) or 128 (MLC) pages
        
        Blocks grouped into 2-4 planes
            All planes use same electrical pins
            But can access their blocks in parallel to overlap latency

        Can *read* one page at a time
            25 microseconds + I/O bus time

        Must *erase* whole block before *programming*

            erase sets all bits to 1, very expensive (2ms)

            programming pre-erased block requires moving data to internal 
                buffer, then 200 (SLC) to 800 (MLC) microseconds

        --Flash characteristics
        from http://cseweb.ucsd.edu/~swanson/papers/Asplos2009Gordon.pdf

	    Parameter		               SLC          MLC
	    ---------------------------------------------------------
	    Density Per Die (GB)	        4             8 
	    Page Size (Bytes)			 2048+32      2048+64
	    Block Size (Pages)			  64            128 
	    Read Latency (us)			  25             25 
	    Write Latency (us)			 200            800 
	    Erase Latency (us)			2000            2000 
	    40MHz, 16-bit bus Read b/w (MB/s)    75.8           75.8 
			    Program b/w (MB/s)   20.1            5.0 
	    133MHz            Read b/w (MB/s)   126.4         126.4 
			    Program b/w (MB/s)   20.1           5.0 

    --disk vs. MLC NAND flash vs. regular DRAM. orders of magnitude.

                       disk        flash         DRAM
    --------------------------------------------------------
    Smallest write    sector	  sector         byte 
    Atomic write      sector	  sector         byte/word 
    Random read        8 ms	  75 us          50 ns 
    Random write       8 ms	  200 us         50 ns 
    Sequential read  100 MB/s	  250 MB/s       > 1 GB/s 
    Sequential write 100 MB/s	  170 MB/s       > 1 GB/s 
    Cost              $.01/GB     $.10/GB        $10-25/GB 
    Persistence     Non-volatile  Non-vol.     Volatile 


    Need FTL: flash translation layer. Maps logical to physical blocks

    problem is write amplification:
        Small random writes punch holes in many blocks
        If small writes require garbage-collecting a 90%-full blocks
        . . . means you are writing 10× more physical than logical data!

        Must also periodically re-write even blocks w/o holes

            Wear leveling ensures active blocks don’t wear out first


    [credit: David Mazières]


3. Intro to file systems

    --what does a FS do?

	--provide persistence (don't go away ... ever)

	--give a way to "name" a set of bytes on the disk (files)

	--give a way to map from human-friendly-names to "names" (directories)

    --where are FSes implemented?

	--can implement them on disk, over network, in memory, in NVRAM
	(non-volatile RAM), on tape, with paper (!!)

	--we are going to focus on the disk and generalize later. 
 
    --a few quick notes about disks in the context of FS design

	--disk is the first thing we've seen that (a) doesn't go away;
	and (b) we can modify (BIOS ROM, hardware configuration, etc.
	don't go away, but we weren't able to modify these things). two
	implications here:

	    (i) we're going to have to put all of our important state on
	    the disk

	    (ii) we have to live with what we put on the disk! scribble
	    randomly on memory --> reboot and hope it doesn't happen
	    again. scribbe randomly on the disk --> now what? (answer:
	    in many cases, we're hosed.)


4. Files

	--what is a file?
	    --answer from user's view: a bunch of named bytes on the disk
	    --answer from FS's view: collection of disk blocks
	
	--big job of a FS: map name and offset to disk blocks
	   
                                 FS
                   {file,offset} --> disk address
	    
	    --operations are create(file), delete(file), read(), write()

	    --(***) goal: operations have as few disk accesses as possible
	    and minimal space overhead
	    
		--wait, why do we want minimal space overhead, given that
		the disk is huge?

		--answer: cache space never enough; the amount of data
		that can be retrieved in one fetch is never enough.
		hence, really don't want to waste.

	[[--note that we have seen translation/indirection before:

	    page table:

		                    page table 
		    virtual address ----------> physical address

    
	    per-file metadata:

			    inode
		    offset ------>  disk block address


	    how'd we get the inode?

			       directory
		    file name ----------> file # 
		    
		(file # *is* an inode in Unix)
		    		
	    ]]


5. Implementing files

	--our task: meet the goal marked (***) above. 

	--NOTE: for now we're going to assume that the file's metadata
	is known to the system
	    
	    --> when we look at directories in a bit, we'll see where
	    the metadata comes from; the above picture should also give
	    a hint
    
	access patterns we could imagine supporting:

	(i) Sequential:
	    --File data processed in sequential order
	    --By far the most common mode
	    --Example: editor writes out new file, compiler reads in file, etc

	(ii) Random access:
	    --Address any block in file directly without passing through
	    the rest of the blocks
	    --Examples: large data set, demand paging, databases

	(iii) Keyed access
	    --Search for block with particular values
	    --Examples: associative database, index
	    --This thing is everywhere in the field of databases,
	    search engines, but....
	    --...usually not provided by a FS in OS

	helpful observations:

	* All blocks in file tend to be used together, sequentially 

	* All files in directory tend to be used together

	* All *names* in directory tend to be used together

	further design parameters:

	* Most files are small 
	
	* Much of the disk is allocated to large files

	* Many of the I/O operations are made to large files

	* Want good sequential and good random access 

	candidate designs........

        A. contiguous
        B. linked files
        C. indexed files

	A. contiguous allocation 

	  "extent based"
	  --when creating a file, make user pre-specify its length, and
	  allocate the space at once
	  --file metadata contains location and size

	  --example: IBM OS/360

		[<free> a1 a2 a3 <5 free> b1 b2 <free> ]

		what if a file c needs 7 sectors?!
	  
	  +: simple
	  +: fast access, both sequential and random
	  -: fragmentation
	  
	B. linked files
	    
	    --keep a linked list of free blocks
	    --metadata: pointer to file's first block
	    --each block holds pointer to next one

	  +: no more fragmentation
	  +: sequential access easy (and probably mostly fast, assuming
	     decent free space management, since the pointers will point
	     close by)
	  -: random access is a disaster
	  -: pointers take up room in blocks; messes up alignment of
	  data

	C. indexed files

	    [DRAW PICTURE]

	    --Each file has an array holding all of its block pointers
		--like a page table, so similar issues crop up

	    --Allocate this array on file creation

	    --Allocate blocks on demand (using free list)
	   
	    +: sequential and random access are both easy
	    -: need to somehow store the array

	    --large possible file size --> lots of unused entries in the
	    block array

	    --large actual block size --> huge contiguous disk chunk
	    needed

	    --solve the problem the same way we did for page tables:

			[............]

		[..........]	    [.........]

	    [ block     block                         block]

            --[above is a drawing of a balanced tree, like the 4-level
            page tables we saw for x86-64.]

	    --okay, so now we're not wasting disk blocks, but what's the
	    problem? (answer: equivalent issues as for page table
	    walking: here, it's extra disk accesses to look up the blocks)

	    --this motivates the classic Unix file system 

	    --inode contains:

		permisssions
		times for file access, file modification, and inode-change
		link count (# directories containing file)
		ptr 1  --> data block
		ptr 2  --> data block
		ptr 3  --> data block
		.....
		ptr 11  --> indirect block 
				      ptr --> 
				      ptr --> 
				      ptr --> 
				      ptr -->
				      ptr -->
		ptr 12 --> indirect block
		ptr 13 --> double indirect block
		ptr 14 --> triple indirect block

    This is just a tree.

    Question: why is this tree intentionally imbalanced? (i.e., uneven
    depth)

        (Answer: optimize for short files. each level of this tree
        requires a disk seek...)

    Pluses/minuses:

    +: Simple, easy to build, fast access to small files

    +: Maximum file length can be enormous, with
       multiple levels of indirection 

    -: worst case # of accesses pretty bad

    -: worst case overhead (such as 11 block file) pretty bad

    -: Because you allocate blocks by taking them off unordered
       freelist, metadata and data get strewn across disk


    Notes about inodes:

    --stored in a fixed-size array

    --Size of array fixed when disk is initialized; can't be changed

    --Multiple inodes in a disk block

    --Lives in known location, originally at one side of disk,
    now lives in pieces across disk (helps keep metadata close
    to data)

    --The index of an inode in the inode array is called an
    ***i-number***

    --Internally, the OS refers to files by i-number

    --When a file is opened, the inode brought in memory

    --Written back when modified and file closed or time elapses