Class 19 CS 202 10 November 2025 On the board ------------ 1. Last time 2. SSDs 3. Intro to file systems 4. Files 5. Implementing files preface contiguous linked indexed --------------------------------------------------------------------------- 1. Last time HDDs 2. SSDs Today, flash memory more common in consumer devices and very common in data centers flash memory solid state: no moving parts memory: stores charge limited number of overwrites possible blocks wear out after 10,000 (MLC) to 100,000 (SLC) overwrites [SLC = single-level cell, MLC = multi-level cell] requires FTL (flash translation layer) for *wear leveling*, so repeated writes to logical block do not wear out physical block random writes are thus very expensive [will see this below] limited durability turn off device for a year, can lose data NAND vs NOR NAND (most prevalent for storage): higher density faster erase and write more errors internally (so need error correction) NOR: faster reads in smaller data units can execute code right out of NOR flash significantly slower erases For NAND: SLC vs MLC vs TLC vs QLC - MLC encodes multiple (two) bits in voltage level - MLC slower to write than SLC - MLC has lower durability (bits decay faster) Now, most flash drives are TLC (or even QLC) overview of NAND flash: 2112-byte pages 2048 bytes for data, 64 bytes for metadata and ECC Block contains 64 (SLC) or 128 (MLC) pages Blocks grouped into 2-4 planes All planes use same electrical pins But can access their blocks in parallel to overlap latency Can *read* one page at a time 25 microseconds + I/O bus time Must *erase* whole block before *programming* erase sets all bits to 1, very expensive (2ms) programming pre-erased block requires moving data to internal buffer, then 200 (SLC) to 800 (MLC) microseconds --Flash characteristics from http://cseweb.ucsd.edu/~swanson/papers/Asplos2009Gordon.pdf Parameter SLC MLC --------------------------------------------------------- Density Per Die (GB) 4 8 Page Size (Bytes) 2048+32 2048+64 Block Size (Pages) 64 128 Read Latency (us) 25 25 Write Latency (us) 200 800 Erase Latency (us) 2000 2000 40MHz, 16-bit bus Read b/w (MB/s) 75.8 75.8 Program b/w (MB/s) 20.1 5.0 133MHz Read b/w (MB/s) 126.4 126.4 Program b/w (MB/s) 20.1 5.0 --disk vs. MLC NAND flash vs. regular DRAM. orders of magnitude. disk flash DRAM -------------------------------------------------------- Smallest write sector sector byte Atomic write sector sector byte/word Random read 8 ms 75 us 50 ns Random write 8 ms 200 us 50 ns Sequential read 100 MB/s 250 MB/s > 1 GB/s Sequential write 100 MB/s 170 MB/s > 1 GB/s Cost $.01/GB $.10/GB $10-25/GB Persistence Non-volatile Non-vol. Volatile Need FTL: flash translation layer. Maps logical to physical blocks problem is write amplification: Small random writes punch holes in many blocks If small writes require garbage-collecting a 90%-full blocks . . . means you are writing 10× more physical than logical data! Must also periodically re-write even blocks w/o holes Wear leveling ensures active blocks don’t wear out first [credit: David Mazières] 3. Intro to file systems --what does a FS do? --provide persistence (don't go away ... ever) --give a way to "name" a set of bytes on the disk (files) --give a way to map from human-friendly-names to "names" (directories) --where are FSes implemented? --can implement them on disk, over network, in memory, in NVRAM (non-volatile RAM), on tape, with paper (!!) --we are going to focus on the disk and generalize later. --a few quick notes about disks in the context of FS design --disk is the first thing we've seen that (a) doesn't go away; and (b) we can modify (BIOS ROM, hardware configuration, etc. don't go away, but we weren't able to modify these things). two implications here: (i) we're going to have to put all of our important state on the disk (ii) we have to live with what we put on the disk! scribble randomly on memory --> reboot and hope it doesn't happen again. scribbe randomly on the disk --> now what? (answer: in many cases, we're hosed.) 4. Files --what is a file? --answer from user's view: a bunch of named bytes on the disk --answer from FS's view: collection of disk blocks --big job of a FS: map name and offset to disk blocks FS {file,offset} --> disk address --operations are create(file), delete(file), read(), write() --(***) goal: operations have as few disk accesses as possible and minimal space overhead --wait, why do we want minimal space overhead, given that the disk is huge? --answer: cache space never enough; the amount of data that can be retrieved in one fetch is never enough. hence, really don't want to waste. [[--note that we have seen translation/indirection before: page table: page table virtual address ----------> physical address per-file metadata: inode offset ------> disk block address how'd we get the inode? directory file name ----------> file # (file # *is* an inode in Unix) ]] 5. Implementing files --our task: meet the goal marked (***) above. --NOTE: for now we're going to assume that the file's metadata is known to the system --> when we look at directories in a bit, we'll see where the metadata comes from; the above picture should also give a hint access patterns we could imagine supporting: (i) Sequential: --File data processed in sequential order --By far the most common mode --Example: editor writes out new file, compiler reads in file, etc (ii) Random access: --Address any block in file directly without passing through the rest of the blocks --Examples: large data set, demand paging, databases (iii) Keyed access --Search for block with particular values --Examples: associative database, index --This thing is everywhere in the field of databases, search engines, but.... --...usually not provided by a FS in OS helpful observations: * All blocks in file tend to be used together, sequentially * All files in directory tend to be used together * All *names* in directory tend to be used together further design parameters: * Most files are small * Much of the disk is allocated to large files * Many of the I/O operations are made to large files * Want good sequential and good random access candidate designs........ A. contiguous B. linked files C. indexed files A. contiguous allocation "extent based" --when creating a file, make user pre-specify its length, and allocate the space at once --file metadata contains location and size --example: IBM OS/360 [ a1 a2 a3 <5 free> b1 b2 ] what if a file c needs 7 sectors?! +: simple +: fast access, both sequential and random -: fragmentation B. linked files --keep a linked list of free blocks --metadata: pointer to file's first block --each block holds pointer to next one +: no more fragmentation +: sequential access easy (and probably mostly fast, assuming decent free space management, since the pointers will point close by) -: random access is a disaster -: pointers take up room in blocks; messes up alignment of data C. indexed files [DRAW PICTURE] --Each file has an array holding all of its block pointers --like a page table, so similar issues crop up --Allocate this array on file creation --Allocate blocks on demand (using free list) +: sequential and random access are both easy -: need to somehow store the array --large possible file size --> lots of unused entries in the block array --large actual block size --> huge contiguous disk chunk needed --solve the problem the same way we did for page tables: [............] [..........] [.........] [ block block block] --[above is a drawing of a balanced tree, like the 4-level page tables we saw for x86-64.] --okay, so now we're not wasting disk blocks, but what's the problem? (answer: equivalent issues as for page table walking: here, it's extra disk accesses to look up the blocks) --this motivates the classic Unix file system --inode contains: permisssions times for file access, file modification, and inode-change link count (# directories containing file) ptr 1 --> data block ptr 2 --> data block ptr 3 --> data block ..... ptr 11 --> indirect block ptr --> ptr --> ptr --> ptr --> ptr --> ptr 12 --> indirect block ptr 13 --> double indirect block ptr 14 --> triple indirect block This is just a tree. Question: why is this tree intentionally imbalanced? (i.e., uneven depth) (Answer: optimize for short files. each level of this tree requires a disk seek...) Pluses/minuses: +: Simple, easy to build, fast access to small files +: Maximum file length can be enormous, with multiple levels of indirection -: worst case # of accesses pretty bad -: worst case overhead (such as 11 block file) pretty bad -: Because you allocate blocks by taking them off unordered freelist, metadata and data get strewn across disk Notes about inodes: --stored in a fixed-size array --Size of array fixed when disk is initialized; can't be changed --Multiple inodes in a disk block --Lives in known location, originally at one side of disk, now lives in pieces across disk (helps keep metadata close to data) --The index of an inode in the inode array is called an ***i-number*** --Internally, the OS refers to files by i-number --When a file is opened, the inode brought in memory --Written back when modified and file closed or time elapses