Class 21 CS 480-008 14 April 2016 On the board ------------ 1. Last time 2. Problem 3. FDS A. Intro B. Design --------------------------------------------------------------------------- 1. Last time --MapReduce --clarify shuffle versus reduce: where do the RPCs happen? in response to the Iterator or before that? multiple ways to implement this 2. Problem lots of data need to be able to find it need to be able to compute over it want this to be efficient historically hierarchical topologies Map-Reduce programming model works assume M-->R stage doesn't involve a lot of data a bit awk: stragglers restarted from scratch each worker given a large chunk of data (b/c in practice, not efficient for any worker to grab any work item; so once there's an implicit split, might as well leverage that) 3. FDS A. Intro Assume data center bandwidth is not the bottleneck (because new topologies are making this possible) What is ideal interface to storage system? flat blobs, versus structured key space or file space, where, say /[abc]/ ... is stored on one set of nodes, and /[def]/ is on another, etc. blobs are very, very small (they won't be able to do this) Ideal abstract picture: [draw picture: lots of storage and compute nodes] What are some problems to solve? (random access to reads are slow.) (there's a lot of data! how does one find it?) (what if a node fails?) (how can we reconfigure the topology?) (how do we build applications in this world?) B. Design --how do they deal with random-access read issue? (answer: by writing in 8MB chunks.) --how do we know that 8MB is enough to amortize disk seek? (answer: Figure 3) --before we address how they solve the data location/placement problem, let's go through some alternatives: (1) a server tracks the mapping from blob id to server. this is the GFS solution. (2) disseminate the map from file name to server responsible for that file's metadata (3) list of servers is disseminated, and then use consistent hashing (see class 23) to identify which server is responsible what do they do? hybrid of (2) and (3): send an implicit map to clients. how does it work? metadata server disseminates a table: the server(s) responsible for (blob_id, tract i) can be read as follows: idx = hash(blob_id) + i) % table_length; TLT[idx] contains the list of tractservers that would have the tract Example four-entry TLT with no replication: 0: S1 1: S2 2: S3 3: S4 suppose hash(27) = 2 then the tracts of blob 27 are laid out: S1: 2 6 S2: 3 7 S3: 0 4 8 S4: 1 5 ... FDS is "striping" blobs over servers at tract granularity Q: why have tracts at all? why not store each blob on just one server? what kinds of apps will benefit from striping? what kinds of apps won't? Q: why not the UNIX i-node approach? (store an array per blob, indexed by tract #, yielding tractserver) so you could make per-tract placement decisions e.g. write new tract to most lightly loaded server why isn't the "i" inside the "hash"? then each tact selects a server independently like throwing balls at bins; do this, and in expectation, you get imbalance: more and less loaded bins with their approach, if a blob is very large, there is a more uniform spread (but if the blob does not have a lot of tracts, then this technique does not deal with imbalance). assuming no replication, if there are n servers, are there n table entries? (answer: no. there are n*m, where m=20 (a made-up parameter).) Why? (because they want to avoid caravanning. this way when a sequential request is at a node, there will be m possible next nodes.) why would caravanning happen? answer: --assume there's a bottleneck server --assume a bunch of clients reading their different blobs sequentially. that is, they're reading each blob in tract order --once the clients "hit" this server (a server holds tracts from lots of different blobs) all of the clients slow down together (because of the bottleneck) metadata: how do they handle *blob metadata* (they also call their server a metadata server, which is confusing)? (answer: it's the negative first (-1) tract.) what are the advantages of this? (answer: distributes the load of handling metadata requests. also, metadata is handled/replicated differently from data allocation so it makes sense to separately identify it.) dynamic work allocation (section 2.4): what is this? --because there is no concept of the data living at a worker (or selecting a worker that lives near a piece of storage), there is a very fluid way to assign jobs to workers. --this means it's feasible to assign work at finer-grain (versus: "You, node 23, handle these billion key-pairs that are near you.") --that in turn means that a node can signal, as it's computing, how it's doing, which allows a central work allocator to make intelligent decisions about which nodes to give work to.