Class 25 CS 439 16 April 2013 On the board ------------ 1. Last time 2. MapReduce --------------------------------------------------------------------------- 0. Last time Begin distributed systems Atomic commit (distributed transactions) Today discuss a different style of distributed system 0.5 Warmup --Did people like the paper? Dislike it? Didn't have time to read it? --Ask what questions people have about MapReduce (This stuff is relevant and real, though presumably by the time they publish they've moved on a bit.) 1. Description of Google's environment. Why do they have tons and tons of commodity PCs? (Because at their scale even the reliable machines would sometimes fail, so they need mechanisms for fault-tolerance [replication, etc.]. Once they have such mechanisms, can get away with components that are more unreliable.) [draw picture] Don't just have to be Google. Services like Amazon Web services, Hadoop, etc. make these kind of service available to everyone. 2. Design problem assume: --can't fit "Web" or inverted index on one disk or in memory --see picture with 'aardvark,' 'zoo', etc. [see picture] --assume you're operating in the Google environment mentioned --don't use MapReduce as you think about this problem. (The MapReduce authors must have started out thinking like this. They probably had some hacked solution. Eventually, they realized they could "factor out" common functions.) 3. Overview of MapReduce computational model --Ask: how could you use MapReduce for the above problem? [draw picture] --explain the shuffle --Are there computations that can't be conveniently expressed? --Computations that change data or do lots of processing of it. --Any computation that is not expressible as transformations of (k,v) pairs --The not-nice way to say it is that the MapReduce programming model pushes work onto the programmer (this is a general design point: if you can constrain the programmer, then the framework itself can optimize because the design space is more restricted) --how about interactive jobs? 4. Overview of MapReduce implementation [draw picture] --people use MapReduce by writing a few simple C++ functions --why does the Map worker write its data to the local disk? 5A. Fault tolerance: how tolerates worker faults? --start over --and why does that work? [--b/c computation robust to tasks (map tasks or reduce tasks) failing --means that programmers don't have to reason about what happens if their computation starts over.] 5B. How tolerates master faults? --Don't. --The master task is a single point of failure. Why did this otherwise fault-tolerant system get designed with a single point of failure? --Simplicity! --Computation model assumes no side-effects, so it's not like anything gets messed up if the computation starts over. --How could they have avoided the single point of failure? (With complexity.) 6A. What has the biggest effect on performance (answer: stragglers.) 6B. How did the authors solve the problem? --Can someone explain this stragglers hack? 6C. How did the authors know that stragglers would be a problem? --They probably didn't! --Lessons: (1) don't optimize until you measure (2) tuning may require hacks 7. Performance --impressive. they scan a terabyte in 2.5 minutes. classic example of using parallelism to decrease latency. (they throw massive amounts of hardware at the problem, but the design challenge is how to harness that hardware in a controlled way.) 8. What do you think: will MapReduce be unnecessary complexity rendered obsolete in the future by faster, bigger, better computers? 9. Pragmatism over idealism --Note how functional programming makes hard problem (what to do about partial and multiple executions) into easy problem (reexecute partially failed tasks from scratch). --> BUT this comes at the cost of a restricted model of computation. --restricted model of computation (can't deal with all problem types) --can't deal with a single point of failure --performance hacks --Authors display what I would call "well-considered pragmatism": some corners cut, but those seem to be the right corners --------------------------------------------------------------------------- Observe that the authors built something very powerful out of simple pieces. That's the essence of great systems design. Key sentence in the paper: "MapReduce has been so successful because it makes it possible to write a simple program and run it efficiently on a thousand machines in the course of half an hour, greatly speeding up the development and prototyping cycle. Furthermore, it allows programmers who have no experience with distributed and/or parallel systems to exploit large amounts of resources easily." ---------------------------------------------------------------------------