Class 25
CS 439
16 April 2013

On the board
------------

1. Last time
2. MapReduce

---------------------------------------------------------------------------

0. Last time

    Begin distributed systems

    Atomic commit (distributed transactions)

    Today discuss a different style of distributed system


0.5 Warmup

    --Did people like the paper? Dislike it? Didn't have time to read it?

    --Ask what questions people have about MapReduce

    (This stuff is relevant and real, though presumably by the time they
    publish they've moved on a bit.)

1. Description of Google's environment.

    Why do they have tons and tons of commodity PCs?
        (Because at their scale even the reliable machines would
        sometimes fail, so they need mechanisms for fault-tolerance
        [replication, etc.]. Once they have such mechanisms, can get
        away with components that are more unreliable.)

    [draw picture]

    Don't just have to be Google. Services like Amazon Web services,
    Hadoop, etc. make these kind of service available to everyone.

2. Design problem
  
   assume:

    --can't fit "Web" or inverted index on one disk or in memory

    --see picture with 'aardvark,' 'zoo', etc.

    [see picture]

    --assume you're operating in the Google environment mentioned
    
    --don't use MapReduce as you think about this problem.

   (The MapReduce authors must have started out thinking like this.
   They probably had some hacked solution. Eventually, they realized
   they could "factor out" common functions.)

3. Overview of MapReduce computational model
   
    --Ask: how could you use MapReduce for the above problem?

        [draw picture]

    --explain the shuffle

    --Are there computations that can't be conveniently expressed?

        --Computations that change data or do lots of processing of it. 

        --Any computation that is not expressible as transformations of
        (k,v) pairs

        --The not-nice way to say it is that the MapReduce programming
        model pushes work onto the programmer (this is a general design
        point: if you can constrain the programmer, then the framework
        itself can optimize because the design space is more restricted)

    --how about interactive jobs?

4. Overview of MapReduce implementation

    [draw picture]

   --people use MapReduce by writing a few simple C++ functions

   --why does the Map worker write its data to the local disk?

5A. Fault tolerance: how tolerates worker faults? 

   --start over

   --and why does that work? 
    
     [--b/c computation robust to tasks (map tasks or reduce tasks) failing 
      --means that programmers don't have to reason about what happens if their
      computation starts over.]

5B. How tolerates master faults? 

   --Don't.

   --The master task is a single point of failure. Why did this otherwise 
   fault-tolerant system get designed with a single point of failure?

	--Simplicity!

	--Computation model assumes no side-effects, so it's not like
	anything gets messed up if the computation starts over.

	--How could they have avoided the single point of failure? 
        (With complexity.)

6A. What has the biggest effect on performance (answer: stragglers.)

6B. How did the authors solve the problem?

  --Can someone explain this stragglers hack?

6C. How did the authors know that stragglers would be a problem?

  --They probably didn't!

  --Lessons:
      (1) don't optimize until you measure 
      (2) tuning may require hacks

7. Performance

    --impressive. they scan a terabyte in 2.5 minutes. classic example
    of using parallelism to decrease latency.
    
    (they throw massive amounts of hardware at the problem, but the
    design challenge is how to harness that hardware in a controlled
    way.)

8. What do you think: will MapReduce be unnecessary complexity rendered
obsolete in the future by faster, bigger, better computers?

9. Pragmatism over idealism

    --Note how functional programming makes hard problem (what to do
    about partial and multiple executions) into easy problem (reexecute
    partially failed tasks from scratch). 

        --> BUT this comes at the cost of a restricted model of
        computation.

   --restricted model of computation (can't deal with all problem types)

   --can't deal with a single point of failure

   --performance hacks

    --Authors display what I would call "well-considered pragmatism":
    some corners cut, but those seem to be the right corners

---------------------------------------------------------------------------

Observe that the authors built something very powerful out of simple
pieces. That's the essence of great systems design.

Key sentence in the paper: "MapReduce has been so successful because it
makes it possible to write a simple program and run it efficiently on a
thousand machines in the course of half an hour, greatly speeding up the
development and prototyping cycle. Furthermore, it allows programmers
who have no experience with distributed and/or parallel systems to
exploit large amounts of resources easily."

---------------------------------------------------------------------------