Class 21
CS 480-008
14 April 2016

On the board
------------

1. Last time
2. Problem
3. FDS
    A. Intro
    B. Design

---------------------------------------------------------------------------

1. Last time

    --MapReduce

    --clarify shuffle versus reduce:

        where do the RPCs happen? in response to the Iterator or before
        that?

        multiple ways to implement this

2. Problem

    lots of data
        need to be able to find it
        need to be able to compute over it
        want this to be efficient

    historically

        hierarchical topologies

        Map-Reduce programming model

            works assume M-->R stage doesn't involve a lot of data

            a bit awk: stragglers restarted from scratch

            each worker given a large chunk of data (b/c in practice,
            not efficient for any worker to grab any work item; so once
            there's an implicit split, might as well leverage that)

3. FDS

    A. Intro

        Assume data center bandwidth is not the bottleneck
            (because new topologies are making this possible)

        What is ideal interface to storage system?

            flat blobs, versus structured key space or file space, where,
                say /[abc]/ ... is stored on one set of nodes, and /[def]/
                is on another, etc.

            blobs are very, very small (they won't be able to do this)

        Ideal abstract picture:

            [draw picture: lots of storage and compute nodes]

        What are some problems to solve?

            (random access to reads are slow.)

            (there's a lot of data! how does one find it?)

            (what if a node fails?)

            (how can we reconfigure the topology?)

            (how do we build applications in this world?)


    B. Design

        --how do they deal with random-access read issue?
            (answer: by writing in 8MB chunks.)
        --how do we know that 8MB is enough to amortize disk seek?
            (answer: Figure 3)

        --before we address how they solve the data location/placement
        problem, let's go through some alternatives:

            (1) a server tracks the mapping from blob id to server. this
            is the GFS solution.

            (2) disseminate the map from file name to server responsible
            for that file's metadata

            (3) list of servers is disseminated, and then use consistent
            hashing (see class 23) to identify which server is responsible

        what do they do?

            hybrid of (2) and (3): send an implicit map to clients.

            how does it work? metadata server disseminates a table:

                the server(s) responsible for (blob_id, tract i) can be
                read as follows:
                    idx = hash(blob_id) + i) % table_length;
                    TLT[idx] contains the list of tractservers that
                        would have the tract


                Example four-entry TLT with no replication:
                  0: S1
                  1: S2
                  2: S3
                  3: S4
                  suppose hash(27) = 2

                  then the tracts of blob 27 are laid out:
                  S1: 2 6
                  S2: 3 7
                  S3: 0 4 8
                  S4: 1 5 ...
                  FDS is "striping" blobs over servers at tract granularity
        

            Q: why have tracts at all? why not store each blob on just one server?
               what kinds of apps will benefit from striping?
               what kinds of apps won't?

            Q: why not the UNIX i-node approach? (store an array per blob,
            indexed by tract #, yielding tractserver)
               so you could make per-tract placement decisions
                 e.g. write new tract to most lightly loaded server

                    
        why isn't the "i" inside the "hash"?
        
            then each tact selects a server independently

            like throwing balls at bins; do this, and in expectation,
            you get imbalance: more and less loaded bins 

            with their approach, if a blob is very large, there is a
            more uniform spread (but if the blob does not have a lot of
            tracts, then this technique does not deal with imbalance).

        
        assuming no replication, if there are n servers, are there n table entries?
            (answer: no. there are n*m, where m=20 (a made-up
            parameter).)
        Why? 
            (because they want to avoid caravanning. this way when a
            sequential request is at a node, there will be m possible
            next nodes.)
            
            why would caravanning happen? answer:
                --assume there's a bottleneck server
                --assume a bunch of clients reading their different
                blobs sequentially. that is, they're reading each blob
                in tract order
                --once the clients "hit" this server (a server holds
                tracts from lots of different blobs) all of the clients
                slow down together (because of the bottleneck)


       metadata: how do they handle *blob metadata* (they also call
       their server a metadata server, which is confusing)?
            (answer: it's the negative first (-1) tract.)
       what are the advantages of this?
            (answer: distributes the load of handling metadata requests.
            also, metadata is handled/replicated differently from data
            allocation so it makes sense to separately identify it.)

       dynamic work allocation (section 2.4): what is this?

        --because there is no concept of the data living at a worker
        (or selecting a worker that lives near a piece of storage),
        there is a very fluid way to assign jobs to workers.

        --this means it's feasible to assign work at finer-grain
        (versus: "You, node 23, handle these billion key-pairs that
        are near you.")

        --that in turn means that a node can signal, as it's computing,
        how it's doing, which allows a central work allocator to make
        intelligent decisions about which nodes to give work to.