CSCI-UA.0480 Spring 2016 Homework 3: MapReduce


Handed out Tuesday, April 12, 2016
Due 3:00 PM, Thursday, April 14, Monday, April 18, 2016

Homework 3: MapReduce

A "hello world" in Mapreduce is counting the number of unique words of a given length in a file. The code below provides a simple Python implementation of this for the text file asimov.txt, which contains a copy of Isaac Asimov's short story "The Last Question".

import re
import sys

def map(value):
  return (len(value),value)

def reduce(pair):
  return (pair[0],len(set(pair[1])))

if __name__ == "__main__":

    # load file's words
    fp = open(sys.argv[1],"r")
    text = re.sub("[^a-zA-Z]+"," "," ".join(fp.readlines())).lower()
    fp.close()

    # map
    #
    #   Note: If you're unfamiliar with list comprehension in python,
    #   read the following line as
    #     "The set of map(v) such that v is an element of text.split()."
    mapd = [map(v) for v in text.split()]

    # shuffle 
    g = []
    for length in set([v[0] for v in mapd]):
        values_with_length = []
        for v in mapd: 
            if v[0] == length:
                values_with_length.append(v[1])
        g.append( (length, values_with_length) )

    # reduce
    info = [reduce(pair) for pair in g]

    # output
    print("XX number of distinct words with YYY letters\n")
    print("XX  YYY")
    print("\n".join(["{1:2d}  {0:3d}".format(r[1],r[0]) for r in info]))

Essentially, the code performs 4 operations:

  1. Load the file's words into a list
  2. Map each element of the list to a (key:value) pair
  3. Group the set of pairs by key
  4. Reduce the set of pairs by counting unique entries.

One of the main benefits comes from the fact that the map() and reduce() functions are stateless. Because their function does not depend on state, they constitute an embarrassingly parallel workload.

Rather than writing all of the scaffolding code ourselves, some Python libraries which implement Mapreduce already exist, such as mrjob. We'll use mrjob for the rest of this homework. You can download it by doing

$ sudo pip install mrjob
[sudo] password for httpd: cs480
....
$ _ 

from the terminal of your virtual machine.

Using mrjob, the code we wrote above can be reduced to

from mrjob.job import MRJob
import re

class MRUniqueWords(MRJob):

    def mapper(self, _, line):
        for word in re.sub("[^a-zA-Z]+"," ",line).lower().split():
            yield (len(word), word)

    def reducer(self, key, values):
        yield key, len(set(values))

if __name__ == '__main__':
    MRUniqueWords.run()
    

which is not only much easier to read, but also leaves far less room for error in implementation of, e.g., the shuffle process.

To run the above code, you should do

$ python unique_words.py asimov.txt
No configs found; falling back on auto-configuration
Creating temp directory ....
Running step 1 of 1...
Streaming final output from ....
5       164
6       165
7       142
8       117
9       85
1       8
10      48
11      27
12      13
13      11
14      3
17      1
2       32
3       90
4       201
Removing temp directory ....
$ _



Your job is to use the Mapreduce mentality to solve the problem of computing mutual friends lists. The input data will be of the form

1 : 2,3,6,7,8
2 : 1,3,4,7
3 : 1,2,5,8
4 : 2,5,7,8
5 : 3,4,6
6 : 1,5,8
7 : 1,2,4,8
8 : 1,3,4,6,7

where the first number on each line corresponds to a particular person, and the list of numbers following the colon : corresponds to their list of friends.

Your code should output the mutual friends lists in the form

1,2 : 3,7
1,3 : 2,8
1,6 : 8
1,7 : 2,8
1,8 : 3,6,7
2,3 : 1
2,4 : 7
2,7 : 1,4
3,5 : 
3,8 : 1
4,5 : 
4,7 : 2,8
4,8 : 7
5,6 : 
6,8 : 1
7,8 : 1,4

where the ordered pair preceding the colon keys the ordered list of mutual friends following the colon. You should avoid printing keys corresponding to pairs of persons who are not friends in the first place.

Your code should be written in Python and should work on the Virtual Machines we provided you (as that is where we will test your solutions). Use the following code template for your solution.

from mrjob.job import MRJob

class MRMutualFriends(MRJob):

    def mapper(self, _, line):
        # Your code here

    def reducer(self, key, values):
        # Your code here

if __name__ == '__main__':
    MRMutualFriends.run()


Handing in the homework

Use NYU Classes to submit your Python code; there's an entry for this homework.


Last updated: 2016-04-15 16:24:03 -0400 [validate xhtml]