Handed out Tuesday, April 12, 2016
Due 3:00 PM, Thursday, April 14, Monday, April 18, 2016
A "hello world" in Mapreduce is counting the number of unique words of a given length in a file. The code below provides a simple Python implementation of this for the text file asimov.txt, which contains a copy of Isaac Asimov's short story "The Last Question".
import re import sys def map(value): return (len(value),value) def reduce(pair): return (pair[0],len(set(pair[1]))) if __name__ == "__main__": # load file's words fp = open(sys.argv[1],"r") text = re.sub("[^a-zA-Z]+"," "," ".join(fp.readlines())).lower() fp.close() # map # # Note: If you're unfamiliar with list comprehension in python, # read the following line as # "The set of map(v) such that v is an element of text.split()." mapd = [map(v) for v in text.split()] # shuffle g = [] for length in set([v[0] for v in mapd]): values_with_length = [] for v in mapd: if v[0] == length: values_with_length.append(v[1]) g.append( (length, values_with_length) ) # reduce info = [reduce(pair) for pair in g] # output print("XX number of distinct words with YYY letters\n") print("XX YYY") print("\n".join(["{1:2d} {0:3d}".format(r[1],r[0]) for r in info]))
Essentially, the code performs 4 operations:
One of the main benefits comes from the fact that the map() and reduce() functions are stateless. Because their function does not depend on state, they constitute an embarrassingly parallel workload.
Rather than writing all of the scaffolding code ourselves, some Python libraries which implement Mapreduce already exist, such as mrjob. We'll use mrjob for the rest of this homework. You can download it by doing
$ sudo pip install mrjob [sudo] password for httpd: cs480 .... $ _
from the terminal of your virtual machine.
Using mrjob, the code we wrote above can be reduced to
from mrjob.job import MRJob import re class MRUniqueWords(MRJob): def mapper(self, _, line): for word in re.sub("[^a-zA-Z]+"," ",line).lower().split(): yield (len(word), word) def reducer(self, key, values): yield key, len(set(values)) if __name__ == '__main__': MRUniqueWords.run()
which is not only much easier to read, but also leaves far less room for error in implementation of, e.g., the shuffle process.
To run the above code, you should do
$ python unique_words.py asimov.txt No configs found; falling back on auto-configuration Creating temp directory .... Running step 1 of 1... Streaming final output from .... 5 164 6 165 7 142 8 117 9 85 1 8 10 48 11 27 12 13 13 11 14 3 17 1 2 32 3 90 4 201 Removing temp directory .... $ _
Your job is to use the Mapreduce mentality to solve the problem of computing mutual friends lists. The input data will be of the form
1 : 2,3,6,7,8 2 : 1,3,4,7 3 : 1,2,5,8 4 : 2,5,7,8 5 : 3,4,6 6 : 1,5,8 7 : 1,2,4,8 8 : 1,3,4,6,7
where the first number on each line corresponds to a particular person, and the list of numbers following the colon : corresponds to their list of friends.
Your code should output the mutual friends lists in the form
1,2 : 3,7 1,3 : 2,8 1,6 : 8 1,7 : 2,8 1,8 : 3,6,7 2,3 : 1 2,4 : 7 2,7 : 1,4 3,5 : 3,8 : 1 4,5 : 4,7 : 2,8 4,8 : 7 5,6 : 6,8 : 1 7,8 : 1,4
where the ordered pair preceding the colon keys the ordered list of mutual friends following the colon. You should avoid printing keys corresponding to pairs of persons who are not friends in the first place.
Your code should be written in Python and should work on the Virtual Machines we provided you (as that is where we will test your solutions). Use the following code template for your solution.
from mrjob.job import MRJob class MRMutualFriends(MRJob): def mapper(self, _, line): # Your code here def reducer(self, key, values): # Your code here if __name__ == '__main__': MRMutualFriends.run()
Use NYU Classes to submit your Python code; there's an entry for this homework.
Last updated: 2016-04-15 16:24:03 -0400 [validate xhtml]