MapReduce (Python) – How to sort the reducer output of a top-N list?
I’m new to MapReduce. Work is currently underway to complete the udacity class for Hadoop MapReduce.
have a mapper that parses forum nodes and I will get the labels associated with each node. My goal is to sort the first 10 tags.
video 1 cs101 1 meta 1 bug 1 issues 1 nationalities 1 cs101 1 welcome 1 cs101 1 cs212 1 cs262 1 cs253 1 discussion 1 meta 1
It’s very easy to add them all to the reducer :
#!/usr/bin/python import sys import string total = 0 oldKey = None for line in sys.stdin: data_mapped = line.strip().split("\t") if(len(data_mapped) != 2): print "=====================" print line.strip() print "=====================" continue key, value = data_mapped if oldKey and oldKey != key: print total, "\t", oldKey oldKey = key; total = 0 oldKey = key total += float(value) if oldKey != None: print total, "\t", oldKey
1.0 application 1.0 board 1.0 browsers 1.0 bug 8.0 cs101 1.0 cs212 5.0 cs253 1.0 cs262 1.0 deadlines 1.0 digital 5.0 discussion 1.0 google-appengine 2.0 homework 1.0 html 1.0 hungarian 1.0 hw2-1 3.0 issues 2.0 jobs 2.0 lessons
I know the keys are
ordered in the output of the mapper, so I’m just testing if the keys are still the same label. If not, then I’ll output the number of times the checkout is out.
However, the question is how do I sort this list?
If I store all the information in a list or dictionary, I can sort the list in python. However, this seems like a bad design decision because if you have a lot of different labels, you lose memory.
I believe you can use collections. Counter is taking a lesson here:
Example: (modified from your code).
#!/usr/bin/python import sys import collections counter = collections. Counter() for line in sys.stdin: k, v = line.strip().split("\t", 2) counter[k] += int(v) print counter.most_common(10)
collections. The Counter() class implements this exact use case and many other common use cases around counting and collecting various statistics, etc.
8.3.1. Counter objects A counter tool is provided to support convenient and rapid tallies. For example: