MapReduce (Python) – How to sort the reducer output of a top-N list?
I’m new to MapReduce. Work is currently underway to complete the udacity class for Hadoop MapReduce.
I
have a mapper that parses forum nodes and I will get the labels associated with each node. My goal is to sort the first 10 tags.
Sample output:
video 1
cs101 1
meta 1
bug 1
issues 1
nationalities 1
cs101 1
welcome 1
cs101 1
cs212 1
cs262 1
cs253 1
discussion 1
meta 1
It’s very easy to add them all to the reducer :
#!/usr/bin/python
import sys
import string
total = 0
oldKey = None
for line in sys.stdin:
data_mapped = line.strip().split("\t")
if(len(data_mapped) != 2):
print "====================="
print line.strip()
print "====================="
continue
key, value = data_mapped
if oldKey and oldKey != key:
print total, "\t", oldKey
oldKey = key;
total = 0
oldKey = key
total += float(value)
if oldKey != None:
print total, "\t", oldKey
Output:
1.0 application
1.0 board
1.0 browsers
1.0 bug
8.0 cs101
1.0 cs212
5.0 cs253
1.0 cs262
1.0 deadlines
1.0 digital
5.0 discussion
1.0 google-appengine
2.0 homework
1.0 html
1.0 hungarian
1.0 hw2-1
3.0 issues
2.0 jobs
2.0 lessons
I know the keys are
ordered in the output of the mapper, so I’m just testing if the keys are still the same label. If not, then I’ll output the number of times the checkout is out.
However, the question is how do I sort this list?
If I store all the information in a list or dictionary, I can sort the list in python. However, this seems like a bad design decision because if you have a lot of different labels, you lose memory.
Solution
I believe you can use collections. Counter is taking a lesson here:
Example: (modified from your code).
#!/usr/bin/python
import sys
import collections
counter = collections. Counter()
for line in sys.stdin:
k, v = line.strip().split("\t", 2)
counter[k] += int(v)
print counter.most_common(10)
collections. The Counter()
class implements this exact use case and many other common use cases around counting and collecting various statistics, etc.
8.3.1. Counter objects A counter tool is provided to support convenient and rapid tallies. For example: