Python – MapReduce (Python) – How to sort the reducer output of a top-N list?

MapReduce (Python) – How to sort the reducer output of a top-N list?… here is a solution to the problem.

MapReduce (Python) – How to sort the reducer output of a top-N list?

I’m new to MapReduce. Work is currently underway to complete the udacity class for Hadoop MapReduce.

I

have a mapper that parses forum nodes and I will get the labels associated with each node. My goal is to sort the first 10 tags.

Sample output:

video   1
cs101   1
meta    1
bug     1
issues  1
nationalities   1
cs101   1
welcome 1
cs101   1
cs212   1
cs262   1
cs253   1
discussion      1
meta    1

It’s very easy to add them all to the reducer :

#!/usr/bin/python

import sys
import string

total = 0
oldKey = None

for line in sys.stdin:
    data_mapped = line.strip().split("\t")

if(len(data_mapped) != 2):
        print "====================="
        print line.strip()
        print "====================="
        continue

key, value = data_mapped

if oldKey and oldKey != key:
        print total, "\t", oldKey
        oldKey = key;
        total = 0

oldKey = key
    total += float(value)

if oldKey != None:
    print total, "\t", oldKey

Output:

1.0     application
1.0     board
1.0     browsers
1.0     bug
8.0     cs101
1.0     cs212
5.0     cs253
1.0     cs262
1.0     deadlines
1.0     digital
5.0     discussion
1.0     google-appengine
2.0     homework
1.0     html
1.0     hungarian
1.0     hw2-1
3.0     issues
2.0     jobs
2.0     lessons

I know the keys are

ordered in the output of the mapper, so I’m just testing if the keys are still the same label. If not, then I’ll output the number of times the checkout is out.

However, the question is how do I sort this list?

If I store all the information in a list or dictionary, I can sort the list in python. However, this seems like a bad design decision because if you have a lot of different labels, you lose memory.

Solution

I believe you can use collections. Counter is taking a lesson here:

Example: (modified from your code).

#!/usr/bin/python

import sys
import collections

counter = collections. Counter()

for line in sys.stdin:
    k, v = line.strip().split("\t", 2)

counter[k] += int(v)

print counter.most_common(10)

collections. The Counter() class implements this exact use case and many other common use cases around counting and collecting various statistics, etc.

8.3.1. Counter objects A counter tool is provided to support convenient and rapid tallies. For example:

Related Problems and Solutions