Python – Hadoop and Python: Disable Sorting

Hadoop and Python: Disable Sorting… here is a solution to the problem.

Hadoop and Python: Disable Sorting

I’ve realized that when running Hadoop with Python code, the mapper or reducer (not sure which one) sorts my output before it’s printed out by the reducer.py. At the moment it seems to be sorted alphanumerically. I was wondering if there was a way to disable it completely. I want the output of the program to be based on the order in which it is printed from the mapper.py. I found the answer in Java, but not in Python. Do I need to modify mapper.py or command line arguments?

Solution

You should read more about basic MapReduce concepts. Although sorting may not be necessary in some cases, the shuffle portion of the Shuffle and Sort phase is an inherent part of the MapReduce model. The MapReduce framework (Hadoop) needs to group the output of the mapper so that all keys are sent together to a single reducer so that the reducer can truly “reduce” the data. When using streaming, key-value pairs are separated by tab values by default. From your sample code in other SO issues, I can see that you didn’t provide to generate a “key, value” tuple, but just a single line of text.

EDIT: For “How do I sort it by number (e.g., 9 before 10)?” The following answer is added to this question

Alternative 1: Prepend your key with zeros so that they are all the same size. “09” before “10”.

Alternative 2: Use KeyFieldBasedComparator, such as as shown in this SO question

Related Problems and Solutions