Java – Sort from a file by values in Hadoop

Sort from a file by values in Hadoop… here is a solution to the problem.

Sort from a file by values in Hadoop

I have a file with a string followed by a space and then a number on each line.

Example:

Line1: Word 2
Line2 : Word1 8
Line3: Word2 1

I need to sort the numbers in descending order, then put the results into a file and assign a grade to the numbers. So my output should be a file with the following format:

Line1: Word1 8 1
Line2: Word  2 2
Line3: Word2 1 3

Does anyone have an idea of how I can do this in Hadoop?
I use Java with Hadoop.

Solution

You can organize your map/reduce calculations like this:

map input: default

map output: “key: number, value: word”

Key sort phase

Here you need to override the default sorter to sort in descending order.

Reduction – 1 reducer

Reduce input: “key: number, value: word”

Reduce output: “key: word, value: (number, rank)”

Holds a global counter. For each key-value pair, add the rank by incrementing the counter.

EDIT: This is a snippet of the custom descendant sorter:

public static class IntComparator extends WritableComparator {

public IntComparator() {
        super(IntWritable.class);
    }

@Override
    public int compare(byte[] b1, int s1, int l1,
            byte[] b2, int s2, int l2) {

Integer v1 = ByteBuffer.wrap(b1, s1, l1).getInt();
        Integer v2 = ByteBuffer.wrap(b2, s2, l2).getInt();

return v1.compareTo(v2) * (-1);
    }
}

Don’t forget to actually set it up as a comparator for your work:

job.setSortComparatorClass(IntComparator.class);

Related Problems and Solutions