Java – TotalOrderPartitioner and partition files

TotalOrderPartitioner and partition files… here is a solution to the problem.

TotalOrderPartitioner and partition files

I’m learning hadoop mapreduce and I’m using the Java API. I learned that TotalOrderPartitioner is used to sort output in the “global” key in the cluster, and that it requires a partition file (generated using InputSampler):

job.setPartitionerClass(TotalOrderPartitioner.class);
InputSampler.Sampler<Text, Text> sampler = new InputSampler.RandomSampler<Text, Text>(0.1, 200);
InputSampler.writePartitionFile(job, sampler);

I

have a few queries and I ask the community for help :

  1. What exactly does the word “global sort” mean here? How exactly is the output sorted, and we also have multiple output partial files distributed across the cluster?

  2. What happens if we don’t provide partition files? Is there a default way to handle this situation?

Solution

Let’s explain it with an example. Suppose your partition file looks like this:

H
T
V

When your keys range from A to Z, this makes up 4 ranges:

1 [A,H)
2 [H,T)
3 [T,V)
4 [V,Z]

When the mapper now sends records to the reducer, the partitioner looks at the keys for the output. Suppose the output of all mappers is as follows:

A,N,C,K,Z,S,U

Now the partitioner checks your partition file and sends the record to the appropriate reducer. Let’s assume that you have defined 4 reducers, so each reducer will handle a range:

Reducer 1 handles A,C
Reducer 2 handles N,K,S
Reducer 3 handles U
Reducer 4 handles Z

This indicates that your partition file must contain at least n-1 elements compared to the number of reducers you use. docs Another important note:

If the keytype is BinaryComparable and
total.order.partitioner.natural.order is not false, a trie of the
first total.order.partitioner.max.trie.depth(2) + 1 bytes will be
built. Otherwise, keys will be located using a binary search of the
partition keyset using the RawComparator defined for this job. The
input file must be sorted with the same comparator and contain
JobContextImpl.getNumReduceTasks() – 1 keys.

Related Problems and Solutions