Java – Unique key generation in Hive/Hadoop

Unique key generation in Hive/Hadoop… here is a solution to the problem.

Unique key generation in Hive/Hadoop

When you select a set of records from a big data hive table, you need to create a unique key for each record. In sequential operation mode, it’s easy to generate a unique id by calling something like max(id). Since Hive runs tasks in parallel, how can we generate unique keys as part of a select query without impacting Hadoop performance.
Is this really a map reduce problem, or do we need to take a sequential approach to this problem.

Solution

If for some reason you don’t want to deal with UUIDs, this solution (numerically based) doesn’t require your parallel units to “talk” to each other or do any synchronization.
So it is very effective, but it does not guarantee that your integer keys will be contiguous.

If you say that there are N units of parallel execution, and you know your N, and each unit is assigned an ID from 0 to N – 1, then you can simply generate a unique integer for all units

Unit #0:   0, N, 2N, 3N, ...
Unit #1:   1, N+1, 2N+1, 3N+1, ...
...
Unit #N-1: N-1, N+(N-1), 2N+(N-1), 3N+(N-1), ...

Depending on where you need to generate the key (mapper or reducer), you can get N: from the Hadoop configuration

Mapper: mapred.map.tasks
Reduce: mapred.reduce.tasks

… and your unit ID:
In Java, it is:

 context.getTaskAttemptID().getTaskID().getId()

Not sure about Hive, but it should be possible too.

Related Problems and Solutions