Java – pig : Group by ranges/binning data

pig : Group by ranges/binning data… here is a solution to the problem.

pig : Group by ranges/binning data

I

have a set of integer values and I want to group them into a bunch of containers.

Example: Let’s say I have a thousand points between 1 and 1000 and I want to do 20 bins.

Is there a way to group them into a bin/array?

Also, I

wouldn’t know in advance how wide the range is, so I can’t hardcode any particular value.

Solution

If you have minimum and maximum values, you can divide the range by the number of bins. For example,

-- foo.pig
ids = load '$INPUT' as (id: int);
ids_with_key = foreach ids generate (id - $MIN) * $BIN_COUNT / ($MAX- $MIN + 1) as bin_id, id;
group_by_id = group ids_with_key by bin_id;
bin_id = foreach group_by_id generate group, flatten(ids_with_key.id);
dump bin_id;

You can then run it with the following command:

pig -f foo.pig -p MIN=1 -p MAX=1000 -p BIN_COUNT=20 -p INPUT=your_input_path

The idea behind the script is that we can divide the range [MIN, MAX] by BIN_COUNT to get the size of each bin: (MAX - MIN + 1)/BIN_COUNT, called BIN_SIZE. Then we map the id to the bin number: (id - MIN)/BIN_SIZE and group them.

Related Problems and Solutions