Java – How do I choose the optimal key in map reduce?

How do I choose the optimal key in map reduce?… here is a solution to the problem.

How do I choose the optimal key in map reduce?

I’m working on a stock trading log file. Each row represents a trade transaction with 20 tab-separated values. I’m using hadoop to process this file and do some benchmarking of the trade. Now, for each row, I have to perform a separate baseline calculation, so I don’t need to use the reduce function in map-reduce. In order to perform the baseline calculation for each row, I had to query the Sybase database to get some standard value that corresponded to that row. The database is indexed based on two values per row [trade ID and Stock ID]. Now my question is, should I use tradeId and StockId as keys in my mapreduce program, or should I choose a combination of other values/[values] for my key.

Solution

Therefore, for each row

of input, you query a database and then perform a baseline calculation for each row separately. When the baseline calculation is complete, you will output each row with the baseline value.

In this case, you can either use reducer at all, or you can use identity reducer.

So your map function reads a row, and then it issues a query to the Sybase database to get the standard values, and then performs a baseline calculation. Because you want to output the base value of each row, you can have the Map function output the row as the key and the base value as the value, that is, <line, benchmark value>

Your map function looks like this: (I assume your base value is an integer).

public void map(Text key, IntWritable value, Context context) throws Exception {
    String line = value.toString();   this will be your key in the final output

/* 
         Perform operations on the line

*/

/* 

standard values = <return value from sybase query.>;

*/

/*Perform benchmark calculations and obtain benchmark values */

context.write(line,benchmarkValue);     

}

Related Problems and Solutions