Java – How to parse CustomWritable from text in Hadoop

How to parse CustomWritable from text in Hadoop… here is a solution to the problem.

How to parse CustomWritable from text in Hadoop

Let’s say I have timestamp values for a specific user in a text file, eg

#userid; unix-timestamp; value
1; 2010-01-01 00:00:00; 10
2; 2010-01-01 00:00:00; 20
1; 2010-01-01 01:00:00; 11
2; 2010-01-01 01:00:00, 21
1; 2010-01-02 00:00:00; 12
2; 2010-01-02 00:00:00; 22

I have a custom class “SessionSummary” that implements readFields and write WritableComparable. Its purpose is to summarize all the values for each user for each calendar day.

So mapper maps rows to each user, and the reducer summarizes all the values for each user per day and outputs SessionSummary as TextOutputFormat (using SessionSummary’s toString, as a tab-separated UTF-8 string):

1; 2010-01-01; 21
2; 2010-01-01; 41
1; 2010-01-02; 12
2; 2010-01-02; 22

If I need to use these summary entries for the second Map/Reduce stage, how should I parse this summary data to populate the members? Can I somehow reuse existing readFields and write-methods (WritableComparable interface implementation) by using text strings as DataInputs? This (obviously) doesn’t work :

public void map(...) {
    SessionSummary ssw = new SessionSummary();
    ssw.readFields(new DataInputStream(new ByteArrayInputStream(value.getBytes("UTF-8"))));
}

Overall: Is there a best practice to implement custom keys and values in Hadoop and make them easily reusable across multiple M/R stages, while maintaining human-readable text output at each stage?

(Hadoop version is 0.20.2/CDH3u3).

Solution

The output format of your first MR job

should be SequenceFileOutputFormat – this will store the key/value output by the reducer in binary format, and then the job using the SequenceFileInputFormat can be read back in your second MR . Also make sure that outputKeyClass and outputValueClass are set appropriately on the job.

The

mapper in the second job has a SessionSummary (and any value type).

If you need to view the text output of your first MR job, you can run the following command on the output file in HDFS:

hadoop fs -libjars my-lib.jar -text output-dir/part-r-*

This reads the sequence file key/value pairs and calls toString() on the two objects, separating them when output to standard output. -libjars specifies where Hadoop can find your custom key/value class

Related Problems and Solutions