Java – Write one file per group in Pig Latin

Write one file per group in Pig Latin… here is a solution to the problem.

Write one file per group in Pig Latin

Question:
I have many files containing Apache web server log entries. The entries are not in date-time order and are scattered throughout the file. I’m trying to read a file for a day using Pig, group and sort the log entries by datetime, and then write them to a file named after the date and time of the entry it contains.

Settings:
After importing the file, I use Regex to get the date field and truncate it to hours. This results in a collection with records in one field and dates truncated to hours in the other field. From here, I group on the date-hour field.

First try:
My first thought was to use the STORE command when iterating through my group using FOREACH, but it quickly turned out that wasn’t cool for Pig.

Second attempt:
My second attempt was using the MultiStorage() method in piggybank and it worked great before I looked at the file. The problem is that MulitStorage wants to write all the fields to the file, including the ones I use to group. All I really want is the original record written to the file.

Question:
So… Am I using Pig for a purpose it’s not suitable for, or is there a better way for me to use Pig to fix this? Now that I have this problem, I’m going to write a simple code example to explain my problem further. Once I have it, I’ll post it here. Thanks in advance.

Solution

Out of the

box, the Pig doesn’t have a lot of features. It does some basic things, but many times I find myself having to write custom UDFs or load/store functions to go from the 95% way to the 100% way. I usually find it worth it because it takes much less Java to write just a small stored function than the entire MapReduce program.

Your second attempt is very close to what I would do. You should copy/paste the source code for MultiStorage or use inheritance as a starting point. Then, modify the putNext method to remove the group value, but still write to the file. Unfortunately, Tuple does not remove or delete method, so you have to rewrite the entire tuple. Or, if you only have the original string, just pull it out and output what is wrapped in Tuple.

Some general documentation on writing load/store functions in case you need more help: http://pig.apache.org/docs/r0.10.0/udf.html#load-store-functions

Related Problems and Solutions