Java – JSON objects span multiple lines, how to split input in Hadoop

JSON objects span multiple lines, how to split input in Hadoop… here is a solution to the problem.

JSON objects span multiple lines, how to split input in Hadoop

I need to extract a large JSON file whose records may span multiple lines instead of a file (depending entirely on how the data provider is writing).

Elephant-Bird uses LZO compression, which I know data providers don’t do.

The Dzone article http://java.dzone.com/articles/hadoop-practice assuming that the JSON records will be on the same line.

Any idea, other than compressing JSON… The files will all be huge… About how to properly split the file so that the JSON is not broken.

Edit: line, not file

Solution

There aren’t any other suggestions, and depending on how the JSON is formatted, you may have a choice.

As noted in the Dzone article, the problem is that JSON doesn’t have an end element, which you can easily locate when jumping to a split point.

Now, if your input

JSON has a “pretty” or standard format, you can take advantage of this in your custom input format implementation.

For example, get the sample JSON:: from the Dzone sample

{
  "results" :
    [
      {
        "created_at" : "Thu, 29 Dec 2011 21:46:01 +0000",
        "from_user" : "grep_alex",
        "text" : "RT @kevinweil: After a lot of hard work by ..."
      },
      {
        "created_at" : "Mon, 26 Dec 2011 21:18:37 +0000",
        "from_user" : "grep_alex",
        "text" : "@miguno pull request has been merged, thanks again!"
      }
    ]
}

With this format, you know (hopefully?) Each new record starts on a line with 6 spaces and an opening parenthesis. A record ends in a similar format – 6 spaces and a closing parenthesis.

So your logic in this case is: consume the line until you find the line with 6 spaces and an opening parenthesis. Then buffer the contents until 6 spaces and a closing parenthesis are found. Then use any JSON deserializer you want to convert it into a java object (or just pass multiple lines of text to your mapper.)

Related Problems and Solutions