Java – Hadoop MapReduce with RDF/XML files

Hadoop MapReduce with RDF/XML files… here is a solution to the problem.

Hadoop MapReduce with RDF/XML files

So I have ten different files, each one looks like this.

<DocID1>    <RDF Document>
<DocID2>    <RDF Document>
.
.
.
.
<DocID50000>    <RDF Document>

There are actually about 56,000 lines per file. Each row has a document ID and an RDF document.

My goal is to pass each mapper as an input key-value pair to each mapper and emit multiple mappers for output key-value pairs. In the reduce step, I’ll store these in a Hive table.

I

have a few getting started questions, and I’m completely new to RDF/XML files.

  1. How should I parse each line of the document to pass to each mapper separately?

  2. Is there an efficient way to control the input size of the mapper?

Solution

1- If you are using TextInputFormat, you will automatically get 1 row (1 split) as the value in each mapper. Convert this line to String and do the required processing. Alternatively, you can use the Hadoop Streaming API by using StreamXmlRecordReader. You must provide the opening and closing tags, and all the information sandwiched between the opening and marking will be provided to the mapper (in your case, <DocID1> and <RDF Document > )。

Usage :

hadoop jar hadoop-streaming.jar -inputreader "StreamXmlRecord,begin=DocID,end=RDF Document" ..... (rest of the command)

2- Why is it needed? Your goal is to provide the complete line to the mapper. This is the job of the InputFormat you are using. If you still need it, you’ll have to write custom code for it, which is a bit tricky for this particular case.

Related Problems and Solutions