Python – How do I use avro files as input to MRJob jobs?

How do I use avro files as input to MRJob jobs?… here is a solution to the problem.

How do I use avro files as input to MRJob jobs?

I need to use the avro file as input to the mrjob hadoop job. Unless I pass additional commands to the hadoop streaming jar, I can’t find any documentation on how to do this. This complicates development because I’ve been testing locally with Inline Runner.

Can I use inline runner to read avro files via MRJob?

Solution

What you need is to tell Hadoop what the “input format” of your Hadoop job is:

hadoop jar hadoop-streaming.jar 
  ;; other params go here
  -inputformat org.apache.avro.mapred.AvroAsTextInputFormat

But I’m not sure how you run MRJobs. If you’re using Plain Hadoop, my previous solution will work.

Related Problems and Solutions