Python – Can I use the mrjob python library on the hive table of partitions?

Can I use the mrjob python library on the hive table of partitions?… here is a solution to the problem.

Can I use the mrjob python library on the hive table of partitions?

I have user access to a Hadoop server/cluster with data stored only in partitioned tables/files in Hive (avro). I wonder if I can use python mrjob on these tables to do mapreduce? So far, I’ve been testing mrjob locally against text files stored on CDH5 and I’ve been impressed with the ease of development.

After some research, I found out that there is a library called HCatalog, but as far as I know, it doesn’t work with python (only for Java). Unfortunately, I don’t have much time to learn Java, and I want to stick with Python.

Do you know how to run mrjob on data stored in Hive?

If this is not possible, is there a way to stream mapreduce code written in Python to the hive? (I’d rather not upload MapReduce Python files to Hive).

Solution

As Alex said, Mr.Job is not currently available for AVRO format files. However, there is a way to execute python code directly on the hive table (no Mr.Job is required, unfortunately the active is lost). Eventually, I managed to do this by executing the “ADD FILE mapper.py” and using the TRANSFORM … USING …. Execute the SELECT clause to add the python file as a resource to hive, storing the results of the mapper in a separate table. Hive query example:

Insert an overlay table u_data_new
choose
Conversions (user identification, movie identification, ratings, unixtime)
Use 'python weekday_mapper.py'
AS (userid, movieid, rating, weekday)
from u_data;

The full example is available here (bottom): link

Related Problems and Solutions