Python – Apache Hive had an error using the Python UDF

Apache Hive had an error using the Python UDF… here is a solution to the problem.

Apache Hive had an error using the Python UDF

I use a Python user-defined function in the Apache hive to change characters from lowercase to uppercase. I’m getting the error “Hive runtime error when closing operator”.

Here’s the query I’ve tried:

describe table1;     
OK
item    string  
count   int 
city    string  

select * from table1;
aaa 1   tokyo
aaa 2   london
bbb 3   washington
ccc 4   moscow
ddd 5   bejing

From the table above, the item and city fields should change from lowercase to uppercase, and the count should increase by 10.

Python script used:

cat caseconvert.py
import sys
import string

for line in sys.stdin:
    line = line.strip()
    item,count,city=line.split('\t')

ITEM1=item.upper()
    COUNT1=count+10
    CITY1=city.upper()
    print '\t'.join([ITEM1,str(COUNT1),FRUIT1])

Insert Table 1 data into Table 2

create table table2(ITEM1 string, COUNT1 int, CITY1 string) row format delimited fields terminated by ',';

add FILE caseconvert.py 

insert overwrite table table2 select TRANSFORM(item,count,city) using 'python caseconvert.py' as (ITEM1,COUNT1,CITY1) from table1;

If I do, I get the following error. I can’t track down the issue. Can I know if it’s wrong?

Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201508151858_0014, Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201508151858_0014
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_201508151858_0014
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-08-15 22:24:06,212 Stage-1 map = 0%,  reduce = 0%
2015-08-15 22:25:01,559 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201508151858_0014 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201508151858_0014
Examining task ID: task_201508151858_0014_m_000002 (and more) from job job_201508151858_0014

Task with the most failures(4): 
-----
Task ID:
  task_201508151858_0014_m_000000

URL:
  http://localhost.localdomain:50030/taskdetails.jsp?jobid=job_201508151858_0014&tipid=task_201508151858_0014_m_000000
-----
Diagnostic Messages for this Task:
java.lang.RuntimeException: Hive Runtime Error while closing operators
    at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:224)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
    at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error occurred when trying to close the Operator running your custom script.
    at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java:488)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:570)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:5

FAILED: Execution Error, return code 20003 from org.apache.hadoop.hive.ql.exec.MapRedTask. An error occurred when trying to close the Operator running your custom script.
MapReduce Jobs Launched: 
Job 0: Map: 1   HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec

Solution

In the last line of the Python script, where the output is printed to STDOUT, you called FRUIT1 without defining it. This should be CITY1. You also imported the string but did not use it. I’ll script it differently:

import sys
import string

while True:
    line = sys.stdin.readline()
    if not line:
        break

line = string.strip(line, '\n ')
    item,count,city=string.split(line, '\t')

ITEM1=item.upper()
    COUNT1=count+10
    CITY1=city.upper()
    print '\t'.join([ITEM1,str(COUNT1),CITY1])

I’LL THEN USE THE CREATE TABLE AS SELECT QUERY (ASSUMING TABLE1 AND YOUR PYTHON SCRIPT ARE BOTH IN HDFS):

create table TABLE2
as select transform(item, count, city)
using 'hdfs:///user/username/caseconvert.py' 
as (item1 string, count1 string, city1 string)
FROM TABLE1;

This worked for me. However, there is an easier way to use the Hive built-in functions to make the desired conversion:

upper(string A) >>> returns a string that converts all characters of A to uppercase. For example, upper(‘fOoBaR’) results in ‘FOOBAR’.

Of course, for cities, you can do it: (city + 10) AS city1。

THEREFORE, YOU CAN CREATE TABLE2: AS FOLLOWS

CREATE TABLE2
AS SELECT
UPPER(ITEM) AS ITEM1,
COUNT + 10 AS COUNT1,
UPPER CITY AS CITY1
FROM TABLE1;

Much more cumbersome than writing a custom UDF.

Related Problems and Solutions