Python – Processing multiple files in HDFS through Python

Processing multiple files in HDFS through Python… here is a solution to the problem.

Processing multiple files in HDFS through Python

I have a directory in HDFS that contains about 10,000 .xml files. I have a python script “processxml.py” that takes a file and does some work on it. Is it possible to run the script on all files in the hdfs directory, or do I need to copy them locally before I can do so?

For example, when I run a script on a file in my local directory, I have:

cd /path/to/files

for file in *.xml
do
python  /path/processxml.py 
$file > /path2/$file
done

So basically, how would I do the same, but this time the file is in hdfs?

Solution

You basically have two options:

1) Create a MapReduce job using the hadoop streaming connector (here you only need the map part). Use this command from the shell or in a shell script:

hadoop jar <the location of the streamlib> \
        -D mapred.job.name=<name for the job> \
        -input /hdfs/input/dir \
        -output /hdfs/output/dir \
        -file your_script.py \
        -mapper python your_script.py \
        -numReduceTasks 0

2) Create a PIG script and publish your Python code. This is a basic example of a script:

input_data = LOAD '/hdfs/input/dir';
DEFINE mycommand `python your_script.py` ship('/path/to/your/script.py');
updated_data = STREAM input_data THROUGH mycommand PARALLEL 20;    
STORE updated_data INTO 'hdfs/output/dir';

Related Problems and Solutions