Python – Opens files on HDFS from a Hadoop mapreduce job

Opens files on HDFS from a Hadoop mapreduce job… here is a solution to the problem.

Opens files on HDFS from a Hadoop mapreduce job

Usually, I can open a new file with something like this:

aDict = {}
with open('WordLists/positive_words.txt', 'r') as f:
    aDict['positive'] = {line.strip() for line in f}

with open('WordLists/negative_words.txt', 'r') as f:
    aDict['negative'] = {line.strip() for line in f}

This opens two related text files in the WordLists folder and appends each line to the dictionary as a positive or negative.

However, when I wanted to run a mapreduce job in Hadoop, I didn’t think it would work. I’m running my program like this :

./hadoop/bin/hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -D mapred.reduce.tasks=0 -file hadoop_map.py -mapper hadoop_reduce.py -input /toBeProcessed - output /Completed

I tried to change the code to:

with open('/mapreduce/WordLists/negative_words.txt', 'r')

where mapreduce is a folder on HDFS and WordLists is a subfolder containing negative words. But my program didn’t find this. Is what I’m doing feasible? If possible, what is the correct way to load files on HDFS.

Edit

I’ve tried :

with open('hdfs://localhost:9000/mapreduce/WordLists/negative_words.txt', 'r')

This seems to do something, but now I get output like this:

13/08/27 21:18:50 INFO streaming. StreamJob:  map 0%  reduce 0%
13/08/27 21:18:50 INFO streaming. StreamJob:  map 50%  reduce 0%
13/08/27 21:18:50 INFO streaming. StreamJob:  map 0%  reduce 0%

Then the job fails. So it’s still not right. Any ideas?

Edit 2:

After re-reading the API, I noticed that I can use the -files option in the terminal to specify files. API Declaration:

The -files option creates a symlink in the current working directory
of the tasks that points to the local copy of the file.

In this example, Hadoop automatically creates a symlink named
testfile.txt in the current working directory of the tasks. This
symlink points to the local copy of testfile.txt.

-files hdfs://host:fs_port/user/testfile.txt

So I run:

./hadoop/bin/hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -D mapred.reduce.tasks=0 -files hdfs://localhost:54310/mapreduce/SentimentWordLists/positive_words.txt#positive_words -files hdfs://localhost:54310/mapreduce/SentimentWordLists/negative_words.txt#negative_words -file hadoop_map.py -mapper hadoop_map.py -input /toBeProcessed -output /Completed

Based on my understanding of the API, this creates symbolic links, so I can use “positive_words” and “negative_words” in my code like this:

with open('negative_words.txt', 'r')

However, this still doesn’t work. Any help anyone can provide would be greatly appreciated as there is nothing I can do until I fix this.

Edit 3:

I can use this command :

-file ~/Twitter/SentimentWordLists/positive_words.txt

Along with the rest of my command to run the Hadoop job. This will find the file on my local system instead of HDFS. This doesn’t throw any error, so it’s accepted as a file somewhere. However, I don’t know how to access the file.

Solution

Solution after lots of comments 🙂

Read the data file in python: Send it with -file and add the following to the script:

import sys

Sometimes it is necessary to add :: after import

sys.path.append('.')

(與@DrDee in Hadoop Streaming – Unable.) to find file error in comment related).

Related Problems and Solutions