Python - Spark/Hadoop file not found on AWS EMR

Spark/Hadoop file not found on AWS EMR… here is a solution to the problem.

Spark/Hadoop file not found on AWS EMR

I’m trying to read a text file on Amazon EMR using the python spark library. The file is in the home directory (/home/hadoop/wet0), but Spark can’t seem to find it.

Problematic line:

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])

Error:

pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://ip-172-31-19-121.us-west-2.compute.internal:8020/user/hadoop/wet0; '

Does the file have to be in a specific directory? I can’t find any information on the AWS website.

Solution

If it is in the local file system, the URL should be file://user/hadoop/wet0
If it is in HDFS, that should be a valid path. Use the hadoop fs command to take a look

For example: hadoop fs -ls/home/hadoop

When you think about it, you say it’s “/home/hadoop”, but the path in the error is “/user/hadoop”. Make sure you don’t use ~ in the command line, as bash will extend before Spark sees it. It is best to use the full path /home/hadoop

Python – Spark/Hadoop file not found on AWS EMR

Spark/Hadoop file not found on AWS EMR

Solution

Related Problems and Solutions