Use a Python script to get a list of files from the hdfs (Hadoop) directory… here is a solution to the problem.
Use a Python script to get a list of files from the hdfs (Hadoop) directory
How do I get a list of files from an HDFS (Hadoop) directory using a Python script?
I’ve tried the following line:
dir = sc.textFile(“hdfs://127.0.0.1:1900/directory”).collect()
The directory has a list of files “file1, file2, file3…. fileN”。 By using this line, I only get a list of all the content.
But I need to get the list of filenames.
Can anyone help me figure this out?
Thanks in advance.
Solution
Use child processes
import subprocess
p = subprocess. Popen("hdfs dfs -ls <HDFS Location> | awk '{print $8}'",
shell=True,
stdout=subprocess. PIPE,
stderr=subprocess. STDOUT)
for line in p.stdout.readlines():
print line
EDIT: No Python answer. The first option can also be used to print all subdirectories recursively. The final redirect statement can be omitted or changed according to your requirements.
hdfs dfs -ls -R <HDFS LOCATION> | awk '{print $8}' > output.txt
hdfs dfs -ls <HDFS LOCATION> | awk '{print $8}' > output.txt
Edit:
Correct the missing quotation marks in the awk command.