Python – Use a Python script to get a list of files from the hdfs (Hadoop) directory

Use a Python script to get a list of files from the hdfs (Hadoop) directory… here is a solution to the problem.

Use a Python script to get a list of files from the hdfs (Hadoop) directory

How do I get a list of files from an HDFS (Hadoop) directory using a Python script?

I’ve tried the following line:

dir = sc.textFile(“hdfs://127.0.0.1:1900/directory”).collect()

The directory has a list of files “file1, file2, file3…. fileN”。 By using this line, I only get a list of all the content.
But I need to get the list of filenames.

Can anyone help me figure this out?

Thanks in advance.

Solution

Use child processes

import subprocess
p = subprocess. Popen("hdfs dfs -ls <HDFS Location> |  awk '{print $8}'",
    shell=True,
    stdout=subprocess. PIPE,
    stderr=subprocess. STDOUT)

for line in p.stdout.readlines():
    print line

EDIT: No Python answer. The first option can also be used to print all subdirectories recursively. The final redirect statement can be omitted or changed according to your requirements.

hdfs dfs -ls -R <HDFS LOCATION> | awk '{print $8}' > output.txt
hdfs dfs -ls <HDFS LOCATION> | awk '{print $8}' > output.txt

Edit:
Correct the missing quotation marks in the awk command.

Related Problems and Solutions