Python – Delete files from Hadoop using pyspark (queries)

Delete files from Hadoop using pyspark (queries)… here is a solution to the problem.

Delete files from Hadoop using pyspark (queries)

I’m using Hadoop to store my data – for some data, I’m using partitions, for some data, I’m not.
I use the pyspark DataFrame class to save the data in parquet format as follows:

df = sql_context.read.parquet('/some_path')
df.write.mode("append").parquet(parquet_path)

I

want to write a script in pyspark that deletes old data, in a similar way (I need to filter and query this old data on the data frame). I didn’t find anything in the pyspark docs….

Is there any way to achieve this?

Solution

Pyspark is primarily a processing engine. Deletions can be handled by the Subprocess module of the original Python itself.

import subprocess

some_path = ...
subprocess.call(["hadoop", "fs", "-rm", "-f", some_path])

Related Problems and Solutions