Delete files from Hadoop using pyspark (queries)… here is a solution to the problem.
Delete files from Hadoop using pyspark (queries)
I’m using Hadoop
to store my data – for some data, I’m using partitions, for some data, I’m not.
I use the pyspark DataFrame
class to save the data in parquet
format as follows:
df = sql_context.read.parquet('/some_path')
df.write.mode("append").parquet(parquet_path)
I
want to write a script in pyspark
that deletes old data, in a similar way (I need to filter and query this old data on the data frame). I didn’t find anything in the pyspark
docs….
Is there any way to achieve this?
Solution
Pyspark
is primarily a processing engine. Deletions can be handled by the Subprocess
module of the original Python itself.
import subprocess
some_path = ...
subprocess.call(["hadoop", "fs", "-rm", "-f", some_path])