Python – How to copy a Parquet file and convert it to csv

How to copy a Parquet file and convert it to csv… here is a solution to the problem.

How to copy a Parquet file and convert it to csv

I have access to the hdfs file system and can view parquet files

hadoop fs -ls /user/foo

How do I copy these parquet files to my local system and convert them to csv so I can use them? These files should be simple text files that contain multiple fields per line.

Solution

Try

df = spark.read.parquet("/path/to/infile.parquet")
df.write.csv("/path/to/outfile.csv")

Related API documentation:

/

path/to/infile.parquet and /path/to/outfile.csv should be locations on the hdfs file system. You can explicitly specify hdfs://... or ignore it, as it is usually the default scenario.

You should avoid using file://... because local files are different files for each machine in the cluster. Output to HDFS instead, and then transfer the results to your local disk using the command line:

hdfs dfs -get /path/to/outfile.csv /path/to/localfile.csv

Or display directly from HDFS:

hdfs dfs -cat /path/to/outfile.csv

Related Problems and Solutions