How to copy a Parquet file and convert it to csv… here is a solution to the problem.
How to copy a Parquet file and convert it to csv
I have access to the hdfs file system and can view parquet files
hadoop fs -ls /user/foo
How do I copy these parquet files to my local system and convert them to csv so I can use them? These files should be simple text files that contain multiple fields per line.
Solution
Try
df = spark.read.parquet("/path/to/infile.parquet")
df.write.csv("/path/to/outfile.csv")
Related API documentation:
/
path/to/infile.parquet
and /path/to/outfile.csv
should be locations on the hdfs file system. You can explicitly specify hdfs://...
or ignore it, as it is usually the default scenario.
You should avoid using file://...
because local files are different files for each machine in the cluster. Output to HDFS instead, and then transfer the results to your local disk using the command line:
hdfs dfs -get /path/to/outfile.csv /path/to/localfile.csv
Or display directly from HDFS:
hdfs dfs -cat /path/to/outfile.csv