Python – can pyarrow write multiple Parquet files to a folder like the file_scheme=’hive’ option in fastparquet?

can pyarrow write multiple Parquet files to a folder like the file_scheme=’hive’ option in fastparquet?… here is a solution to the problem.

can pyarrow write multiple Parquet files to a folder like the file_scheme=’hive’ option in fastparquet?

I

have a SQL table with millions of records that I intend to write to many Parquet files in the folder using the pyarrow library. The data content appears to be too large to store in a single parquet file.

However, I can’t seem to find an API or parameter in the pyarrow library that allows me to specify something like :

file_scheme="hive"

Supported by the FastParquet Python library.

Here is my sample code:

#!/usr/bin/python

import pyodbc
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

conn_str = 'UID=username; PWD=passwordHere; ' + 
    'DRIVER=FreeTDS; SERVERNAME=myConfig;DATABASE=myDB'

#----> Query the SQL database into a Pandas dataframe
conn = pyodbc.connect( conn_str, autocommit=False)
sql = "SELECT * FROM ClientAccount (NOLOCK)"
df = pd.io.sql.read_sql(sql, conn)

#----> Convert the dataframe to a pyarrow table and write it out
table = pa. Table.from_pandas(df)
pq.write_table(table, './clients/' )

This throws an error:

File "/usr/local/lib/python2.7/dist-packages/pyarrow/parquet.py", line 912, in write_table
    os.remove(where)
OSError: [Errno 21] Is a directory: './clients/'

If I replace the last line with the following, it works fine but only writes a large file :

pq.write_table(table, './clients.parquet' )

Any ideas on how to use pyarrow for multifile output?

Solution

Try pyarrow.parquet.write_to_dataset https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L938 .

I opened https://issues.apache.org/jira/browse/ARROW-1858 documentation on adding more about this.

I recommend seeking support for Apache Arrow on mailing list [email protected]. Thanks!

Related Problems and Solutions