Python – How do I specify a logical type when writing a Parquet file from PyArrow?

How do I specify a logical type when writing a Parquet file from PyArrow?… here is a solution to the problem.

How do I specify a logical type when writing a Parquet file from PyArrow?

I’m writing using PyArrow Parquet comes from some Pandas‘ file data frame in Python.

Is there a way to specify the logical type of writing to a parquet file?

For example, writing an np.uint32 column in PyArrow produces an INT64 column in the parquet file, which is used instead fastparquet writes the same column module to produce an INT32 column of logical type UINT_32 (this is what I started from Behavior obtained by PyArrow).

For example:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import fastparquet as fp
import numpy as np

df = pd. DataFrame.from_records(data=[(1, 'foo'), (2, 'bar')], columns=['id', 'name'])
df['id'] = df['id'].astype(np.uint32)

# write parquet file using PyArrow
pq.write_table(pa. Table.from_pandas(df, preserve_index=False), 'pyarrow.parquet')

# write parquet file using fastparquet
fp.write('fastparquet.parquet', df)

# print schemas of both written files
print('PyArrow:', pq. ParquetFile('pyarrow.parquet').schema)
print('fastparquet:', pq. ParquetFile('fastparquet.parquet').schema)

This output:

PyArrow: <pyarrow._parquet. ParquetSchema object at 0x10ecf9048>
id: INT64
name: BYTE_ARRAY UTF8

fastparquet: <pyarrow._parquet. ParquetSchema object at 0x10f322848>
id: INT32 UINT_32
name: BYTE_ARRAY UTF8

I’m having similar issues with other column types, so I’m really looking for a generic way to specify the logical type to use when writing with PyArrow.

Solution

PyArrow writes parquet version 1.0 files by default, and requires version 2.0 to use UINT_32 logical types.

The workaround is to specify the version when writing the table, ie

pq.write_table(pa. Table.from_pandas(df, preserve_index=False), 'pyarrow.parquet', version='2.0')

This causes the expected Parquet pattern to be written.

Related Problems and Solutions