Java reads the Parquet file to JSON output … here is a solution to the problem.
Java reads the Parquet file to JSON output
The Parquet file is being read, but you get the indented format instead of the desired JSON output format. Any ideas? I was thinking I might need to change GroupRecordConverter but couldn’t find much documentation. It would also be helpful if this could be pointed out. Thank you very much for your help.
long num = numLines;
try {
ParquetMetadata readFooter = ParquetFileReader.readFooter(conf, path, ParquetMetadataConverter.NO_FILTER);
MessageType schema = readFooter.getFileMetaData().getSchema();
ParquetFileReader r = new ParquetFileReader(conf,path,readFooter);
PageReadStore pages = null;
try{
while(null != (pages = r.readNextRowGroup())) {
final long rows = pages.getRowCount();
System.out.println("Number of rows: " + rows);
final MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
final RecordReader recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));
String sTemp = "";
for(int i=0; i<rows && num-->0; i++) {
System.out.println(recordReader.read().toString())
}
}
}
}
Current indent output:
data1: value1
data2: value2
models
map
key: data3
value
array: value3
map
key: data4
value
array: value4
data5: value5
...
Required JSON output:
"data1": "value1",
"data2": "value2",
"models": {
"data3": [
"value3"
],
"data4": [
"value4"
]
},
"data5": "value5"
...
Solution
The cat command tool code for java parquet lib might serve as an example for you…
Contains rows:
org.apache.parquet.tools.json.JsonRecordFormatter.JsonGroupFormatter formatter = JsonRecordFormatter.fromSchema(metadata.getFileMetaData().getSchema());
See also Get the full source code here.