Java reads the Parquet file to JSON output

Java reads the Parquet file to JSON output … here is a solution to the problem.

Java reads the Parquet file to JSON output

The Parquet file is being read, but you get the indented format instead of the desired JSON output format. Any ideas? I was thinking I might need to change GroupRecordConverter but couldn’t find much documentation. It would also be helpful if this could be pointed out. Thank you very much for your help.

long num = numLines;
try {
  ParquetMetadata readFooter = ParquetFileReader.readFooter(conf, path, ParquetMetadataConverter.NO_FILTER);
  MessageType schema = readFooter.getFileMetaData().getSchema();
  ParquetFileReader r = new ParquetFileReader(conf,path,readFooter);

PageReadStore pages = null;
  try{
    while(null != (pages = r.readNextRowGroup())) {
      final long rows = pages.getRowCount();
      System.out.println("Number of rows: " + rows);

final MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
      final RecordReader recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));
      String sTemp = "";
      for(int i=0; i<rows && num-->0; i++) {
        System.out.println(recordReader.read().toString())
      }
    }
  }
}

Current indent output:

data1: value1
data2: value2
models
  map
    key: data3
    value
      array: value3
  map
    key: data4
    value
      array: value4
data5: value5
...

Required JSON output:

"data1": "value1",
"data2": "value2",
"models": {
    "data3": [
        "value3"
    ],
    "data4": [
        "value4"
    ]
},
"data5": "value5"
...

Solution

The cat command tool code for java parquet lib might serve as an example for you…
Contains rows:

org.apache.parquet.tools.json.JsonRecordFormatter.JsonGroupFormatter formatter = JsonRecordFormatter.fromSchema(metadata.getFileMetaData().getSchema());

See also Get the full source code here.

Related Problems and Solutions