Java – How to parse the dataset Apache Spark Multi-line JSON in Java

How to parse the dataset Apache Spark Multi-line JSON in Java… here is a solution to the problem.

How to parse the dataset Apache Spark Multi-line JSON in Java

Is there a way to parse multi-line json files using the dataset
Here is the sample code

public static void main(String[] args) {

 creating spark session
    SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example")
                .config("spark.some.config.option", "some-value").getOrCreate();

Dataset<Row> df = spark.read().json("D:/sparktestio/input.json");
    df.show();
}

If json is in a single line, it works fine, but I need it for multiple lines

My json file

{
  "name": "superman",
  "age": "unknown",
  "height": "6.2",
  "weight": "flexible"
}

Solution

The Apache Spark documentation makes this clear –

For regular multiline JSON files, set the multiLine option to true.

So, the solution is

Dataset<Row> df = spark.read().option("multiLine", true).json("file:/a/b/c.json");
df.show();          

I’ve tried json with the same format (a json object that spans multiple lines). After adding that option, I no longer see results with corrupted_record headers in the results.

Related Problems and Solutions