Use Apache Spark to read JSON files
I’m trying to read a JSON file using Spark v2.0.0. Works very well in the case of simple data codes. If the data is a bit complicated, when I print df.show(), the data is not displayed in the correct way.
Here is my code :
SparkSession session = SparkSession.builder().master("local").appName("jsonreader").getOrCreate();
Dataset<Row> list = session.read().json("/Users/hadoop/Desktop/sample.json");
list.show();
Here is my sample data:
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}
My output is this:
+--------------------+
| _corrupt_record|
+--------------------+
| {|
| "glossary": {|
| "title": ...|
| "GlossDiv": {|
| "titl...|
| "GlossList": {|
| "...|
| ...|
| "SortAs": "S...|
| "GlossTerm":...|
| "Acronym": "...|
| "Abbrev": "I...|
| "GlossDef": {|
| ...|
| "GlossSeeAl...|
| ...|
| "GlossSee": ...|
| }|
| }|
| }|
+--------------------+
only showing top 20 rows
Solution
If you must read this JSON, you need to format the JSON as a single line. This is a multi-line JSON, so it is not read and loaded correctly (One Object one Row).
Reference JSON API:
Loads a JSON file (one object per line) and returns the result as a
DataFrame.
{"glossary":{"title":"example glossary","GlossDiv":{"title":"S","GlossList":{"GlossEntry":{"ID":"SGML","SortAs":"SGML","GlossTerm":"Standard Generalized Markup Language","Acronym":"SGML","Abbrev":"ISO 8879:1986","GlossDef":{"para":"A meta-markup language, used to create markup languages such as DocBook.","GlossSeeAlso":["GML"," XML"]},"GlossSee":"markup"}}}}}
I
just tried it on the shell and it should work the same way from the code (I get the same broken record error when I read multiple lines of JSON).
scala> val df = spark.read.json("C:/DevelopmentTools/data.json")
df: org.apache.spark.sql.DataFrame = [glossary: struct<GlossDiv: struct<GlossList: struct<GlossEntry: struct<Abbrev: string, Acronym: string ... 5 more fields>>, title: string>, title: string>]
scala>
Edit:
For example, you can use any operation to get values from that data frame
scala> df.select(df("glossary. GlossDiv.GlossList.GlossEntry.GlossTerm")).show()
+--------------------+
| GlossTerm|
+--------------------+
| Standard Generali...|
+--------------------+
scala>
You should also be able to do this from your code