Java – How to read HDFS sequence files in spark

How to read HDFS sequence files in spark… here is a solution to the problem.

How to read HDFS sequence files in spark

I’m trying to read a file from HDFS (in this case, s3) into spark as an RDD. The file is located in the SequenceInputFileFormat. But I can’t decode the contents of the file into strings. I have the following code:

package com.spark.example.ExampleSpark;

import java.util.List;
import scala. Tuple2;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.hive.HiveContext;

public class RawEventDump 
{
    public static void main( String[] args )
    {

SparkConf conf = new SparkConf().setAppName("atlas_raw_events").setMaster("local[2]");
        JavaSparkContext jsc = new JavaSparkContext(conf);

JavaPairRDD<String, Byte> file = jsc.sequenceFile("s3n://key_id:secret_key@<file>", String.class, Byte.class);
        List<String> values = file.map(
            new Function<Tuple2<String, Byte>, String>() {
            public String call(Tuple2 row) {
                return "Value: " + row._2.toString() + "\n";
            }
        }).collect();
        System.out.println(values);
    }
}

But I get the following output:

Value: 7b 22 65 76 65 6e ...
, Value: 7b 22 65 76 65 6e 74 22 3a ...
, Value: 7b 22 65 76 65 6...
...

How do I read file contents in Spark?

Solution

Sequence files usually use Hadoop types such as TextWritable, BytesWritable, LongWritable, etc., so the RDD type should be JavaPairRDD<LongWritable, BytesWritable>

Then to convert the string you should call org.apache.hadoop.io.Text.decode(row._2.getBytes()).

Related Problems and Solutions