How do I read Hadoop files using Apache Beam?… here is a solution to the problem.
How do I read Hadoop files using Apache Beam?
I’m trying to read files using Apache Beam on a Hadoop server (not on-premises). The question is: what do I do? I read some articles about using Beam’s Hadoop I/O format:
https://beam.apache.org/documentation/io/built-in/hadoop/
I don’t quite understand this part :
Configuration myHadoopConfiguration = new Configuration(false);
THIS --> // Set Hadoop InputFormat, key and value class in configuration <-- THIS
myHadoopConfiguration.setClass("mapreduce.job.inputformat.class",
InputFormatClass,
InputFormat.class);
myHadoopConfiguration.setClass("key.class", InputFormatKeyClass, Object.class);
myHadoopConfiguration.setClass("value.class", InputFormatValueClass, Object.class);
How to set this format? Do I need to create a class? Because if I c/p this code doesn’t work. Thanks
Solution
The standard default InputFormat is TextInputFormat
This extends FileInputFormat<LongWritable, Text>
It says the Long
value as the byte offset in the file. import org.apache.hadoop.io.LongWritable
and Text
values as singular lines. import org.apache.hadoop.io.Text
The code does not work because InputFormatClass, InputFormatKeyClass
, or InputFormatValueClass
are not actual variables