Use MapReduce to read large files in hadoop… here is a solution to the problem.
Use MapReduce to read large files in hadoop
I have a code that can read a file from an FTP server and write it to HDFS
. I’ve implemented a custom InputFormatReader
that sets the isSplitable
property of the input to false
. But this gives me the following error.
INFO mapred. MapTask: Record too large for in-memory buffer
The code I use to read the data is
Path file = fileSplit.getPath();
FileSystem fs = file.getFileSystem(conf);
FSDataInputStream in = null;
try {
in = fs.open(file);
IOUtils.readFully(in, contents, 0, contents.length);
value.set(contents, 0, contents.length);
}
Any ideas on how to avoid java heap space error
without splitting the input file? Or how do I read the file if I make isSplitable
true
?
Solution
If I heard me correctly – you load the entire file into memory. Nothing to do with hadoop – you can’t do it on Java and make sure you have enough memory.
I suggest defining some reasonable blocks and making them “recorded”