Java – Hadoop writes a new file from the mapper

Hadoop writes a new file from the mapper… here is a solution to the problem.

Hadoop writes a new file from the mapper

I’m trying to write a program that takes a huge dataset and then runs some queries on it using MapReduce. I have code like this:

public static class MRMapper
   extends Mapper<LongWritable, Text, Text, IntWritable>{
String output2="hdfs://master:9000/user/xxxx/indexln.txt";
  FileSystem Phdfs =FileSystem.get(new Configuration());
 Path fname1=new Path(output2);
BufferedWriter out=new BufferedWriter(new OutputStreamWriter(Phdfs.create(fname1,true)));
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
    long max=0;

public void map(LongWritable key, Text value, Context context
                ) throws IOException, InterruptedException {
    String binln = Long.toBinaryString(0x8000000000000000L | key).substring(1);
    out2.write(binln+"\n");
    out2.flush();
    String line = value.toString();
    String [] ST = line.split(",");
                    long val=Math.abs(Long.parseLong(ST[2]));
                    if (max < val){
                                    max= val;
                    }
                    else{
                            word.set(line);
                            context.write(word, val);
                     }
 }
}

What I want to do is build an indexfile in the mapper. The mapper will be used to access a specific area of the input file. The mapper reads a portion of the input file based on the index and then prints the portion read and the number of lines read to the output. I’m using a mapper with 9 reducers.

  • My question is, is it possible to create/write a different file than the output file in the map function, and can the reducer read the file opened in the mapper? If yes, am I on the right path or is it completely wrong, or is MapReduce not the solution to this problem? I apologize if this issue sounds too noob, but I’m actually a noob with Hadoop. Study hard. Thanks

Solution

Are you sure you’re using a single mapper? Because the number of mappers created by Hadoop is very close to the number of input splits (more details).

The concept of input segmentation is also very important: it means that very large data files are split into several blocks, each assigned to a mapper. Therefore, unless you are completely sure that only one mapper is used, you will not be able to control which part of the file is being processed, nor can you control any type of global index.

That being said, using a single mapper in MapReduce is the same as not using MapReduce at all:) Maybe it’s my mistake, I’m assuming you only have one file to analyze, is that so?

If you have multiple big data files, things change and it might make sense to create a mapper for each file, but you must create your own InputSplit and override the isSplitable method, always returning false.

Related Problems and Solutions