Java – Map Reduce inverted list in Hadoop

Map Reduce inverted list in Hadoop… here is a solution to the problem.

Map Reduce inverted list in Hadoop

I’m trying to modify this code to generate a full inverted list. I mean, get the index of every word in the file location. That is, if we have two files that contain words

  abc.txt =    I am coming to the park to play, yes i am.

def.txt = Please come on over, i will be waiting for you

I should have something like this :

i /home/abc.txt: 1 10 /home/def.txt: 5

This means that the letter i is the 1st and 10th words in the file abc.txt and the 5th word in the file def.txt

I modified the code to provide “word position and word frequency” as follows:

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;

public class WordCountByFile extends Configured implements Tool {

public static void main(String args[]) throws Exception {
        String[] argsLocal = {
            "input#2", "output#2"
        };
        int res = ToolRunner.run(new WordCountByFile(), argsLocal);
        System.exit(res);
    }

public int run(String[] args) throws Exception {
        Path inputPath = new Path(args[0]);
        Path outputPath = new Path(args[1]);

Configuration conf = getConf();
        Job job = new Job(conf, this.getClass().toString());

FileInputFormat.setInputPaths(job, inputPath);
        FileOutputFormat.setOutputPath(job, outputPath);

job.setJobName("WordCountByFile");
        job.setJarByClass(WordCountByFile.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);
        job.setCombinerClass(Reduce.class);
        job.setReducerClass(Reduce.class);

return job.waitForCompletion(true) ? 0 : 1;
    }

public static class Map extends Mapper < LongWritable, Text, Text, IntWritable > {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {

String filePathString = ((FileSplit) context.getInputSplit()).getPath().toString();

word.set(tokenizer.nextToken() + " " + filePathString + " : ");
                context.write(word, one);
            }
        }
    }

public static class Reduce extends Reducer < Text, IntWritable, Text, IntWritable > {

@Override
        public void reduce(Text key, Iterable < IntWritable > values, Context context) 
            throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable value: values) {
                sum += value.get();
            }
            context.write(key, new IntWritable(sum));
        }
    }
}

I

know it has to use some indexes like in Java, but I’m trying to figure out how to do that in Hadoop Map Reduce. Helpful people?

Solution

Some thoughts on your question.

Input format:

TextInputFormat uses each line of the input file as an input record. So you should use the input format, which provides access to the entire file as one input record. You can use this WholeFileRecordReader, for example.

Mapper:

Mapper should return information about each word in the input record. The return key is the word, and the return value is any structure that contains the input file and the position of the current word in the file. You can write your own Writable class or combine this information into a string and return the Text class as you do now.

reducer :

The reducer should merge the information for each word. Simply loop through all the values passed to the reducer through a key and generate the resulting string in the format you described.

Related Problems and Solutions