Java – Hadoop uses one instance per mapper

Hadoop uses one instance per mapper… here is a solution to the problem.

Hadoop uses one instance per mapper

I’m using Hadoop’s map reduce to parse xml files. So I have a class called Parser that can have a method parse() to parse xml files. So I should use it in Mapper’s map() function.

However, this means that every time I want to invoke a Parser, I need to create an instance of Parser. But this instance should be the same for every map job. So I wonder if it’s possible to instantiate this Parser just once?

There is also an additional question, why is the Mapper class always static?

Solution

To ensure one parser instance per Mapper, use the mapper setup method to instantiate your resolver instance and clean it up using the cleanup method.

Same thing, we applied for the protobuf parser we have, but need to make sure your parser instance is thread-safe and has no shared data.
Note: Each mapper will only call the setup and cleanup methods once, so we can initialize private variables there.
Clarification cricket_007 in “In a distributed computing environment, sharing variable instances is not possible…”

we have a practice of reusing of writable classes instead of creating new writables every time we need. we can instantiate once and re-set the writable multiple times as described by Tip 6
Similarly, parser objects can be reused (Tip-6 style). As described in the following code.
For example:

private YourXMLParser xmlParser = null;
    @Override
        protected void setup(Context context) throws IOException, InterruptedException {
            super.setup(context);
            xmlParser= new YourXMLParser();        
        }

@Override
        protected void cleanup(Mapper<ImmutableBytesWritable, Result, NullWritable, Put>. Context context) throws IOException,
                        InterruptedException {
            super.cleanup(context);
                  xmlParser= null;
    }

Related Problems and Solutions