Java – Hadoop : set a variable like hashSet only once so that it can be utilized multiple times in each map task

Hadoop : set a variable like hashSet only once so that it can be utilized multiple times in each map task… here is a solution to the problem.

Hadoop : set a variable like hashSet only once so that it can be utilized multiple times in each map task

Hello, I have a HashSet which needs to be used in every mapping task in hadoop. I don’t want to initialize it multiple times. I’ve heard that it can be achieved by setting variables in the configuration function. Any suggestions are welcome.

Solution

It seems that you don’t really understand Hadoop’s execution strategy.

If you are in distributed mode, you cannot share a set (HashSet) across multiple map tasks. This is because tasks are executed in their own JVM, and it is not deterministic, even without JVM reuse, your collection persists after the JVM is reset.

What you can do is set a HashSet for each task at the beginning of the calculation.

So you can override the setup(Context ctx) method. This is called before the mapping method is called.

But you need enough memory to store the HashSet in each task.

If you don’t have this capability, you can consider a distributed caching scheme, but there is overhead because each query has to be serialized and deserialized. And there is no guarantee that the data will be available locally, so this may take longer than the collection in the task.

Related Problems and Solutions