Python - Hadoop cluster: map task run only on one machine and not all

Hadoop cluster: map task run only on one machine and not all… here is a solution to the problem.

Hadoop cluster: map task run only on one machine and not all

I have a Hadoop cluster of three machines, one of which acts as both master and slave.

When I run wordcount-example, It runs Map Task on two machines – Worker1 and Worker2. But when I run my own code, it only runs on one machine — worker1, how can I get map task to run on all machines?

Input Split Locations

/default-rack/master
/default-rack/worker1
/default-rack/worker2

Fixed !!!

I added the following in the config of mapred-site.xml and fixed it

<property>
  <name>mapred.map.tasks</name>
  <value>100</value>
</property>

Solution

How big is your input? Hadoop splits the job into input splits, and if your file is too small, it will have only one split.

Try a larger file – say about 1GB in size – and see how you get the mapper.

You can also check to make sure that each TaskTracker is correctly reported to JobTracker. If there is a TaskTracker that is not properly connected, it will not fetch the task:

   $ hadoop job -list-active-trackers

This command should output all 3 hosts.

Python – Hadoop cluster: map task run only on one machine and not all

Hadoop cluster: map task run only on one machine and not all

Solution

Related Problems and Solutions