Java – Configure EMR nodes with a customization file

Configure EMR nodes with a customization file… here is a solution to the problem.

Configure EMR nodes with a customization file

I’m trying to run a jar with Apache Nutch dependencies on an AWS EMR Hadoop cluster. The problem is that Nutch can’t find the plugin class (I specify the plugin location using -Dplugin.folders).
I tested this option locally and it worked fine: java -cp app.jar -Dplugin.folders=./nutch-plugins.

I’m getting this error :

19/07/24 15:42:26 INFO mapreduce. Job: Task Id : attempt_1563980669003_0005_m_000000_2, Status : FAILED
Error: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found.
        at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:146)
        at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:796)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

I tried to copy the plugin to the /tmp folder (just guessing it’s a shared folder) – didn’t help :

hadoop jar app.jar -Dplugin.folders=/tmp/nutch-plugins

Then I tried copying it to HDFS – didn’t help :

hadoop fs -cp file:///tmp/nutch-plugins hdfs:///tmp/
hadoop jar app.jar -Dplugin.folders=hdfs:///tmp/nutch-plugins

And try uploading it to an S3 bucket – no help:

hadoop fs -cp file:///tmp/nutch-plugins s3a:///mybucket/
hadoop jar app.jar -Dplugin.folders=s3a:///mybucket/nutch-plugins

How do I configure a Hadoop node using the Nutch plugin? All I need to do is copy the plugin file somewhere so that it can be accessed from any node in the cluster.

Solution

In distributed mode (in a Hadoop cluster), the plug-in is contained in the job file (runtime/deploy/apache-nutch-1.x.job).

  1. Start with a source package or a git clone of the Nutch source code
  2. Modify the configuration in conf/ – Note: The configuration file is also included in the job file
  3. Build Nutch (ant runtime).
  4. Run runtime/deploy/bin/nutch or runtime/deploy/bin/crawl : hadoop jar <jobfile> is called to start the Nutch job, so the executable Hadoop must be in PATH Above.

Related Problems and Solutions