Configure EMR nodes with a customization file
I’m trying to run a jar with Apache Nutch dependencies on an AWS EMR Hadoop cluster. The problem is that Nutch can’t find the plugin class (I specify the plugin location using -Dplugin.folders
).
I tested this option locally and it worked fine: java -cp app.jar -Dplugin.folders=./nutch-plugins.
I’m getting this error :
19/07/24 15:42:26 INFO mapreduce. Job: Task Id : attempt_1563980669003_0005_m_000000_2, Status : FAILED
Error: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found.
at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:146)
at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:796)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
I tried to copy the plugin to the /tmp
folder (just guessing it’s a shared folder) – didn’t help :
hadoop jar app.jar -Dplugin.folders=/tmp/nutch-plugins
Then I tried copying it to HDFS – didn’t help :
hadoop fs -cp file:///tmp/nutch-plugins hdfs:///tmp/
hadoop jar app.jar -Dplugin.folders=hdfs:///tmp/nutch-plugins
And try uploading it to an S3 bucket – no help:
hadoop fs -cp file:///tmp/nutch-plugins s3a:///mybucket/
hadoop jar app.jar -Dplugin.folders=s3a:///mybucket/nutch-plugins
How do I configure a Hadoop node using the Nutch plugin? All I need to do is copy the plugin file somewhere so that it can be accessed from any node in the cluster.
Solution
In distributed mode (in a Hadoop cluster), the plug-in is contained in the job file (runtime/deploy/apache-nutch-1.x.job
).
- Start with a source package or a git clone of the Nutch source code
- Modify the configuration in
conf/
– Note: The configuration file is also included in the job file - Build Nutch (
ant runtime
). - Run
runtime/deploy
/bin/nutch orruntime/deploy/bin/crawl
:hadoop jar <jobfile>
is called to start the Nutch job, so the executableHadoop
must be in PATH Above.