Java – Set the external jar to the hadoop classpath

Set the external jar to the hadoop classpath… here is a solution to the problem.

Set the external jar to the hadoop classpath

I’m trying to set the outer jar to the hadoop classpath but so far without success.

I have the following settings

$ hadoop version
Hadoop 2.0.6-alpha
Subversion https://git-wip-us.apache.org/repos/asf/bigtop.git -r ca4c88898f95aaab3fd85b5e9c194ffd647c2109
Compiled by jenkins on 2013-10-31T07:55Z
From source with checksum 95e88b2a9589fa69d6d5c1dbd48d4e
This command was run using /usr/lib/hadoop/hadoop-common-2.0.6-alpha.jar

Classpath

$ echo $HADOOP_CLASSPATH
/home/tom/workspace/libs/opencsv-2.3.jar

I can see that the HADOOP_CLASSPATH above has been fetched by hadoop

$ hadoop classpath
/etc/hadoop/conf:/usr/lib/hadoop/lib/:/usr/lib/hadoop/.//:/home/tom/workspace/libs/opencsv-2.3.jar:/usr/lib/hadoop-hdfs/./:/usr/lib/ hadoop-hdfs/lib/:/usr/lib/hadoop-hdfs/.//:/usr/lib/hadoop-yarn/lib/:/usr/lib/hadoop-yarn/.//:/usr/lib/hadoop-mapreduce/lib/:/usr/lib/ hadoop-mapreduce/.//

Command

$ sudo hadoop jar FlightsByCarrier.jar FlightsByCarrier /user/root/1987.csv /user/root/result

I also tried the -libjars option

$ sudo hadoop jar FlightsByCarrier.jar FlightsByCarrier /user/root/1987.csv /user/root/result -libjars /home/tom/workspace/libs/ opencsv-2.3.jar

Stack trace

14/11/04 16:43:23 INFO mapreduce. Job: Running job: job_1415115532989_0001
14/11/04 16:43:55 INFO mapreduce. Job: Job job_1415115532989_0001 running in uber mode : false
14/11/04 16:43:56 INFO mapreduce. Job: map 0% reduce 0%
14/11/04 16:45:27 INFO mapreduce. Job: map 50% reduce 0%
14/11/04 16:45:27 INFO mapreduce. Job: Task Id : attempt_1415115532989_0001_m_000001_0, Status : FAILED
Error: java.lang. ClassNotFoundException: au.com.bytecode.opencsv. CSVParser
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at FlightsByCarrierMapper.map(FlightsByCarrierMapper.java:19)
at FlightsByCarrierMapper.map(FlightsByCarrierMapper.java:10)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:757)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:158)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:153)

Any help is greatly appreciated.

Solution

Your external jar is missing on the node running the map. You must add it to the cache to make it available. Try:

DistributedCache.addFileToClassPath(new Path("pathToJar"), conf);

Not sure in which version DistributedCache is deprecated, but starting with Hadoop 2.2.0 you can use:

job.addFileToClassPath(new Path("pathToJar")); 

Related Problems and Solutions