Java – How to distribute jars to Hadoop before job submission

How to distribute jars to Hadoop before job submission… here is a solution to the problem.

How to distribute jars to Hadoop before job submission

I want to implement a REST API to submit a Hadoop job for execution. This is done entirely through Java code. If I compile a jar file and execute it via “hadoop -jar”, everything works as expected. But when I submit a Hadoop job in my REST API via Java code – the job was submitted but failed due to ClassNotFoundException.
Is it possible to somehow deploy the jar file (using my job code) to hadoop (nodemanagers and its containers) so that hadoop can locate the jar file by class name? Should I copy the jar file to each node manager and set up HADOOP_CLASSPATH there?

Solution

You can create a method that adds a jar file to Hadoop’s distributed cache so that it is available to tasktrakers when needed.

private static void addJarToDistributedCache(
    String jarPath, Configuration conf)
throws IOException {

File jarFile = new File(jarPath);

 Declare new HDFS location
Path hdfsJar = new Path(jarFile.getName());

 Mount HDFS
FileSystem hdfs = FileSystem.get(conf);

 Copy (override) jar file to HDFS
hdfs.copyFromLocalFile(false, true,
    new Path(jar), hdfsJar);

 Add jar to distributed classPath
DistributedCache.addFileToClassPath(hdfsJar, conf);
}

Then in your application, call addJarToDistributedCache:: before submitting the job

public static void main(String[] args) throws Exception {

 Create Hadoop configuration
Configuration conf = new Configuration();

 Add 3rd-party libraries
addJarToDistributedCache("/tmp/hadoop_app/file.jar", conf);

 Create my job
Job job = new Job(conf, "Hadoop-classpath");
.../...
}

You can find more details in blog:

Related Problems and Solutions