Import external libraries in Hadoop MapReduce scripts
I run a Python MapReduce script on top of Amazon’s EMR Hadoop implementation. As a result of the main script, I got the similarity of the project project. In a later step, I want to split this output into separate S3 buckets for each project, so each artifact bucket contains a list of similar artifacts. To do this, I want to use Amazon’s Boto Python library in the reduce function of the aftercare step.
- How do I import external (Python) libraries into Hadoop so that they can be used in the Reduce step written in Python?
- Can I access S3 this way in a Hadoop environment?
Thanks in advance,
Thomas
Solution
When you start the Hadoop process, you can specify external files that should be available. This is done by using the -files
parameter.
$HADOOP_HOME/bin/hadoop jar/usr/lib/COMPANY/analytics/libjars/MyJar.jar -files hdfs://PDHadoop1.corp.COMPANY.com:54310/data/geoip/GeoIPCity.dat
I don’t know if these files have to be put on HDFS, but if it’s a job that runs frequently, it’s not a bad idea to put them there.
From the code, you can do something like
if (DistributedCache.getLocalCacheFiles(context.getConfiguration()) != null) {
List<Path> localFiles = Utility.arrayToList(DistributedCache.getLocalCacheFiles(context.getConfiguration()));
for (Path localFile : localFiles) {
if ((localFile.getName() != null) && (localFile.getName().equalsIgnoreCase("GeoIPCity.dat"))) {
Path path = new File(localFile.toUri().getPath());
}
}
}
This is almost directly copied and pasted from working code in our multiple mappers.
I don’t know the second part of your question. Hopefully, the answers in the first part get you started. 🙂
In addition to –files
, there are -libjars
for containing additional jars; I have some information about this – If I have a constructor that requires a path to a file, how can I “fake” that if it is packaged into a jar?