Python - Import external libraries in Hadoop MapReduce scripts

Import external libraries in Hadoop MapReduce scripts… here is a solution to the problem.

Import external libraries in Hadoop MapReduce scripts

I run a Python MapReduce script on top of Amazon’s EMR Hadoop implementation. As a result of the main script, I got the similarity of the project project. In a later step, I want to split this output into separate S3 buckets for each project, so each artifact bucket contains a list of similar artifacts. To do this, I want to use Amazon’s Boto Python library in the reduce function of the aftercare step.

How do I import external (Python) libraries into Hadoop so that they can be used in the Reduce step written in Python?
Can I access S3 this way in a Hadoop environment?

Thanks in advance,
Thomas

Solution

When you start the Hadoop process, you can specify external files that should be available. This is done by using the -files parameter.

$HADOOP_HOME/bin/hadoop jar/usr/lib/COMPANY/analytics/libjars/MyJar.jar -files hdfs://PDHadoop1.corp.COMPANY.com:54310/data/geoip/GeoIPCity.dat

I don’t know if these files have to be put on HDFS, but if it’s a job that runs frequently, it’s not a bad idea to put them there.
From the code, you can do something like

if (DistributedCache.getLocalCacheFiles(context.getConfiguration()) != null) {
    List<Path> localFiles = Utility.arrayToList(DistributedCache.getLocalCacheFiles(context.getConfiguration()));
    for (Path localFile : localFiles) {
        if ((localFile.getName() != null) && (localFile.getName().equalsIgnoreCase("GeoIPCity.dat"))) {
            Path path = new File(localFile.toUri().getPath());
        }
    }
}

This is almost directly copied and pasted from working code in our multiple mappers.

I don’t know the second part of your question. Hopefully, the answers in the first part get you started. 🙂

In addition to –files, there are -libjars for containing additional jars; I have some information about this – If I have a constructor that requires a path to a file, how can I “fake” that if it is packaged into a jar?

Python – Import external libraries in Hadoop MapReduce scripts

Import external libraries in Hadoop MapReduce scripts

Solution

Related Problems and Solutions