Python – Python modules are not visible to the Pig//Spark job

Python modules are not visible to the Pig//Spark job… here is a solution to the problem.

Python modules are not visible to the Pig//Spark job

I have a recurring issue with my Hadoop cluster, where occasionally functional code stops looking at python modules in the right places. I’m looking for tips from someone who may have had the same issue.

When I first started programming and the code stopped working, I asked a question about SO here, and someone told me to go sleep and it should work in the morning, or some other comment like “you’re a fool, you must have changed something”.

I

ran the code multiple times and it worked, I went to sleep, in the morning I tried to run it again but it failed. Sometimes I use CTRL+C to terminate jobs, sometimes I use CTRL+Z. But this only eats up resources and does not cause any other problems beyond that – the code is still running. I haven’t seen this issue immediately after the code runs. This usually happens the next morning, when I start working after the code work I left 10 hours ago. Restarting the cluster usually resolves the issue

I’m currently checking if the cluster is rebooting itself for some reason, or if some part of it is failing, but so far the ambari screen shows everything is green. I’m not sure if there is some automatic maintenance or something known to mess up.

Still struggling to read the elephant book, sorry if this topic is clearly addressed on page XXXX, I just haven’t gotten to that page yet.

I

looked at all the error logs, but the only thing I see that makes sense is in stderr :

  File "/data5/hadoop/yarn/local/usercache/melvyn/appcache/application_1470668235545_0029/container_e80_1470668235545_0029_01_000002/format_text.py", line 3, in <module>

from formatting_functions import *

ImportError: No module named formatting_functions

Solution

So we solved the problem. This problem is specific to our setup. We have installed all the data node NFS. Occasionally a node fails and someone has to recover and remount it.

Our script specifies the path to the library, for example: ‘

    pig -Dmapred.child.env="PYTHONPATH=$path_to_mnt$hdfs_library_path" ...

So pig can’t find these libraries because $path_to_mnt is not valid for one of the nodes.

Related Problems and Solutions