Python – How to specifically determine the input for each map step in MRJob?

How to specifically determine the input for each map step in MRJob?… here is a solution to the problem.

How to specifically determine the input for each map step in MRJob?

I’m working on a map-reduce job with multiple steps. With mrjob, each step receives the output of the previous step. The problem is that I don’t want that.

What I want is to extract some information and use it in the second step for all the input and so on. Is it possible to do this using mrjob?

Note: Since I don’t want to use EMR, this question Didn’t help me much.

Update: If it’s not possible to do this in one job, I need to do it in two different jobs. In this case, is there any way to wrap these two jobs and manage intermediate outputs, etc.?

Solution

You can use Runners

You must define the job separately and call it using another Python script.

from NumLines import NumLines
from WordsPerLine import WordsPerLine
import sys

intermediate = None

def firstJob(input_file):
    global intermediate
    mr_job = NumLines(args=[input_file])
    with mr_job.make_runner() as runner:
        runner.run()
        intermediate = runner.get_output_dir()

def secondJob(input_file):
    mr_job = WordsPerLine(args=[intermediate,input_file])
    with mr_job.make_runner() as runner:
        runner.run()

if __name__ == '__main__':
    firstJob(sys.argv[1]) 
    secondJob(sys.argv[1])

And can be called:

python main_script.py input.txt

Related Problems and Solutions