Python – Automated Hadoop batch commands

Automated Hadoop batch commands… here is a solution to the problem.

Automated Hadoop batch commands

I’m a beginner in this field, so don’t know the exact term, sorry

Problem Library: I want to automate the processing of the batch layer

Question: I can’t understand how people manage to run large hadoop commands like

"hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.4.0.jar \
    -mapper mapper.py \
    -reducer reducer.py \
    -input nfldata/stadiums \
    -output nfldata/pythonoutput \
    -file simple/mapper.py \
    -file simple/reducer.py" 

Is there any

way to automate a process like cron every time they need to run a map reduce job, please let me know if there are any resources for this so that we can schedule hadoop commands or related python or bash scripts

What I searched for: Luigi( here) to build the jar or run the command, but there is no documentation or examples about it

Note: Because I don’t know Java, I haven’t searched and can’t use options in Java.

Solution

You can choose between Hadoop and non-Hadoop solutions.

Hadoop solutions

Hadoop provides three main frameworks for this scenario:

Each framework has its advantages and disadvantages. For example, Oozie is XML-based (which many people don’t like), and you can write jobs that you can add to the Oozie workflow engine. The reason people usually love Oozie is that they have a GUI to design workflows.

For more information about Hadoop workflow solutions, Google a comparison of these languages. There are many sector comparisons available

Non-Hadoop solutions

Write your workflow code in any language (most likely a scripting language like Python, Bash, or Perl is better suited for that use case than a compiled language). Add this application to the cron job, and then run the application periodically.

Invoke all commands in the application (for example, HDFS DFS or Hadoop JAR). You have the flexibility to catch exceptions or prepare statements with all the programming logic you need.

How

If you’re using Hue, using Oozie means you have a GUI to create your workflows by default. See also screenshot.
enter image description here

In your case, you can add example hadoop commands in the GUI, which you can specify in the Mapper, Reducer, etc. fields. You can then schedule the workflow.

enter image description here

As you can see, there are many job templates for oozie, such as map reduce job. If you don’t have a specific job template, you can implement your own Oozie job in Java. Under the hood, Oozie stores content in XML files, so you can also edit your workflows and jobs in XML.

<workflow-app xmlns="uri:oozie:workflow:0.2"
name="whitehouse-workflow">
<start to="transform_input"/>
<action name="transform_sample_pig">
  <pig>
    <job-tracker>${resourceManager}</job-tracker>
    <name-node>${nameNode}</name-node>
    <prepare>
      <delete path="pig_store"/>
    </prepare>
    <script>mypig.pig</script>
  </pig> 
  <ok to="end"/>
  <error to="fail"/>
</action>
<kill name="fail">
  <message>Job failed, error message[${wf:errorMessage(wf:lastErrorNode())}]
  </message>
</kill>
<end name="end"/>

Scheduling/running
Once you’ve edited and designed your workflow, you also have the option to run or schedule your workflow. Scheduling is a wizard that allows you to define details such as frequency, input data, or more advanced topics such as concurrency.

This shows another advantage of oozie over scripted implementations. If you collaborate with users who should occasionally be allowed to trigger workflows, additional effort is required to integrate the script implementation into the GUI. With Oozie, it’s just a click away.

scheduling

Trade-offs

As usual, there is no Elixir with the best tool to solve all problems. With a Hadoop solution, you will have to learn specific tools. It adds a learning curve to understand how Oozie, Luigi or Azkaban work.

If you are already proficient in programming languages, you don’t need this learning curve. Use a scripting language and add the script to a scheduler, such as cron. You have all the programming power to use react for exceptions and customize your workflow. You gave up the UI for programming skills.

All in all, if you’re just scheduling simple jobs, any Hadoop-specific solution will suffice. With a certain complexity and customization, using Python implementations adds agility at the expense of maintenance.

There is a third option. There are many specialized ETL solutions on the market, such as Informatica, Talend or OSI.

Related Problems and Solutions