Python – How to run xgboost on a hadoop cluster for distributed model training?

How to run xgboost on a hadoop cluster for distributed model training?… here is a solution to the problem.

How to run xgboost on a hadoop cluster for distributed model training?

I’m trying to build a CTR predictive model for 100 million impressions of contextual ads using XGBoost, and to achieve the same goal, I’d like to try XGboost on hadoop because I have all the impressions available in HDFS.

Can someone refer to the same working tutorial for python?

Solution

There are many ways to do it:

  1. If you have some lower-level logical groupings, such as CTRs for some project departments, and you want to make localized models for the departments, you can use a set of type map reduce. It will ensure that all data belonging to a single department ends up in a single YARN container on which you can build your model. NLineInputFormat is a clever trick that makes this map only process and not map reduce based processes, which will give you a significant speed boost.

  2. You can use the Spark version of XGBoost for distributed machine learning for more information, see http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html

  3. If you’re also deciding on your infrastructure, you can also try AWS by following the instructions here. It’s not Hadoop, it’s pseudo-distributed machine learning: https://xgboost.readthedocs.io/en/latest/tutorials/aws_yarn.html

Related Problems and Solutions