Python – Log in to Hadoop

Log in to Hadoop… here is a solution to the problem.

Log in to Hadoop

I’m trying to run the map reduce job. But when I run this job, I can’t find my log file. I’m using a hadoop stream job to do map reduce, I’m using Python. I’m using python’s logging module to log messages. When I run it on a file using the “cat” command, a log file is created.

cat file | ./mapper.py 

But when I run this job via Hadoop, I can’t find the log file.

import os,logging

logging.basicConfig(filename="myApp.log", level=logging.INFO)
logging.info("app start")

##
##logic with log messages
##

logging.info("app complete")

But I can’t find myApp.log file anywhere. Whether the log data is stored anywhere or Hadoop ignores application logging completion. I also searched for my log entries in the userlogs folder, but it looks like my log entries aren’t there.

I’m dealing with a lot of data where random items don’t make it to the next stage, which is a very big problem for us, so I’m trying to find a way to debug my application using logging.

Thanks for any help.

Solution

I believe you are logging into stdout? If so, you should definitely log in to stderr or create your own custom flow.

With hadoop-streaming, stdout is a stream dedicated to passing key-values between mappers/reducers and outputting results, so you shouldn’t log anything in it.

Related Problems and Solutions