Python – hadoop streaming – How to join two diff files internally using python

hadoop streaming – How to join two diff files internally using python… here is a solution to the problem.

hadoop streaming – How to join two diff files internally using python

I want to find out how many page views of popular websites are based on the age group of users aged 18 to 25.
I have two files, one containing username, age, and the other containing username, website name. Example:

User .txt

John, 22

Page .txt

John, google.com

I

wrote the following in python and it works the way I expected it outside of Hadoop.

import os
os.chdir("/home/pythonlab")

#Top sites visited by users aged 18 to 25

#read the users file
lines = open("users.txt")
users = [ line.split(",") for line in lines]      #user name, age (eg - john, 22)
userlist = [ (u[0],int(u[1])) for u in users]     #split the user name and age

#read the page visit file
pages = open("pages.txt")
page = [p.split(",") for p in pages]              #user name, website visited (eg - john,google.com)
pagelist  = [ (p[0],p[1]) for p in page]

#map user and page visits & filter age group between 18 and 25
usrpage = [[p[1],u[0]] for u in userlist for p in pagelist  if (u[0] == p[0] and u[1]>=18 and u[1]<=25) ]

for z in usrpage:
    print(z[0].strip('\r\n')+",1")     #print website name, 1

Sample output:

yahoo.com,1
google.com,1

Now I want to use a hadoop stream to solve this problem.

My question is, how do I handle these two named files (users.txt, pages.txt) in my mapper? We usually only pass the input directory to the hadoop stream.

Solution

You need to consider using Hive. This will allow you to merge multiple source files into one, just as you need to. It allows you to join two data sources, almost as you would do in SQL, and then push the results to your mapper and reducer.

Related Problems and Solutions