Python – How to save files in Hadoop using Python

How to save files in Hadoop using Python… here is a solution to the problem.

How to save files in Hadoop using Python

Question:

I

started learning Hadoop, however, I needed to save a lot of files into it using Python.
I can’t seem to figure out what I’m doing wrong. Can anyone help me with this?

Below is my code.
I think HDFS_PATH is correct because I didn’t change it in the settings when installing.
pythonfile.txt on my desktop (as is python code running from the command line).

Code:

import hadoopy
import os
hdfs_path ='hdfs://localhost:9000/python' 

def main():
    hadoopy.writetb(hdfs_path, [('pythonfile.txt',open('pythonfile.txt').read())])

main()

Output
When I run the code above, all I get is a directory of python itself.

iMac-van-Brian:desktop Brian$ $HADOOP_HOME/bin/hadoop dfs -ls /python

DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

14/10/28 11:30:05 WARN util. NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-rw-r--r--   1 Brian supergroup        236 2014-10-28 11:30 /python

Solution

This is a very typical task module for Subprocess. The solution is as follows:

put = Popen(["hadoop", "fs", "-put", <path/to/file>, <path/to/hdfs/file], stdin=PIPE, bufsize=-1)
put.communicate()

Complete example

Suppose you have an authenticated connection to HDFS on the server (for example, you have called .keytab).

You just started with pandas. DataFrame creates a csv and wants to put it into hdfs.

Then you can upload the file to hdfs:: as follows

import os 

import pandas as pd

from subprocess import PIPE, Popen

# define path to saved file
file_name = "saved_file.csv"

# create a pandas. DataFrame
sales = {'account': ['Jones LLC', 'Alpha Co', 'Blue Inc'], 'Jan': [150, 200, 50]}
df = pd. DataFrame.from_dict(sales)

# save your pandas. DataFrame to csv (this could be anything, not necessarily a pandas. DataFrame)
df.to_csv(file_name)

# create path to your username on hdfs
hdfs_path = os.path.join(os.sep, 'user', '<your-user-name>', file_name)

# put csv into hdfs
put = Popen(["hadoop", "fs", "-put", file_name, hdfs_path], stdin=PIPE, bufsize=-1)
put.communicate()

The csv file will exist in /user/<your-user-name/saved_file.csv

Note – If you create this file from a python script called in Hadoop, the intermediate CSV file may be stored on some random node. Since this file is (probably) no longer needed, it is best to delete it so as not to pollute the node every time the script is invoked. You can simply add os.remove(file_name) as the last line of the above script to solve this problem.

Related Problems and Solutions