Python – Hive data to the Pandas data frame

Hive data to the Pandas data frame… here is a solution to the problem.

Hive data to the Pandas data frame

Python newbie.

How to save data from Hive to a Pandas data frame.

with pyhs2.connect(host, port=20000,authMechanism="PLAIN",user,password,
               database) as conn:
    with conn.cursor() as cur:
        #Show databases
        print cur.getDatabases()

#Execute query
        cur.execute(query)

#Return column info from query
        print cur.getSchema()

#Fetch table results
        for i in cur.fetch():
            print i
        **columnNames = [a['columnName'] for a in  cur.getSchema()]
        print columnNames
        df1=pd. DataFrame(cur.fetch(),columnNames)**

Try using column names. Useless.

Please. Suggested stuff.

Solution

pd.read_sql() (pandas 0.24.0) Adopt a database connection. Using PyHive connects directly to pandas.read_sql() as follows:

from pyhive import hive
import pandas as pd

# open connection
conn = hive. Connection(host=host,port= 20000, ...)

# query the table to a new dataframe
dataframe = pd.read_sql("SELECT id, name FROM test.example_table", conn)

The columns of the Dataframe are named after the hive table. If needed, you can change them during/after data frame creation:

  • Via HiveQL:SELECT id AS new_column_name....
  • Through the column property in pd.read_sql().

Related Problems and Solutions