Hive data to the Pandas data frame… here is a solution to the problem.
Hive data to the Pandas data frame
Python newbie.
How to save data from Hive to a Pandas data frame.
with pyhs2.connect(host, port=20000,authMechanism="PLAIN",user,password,
database) as conn:
with conn.cursor() as cur:
#Show databases
print cur.getDatabases()
#Execute query
cur.execute(query)
#Return column info from query
print cur.getSchema()
#Fetch table results
for i in cur.fetch():
print i
**columnNames = [a['columnName'] for a in cur.getSchema()]
print columnNames
df1=pd. DataFrame(cur.fetch(),columnNames)**
Try using column names. Useless.
Please. Suggested stuff.
Solution
pd.read_sql() (pandas 0.24.0) Adopt a database connection. Using PyHive connects directly to pandas.read_sql()
as follows:
from pyhive import hive
import pandas as pd
# open connection
conn = hive. Connection(host=host,port= 20000, ...)
# query the table to a new dataframe
dataframe = pd.read_sql("SELECT id, name FROM test.example_table", conn)
The columns of the Dataframe are named after the hive table. If needed, you can change them during/after data frame creation:
- Via HiveQL:
SELECT id AS new_column_name....
- Through the column property in
pd.read_sql().