Why is fetching postgreSQL tables in Python much slower than in R?
I have to do some statistical processing on the data stored in PostgreSQL tables. I’ve been hesitating between using R and Python.
For RI, I use the following code:
require("RPostgreSQL")
(...) #connection to the database, etc
my_table <- dbGetQuery(con, "SELECT * FROM some_table;" )
Very fast: It only takes 5 seconds to get a table with about 200 000 rows and 15 columns with almost no NULL.
For Python, I use the following code:
import psycopg2
conn = psycopg2.connect(conn_string)
cursor = conn.cursor()
cursor.execute("SELECT * FROM some_table;" )
my_table = cursor.fetchall()
Surprisingly, it caused my Python session to get stuck and my computer to crash.
When I used these libraries as a “black box”, I didn’t understand why something so fast in R was so slow in Python (and therefore almost impossible to use in practice).
Can anyone explain this performance difference and can anyone tell me if there is a more efficient way to get pgSQL tables in Python?
Solution
I’m not an expert on R, but it’s clear that dbGetQuery
() (actually: dbFetch())
returns a lazy object that doesn’t load all the resulting memory – otherwise it would of course take a long time and eat all your RAM.
wrt/Python/psycopg, you definitely don’t want to fetchall()
a huge data set. The correct solution here is use a server-side cursor and iterate on it.
Edit – Answer the question in your comment:
so the option cursor_factory=psycopg2.extras.DictCursor when executing fetchall()does the trick, right?
Not at all. As all the letters in the example I linked write in, what is “useful” is to use server-side cursors, which are done by naming cursors (in psycopg):
HERE IS THE IMPORTANT PART, by specifying a name for the cursor
psycopg2 creates a server-side cursor, which prevents all of the
records from being downloaded at once from the server.
cursor = conn.cursor('cursor_unique_name')
The DictCursor stuff is actually irrelevant (shouldn’t be mentioned in this example, as it obviously confuses newbies).
I have a side question regarding the concept of lazy object (the one returned in R). How is it possible to return the object as a data-frame without storing it in my RAM? I find it a bit magical.
As I mentioned, I’m not interested in R and its implementation – I deduce from the behavior you described that whatever dbFetch
returns is a lazy object – but there’s nothing magical about having an object to get values from an external source. Python’s file
object is a known example:
with open("/some/huge/file.txt") as f:
for line in f:
print line
In the code snippet above, the file
object f
gets data from disk only when needed. All that needs to be stored is the file pointer location (and the last N bytes of the buffer read from disk, but that’s an implementation detail).
To learn more, read about Python’s iteratable
and iterator
.