Python - Why is fetching postgreSQL tables in Python much slower than in R?

Why is fetching postgreSQL tables in Python much slower than in R?… here is a solution to the problem.

Why is fetching postgreSQL tables in Python much slower than in R?

I have to do some statistical processing on the data stored in PostgreSQL tables. I’ve been hesitating between using R and Python.

For RI, I use the following code:

require("RPostgreSQL")
(...) #connection to the database, etc
my_table <- dbGetQuery(con, "SELECT * FROM some_table;" )

Very fast: It only takes 5 seconds to get a table with about 200 000 rows and 15 columns with almost no NULL.

For Python, I use the following code:

import psycopg2  
conn = psycopg2.connect(conn_string)
cursor = conn.cursor()
cursor.execute("SELECT * FROM some_table;" )
my_table = cursor.fetchall()

Surprisingly, it caused my Python session to get stuck and my computer to crash.

When I used these libraries as a “black box”, I didn’t understand why something so fast in R was so slow in Python (and therefore almost impossible to use in practice).

Can anyone explain this performance difference and can anyone tell me if there is a more efficient way to get pgSQL tables in Python?

Solution

I’m not an expert on R, but it’s clear that dbGetQuery() (actually: dbFetch()) returns a lazy object that doesn’t load all the resulting memory – otherwise it would of course take a long time and eat all your RAM.

wrt/Python/psycopg, you definitely don’t want to fetchall() a huge data set. The correct solution here is use a server-side cursor and iterate on it.

Edit – Answer the question in your comment:

so the option cursor_factory=psycopg2.extras.DictCursor when executing fetchall()does the trick, right?

Not at all. As all the letters in the example I linked write in, what is “useful” is to use server-side cursors, which are done by naming cursors (in psycopg):

HERE IS THE IMPORTANT PART, by specifying a name for the cursor
psycopg2 creates a server-side cursor, which prevents all of the
records from being downloaded at once from the server.

cursor = conn.cursor('cursor_unique_name')

The DictCursor stuff is actually irrelevant (shouldn’t be mentioned in this example, as it obviously confuses newbies).

I have a side question regarding the concept of lazy object (the one returned in R). How is it possible to return the object as a data-frame without storing it in my RAM? I find it a bit magical.

As I mentioned, I’m not interested in R and its implementation – I deduce from the behavior you described that whatever dbFetch returns is a lazy object – but there’s nothing magical about having an object to get values from an external source. Python’s file object is a known example:

with open("/some/huge/file.txt") as f:
    for line in f:
        print line

In the code snippet above, the file object f gets data from disk only when needed. All that needs to be stored is the file pointer location (and the last N bytes of the buffer read from disk, but that’s an implementation detail).

To learn more, read about Python’s iteratable and iterator .

Python – Why is fetching postgreSQL tables in Python much slower than in R?

Why is fetching postgreSQL tables in Python much slower than in R?

Solution

Related Problems and Solutions