Python – Low memory and fastest query database for Python projects

Low memory and fastest query database for Python projects… here is a solution to the problem.

Low memory and fastest query database for Python projects

I’m migrating a GAE/Java application to Python (non-GAE) due to the new pricing, so I got a small server and I wanted to find a database that met the following requirements:

  • Low memory usage (either adjustable or predictable).
  • Fastest query capability on simple document/tree data identified by a key (I don’t care about write performance, I’m assuming it will have indexes).
  • Binds with Pypy 1.6 compatibility (or at least Python 2.7).

My data is like this:

  • Id: Short key string
  • Position
  • Creators: An array of another data structure that has an id (used as a key), name, site address, and so on.
  • Labels: An array of tags. Each of them can have multiple parent tags, names, IDs, etc.
  • License: A data structure that describes its license (CC, GPL, etc.) along with its name, associated URL, and so on.
  • Add time: The time added to our website.
  • Translation: A pointer to another entry that is a authored translation.

My query is very simple. The usual situation is:

  • Filter by tags sorted by added time.
  • Select several (pagination) to sort by added time.
  • (maybe, not done yet) filter by creator.
  • (not done but scheduled) some autocomplete functionality in the form, so if some fields contain substrings (“LIKE” queries), I will need to search.

The amount of data is not large. Right now I have about 50MB of data, but I plan to have a huge dataset of about 10GB.

Also, I

want to rebuild it from scratch, so I’m open to any option. What database do you think will meet my requirements?

EDIT: I want to do some benchmarking around different options and share the results. So far, I’ve chosen MongoDB, PostgreSQL, MySQL, Drizzle, Riak, and Kyoto Cabinet.

Solution

The path with the least resistance to migrating App Engine applications is probably AppScale using which implements the bulk of the App Engine API. In particular, you may want to use the HyperTable data store, which is very similar to the Google App Engine data store.

EDIT: Okay, so you’re going to redesign. I would like to review some of the points you made in the question.

Low memory usage

This is almost the opposite of what you would want in a database; You want to store as many datasets in core memory as possible; This could mean adjusting the dataset itself to fit efficiently, or adding memcached nodes so that you can distribute the dataset across multiple hosts so that each host has a small enough portion of the dataset to fit the core.

To illustrate this, consider that reading a value from RAM is about 1000 times faster than reading it from disk; A database that can satisfy each query from the core can handle 10 times more workload than a database that only needs access to disk to handle 1% of queries.

I’m planning to have a huge dataset around 10GB.

I don’t think you can call 10GB a “huge dataset”. In fact, this may fit into the memory of a fairly large database server; You don’t need more than one memcached node, let alone an additional persistence node (typical disk size is measured in terabytes, which is 100 times larger than this expected dataset.)


Based on this information, I would definitely recommend using a full-fledged database product like PostgreSQL, which provides enough performance for the data you are describing to easily provide all the features you are talking about. If the point comes when you need to go beyond what PostgreSQL actually provides, you’ll actually have a real workload to analyze to understand what the real bottleneck is.

Related Problems and Solutions