GAE text search with quadratic sorting
Instead of full-text search on GAE, I use the solution below to return a sorted result set, first by keyword relevance and then by date (although the second sort can actually be anything). It feels a bit clunky, and I’m worried about performance at scale, so I’m looking for optimization suggestions or a completely different approach.
Quadratic sorting is important for my use case because a given search may have multiple results of the same relevance (measured by the number of keyword matches), but keeping the original query order now adds a lot of complexity. Any ideas?
Step 1: Get a list of keys that match each search term
results_key_list = []
search_terms = ['a','b','c'] #User's search query, split into a list of strings
#query each search term and add the results to a list
#yields a list of keys with frequency of occurance indicating relevance
for item in search_terms:
subquery = SomeEntity.all(keys_only=True)
subquery.filter('SearchIndex = ', item) #SearchIndex is a StringListProperty
#more filters...
subquery.order('-DateCreated')
for returned_item in subquery:
results_key_list.append(str(returned_item))
Step 2: Group the list by frequency while maintaining the original order
#return a dictionary of keys, with their frequency of occurrence
grouped_results = defaultdict(int)
for key in results_key_list:
grouped_results[key] += 1
sorted_results = []
known = set()
#creates an empty list for each match frequency
for i in range(len(search_terms)):
sorted_results.append([])
#using the original results ordering,
#construct an array of results grouped and ordered by descending frequency
for key in results_key_list:
if key in known: continue
frequency = grouped_results[key]
sorted_results[len(search_terms) - frequency].append(key)
known.add(key)
#combine into a single list
ordered_key_list = []
for l in sorted_results:
ordered_key_list.extend(l)
del ordered_key_list[:offset]
del ordered_key_list[limit:]
result = SomeEntity.get(ordered_key_list)
Solution
search_terms = ['a','b','c'] #User's search query, split into a list of strings
Keys can be accumulated in the order in which they appear
And all key frequencies can be established at once.
With sorting stability, reduce the frequency first, and then sort in the order in which they appear:
keys_in_order_of_appearance = []
key_frequency = defaultdict(int)
for item in search_terms:
subquery = SomeEntity.all(keys_only=True)
subquery.filter('SearchIndex = ', item) #SearchIndex is a StringListProperty
#more filters...
subquery.order('-DateCreated')
for returned_item in subquery:
key = str(returned_item)
if key not in key_frequency:
key_order_of_appearance.append(key)
key_frequency[key] += 1
keys = keys_in_order_of_appearance[:] # order of appearance kept as secondary sort
keys.sort(key=key_frequency.__getitem__, reverse=True) # descending freq as primary sort
result = SomeEntity.get(ordered_key_list)