Python – GAE text search with quadratic sorting

GAE text search with quadratic sorting… here is a solution to the problem.

GAE text search with quadratic sorting

Instead of full-text search on GAE, I use the solution below to return a sorted result set, first by keyword relevance and then by date (although the second sort can actually be anything). It feels a bit clunky, and I’m worried about performance at scale, so I’m looking for optimization suggestions or a completely different approach.

Quadratic sorting is important for my use case because a given search may have multiple results of the same relevance (measured by the number of keyword matches), but keeping the original query order now adds a lot of complexity. Any ideas?

Step 1: Get a list of keys that match each search term

results_key_list = []
search_terms = ['a','b','c'] #User's search query, split into a list of strings

#query each search term and add the results to a list
#yields a list of keys with frequency of occurance indicating relevance     
for item in search_terms:
    subquery = SomeEntity.all(keys_only=True)                   
    subquery.filter('SearchIndex = ', item) #SearchIndex is a StringListProperty
    #more filters...            
    subquery.order('-DateCreated')                  
    for returned_item in subquery:
        results_key_list.append(str(returned_item))     

Step 2: Group the list by frequency while maintaining the original order

#return a dictionary of keys, with their frequency of occurrence            
grouped_results = defaultdict(int)              
for key in results_key_list:
    grouped_results[key] += 1               

sorted_results = []
known = set()

#creates an empty list for each match frequency 
for i in range(len(search_terms)):
    sorted_results.append([])

#using the original results ordering, 
#construct an array of results grouped and ordered by descending frequency  
for key in results_key_list:
    if key in known: continue
    frequency = grouped_results[key]
    sorted_results[len(search_terms) - frequency].append(key)
    known.add(key)          

#combine into a single list
ordered_key_list = []   
for l in sorted_results:
    ordered_key_list.extend(l)  

del ordered_key_list[:offset]
del ordered_key_list[limit:]    
result = SomeEntity.get(ordered_key_list)

Solution

search_terms = ['a','b','c'] #User's search query, split into a list of strings

Keys can be accumulated in the order in which they appear
And all key frequencies can be established at once.
With sorting stability, reduce the frequency first, and then sort in the order in which they appear:

keys_in_order_of_appearance = []
key_frequency = defaultdict(int)

for item in search_terms:
    subquery = SomeEntity.all(keys_only=True)                   
    subquery.filter('SearchIndex = ', item) #SearchIndex is a StringListProperty
    #more filters...            
    subquery.order('-DateCreated')                  
    for returned_item in subquery:
        key = str(returned_item)
        if key not in key_frequency:
            key_order_of_appearance.append(key)
        key_frequency[key] += 1

keys = keys_in_order_of_appearance[:]   # order of appearance kept as secondary sort
keys.sort(key=key_frequency.__getitem__, reverse=True) # descending freq as primary sort
result = SomeEntity.get(ordered_key_list)

Related Problems and Solutions