The sorting algorithm used by Panda’s sort_values when the kind parameter is not applied… here is a solution to the problem.
The sorting algorithm used by Panda’s sort_values when the kind parameter is not applied
In Pindas’ sort_values
method, the kind
parameter is applied only when sorting a single column or label. Why is this? What sort algorithm is used in these cases where the kind
parameter is not applied? Is it stable sorting?
(For documentation, see .) https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html。 )
Solution
This is a docstring from the source file , declare get_group_index_sorter(group_index, ngroups)
:
algos.groupsort_indexer implements `counting sort` and it is at least O(ngroups), where ngroups = prod(shape) shape = map(len, keys) that is, linear in the number of combinations (cartesian product) of unique values of groupby keys. This can be huge when doing multi-key groupby. np.argsort(kind='mergesort') is O(count x log(count)) where count is the length of the data-frame;
Both algorithms are `stable` sort and that is necessary for correctness of
groupby operations. e.g. consider: df.groupby(key)[col].transform('first')
PS Here is a “call chain” :
pandas.core.frame.DataFrame.sort_values() -> \
pandas.core.sorting.lexsort_indexer() -> \
pandas.core.sorting.indexer_from_factorized() -> \
pandas.core.sorting.get_group_index_sorter()