The Python – Pandas .min() method doesn’t seem to be the fastest

Pandas .min() method doesn’t seem to be the fastest… here is a solution to the problem.

Pandas .min() method doesn’t seem to be the fastest

I’m trying to get min, max, mean, etc. (some kind of value for all numbers) for a Pandas df column, and the Pandas method doesn’t seem to be the fastest. It seems that if I hit it with .values first, the run time of these operations will be greatly improved. Is this expected behavior (meaning is Pandas doing something stupid or intentionally?). Maybe I’m using .values to run out of extra memory, or I’m making assumptions and/or making it easier in some way which isn’t a given…).

“Evidence” of unexpected behavior:

df = pd. DataFrame(np.random.randint(0,1000,size=(100000000, 4)), columns=list('ABCD'))

start = time.time()
print(df['A'].min())
print(time.time()-start)`

# 0
# 1.35876178741

start = time.time()
df['A'].values.min()
print(time.time()-start)

# 0
# 0.225932121277

start = time.time()
print(np.mean(df['A']))
print(time.time()-start)

# 499.49969672
# 1.58990907669

start = time.time()
print(df['A'].values.mean())
print(time.time()-start)

# 499.49969672
# 0.244406938553

Solution

When you call only one column, you reduce it to a pandas family that is based on a numpy array but contains more. Pandas objects are optimized for spreadsheet or database-type operations such as joins, lookups, and so on.

When you call .values on a column, it makes it a numpy array, which is a dtype optimized for math and vector operations in C. Even when “expanding” the ndarray type, the efficiency of mathematical operations easily beats the family data type. Here is a quick discussion on some of the differences.

As a side note, there is a specific module – timeit for these types of time comparisons

type(df['a'])

pandas.core.series.Series

%timeit df['a'].min()

6.68 ms ± 121 µs per loop

type(df['a'].values)

numpy.ndarray

%timeit df['a'].values.min()

696 µs ± 18 µs per loop

Related Problems and Solutions