Python – What is the most efficient way to calculate the square Euclidean distance between N samples and clustered centroids?

What is the most efficient way to calculate the square Euclidean distance between N samples and clustered centroids?… here is a solution to the problem.

What is the most efficient way to calculate the square Euclidean distance between N samples and clustered centroids?

I’m looking for an efficient way (without a for loop) to calculate the Euclidean distance between a set of samples and a set of clustered centroids.

Example:

import numpy as np
X = np.array([[1,2,3],[1, 1, 1],[0, 2, 0]])
y = np.array([[1,2,3], [0, 1, 0]])

Expected output:

array([[ 0., 11.],
       [ 5.,  2.],
       [10.,  1.]])

This is the square Euclidean distance between each sample in X and each centroid in Y.

I came up with two solutions :

Solution 1:

def dist_2(X,y):
    X_square_sum = np.sum(np.square(X), axis = 1)
    y_square_sum = np.sum(np.square(y), axis = 1)
    dot_xy = np.dot(X, y.T)
    X_square_sum_tile = np.tile(X_square_sum.reshape(-1, 1), (1, y.shape[0]))
    y_square_sum_tile = np.tile(y_square_sum.reshape(1, -1), (X.shape[0], 1))
    dist = X_square_sum_tile + y_square_sum_tile - (2 * dot_xy)
    return dist

dist = dist_2(X, y)

Solution 2:

import scipy
dist = scipy.spatial.distance.cdist(X,y)**2

Performance of both solutions (wall clock time).

import time
X = np.random.random((100000, 50))
y = np.random.random((100, 50))

start = time.time()
dist = scipy.spatial.distance.cdist(X,y)**2
end = time.time()
print (end - start)

Average elapsed wall clock time = 0.7 seconds

start = time.time()
dist = dist_2(X,y)
end = time.time()
print (end - start)

Average elapsed wall clock time = 0.3 seconds

A large number of centroids are tested

X = np.random.random((100000, 50))
y = np.random.random((1000, 50))

Average elapsed wall clock time for Solution 1 = 50 seconds (+ memory issue).

Average elapsed wall clock time for Solution 2 = 6 seconds !!!

Conclusion

In terms of average elapsed wall clock time (on small datasets), Solution 1 seems to be more efficient than Solution 2, but inefficient in terms of memory.

Any suggestions?

Solution

This issue is often used in conjunction with neighbor search. If this is the case, check out kdtree approach. This will be more efficient than calculating Euclidean distance, both in terms of memory consumption and performance.

If this is not the case, here are some possible methods. The first two come froman answer of Divakar. The main difference from the previous two versions is that he avoids temporary arrays.

Three methods for calculating Euclidean distance

import numpy as np
import numba as nb

# @Paul Panzer
#https://stackoverflow.com/a/42994680/4045774
def outer_sum_dot_app(A,B):
    return np.add.outer((A*A).sum(axis=-1), (B*B).sum(axis=-1)) - 2*np.dot(A,B.T)

# @Divakar
#https://stackoverflow.com/a/42994680/4045774
def outer_einsum_dot_app(A,B):
    return np.einsum('ij,ij->i',A,A)[:,None] + np.einsum('ij,ij->i',B,B) - 2*np.dot(A,B.T)

@nb.njit()
def calc_dist(A,B,sqrt=False):
  dist=np.dot(A,B.T)

TMP_A=np.empty(A.shape[0],dtype=A.dtype)
  for i in range(A.shape[0]):
    sum=0.
    for j in range(A.shape[1]):
      sum+=A[i,j]**2
    TMP_A[i]=sum

TMP_B=np.empty(B.shape[0],dtype=A.dtype)
  for i in range(B.shape[0]):
    sum=0.
    for j in range(B.shape[1]):
      sum+=B[i,j]**2
    TMP_B[i]=sum

if sqrt==True:
    for i in range(A.shape[0]):
      for j in range(B.shape[0]):
        dist[i,j]=np.sqrt(-2.*dist[i,j]+TMP_A[i]+TMP_B[j])
  else:
    for i in range(A.shape[0]):
      for j in range(B.shape[0]):
        dist[i,j]=-2.*dist[i,j]+TMP_A[i]+TMP_B[j]
  return dist

Time

A = np.random.randn(10000,3)
B = np.random.randn(10000,3)

#calc_dist:                      360ms first call excluded due to compilation overhead
#outer_einsum_dot_app (Divakar): 1150ms
#outer_sum_dot_app (Paul Panzer):1590ms
#dist_2:                         1840ms

A = np.random.randn(1000,100)
B = np.random.randn(1000,100)

#calc_dist:                      4.3  ms first call excluded due to compilation overhead
#outer_einsum_dot_app (Divakar): 13.12ms
#outer_sum_dot_app (Paul Panzer):13.2 ms
#dist_2:                         21.3ms

Related Problems and Solutions