Python – K-Means: Assign clusters to new data points

K-Means: Assign clusters to new data points… here is a solution to the problem.

K-Means: Assign clusters to new data points

I

have implemented the k-means clustering algorithm in python, and now I want to label new data with the clusters obtained by my algorithm. My approach is to iterate through each data point and each centroid to find the minimum distance and the centroid associated with it. But I wonder if there is an easier or shorter way to do this.

def assign_cluster(clusterDict, data):

clusterList = []
    label = []
    cen = list(clusterDict.values())

for i in range(len(data)):
        for j in range(len(cen)):
            # if cen[j] has the minimum distance with data[i]
            # then clusterList[i] = cen[j]

where clusterDict is a dictionary, the key as the label, [0,1,2,….], and the value as the centroid coordinates.

Can someone help me achieve this?

Solution

This is a good use case for numba because it lets you express it as a simple double loop without a big performance penalty, which in turn allows you to avoid using too much extra memory from np.tile. Copying data across 3D is only done in a vectorized way.

Borrowing the standard vectorized numpy implementation from the other answers, I have these two:

import numba
import numpy as np

def kmeans_assignment(centroids, points):
    num_centroids, dim = centroids.shape
    num_points, _ = points.shape

# Tile and reshape both arrays into `[num_points, num_centroids, dim]`.                                                                      
    centroids = np.tile(centroids, [num_points, 1]).reshape([num_points, num_centroids, dim])
    points = np.tile(points, [1, num_centroids]).reshape([num_points, num_centroids, dim])

# Compute all distances (for all points and all centroids) at once and                                                                       
    # select the min centroid for each point.                                                                                                    
    distances = np.sum(np.square(centroids - points), axis=2)
    return np.argmin(distances, axis=1)

@numba.jit
def kmeans_assignment2(centroids, points):
    P, C = points.shape[0], centroids.shape[0]
    distances = np.zeros((P, C), dtype=np.float32)
    for p in range(P):
        for c in range(C):
            distances[p, c] = np.sum(np.square(centroids[c] - points[p]))
    return np.argmin(distances, axis=1)

Then for some sample data, I did some time series experiments:

In [12]: points = np.random.rand(10000, 50)

In [13]: centroids = np.random.rand(30, 50)

In [14]: %timeit kmeans_assignment(centroids, points)
196 ms ± 6.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [15]: %timeit kmeans_assignment2(centroids, points)
127 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I wouldn’t say the numba version is definitely faster than the np.tile version, but obviously it’s pretty close while not incurring the extra memory cost of np.tile

In fact, I noticed on my laptop that when I make the shape larger and use (10000, 1000)

as the shape of points and (200, 1000) for centroids shapes, then np.tile generates a MemoryError at the same time The numba function runs within 5 seconds with no memory errors.

Also, I actually noticed a slower speed when using numba.jit on the first version (using np.tile), probably due to the extra array created inside the jitted function, coupled with the fact that not much numba can be optimized when you’ve already called all the vectorization functions.

And when trying to use broadcast to shorten the code, I also didn’t notice any significant improvements in the second version. For example. Shorten the double loop to

for p in range(P):
    distances[p, :] = np.sum(np.square(centroids - points[p, :]), axis=1)

Doesn’t really help anything (and uses more memory when broadcasting centroids repeatedly in all points[p, :]).

This is one of the real benefits of NUMBA. You can really write algorithms in a very straightforward, loop-based way that conforms to the standard description of the algorithm and allows for greater control over how the syntax is unpacked into memory consumption or broadcasting… All of this without sacrificing runtime performance.

Related Problems and Solutions