Python – Why numpy vectorization is slower than for loops

Why numpy vectorization is slower than for loops… here is a solution to the problem.

Why numpy vectorization is slower than for loops

The following code has two functions to do the same thing: check if the line between two points intersects the circle.


from line_profiler import LineProfiler
from math import sqrt
import numpy as np

class Point:
    x: float
    y: float

def __init__(self, x: float, y: float):
        self.x = x
        self.y = y

def __repr__(self):
        return f"Point(x={self.x}, y={self.y})"

class Circle:
    ctr: Point
    r: float

def __init__(self, ctr: Point, r: float):
        self.ctr = ctr
        self.r = r

def __repr__(self):
        return f"Circle(r={self.r}, ctr={self.ctr})"

def loop(p1: Point, p2: Point, circles: list[Circle]):
    m = (p1.y - p2.y) / (p1.x - p2.x)
    n = p1.y - m * p1.x

max_x = max(p1.x, p2.x)
    min_x = min(p1.x, p2.x)

for circle in circles:
        if sqrt((circle.ctr.x - p1.x) ** 2 + (circle.ctr.y - p1.y) ** 2) < circle.r \
                or sqrt((circle.ctr.x - p2.x) ** 2 + (circle.ctr.y - p2.y) ** 2) < circle.r:
            return False

a = m ** 2 + 1
        b = 2 * (m * n - m * circle.ctr.y - circle.ctr.x)
        c = circle.ctr.x ** 2 + circle.ctr.y ** 2 + n ** 2 - circle.r ** 2 - 2 * n * circle.ctr.y

# compute the intersection points
        discriminant = b ** 2 - 4 * a * c
        if discriminant <= 0:
            # no real roots, the line does not intersect the circle
            continue

# two real roots, the line intersects the circle at two points
        x1 = (-b + sqrt(discriminant)) / (2 * a)
        x2 = (-b - sqrt(discriminant)) / (2 * a)

# check if both points in range
        first = min_x <= x1 <= max_x
        second = min_x <= x2 <= max_x
        if first and second:
            return False

return True

def vectorized(p1: Point, p2: Point, circles):
    m = (p1.y - p2.y) / (p1.x - p2.x)
    n = p1.y - m * p1.x

max_x = max(p1.x, p2.x)
    min_x = min(p1.x, p2.x)

circle_ctr_x = circles['x']
    circle_ctr_y = circles['y']
    circle_radius = circles['r']

# Pt 1 inside circle
    if np.any(np.sqrt((circle_ctr_x - p1.x) ** 2 + (circle_ctr_y - p1.y) ** 2) < circle_radius):
        return False
    # Pt 2 inside circle
    if np.any(np.sqrt((circle_ctr_x - p2.x) ** 2 + (circle_ctr_y - p2.y) ** 2) < circle_radius):
        return False
    # Line intersects with circle in range
    a = m ** 2 + 1
    b = 2 * (m * n - m * circle_ctr_y - circle_ctr_x)
    c = circle_ctr_x ** 2 + circle_ctr_y ** 2 + n ** 2 - circle_radius ** 2 - 2 * n * circle_ctr_y

# compute the intersection points
    discriminant = b**2 - 4*a*c
    discriminant_bigger_than_zero = discriminant > 0
    discriminant = discriminant[discriminant_bigger_than_zero]

if discriminant.size == 0:
        return True

b = b[discriminant_bigger_than_zero]

# two real roots, the line intersects the circle at two points
    x1 = (-b + np.sqrt(discriminant)) / (2 * a)
    x2 = (-b - np.sqrt(discriminant)) / (2 * a)

# check if both points in range
    in_range = (min_x <= x1) & (x1 <= max_x) & (min_x <= x2) & (x2 <= max_x)
    return not np.any(in_range)

a = Point(x=-2.47496075130008, y=1.3609840363748935)
b = Point(x=3.4637947060471084, y=-3.7779123453298817)
c = [Circle(r=1.2587063082677084, ctr=Point(x=3.618533781361757, y=2.179925931180058)), Circle(r=0.7625751871124099, ctr=Point(x=-0.3173290200183132, y=4.256206636932641)), Circle(r=0.4926043225930364, ctr=Point(x=-4.626312261120341, y=-1.5754603504419196)), Circle(r=0.6026364956540792, ctr=Point(x=3.775240278691819, y=1.7381168262343072)), Circle(r=1.2804597877349562, ctr=Point(x=4.403273380178893, y=-1.6890127555343681)), Circle(r=1.1562415624767421, ctr=Point(x=-1.0675000352105801, y=-0.23952113329203994)) , Circle(r=1.112718432321835, ctr=Point(x=2.500137075066017, y=-2.77748519509295)), Circle(r=0.979889574640609, ctr=Point(x=4.494971251199753, y=-1.0530995423779388)), Circle(r=0.7817624050358268, ctr=Point(x=3.2419454348696544, y=4.3303373486692465)), Circle(r=1.0271176198616367, ctr=Point(x=-0.9740272820753071, y=-4.282195116754338)), Circle(r=1.1585218836700681, ctr=Point(x=-0.42096876790888915, y=2.135161027254492)), Circle(r=1.0242603387003988, ctr=Point(x=2.2617850544260767, y=-4.59942951839469)), Circle(r=1.5704233297828027, ctr=Point(x=-1.1182365440831088, y=4.2411408333943506)), Circle(r=0.37137272043983655, ctr=Point(x=3.280499587987774, y=-4.87871834733383)), Circle(r=1.1829610109115543, ctr=Point(x=-0.27755604766113606, y=-3.68429580935016)), Circle(r=1.0993567600839198, ctr=Point(x=0.23602306761027925, y=0.47530122196024704)) , Circle(r=1.3865045367147553, ctr=Point(x=-2.537565761732492, y=4.719766182202855)), Circle(r=0.9492796511909753, ctr=Point(x=-3.7047245796551973, y=-2.501817905967274)), Circle(r=0.9866916911482386, ctr=Point(x=1.3021813533479742, y=4.754952371169189)), Circle(r=0.9053004331885084, ctr=Point(x=-3.4912157984801784, y=-0.5269727600532836)), Circle(r=1.3058987272565075, ctr=Point(x=-1.6983878085276427, y=-2.2910189455221053)), Circle(r=0.5342716756987732, ctr=Point(x=4.948676886704507, y=-1.2467089784975183)), Circle(r=1.0603926633240575, ctr=Point(x=-4.390462974765324, y=0.785568745976325)), Circle(r=0.3448422804513971, ctr=Point(x=-1.6459756952994697, y=2.7608629057950362)), Circle(r=0.8521457455807724, ctr=Point(x=-4.503217369041699, y=3.93796926957188)), Circle(r=0.602438849989669, ctr=Point(x=-2.0703406576157493, y=0.6142570312870999)), Circle(r=0.6453692950682722, ctr=Point(x=-0.14802220452893144, y=4.08189682338989)), Circle(r=0.6983361689325062, ctr=Point(x=0.09362196694661651, y=-1.0953438275586391)), Circle(r=1.880331563921456, ctr=Point(x=0.23481661751521776, y=-4.09217120864087)), Circle(r=0.5766225363413416, ctr=Point(x=3.149434524126505, y=-4.639582956406762)), Circle(r=0.6177559628867022, ctr=Point(x=-1.6758918144661683, y=-0.7954935787503492)), Circle(r=0.7347952666955615, ctr=Point(x=-3.1907522890427575, y=0.7048509241855683)) , Circle(r=1.2795003337464894, ctr=Point(x=-1.777244415863577, y=2.936422879898364)), Circle(r=0.9181024765780231, ctr=Point(x=4.212544425778317, y=-1.953546993038261)), Circle(r=1.7681384709020282, ctr=Point(x=-1.3702722387909405, y=-1.7013020424154368)), Circle(r=0.5420789771729688, ctr=Point(x=4.063803796292818, y=-3.7159871611415065)), Circle(r=1.3863651881788939, ctr=Point(x=0.7685002210812408, y=-3.994230705171357)), Circle(r=0.5739750223225826, ctr=Point(x=0.08779554290638258, y=4.879912451441914)), Circle(r=1.2019825386919343, ctr=Point(x=-4.206623233886995, y=-1.1617382464768689))]

circle_dt = np.dtype('float,float,float')
circle_dt.names = ['x', 'y', 'r']
np_c = np.array([(x.ctr.x, x.ctr.y, x.r) for x in c], dtype=circle_dt)

lp1 = LineProfiler()
loop_wrapper = lp1(loop)
loop_wrapper(a, b, c)
lp1.print_stats()

lp2 = LineProfiler()
vectorized_wrapper = lp2(vectorized)
vectorized_wrapper(a, b, np_c)
lp2.print_stats()

One implementation is the regular for loop implementation, and the other is implemented with numpy vectorization.
Based on what I know about vectorization, I’m guessing that the vectorization function produces better results, but as you can see below, that’s not the case:

Total time: 4.36e-05 s
Function: loop at line 31

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    31                                           def loop(p1: Point, p2: Point, circles: list[Circle]):
    32         1          9.0      9.0      2.1      m = (p1.y - p2.y) / (p1.x - p2.x)
    33         1          5.0      5.0      1.1      n = p1.y - m * p1.x
    34                                           
    35         1         19.0     19.0      4.4      max_x = max(p1.x, p2.x)
    36         1          5.0      5.0      1.1      min_x = min(p1.x, p2.x)
    37                                           
    38         6         30.0      5.0      6.9      for circle in circles:
    39         6         73.0     12.2     16.7          if sqrt((circle.ctr.x - p1.x) ** 2 + (circle.ctr.y - p1.y) ** 2) < circle.r \
    40         6         62.0     10.3     14.2                  or sqrt((circle.ctr.x - p2.x) ** 2 + (circle.ctr.y - p2.y) ** 2) < circle.r:
    41                                                       return False
    42                                           
    43         6         29.0      4.8      6.7          a = m ** 2 + 1
    44         6         32.0      5.3      7.3          b = 2 * (m * n - m * circle.ctr.y - circle.ctr.x)
    45         6         82.0     13.7     18.8          c = circle.ctr.x ** 2 + circle.ctr.y ** 2 + n ** 2 - circle.r ** 2 - 2 * n * circle.ctr.y
    46                                           
    47                                                   # compute the intersection points
    48         6         33.0      5.5      7.6          discriminant = b ** 2 - 4 * a * c
    49         5         11.0      2.2      2.5          if discriminant <= 0:
    50                                                       # no real roots, the line does not intersect the circle
    51         5         22.0      4.4      5.0              continue
    52                                           
    53                                                   # two real roots, the line intersects the circle at two points
    54         1          7.0      7.0      1.6          x1 = (-b + sqrt(discriminant)) / (2 * a)
    55         1          4.0      4.0      0.9          x2 = (-b - sqrt(discriminant)) / (2 * a)
    56                                           
    57                                                   # check if one point in range
    58         1          5.0      5.0      1.1          first = min_x < x1 < max_x
    59         1          3.0      3.0      0.7          second = min_x < x2 < max_x
    60         1          2.0      2.0      0.5          if first and second:
    61         1          3.0      3.0      0.7              return False
    62                                           
    63                                                   return True

Total time: 0.0001534 s
Function: vectorized at line 66

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    66                                           def vectorized(p1: Point, p2: Point, circles):
    67         1         10.0     10.0      0.7      m = (p1.y - p2.y) / (p1.x - p2.x)
    68         1          5.0      5.0      0.3      n = p1.y - m * p1.x
    69                                           
    70         1          7.0      7.0      0.5      max_x = max(p1.x, p2.x)
    71         1          4.0      4.0      0.3      min_x = min(p1.x, p2.x)
    72                                           
    73         1         10.0     10.0      0.7      circle_ctr_x = circles['x']
    74         1          3.0      3.0      0.2      circle_ctr_y = circles['y']
    75         1          3.0      3.0      0.2      circle_radius = circles['r']
    76                                           
    77                                               # Pt 1 inside circle
    78         1        652.0    652.0     42.5      if np.any(np.sqrt((circle_ctr_x - p1.x) ** 2 + (circle_ctr_y - p1.y) ** 2) < circle_radius):
    79                                                   return False
    80                                               # Pt 2 inside circle
    81         1        161.0    161.0     10.5      if np.any(np.sqrt((circle_ctr_x - p2.x) ** 2 + (circle_ctr_y - p2.y) ** 2) < circle_radius):
    82                                                   return False
    83                                               # Line intersects with circle in range
    84         1         13.0     13.0      0.8      a = m ** 2 + 1
    85         1        120.0    120.0      7.8      b = 2 * (m * n - m * circle_ctr_y - circle_ctr_x)
    86         1         77.0     77.0      5.0      c = circle_ctr_x ** 2 + circle_ctr_y ** 2 + n ** 2 - circle_radius ** 2 - 2 * n * circle_ctr_y
    87                                           
    88                                               # compute the intersection points
    89         1         25.0     25.0      1.6      discriminant = b**2 - 4*a*c
    90         1         46.0     46.0      3.0      discriminant_bigger_than_zero = discriminant > 0
    91         1         56.0     56.0      3.7      discriminant = discriminant[discriminant_bigger_than_zero]
    92                                           
    93         1          6.0      6.0      0.4      if discriminant.size == 0:
    94                                                   return True
    95                                           
    96         1         12.0     12.0      0.8      b = b[discriminant_bigger_than_zero]
    97                                           
    98                                               # two real roots, the line intersects the circle at two points
    99         1         77.0     77.0      5.0      x1 = (-b + np.sqrt(discriminant)) / (2 * a)
   100         1         28.0     28.0      1.8      x2 = (-b - np.sqrt(discriminant)) / (2 * a)
   101                                           
   102                                               # check if both points in range
   103         1         96.0     96.0      6.3      in_range = (min_x <= x1) & (x1 <= max_x) & (min_x <= x2) & (x2 <= max_x)
   104         1        123.0    123.0      8.0      return not np.any(in_range)

For some reason, non-vectorized functions run faster.

My simple guess is because the vectorization function traverses the entire array every time, and the non-vectorized function stops in the middle when it finds the intersection of the circles.

So my question is:

  1. Is there a numpy function that doesn’t iterate over the entire array, but stops when the result is false?
  2. What causes vectorization functions to take longer to run?
  3. Any general optimization suggestions would be appreciated

Solution

Is there a numpy function which doesn’t iterate over the whole array but stops when the results are false?

No. This is a long-standing feature requested by Numpy users, but it will certainly never be added to Numpy. For simple cases, such as returning the first index of a bool array, Numpy can implement it, but the problem is that the bool array needs to be created completely first. To support the general case, Numpy should combine multiple operations and do some kind of lazy calculation. This basically means completely rewriting Numpy from scratch for an efficient implementation (which is a huge job).

If you need to do this, there are two main solutions:

  • operate on blocks to stop the calculation early (up to len(chunk) additional items are counted simultaneously);
  • Write your own fast compilation implementation using Numba or Cython (with View).

What is the reason the vectorized function takes longer to run?

The input is very small and Numpy is not optimized for small arrays. In fact, on mainstream processors (like my i5-9600KF), it usually takes 0.4-4 microseconds per call to the Numpy function. This is because Numpy does a lot of checks, allocates new arrays, builds generic internal iterators, etc. As a result, it takes about 8 us to make 8 Numpy calls on my machine and create 7 temporary arrays on a line like np.any(np.sqrt((circle_ctr_x - p1.x) ** 2 + (circle_ctr_y - p1.y) ** 2) < circle_radius). A second similar line takes the same amount of time. They already add up to be slower than the non-vectorized version.

As noted in the question and comments, the non-vectorized function can be stopped early, which also helps the non-vectorized version faster than another.

Any general optimization suggestions would be appreciated

Regarding your code, using numba (with normal loops and numpy arrays) is certainly a good idea for improving performance. Note that the first call may be slower due to compile time (you can do this by providing a signature on load, or just use an AOT compiler including Cython).

Note that arrays of structures are generally inefficient because they prevent the efficient use of SIMD instructions. Numpy certainly can’t compute them efficiently either, because data types are created dynamically and the Numpy code is already compiled ahead of time (so it can’t implement functions for this particular data type and has to use generic dynamic operations for each item of the array much slower than basic data types). Consider using an array structure. For more information, read This post more generally this post

Related Problems and Solutions