Why numpy vectorization is slower than for loops
The following code has two functions to do the same thing: check if the line between two points intersects the circle.
from line_profiler import LineProfiler
from math import sqrt
import numpy as np
class Point:
x: float
y: float
def __init__(self, x: float, y: float):
self.x = x
self.y = y
def __repr__(self):
return f"Point(x={self.x}, y={self.y})"
class Circle:
ctr: Point
r: float
def __init__(self, ctr: Point, r: float):
self.ctr = ctr
self.r = r
def __repr__(self):
return f"Circle(r={self.r}, ctr={self.ctr})"
def loop(p1: Point, p2: Point, circles: list[Circle]):
m = (p1.y - p2.y) / (p1.x - p2.x)
n = p1.y - m * p1.x
max_x = max(p1.x, p2.x)
min_x = min(p1.x, p2.x)
for circle in circles:
if sqrt((circle.ctr.x - p1.x) ** 2 + (circle.ctr.y - p1.y) ** 2) < circle.r \
or sqrt((circle.ctr.x - p2.x) ** 2 + (circle.ctr.y - p2.y) ** 2) < circle.r:
return False
a = m ** 2 + 1
b = 2 * (m * n - m * circle.ctr.y - circle.ctr.x)
c = circle.ctr.x ** 2 + circle.ctr.y ** 2 + n ** 2 - circle.r ** 2 - 2 * n * circle.ctr.y
# compute the intersection points
discriminant = b ** 2 - 4 * a * c
if discriminant <= 0:
# no real roots, the line does not intersect the circle
continue
# two real roots, the line intersects the circle at two points
x1 = (-b + sqrt(discriminant)) / (2 * a)
x2 = (-b - sqrt(discriminant)) / (2 * a)
# check if both points in range
first = min_x <= x1 <= max_x
second = min_x <= x2 <= max_x
if first and second:
return False
return True
def vectorized(p1: Point, p2: Point, circles):
m = (p1.y - p2.y) / (p1.x - p2.x)
n = p1.y - m * p1.x
max_x = max(p1.x, p2.x)
min_x = min(p1.x, p2.x)
circle_ctr_x = circles['x']
circle_ctr_y = circles['y']
circle_radius = circles['r']
# Pt 1 inside circle
if np.any(np.sqrt((circle_ctr_x - p1.x) ** 2 + (circle_ctr_y - p1.y) ** 2) < circle_radius):
return False
# Pt 2 inside circle
if np.any(np.sqrt((circle_ctr_x - p2.x) ** 2 + (circle_ctr_y - p2.y) ** 2) < circle_radius):
return False
# Line intersects with circle in range
a = m ** 2 + 1
b = 2 * (m * n - m * circle_ctr_y - circle_ctr_x)
c = circle_ctr_x ** 2 + circle_ctr_y ** 2 + n ** 2 - circle_radius ** 2 - 2 * n * circle_ctr_y
# compute the intersection points
discriminant = b**2 - 4*a*c
discriminant_bigger_than_zero = discriminant > 0
discriminant = discriminant[discriminant_bigger_than_zero]
if discriminant.size == 0:
return True
b = b[discriminant_bigger_than_zero]
# two real roots, the line intersects the circle at two points
x1 = (-b + np.sqrt(discriminant)) / (2 * a)
x2 = (-b - np.sqrt(discriminant)) / (2 * a)
# check if both points in range
in_range = (min_x <= x1) & (x1 <= max_x) & (min_x <= x2) & (x2 <= max_x)
return not np.any(in_range)
a = Point(x=-2.47496075130008, y=1.3609840363748935)
b = Point(x=3.4637947060471084, y=-3.7779123453298817)
c = [Circle(r=1.2587063082677084, ctr=Point(x=3.618533781361757, y=2.179925931180058)), Circle(r=0.7625751871124099, ctr=Point(x=-0.3173290200183132, y=4.256206636932641)), Circle(r=0.4926043225930364, ctr=Point(x=-4.626312261120341, y=-1.5754603504419196)), Circle(r=0.6026364956540792, ctr=Point(x=3.775240278691819, y=1.7381168262343072)), Circle(r=1.2804597877349562, ctr=Point(x=4.403273380178893, y=-1.6890127555343681)), Circle(r=1.1562415624767421, ctr=Point(x=-1.0675000352105801, y=-0.23952113329203994)) , Circle(r=1.112718432321835, ctr=Point(x=2.500137075066017, y=-2.77748519509295)), Circle(r=0.979889574640609, ctr=Point(x=4.494971251199753, y=-1.0530995423779388)), Circle(r=0.7817624050358268, ctr=Point(x=3.2419454348696544, y=4.3303373486692465)), Circle(r=1.0271176198616367, ctr=Point(x=-0.9740272820753071, y=-4.282195116754338)), Circle(r=1.1585218836700681, ctr=Point(x=-0.42096876790888915, y=2.135161027254492)), Circle(r=1.0242603387003988, ctr=Point(x=2.2617850544260767, y=-4.59942951839469)), Circle(r=1.5704233297828027, ctr=Point(x=-1.1182365440831088, y=4.2411408333943506)), Circle(r=0.37137272043983655, ctr=Point(x=3.280499587987774, y=-4.87871834733383)), Circle(r=1.1829610109115543, ctr=Point(x=-0.27755604766113606, y=-3.68429580935016)), Circle(r=1.0993567600839198, ctr=Point(x=0.23602306761027925, y=0.47530122196024704)) , Circle(r=1.3865045367147553, ctr=Point(x=-2.537565761732492, y=4.719766182202855)), Circle(r=0.9492796511909753, ctr=Point(x=-3.7047245796551973, y=-2.501817905967274)), Circle(r=0.9866916911482386, ctr=Point(x=1.3021813533479742, y=4.754952371169189)), Circle(r=0.9053004331885084, ctr=Point(x=-3.4912157984801784, y=-0.5269727600532836)), Circle(r=1.3058987272565075, ctr=Point(x=-1.6983878085276427, y=-2.2910189455221053)), Circle(r=0.5342716756987732, ctr=Point(x=4.948676886704507, y=-1.2467089784975183)), Circle(r=1.0603926633240575, ctr=Point(x=-4.390462974765324, y=0.785568745976325)), Circle(r=0.3448422804513971, ctr=Point(x=-1.6459756952994697, y=2.7608629057950362)), Circle(r=0.8521457455807724, ctr=Point(x=-4.503217369041699, y=3.93796926957188)), Circle(r=0.602438849989669, ctr=Point(x=-2.0703406576157493, y=0.6142570312870999)), Circle(r=0.6453692950682722, ctr=Point(x=-0.14802220452893144, y=4.08189682338989)), Circle(r=0.6983361689325062, ctr=Point(x=0.09362196694661651, y=-1.0953438275586391)), Circle(r=1.880331563921456, ctr=Point(x=0.23481661751521776, y=-4.09217120864087)), Circle(r=0.5766225363413416, ctr=Point(x=3.149434524126505, y=-4.639582956406762)), Circle(r=0.6177559628867022, ctr=Point(x=-1.6758918144661683, y=-0.7954935787503492)), Circle(r=0.7347952666955615, ctr=Point(x=-3.1907522890427575, y=0.7048509241855683)) , Circle(r=1.2795003337464894, ctr=Point(x=-1.777244415863577, y=2.936422879898364)), Circle(r=0.9181024765780231, ctr=Point(x=4.212544425778317, y=-1.953546993038261)), Circle(r=1.7681384709020282, ctr=Point(x=-1.3702722387909405, y=-1.7013020424154368)), Circle(r=0.5420789771729688, ctr=Point(x=4.063803796292818, y=-3.7159871611415065)), Circle(r=1.3863651881788939, ctr=Point(x=0.7685002210812408, y=-3.994230705171357)), Circle(r=0.5739750223225826, ctr=Point(x=0.08779554290638258, y=4.879912451441914)), Circle(r=1.2019825386919343, ctr=Point(x=-4.206623233886995, y=-1.1617382464768689))]
circle_dt = np.dtype('float,float,float')
circle_dt.names = ['x', 'y', 'r']
np_c = np.array([(x.ctr.x, x.ctr.y, x.r) for x in c], dtype=circle_dt)
lp1 = LineProfiler()
loop_wrapper = lp1(loop)
loop_wrapper(a, b, c)
lp1.print_stats()
lp2 = LineProfiler()
vectorized_wrapper = lp2(vectorized)
vectorized_wrapper(a, b, np_c)
lp2.print_stats()
One implementation is the regular for loop implementation, and the other is implemented with numpy vectorization.
Based on what I know about vectorization, I’m guessing that the vectorization function produces better results, but as you can see below, that’s not the case:
Total time: 4.36e-05 s
Function: loop at line 31
Line # Hits Time Per Hit % Time Line Contents
==============================================================
31 def loop(p1: Point, p2: Point, circles: list[Circle]):
32 1 9.0 9.0 2.1 m = (p1.y - p2.y) / (p1.x - p2.x)
33 1 5.0 5.0 1.1 n = p1.y - m * p1.x
34
35 1 19.0 19.0 4.4 max_x = max(p1.x, p2.x)
36 1 5.0 5.0 1.1 min_x = min(p1.x, p2.x)
37
38 6 30.0 5.0 6.9 for circle in circles:
39 6 73.0 12.2 16.7 if sqrt((circle.ctr.x - p1.x) ** 2 + (circle.ctr.y - p1.y) ** 2) < circle.r \
40 6 62.0 10.3 14.2 or sqrt((circle.ctr.x - p2.x) ** 2 + (circle.ctr.y - p2.y) ** 2) < circle.r:
41 return False
42
43 6 29.0 4.8 6.7 a = m ** 2 + 1
44 6 32.0 5.3 7.3 b = 2 * (m * n - m * circle.ctr.y - circle.ctr.x)
45 6 82.0 13.7 18.8 c = circle.ctr.x ** 2 + circle.ctr.y ** 2 + n ** 2 - circle.r ** 2 - 2 * n * circle.ctr.y
46
47 # compute the intersection points
48 6 33.0 5.5 7.6 discriminant = b ** 2 - 4 * a * c
49 5 11.0 2.2 2.5 if discriminant <= 0:
50 # no real roots, the line does not intersect the circle
51 5 22.0 4.4 5.0 continue
52
53 # two real roots, the line intersects the circle at two points
54 1 7.0 7.0 1.6 x1 = (-b + sqrt(discriminant)) / (2 * a)
55 1 4.0 4.0 0.9 x2 = (-b - sqrt(discriminant)) / (2 * a)
56
57 # check if one point in range
58 1 5.0 5.0 1.1 first = min_x < x1 < max_x
59 1 3.0 3.0 0.7 second = min_x < x2 < max_x
60 1 2.0 2.0 0.5 if first and second:
61 1 3.0 3.0 0.7 return False
62
63 return True
Total time: 0.0001534 s
Function: vectorized at line 66
Line # Hits Time Per Hit % Time Line Contents
==============================================================
66 def vectorized(p1: Point, p2: Point, circles):
67 1 10.0 10.0 0.7 m = (p1.y - p2.y) / (p1.x - p2.x)
68 1 5.0 5.0 0.3 n = p1.y - m * p1.x
69
70 1 7.0 7.0 0.5 max_x = max(p1.x, p2.x)
71 1 4.0 4.0 0.3 min_x = min(p1.x, p2.x)
72
73 1 10.0 10.0 0.7 circle_ctr_x = circles['x']
74 1 3.0 3.0 0.2 circle_ctr_y = circles['y']
75 1 3.0 3.0 0.2 circle_radius = circles['r']
76
77 # Pt 1 inside circle
78 1 652.0 652.0 42.5 if np.any(np.sqrt((circle_ctr_x - p1.x) ** 2 + (circle_ctr_y - p1.y) ** 2) < circle_radius):
79 return False
80 # Pt 2 inside circle
81 1 161.0 161.0 10.5 if np.any(np.sqrt((circle_ctr_x - p2.x) ** 2 + (circle_ctr_y - p2.y) ** 2) < circle_radius):
82 return False
83 # Line intersects with circle in range
84 1 13.0 13.0 0.8 a = m ** 2 + 1
85 1 120.0 120.0 7.8 b = 2 * (m * n - m * circle_ctr_y - circle_ctr_x)
86 1 77.0 77.0 5.0 c = circle_ctr_x ** 2 + circle_ctr_y ** 2 + n ** 2 - circle_radius ** 2 - 2 * n * circle_ctr_y
87
88 # compute the intersection points
89 1 25.0 25.0 1.6 discriminant = b**2 - 4*a*c
90 1 46.0 46.0 3.0 discriminant_bigger_than_zero = discriminant > 0
91 1 56.0 56.0 3.7 discriminant = discriminant[discriminant_bigger_than_zero]
92
93 1 6.0 6.0 0.4 if discriminant.size == 0:
94 return True
95
96 1 12.0 12.0 0.8 b = b[discriminant_bigger_than_zero]
97
98 # two real roots, the line intersects the circle at two points
99 1 77.0 77.0 5.0 x1 = (-b + np.sqrt(discriminant)) / (2 * a)
100 1 28.0 28.0 1.8 x2 = (-b - np.sqrt(discriminant)) / (2 * a)
101
102 # check if both points in range
103 1 96.0 96.0 6.3 in_range = (min_x <= x1) & (x1 <= max_x) & (min_x <= x2) & (x2 <= max_x)
104 1 123.0 123.0 8.0 return not np.any(in_range)
For some reason, non-vectorized functions run faster.
My simple guess is because the vectorization function traverses the entire array every time, and the non-vectorized function stops in the middle when it finds the intersection of the circles.
So my question is:
- Is there a numpy function that doesn’t iterate over the entire array, but stops when the result is false?
- What causes vectorization functions to take longer to run?
- Any general optimization suggestions would be appreciated
Solution
Is there a numpy function which doesn’t iterate over the whole array but stops when the results are false?
No. This is a long-standing feature requested by Numpy users, but it will certainly never be added to Numpy. For simple cases, such as returning the first index of a bool array, Numpy can implement it, but the problem is that the bool array needs to be created completely first. To support the general case, Numpy should combine multiple operations and do some kind of lazy calculation. This basically means completely rewriting Numpy from scratch for an efficient implementation (which is a huge job).
If you need to do this, there are two main solutions:
- operate on blocks to stop the calculation early (up to
len(chunk)
additional items are counted simultaneously); - Write your own fast compilation implementation using Numba or Cython (with View).
What is the reason the vectorized function takes longer to run?
The input is very small and Numpy is not optimized for small arrays. In fact, on mainstream processors (like my i5-9600KF), it usually takes 0.4-4 microseconds per call to the Numpy function. This is because Numpy does a lot of checks, allocates new arrays, builds generic internal iterators, etc. As a result, it takes about 8 us to make 8 Numpy calls on my machine and create 7 temporary arrays on a line like np.any(np.sqrt((circle_ctr_x - p1.x) ** 2 + (circle_ctr_y - p1.y
) ** 2) < circle_radius). A second similar line takes the same amount of time. They already add up to be slower than the non-vectorized version.
As noted in the question and comments, the non-vectorized function can be stopped early, which also helps the non-vectorized version faster than another.
Any general optimization suggestions would be appreciated
Regarding your code, using numba (with normal loops and numpy arrays) is certainly a good idea for improving performance. Note that the first call may be slower due to compile time (you can do this by providing a signature on load, or just use an AOT compiler including Cython).
Note that arrays of structures are generally inefficient because they prevent the efficient use of SIMD instructions. Numpy certainly can’t compute them efficiently either, because data types are created dynamically and the Numpy code is already compiled ahead of time (so it can’t implement functions for this particular data type and has to use generic dynamic operations for each item of the array much slower than basic data types). Consider using an array structure. For more information, read This post more generally this post