Skip to content

NumPy for Numerical perfomance

The most important rule in performance optimisation is also the most frequently ignored — never optimise without measuring first. It is tempting to look at code and guess where the slowness is, but intuition is unreliable. Studies consistently show that developers are wrong about where bottlenecks are more often than they are right. Optimising the wrong thing wastes time, adds complexity, and often makes no measurable difference.

The decision process below is a structured workflow that ensures you always fix the right thing in the right order — starting with the highest-impact changes and escalating only when necessary:

Is it actually slow?
Profile first (cProfile / timeit)
Found the bottleneck?
├── Algorithm or data structure wrong?
│ → Fix this first — biggest gains
│ list → set for O(1) lookups
│ O(n²) → O(n log n) sorting
├── Python interpreter overhead?
│ → built-ins, local variables, generators
├── Numerical computation?
│ → NumPy / SciPy
├── Memory usage too high?
│ → __slots__, generators, lazy evaluation
└── Still not fast enough?
├── Cython — compile Python to C
├── Numba — JIT compile numerical code
└── C extension — write the hot path in C

The order matters — each level delivers less improvement than the one above it but requires more effort. Always exhaust the higher levels before descending.


Before doing anything, establish that there is actually a problem worth solving. A function that takes 0.1s and is called once at startup is not worth optimising. A function that takes 0.001s but is called 10 million times absolutely is.

import timeit
import cProfile
def process(data):
return [x**2 for x in data if x % 2 == 0]
data = list(range(100_000))
# step 1 — is it slow? measure first
t = timeit.timeit(lambda: process(data), number=100)
print(f"total for 100 runs: {t:.3f}s")
print(f"average per call: {t/100*1000:.3f}ms")
# if average per call is acceptable → stop here, do not optimise
# if average per call is a problem → proceed to step 2

Once you have confirmed slowness, profile to find where the time is going. Never guess:

import cProfile
import pstats
import io
def outer(data):
step1 = process(data) # filtering and squaring
step2 = sorted(step1) # sorting
step3 = sum(step2) # summing
return step3
# profile the full call
profiler = cProfile.Profile()
profiler.enable()
outer(data)
profiler.disable()
stream = io.StringIO()
stats = pstats.Stats(profiler, stream=stream)
stats.sort_stats("cumulative")
stats.print_stats(5)
print(stream.getvalue())
# output tells you exactly which function is the bottleneck
# do not guess — read the profiler output

Algorithm and data structure first — this is always the highest-impact fix:

# ❌ O(n²) — checking membership in a list
def find_common(list1, list2):
return [x for x in list1 if x in list2] # O(n²)
# ✅ O(n) — convert to set first
def find_common_fast(list1, list2):
set2 = set(list2) # O(n) to build
return [x for x in list1 if x in set2] # O(1) per lookup
data1 = list(range(10_000))
data2 = list(range(5_000, 15_000))
import timeit
t1 = timeit.timeit(lambda: find_common(data1, data2), number=10)
t2 = timeit.timeit(lambda: find_common_fast(data1, data2), number=10)
print(f"O(n²) : {t1:.3f}s")
print(f"O(n) : {t2:.3f}s")
print(f"speedup: {t1/t2:.1f}x")
# a 10,000 element list → ~100x speedup
# no numpy, no cython, no C — just the right data structure

Python overhead second — once the algorithm is right, reduce interpreter overhead:

import math
# ❌ global lookup on every iteration
def slow(data):
return [math.sqrt(x) for x in data]
# ✅ cache as local variable
def fast(data):
sqrt = math.sqrt
return [sqrt(x) for x in data]

NumPy third — when the operation is numerical and the data is large:

import numpy as np
# ✅ vectorized — bypasses interpreter entirely
arr = np.array(data, dtype=np.float64)
result = np.sqrt(arr) # single C call

Escalate only if still not enough — Cython, Numba, and C extensions are powerful but add significant complexity. They should only be reached when the above steps have been exhausted and measured:

# Numba — JIT compile a numerical function
from numba import jit
@jit(nopython=True)
def numba_squares(arr):
result = np.empty(len(arr))
for i in range(len(arr)):
result[i] = arr[i] ** 2
return result
# first call compiles — subsequent calls run at C speed

symptom: "the report generation takes 45 seconds"
timeit → confirmed: report() takes 45s per call
cProfile → bottleneck: find_duplicates() called 10,000 times
line_profiler → bottleneck line: if item in seen_list ← O(n) lookup
fix: change seen_list to seen_set ← O(1) lookup
timeit → report() now takes 0.8s ← 56x faster
done — no numpy, no cython needed

The most common outcome is that fixing the algorithm or data structure is all that is needed. The escalation path to Cython and C extensions exists for the rare cases where Python genuinely cannot go fast enough — numerical simulations, real-time signal processing, game engines — not for typical application code.