NumPy for Numerical perfomance
The Optimisation Decision Process
Section titled “The Optimisation Decision Process”Measure First, Fix Second
Section titled “Measure First, Fix Second”The most important rule in performance optimisation is also the most frequently ignored — never optimise without measuring first. It is tempting to look at code and guess where the slowness is, but intuition is unreliable. Studies consistently show that developers are wrong about where bottlenecks are more often than they are right. Optimising the wrong thing wastes time, adds complexity, and often makes no measurable difference.
The decision process below is a structured workflow that ensures you always fix the right thing in the right order — starting with the highest-impact changes and escalating only when necessary:
Is it actually slow? │ ▼ Profile first (cProfile / timeit) │ ▼ Found the bottleneck? │ ├── Algorithm or data structure wrong? │ → Fix this first — biggest gains │ list → set for O(1) lookups │ O(n²) → O(n log n) sorting │ ├── Python interpreter overhead? │ → built-ins, local variables, generators │ ├── Numerical computation? │ → NumPy / SciPy │ ├── Memory usage too high? │ → __slots__, generators, lazy evaluation │ └── Still not fast enough? ├── Cython — compile Python to C ├── Numba — JIT compile numerical code └── C extension — write the hot path in CThe order matters — each level delivers less improvement than the one above it but requires more effort. Always exhaust the higher levels before descending.
Step 1 — Is It Actually Slow?
Section titled “Step 1 — Is It Actually Slow?”Before doing anything, establish that there is actually a problem worth solving. A function that takes 0.1s and is called once at startup is not worth optimising. A function that takes 0.001s but is called 10 million times absolutely is.
import timeitimport cProfile
def process(data): return [x**2 for x in data if x % 2 == 0]
data = list(range(100_000))
# step 1 — is it slow? measure firstt = timeit.timeit(lambda: process(data), number=100)print(f"total for 100 runs: {t:.3f}s")print(f"average per call: {t/100*1000:.3f}ms")
# if average per call is acceptable → stop here, do not optimise# if average per call is a problem → proceed to step 2Step 2 — Profile to Find the Bottleneck
Section titled “Step 2 — Profile to Find the Bottleneck”Once you have confirmed slowness, profile to find where the time is going. Never guess:
import cProfileimport pstatsimport io
def outer(data): step1 = process(data) # filtering and squaring step2 = sorted(step1) # sorting step3 = sum(step2) # summing return step3
# profile the full callprofiler = cProfile.Profile()profiler.enable()outer(data)profiler.disable()
stream = io.StringIO()stats = pstats.Stats(profiler, stream=stream)stats.sort_stats("cumulative")stats.print_stats(5)print(stream.getvalue())
# output tells you exactly which function is the bottleneck# do not guess — read the profiler outputStep 3 — Fix in the Right Order
Section titled “Step 3 — Fix in the Right Order”Algorithm and data structure first — this is always the highest-impact fix:
# ❌ O(n²) — checking membership in a listdef find_common(list1, list2): return [x for x in list1 if x in list2] # O(n²)
# ✅ O(n) — convert to set firstdef find_common_fast(list1, list2): set2 = set(list2) # O(n) to build return [x for x in list1 if x in set2] # O(1) per lookup
data1 = list(range(10_000))data2 = list(range(5_000, 15_000))
import timeitt1 = timeit.timeit(lambda: find_common(data1, data2), number=10)t2 = timeit.timeit(lambda: find_common_fast(data1, data2), number=10)
print(f"O(n²) : {t1:.3f}s")print(f"O(n) : {t2:.3f}s")print(f"speedup: {t1/t2:.1f}x")# a 10,000 element list → ~100x speedup# no numpy, no cython, no C — just the right data structurePython overhead second — once the algorithm is right, reduce interpreter overhead:
import math
# ❌ global lookup on every iterationdef slow(data): return [math.sqrt(x) for x in data]
# ✅ cache as local variabledef fast(data): sqrt = math.sqrt return [sqrt(x) for x in data]NumPy third — when the operation is numerical and the data is large:
import numpy as np
# ✅ vectorized — bypasses interpreter entirelyarr = np.array(data, dtype=np.float64)result = np.sqrt(arr) # single C callEscalate only if still not enough — Cython, Numba, and C extensions are powerful but add significant complexity. They should only be reached when the above steps have been exhausted and measured:
# Numba — JIT compile a numerical functionfrom numba import jit
@jit(nopython=True)def numba_squares(arr): result = np.empty(len(arr)) for i in range(len(arr)): result[i] = arr[i] ** 2 return result
# first call compiles — subsequent calls run at C speedThe Complete Workflow in Practice
Section titled “The Complete Workflow in Practice” symptom: "the report generation takes 45 seconds" │ ▼ timeit → confirmed: report() takes 45s per call │ ▼ cProfile → bottleneck: find_duplicates() called 10,000 times │ ▼ line_profiler → bottleneck line: if item in seen_list ← O(n) lookup │ ▼ fix: change seen_list to seen_set ← O(1) lookup │ ▼ timeit → report() now takes 0.8s ← 56x faster │ ▼ done — no numpy, no cython neededThe most common outcome is that fixing the algorithm or data structure is all that is needed. The escalation path to Cython and C extensions exists for the rare cases where Python genuinely cannot go fast enough — numerical simulations, real-time signal processing, game engines — not for typical application code.