The optimisation decision process
Measure First, Fix Second
Section titled “Measure First, Fix Second”The decision process below is a structured workflow that ensures you always fix the right thing in the right order, starting with the highest-impact changes and escalating only when necessary:
Is it actually slow? │ ▼ Profile first (cProfile / timeit) │ ▼ Found the bottleneck? │ ├── Algorithm or data structure wrong? │ → Fix this first — biggest gains │ list → set for O(1) lookups │ O(n²) → O(n log n) sorting │ ├── Python interpreter overhead? │ → built-ins, local variables, generators │ ├── Numerical computation? │ → NumPy / SciPy │ ├── Memory usage too high? │ → __slots__, generators, lazy evaluation │ └── Still not fast enough? ├── Cython — compile Python to C ├── Numba — JIT compile numerical code └── C extension — write the hot path in CThe order matters: each level delivers less improvement than the one above it but requires more effort. Always exhaust the higher levels before descending.
-
Is It Actually Slow?
Section titled “Is It Actually Slow?”Before doing anything, establish that there is actually a problem worth solving. A function that takes 0.1s and is called once at startup is not worth optimising. A function that takes 0.001s but is called 10 million times absolutely is.
import timeitimport cProfiledef process(data):return [x**2 for x in data if x % 2 == 0]data = list(range(100_000))# step 1 — is it slow? measure firstt = timeit.timeit(lambda: process(data), number=100)print(f"total for 100 runs: {t:.3f}s")print(f"average per call: {t/100*1000:.3f}ms")# if average per call is acceptable → stop here, do not optimise# if average per call is a problem → proceed to step 2 -
Profile to Find the Bottleneck
Section titled “Profile to Find the Bottleneck”Once you have confirmed slowness, profile to find where the time is going. Never guess:
import cProfileimport pstatsimport iodef outer(data):step1 = process(data) # filtering and squaringstep2 = sorted(step1) # sortingstep3 = sum(step2) # summingreturn step3# profile the full callprofiler = cProfile.Profile()profiler.enable()outer(data)profiler.disable()stream = io.StringIO()stats = pstats.Stats(profiler, stream=stream)stats.sort_stats("cumulative")stats.print_stats(5)print(stream.getvalue())# output tells you exactly which function is the bottleneck# do not guess — read the profiler output -
Fix in the Right Order
Algorithm and data structure first — this is always the highest-impact fix:
# ❌ O(n²) — checking membership in a listdef find_common(list1, list2):return [x for x in list1 if x in list2] # O(n²)# ✅ O(n) — convert to set firstdef find_common_fast(list1, list2):set2 = set(list2) # O(n) to buildreturn [x for x in list1 if x in set2] # O(1) per lookupdata1 = list(range(10_000))data2 = list(range(5_000, 15_000))import timeitt1 = timeit.timeit(lambda: find_common(data1, data2), number=10)t2 = timeit.timeit(lambda: find_common_fast(data1, data2), number=10)print(f"O(n²) : {t1:.3f}s")print(f"O(n) : {t2:.3f}s")print(f"speedup: {t1/t2:.1f}x")# a 10,000 element list → ~100x speedup# no numpy, no cython, no C — just the right data structurePython overhead second — once the algorithm is right, reduce interpreter overhead:
import math# ❌ global lookup on every iterationdef slow(data):return [math.sqrt(x) for x in data]# ✅ cache as local variabledef fast(data):sqrt = math.sqrtreturn [sqrt(x) for x in data]NumPy third — when the operation is numerical and the data is large:
import numpy as np# ✅ vectorized — bypasses interpreter entirelyarr = np.array(data, dtype=np.float64)result = np.sqrt(arr) # single C callEscalate only if still not enough — Cython, Numba, and C extensions are powerful but add significant complexity. They should only be reached when the above steps have been exhausted and measured:
# Numba — JIT compile a numerical functionfrom numba import jit@jit(nopython=True)def numba_squares(arr):result = np.empty(len(arr))for i in range(len(arr)):result[i] = arr[i] ** 2return result# first call compiles — subsequent calls run at C speed
The Complete Workflow in Practice
Section titled “The Complete Workflow in Practice” symptom: "the report generation takes 45 seconds" │ ▼ timeit → confirmed: report() takes 45s per call │ ▼ cProfile → bottleneck: find_duplicates() called 10,000 times │ ▼ line_profiler → bottleneck line: if item in seen_list ← O(n) lookup │ ▼ fix: change seen_list to seen_set ← O(1) lookup │ ▼ timeit → report() now takes 0.8s ← 56x faster │ ▼ done — no numpy, no cython neededThe most common outcome is that fixing the algorithm or data structure is all that is needed. The escalation path to Cython and C extensions exists for the rare cases where Python genuinely cannot go fast enough. Numerical simulations, real-time signal processing, game engines, not for typical application code.