Skip to content

The optimisation decision process

The decision process below is a structured workflow that ensures you always fix the right thing in the right order, starting with the highest-impact changes and escalating only when necessary:

Is it actually slow?
Profile first (cProfile / timeit)
Found the bottleneck?
├── Algorithm or data structure wrong?
Fix this first biggest gains
list set for O(1) lookups
O() O(n log n) sorting
├── Python interpreter overhead?
built-ins, local variables, generators
├── Numerical computation?
NumPy / SciPy
├── Memory usage too high?
__slots__, generators, lazy evaluation
└── Still not fast enough?
├── Cython compile Python to C
├── Numba JIT compile numerical code
└── C extension write the hot path in C

The order matters: each level delivers less improvement than the one above it but requires more effort. Always exhaust the higher levels before descending.

  1. Before doing anything, establish that there is actually a problem worth solving. A function that takes 0.1s and is called once at startup is not worth optimising. A function that takes 0.001s but is called 10 million times absolutely is.

    import timeit
    import cProfile
    def process(data):
    return [x**2 for x in data if x % 2 == 0]
    data = list(range(100_000))
    # step 1 — is it slow? measure first
    t = timeit.timeit(lambda: process(data), number=100)
    print(f"total for 100 runs: {t:.3f}s")
    print(f"average per call: {t/100*1000:.3f}ms")
    # if average per call is acceptable → stop here, do not optimise
    # if average per call is a problem → proceed to step 2
  2. Once you have confirmed slowness, profile to find where the time is going. Never guess:

    import cProfile
    import pstats
    import io
    def outer(data):
    step1 = process(data) # filtering and squaring
    step2 = sorted(step1) # sorting
    step3 = sum(step2) # summing
    return step3
    # profile the full call
    profiler = cProfile.Profile()
    profiler.enable()
    outer(data)
    profiler.disable()
    stream = io.StringIO()
    stats = pstats.Stats(profiler, stream=stream)
    stats.sort_stats("cumulative")
    stats.print_stats(5)
    print(stream.getvalue())
    # output tells you exactly which function is the bottleneck
    # do not guess — read the profiler output
  3. Fix in the Right Order

    Algorithm and data structure first — this is always the highest-impact fix:

    # ❌ O(n²) — checking membership in a list
    def find_common(list1, list2):
    return [x for x in list1 if x in list2] # O(n²)
    # ✅ O(n) — convert to set first
    def find_common_fast(list1, list2):
    set2 = set(list2) # O(n) to build
    return [x for x in list1 if x in set2] # O(1) per lookup
    data1 = list(range(10_000))
    data2 = list(range(5_000, 15_000))
    import timeit
    t1 = timeit.timeit(lambda: find_common(data1, data2), number=10)
    t2 = timeit.timeit(lambda: find_common_fast(data1, data2), number=10)
    print(f"O(n²) : {t1:.3f}s")
    print(f"O(n) : {t2:.3f}s")
    print(f"speedup: {t1/t2:.1f}x")
    # a 10,000 element list → ~100x speedup
    # no numpy, no cython, no C — just the right data structure

    Python overhead second — once the algorithm is right, reduce interpreter overhead:

    import math
    # ❌ global lookup on every iteration
    def slow(data):
    return [math.sqrt(x) for x in data]
    # ✅ cache as local variable
    def fast(data):
    sqrt = math.sqrt
    return [sqrt(x) for x in data]

    NumPy third — when the operation is numerical and the data is large:

    import numpy as np
    # ✅ vectorized — bypasses interpreter entirely
    arr = np.array(data, dtype=np.float64)
    result = np.sqrt(arr) # single C call

    Escalate only if still not enough — Cython, Numba, and C extensions are powerful but add significant complexity. They should only be reached when the above steps have been exhausted and measured:

    # Numba — JIT compile a numerical function
    from numba import jit
    @jit(nopython=True)
    def numba_squares(arr):
    result = np.empty(len(arr))
    for i in range(len(arr)):
    result[i] = arr[i] ** 2
    return result
    # first call compiles — subsequent calls run at C speed
symptom: "the report generation takes 45 seconds"
timeit → confirmed: report() takes 45s per call
cProfile → bottleneck: find_duplicates() called 10,000 times
line_profiler → bottleneck line: if item in seen_list ← O(n) lookup
fix: change seen_list to seen_set ← O(1) lookup
timeit → report() now takes 0.8s ← 56x faster
done — no numpy, no cython needed

The most common outcome is that fixing the algorithm or data structure is all that is needed. The escalation path to Cython and C extensions exists for the rare cases where Python genuinely cannot go fast enough. Numerical simulations, real-time signal processing, game engines, not for typical application code.