Performance optimisation patterns
What is Performance Optimisation?
Section titled “What is Performance Optimisation?”Profiling tells you where the problem is, these patterns tell you how to fix it. They are not micro-optimisations to apply blindly everywhere, but well-established techniques that consistently make a measurable difference in Python.
Each one has a clear reason behind it, understanding the why makes you reach for the right tool naturally rather than guessing.
Use Built-ins — They Are Implemented in C
Section titled “Use Built-ins — They Are Implemented in C”Python’s built-in functions are not written in Python, they are implemented directly in C, which means they execute without the
overhead of the Python interpreter loop. A manual for loop in Python has to interpret each bytecode instruction one by one,
while sum(), min(), max() and friends run as a single native C call:
manual loop built-in sum() ─────────── ──────────────
Python bytecode single C call ──────────────── ───────────── LOAD total sum(data) ──► [C code runs at native speed] LOAD x returns result BINARY_ADD STORE total JUMP_BACKWARD ... × 10,000 iterations
interpreter overhead no interpreter overhead on every iteration one call, donef-string is faster than manual concatenation
Section titled “f-string is faster than manual concatenation”An f-string is not just a style preference, it is also faster than manual concatenation with + because Python evaluates
it in a single pass, formatting and building the string in one operation.
With + concatenation Python has to:
"Name: " + name + ", Age: " + str(age)
step 1: "Name: " + "Alice" → "Name: Alice" ← new string step 2: "Name: Alice" + ", Age: " → "Name: Alice, Age: " ← new string step 3: "Name: Alice, Age: " + "30" → "Name: Alice, Age: 30" ← new string
3 intermediate allocations note: str(age) is also an extra call to convert int to stringWith an f-string Python does it in one shot:
f"Name: {name}, Age: {age}"
step 1: scan the template once step 2: measure total length needed step 3: allocate once step 4: fill in values directly
1 allocation — no intermediate strings note: {age} converts int to string automatically — no str() neededname = "Alice"age = 30
# ❌ manual concatenation — 3 allocations, requires explicit str()s = "Name: " + name + ", Age: " + str(age)
# ✅ f-string — 1 allocation, converts automatically, reads naturallys = f"Name: {name}, Age: {age}"The readability gain alone is worth it, but the performance improvement is a bonus that compounds when f-strings are used inside loops or called frequently.
Data Structure Choice — List vs Set vs Dict
Section titled “Data Structure Choice — List vs Set vs Dict”The choice of data structure has a dramatic effect on lookup performance. The difference comes from how each structure finds an element.
The diagram below shows why the data structure choice matters so dramatically, it is not a matter of implementation quality but of fundamental algorithmic difference. A list has no choice but to scan every element one by one until it finds a match.
A set and a dict jump directly to the answer using a hash, the size of the collection is
irrelevant:
list — O(n) lookup ────────────────── needle in [0, 1, 2, 3, ..., 999_999]
checks 0 → not it checks 1 → not it checks 2 → not it ... checks 999_999 → found! ← scanned every element in worst case
set — O(1) lookup ────────────────── needle in {0, 1, 2, 3, ..., 999_999}
hash(needle) → bucket 47382 → found! ← direct jump, no scanning
dict — O(1) lookup by key ────────────────────────── needle in {0:0, 1:1, 2:2, ..., 999_999:999_999}
hash(needle) → bucket 47382 → found! ← same as setThe code below confirms what the diagram shows with actual measurements. needle = 999_999 is deliberately chosen as the worst
case for the list, it is the last element, forcing a full scan of all one million items before finding it.
The set and dict are unaffected by where the value sits, their lookup time is the same whether the value
is the first or the last element:
import timeit
needle = 999_999 # worst case for list — last element
lst = list(range(1_000_000))st = set(range(1_000_000))d = {i: i for i in range(1_000_000)}
print(timeit.timeit(lambda: needle in lst, number=100)) # ~2.5s O(n)print(timeit.timeit(lambda: needle in st, number=100)) # ~0.0001s O(1)print(timeit.timeit(lambda: needle in d, number=100)) # ~0.0001s O(1)Local Variables Are Faster Than Global Lookups
Section titled “Local Variables Are Faster Than Global Lookups”Python resolves names by searching through a chain of namespaces in order, locals first, then globals, then built-ins.
A local variable is found immediately at the first stop. A global lookup like math.sqrt requires two steps: find math in
the global namespace, then find sqrt inside the math module.
In normal code this difference is negligible, but inside a tight loop that runs thousands of times the cost of those extra lookups adds up on every single iteration.
The fix is simple: cache the function reference in a local variable before the loop. Python finds local variables faster than globals because they are stored in a fixed-size array accessed by index, while globals are stored in a dictionary that requires a hash lookup every time.
The diagram below shows exactly what Python is doing on each iteration in both cases:
❌ global lookup on every iteration
for i in range(10_000): result += math.sqrt(i) ↑ iteration 1: find "math" in globals → find "sqrt" in math → call iteration 2: find "math" in globals → find "sqrt" in math → call iteration 3: find "math" in globals → find "sqrt" in math → call ... × 10,000
✅ local variable — one lookup, stored once
sqrt = math.sqrt ← one lookup here for i in range(10_000): result += sqrt(i) ↑ iteration 1: find "sqrt" in locals → call iteration 2: find "sqrt" in locals → call ... × 10,000 local lookup is faster than global lookup ❌ global lookup — two namespace searches on every iteration ──────────────────────────────────────────────────────────────
for i in range(10_000): result += math.sqrt(i)
iteration 1: 1. search locals → "math" not found 2. search globals → "math" found ✅ [ math module ] 3. search math → "sqrt" found ✅ [ sqrt function ] 4. call sqrt(i) 5. add to result
iteration 2: 1. search locals → "math" not found 2. search globals → "math" found ✅ [ math module ] 3. search math → "sqrt" found ✅ [ sqrt function ] 4. call sqrt(i) 5. add to result
... × 10,000 iterations every iteration pays the cost of steps 1, 2, 3 10,000 × 3 lookups = 30,000 namespace searches
✅ local variable — one lookup before the loop, fast access inside ──────────────────────────────────────────────────────────────────
sqrt = math.sqrt ← steps 1, 2, 3 happen ONCE here sqrt now lives in locals as a direct reference
for i in range(10_000): result += sqrt(i)
iteration 1: 1. search locals → "sqrt" found immediately ✅ 2. call sqrt(i) 3. add to result
iteration 2: 1. search locals → "sqrt" found immediately ✅ 2. call sqrt(i) 3. add to result
... × 10,000 iterations every iteration pays the cost of step 1 only 10,000 × 1 lookup = 10,000 namespace searches
────────────────────────────────────────────────────────────────── global: 30,000 namespace searches local: 10,000 namespace searches + 3 searches once upfront ─────────────────────────────────────────────────────── 10,003 total ← ~3x fewer lookupsMeasuring the execution time:
import mathimport timeit
# ❌ global lookup on every iterationdef slow(): result = 0 for i in range(10_000): result += math.sqrt(i) return result
# ✅ cache as local variable — one lookup stored locallydef fast(): sqrt = math.sqrt # one lookup result = 0 for i in range(10_000): result += sqrt(i) # local lookup — faster return result
print(timeit.timeit(slow, number=1_000)) # ~1.2sprint(timeit.timeit(fast, number=1_000)) # ~0.8s ← ~33% fasterGenerator vs List — Iterate once, allocate once
Section titled “Generator vs List — Iterate once, allocate once”When you only need to consume the results once, passing them directly into sum(), max(), or a for loop,
building a full list in memory first is pure waste. You are paying the cost of storing every value just to
read each one once and discard it.
A generator produces one value at a time and discards it immediately:
list comprehension generator expression ────────────────── ───────────────────
[x**2 for x in data] (x**2 for x in data)
allocates all 1M values allocates nothing upfront before sum() even starts produces one value at a time
Heap Heap ──── ──── [ 0, 1, 4, 9, ... 999² ] [ generator object ] ~200 bytes ~8 MB computes each x**2 on demand
sum() reads all 1M values sum() pulls one value at a time from the list value is used and discardedIn Python code:
import sys
data = range(1_000_000)
# ❌ list — builds 1M values in memory before summingtotal_list = sum([x**2 for x in data])
# ✅ generator — produces one value at a time, no list neededtotal_gen = sum(x**2 for x in data)
# same result — very different memory costlst = [x**2 for x in data]gen = (x**2 for x in data)
print(sys.getsizeof(lst)) # ~8 MBprint(sys.getsizeof(gen)) # ~200 bytesNote that both produce identical results, the generator is not an approximation. The only difference is that the list holds all values in memory simultaneously while the generator produces them one at a time and immediately discards each one after use.
Summary
Section titled “Summary”| Pattern | Why it helps | Typical gain |
|---|---|---|
| Built-ins over loops | C implementation, no interpreter overhead | 5–10x |
join() over += | Single allocation vs O(n²) copies | 10–100x |
| Set/dict over list | O(1) vs O(n) lookup | 1000x+ for large collections |
| Local over global | Fewer namespace lookups per iteration | 10–33% |
| Generator over list | One value at a time vs full allocation | 40x less memory |