Skip to content

Performance optimisation patterns

Profiling tells you where the problem is, these patterns tell you how to fix it. They are not micro-optimisations to apply blindly everywhere, but well-established techniques that consistently make a measurable difference in Python.

Each one has a clear reason behind it, understanding the why makes you reach for the right tool naturally rather than guessing.

Use Built-ins — They Are Implemented in C

Section titled “Use Built-ins — They Are Implemented in C”

Python’s built-in functions are not written in Python, they are implemented directly in C, which means they execute without the overhead of the Python interpreter loop. A manual for loop in Python has to interpret each bytecode instruction one by one, while sum(), min(), max() and friends run as a single native C call:

manual loop built-in sum()
─────────── ──────────────
Python bytecode single C call
──────────────── ─────────────
LOAD total sum(data) ──► [C code runs at native speed]
LOAD x returns result
BINARY_ADD
STORE total
JUMP_BACKWARD
... × 10,000 iterations
interpreter overhead no interpreter overhead
on every iteration one call, done

f-string is faster than manual concatenation

Section titled “f-string is faster than manual concatenation”

An f-string is not just a style preference, it is also faster than manual concatenation with + because Python evaluates it in a single pass, formatting and building the string in one operation.

With + concatenation Python has to:

"Name: " + name + ", Age: " + str(age)
step 1: "Name: " + "Alice" "Name: Alice" new string
step 2: "Name: Alice" + ", Age: " "Name: Alice, Age: " new string
step 3: "Name: Alice, Age: " + "30" "Name: Alice, Age: 30" new string
3 intermediate allocations
note: str(age) is also an extra call to convert int to string

With an f-string Python does it in one shot:

f"Name: {name}, Age: {age}"
step 1: scan the template once
step 2: measure total length needed
step 3: allocate once
step 4: fill in values directly
1 allocation no intermediate strings
note: {age} converts int to string automatically no str() needed
Manual concatenation vs. f-string
name = "Alice"
age = 30
# ❌ manual concatenation — 3 allocations, requires explicit str()
s = "Name: " + name + ", Age: " + str(age)
# ✅ f-string — 1 allocation, converts automatically, reads naturally
s = f"Name: {name}, Age: {age}"

The readability gain alone is worth it, but the performance improvement is a bonus that compounds when f-strings are used inside loops or called frequently.

Data Structure Choice — List vs Set vs Dict

Section titled “Data Structure Choice — List vs Set vs Dict”

The choice of data structure has a dramatic effect on lookup performance. The difference comes from how each structure finds an element.

The diagram below shows why the data structure choice matters so dramatically, it is not a matter of implementation quality but of fundamental algorithmic difference. A list has no choice but to scan every element one by one until it finds a match.

A set and a dict jump directly to the answer using a hash, the size of the collection is irrelevant:

list O(n) lookup
──────────────────
needle in [0, 1, 2, 3, ..., 999_999]
checks 0 not it
checks 1 not it
checks 2 not it
...
checks 999_999 found! scanned every element in worst case
set O(1) lookup
──────────────────
needle in {0, 1, 2, 3, ..., 999_999}
hash(needle) → bucket 47382 → found! direct jump, no scanning
dict O(1) lookup by key
──────────────────────────
needle in {0:0, 1:1, 2:2, ..., 999_999:999_999}
hash(needle) → bucket 47382 → found! same as set

The code below confirms what the diagram shows with actual measurements. needle = 999_999 is deliberately chosen as the worst case for the list, it is the last element, forcing a full scan of all one million items before finding it.

The set and dict are unaffected by where the value sits, their lookup time is the same whether the value is the first or the last element:

import timeit
needle = 999_999 # worst case for list — last element
lst = list(range(1_000_000))
st = set(range(1_000_000))
d = {i: i for i in range(1_000_000)}
print(timeit.timeit(lambda: needle in lst, number=100)) # ~2.5s O(n)
print(timeit.timeit(lambda: needle in st, number=100)) # ~0.0001s O(1)
print(timeit.timeit(lambda: needle in d, number=100)) # ~0.0001s O(1)

Local Variables Are Faster Than Global Lookups

Section titled “Local Variables Are Faster Than Global Lookups”

Python resolves names by searching through a chain of namespaces in order, locals first, then globals, then built-ins. A local variable is found immediately at the first stop. A global lookup like math.sqrt requires two steps: find math in the global namespace, then find sqrt inside the math module.

In normal code this difference is negligible, but inside a tight loop that runs thousands of times the cost of those extra lookups adds up on every single iteration.

The fix is simple: cache the function reference in a local variable before the loop. Python finds local variables faster than globals because they are stored in a fixed-size array accessed by index, while globals are stored in a dictionary that requires a hash lookup every time.

The diagram below shows exactly what Python is doing on each iteration in both cases:

Global vs. local lookups
❌ global lookup on every iteration
for i in range(10_000):
result += math.sqrt(i)
iteration 1: find "math" in globals → find "sqrt" in math → call
iteration 2: find "math" in globals → find "sqrt" in math → call
iteration 3: find "math" in globals → find "sqrt" in math → call
... × 10,000
✅ local variable — one lookup, stored once
sqrt = math.sqrt ← one lookup here
for i in range(10_000):
result += sqrt(i)
iteration 1: find "sqrt" in locals → call
iteration 2: find "sqrt" in locals → call
... × 10,000 local lookup is faster than global lookup
What Python does on each iteration
global lookup two namespace searches on every iteration
──────────────────────────────────────────────────────────────
for i in range(10_000):
result += math.sqrt(i)
iteration 1:
1. search locals "math" not found
2. search globals "math" found [ math module ]
3. search math "sqrt" found [ sqrt function ]
4. call sqrt(i)
5. add to result
iteration 2:
1. search locals "math" not found
2. search globals "math" found [ math module ]
3. search math "sqrt" found [ sqrt function ]
4. call sqrt(i)
5. add to result
... × 10,000 iterations
every iteration pays the cost of steps 1, 2, 3
10,000 × 3 lookups = 30,000 namespace searches
local variable one lookup before the loop, fast access inside
──────────────────────────────────────────────────────────────────
sqrt = math.sqrt steps 1, 2, 3 happen ONCE here
sqrt now lives in locals as a direct reference
for i in range(10_000):
result += sqrt(i)
iteration 1:
1. search locals "sqrt" found immediately
2. call sqrt(i)
3. add to result
iteration 2:
1. search locals "sqrt" found immediately
2. call sqrt(i)
3. add to result
... × 10,000 iterations
every iteration pays the cost of step 1 only
10,000 × 1 lookup = 10,000 namespace searches
──────────────────────────────────────────────────────────────────
global: 30,000 namespace searches
local: 10,000 namespace searches + 3 searches once upfront
───────────────────────────────────────────────────────
10,003 total ~3x fewer lookups

Measuring the execution time:

import math
import timeit
# ❌ global lookup on every iteration
def slow():
result = 0
for i in range(10_000):
result += math.sqrt(i)
return result
# ✅ cache as local variable — one lookup stored locally
def fast():
sqrt = math.sqrt # one lookup
result = 0
for i in range(10_000):
result += sqrt(i) # local lookup — faster
return result
print(timeit.timeit(slow, number=1_000)) # ~1.2s
print(timeit.timeit(fast, number=1_000)) # ~0.8s ← ~33% faster

Generator vs List — Iterate once, allocate once

Section titled “Generator vs List — Iterate once, allocate once”

When you only need to consume the results once, passing them directly into sum(), max(), or a for loop, building a full list in memory first is pure waste. You are paying the cost of storing every value just to read each one once and discard it.

A generator produces one value at a time and discards it immediately:

list comprehension generator expression
────────────────── ───────────────────
[x**2 for x in data] (x**2 for x in data)
allocates all 1M values allocates nothing upfront
before sum() even starts produces one value at a time
Heap Heap
──── ────
[ 0, 1, 4, 9, ... 999² ] [ generator object ] ~200 bytes
~8 MB computes each x**2 on demand
sum() reads all 1M values sum() pulls one value at a time
from the list value is used and discarded

In Python code:

import sys
data = range(1_000_000)
# ❌ list — builds 1M values in memory before summing
total_list = sum([x**2 for x in data])
# ✅ generator — produces one value at a time, no list needed
total_gen = sum(x**2 for x in data)
# same result — very different memory cost
lst = [x**2 for x in data]
gen = (x**2 for x in data)
print(sys.getsizeof(lst)) # ~8 MB
print(sys.getsizeof(gen)) # ~200 bytes

Note that both produce identical results, the generator is not an approximation. The only difference is that the list holds all values in memory simultaneously while the generator produces them one at a time and immediately discards each one after use.

PatternWhy it helpsTypical gain
Built-ins over loopsC implementation, no interpreter overhead5–10x
join() over +=Single allocation vs O(n²) copies10–100x
Set/dict over listO(1) vs O(n) lookup1000x+ for large collections
Local over globalFewer namespace lookups per iteration10–33%
Generator over listOne value at a time vs full allocation40x less memory