Performance optimisation patterns

What is Performance Optimisation?

Profiling tells you where the problem is, these patterns tell you how to fix it. They are not micro-optimisations to apply blindly everywhere, but well-established techniques that consistently make a measurable difference in Python.

Each one has a clear reason behind it, understanding the why makes you reach for the right tool naturally rather than guessing.

Use Built-ins — They Are Implemented in C

Python’s built-in functions are not written in Python, they are implemented directly in C, which means they execute without the overhead of the Python interpreter loop. A manual for loop in Python has to interpret each bytecode instruction one by one, while sum(), min(), max() and friends run as a single native C call:

  manual loop                     built-in sum()
  ───────────                     ──────────────

  Python bytecode                 single C call
  ────────────────                ─────────────
  LOAD total                      sum(data) ──► [C code runs at native speed]
  LOAD x                                         returns result
  BINARY_ADD
  STORE total
  JUMP_BACKWARD
  ... × 10,000 iterations

  interpreter overhead            no interpreter overhead
  on every iteration              one call, done

The example below measures the difference directly, it runs both approaches 1,000 times each on a list of 10,000 numbers and compares the total time. The gap between them is not subtle:

import timeit

data = list(range(10_000))

# ❌ manual loop — interpreted bytecode on every iteration
def manual_sum(lst):
    total = 0
    for x in lst:
        total += x
    return total

# ✅ built-in — single C call
print(timeit.timeit(lambda: manual_sum(data), number=1_000))  # ~0.5s
print(timeit.timeit(lambda: sum(data),        number=1_000))  # ~0.05s  10x faster

Both functions produce the exact same result, the difference is entirely in how Python executes them.

The same principle applies across the standard library, prefer built-ins wherever they exist:

data    = list(range(10_000))
strings = ["word"] * 10_000

min(data)                           # ✅ faster than manual loop
max(data)                           # ✅ faster than manual loop
sorted(data)                        # ✅ faster than any pure Python sort
"".join(strings)                    # ✅ much faster than += in a loop
any(x > 5000 for x in data)         # ✅ short-circuits — stops at first True
all(x >= 0 for x in data)           # ✅ short-circuits — stops at first False

any() and all() are worth highlighting — they stop as soon as the answer is known, so they never process more elements than necessary.

String Concatenation — += vs join()

String concatenation with += in a loop is one of the most common Python performance mistakes. The reason it is slow is that strings are immutable, every += creates a brand new string object, copies the old content, and appends the new part:

  result += "word"    ← on every iteration

  iteration 1:   "word"                          1 allocation
  iteration 2:   "word" + "word" = "wordword"    1 allocation, copies 1×"word"
  iteration 3:   "wordword" + "word"             1 allocation, copies 2×"word"
  iteration 4:   "wordwordword" + "word"         1 allocation, copies 3×"word"
  ...
  iteration n:   copies grow with every step → O(n²) total work

"".join() avoids this entirely, it measures the total size needed, allocates once, and fills it in:

  "".join(parts)

  step 1: measure total length needed   → 1 pass
  step 2: allocate one block of memory  → 1 allocation
  step 3: copy each part into place     → 1 pass
  done                                  → O(n) total work

The example below demonstrates the cost of += in a loop versus join(), and introduces f-strings as the idiomatic choice for small fixed concatenations:

parts = ["word"] * 10_000

# ❌ += in a loop — O(n²) — creates a new string on every iteration
def bad_concat(parts):
    result = ""
    for p in parts:
        result += p
    return result

# ✅ join — O(n) — single allocation
def good_concat(parts):
    return "".join(parts)

# ✅ f-string for small fixed concatenations — cleaner and faster than +
name = "Alice"
age  = 30
s = f"Name: {name}, Age: {age}"     # faster than "Name: " + name + ", Age: " + str(age)

bad_concat and good_concat produce identical output, the difference is entirely in how many times Python allocates memory to get there. The f-string at the bottom is a separate case, not a loop, but a reminder that even for small one-line concatenations, + is both slower and harder to read than an f-string.

f-string is faster than manual concatenation

An f-string is not just a style preference, it is also faster than manual concatenation with + because Python evaluates it in a single pass, formatting and building the string in one operation.

With + concatenation Python has to:

  "Name: " + name + ", Age: " + str(age)

  step 1: "Name: " + "Alice"          → "Name: Alice"        ← new string
  step 2: "Name: Alice" + ", Age: "   → "Name: Alice, Age: " ← new string
  step 3: "Name: Alice, Age: " + "30" → "Name: Alice, Age: 30" ← new string

  3 intermediate allocations
  note: str(age) is also an extra call to convert int to string

With an f-string Python does it in one shot:

  f"Name: {name}, Age: {age}"

  step 1: scan the template once
  step 2: measure total length needed
  step 3: allocate once
  step 4: fill in values directly

  1 allocation — no intermediate strings
  note: {age} converts int to string automatically — no str() needed

name = "Alice"
age  = 30

# ❌ manual concatenation — 3 allocations, requires explicit str()
s = "Name: " + name + ", Age: " + str(age)

# ✅ f-string — 1 allocation, converts automatically, reads naturally
s = f"Name: {name}, Age: {age}"

The readability gain alone is worth it, but the performance improvement is a bonus that compounds when f-strings are used inside loops or called frequently.

Data Structure Choice — List vs Set vs Dict

The choice of data structure has a dramatic effect on lookup performance. The difference comes from how each structure finds an element.

The diagram below shows why the data structure choice matters so dramatically, it is not a matter of implementation quality but of fundamental algorithmic difference. A list has no choice but to scan every element one by one until it finds a match.

A set and a dict jump directly to the answer using a hash, the size of the collection is irrelevant:

  list — O(n) lookup
  ──────────────────
  needle in [0, 1, 2, 3, ..., 999_999]

  checks 0      → not it
  checks 1      → not it
  checks 2      → not it
  ...
  checks 999_999 → found!   ← scanned every element in worst case


  set — O(1) lookup
  ──────────────────
  needle in {0, 1, 2, 3, ..., 999_999}

  hash(needle) → bucket 47382  → found!  ← direct jump, no scanning


  dict — O(1) lookup by key
  ──────────────────────────
  needle in {0:0, 1:1, 2:2, ..., 999_999:999_999}

  hash(needle) → bucket 47382  → found!  ← same as set

The code below confirms what the diagram shows with actual measurements. needle = 999_999 is deliberately chosen as the worst case for the list, it is the last element, forcing a full scan of all one million items before finding it.

The set and dict are unaffected by where the value sits, their lookup time is the same whether the value is the first or the last element:

import timeit

needle = 999_999    # worst case for list — last element

lst = list(range(1_000_000))
st  = set(range(1_000_000))
d   = {i: i for i in range(1_000_000)}

print(timeit.timeit(lambda: needle in lst, number=100))   # ~2.5s    O(n)
print(timeit.timeit(lambda: needle in st,  number=100))   # ~0.0001s O(1)
print(timeit.timeit(lambda: needle in d,   number=100))   # ~0.0001s O(1)

The simple rule:

  need ordered sequence                         → list
  need fast membership testing, no extra data   → set
  need fast membership testing + associated     → dict
    value per key
  need ordered + fast membership testing        → dict (Python 3.7+ preserves
                                                   insertion order)

Where membership testing means checking whether a value exists in a collection, the in operator. It specifically means: does this value exist in the collection? — a yes/no question. It does not return the value, just confirms presence.

The two concepts are:

Membership testing — does it exist?

"banana" in fruits      # True or False — yes/no question
999 in data             # True or False — yes/no question

Index access — give me the value at this position:

fruits[0]               # "apple" — retrieves by position
data[999]               # retrieves the 999th element

  indexing     → fruits[0]          position based — lists and tuples
  key lookup   → fruits["banana"]   key based — dicts
  membership   → "banana" in fruits yes/no — lists, sets, dicts

Terminology:

accessing by position is indexing or index access
accessing by key in a dict is called key lookup or key access

So the three operations are distinct:

Operation	Question asked	Returns
Indexing `lst[0]`	what is at position 0?	the value
Key lookup `d["banana"]`	what is associated with “banana”?	the value
Membership `x in collection`	does x exist?	True or False

The cost of these check depends entirely on the data structure.

And a concrete example of each case to verify:

# list — ordered, slow membership testing O(n)
fruits = ["apple", "banana", "cherry"]
print("banana" in fruits)       # True — scans until found

# set — unordered, fast membership testing O(1)
fruits = {"apple", "banana", "cherry"}
print("banana" in fruits)       # True — direct hash lookup, no order

# dict — ordered (Python 3.7+), fast membership testing O(1) on keys
fruits = {"apple": 0.5, "banana": 0.3, "cherry": 1.0}
print("banana" in fruits)       # True — checks keys, direct hash lookup
print(fruits["banana"])         # 0.3  — also gives associated value

# dict preserves insertion order
print(list(fruits.keys()))      # ['apple', 'banana', 'cherry'] — insertion order

The corrected rule in plain English:

Use a list when order matters and you mostly access by index or iterate, not by membership testing.
Use a set when you only need to know if something exists and do not need any associated data.
Use a dict when you need both fast lookup and an associated value per key, and as a bonus you get insertion order for free in Python 3.7+

Local Variables Are Faster Than Global Lookups

Python resolves names by searching through a chain of namespaces in order, locals first, then globals, then built-ins. A local variable is found immediately at the first stop. A global lookup like math.sqrt requires two steps: find math in the global namespace, then find sqrt inside the math module.

In normal code this difference is negligible, but inside a tight loop that runs thousands of times the cost of those extra lookups adds up on every single iteration.

The fix is simple: cache the function reference in a local variable before the loop. Python finds local variables faster than globals because they are stored in a fixed-size array accessed by index, while globals are stored in a dictionary that requires a hash lookup every time.

The diagram below shows exactly what Python is doing on each iteration in both cases:

  ❌ global lookup on every iteration

  for i in range(10_000):
      result += math.sqrt(i)
                ↑
                iteration 1:  find "math" in globals → find "sqrt" in math → call
                iteration 2:  find "math" in globals → find "sqrt" in math → call
                iteration 3:  find "math" in globals → find "sqrt" in math → call
                ... × 10,000

  ✅ local variable — one lookup, stored once

  sqrt = math.sqrt               ← one lookup here
  for i in range(10_000):
      result += sqrt(i)
                ↑
                iteration 1:  find "sqrt" in locals → call
                iteration 2:  find "sqrt" in locals → call
                ... × 10,000   local lookup is faster than global lookup

  ❌ global lookup — two namespace searches on every iteration
  ──────────────────────────────────────────────────────────────

  for i in range(10_000):
      result += math.sqrt(i)

  iteration 1:
      1. search locals    → "math" not found
      2. search globals   → "math" found  ✅  [ math module ]
      3. search math      → "sqrt" found  ✅  [ sqrt function ]
      4. call sqrt(i)
      5. add to result

  iteration 2:
      1. search locals    → "math" not found
      2. search globals   → "math" found  ✅  [ math module ]
      3. search math      → "sqrt" found  ✅  [ sqrt function ]
      4. call sqrt(i)
      5. add to result

  ... × 10,000 iterations
  every iteration pays the cost of steps 1, 2, 3
  10,000 × 3 lookups = 30,000 namespace searches


  ✅ local variable — one lookup before the loop, fast access inside
  ──────────────────────────────────────────────────────────────────

  sqrt = math.sqrt        ← steps 1, 2, 3 happen ONCE here
                             sqrt now lives in locals as a direct reference

  for i in range(10_000):
      result += sqrt(i)

  iteration 1:
      1. search locals    → "sqrt" found immediately ✅
      2. call sqrt(i)
      3. add to result

  iteration 2:
      1. search locals    → "sqrt" found immediately ✅
      2. call sqrt(i)
      3. add to result

  ... × 10,000 iterations
  every iteration pays the cost of step 1 only
  10,000 × 1 lookup = 10,000 namespace searches

  ──────────────────────────────────────────────────────────────────
  global:  30,000 namespace searches
  local:   10,000 namespace searches  +  3 searches once upfront
           ───────────────────────────────────────────────────────
           10,003 total  ←  ~3x fewer lookups

Measuring the execution time:

import math
import timeit

# ❌ global lookup on every iteration
def slow():
    result = 0
    for i in range(10_000):
        result += math.sqrt(i)
    return result

# ✅ cache as local variable — one lookup stored locally
def fast():
    sqrt   = math.sqrt          # one lookup
    result = 0
    for i in range(10_000):
        result += sqrt(i)       # local lookup — faster
    return result

print(timeit.timeit(slow, number=1_000))    # ~1.2s
print(timeit.timeit(fast, number=1_000))    # ~0.8s  ← ~33% faster

Generator vs List — Iterate once, allocate once

When you only need to consume the results once, passing them directly into sum(), max(), or a for loop, building a full list in memory first is pure waste. You are paying the cost of storing every value just to read each one once and discard it.

A generator produces one value at a time and discards it immediately:

  list comprehension             generator expression
  ──────────────────             ───────────────────

  [x**2 for x in data]          (x**2 for x in data)

  allocates all 1M values        allocates nothing upfront
  before sum() even starts       produces one value at a time

  Heap                           Heap
  ────                           ────
  [ 0, 1, 4, 9, ... 999² ]      [ generator object ]  ~200 bytes
  ~8 MB                          computes each x**2 on demand

  sum() reads all 1M values      sum() pulls one value at a time
  from the list                  value is used and discarded

In Python code:

import sys

data = range(1_000_000)

# ❌ list — builds 1M values in memory before summing
total_list = sum([x**2 for x in data])

# ✅ generator — produces one value at a time, no list needed
total_gen  = sum(x**2 for x in data)

# same result — very different memory cost
lst = [x**2 for x in data]
gen = (x**2 for x in data)

print(sys.getsizeof(lst))   # ~8 MB
print(sys.getsizeof(gen))   # ~200 bytes

Note that both produce identical results, the generator is not an approximation. The only difference is that the list holds all values in memory simultaneously while the generator produces them one at a time and immediately discards each one after use.

Generators can look like they are doing something different, producing values lazily might give the impression that they are somehow less accurate or incomplete. They are not.

The result of sum(x**2 for x in data) is byte-for-byte identical to sum([x**2 for x in data]) — the same integer, computed the same way. The only difference is the path to get there:

  sum([x**2 for x in data])         sum(x**2 for x in data)
  ──────────────────────────         ───────────────────────

  1. build full list in memory       1. generator sits idle
     [0, 1, 4, 9, 16, ...]
  2. pass list to sum()              2. sum() pulls first value → 0
  3. sum() reads each value             adds to running total
     from the list                   3. sum() pulls next value → 1
  4. returns total                      adds to running total
                                     4. repeat until exhausted
                                     5. returns total

  both arrive at the same number
  via different routes

The generator does not skip values, approximate, or round, it computes every x**2 exactly as the list comprehension would, just one at a time on demand rather than all at once upfront.

sum() does not know or care whether it is receiving values from a list or a generator, it just adds whatever it receives.

Summary

Pattern	Why it helps	Typical gain
Built-ins over loops	C implementation, no interpreter overhead	5–10x
`join()` over `+=`	Single allocation vs O(n²) copies	10–100x
Set/dict over list	O(1) vs O(n) lookup	1000x+ for large collections
Local over global	Fewer namespace lookups per iteration	10–33%
Generator over list	One value at a time vs full allocation	40x less memory