Profiling - Finding bootlenecks

What is Profiling?

timeit is excellent for comparing small, isolated snippets, but real performance problems rarely live in a single line. They hide inside functions, loops, and data structures spread across your codebase.

Python offers three levels of profiling:

cProfile — function level, built into the standard library, measures how long each function takes and how many times it is called
line_profiler — line level, third party, shows exactly which line inside a function is the bottleneck
tracemalloc — memory level, built into the standard library, shows which lines allocate the most memory

`cProfile` — Function Level Profiling

cProfile instruments your code and records every function call, how many times each function was called, how long it took in total, and how long it took excluding calls to other functions:

import cProfile
import pstats
import io

def slow_function():
    total = 0

    # add the same number 10,000 times
    for i in range(10_000):
        # i is never used inside the loop!
        total += sum(x**2 for x in range(100))
    return total

# create a profiler and wrap the code to measure
profiler = cProfile.Profile()
profiler.enable()

slow_function()

profiler.disable()

# format and print the results
stream = io.StringIO()
stats  = pstats.Stats(profiler, stream=stream)
stats.sort_stats("cumulative")      # sort by cumulative time — most useful
stats.print_stats(10)               # show only the top 10 functions
print(stream.getvalue())

The output columns explained:

  ncalls  tottime  percall  cumtime  percall  filename:lineno(function)
  ──────  ───────  ───────  ───────  ───────  ─────────────────────────

  10000   0.123    0.000    0.456    0.000    <genexpr>
  1       0.001    0.001    0.457    0.457    slow_function

  ncalls  — how many times the function was called
  tottime — time spent in this function alone (excluding sub-calls)
  percall — tottime / ncalls
  cumtime — total time including all functions called from here
  percall — cumtime / ncalls

tottime tells you where the CPU actually is. cumtime tells you the total cost of calling that function including everything it calls. The gap between them reveals how much time is spent in sub-calls.

The sort_stats options worth knowing:

stats.sort_stats("cumulative")  # total time including sub-calls — best starting point
stats.sort_stats("tottime")     # time in function only — find the actual hotspot
stats.sort_stats("ncalls")      # most called functions — find unexpected call counts

  ncalls  tottime  percall  cumtime  percall  filename:lineno(function)
  ──────  ───────  ───────  ───────  ───────  ─────────────────────────

  10000   0.123    0.000    0.456    0.000    <genexpr>
  1       0.001    0.001    0.457    0.457    slow_function

Reading the numbers:

slow_function was called once and took 0.457s in total (cumtime) but only 0.001s of its own code (tottime), the remaining 0.456s = (0.457s - 0.001) was spent inside calls it made.
<genexpr>, the generator expression x**2 for x in range(100), was called 10,000 times (once per loop iteration) and is where almost all the time actually goes.
The gap between slow_function’s tottime (0.001s) and cumtime (0.457s) tells you the function itself is not the problem, but the generator expression it calls 10,000 times is

So the profiler is telling you clearly: do not optimise slow_function, optimise or replace the generator expression inside the loop.

A simple fix would be to precompute the sum outside the loop:

# ❌ before — generator called 10,000 times
def slow_function():
    total = 0

    # add the same number 10,000 times
    for i in range(10_000):
        total += sum(x**2 for x in range(100))
    return total

# i is never used inside the loop!
# the sum is always 0² + 1² + 2² + ... + 99² = 328350
# you are computing 328350 ten thousand times
# and adding it to total ten thousand times

# ✅ after — compute it once, multiply by how many times it would have been added
def fast_function():
    precomputed = sum(x**2 for x in range(100))      # 328350 — computed ONCE
    return precomputed * 10_000                      # 328350 × 10,000 = 3,283,500,000

The result is mathematically identical, you are just recognising that adding the same number 10,000 times is the same as multiplying it by 10,000:

  ❌ before                    ✅ after

  328350                       328350
  328350                          × 10,000
  328350                       ──────────
  ... × 10,000 times           3,283,500,000
  ──────────────
  3,283,500,000

  same result — but one        computed in a single
  calls the generator          operation
  10,000 times

So the performance gain comes from recognising that the computation was redundant, not from moving it somewhere else. The generator runs once instead of 10,000 times, that is the entire saving.

Let’s calculate the expected improvement and compare the two profiling outputs:

Before — `slow_function`

def slow_function():
    total = 0
    for i in range(10_000):
        total += sum(x**2 for x in range(100))
    return total

  ncalls  tottime  percall  cumtime  percall  filename:lineno(function)
  ──────  ───────  ───────  ───────  ───────  ─────────────────────────
  10000   0.123    0.000    0.456    0.000    <genexpr>
  1       0.001    0.001    0.457    0.457    slow_function

generator called 10,000 times
total time: 0.457s

After — `fast_function`

def fast_function():
    precomputed = sum(x**2 for x in range(100))
    return precomputed * 10_000

  ncalls  tottime  percall  cumtime  percall  filename:lineno(function)
  ──────  ───────  ───────  ───────  ───────  ─────────────────────────
  1       0.000    0.000    0.000    0.000    <genexpr>
  1       0.000    0.000    0.000    0.000    fast_function

generator called once
total time: ~0.000s — effectively instant

Side by Side

                    slow_function    fast_function
                    ─────────────    ─────────────

  genexpr calls         10,000               1       ← 10,000x fewer calls
  genexpr time           0.456s          ~0.000s
  function time          0.457s          ~0.000s
  ─────────────────────────────────────────────────
  total time             0.457s          ~0.000s

  improvement            baseline        ~10,000x faster

The improvement is directly proportional to the number of loop iterations, 10,000 iterations means roughly 10,000x faster. This is because:

  slow_function cost = genexpr_cost × 10,000
                     = 0.0000456s × 10,000
                     = 0.456s

  fast_function cost = genexpr_cost × 1
                     = 0.0000456s × 1
                     = 0.0000456s

  improvement = 0.456s / 0.0000456s = ~10,000x

This is the most dramatic class of performance improvement, eliminating redundant computation entirely.

No matter how fast you make the generator itself, it will never beat calling it once. This is why profiling matters, cProfile made it immediately visible that the generator was being called 10,000 times, which no amount of code reading would have caught as quickly.

`line_profiler` — Line Level Profiling

Once cProfile tells you which function is slow, line_profiler tells you which line inside that function is the bottleneck. It requires installation:

pip install line_profiler

Consider the following text normalisation and filtering pipeline function:

from line_profiler import LineProfiler

def process_data(data):
    result = []
    for item in data:
        cleaned  = item.strip().lower()         # line A — string cleaning
        words    = cleaned.split()              # line B — splitting
        filtered = [w for w in words            # line C — filtering
                    if len(w) > 3]
        result.extend(filtered)                 # line D — extending result
    return result

The function takes a list of strings and processes each one through a four step pipeline:

  input: ["  Hello World  ", "  The Quick Brown Fox  ", ...]

  for each string:

  Line A — strip() + lower()
  ──────────────────────────
  "  Hello World  "  →  "hello world"
  removes leading/trailing whitespace
  converts to lowercase

  Line B — split()
  ─────────────────
  "hello world"  →  ["hello", "world"]
  splits the string into individual words
  on whitespace boundaries

  Line C — filter words longer than 3 characters
  ───────────────────────────────────────────────
  ["hello", "world"]  →  ["hello", "world"]   ← both kept (5 chars each)
  ["the", "fox"]      →  []                   ← both dropped (3 chars or less)
  ["quick", "brown"]  →  ["quick", "brown"]   ← both kept

  Line D — extend result
  ──────────────────────
  adds the filtered words into the final result list
  extend() adds each word individually, not as a nested list

  result.extend(["hello", "world"])  →  result = ["hello", "world"]
  result.extend(["quick", "brown"])  →  result = ["hello", "world", "quick", "brown"]

A concrete example end to end:

data = [
    "  Hello World  ",
    "  The Quick Brown Fox  ",
]

# after process_data(data):
# [
#   "hello",    ← from "Hello World"  (5 chars ✅)
#   "world",    ← from "Hello World"  (5 chars ✅)
#               ← "The" dropped       (3 chars ❌)
#   "quick",    ← from "Quick"        (5 chars ✅)
#   "brown",    ← from "Brown"        (5 chars ✅)
#               ← "Fox" dropped       (3 chars ❌)
# ]

In short — it is a text normalisation and filtering pipeline that cleans raw strings, splits them into words, and keeps only words longer than 3 characters, returning all surviving words in a single flat list.

Attach the profiler to the function:

# attach the profiler to the function
profiler = LineProfiler()
profiler.add_function(process_data)

# run the function through the profiler
data = ["  Hello World  "] * 10_000
profiler.runcall(process_data, data)
profiler.print_stats()

The output shows time per line:

  Line    Hits      Time   Per Hit   % Time  Contents
  ────    ────      ────   ───────   ──────  ────────
     5   10000     12500      1.25    10.2   cleaned = item.strip().lower()
     6   10000      8200      0.82     6.7   words = cleaned.split()
     7   10000     45300      4.53    37.0   filtered = [w for w in words ...
     8   10000     56800      5.68    46.1   result.extend(filtered)

% Time is the most useful column, it immediately shows where the function is spending its time. In this example result.extend() is the bottleneck, not the list comprehension as you might have guessed.

`tracemalloc` — Memory Profiling

tracemalloc tracks memory allocations line by line, useful when your program uses more memory than expected and you need to find what is allocating it:

import tracemalloc

tracemalloc.start()

# code to measure
data = {i: [j**2 for j in range(100)] for i in range(1000)}

snapshot  = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")

print("top memory consumers:")
for stat in top_stats[:5]:
    print(stat)

It creates a dictionary where:

the keys are integers from 0 to 999
the values are lists of 100 squared numbers each

Breaking it down step by step:

  outer loop — creates 1000 keys
  ────────────────────────────────
  i = 0    →  key: 0
  i = 1    →  key: 1
  i = 2    →  key: 2
  ...
  i = 999  →  key: 999


  inner loop — creates the value for each key
  ────────────────────────────────────────────
  [j**2 for j in range(100)]

  j = 0   →  0
  j = 1   →  1
  j = 2   →  4
  j = 3   →  9
  ...
  j = 99  →  9801

  → [0, 1, 4, 9, 16, 25, ... 9801]  ← same list for every key!


  final result
  ────────────
  {
    0:   [0, 1, 4, 9, 16, ... 9801],   ← 100 items
    1:   [0, 1, 4, 9, 16, ... 9801],   ← 100 items
    2:   [0, 1, 4, 9, 16, ... 9801],   ← 100 items
    ...
    999: [0, 1, 4, 9, 16, ... 9801],   ← 100 items
  }

  1000 keys × 100 values each = 100,000 integers total

Note that i is never used inside the inner list comprehension, every key maps to the exact same list of squared numbers.

The purpose of this line is not to do something meaningful with the data, it is purely to allocate a significant amount of memory so that tracemalloc has something interesting to measure and report.

The output shows memory per line:

  file.py:5: size=8.5 MiB, count=101000, average=88 B
  ────────────────────────────────────────────────────
  size    — total memory allocated by this line
  count   — number of individual allocations
  average — average size per allocation

You can also compare two snapshots to see what was allocated between them, useful for finding memory leaks. Take a photograph of memory before and another after, then compare the two to see exactly what was allocated in between.

Think of it like a bank statement, you don’t need to know every transaction ever made, just what changed between two dates.

Here is the step by step:

  tracemalloc.start()
  ───────────────────
  tells Python to start watching
  every memory allocation from this point


  snapshot1 = tracemalloc.take_snapshot()
  ────────────────────────────────────────
  photograph of memory RIGHT NOW
  records what is currently allocated
  and which line of code allocated it

  ┌─────────────────────────────────┐
  │ snapshot 1                      │
  │ line 5: 1.2 MB                  │
  │ line 8: 0.3 MB                  │
  │ total:  1.5 MB                  │
  └─────────────────────────────────┘


  data = {i: [j**2 for j in range(100)] for i in range(1000)}
  ─────────────────────────────────────────────────────────────
  this line allocates a large amount of memory
  1000 keys × 100 integers each = 100,000 new objects


  snapshot2 = tracemalloc.take_snapshot()
  ────────────────────────────────────────
  second photograph — after the allocation

  ┌─────────────────────────────────┐
  │ snapshot 2                      │
  │ line 5: 1.2 MB                  │
  │ line 8: 0.3 MB                  │
  │ line 12: 8.5 MB  ← NEW          │
  │ total:  10.0 MB                 │
  └─────────────────────────────────┘


  snapshot2.compare_to(snapshot1, "lineno")
  ──────────────────────────────────────────
  subtracts snapshot1 from snapshot2
  shows only what CHANGED between the two

  ┌─────────────────────────────────┐
  │ difference                      │
  │ line 5:  0 MB    ← no change    │
  │ line 8:  0 MB    ← no change    │
  │ line 12: +8.5 MB ← allocated!   │
  └─────────────────────────────────┘

In code:

import tracemalloc

tracemalloc.start()

snapshot1 = tracemalloc.take_snapshot()         # before — baseline

# the code we want to measure
data = {i: [j**2 for j in range(100)] for i in range(1000)}

snapshot2 = tracemalloc.take_snapshot()         # after

# compare — shows only what changed
top_stats = snapshot2.compare_to(snapshot1, "lineno")

print("memory added between snapshots:")
for stat in top_stats[:5]:
    print(stat)

# output:
# file.py:8: size=8.5 MiB, count=101000, average=88 B
# ─────────────────────────────────────────────────────
# line 8 allocated 8.5 MB across 101,000 objects
# nothing else changed — the diff isolates exactly this line

Choosing the Right Tool

  something is slow
       │
       ▼
  cProfile — which function?
       │
       ▼
  line_profiler — which line?
       │
       ▼
  fix it, measure again with timeit to confirm improvement


  something uses too much memory
       │
       ▼
  tracemalloc — which line allocates the most?
       │
       ▼
  fix it, measure again with tracemalloc to confirm improvement

Final comparison

Tool	Level	Built-in	Best for
`timeit`	snippet	✅	comparing two approaches
`cProfile`	function	✅	finding the slow function
`line_profiler`	line	❌ pip install	finding the slow line
`tracemalloc`	memory	✅	finding memory allocations