Skip to content

Profiling - Finding bootlenecks

timeit is excellent for comparing small, isolated snippets, but real performance problems rarely live in a single line. They hide inside functions, loops, and data structures spread across your codebase.

Python offers three levels of profiling:

  • cProfile — function level, built into the standard library, measures how long each function takes and how many times it is called
  • line_profiler — line level, third party, shows exactly which line inside a function is the bottleneck
  • tracemalloc — memory level, built into the standard library, shows which lines allocate the most memory

cProfile instruments your code and records every function call, how many times each function was called, how long it took in total, and how long it took excluding calls to other functions:

Slow inefficient function
import cProfile
import pstats
import io
def slow_function():
total = 0
# add the same number 10,000 times
for i in range(10_000):
# i is never used inside the loop!
total += sum(x**2 for x in range(100))
return total
# create a profiler and wrap the code to measure
profiler = cProfile.Profile()
profiler.enable()
slow_function()
profiler.disable()
# format and print the results
stream = io.StringIO()
stats = pstats.Stats(profiler, stream=stream)
stats.sort_stats("cumulative") # sort by cumulative time — most useful
stats.print_stats(10) # show only the top 10 functions
print(stream.getvalue())

The output columns explained:

ncalls tottime percall cumtime percall filename:lineno(function)
────── ─────── ─────── ─────── ─────── ─────────────────────────
10000 0.123 0.000 0.456 0.000 <genexpr>
1 0.001 0.001 0.457 0.457 slow_function
ncalls how many times the function was called
tottime time spent in this function alone (excluding sub-calls)
percall tottime / ncalls
cumtime total time including all functions called from here
percall cumtime / ncalls

tottime tells you where the CPU actually is. cumtime tells you the total cost of calling that function including everything it calls. The gap between them reveals how much time is spent in sub-calls.

The sort_stats options worth knowing:

stats.sort_stats("cumulative") # total time including sub-calls — best starting point
stats.sort_stats("tottime") # time in function only — find the actual hotspot
stats.sort_stats("ncalls") # most called functions — find unexpected call counts

Once cProfile tells you which function is slow, line_profiler tells you which line inside that function is the bottleneck. It requires installation:

pip install line_profiler

Consider the following text normalisation and filtering pipeline function:

from line_profiler import LineProfiler
def process_data(data):
result = []
for item in data:
cleaned = item.strip().lower() # line A — string cleaning
words = cleaned.split() # line B — splitting
filtered = [w for w in words # line C — filtering
if len(w) > 3]
result.extend(filtered) # line D — extending result
return result

Attach the profiler to the function:

# attach the profiler to the function
profiler = LineProfiler()
profiler.add_function(process_data)
# run the function through the profiler
data = [" Hello World "] * 10_000
profiler.runcall(process_data, data)
profiler.print_stats()

The output shows time per line:

Line Hits Time Per Hit % Time Contents
──── ──── ──── ─────── ────── ────────
5 10000 12500 1.25 10.2 cleaned = item.strip().lower()
6 10000 8200 0.82 6.7 words = cleaned.split()
7 10000 45300 4.53 37.0 filtered = [w for w in words ...
8 10000 56800 5.68 46.1 result.extend(filtered)

% Time is the most useful column, it immediately shows where the function is spending its time. In this example result.extend() is the bottleneck, not the list comprehension as you might have guessed.

tracemalloc tracks memory allocations line by line, useful when your program uses more memory than expected and you need to find what is allocating it:

import tracemalloc
tracemalloc.start()
# code to measure
data = {i: [j**2 for j in range(100)] for i in range(1000)}
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")
print("top memory consumers:")
for stat in top_stats[:5]:
print(stat)

The output shows memory per line:

file.py:5: size=8.5 MiB, count=101000, average=88 B
────────────────────────────────────────────────────
size total memory allocated by this line
count number of individual allocations
average average size per allocation

You can also compare two snapshots to see what was allocated between them, useful for finding memory leaks. Take a photograph of memory before and another after, then compare the two to see exactly what was allocated in between.

Think of it like a bank statement, you don’t need to know every transaction ever made, just what changed between two dates.

Here is the step by step:

tracemalloc.start()
───────────────────
tells Python to start watching
every memory allocation from this point
snapshot1 = tracemalloc.take_snapshot()
────────────────────────────────────────
photograph of memory RIGHT NOW
records what is currently allocated
and which line of code allocated it
┌─────────────────────────────────┐
snapshot 1
line 5: 1.2 MB
line 8: 0.3 MB
total: 1.5 MB
└─────────────────────────────────┘
data = {i: [j**2 for j in range(100)] for i in range(1000)}
─────────────────────────────────────────────────────────────
this line allocates a large amount of memory
1000 keys × 100 integers each = 100,000 new objects
snapshot2 = tracemalloc.take_snapshot()
────────────────────────────────────────
second photograph after the allocation
┌─────────────────────────────────┐
snapshot 2
line 5: 1.2 MB
line 8: 0.3 MB
line 12: 8.5 MB NEW
total: 10.0 MB
└─────────────────────────────────┘
snapshot2.compare_to(snapshot1, "lineno")
──────────────────────────────────────────
subtracts snapshot1 from snapshot2
shows only what CHANGED between the two
┌─────────────────────────────────┐
difference
line 5: 0 MB no change
line 8: 0 MB no change
line 12: +8.5 MB allocated!
└─────────────────────────────────┘

In code:

import tracemalloc
tracemalloc.start()
snapshot1 = tracemalloc.take_snapshot() # before — baseline
# the code we want to measure
data = {i: [j**2 for j in range(100)] for i in range(1000)}
snapshot2 = tracemalloc.take_snapshot() # after
# compare — shows only what changed
top_stats = snapshot2.compare_to(snapshot1, "lineno")
print("memory added between snapshots:")
for stat in top_stats[:5]:
print(stat)
# output:
# file.py:8: size=8.5 MiB, count=101000, average=88 B
# ─────────────────────────────────────────────────────
# line 8 allocated 8.5 MB across 101,000 objects
# nothing else changed — the diff isolates exactly this line
something is slow
cProfile which function?
line_profiler which line?
fix it, measure again with timeit to confirm improvement
something uses too much memory
tracemalloc which line allocates the most?
fix it, measure again with tracemalloc to confirm improvement
ToolLevelBuilt-inBest for
timeitsnippetcomparing two approaches
cProfilefunctionfinding the slow function
line_profilerline❌ pip installfinding the slow line
tracemallocmemoryfinding memory allocations