Skip to content

How to mearure performance

When optimising Python code, intuition about which approach is faster is often wrong. timeit is Python’s standard library module for measuring the execution time of small code snippets reliably.

A single time.time() measurement is unreliable, it captures everything happening on your machine at that moment: OS scheduling, garbage collection, cache effects. timeit mitigates this by running the code many times and offering repeat to take multiple independent measurements.

The simplest form runs a statement a fixed number of times and returns the total elapsed time:

import timeit
# runs the list comprehension 10,000 times
# returns total time in seconds for all 10,000 runs
t = timeit.timeit(
stmt="[x**2 for x in range(1000)]",
number=10_000
)
print(f"total for 10,000 runs: {t:.3f}s")
print(f"average per run: {t/10_000*1000:.3f}ms")

The number parameter controls how many times the statement runs. The return value is the total time for all runs, not the time for a single run, divide by number to get the average per execution.

The most common use case is measuring which of two approaches is faster:

import timeit
number = 10_000
list_comp = timeit.timeit(
stmt="[x**2 for x in range(1000)]",
number=number
)
map_func = timeit.timeit(
stmt="list(map(lambda x: x**2, range(1000)))",
number=number
)
gen_expr = timeit.timeit(
stmt="list(x**2 for x in range(1000))",
number=number
)
print(f"list comprehension : {list_comp:.3f}s")
print(f"map() : {map_func:.3f}s")
print(f"generator expr : {gen_expr:.3f}s")
print(f"fastest: {'list comp' if list_comp < map_func else 'map()'}")

A single timing can be skewed by a background process, a garbage collection pause, or a CPU cache miss. timeit.repeat runs the entire timing multiple times and returns a list of results, the minimum is the most reliable indicator of true performance:

import timeit
results = timeit.repeat(
stmt="sorted(range(1000, 0, -1))",
repeat=5, # 5 independent timing runs
number=10_000 # each run executes 10,000 times
)
print(f"all results : {[f'{r:.3f}' for r in results]}")
print(f"min : {min(results):.3f}s") # most reliable
print(f"max : {max(results):.3f}s") # worst case
print(f"average : {sum(results)/len(results):.3f}s")

The minimum is preferred over the average because it represents the run least affected by external noise. If your minimum and average are very different, your machine was under load during some runs.

You are right — the explanation is confusing because timeit() already runs the statement multiple times via number, so what does repeat add?

Here is the clearer distinction:

timeit(stmt, number=10_000)
───────────────────────────
runs stmt 10,000 times
returns ONE total time
e.g. 1.243s
repeat(stmt, number=10_000, repeat=5)
──────────────────────────────────────
runs stmt 10,000 times records time run 1
runs stmt 10,000 times records time run 2
runs stmt 10,000 times records time run 3
runs stmt 10,000 times records time run 4
runs stmt 10,000 times records time run 5
returns FIVE separate times
e.g. [1.243s, 1.251s, 1.198s, 1.312s, 1.205s]

So timeit() gives you one measurement, while repeat() gives you multiple independent measurements of the same thing. The reason you want multiple measurements is that any single run can be skewed by external noise — a background process, garbage collection, a CPU cache miss:

import timeit
# one measurement — could be unlucky
t = timeit.timeit(
stmt="sorted(range(1000, 0, -1))",
number=10_000
)
print(t) # 1.312s ← was GC running during this? was CPU busy?
# five independent measurements — much more reliable
results = timeit.repeat(
stmt="sorted(range(1000, 0, -1))",
number=10_000,
repeat=5
)
print(results) # [1.243, 1.251, 1.198, 1.312, 1.205]
print(min(results)) # 1.198s ← the minimum is the most reliable
# it represents the run with least noise

Think of it like timing a race:

timeit() repeat()
──────── ────────
one attempt five attempts
────────── ────────────
runner runs 10,000m runner runs 10,000m 49.2s
total time 51.3s runner runs 10,000m 48.8s
runner runs 10,000m 53.1s bad day
runner runs 10,000m 49.0s
runner runs 10,000m 48.9s
was 51.3s the true min = 48.8s true capability
performance? avg = 49.8s
or was it a bad run? max = 53.1s noise/bad conditions

The minimum across all repeat runs is the most reliable number — it represents the execution least affected by external factors, closest to the true performance of the code itself.

When you need more control or want to time the same statement multiple times with different number values, use the Timer class directly:

import timeit
# create a reusable timer
timer = timeit.Timer(
stmt="sum(x**2 for x in range(1000))",
)
# run with different numbers to verify linear scaling
print(timer.timeit(1_000)) # 1,000 runs
print(timer.timeit(10_000)) # 10,000 runs — should be ~10x the above
print(timer.timeit(100_000)) # 100,000 runs — should be ~100x the first

This example

  • wraps timeit.repeat in a helper function to make benchmarking cleaner and reusable,
  • then uses it to compare four different ways of building a list of squared numbers:
    • list comprehension,
    • map(),
    • generator expression,
    • and a for loop

printing the best and average time for each:

import timeit
def compare(label, stmt, number=10_000, repeat=5):
results = timeit.repeat(stmt=stmt, repeat=repeat, number=number)
best = min(results)
avg = sum(results) / len(results)
print(f"{label:<25} best: {best:.3f}s avg: {avg:.3f}s")
compare("list comprehension", "[x**2 for x in range(1000)]")
compare("map()", "list(map(lambda x: x**2, range(1000)))")
compare("generator expr", "list(x**2 for x in range(1000))")
compare("for loop", """
r = []
for x in range(1000):
r.append(x**2)
""")

Expected output:

list comprehension best: 0.312s avg: 0.318s
map() best: 0.342s avg: 0.351s
generator expr best: 0.318s avg: 0.325s
for loop best: 0.445s avg: 0.462s
Parametertimeit()repeat()Timer
stmt
numbervia .timeit(n)
repeatvia .repeat(r, n)
setup
Returnstotal timelist of timestotal time

The setup parameter is useful when your statement depends on imported modules or pre-built data, it runs once before the timing starts and is not included in the measurement:

import timeit
# setup runs once — not included in timing
t = timeit.timeit(
stmt="bisect.bisect_left(data, 500)",
setup="import bisect; data = list(range(1000))",
number=100_000
)
print(f"bisect lookup: {t:.3f}s")