Skip to content

Threading — I/O Bound Tasks

For CPU-bound work — heavy computation, data processing, mathematical operations — threading provides no benefit because the GIL prevents more than one thread from executing Python bytecode at a time. No matter how many threads you create, they take turns rather than running in parallel.

Multiprocessing solves this by spawning separate Python interpreters, each with its own GIL, each running on its own CPU core. The work is truly parallel — if you have 4 cores and 4 processes, all four run simultaneously.

The trade-off is overhead — spawning a process is more expensive than spawning a thread, and data must be serialised to pass between processes. This means multiprocessing is only worth it when the tasks are heavy enough to justify the startup cost.

Sequential vs Multiprocessing — The Core Difference

Section titled “Sequential vs Multiprocessing — The Core Difference”

The example below runs 8 CPU-intensive computations — each summing the squares of 5 million numbers. Sequentially this takes ~8 seconds. With multiprocessing, the work is distributed across all available CPU cores and completes in ~2 seconds:

from multiprocessing import Pool, cpu_count
import time
def heavy_computation(n):
"""CPU-intensive task — sum of squares of n numbers."""
return sum(x**2 for x in range(n))
numbers = [5_000_000] * 8 # 8 identical heavy tasks

Without multiprocessing — sequential, ~8 seconds:

start = time.perf_counter()
results = [heavy_computation(n) for n in numbers]
print(f"sequential: {time.perf_counter() - start:.1f}s") # ~8.0s
timeline without multiprocessing (1 core doing all work)
t=0s task 1 ──────────► done at t=1s
t=1s task 2 ──────────► done at t=2s
t=2s task 3 ──────────► done at t=3s
t=3s task 4 ──────────► done at t=4s
t=4s task 5 ──────────► done at t=5s
t=5s task 6 ──────────► done at t=6s
t=6s task 7 ──────────► done at t=7s
t=7s task 8 ──────────► done at t=8s
total: ~8s

With multiprocessing — parallel, ~2 seconds:

print(f"CPUs available: {cpu_count()}") # e.g. 4
start = time.perf_counter()
with Pool(processes=cpu_count()) as pool:
results = pool.map(heavy_computation, numbers)
print(f"multiprocessing: {time.perf_counter() - start:.1f}s") # ~2.0s ✅
timeline with multiprocessing (4 cores, 8 tasks)
core 1: task 1 ──────────► done at t=1s
task 5 ──────────► done at t=2s
core 2: task 2 ──────────► done at t=1s
task 6 ──────────► done at t=2s
core 3: task 3 ──────────► done at t=1s
task 7 ──────────► done at t=2s
core 4: task 4 ──────────► done at t=1s
task 8 ──────────► done at t=2s
total: ~2s ✅

pool.map() distributes the tasks across the available processes, collects the results, and returns them in the same order as the input — equivalent to a parallel map().

Just as ThreadPoolExecutor is the modern API for threading, ProcessPoolExecutor from concurrent.futures is the preferred API for multiprocessing. It is cleaner, handles process lifecycle automatically, and shares the same interface as ThreadPoolExecutor — making it easy to switch between the two:

from concurrent.futures import ProcessPoolExecutor, as_completed
def compute(n):
"""CPU-intensive task."""
return sum(x**2 for x in range(n))
numbers = [1_000_000, 2_000_000, 3_000_000, 4_000_000]

Processing results as they complete — useful when tasks have different durations and you want to handle each result as soon as it is ready:

with ProcessPoolExecutor() as executor:
futures = {executor.submit(compute, n): n for n in numbers}
for future in as_completed(futures):
n = futures[future]
result = future.result()
print(f"compute({n:,}) = {result:,}")
# output arrives out of order — smaller tasks finish first
# compute(1,000,000) = ...
# compute(2,000,000) = ...

Processing results in order — useful when the order of results matters:

with ProcessPoolExecutor() as executor:
results = list(executor.map(compute, numbers))
print(results) # results in same order as numbers

Both achieve the same goal — the choice is largely stylistic:

PoolProcessPoolExecutor
Modulemultiprocessingconcurrent.futures
API styleolder, more explicitmodern, cleaner
map()
as_completed()
Shared interface with threading
Best forfine-grained controlmost use cases

Multiprocessing is not free — every process must be spawned, memory must be copied, and data must be serialised to pass between processes. For small tasks the overhead can exceed the computation time itself:

small task — overhead dominates
────────────────────────────────
spawn process: ~50ms
serialise data: ~1ms
compute: ~1ms ← actual work
deserialise: ~1ms
total overhead: ~52ms for 1ms of work ❌
large task — computation dominates
────────────────────────────────────
spawn process: ~50ms
serialise data: ~1ms
compute: ~1000ms ← actual work
deserialise: ~1ms
total overhead: ~52ms for 1000ms of work ✅

The rule of thumb — if each task takes less than a few hundred milliseconds, the overhead of multiprocessing likely outweighs the benefit. Profile first, then decide.