Skip to content

NumPy for Numerical perfomance

The optimisation patterns from the previous section, built-ins, local variables, generators, all operate within the Python interpreter. They reduce overhead, but they cannot escape the fundamental cost of Python’s dynamic type system: every operation on every value requires type checking, reference counting, and interpreter dispatch.

For numerical work on large datasets, this overhead is prohibitive. NumPy solves this by storing data in typed, contiguous blocks of memory and executing operations in C, bypassing the Python interpreter entirely for the heavy lifting. The result is not a 2x or 3x improvement, it is routinely 10x to 100x faster for numerical operations.

Why NumPy is Faster — The Core Difference

Section titled “Why NumPy is Faster — The Core Difference”

Memory overhead — How much space is used to store the data

Section titled “Memory overhead — How much space is used to store the data”

The diagram below shows the fundamental difference in how Python and NumPy store data in memory, this is the root cause of the performance gap, before a single operation is even performed.

It is not about how fast the CPU runs, but about how the data is laid out in memory that causes both the memory and execution overhead.

Pyhton vs. NumPy int list
lst = [0, 1, 2, 3] arr = np.array([0, 1, 2, 3])
Python list NumPy array
─────────── ───────────
Heap Heap
──── ────
[ list object ] [ ndarray object ]
├──► [ PyObject: int 0 ] └──► [ contiguous memory block ]
├──► [ PyObject: int 1 ] 0 | 1 | 2 | 3
├──► [ PyObject: int 2 ]
└──► [ PyObject: int 3 ] raw int64 values, no boxing
each integer is a separate all integers packed together
Python object with: as raw C values:
- type info - no type info per element
- reference count - no reference counting
- value - no PyObject overhead
~36 bytes per integer ~8 bytes per integer
(PyObject + list pointer) (raw int64 only)

Same values, same logical content, completely different physical layout in memory. The Python list has 4 separate objects scattered on the heap, each requiring a pointer to find. The NumPy array has 4 raw integers sitting next to each other in a single block, no pointers, no chasing, no overhead per element.

A Python vs. NumPy list of 1 million integers:

Python list NumPy array
─────────── ───────────
~36 MB for 1M integers ~8 MB for 1M integers
(PyObjects + list pointers) (raw C values, contiguous)

Execution overhead — What Python has to do to process each value

Section titled “Execution overhead — What Python has to do to process each value”

Consider the following for loop explanation to better undertsand the execution overhead.

Basic Comparison — Squaring 1 Million Numbers

Section titled “Basic Comparison — Squaring 1 Million Numbers”

The example below makes the performance difference concrete: it runs the same operation on the same data, stored in two different ways, and measures the gap directly.

Both python_squares and numpy_squares compute the square of every number from 0 to 999,999 and produce identical results. The only difference is how they get there:

  • one element at a time through the Python interpreter,
  • or the entire array in a single C call:
import numpy as np
import timeit
# creates a Python list containing 1 million integers from 0 to 999,999.
data_list = list(range(1_000_000))
# numpy array — same values, different storage
data_np = np.arange(1_000_000)
# ❌ python — interpreter loop, one element at a time
def python_squares(lst):
return [x**2 for x in lst]
# ✅ numpy — single C call, entire array at once
def numpy_squares(arr):
return arr ** 2
t_python = timeit.timeit(lambda: python_squares(data_list), number=10)
t_numpy = timeit.timeit(lambda: numpy_squares(data_np), number=10)
print(f"python : {t_python:.3f}s")
print(f"numpy : {t_numpy:.3f}s")
print(f"speedup: {t_python / t_numpy:.1f}x faster")
# expected output:
# python : 0.800s
# numpy : 0.010s
# speedup: 80.0x faster

What happens under the hood:

python_squares([0, 1, 2, ..., 999_999])
────────────────────────────────────────
iteration 1: unbox int(0) compute 0**2 box result store
iteration 2: unbox int(1) compute 1**2 box result store
iteration 3: unbox int(2) compute 2**2 box result store
... × 1,000,000 interpreter steps
numpy_squares(np.arange(1_000_000))
────────────────────────────────────
single C call: [ 0, 1, 2, ..., 999_999 ] ** 2
[ 0, 1, 4, ..., 999_998_000_001 ]
no boxing, no unboxing, no interpreter loop
one operation on the entire array

NumPy operations apply to the entire array at once, this is called vectorization. You describe what you want done, not how to loop over it:

import numpy as np
data = np.arange(1_000_000)
# filtering — returns only elements matching the condition
result = data[data > 500_000]
print(result) # [500_001, 500_002, ..., 999_999]
print(result.shape) # (499_999,)
# aggregation — C-level computation, no Python loop
mean = data.mean() # average
total = data.sum() # sum
std = data.std() # standard deviation
top = data.max() # maximum
print(f"mean : {mean}") # 499_999.5
print(f"total: {total}") # 499_999_500_000
print(f"std : {std:.2f}") # 288_675.14
print(f"max : {top}") # 999_999

How vectorized filtering works:

data[data > 500_000]
step 1 create boolean mask (single C pass)
─────────────────────────────────────────────
data: [ 0, 1, 2, ... 500_000 500_001 ... 999_999 ]
mask: [ False False False ... False True ... True ]
step 2 apply mask (single C pass)
─────────────────────────────────────
result: [ 500_001, 500_002, ..., 999_999 ]
two C passes no Python interpreter loop at any point

NumPy is not always the right tool, it has a setup cost and works best for specific scenarios:

import numpy as np
# ✅ good fit — large numerical arrays, repeated operations
data = np.arange(1_000_000, dtype=np.float64)
result = np.sqrt(data) + np.log(data + 1) # vectorized math
# ✅ good fit — matrix operations
matrix = np.random.rand(1000, 1000)
result = matrix @ matrix.T # matrix multiplication
# ❌ poor fit — small data, not worth the overhead
small = np.array([1, 2, 3]) # overhead exceeds benefit
total = np.sum(small) # sum([1,2,3]) is faster here
# ❌ poor fit — non-numerical data
names = np.array(["Alice", "Bob"]) # use a list instead
Pure Python NumPy
─────────── ─────
storage PyObject per value raw C values, contiguous
+ pointer per element
memory ~28 bytes (PyObject) ~8 bytes/int (raw int64)
+ ~8 bytes (pointer)
= ~36 bytes/int
operation interpreter loop single C call
type checking every element once at array creation
speed baseline 10x 100x faster
best for general purpose large numerical arrays

The memory comparison:

1 million integers
Pure Python NumPy
─────────── ─────
28 bytes × 1M (PyObjects) 8 bytes × 1M (raw int64)
+ 8 bytes × 1M (list pointers)
───────────────────────── ──────────────────────
~36 MB total ~8 MB total
4.5x less memory
OperationPure PythonNumPySpeedup
Square 1M numbers~0.8s~0.01s~80x
Sum 1M numbers~0.05s~0.001s~50x
Filter 1M numbers~0.1s~0.002s~50x
Mean of 1M numbers~0.05s~0.001s~50x

The numbers will vary by machine but the relative ordering is consistent, NumPy wins decisively for any large numerical dataset, and the advantage grows with the size of the data.