NumPy for Numerical perfomance
When Pure Python Is Not Fast Enough
Section titled “When Pure Python Is Not Fast Enough”The optimisation patterns from the previous section, built-ins, local variables, generators, all operate within the Python interpreter. They reduce overhead, but they cannot escape the fundamental cost of Python’s dynamic type system: every operation on every value requires type checking, reference counting, and interpreter dispatch.
For numerical work on large datasets, this overhead is prohibitive. NumPy solves this by storing data in typed, contiguous blocks of memory and executing operations in C, bypassing the Python interpreter entirely for the heavy lifting. The result is not a 2x or 3x improvement, it is routinely 10x to 100x faster for numerical operations.
Why NumPy is Faster — The Core Difference
Section titled “Why NumPy is Faster — The Core Difference”Memory overhead — How much space is used to store the data
Section titled “Memory overhead — How much space is used to store the data”The diagram below shows the fundamental difference in how Python and NumPy store data in memory, this is the root cause of the performance gap, before a single operation is even performed.
It is not about how fast the CPU runs, but about how the data is laid out in memory that causes both the memory and execution overhead.
lst = [0, 1, 2, 3] arr = np.array([0, 1, 2, 3])
Python list NumPy array ─────────── ───────────
Heap Heap ──── ──── [ list object ] [ ndarray object ] │ │ ├──► [ PyObject: int 0 ] └──► [ contiguous memory block ] ├──► [ PyObject: int 1 ] 0 | 1 | 2 | 3 ├──► [ PyObject: int 2 ] ↑ ↑ ↑ ↑ └──► [ PyObject: int 3 ] raw int64 values, no boxing
each integer is a separate all integers packed together Python object with: as raw C values: - type info - no type info per element - reference count - no reference counting - value - no PyObject overhead ~36 bytes per integer ~8 bytes per integer (PyObject + list pointer) (raw int64 only)Same values, same logical content, completely different physical layout in memory. The Python list has 4 separate objects scattered on the heap, each requiring a pointer to find. The NumPy array has 4 raw integers sitting next to each other in a single block, no pointers, no chasing, no overhead per element.
A Python vs. NumPy list of 1 million integers:
Python list NumPy array ─────────── ─────────── ~36 MB for 1M integers ~8 MB for 1M integers (PyObjects + list pointers) (raw C values, contiguous)Execution overhead — What Python has to do to process each value
Section titled “Execution overhead — What Python has to do to process each value”Consider the following for loop explanation to better undertsand the execution overhead.
Basic Comparison — Squaring 1 Million Numbers
Section titled “Basic Comparison — Squaring 1 Million Numbers”The example below makes the performance difference concrete: it runs the same operation on the same data, stored in two different ways, and measures the gap directly.
Both python_squares and numpy_squares compute the square of every number from 0 to 999,999 and produce identical results.
The only difference is how they get there:
- one element at a time through the Python interpreter,
- or the entire array in a single C call:
import numpy as npimport timeit
# creates a Python list containing 1 million integers from 0 to 999,999.data_list = list(range(1_000_000))
# numpy array — same values, different storagedata_np = np.arange(1_000_000)
# ❌ python — interpreter loop, one element at a timedef python_squares(lst): return [x**2 for x in lst]
# ✅ numpy — single C call, entire array at oncedef numpy_squares(arr): return arr ** 2
t_python = timeit.timeit(lambda: python_squares(data_list), number=10)t_numpy = timeit.timeit(lambda: numpy_squares(data_np), number=10)
print(f"python : {t_python:.3f}s")print(f"numpy : {t_numpy:.3f}s")print(f"speedup: {t_python / t_numpy:.1f}x faster")
# expected output:# python : 0.800s# numpy : 0.010s# speedup: 80.0x fasterWhat happens under the hood:
python_squares([0, 1, 2, ..., 999_999]) ──────────────────────────────────────── iteration 1: unbox int(0) → compute 0**2 → box result → store iteration 2: unbox int(1) → compute 1**2 → box result → store iteration 3: unbox int(2) → compute 2**2 → box result → store ... × 1,000,000 interpreter steps
numpy_squares(np.arange(1_000_000)) ──────────────────────────────────── single C call: [ 0, 1, 2, ..., 999_999 ] ** 2 → [ 0, 1, 4, ..., 999_998_000_001 ] no boxing, no unboxing, no interpreter loop one operation on the entire arrayVectorized Operations — No Loops Needed
Section titled “Vectorized Operations — No Loops Needed”NumPy operations apply to the entire array at once, this is called vectorization. You describe what you want done, not how to loop over it:
import numpy as np
data = np.arange(1_000_000)
# filtering — returns only elements matching the conditionresult = data[data > 500_000]print(result) # [500_001, 500_002, ..., 999_999]print(result.shape) # (499_999,)
# aggregation — C-level computation, no Python loopmean = data.mean() # averagetotal = data.sum() # sumstd = data.std() # standard deviationtop = data.max() # maximum
print(f"mean : {mean}") # 499_999.5print(f"total: {total}") # 499_999_500_000print(f"std : {std:.2f}") # 288_675.14print(f"max : {top}") # 999_999How vectorized filtering works:
data[data > 500_000]
step 1 — create boolean mask (single C pass) ───────────────────────────────────────────── data: [ 0, 1, 2, ... 500_000 500_001 ... 999_999 ] mask: [ False False False ... False True ... True ]
step 2 — apply mask (single C pass) ───────────────────────────────────── result: [ 500_001, 500_002, ..., 999_999 ]
two C passes — no Python interpreter loop at any pointWhen to Use NumPy
Section titled “When to Use NumPy”NumPy is not always the right tool, it has a setup cost and works best for specific scenarios:
import numpy as np
# ✅ good fit — large numerical arrays, repeated operationsdata = np.arange(1_000_000, dtype=np.float64)result = np.sqrt(data) + np.log(data + 1) # vectorized math
# ✅ good fit — matrix operationsmatrix = np.random.rand(1000, 1000)result = matrix @ matrix.T # matrix multiplication
# ❌ poor fit — small data, not worth the overheadsmall = np.array([1, 2, 3]) # overhead exceeds benefittotal = np.sum(small) # sum([1,2,3]) is faster here
# ❌ poor fit — non-numerical datanames = np.array(["Alice", "Bob"]) # use a list insteadNumPy vs Pure Python — Summary
Section titled “NumPy vs Pure Python — Summary”Pure Python NumPy ─────────── ─────
storage PyObject per value raw C values, contiguous + pointer per element memory ~28 bytes (PyObject) ~8 bytes/int (raw int64) + ~8 bytes (pointer) = ~36 bytes/int operation interpreter loop single C call type checking every element once at array creation speed baseline 10x – 100x faster best for general purpose large numerical arraysThe memory comparison:
1 million integers
Pure Python NumPy ─────────── ───── 28 bytes × 1M (PyObjects) 8 bytes × 1M (raw int64) + 8 bytes × 1M (list pointers) ───────────────────────── ────────────────────── ~36 MB total ~8 MB total 4.5x less memory| Operation | Pure Python | NumPy | Speedup |
|---|---|---|---|
| Square 1M numbers | ~0.8s | ~0.01s | ~80x |
| Sum 1M numbers | ~0.05s | ~0.001s | ~50x |
| Filter 1M numbers | ~0.1s | ~0.002s | ~50x |
| Mean of 1M numbers | ~0.05s | ~0.001s | ~50x |
The numbers will vary by machine but the relative ordering is consistent, NumPy wins decisively for any large numerical dataset, and the advantage grows with the size of the data.