NumPy for Numerical perfomance

When Pure Python Is Not Fast Enough

The optimisation patterns from the previous section, built-ins, local variables, generators, all operate within the Python interpreter. They reduce overhead, but they cannot escape the fundamental cost of Python’s dynamic type system: every operation on every value requires type checking, reference counting, and interpreter dispatch.

For numerical work on large datasets, this overhead is prohibitive. NumPy solves this by storing data in typed, contiguous blocks of memory and executing operations in C, bypassing the Python interpreter entirely for the heavy lifting. The result is not a 2x or 3x improvement, it is routinely 10x to 100x faster for numerical operations.

Why NumPy is Faster — The Core Difference

Memory overhead — How much space is used to store the data

The diagram below shows the fundamental difference in how Python and NumPy store data in memory, this is the root cause of the performance gap, before a single operation is even performed.

It is not about how fast the CPU runs, but about how the data is laid out in memory that causes both the memory and execution overhead.

lst = [0, 1, 2, 3]                  arr = np.array([0, 1, 2, 3])

  Python list                         NumPy array
  ───────────                         ───────────

  Heap                                Heap
  ────                                ────
  [ list object ]                     [ ndarray object ]
       │                                   │
       ├──► [ PyObject: int 0 ]            └──► [ contiguous memory block ]
       ├──► [ PyObject: int 1 ]                  0 | 1 | 2 | 3
       ├──► [ PyObject: int 2 ]                  ↑   ↑   ↑   ↑
       └──► [ PyObject: int 3 ]                  raw int64 values, no boxing

  each integer is a separate          all integers packed together
  Python object with:                 as raw C values:
  - type info                         - no type info per element
  - reference count                   - no reference counting
  - value                             - no PyObject overhead
  ~36 bytes per integer               ~8 bytes per integer
  (PyObject + list pointer)           (raw int64 only)

Same values, same logical content, completely different physical layout in memory. The Python list has 4 separate objects scattered on the heap, each requiring a pointer to find. The NumPy array has 4 raw integers sitting next to each other in a single block, no pointers, no chasing, no overhead per element.

A Python vs. NumPy list of 1 million integers:

Python list                         NumPy array
  ───────────                         ───────────
  ~36 MB for 1M integers              ~8 MB for 1M integers
  (PyObjects + list pointers)         (raw C values, contiguous)

~28 bytes per integer, this is the size of a Python int object in CPython, verifiable with:

import sys
print(sys.getsizeof(0))     # 28 bytes — small integer
print(sys.getsizeof(1000))  # 28 bytes — same for small integers

✅ This is accurate for small integers in CPython.

~8 bytes per integer — a NumPy int64 uses exactly 8 bytes, verifiable with:

import numpy as np
arr = np.array([0], dtype=np.int64)
print(arr.itemsize)     # 8 bytes

~28 MB vs ~8 MB for 1 million integers, however this is where it gets nuanced. The Python list itself also has overhead beyond just the integers:

import sys

lst = list(range(1_000_000))
print(sys.getsizeof(lst))           # ~8 MB  — the list container itself
                                    # stores pointers, not the objects

# each integer object lives separately on the heap
# ~28 bytes × 1,000,000 = ~28 MB   — the integer objects
# ~8 bytes  × 1,000,000 = ~8 MB    — the list pointers
# total: ~36 MB

So:

	Python list	NumPy array
Memory per integer	~28 bytes (PyObject) + 8 bytes (pointer)	8 bytes
1 million integers	~36 MB total	~8 MB
Operation execution	1 million interpreter steps	single C loop

A Python list of 1 million integers uses ~36 MB in total (the integer objects plus the list pointers), while the equivalent NumPy array uses ~8 MB.

Execution overhead — What Python has to do to process each value

Consider the following for loop explanation to better undertsand the execution overhead.

for loop overhead explanation

Here is a clearer explanation showing what Python actually does for a single x**2 operation inside a loop.

Every time Python evaluates x**2 inside a loop it cannot assume anything about x, Python is dynamically typed, so it must resolve the operation on every single iteration, even if x has been an integer every time so far:

  result = [x**2 for x in lst]

  iteration 1:  x = 0
  ─────────────────────
  1. fetch x from heap        → x is a PyObject (full heap-allocated struct)
  2. look up __pow__ on type  → find the right implementation for int
  3. extract raw value        → get the C integer out of the PyObject
  4. compute 0 ** 2 = 0       → actual computation
  5. wrap result in PyObject  → allocate a new int object on the heap
  6. manage reference counts  → increment/decrement as objects are created

  iteration 2:  x = 1         ← Python does ALL 6 steps again
  iteration 3:  x = 2         ← and again
  ...
  iteration 1,000,000         ← and again, 1 million times

Steps 1, 2, 3, 5, and 6 are interpreter overhead, the actual computation is step 4, which takes nanoseconds. The rest is the cost of Python’s dynamic type system, paid on every element, every iteration.

NumPy eliminates this per-element overhead entirely, the type is known once at array creation, operations run on raw contiguous memory in C, and there is no PyObject wrapping or reference counting per element.

In a simple way:

  Python list — overhead paid per element
  ────────────────────────────────────────

  element 0:   PyObject overhead + compute + PyObject overhead
  element 1:   PyObject overhead + compute + PyObject overhead
  element 2:   PyObject overhead + compute + PyObject overhead
  ...
  element N:   PyObject overhead + compute + PyObject overhead

  overhead × N


  NumPy array — overhead paid once per array
  ───────────────────────────────────────────

  type check once → single C function call →
  [ 0**2, 1**2, 2**2, ..., N**2 ]
  raw memory, no PyObject per element,
  no reference counting per element

  overhead × 1

Comparing executions:

Python list                         NumPy array
  ───────────                         ───────────
  per element:                        per array:
  - fetch PyObject from heap          - type check once
  - look up __pow__ on type           - single C function call
  - extract raw value                 - loop runs entirely in C
  - compute                           - compute × N elements
  - wrap result in new PyObject       - return new array
  - manage reference counts
  → paid 1,000,000 times              → overhead paid once

operations on NumPy array run as a single C loop rather than 1 million Python interpreter steps, the cost is dramatically lower:

  cost per element (Python list)
  ──────────────────────────────

  fetch PyObject      ██░░░░░░░░░░░░░░░░░░
  __pow__ lookup      ████░░░░░░░░░░░░░░░░
  extract raw value   ██░░░░░░░░░░░░░░░░░░
  computation         █░░░░░░░░░░░░░░░░░░░  ← the actual work
  wrap in PyObject    ███░░░░░░░░░░░░░░░░░
  refcount mgmt       ██░░░░░░░░░░░░░░░░░░
                      ────────────────────
                      actual computation is a small fraction
                      of the total cost per element
                      and this cost is paid 1,000,000 times


  cost per array (NumPy)
  ──────────────────────

  type check once     █░░░░░░░░░░░░░░░░░░░
  C function call     █░░░░░░░░░░░░░░░░░░░
  computation ×N      ████████████████████  ← almost all time is actual work
                      ────────────────────
                      overhead paid once
                      computation dominates — as it should

The diagram makes the difference visible, in Python the actual computation is a small fraction of the work per element, dwarfed by interpreter overhead. In NumPy the overhead is paid once and the rest is pure computation.

The more elements you have, the more dramatic this difference becomes.

Basic Comparison — Squaring 1 Million Numbers

The example below makes the performance difference concrete: it runs the same operation on the same data, stored in two different ways, and measures the gap directly.

Both python_squares and numpy_squares compute the square of every number from 0 to 999,999 and produce identical results. The only difference is how they get there:

one element at a time through the Python interpreter,
or the entire array in a single C call:

import numpy as np
import timeit

# creates a Python list containing 1 million integers from 0 to 999,999.
data_list = list(range(1_000_000))

# numpy array — same values, different storage
data_np = np.arange(1_000_000)

# ❌ python — interpreter loop, one element at a time
def python_squares(lst):
    return [x**2 for x in lst]

# ✅ numpy — single C call, entire array at once
def numpy_squares(arr):
    return arr ** 2

t_python = timeit.timeit(lambda: python_squares(data_list), number=10)
t_numpy  = timeit.timeit(lambda: numpy_squares(data_np),    number=10)

print(f"python : {t_python:.3f}s")
print(f"numpy  : {t_numpy:.3f}s")
print(f"speedup: {t_python / t_numpy:.1f}x faster")

# expected output:
# python : 0.800s
# numpy  : 0.010s
# speedup: 80.0x faster

What happens under the hood:

  python_squares([0, 1, 2, ..., 999_999])
  ────────────────────────────────────────
  iteration 1:  unbox int(0)  → compute 0**2  → box result  → store
  iteration 2:  unbox int(1)  → compute 1**2  → box result  → store
  iteration 3:  unbox int(2)  → compute 2**2  → box result  → store
  ... × 1,000,000 interpreter steps


  numpy_squares(np.arange(1_000_000))
  ────────────────────────────────────
  single C call:  [ 0, 1, 2, ..., 999_999 ] ** 2
                  → [ 0, 1, 4, ..., 999_998_000_001 ]
  no boxing, no unboxing, no interpreter loop
  one operation on the entire array

Vectorized Operations — No Loops Needed

NumPy operations apply to the entire array at once, this is called vectorization. You describe what you want done, not how to loop over it:

import numpy as np

data = np.arange(1_000_000)

# filtering — returns only elements matching the condition
result = data[data > 500_000]
print(result)           # [500_001, 500_002, ..., 999_999]
print(result.shape)     # (499_999,)

# aggregation — C-level computation, no Python loop
mean   = data.mean()    # average
total  = data.sum()     # sum
std    = data.std()     # standard deviation
top    = data.max()     # maximum

print(f"mean : {mean}")     # 499_999.5
print(f"total: {total}")    # 499_999_500_000
print(f"std  : {std:.2f}")  # 288_675.14
print(f"max  : {top}")      # 999_999

How vectorized filtering works:

  data[data > 500_000]

  step 1 — create boolean mask (single C pass)
  ─────────────────────────────────────────────
  data:  [ 0,     1,     2,     ... 500_000  500_001  ... 999_999 ]
  mask:  [ False  False  False  ... False    True     ... True    ]

  step 2 — apply mask (single C pass)
  ─────────────────────────────────────
  result: [ 500_001, 500_002, ..., 999_999 ]

  two C passes — no Python interpreter loop at any point

When to Use NumPy

NumPy is not always the right tool, it has a setup cost and works best for specific scenarios:

import numpy as np

# ✅ good fit — large numerical arrays, repeated operations
data = np.arange(1_000_000, dtype=np.float64)
result = np.sqrt(data) + np.log(data + 1)   # vectorized math

# ✅ good fit — matrix operations
matrix = np.random.rand(1000, 1000)
result = matrix @ matrix.T                  # matrix multiplication

# ❌ poor fit — small data, not worth the overhead
small = np.array([1, 2, 3])                 # overhead exceeds benefit
total = np.sum(small)                       # sum([1,2,3]) is faster here

# ❌ poor fit — non-numerical data
names = np.array(["Alice", "Bob"])          # use a list instead

NumPy vs Pure Python — Summary

Pure Python                    NumPy
                    ───────────                    ─────

  storage           PyObject per value             raw C values, contiguous
                    + pointer per element
  memory            ~28 bytes (PyObject)           ~8 bytes/int (raw int64)
                    + ~8 bytes (pointer)
                    = ~36 bytes/int
  operation         interpreter loop               single C call
  type checking     every element                  once at array creation
  speed             baseline                       10x – 100x faster
  best for          general purpose                large numerical arrays

The memory comparison:

1 million integers

  Pure Python                         NumPy
  ───────────                         ─────
  28 bytes × 1M (PyObjects)           8 bytes × 1M (raw int64)
  + 8 bytes × 1M (list pointers)
  ─────────────────────────           ──────────────────────
  ~36 MB total                        ~8 MB total
                                      4.5x less memory

Operation	Pure Python	NumPy	Speedup
Square 1M numbers	~0.8s	~0.01s	~80x
Sum 1M numbers	~0.05s	~0.001s	~50x
Filter 1M numbers	~0.1s	~0.002s	~50x
Mean of 1M numbers	~0.05s	~0.001s	~50x

The numbers will vary by machine but the relative ordering is consistent, NumPy wins decisively for any large numerical dataset, and the advantage grows with the size of the data.