High Performance Python

1. Hardware Resources
2. Profiling
- 2.1. Tools
- 2.2. Practical Points
3. Lists and Tuples
- 3.1. Lists
- 3.2. Tuples
4. Dictionaries and Sets
5. Iterators and Generators
6. Vectorization
7. Compiling to Bytecode

1 Hardware Resources

1.1 Computing Unit

The main properties of interest in a computing unit are the number of operations it can do in one cycle and the number of cycles it can do in one second. The first value is measured by its instructions per cycle(IPC), while the latter value is measured by its clock speed.

SIMD: Single instruction, multiple data. Efficient with high IPC. (CPU vectorization instruction)
Hyperthreading presents a virtual second CPU to the host OS, and clever hardware logic tries

to interleave two threads of instructions into the execution units on a single CPU. When successful, gains of up to 30% over a single thread can be achieved.

1.2 Memory Unit

When trying to optimize the memory patterns of a program, we are simply optimizing which data is placed where, how it is laid out (in order to increase the number of sequential reads), and how many times it is moved among the various locations.

1.3 Communications Layers

The frontside bus is the connection between the RAM and the L1/L2 cache
Two main properties of a bus:
- how much data can be moved in one transfer (bus width)
- how many transfers the bus can do per second (bus frequency).

1.4 Performance Issues of Python

Memory Fragmentation: due to garbage-collected
Dynamic Types: Using Cython to mitigate this problem
GIL: Using Cython or foreign functions

1.5 GPU

When to use: The task requires mainly linear algebra and matrix manipulations.
1. Ensure that the memory use of the problem will fit within the GPU.
2. Evaluate whether the algorithm requires a lot of branching conditions.(GPU's memory is limited)
3. Evaluate how much data needs to be moved between the GPU and the CPU.
Drawbacks: Communicates through the PCI bus, which is much slower than the frontside bus

2 Profiling

2.1 Tools

2.1.1 `/usr/bin/time`

Using --verbose flag to get more details.
Most useful indicator is Major (requiring I/O) page faults, as it indicates cache misses.

2.1.2 `timefn` decorator

def timefn(fn):
    @wraps(fn)
    def measure_time(*args, **kwargs):
        t1 = time.time()
        result = fn(*args, **kwargs)
        t2 = time.time()
        print(f"@timefn: {fn.__name__} took {t2 - t1} seconds")
        return result

    return measure_time

2.1.3 bulit-in `timeit` module

Documentation
The timeit module temporarily disables the garbage collector.
More useful approach: IPython magic %timeit

2.1.4 CProfile

python -m cProfile -s cumulative test.py
python -m cProfile -o profile.stats test.py

import pstats
p = pstats.Stats("profile.stats")
p.sort_stats("cumulative")
p.print_stats()
p.print_callers()
p.print_callees()

snakeviz for visualization
alternative pyinstrument: lower overhead and less trivial information

2.1.5 `line_profiler`

pip install line_profiler
Use @profile to mark the chosen function
kernprof -l -v test.py
See report: python -m line_profiler test.py.lprof

Object Interface

from line_profiler import LineProfiler

lp = LineProfiler(some_func)
lp.run("some_func(args)")
lp.print_stats()

2.1.6 `pyinstrument`

2.1.7 `VizTracer`

2.1.8 `memory_profiler`

pip install psutil (recommended)
pip install memory_profiler
Use @profile to mark the chosen function
python -m memory_profiler test.py or mprof run test.py

Other Hints:

using with profile.timestamp("scope1") to add label
memory_profiler offers an interesting aid to debugging a large process via the --pdb-mmem=XXX flag

2.1.9 `perf`

sudo apt install linux-tools-generic
Tweeking /proc/sys/kernel/perf_event_paranoid to -1 (Allow use of almost all events by all users Ignore mlock limit)
perf stat -e cycles,instructions,cache-references,cache-misses,... python test.py

Performance Counter Descriptions:

task-clock tells us how many clock cycles our task. (all CPUs)
The difference between instructions and cycles gives us an indication of how well our code is vectorizing and pipelining.
cs: context switches
migrations: tell us about how the program is halted in order to wait for a kernel operation to finish (such as I/O).
- migrations happen when the program is halted and resumed on a different CPU than the one it was on before
faults: page-fault.
cache-references increases whenever we reference data that is in our cache(L1/L2/L3). If we do not already have this data in the cache and need to fetch it from RAM, this counts as a cache-miss
A branch is a time in the code where the execution flow changes.
branch-misses: the CPU tries to guess which direction the branch will take and preload the relevant instructions.
instructions per cycle tells us the total speed boost from pipelining, out-of-order execution, and hyperthreading.
run perf list to get the list of currently supported metrics on your system

Terms:

Pipelining: With pipelining, the CPU is able to run the current operation while fetching and preparing the next one.
Minor Page Fault Interrupt: When memory is allocated, the kernel doesn't do much except give the program a reference to memory. Later, however, when the memory is first used, the operating system throws a minor page fault interrupt, which pauses the program that is being run and properly allocates the memory. This is called a lazy allocation system.
Major Page Fault: which happens when the program requests data from a device (disk, network, etc.) that hasn't been read yet.

2.1.10 ipython magic `%memit`

%load_ext memory_profiler

2.1.11 ipython `%%timeit`

Allows us to specify code to set up the experiment that doesn't get timed.

%%timeit array1, array2 = np.random.random((2, 100, 100))
array1 = array1 + array2

2.1.12 No-op @profile

Add it to the start of our module while unit testing

if 'line_profiler' not in dir() and 'profile' not in dir():
def profile(func):
    return func

2.1.13 Introspecting an Existing Process with `PySpy`

pip install py-spy
sudo py-spy top --pid 2046: top-like view.
py-spy record -o profile.svg python test.py

2.1.14 Bytecode: `dis` module

dis.dis(func)

2.1.15 `vmperf`

vmperf is a lightweight sampling profiler supports a web-based user interface.

sudo apt install libunwind-dev
pip install vmprof
python -m vmprof <your program> <your program args>

2.1.16 GPU profiling

nvidia-smi
gpustat

2.1.17 Pytorch profiling

python -m torch.utils.bottleneck test.py

2.1.18 For Web Servers

dowser
dozer

2.2 Practical Points

Disable Turbo Boost in the BIOS.
Disable the operating system's ability to override the SpeedStep(in BIOS).
Use only AC power (never battery power).
Disable background tools like backups and Dropbox while running experiments.
Run the experiments many times to obtain a stable measurement.
Possibly drop to run level 1 (Unix) so that no other tasks are running.
Reboot and rerun the experiments to double-confirm the results.
Unit testing a complicated section of code that generates a large numerical output may be difficult. Do not be afraid to output a text file of results to run through diff or to use a pickled object.

3 Lists and Tuples

Python array stores data in buckets by reference, opposed to numpy arrays.

3.1 Lists

lists also store how large they are, so of the six allocated blocks, only five are usable.
bisect gives easy methods to add elements into a list while maintaining its sorting
List pre-allocation equation in Python 3.7: M = (N >> 3) + (3 if N < 9 else 6)

3.1.1 Bulit-in Tim Sort

Python lists have a built-in sorting algorithm that uses Tim sort. O(n) in the best case, O(n log n) in the worst case. It hybridizes insertion and merge sort algorithms.

3.2 Tuples

Python process will have some extra memory overhead for resource caching. For tuples of sizes 1–20, however, when they are no longer in use, the space isn't immediately given back to the system, which reduced system calls for memory allocation.

In [1]: %timeit l = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
62 ns ± 0.714 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [2]: %timeit t = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
9.41 ns ± 0.113 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

4 Dictionaries and Sets

4.1 Hashable Type

should implement __hash__, __eq__, __cmp__
User-defined classes have default hash and comparison functions using the object's placement in memory.(given by id function)

4.2 Key to Array Index

hashing: turn key into an integer number
masking: fits the allocated number of buckets
Using probing to find a new place if collision happens

# pseudocode of finding index
def index_sequence(key, mask=0b111, PERTURB_SHIFT=5):
    perturb = hash(key)  # hashing
    i = perturb & mask  # masking
    yield i
    # probing
    while True:
        perturb >>= PERTURB_SHIFT  # use high-order bits
        i = (i * 5 + perturb + 1) & mask  # simple linear function and masking again
        yield i

4.2.1 Finding a Element

If we hit an empty bucket, we can conclude that the data does not exist in the table.

4.2.2 Deleting a Element

We will write a special value that signifies that the bucket is empty, but there still may be values after it to consider when resolving a hash collision. These empty slots can be written to in the future and are removed when the hash table is resized.

4.2.3 Entropy of a Hash Function

\[S = -\sum_i p(i)\cdot\log(p(i))\]

\(p(i)\) is the probability that the hash function gives hash i
It is maximized when every hash value has equal probability of being chosen

import math
p1 = [0.25, 0.25, 0.25, 0.25]
-sum(i * math.log(i) for i in p1)  # => 1.3862943611198906

p2 = [0.1, 0.3, 0.5, 0.1]
-sum(i * math.log(i) for i in p2)  # => 1.1682824501765625

Knowing up front what range of values will be used and how large the dictionary will be helps in making a good selection

4.3 Dictionary

Optimization: Python first appends the key/value data into a standard array and then stores only the index into this array in the hash table. The array also helps keep the insertion order of items.
How well distributed the data throughout the hash table is called the load factor and is related to the entropy of the hash function
By default, the smallest size of a dictionary or set is 8, and it will resize by 3x if the dictionary is more than two-thirds full. (possible sizes: 8->18->39->81->165->…)

4.4 Namespace Management

Namespace Management heavily uses dictionaries to do its lookups.

The steps to look for a variable/function/module

Searching locals(): which has entries for all local variables, and this is the only part of the chain that doesn't require a dictionary lookup
Searching globals()
Searching __builtin__ objects: __builtin__ is technically a module object

import math


def test1(x):
    """
    >>> %timeit test1(123456)
    94 µs ± 387 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

    18 LOAD_GLOBAL              1 (math)
    20 LOAD_METHOD              2 (sin)
    22 LOAD_FAST                0 (x)
    24 CALL_METHOD              1

    """
    res = 1
    for _ in range(1000):
        res += math.sin(x)
    return res


def test2(x):
    """
    >>> %timeit test2(123456)
    72.5 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

    22 LOAD_FAST                2 (res)
    24 LOAD_FAST                1 (sin)
    26 LOAD_FAST                0 (x)
    28 CALL_FUNCTION            1

    """
    sin = math.sin
    res = 1
    for _ in range(1000):
        res += sin(x)
    return res

5 Iterators and Generators

5.1 Iterator

first, get an iterator through iter(iterable)
call iterator.next() to get new values until a StopIteration is raised.

# The Python loop
for i in object:
    do_work(i)

# Is equivalent to
object_iterator = iter(object)
while True:
    try:
        i = next(object_iterator)
    except StopIteration:
        break
    else:
        do_work(i)

5.2 Lazy Generator Evaluation

from datetime import datetime
from itertools import count, filterfalse, groupby, islice
from random import normalvariate, randint
from typing import Generator, Iterable, List, Tuple

from scipy.stats import normaltest

_ENTRY_TYPE = Tuple[datetime, int]


def read_fake_data(filename: str = "fake") -> Generator[_ENTRY_TYPE, None, None]:
    for timestamp in count():
        if randint(0, 7 * 60 * 60 * 24 - 1) == 1:
            value = normalvariate(0, 1)
        else:
            value = 100
        yield datetime.fromtimestamp(timestamp), value


def groupby_day(iterable: Iterable[_ENTRY_TYPE]) -> Generator[List[_ENTRY_TYPE], None, None]:
    for day, data_group in groupby(iterable, lambda row: row[0].day):
        yield list(data_group)


def is_normal(data: List[_ENTRY_TYPE], threshold: float = 1e-3) -> bool:
    _, values = zip(*data)
    k2, p_value = normaltest(values)
    if p_value < threshold:
        return False
    return True


def filter_anomalous_groups(
    data: Iterable[_ENTRY_TYPE],
) -> Generator[List[_ENTRY_TYPE], None, None]:
    yield from filterfalse(is_normal, data)


def filter_anomalous_data(data: Iterable[_ENTRY_TYPE]) -> Generator[List[_ENTRY_TYPE], None, None]:
    data_group = groupby_day(data)
    yield from filter_anomalous_groups(data_group)


if __name__ == "__main__":
    data = read_fake_data("fake")
    anomaly_generator = filter_anomalous_data(data)
    first_five_anomalies = islice(anomaly_generator, 5)

    for data_anomaly in first_five_anomalies:
        start_date = data_anomaly[0][0]
        end_date = data_anomaly[-1][0]
        print(f"Anomaly from {start_date} - {end_date}")

5.3 Useful Itertools

cycle, repeat, chain, groupby, islice
compress: like boolean-index in pandas
accumulate: reduce though summation
takewhile, dropwhile: add a condition that will end a generator.
starmap: Used instead of map() when argument parameters are already grouped in tuples from a single iterable
tee(iterable, n): Return n independent iterators from a single iterable. Useful for splitting one generator into n generators.
zip_longest: zip_longest('ABCD', 'xy', fillvalue='-') --> Ax By C- D-
Combinatoric iterators: product (cartesian product), permutations, combinations

6 Vectorization

Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values at one time. Vectorization of computations can occur only if we can fill the CPU cache with all the relevant data. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data(SIMD). Python doesn't natively support vectorization for two reasons:

Python lists store pointers to the actual data.
Python bytecode is not optimized for vectorization. (raw machine code uses nonvectorized operations)

6.1 `array` module

Objects in array are sequentially in memory. Using the array type when creating lists of data that must be iterated on is actually slower than simply creating a list. This is because the array object stores a very low-level representation of the numbers it stores, and this must be converted into a Python-compatible version before being returned to the user. This extra overhead happens every time you index an array type. That implementation decision has made the array object less suitable for math and more suitable for storing fixed-type data more efficiently in memory.

6.2 `numpy`

numpy gives us memory locality and vectorized operations.

6.2.1 Comparison with Built-in Module

from array import array
import numpy


def norm_square_list(vector):
    """
    >>> vector = list(range(1_000_000))
    >>> %timeit norm_square_list(vector)
    85.5 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    """
    norm = 0
    for v in vector:
        norm += v * v
    return norm


def norm_square_list_comprehension(vector):
    """
    >>> vector = list(range(1_000_000))
    >>> %timeit norm_square_list_comprehension(vector)
    80.3 ms ± 1.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    """
    return sum([v * v for v in vector])


def norm_square_array(vector):
    """
    >>> vector_array = array('l', range(1_000_000))
    >>> %timeit norm_square_array(vector_array)
    101 ms ± 4.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    """
    norm = 0
    for v in vector:
        norm += v * v
    return norm


def norm_square_numpy(vector):
    """
    >>> vector_np = numpy.arange(1_000_000)
    >>> %timeit norm_square_numpy(vector_np)
    3.22 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    """
    return numpy.sum(vector * vector)


def norm_square_numpy_dot(vector):
    """
    >>> vector_np = numpy.arange(1_000_000)
    >>> %timeit norm_square_numpy_dot(vector_np)
    960 µs ± 41.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    """
    # we don't need to store the intermediate value of vector * vector as in norm_square_numpy
    return numpy.dot(vector, vector)

6.2.2 In-Place Operations

Using in-place operations can help avoid the memory allocations. It is important to note that this effect happens only when the array sizes are bigger than the CPU cache! When the arrays are smaller and the two inputs and the output can all fit into cache, the out-of-place operation is faster because it can benefit from vectorization.

6.2.3 Numba

Numba is a just-in-time compiler that specializes in numpy code, which it compiles via the LLVM compiler at runtime. The beauty is that you provide a decorator telling it which functions to focus on and then you let Numba take over. It can automatically generate code for GPUs.

import numpy as np
from numba import jit, prange

input_array = np.array(range(100000))
output_array = np.zeros(len(input_array))

def double_all(array, output):
    for idx, value in enumerate(array):
        output[idx] = value * 2

# %timeit double_all(input_array, output_array)
# 25.3 ms ± 774 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

double_all_numba = jit(double_all, nopython=True)
# %timeit double_all_numba(input_array, output_array)
# 43.8 µs ± 481 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

@jit()
def double_all_jit(array, output):
    for idx, value in enumerate(array):
        output[idx] = value * 2

# %timeit double_all_jit(input_array, output_array)
# 44.2 µs ± 299 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# The nopython specifier means that if Numba cannot compile all of the code, it will fail.
# Adding parallel enables support for prange
@jit(nopython=True, parallel=True, nogil=True)
def double_all_jit_in_parallel(array, output):
    len_array = len(array)
    for idx in prange(len_array):
        output[idx] = array[idx] * 2

# %timeit double_all_jit_in_parallel(input_array, output_array)
# 14.4 µs ± 238 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Debuging with Numba: using double_all_jit.inspect_types()
Try to make your code compile in nopython mode.
Your best approach will be to break your current code into small(<10 line) and to tackle these one at a time.

6.3 `numexpr`

One downfall of numpy's optimization of vector operations is that it occurs on only one operation at a time. numexpr can help take an entire vector expression and compile it into very efficient code that is optimized to minimize cache misses and temporary space used. In addition, the expressions can utilize multiple CPU cores(with OpenMP).

6.3.1 How to Use `numpexpr`

Simply rewrite the expressions as strings with references to local variables. The expressions are compiled behind the scenes and run using optimized code.

import numpy as np
from numexpr import evaluate

data = np.array(range(1000000))
%timeit data + data * 5 + 4
# 1.95 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit evaluate("data + data * 5 + 4")
# 787 µs ± 28.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

6.4 `pandas`

Operations on columns often generate temporary intermediate arrays, which consume RAM.
Make pandas parallel and scalable with dask module.
Columns of the same dtype are grouped together by a BlockManager.
df.apply(..., raw=True) stops the creation of an intermediate Series object, uses raw numpy array instead.
Install the optional dependencies numexpr and bottleneck for additional performance improvements
Use bulwark to check the schema of dataframes up front.
Converting the Series to a Category dtype when dealing with low cardinality data. df["col"].astype("category")

6.4.1 `Modin`

Parallelizing with dask/ray.

pip install modin[dask] or pip install modin[ray]
export MODIN_ENGINE=dask or os.environ["MODIN_ENGINE"] = "dask"
import modin.pandas as pd

6.4.2 `cuDF`

GPU DataFrame Library

6.4.3 `vaex`

Vaex specializes in both larger datasets and string-heavy operations
Vaex offers a slew of built-in visualization functions.

6.5 Summary

Two main routes:

reducing the time taken to get data to the CPU.
reducing the amount of work that the CPU had to do.

7 Compiling to Bytecode

Using Python API for C
C-based compiling: Cython
LLVM-based compiling: Numba
just-in-time compiler: PyPy

7.1 TODO Python API

7.2 Cython

# In setup.py
from distutils.core import setup

from Cython.Build import cythonize

setup(ext_modules=cythonize("cythonfn.pyx", compiler_directives={"language_level": "3"}))

When we run python setup.py build_ext --inplace, Cython will look for cythonfn.pyx and build cythonfn[…].so.

7.2.1 `pyximport`

Simplifing build system, no need to use setup.py.

# In client_code.py
import pyximport

pyximport.install(language_level=3)
import cythonfn

# followed by the usual code

7.2.2 Analyzing Generated C Code

cython -a cythonfn.pyx -> cythonfn.html

More yellow means "more calls into the Python virtual machine"
More white means "more non-Python C code"

7.2.3 Type Annotation

def add(int x, int y):
    cdef unsigned int x, y
    return x + y

7.2.4 Cython Flags

#cython: boundscheck=False
def test(...):
    ...

7.2.5 Parallelizing on One Machine

With Cython, OpenMP can be added by using the prange (parallel range) operator and adding the -fopenmp compiler directive to setup.py. Work in a prange loop can be performed in parallel because we disable the GIL.

7.2.6 Example

# cythonfn.pyx
#cython: boundscheck=False # disable boundscheck on arrays
from cython.parallel import prange
import numpy as np
cimport numpy as np

def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs):
    """Calculate output list using Julia update rule"""
    # using [] to annotate the buffer protocol, single colon indicates one-dimensional data
    cdef unsigned int i, length
    cdef double complex z, c
    cdef int[:] output = np.empty(len(zs), dtype=np.int32) # annotate output
    length = len(zs)

    with nogil: # disable GIL
        for i in prange(length, schedule="guided"): # guided scheduling
            # parallelizing in prange block
            z = zs[i]
            c = cs[i]
            output[i] = 0
            while output[i] < maxiter and (z.real * z.real + z.imag * z.imag) < 4:
                z = z * z + c
                output[i] += 1
        return output

In setup.py

# setup.py
from distutils.core import setup
from distutils.extension import Extension

import numpy as np
from Cython.Build import cythonize

ext_modules = [
    Extension(
        "cythonfn", ["cythonfn.pyx"], extra_compile_args=["-fopenmp"], extra_link_args=["-fopenmp"]
    )  # Adding the OpenMP compiler flags and linker flags to setup.py for Cython
]


setup(
    ext_modules=cythonize(
        ext_modules,
        compiler_directives={"language_level": "3"},
    ),
    include_dirs=[np.get_include()],
)

Choosing scheduling approaches: static, dynamic, guided

7.3 PyPy

Different GC strategy(mark-and-sweep) than CPython(reference counting)
PyPy supports projects like numpy that require C bindings through the CPython extension compatibility layer cpyext, but it has an overhead of 4–6×, which generally makes numpy too slow.
May use more RAM than CPython.

7.4 Transonic

Transonic attempts to unify Cython, Pythran, and Numba, and potentially other compilers, behind one interface to enable quick evaluation of multiple compilers without having to rewrite code.

7.5 CuPy

A NumPy-compatible array library accelerated by CUDA

7.6 Foreign Function Interfaces

7.6.1 C `.so`

int add_two(int a, int b) {
  return a + b;
}

gcc -O3 -std=gnu11 -c libtest.c
gcc -shared -o libtest.so libtest.o
Put libtest.so in /usr/lib or /usr/local/lib or somewhere is accessible to python code.

7.6.2 ctypes

The most basic foreign function interface in CPython is through the ctypes module.

import ctypes

_libtest = ctypes.CDLL("libtest.so")

# Create references to the C types that we will need to simplify future code
TYPE_INT = ctypes.c_int

# Initialize the signature of the evolve function to:
# int add_two(int, int)
_libtest.add_two.argtypes = [TYPE_INT, TYPE_INT]
_libtest.add_two.restype = TYPE_INT


def add_two_in_python(a: int, b: int) -> int:
    # First we convert the Python types into the relevant C types
    a = TYPE_INT(a)
    b = TYPE_INT(b)
    res = _libtest.add_two(a, b)
    return res

C structure

from ctypes import Structure, c_int

class cPoint(Structure):
    _fields_ = ("x", c_int), ("y", c_int)

point = cPoint(10, 5)

7.7 Other Tools

See http://compilers.pydata.org/