High Performance Python
Table of Contents
1 Hardware Resources
1.1 Computing Unit
The main properties of interest in a computing unit are the number of operations it can do in one cycle and the number of cycles it can do in one second. The first value is measured by its instructions per cycle(IPC), while the latter value is measured by its clock speed.
- SIMD: Single instruction, multiple data. Efficient with high IPC. (CPU vectorization instruction)
- Hyperthreading presents a virtual second CPU to the host OS, and clever hardware logic tries
to interleave two threads of instructions into the execution units on a single CPU. When successful, gains of up to 30% over a single thread can be achieved.
1.2 Memory Unit
When trying to optimize the memory patterns of a program, we are simply optimizing which data is placed where, how it is laid out (in order to increase the number of sequential reads), and how many times it is moved among the various locations.
1.3 Communications Layers
- The frontside bus is the connection between the RAM and the L1/L2 cache
- Two main properties of a bus:
- how much data can be moved in one transfer (bus width)
- how many transfers the bus can do per second (bus frequency).
1.4 Performance Issues of Python
- Memory Fragmentation: due to garbage-collected
- Dynamic Types: Using Cython to mitigate this problem
- GIL: Using Cython or foreign functions
1.5 GPU
- When to use: The task requires mainly linear algebra and matrix manipulations.
- Ensure that the memory use of the problem will fit within the GPU.
- Evaluate whether the algorithm requires a lot of branching conditions.(GPU's memory is limited)
- Evaluate how much data needs to be moved between the GPU and the CPU.
- Drawbacks: Communicates through the PCI bus, which is much slower than the frontside bus
2 Profiling
2.1 Tools
2.1.1 /usr/bin/time
- Using
--verboseflag to get more details. - Most useful indicator is
Major (requiring I/O) page faults, as it indicates cache misses.
2.1.2 timefn decorator
def timefn(fn): @wraps(fn) def measure_time(*args, **kwargs): t1 = time.time() result = fn(*args, **kwargs) t2 = time.time() print(f"@timefn: {fn.__name__} took {t2 - t1} seconds") return result return measure_time
2.1.3 bulit-in timeit module
- Documentation
- The
timeitmodule temporarily disables the garbage collector. - More useful approach: IPython magic
%timeit
2.1.4 CProfile
python -m cProfile -s cumulative test.py python -m cProfile -o profile.stats test.py
import pstats p = pstats.Stats("profile.stats") p.sort_stats("cumulative") p.print_stats() p.print_callers() p.print_callees()
snakevizfor visualization- alternative
pyinstrument: lower overhead and less trivial information
2.1.5 line_profiler
pip install line_profiler- Use
@profileto mark the chosen function kernprof -l -v test.py- See report:
python -m line_profiler test.py.lprof
2.1.6 pyinstrument
2.1.7 VizTracer
2.1.8 memory_profiler
pip install psutil(recommended)pip install memory_profiler- Use
@profileto mark the chosen function python -m memory_profiler test.pyormprof run test.py
Other Hints:
- using
with profile.timestamp("scope1")to add label memory_profileroffers an interesting aid to debugging a large process via the--pdb-mmem=XXXflag
2.1.9 perf
sudo apt install linux-tools-generic- Tweeking
/proc/sys/kernel/perf_event_paranoidto -1 (Allow use of almost all events by all users Ignore mlock limit) perf stat -e cycles,instructions,cache-references,cache-misses,... python test.py
Performance Counter Descriptions:
task-clocktells us how many clock cycles our task. (all CPUs)- The difference between
instructionsandcyclesgives us an indication of how well our code is vectorizing and pipelining. cs: context switchesmigrations: tell us about how the program is halted in order to wait for a kernel operation to finish (such as I/O).migrationshappen when the program is halted and resumed on a different CPU than the one it was on before
faults: page-fault.cache-referencesincreases whenever we reference data that is in our cache(L1/L2/L3). If we do not already have this data in the cache and need to fetch it from RAM, this counts as acache-miss- A
branchis a time in the code where the execution flow changes. branch-misses: the CPU tries to guess which direction the branch will take and preload the relevant instructions.instructions per cycletells us the total speed boost from pipelining, out-of-order execution, and hyperthreading.- run
perf listto get the list of currently supported metrics on your system
Terms:
- Pipelining
- With pipelining, the CPU is able to run the current operation while fetching and preparing the next one.
- Minor Page Fault Interrupt
- When memory is allocated, the kernel doesn't do much except give the program a reference to memory. Later, however, when the memory is first used, the operating system throws a minor page fault interrupt, which pauses the program that is being run and properly allocates the memory. This is called a lazy allocation system.
- Major Page Fault
- which happens when the program requests data from a device (disk, network, etc.) that hasn't been read yet.
2.1.10 ipython magic %memit
%load_ext memory_profiler
2.1.11 ipython %%timeit
Allows us to specify code to set up the experiment that doesn't get timed.
%%timeit array1, array2 = np.random.random((2, 100, 100)) array1 = array1 + array2
2.1.12 No-op @profile
Add it to the start of our module while unit testing
if 'line_profiler' not in dir() and 'profile' not in dir(): def profile(func): return func
2.1.13 Introspecting an Existing Process with PySpy
pip install py-spysudo py-spy top --pid 2046: top-like view.py-spy record -o profile.svg python test.py
2.1.14 Bytecode: dis module
dis.dis(func)
2.1.15 vmperf
vmperf is a lightweight sampling profiler supports a web-based user interface.
sudo apt install libunwind-devpip install vmprofpython -m vmprof <your program> <your program args>
2.1.16 GPU profiling
nvidia-smigpustat
2.1.17 Pytorch profiling
python -m torch.utils.bottleneck test.py
2.1.18 For Web Servers
dowserdozer
2.2 Practical Points
- Disable Turbo Boost in the BIOS.
- Disable the operating system's ability to override the SpeedStep(in BIOS).
- Use only AC power (never battery power).
- Disable background tools like backups and Dropbox while running experiments.
- Run the experiments many times to obtain a stable measurement.
- Possibly drop to run level 1 (Unix) so that no other tasks are running.
- Reboot and rerun the experiments to double-confirm the results.
- Unit testing a complicated section of code that generates a large numerical output may be difficult.
Do not be afraid to output a text file of results to run through
diffor to use a pickled object.
3 Lists and Tuples
- Python array stores data in buckets by reference, opposed to numpy arrays.
3.1 Lists
- lists also store how large they are, so of the six allocated blocks, only five are usable.
bisectgives easy methods to add elements into a list while maintaining its sorting- List pre-allocation equation in Python 3.7:
M = (N >> 3) + (3 if N < 9 else 6)
3.1.1 Bulit-in Tim Sort
Python lists have a built-in sorting algorithm that uses Tim sort.
O(n) in the best case, O(n log n) in the worst case. It hybridizes
insertion and merge sort algorithms.
3.2 Tuples
Python process will have some extra memory overhead for resource caching. For tuples of sizes 1–20, however, when they are no longer in use, the space isn't immediately given back to the system, which reduced system calls for memory allocation.
In [1]: %timeit l = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] 62 ns ± 0.714 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) In [2]: %timeit t = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) 9.41 ns ± 0.113 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
4 Dictionaries and Sets
4.1 Hashable Type
- should implement
__hash__,__eq__,__cmp__ - User-defined classes have default hash and comparison functions using the object's placement in memory.(given by
idfunction)
4.2 Key to Array Index
- hashing: turn key into an integer number
- masking: fits the allocated number of buckets
- Using probing to find a new place if collision happens
# pseudocode of finding index def index_sequence(key, mask=0b111, PERTURB_SHIFT=5): perturb = hash(key) # hashing i = perturb & mask # masking yield i # probing while True: perturb >>= PERTURB_SHIFT # use high-order bits i = (i * 5 + perturb + 1) & mask # simple linear function and masking again yield i
4.2.1 Finding a Element
If we hit an empty bucket, we can conclude that the data does not exist in the table.
4.2.2 Deleting a Element
We will write a special value that signifies that the bucket is empty, but there still may be values after it to consider when resolving a hash collision. These empty slots can be written to in the future and are removed when the hash table is resized.
4.2.3 Entropy of a Hash Function
\[S = -\sum_i p(i)\cdot\log(p(i))\]
- \(p(i)\) is the probability that the hash function gives hash i
- It is maximized when every hash value has equal probability of being chosen
import math p1 = [0.25, 0.25, 0.25, 0.25] -sum(i * math.log(i) for i in p1) # => 1.3862943611198906 p2 = [0.1, 0.3, 0.5, 0.1] -sum(i * math.log(i) for i in p2) # => 1.1682824501765625
- Knowing up front what range of values will be used and how large the dictionary will be helps in making a good selection
4.3 Dictionary
- Optimization: Python first appends the key/value data into a standard array and then stores only the index into this array in the hash table. The array also helps keep the insertion order of items.
- How well distributed the data throughout the hash table is called the load factor and is related to the entropy of the hash function
- By default, the smallest size of a dictionary or set is 8, and it will resize by 3x if the dictionary is more than two-thirds full. (possible sizes: 8->18->39->81->165->…)
4.4 Namespace Management
Namespace Management heavily uses dictionaries to do its lookups.
The steps to look for a variable/function/module
- Searching
locals(): which has entries for all local variables, and this is the only part of the chain that doesn't require a dictionary lookup - Searching
globals() - Searching
__builtin__objects:__builtin__is technically a module object
import math def test1(x): """ >>> %timeit test1(123456) 94 µs ± 387 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 18 LOAD_GLOBAL 1 (math) 20 LOAD_METHOD 2 (sin) 22 LOAD_FAST 0 (x) 24 CALL_METHOD 1 """ res = 1 for _ in range(1000): res += math.sin(x) return res def test2(x): """ >>> %timeit test2(123456) 72.5 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 22 LOAD_FAST 2 (res) 24 LOAD_FAST 1 (sin) 26 LOAD_FAST 0 (x) 28 CALL_FUNCTION 1 """ sin = math.sin res = 1 for _ in range(1000): res += sin(x) return res
5 Iterators and Generators
5.1 Iterator
- first, get an iterator through
iter(iterable) - call
iterator.next()to get new values until aStopIterationis raised.
# The Python loop for i in object: do_work(i) # Is equivalent to object_iterator = iter(object) while True: try: i = next(object_iterator) except StopIteration: break else: do_work(i)
5.2 Lazy Generator Evaluation
from datetime import datetime from itertools import count, filterfalse, groupby, islice from random import normalvariate, randint from typing import Generator, Iterable, List, Tuple from scipy.stats import normaltest _ENTRY_TYPE = Tuple[datetime, int] def read_fake_data(filename: str = "fake") -> Generator[_ENTRY_TYPE, None, None]: for timestamp in count(): if randint(0, 7 * 60 * 60 * 24 - 1) == 1: value = normalvariate(0, 1) else: value = 100 yield datetime.fromtimestamp(timestamp), value def groupby_day(iterable: Iterable[_ENTRY_TYPE]) -> Generator[List[_ENTRY_TYPE], None, None]: for day, data_group in groupby(iterable, lambda row: row[0].day): yield list(data_group) def is_normal(data: List[_ENTRY_TYPE], threshold: float = 1e-3) -> bool: _, values = zip(*data) k2, p_value = normaltest(values) if p_value < threshold: return False return True def filter_anomalous_groups( data: Iterable[_ENTRY_TYPE], ) -> Generator[List[_ENTRY_TYPE], None, None]: yield from filterfalse(is_normal, data) def filter_anomalous_data(data: Iterable[_ENTRY_TYPE]) -> Generator[List[_ENTRY_TYPE], None, None]: data_group = groupby_day(data) yield from filter_anomalous_groups(data_group) if __name__ == "__main__": data = read_fake_data("fake") anomaly_generator = filter_anomalous_data(data) first_five_anomalies = islice(anomaly_generator, 5) for data_anomaly in first_five_anomalies: start_date = data_anomaly[0][0] end_date = data_anomaly[-1][0] print(f"Anomaly from {start_date} - {end_date}")
5.3 Useful Itertools
cycle,repeat,chain,groupby,islicecompress: like boolean-index in pandasaccumulate: reduce though summationtakewhile,dropwhile: add a condition that will end a generator.starmap: Used instead ofmap()when argument parameters are already grouped in tuples from a single iterabletee(iterable, n): Return n independent iterators from a single iterable. Useful for splitting one generator into n generators.zip_longest:zip_longest('ABCD', 'xy', fillvalue='-') --> Ax By C- D-- Combinatoric iterators:
product(cartesian product),permutations,combinations
6 Vectorization
Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values at one time. Vectorization of computations can occur only if we can fill the CPU cache with all the relevant data. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data(SIMD). Python doesn't natively support vectorization for two reasons:
- Python lists store pointers to the actual data.
- Python bytecode is not optimized for vectorization. (raw machine code uses nonvectorized operations)
6.1 array module
Objects in array are sequentially in memory. Using the array type when creating lists of data that must be iterated on is actually slower than simply creating a list. This is because the array object stores a very low-level representation of the numbers it stores, and this must be converted into a Python-compatible version before being returned to the user. This extra overhead happens every time you index an array type. That implementation decision has made the array object less suitable for math and more suitable for storing fixed-type data more efficiently in memory.
6.2 numpy
numpygives us memory locality and vectorized operations.
6.2.1 Comparison with Built-in Module
from array import array import numpy def norm_square_list(vector): """ >>> vector = list(range(1_000_000)) >>> %timeit norm_square_list(vector) 85.5 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) """ norm = 0 for v in vector: norm += v * v return norm def norm_square_list_comprehension(vector): """ >>> vector = list(range(1_000_000)) >>> %timeit norm_square_list_comprehension(vector) 80.3 ms ± 1.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) """ return sum([v * v for v in vector]) def norm_square_array(vector): """ >>> vector_array = array('l', range(1_000_000)) >>> %timeit norm_square_array(vector_array) 101 ms ± 4.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) """ norm = 0 for v in vector: norm += v * v return norm def norm_square_numpy(vector): """ >>> vector_np = numpy.arange(1_000_000) >>> %timeit norm_square_numpy(vector_np) 3.22 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) """ return numpy.sum(vector * vector) def norm_square_numpy_dot(vector): """ >>> vector_np = numpy.arange(1_000_000) >>> %timeit norm_square_numpy_dot(vector_np) 960 µs ± 41.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) """ # we don't need to store the intermediate value of vector * vector as in norm_square_numpy return numpy.dot(vector, vector)
6.2.2 In-Place Operations
Using in-place operations can help avoid the memory allocations. It is important to note that this effect happens only when the array sizes are bigger than the CPU cache! When the arrays are smaller and the two inputs and the output can all fit into cache, the out-of-place operation is faster because it can benefit from vectorization.
6.2.3 Numba
Numba is a just-in-time compiler that specializes in numpy code, which it compiles via the LLVM compiler at runtime. The beauty is that you provide a decorator telling it which functions to focus on and then you let Numba take over. It can automatically generate code for GPUs.
import numpy as np from numba import jit, prange input_array = np.array(range(100000)) output_array = np.zeros(len(input_array)) def double_all(array, output): for idx, value in enumerate(array): output[idx] = value * 2 # %timeit double_all(input_array, output_array) # 25.3 ms ± 774 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) double_all_numba = jit(double_all, nopython=True) # %timeit double_all_numba(input_array, output_array) # 43.8 µs ± 481 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) @jit() def double_all_jit(array, output): for idx, value in enumerate(array): output[idx] = value * 2 # %timeit double_all_jit(input_array, output_array) # 44.2 µs ± 299 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) # The nopython specifier means that if Numba cannot compile all of the code, it will fail. # Adding parallel enables support for prange @jit(nopython=True, parallel=True, nogil=True) def double_all_jit_in_parallel(array, output): len_array = len(array) for idx in prange(len_array): output[idx] = array[idx] * 2 # %timeit double_all_jit_in_parallel(input_array, output_array) # 14.4 µs ± 238 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
- Debuging with Numba: using
double_all_jit.inspect_types() - Try to make your code compile in nopython mode.
- Your best approach will be to break your current code into small(<10 line) and to tackle these one at a time.
6.3 numexpr
One downfall of numpy's optimization of vector operations is that it occurs on only one
operation at a time. numexpr can help take an entire vector expression and compile it into
very efficient code that is optimized to minimize cache misses and temporary space used. In
addition, the expressions can utilize multiple CPU cores(with OpenMP).
6.3.1 How to Use numpexpr
Simply rewrite the expressions as strings with references to local variables. The expressions are compiled behind the scenes and run using optimized code.
import numpy as np from numexpr import evaluate data = np.array(range(1000000)) %timeit data + data * 5 + 4 # 1.95 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit evaluate("data + data * 5 + 4") # 787 µs ± 28.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
6.4 pandas
- Operations on columns often generate temporary intermediate arrays, which consume RAM.
- Make
pandasparallel and scalable withdaskmodule. - Columns of the same dtype are grouped together by a BlockManager.
df.apply(..., raw=True)stops the creation of an intermediate Series object, uses raw numpy array instead.- Install the optional dependencies
numexprandbottleneckfor additional performance improvements - Use
bulwarkto check the schema of dataframes up front. - Converting the Series to a Category dtype when dealing with low cardinality data.
df["col"].astype("category")
6.4.1 Modin
Parallelizing with dask/ray.
pip install modin[dask]orpip install modin[ray]export MODIN_ENGINE=daskoros.environ["MODIN_ENGINE"] = "dask"import modin.pandas as pd
6.4.2 cuDF
GPU DataFrame Library
6.4.3 vaex
- Vaex specializes in both larger datasets and string-heavy operations
- Vaex offers a slew of built-in visualization functions.
6.5 Summary
Two main routes:
- reducing the time taken to get data to the CPU.
- reducing the amount of work that the CPU had to do.
7 Compiling to Bytecode
- Using Python API for C
- C-based compiling: Cython
- LLVM-based compiling: Numba
- just-in-time compiler: PyPy
7.1 TODO Python API
7.2 Cython
# In setup.py from distutils.core import setup from Cython.Build import cythonize setup(ext_modules=cythonize("cythonfn.pyx", compiler_directives={"language_level": "3"}))
When we run python setup.py build_ext --inplace, Cython will look for cythonfn.pyx and build cythonfn[…].so.
7.2.1 pyximport
Simplifing build system, no need to use setup.py.
# In client_code.py import pyximport pyximport.install(language_level=3) import cythonfn # followed by the usual code
7.2.2 Analyzing Generated C Code
cython -a cythonfn.pyx -> cythonfn.html
- More yellow means "more calls into the Python virtual machine"
- More white means "more non-Python C code"
7.2.3 Type Annotation
def add(int x, int y): cdef unsigned int x, y return x + y
7.2.4 Cython Flags
#cython: boundscheck=False def test(...): ...
7.2.5 Parallelizing on One Machine
With Cython, OpenMP can be added by using the prange (parallel range) operator and adding the -fopenmp
compiler directive to setup.py. Work in a prange loop can be performed in parallel because we disable the GIL.
7.2.6 Example
# cythonfn.pyx #cython: boundscheck=False # disable boundscheck on arrays from cython.parallel import prange import numpy as np cimport numpy as np def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs): """Calculate output list using Julia update rule""" # using [] to annotate the buffer protocol, single colon indicates one-dimensional data cdef unsigned int i, length cdef double complex z, c cdef int[:] output = np.empty(len(zs), dtype=np.int32) # annotate output length = len(zs) with nogil: # disable GIL for i in prange(length, schedule="guided"): # guided scheduling # parallelizing in prange block z = zs[i] c = cs[i] output[i] = 0 while output[i] < maxiter and (z.real * z.real + z.imag * z.imag) < 4: z = z * z + c output[i] += 1 return output
In setup.py
# setup.py from distutils.core import setup from distutils.extension import Extension import numpy as np from Cython.Build import cythonize ext_modules = [ Extension( "cythonfn", ["cythonfn.pyx"], extra_compile_args=["-fopenmp"], extra_link_args=["-fopenmp"] ) # Adding the OpenMP compiler flags and linker flags to setup.py for Cython ] setup( ext_modules=cythonize( ext_modules, compiler_directives={"language_level": "3"}, ), include_dirs=[np.get_include()], )
- Choosing scheduling approaches:
static,dynamic,guided
7.3 PyPy
- Different GC strategy(mark-and-sweep) than CPython(reference counting)
- PyPy supports projects like numpy that require C bindings through the CPython extension compatibility layer cpyext, but it has an overhead of 4–6×, which generally makes numpy too slow.
- May use more RAM than CPython.
7.4 Transonic
Transonic attempts to unify Cython, Pythran, and Numba, and potentially other compilers, behind one interface to enable quick evaluation of multiple compilers without having to rewrite code.
7.6 Foreign Function Interfaces
7.6.1 C .so
int add_two(int a, int b) { return a + b; }
gcc -O3 -std=gnu11 -c libtest.cgcc -shared -o libtest.so libtest.o- Put
libtest.soin/usr/libor/usr/local/libor somewhere is accessible to python code.
7.6.2 ctypes
The most basic foreign function interface in CPython is through the ctypes module.
import ctypes _libtest = ctypes.CDLL("libtest.so") # Create references to the C types that we will need to simplify future code TYPE_INT = ctypes.c_int # Initialize the signature of the evolve function to: # int add_two(int, int) _libtest.add_two.argtypes = [TYPE_INT, TYPE_INT] _libtest.add_two.restype = TYPE_INT def add_two_in_python(a: int, b: int) -> int: # First we convert the Python types into the relevant C types a = TYPE_INT(a) b = TYPE_INT(b) res = _libtest.add_two(a, b) return res
C structure
from ctypes import Structure, c_int class cPoint(Structure): _fields_ = ("x", c_int), ("y", c_int) point = cPoint(10, 5)