DataAnalysis For Python

1. numpy
2. pandas
3. Wrangling
4. Plotting
5. Group
6. Timeseries
7. Categorical Data
8. Modeling Libraries
9. Other Tools
10. hdf

1 numpy

1.1 ndarray

import numpy as np
py_array = [1, 2, 3, 4, 5]
np_array = np.array(py_array)
np_array.shape #=> (5,)
np_array.dtype #=> dtype('int64')

py_array = [[1, 2, 3, 4], [5, 6, 7, 8]]
np_array = np.array(py_array)
np_array.shape #=> (2, 4)
np_array.ndim #=> 2

np.zeros(10)
#=> array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
np.ones((2, 3))
#=> array([[ 1.,  1.,  1.], [ 1.,  1.,  1.]])
np.empty((2, 3, 2))
#=>
# array([[[ 0.,  0.],
#         [ 0.,  0.],
#         [ 0.,  0.]],

#        [[ 0.,  0.],
#         [ 0.,  0.],
#         [ 0.,  0.]]])

array = np.arange(5)
#=> array([0, 1, 2, 3, 4])
masked_array = np.ma.masked_array(array, array<3)
#=> [-- -- -- 3 4]
np.random.randn(2, 3)

np_array = np.array([1, 2, 3], dtype=np.int32)
np_array = np_array.astype(np.float64)
np_array.dtype #=> dtype('float64')

x, y=np.meshgrid(np.arange(2),np.arange(2))
#x=>
# array([[0, 1],
#        [0, 1]])
#y=>
# array([[0, 0],
#        [1, 1]])

1.1.1 Strides

indicating the number of bytes to "step" in order to advance one element along a dimension, \(3 * 4 * 5\) array of float64(8-bytes) has strides (160, 40, 8)

1.1.2 Supported dtype

int8, uint8, int16, uint16, int32, uint32, int64, uint64
float16, float32, float64, float128
complex64, complex128, complex256
bool
object
string_
unicode_

1.1.3 Creating

Function	Description
array	Convert input data (list, tuple, array, or other sequence type) to an ndarray
asarray	Convert input to ndarray, but do not copy if the input is already an ndarray
arange	Like the built-in range but returns an ndarray instead of a list.
ones, ones_like	Produce an array of all 1’s with the given shape and dtype.
zeros, zeros_like	Produce an array of all 0’s with the given shape and dtype.
empty, empty_like	Create new arrays by allocating new memory
eye, identity	Create a square N x N identity matrix (1’s on the diagonal and 0’s elsewhere)

1.1.4 Indexing

basic indexing

arr = np.array(range(12)).reshape(3,4)
# array([[ 0,  1,  2,  3],
#        [ 4,  5,  6,  7],
#        [ 8,  9, 10, 11]])

arr[1:, :3]
# array([[ 4,  5,  6],
#        [ 8,  9, 10]])

boolean indexing

flags = np.array([1, 2, 3, 2, 2, 1])
data = np.random.randn(6,2)
# array([[ 2.11684529,  1.24861544],
#        [-0.34817586, -0.59905366],
#        [-0.84976431,  0.11840417],
#        [ 1.36648373,  1.33416664],
#        [ 0.37616856, -0.0032112 ],
#        [-0.7749904 , -0.60457688]])
flags ==2
# array([False,  True, False,  True,  True, False], dtype=bool)
data[flags == 2]
# array([[-0.34817586, -0.59905366],
#        [ 1.36648373,  1.33416664],
#        [ 0.37616856, -0.0032112 ]])
data[(flags == 3) | (flags == 1), 1:]
# array([[ 1.24861544],
#        [ 0.11840417],
#        [-0.60457688]])
data[data<0] = 0
# array([[ 2.11684529,  1.24861544],
#        [ 0.        ,  0.        ],
#        [ 0.        ,  0.11840417],
#        [ 1.36648373,  1.33416664],
#        [ 0.37616856,  0.        ],
#        [ 0.        ,  0.        ]])

fancy indexing

Fancy indexing is a term adopted by NumPy to describe indexing using integer arrays .
Unlike slicing, always copies the data into a new array .

arr
# array([[ 0.,  0.,  0.,  0.],
#        [ 1.,  1.,  1.,  1.],
#        [ 2.,  2.,  2.,  2.]])
arr[[2, 1]]
# array([[ 2.,  2.,  2.,  2.],
#        [ 1.,  1.,  1.,  1.]])
arr = np.array(range(12)).reshape(3,4)
# array([[ 0,  1,  2,  3],
#        [ 4,  5,  6,  7],
#        [ 8,  9, 10, 11]])
arr[[1,2], [2,3]]
# array([ 6, 11]) # choose location (1, 2) (2, 3)

arr[[1,2]][:, [2,3]]
# array([[ 6,  7],
#        [10, 11]])
arr[np.ix_([1,2], [2,3])] # same effect
# array([[ 6,  7],
#        [10, 11]])

1.1.5 Reshape 1D to 2D

arr = np.array([1,2,3])
arr.reshape((-1, 1))
array([[1],
       [2],
       [3]])

1.2 matrix

matrix has two dimensions and implements row-column matrix multiplication

1.3 Conditional Logic

1.3.1 any, all

1.3.2 numpy.where

The numpy.where function is a vectorized version of the ternary expression x if condition else y. numpy.where is quicker than list comprehension.

xarr = np.array([1,1,1,1,1])
yarr = np.array([2,2,2,2,2])
cond = np.array([True, False, True, False, False])
np.where(cond, xarr, yarr)
# array([1, 2, 1, 2, 2])
np.where(cond, 4, 3)
# array([4, 3, 4, 3, 3])

1.4 Transpose

Simple transposing with .T is just a special case of swapaxes

1.5 Useful functions

1.5.1 Math

Method	Description
sign	Returns on array of 1 and -1 depending on the sign of the values
sum	Sum of all the elements in the array or along an axis. Zero-length arrays have sum 0
mean	Arithmetic mean. Zero-length arrays have NaN mean
std, var	Standard deviation and variance, respectively, with optional degrees of freedom adjustment
min, max	Minimum and maximum
argmin, argmax	Indices of minimum and maximum elements, respectively
cumsum	Cumulative sum of elements starting from 0
cumprod	Cumulative product of elements starting from 1
abs, fabs	Use fabs as a faster alternative for non-complex-valued data
modf	Return factional and integral parts of array as separate array
rint	Round elements to the nearest integer, preserving the dtype
average	Compute the weighted average along the specified axis.
exp	Calculate the exponential of all elements in the input array.

1.5.2 Linear Algebra

Function	Description
diag	Return the diagonal (or off-diagonal) elements of a square matrix
dot	Matrix multiplication
trace	Compute the sum of the diagonal elements
det	Compute the matrix determinant
eig	Compute the eigenvalues and eigenvectors of a square matrix
inv	Compute the inverse of a square matrix
pinv	Compute the Moore-Penrose pseudo-inverse inverse of a square matrix
qr	Compute the QR decomposition
svd	Compute the singular value decomposition (SVD)
solve	Solve the linear system Ax = b for x, where A is a square matrix
lstsq	Compute the least-squares solution to y = Xb

1.5.3 Random Number Generation

Function	Description
seed	Seed the random number generator
permutation	Return a random permutation of a sequence, or return a permuted range
shuffle	Randomly permute a sequence in place
rand	Draw samples from a uniform distribution
randint	Draw random integers from a given low-to-high range
randn	Draw samples from a normal distribution with mean 0 and standard deviation 1 (MATLAB-like interface)
binomial	Draw samples a binomial distribution
normal	Draw samples from a normal (Gaussian) distribution
beta	Draw samples from a beta distribution
chisquare	Draw samples from a chi-square distribution
gamma	Draw samples from a gamma distribution
uniform	Draw samples from a uniform [0, 1) distribution

1.5.4 Set operations

Function	Description
unique(x)	Compute the sorted, unique elements in x
intersect1d(x, y)	Compute the sorted, common elements in x and y
union1d(x, y)	Compute the sorted union of elements
in1d(x, y)	Compute a boolean array indicating whether element of x is in y
setdiff1d(x, y)	Set difference, elements in x that are not in y
setxor1d(x, y)	Set symmetric differences; elements that are in either of the arrays, but not both

2 pandas

2.1 Series

Series is a fixed-length ordered dict

2.2 DataFrame

df = pd.DataFrame(np.arange(8).reshape(4,2),
		  columns=['c1', 'c2'], index=['r1', 'r2', 'r3', 'r4'])

df.ix['r1'] # retrieve row
# c1    0
# c2    1
# Name: r1, dtype: int64

df.ix[['r1','r2']]
#     c1  c2
# r1   0   1
# r2   2   3

df.T
#     r1  r2  r3  r4
# c1   0   2   4   6
# c2   1   3   5   7

del df['c2']

df.columns
# Index([u'c1'], dtype='object')

2.3 Index

Index objects are immutable, functions as a fixed-size set.

main type: Index, Int64Index, MultiIndex, DatetimeIndex, PeriodIndex

Method	Description
append	Concatenate with additional Index objects, producing a new Index
diff	Compute set difference as an Index
intersection	Compute set intersection
union	Compute set union
isin	Compute boolean array indicating whether each value is contained in the passed collection
delete	Compute new Index with element at index i deleted
drop	Compute new index by deleting passed values
insert	Compute new Index by inserting element at index i
is_monotonic	Returns True if each element is greater than or equal to the previous element
is_unique	Returns True if the Index has no duplicate values
unique	Compute the array of unique values in the Index

2.3.1 `rename` with functions

data.rename(index=str.title, columns=str.upper)

2.4 Functionality

2.4.1 Reindexing

Example:

frame.reindex(columns=['c1', 'c2'])
frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill', columns=['c1', 'c2'])
frame.reindex(frame2.index, method='ffill')
# is similar to frame.ix[['a', 'b', 'c', 'd'], ['c1', 'c2']]

reindex args: index, method, fill_value, limit, level, copy

2.4.2 Drop

frame.drop(['r1', 'r2'])
frame.drop(['c1', 'c2'], axis=1)

2.4.3 Selection

Indexing options:

Type	Notes
df[val]	Select column
df.loc[val]	Select row
df.loc[:, val]	Select column
df.loc[val1, val2]	Select both column and row
df.iloc[where]	Select row by int position
df.iloc[:, where]	Select column by int position
df.iloc[where_i, where_j]	Select both column and row by int position
df.at[label_i, label_j]	Select a single scalar value by row and column label
df.iat[i, j]	Select a single scalar value by int position
get_value, set_value	Select single value by row and column label

2.4.4 Arithmetic

Basic df1 + df2,
use add method to fill na values: df1.add(df2, fill_value=0)

Operation between Dataframe and Series

df = pd.DataFrame(np.arange(6).reshape(3,2),
		  columns=['c1', 'c2'], index=['r1', 'r2', 'r3'])

#     c1  c2
# r1   0   1
# r2   2   3
# r3   4   5

s = pd.Series([4,5], index=['c1', 'c2'])

# c1    4
# c2    5
# dtype: int64

df + s
#     c1  c2
# r1   4   6
# r2   6   8
# r3   8  10


s2 = pd.Series([1,2,3], index=['r1', 'r2', 'r3'])
# r1    1
# r2    2
# r3    3
# dtype: int64

df.add(s2, axis=0, fill_value=0)
#     c1  c2
# r1   1   2
# r2   4   5
# r3   7   8

df['sum_c'] = df.eval('c1+c2')
#     c1  c2  sum_c
# r1   0   1      1
# r2   2   3      5
# r3   4   5      9

2.4.5 Broadcasting

frame=pd.DataFrame(np.arange(12).reshape((4,3)), columns=list('bde'), index=list('1234'))
series = frame.ix[0]
frame - series
#=>
#    b  d  e
# 1  0  0  0
# 2  3  3  3
# 3  6  6  6
# 4  9  9  9

series2 = pd.Series(range(3), index=list('bef'))
frame + series2
#=>
#      b   d     e   f
# 1  0.0 NaN   3.0 NaN
# 2  3.0 NaN   6.0 NaN
# 3  6.0 NaN   9.0 NaN
# 4  9.0 NaN  12.0 NaN

2.4.6 Applying

f = lambda x: x.max()
frame.apply(f)
frame.apply(f, axis=1)

apply can also return a Series

def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)
#=>    a   b
# min xxx xxx
# max xxx xxx

map method for applying an element-wise function on a Series
applymap for applying an element-wise function on a DataFrame

2.4.7 Sorting

df.sort_index()

# by column(s)
df.sort_values(by='c1')
df.sort_values(by=['c1', 'c2'])

series.sort_values()

2.4.8 Ranking

By default, rank breaks ties by assigning each group the mean rank

args: 'average'(default), 'min', 'max', 'first'

obj = pd.Series([7, -5, 7, 4, 2, 0, 4, 7])
obj.rank()
#=>
# 0    7.0
# 1    1.0
# 2    7.0
# 3    4.5
# 4    3.0
# 5    2.0
# 6    4.5
# 7    7.0
obj.rank(method='first', ascending=False)
#=>
# 0    1.0
# 1    7.0
# 2    2.0
# 3    3.0
# 4    5.0
# 5    6.0
# 6    4.0

2.4.9 Binning

cut

bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats # Categories object [(18,25], (25, 35], ...]

useful options: labels, precision

qcut

pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.]) # pass quantiles

2.4.10 Other funtions

numpy ufancs works fine with pandas objects
isnull, notnull, dropna, fillna
stack, unstack, swaplevel, sortlevel
set_index, reset_index
unique(series based), value_counts(series based), isin(element-wise)
all, any
replace: data.replace(-999, np.nan)
cut, qcut

2.5 Statistic methods

Basic: count, describe, min, max, quantile, sum, pct_change, diff, corr, cov, corrwith

2.5.1 mean, median, mad, var, std

2.5.2 argmin, argmax, idxmin, idxmax

argmin/argmax: compute index locations (integers) at which minimum or maximum value obtained, respectively

2.5.3 cumsum, cummin, cummax, cumprod

2.5.4 skew

Sample skewness (3rd moment) of values

2.5.5 kurt

Sample kurtosis (4th moment) of values

2.5.6 diff

Compute 1st arithmetic difference (useful for time series)

2.5.7 corr, cov, corrwith

import pandas_datareader as pdr

all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
    all_data[ticker] = pdr.get_data_yahoo(ticker, '1/1/2000', '1/1/2010')

price = pd.DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
returns = price.pct_change()

returns.MSFT.corr(returns.IBM)
returns.MSFT.cov(returns.IBM)

returns.corr()

factors_df.corrwith(prices)

DataFrame.corrwith: Compute pairwise correlation.

2.5.8 common args

Method	Description
axis	Axis to reduce over. 0 for DataFrame’s rows and 1 for columns
skipna	Exclude missing values, True by default
level	Reduce grouped by level if the axis is hierarchically-indexed (MultiIndex)

skipna option:

True(default): NA values are excluded unless the entire slice (row or column in this case) is NA
False: if any value is NA, then return NA

2.6 Hierarchical Indexing

2.6.1 Indexing

data[index_level1]
data[index_level1 : index_level1]
data[[index_level1, index_level1]]

select by inner level:
data[:, index_level2]

2.6.2 stack, unstack

2.6.3 swaplevel, sortlevel

2.7 Panel

Panel can be thought as a 3-dimensional analogue of DataFrame. Although hierarchical indexing makes using truly N-dimensional arrays unnecessary in a lot of cases

pdata = pd.Panel({stk: pdr.get_data_yahoo(stk, '1/1/2009', '6/1/2012')
		  for stk in ['AAPL', 'GOOG']})

2.7.1 Useful functions

ix
```
pdata.ix[:, '6/1/2012', :]
```

swapaxes

pdata.swapaxes('items', 'minor')['Adj Close']

to_frame

index will be the [major, minor] axis, items will be the columns
```
stacked = pdata.to_frame()
```
to_panel
```
stacked.to_panel()
```

2.8 config options

pd.options categories:

compute
display
io
mode
plotting

2.9 Functional Method Chaining

# Usual non-functional way
df2 = df.copy()
df2["k"] = v

# Functional assign way
df2 = df.assign(k=v)

result = df2.assign(col1_demeaned=df2.col1 - df2.col2.mean()).groupby("key").col1_demeaned.std()

Assigning in-place may execute faster than using assign, but assign enables easier method chaining

2.9.1 Fancy Method Chaining

df = load_data()
df2 = df[df['col2'] < 0]
# This can be rewritten as:
df = (load_data()
      [lambda x: x['col2'] < 0])

# write the entire sequence as a single chained expression
result = (load_data()
	  [lambda x: x.col2 < 0]
	  .assign(col1_demeaned=lambda x: x.col1 - x.col1.mean())
	  .groupby('key')
	  .col1_demeaned.std())

2.9.2 `pipe` method

The statement f(df) and df.pipe(f) are equivalent

useful pattern for `pipe`

def group_demean(df, by, cols):
    result = df.copy()
    g = df.groupby(by)
    for c in cols:
	result[c] = df[c] - g[c].transform("mean")
    return result


result = df[df.col1 < 0].pipe(group_demean, ["key1", "key2"], ["col1"])

3 Wrangling

3.1 Evaluate Data

3.1.1 Quality

Data in low quality is called dirty data.

missing value
invalid value
inaccurate value
inconsistent value, e.g. different unit(cm/inch)

Evaluate Method

head, tail, info, value_counts, plot

3.2 Dealing with Missing Data

3.2.1 Imputation

Why impute

Not much data
Removing data could affect representativeness

Methods

Mean Imputation: Drawbacks: Lessens correlations between variables
Linear Regression: Drawbacks: Over emphasize trends; Exist values suggest too much certainty.

3.3 Data Input/Output

3.3.1 Reading option categories

Indexing

Can treat one or more columns as the returned DataFrame, and whether to get column names from the file, the user, or not at all.

Type inference and data conversion

This includes the user-defined value conversions and custom list of missing value markers.

Datetime parsing

Includes combining capability, including combining date and time information spread over multiple columns into a single column in the result.

Iterating

Support for iterating over chunks of very large files.

Unclean data issues

Skipping rows or a footer, comments, or other minor things like numeric data with thousands separated by commas.

3.3.2 Hints

from_csv: a convenience method simpler than read_csv
pickle related: load, save

3.3.3 Useful `read_csv` parameters

nrows
chunksize

chunker = pd.read_csv("data.csv", chunksize=1000)
for piece_df in chunker:
    # do somethine with piece_df
    ...

3.4 Concatenation

pd.merge, merge method
join method: performs a left join on the join keys
pd.concat
combine_first: patching missing data from another df
align: align two object on their axes with the specified join method for each axis Index

3.4.1 concat args

Argument	Description
objs	List or dict of pandas objects to be concatenated. The only required argument
axis	Axis to concatenate along; defaults to 0
join	One of 'inner', 'outer' , defaulting to 'outer'
join_axes	Specific indexes to use for the other n-1 axes instead of performing union/intersection logic
keys	Values to associate with objects being concatenated, forming a hierarchical index along the concatenation axis
levels	Specific indexes to use as hierarchical index level or levels if keys passed
names	Names for created hierarchical levels if keys and / or levels passed
verify_integrity	Check new axis in concatenated object for duplicates and raise exception if so. By default( False ) allows duplicates
ignore_index	Do not preserve indexes along concatenation axis , instead producing a new range(total_length) index

3.5 Reshaping and Pivoting

3.5.1 stack

Pivots from the columns in the data to the rows. Stacking filters out missing data by default.

data = DataFrame(np.arange(6).reshape((2, 3)), columns=['a', 'b', 'c'])
data
#=>
#    a  b  c
# 0  0  1  2
# 1  3  4  5
data.stack()
#=>
# 0  a    0
#    b    1
#    c    2
# 1  a    3
#    b    4
#    c    5
# dtype: int64

3.5.2 unstack

Pivots from the rows into the columns

data.stack().unstack()
#=>
#    a  b  c
# 0  0  1  2
# 1  3  4  5

# can specific level number or name
data.stack().unstack(0)
#=>
#    0  1
# a  0  3
# b  1  4
# c  2  5

3.5.3 pivot & melt

pivot is a shortcut for creating a hierarchical index using set_index and reshaping with unstack

quotes.head()
#=>
#                   Open        High         Low       Close    Volume  \
# Date
# 2010-01-04  626.951088  629.511067  624.241073  626.751061   3927000
# 2010-01-05  627.181073  627.841071  621.541045  623.991055   6031900
# 2010-01-06  625.861078  625.861078  606.361042  608.261023   7987100
# 2010-01-07  609.401025  610.001045  592.651008  594.101005  12876600
# 2010-01-08  592.000997  603.251034  589.110988  602.021036   9483900

#              Adj Close symbol
# Date
# 2010-01-04  313.062468   GOOG
# 2010-01-05  311.683844   GOOG
# 2010-01-06  303.826685   GOOG
# 2010-01-07  296.753749   GOOG
# 2010-01-08  300.709808   GOOG
quotes.pivot(columns='symbol', values='Close')
#=>
# symbol            AAPL        GOOG         IBM       MSFT
# Date
# 2010-01-04  214.009998  626.751061  132.449997  30.950001
# 2010-01-05  214.379993  623.991055  130.850006  30.959999
# 2010-01-06  210.969995  608.261023  130.000000  30.770000
# 2010-01-07  210.580000  594.101005  129.550003  30.450001
# 2010-01-08  211.980005  602.021036  130.850006  30.660000

3.6 Random Sampling

df.take, df.sample

df = DataFrame(np.arange(5 * 4).reshape(5, 4))
df
#=>
#     0   1   2   3
# 0   0   1   2   3
# 1   4   5   6   7
# 2   8   9  10  11
# 3  12  13  14  15
# 4  16  17  18  19

# simple sampling
df.sample(n=3)

sampler = np.random.permutation(5)
sampler
#=> array([1, 3, 4, 0, 2])
df.take(sampler)
#=>
#     0   1   2   3
# 1   4   5   6   7
# 3  12  13  14  15
# 4  16  17  18  19
# 0   0   1   2   3
# 2   8   9  10  11

3.7 String Manipulation

df.str.XXX

3.7.1 Vectorized string methods

cat, contains, count, endswith/startswith, findall, get, join, len, lower, upper, match, pad, center, repeat, replace, slice, split, strip/rstrip/lstrip

3.8 Check Duplicates

pandas.Index.is_unique, pandas.Series.is_unique

3.9 Indicator/Dummy

Converting Categorical Variable into "dummy" Matrix, use ~pd.get_dummies

df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b']})
df
#   key
# 0   b
# 1   b
# 2   a
# 3   c
# 4   a
# 5   b

pd.get_dummies(df['key'])
#    a  b  c
# 0  0  1  0
# 1  0  1  0
# 2  1  0  0
# 3  0  0  1
# 4  1  0  0
# 5  0  1  0

4 Plotting

4.1 Matplotlib Basic

4.1.1 Figures and Subplots

fig = plt.figure()
ax1 = fig.add_subplot(2, 2, 1)
# get a reference of active figure
plt.gcf()

# all in one subplots
fig, axes = plt.subplots(2,3, figsize=(14, 8))

4.1.2 subplots options

Argument	Description
figsize	Size of figure
nrows	Number of rows of subplots
ncols	Number of columns of subplots
sharex	All subplots use the same X-axis ticks
sharey	see above
subpot_kw	creating dict of keywords
**fig_kw	Additional keywords

4.1.3 Adjusting Size

subplots_adjust

args: left, right, bottom, top, wspace, hspace

4.1.4 global configuration

plt.rc

plt.rc('figure', figsize=(10, 10))
font_options = {'family' : 'monospace',
		'weight' : 'bold',
		'size'
		: 'small'}
plt.rc('font', **font_options)

4.1.5 Choosing Plotting Range

plt.xlim, plt.ylim, ax.set_xlim, ax.set_ylim

4.1.6 title, label, tick, ticklabel, legend

set_title, set_xlabel, set_xticks, set_xticklabels, set

ticks = ax.set_xticks([0, 250, 500, 750, 1000])
labels = ax.set_xticklabels(["one", "tow", "three", "four", "five"])

# ax.set
props = {
    "title": "My first matplotlib plot",
    "xlabel": "Stages"
}
ax.set(**props)

# legend
ax.legend(loc='best') #show the label

4.1.7 Annotations

ax.text, ax.arrow, ax.annotate

ax.annotate(label, xy=(date, spx.asof(date) + 50),
	    xytext=(date, spx.asof(date) + 200),
	    arrowprops=dict(facecolor='black'),
	    horizontalalignment='left', verticalalignment='top')

util function: asof

4.2 pandas Plotting

4.2.1 Line Plots

Series

Argument	Description
label	Label for plot legend
ax	matplotlib subplot object to plot on. If nothing passed, uses active matplotlib subplot
style	Style string, like 'ko–' , to be passed to matplotlib
alpha	The plot fill opacity (from 0 to 1)

DataFrame

Argument	Description
kind	Can be 'line', 'bar', 'barh', 'kde'
logy	Use logarithmic scaling on the Y axis
use_index	Use the object index for tick labels
rot	Rotation of tick labels (0 through 360)
xticks	Values to use for X axis ticks
yticks	Values to use for Y axis ticks
xlim	X axis limits (e.g. [0, 10] )
ylim	Y axis limits
grid	Display axis grid (on by default)
subplots	Plot each DataFrame column in a separate subplot
sharex, sharey	If subplots=True , share the same Y/x axis
figsize	Size of figure to create as tuple
title	Plot title as string
legend	Add a subplot legend ( True by default)
sort_columns	Plot columns in alphabetical order

steps

plt.plot(data, drawstyle="steps-post")

4.2.2 Bar Plots

kind='bar': for vertical bars, 'barh' for horizontal bars
stacked=True: stacked bar plots
useful recipe: s.value_counts().plot(kind='bar')

4.2.3 Histogram & Density Plots

A kind of bar plot that gives a discretized display of value frequency

comp1 = np.random.normal(0, 1, size=200) # N(0, 1)
comp2 = np.random.normal(10, 2, size=200) # N(10, 4)
values = Series(np.concatenate([comp1, comp2]))
fig = plt.figure(figsize=(10, 5))
values.hist(bins=100, alpha=0.3, color='k', normed=True)
values.plot(kind='kde', style='k--')

4.2.4 Scatter Plots

pairs plot or scatter plot matrix: scatter_matrix

4.3 Saving Plots to File

Figure.savefig args

fname, dpi, facecolor, edgecolor, format
bbox_inches: The portion of the figure to save

4.4 interactive mode

4.5 add-ons

mplot3d
basemap /cartopy: projection and mapping(plotting 2D data on maps)
seaborn /holoviews/ggplot: higher-level plotting interfaces
axes_grid: axes and axis helpers

5 Group

5.1 group by

# series groupby
df['data1'].groupby(df['key1'])
df.groupby(df['key1'])['data1'] # syntactic sugar
dict(list(df.groupby('key1')))

# df groupby
df.groupby(['key1', 'key2']) # options: as_index, axis

# get group size
df.groupby().size()

# iterations
for name, group in df.groupby('key1'):
    print name, group

5.1.1 Using Mapping Dict(or Series)

# using mapping dict
mapping = {'a': 'group1', 'b': 'group1', 'c': 'group2'}
df.groupby(mapping, axis=1)

5.1.2 Using a Function

Any function passed as a group key will be called once per index value, with the return values being used as the group names

# using function
df.groupby(len).sum()

5.1.3 Using Mixing

Mixing functions with arrays, dicts or Series is not a problem as everything gets converted to arrays internally

key_list = ['one', 'one', 'two']
df.groupby([len, key_list]).min()

5.2 groupby aggregation

Table 1: Optimized Groupby Methods
Name	Description
count	Number of non-NA values in the groupNumber of non
sum	Sum of non-NA values
mean	Mean of non-NA values
median	Arithmetic median of non-NA values
std, var	Unbiased (n - 1 denominator) standard deviation and variance
min, max	Minimum and maximum of non-NA values
prod	Product of non-NA values
first, last	First and last non-NA values

5.2.1 `agg`

Series

grouped = series.groupby(group_key)
grouped.agg(['mean', 'std', transform_function])
grouped.agg([('foo', 'mean'), ('bar', 'np.std')]) # foo bar will be the column name of result df
grouped.agg({"tip": np.max, "size": "sum"})

DataFrame

grouped = df.groupby("column")
grouped.agg({'col1': 'mean', 'col2': 'std', 'col3': np.max})
grouped.agg({'col1': ['mean', 'std']})

5.2.2 apply

apply function has args:

df.groupby(['key1', 'key2']).apply(top, n=1)

5.2.3 transform

Like apply, transform works with functions that return Series, but the result must be the same size as the input

5.2.4 fillna with group value

grouped.apply(lambda g: g.fillna(g.mean()))

fill_values = {'a': 5, 'b': 4}
grouped.fillna(lambda g: g.fillna(fill_values[g.name]))

5.3 `pivot_table`

In addition to providing a convenience interface to groupby + reshape

crossbar is a special case of a pivot table.

data
#  Sample Nationality Handedness
# 0 1 USA Right-handed
# 1 2 Japan  Left-handed
pd.crosstab(data.Nationality, data.Handedness, margins=True)
# Handedness Left-handed Right-handed All
# Nationality
# Japan      2 3 5
# USA        1 4 5
# All        3 7 10

5.4 `transform`

similar to apply but imposes more constraints:

It can produce a scalar value to be broadcast to the shape of the group
It can produce an object of the same shape as the input group
It must not mutate its input

5.4.1 examples

df = pd.DataFrame({'key': ['a', 'b', 'c'] * 2, 'value': np.arange(6.)})
# =>
#   key  value
# 0   a    0.0
# 1   b    1.0
# 2   c    2.0
# 3   a    3.0
# 4   b    4.0
# 5   c    5.0
df.groupby("key").transform(lambda x: x.mean()) # or df.groupby("key").transform("mean")
# => same shape
#    value
# 0    1.5
# 1    2.5
# 2    3.5
# 3    1.5
# 4    2.5
# 5    3.5

5.4.2 unwrapped group operation

normalized = (df['value'] - g.transform('mean')) / g.transform('std')
# faster than
g.apply(lambda x: (x - x.mean()) / x.std())

5.5 Examples

5.5.1 Group Weighted Average

df = pd.DataFrame({'category': ['a', 'a', 'a', 'a',
				'b', 'b', 'b', 'b'],
		   'data': np.random.randn(8),
		   'weights': np.random.rand(8)})
df.groupby("category").apply(lambda g: np.average(g['data'], weights=g['weights']))

5.5.2 Group Correlation

prices.columns
# AAPL,MSFT,XOM,SPX

spx_corr = lambda x: x.corrwith(x['SPX'])
rets = prices.pct_change().dropna()

get_year = lambda x: x.year
by_year = rets.groupby(get_year)
by_year.apply(spx_corr)
#    AAPL MSFT XOM SPX
# 2003 0.541124 0.745174 0.661265 1.0
# 2004 0.374283 0.588531 0.557742 1.0

5.5.3 Group-Wise Linear Regression

prices.columns
# AAPL,MSFT,XOM,SPX
rets = prices.pct_change().dropna()

get_year = lambda x: x.year
by_year = rets.groupby(get_year)

import statsmodels.api as sm
def regress(data, yvar, xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.
    result = sm.OLS(Y, X).fit()
    return result.params

by_year.apply(regress, 'AAPL', ['SPX'])

5.5.4 Grouped Time Resampling

times = pd.date_range('2017-05-20 00:00', freq='1min', periods=N)
df = pd.DataFrame({'time': times.repeat(3),
		   'key': np.tile(['a', 'b', 'c'], N),
		   'value': np.arange(N * 3.)})
time_key = pd.TimeGrouper('5min')
resampled = (df.set_index('time').groupby(['key', time_key]).sum())

6 Timeseries

6.1 Indexing

ts[(time(10,0))]
# same as
ts.at_time(time(10,0))

ts.between_time(time(10, 0), time(10, 30))
ts.asof(pd.date_range('2016-05-01 10:00', periods=4, freq='B'))

6.1.1 asof

By passing an array of timestamps to the asof method, you will obtain an array of the last valid(non-NA) values at or before each timestamps.

6.2 Special Frequencies

Table 2: Base Time Series Frequencies
Alias	Offset Type
D	Day
B	BusinessDay
H	Hour
T or min	Minute
S	Second
L or ms	Milli
U	Micro
M	MonthEnd
BM	BusinessMonthEnd
MS	MonthBegin
BMS	BusinessMonthBegin
W-Mon, W-TUE, …	Week
WOM-1MON, …	WeekOfMonth
Q-JAN, Q-FEB, …	QuarterEnd
BQ-JAN, BQ-FEB, …	BusinessQuarterEnd
QS-JAN, …	QuarterBegin
BQS-JAN, …	BusinessQuarterBegin
A-JAN, …	YearEnd
BA-JAN, …	BusinessYearEnd
AS-JAN, …	YearBegin
BAS-JAN, …	BusinessYearBegin

6.2.1 examples

"1h30min"
"WOM-3FRI": the third Friday of each month:

6.3 Offset Objects

from pandas.tseries.offsets import Day, MonthEnd

now = datetime(2011, 11, 17)
now + 3 * Day()  # -> Timestamp('2011-11-20 00:00:00')
now + MonthEnd()  # -> Timestamp('2011-11-30 00:00:00') "roll forward"
# same as
offset = MonthEnd()
offset.rollforward(now)

offset.rollback(now)

from pandas.tseries.frequencies import to_offset
offset = to_offset("WOM-3FRI")

rollforward/rollback is useful with groupby(like resample but slower): ts.groupby(offset.rollforward).mean()

6.4 Time Zone Localization

ts.tz_localize("UTC"): same way like dt.replace

6.5 Periods

Periods represent timespans, like days, months, quarters, or years.

p = pd.Period(2007, freq="A-DEC")  # represents the full timespan from 2017-1-1 to 2017-12-31.
p + 5  # Period('2012', 'A-DEC')

rng = pd.period_range("2000-01-01", "2000-06-30", freq="M")
# rng -> PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='period[M]', freq='M')

m = p.asfreq("M", how="start")  # converting from low to high frequency
m.asfreq("A-DEC")  # converting from high to low frequency depending on where the subperiod “belongs.”

6.5.1 Frequencies

A-DEC: annual frequency end at DEC
Q-SEP: fiscal quarter Q4 end at SEP

6.5.2 Converting from/to Timestamps

ts.to_period
ts.to_timestamp(how="end")

6.5.3 From Different Columns

pd.PeriodIndex(year=data.year, quarter=data.quarter, freq='Q-DEC')

6.6 Resampling

6.6.1 Downsampling

Label: resample(label="right"). For "T" freq, 9:01 represents 9:00~9:01

6.6.2 Upsampling

often need an aggregation function

without any aggregation, but introduce None data: asfreq

6.7 Moving Window

Can be useful for smoothing noisy or gappy data.

Function	Description
rolling_count	Returns number of non-NA observations in each trailing window.
rolling_sum	Moving window sum.
rolling_mean	Moving window mean.
rolling_median	Moving window median.
rolling_var, rolling_std	Moving window variance and standard deviation, respectively. Uses n - 1 denominator.
rolling_skew, rolling_kurt	Moving window skewness (3rd moment) and kurtosis (4th moment), respectively.
rolling_min, rolling_max	Moving window minimum and maximum.
rolling_quantile	Moving window score at percentile/sample quantile.
rolling_corr, rolling_cov	Moving window correlation and covariance.
rolling_apply	Apply generic array function over a moving window.
ewma	Exponentially-weighted moving average.
ewmvar, ewmstd	Exponentially-weighted moving variance and standard deviation.
ewmcorr, ewmcov	Exponentially-weighted moving correlation and covariance.

6.7.1 `rolling`

rolling by time offset: close_px.rolling('20D').mean()

6.7.2 `expanding`

The expanding mean starts the time window from the beginning of the time series and increases the size of the window until it encompasses the whole series.

6.7.3 `ewm`

The idea is to specify a constant decay factor to give more weight to more recent observations. There are a couple of ways to specify the decay factor. A popular one is using a span. which makes the result comparable to a simple moving window function with window size equal to the span.

with decay factor span: aapl_px.ewm(span=30).mean()

6.8 Binary Moving Window

Some statistical operators need to operate on two time series, like correlation and covariance

aapl_rets.rolling(125, min_periods=100).corr(spx_rets)
rets_df.rolling(125, min_periods=100).corr(spx_rets)

6.9 User-Defined Moving Window Functions

The only requirement is that the function produce a single value (a reduction) from each piece of the array.

from scipy.stats import percentileofscore
score_at_2percent = lambda x: percentileofscore(x, 0.02)
result = returns.AAPL.rolling(250).apply(score_at_2percent)

6.10 Tips

normalized date range: pd.date_range('2012-05-02 12:56:31', periods=5, normalize=True)
shift by freq: ts.shift(2, freq="M")
shortcut for ohlc reampling: ts.resample("5min").ohlc()
rolling with min_periods: aapl_px.rolling(125, min_periods=100)

7 Categorical Data

7.1 Motivation

values = pd.Series([0, 1, 0, 0] * 2)
dim = pd.Series(['apple', 'orange'])
# dim
# 0 apple
# 1 orange
# dtype: object
dim.take(values)
# 0 apple
# 1 orange
# 0 apple
# 0 apple
# ...

The representation values as integers is called the categorical or dictionary-encoded representation.
The array dim of distinct values can be called the categories, dictionary, or levels of the data.
The integer values that reference the categories are called the category codes or simply codes.

7.2 Categorical Type

pd.Categorical(data: Iterable[str])
converting string series to categorical: s.astype("category")
converting integer series to categorical: pd.Categorical.from_codes(codes: pd.Series[int], categories: Iterable[str])
numeric data to categorical: bins = pd.qcut(data, 4, labels=["Q1", "Q2", "Q3", "Q4"]). often using groupby(bins) after

7.2.1 Better Performance

categories.memory_usage() is less
GroupBy operations can be significantly faster with categoricals
In large datasets, categoricals are often used as a convenient tool for memory savings and better performance

7.3 Categorical Methods

Series.cat is similar to Series.str

categorical is replaced with string categories: cat_s.cat.set_categories(actual_categories)
methods: add_categories, as_ordered, as_unordered, remove_categories, remove_unused_categories, rename_categories, reorder_categories, set_categories

7.3.1 Creating Dummy Variables for Modeling

When you're using statistics or machine learning tools, you'll often transform categorical data into dummy variables, also known as one-hot encoding.

cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')
pd.get_dummies(cat_s)
#    a  b  c  d
# 0  1  0  0  0
# 1  0  1  0  0
# 2  0  0  1  0
# 3  0  0  0  1
# 4  1  0  0  0
# 5  0  1  0  0
# 6  0  0  1  0
# 7  0  0  0  1

8 Modeling Libraries

patsy for describing statistical models
statsmodel
scikit-learn: scikit-image, scikits-cuda
TensorFlow, PyTorch

8.0.1 Books

Introduction to Machine Learning with Python by Andreas Mueller and Sarah Guido (O’Reilly)
Python Data Science Handbook by Jake VanderPlas (O’Reilly)
Python Machine Learning by Sebastian Raschka (Packt Publishing)
Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurélien Géron (O’Reilly)

9 Other Tools

9.1 ipyparallel

ipcluster start --n=4

from ipyparallel import Client
rc = Client()
dview = rc[:]
# => <DirectView [0, 1, 2, 3>]
dview.apply_sync(os.getpid)

dview.scatter('a', range(10)) # distribute data (map)
dview.execute('print(a)').display_outputs()
dview.execute('b=sum(a)')
dview.gather('b').r # reduce

9.2 Scipy

9.3 Sympy

Main computer algebra module in Python

import sympy as S
from sympy import init_printing
init_printing() # for latex printing

x = S.symbols('x')

p=sum(x**i for i in range(3)) # 2nd order polynomial
# => x**2 + x + 1

S.solve(p) # solves p == 0
# => [-1/2 - sqrt(3)*I/2, -1/2 + sqrt(3)*I/2]

S.roots(p)
# => {-1/2 - sqrt(3)*I/2: 1, -1/2 + sqrt(3)*I/2: 1}

9.3.1 lambdify

y = S.tan(x) * x + x**2
yf= S.lambdify(x,y,’numpy’)
#=> <function numpy.<lambda>>
yf(np.arange(3))

9.3.2 Alternative SAGE

open source mathematics software package
non pure-python module.

9.4 seaborn

Beautifully formatted plots

9.5 plotly

9.6 Blaze

Big data(too big to fit in RAM)
tight integration with Pandas dataframes

https://github.com/blaze/blaze

9.7 bcolz

columnar and compressed data containers(either in-memory and on-disk)

based on numpy
support for PyTables and pandas dataframes
takes advantage of multi-core architecture

https://github.com/Blosc/bcolz

9.8 Enthought Tool-Suite (ETS)

ETS is a collection of components developed by Enthought and our partners, which we use every day to construct custom scientific applications.

9.8.1 Chaco

2-Dimensional Plotting, interactive visualization.

9.8.2 Mayavi

Another Enthought-supported 3D visualization package that sits on VTK.

9.9 bottleneck

provides an alternate implementation of NaN-friendly moving window functions.

9.10 SWIG

Wrapper for c, c++ libraries.

generate python PYD

9.11 zipline

10 hdf

10.1 dataset

f = h5py.File("testfile.hdf5")
dset = f.create_dataset("big dataset", (1024**3, ), dtype=np.float32)
dset[0:1024] = np.arange(1024)
f.flush() # dump cache on disk

with dset.astype('float64'):
    out = dset[0,:]
out.dtype #=> dtype('float64')

10.1.1 Ellipsis

dset = f.create_dataset('4d', shape=(100, 80, 50, 20))
dset[0,...,0].shape
# => (80, 50)
dset[...].shape
# => (100, 80, 50, 20)

10.1.2 chunks

dset = f.create_dataset('chunked', (100,480,640), dtype='i1', chunks=(1, 64, 64))

10.1.3 float data saving space

it’s common practice to store these data points on disk as single-precision

read float32 in as double precision

one way

big_out = np.empty((100, 1000), dtype=np.float64)
dset.read_direct(big_out)

another way

with dset.astype('float64'):
    out = dset[0,:]

10.2 lookup tools

10.2.1 HDFView

10.2.2 h5ls

h5ls -vlr

10.3 notice

slice on hdf table will do read&write on disk

use np.s_ to get a slice object

dset.read_direct(out, source_sel=np.s_[0,:], dest_sel=np.s_[50,:])