DataAnalysis For Python

Table of Contents

1 numpy

1.1 ndarray

import numpy as np
py_array = [1, 2, 3, 4, 5]
np_array = np.array(py_array)
np_array.shape #=> (5,)
np_array.dtype #=> dtype('int64')

py_array = [[1, 2, 3, 4], [5, 6, 7, 8]]
np_array = np.array(py_array)
np_array.shape #=> (2, 4)
np_array.ndim #=> 2

np.zeros(10)
#=> array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
np.ones((2, 3))
#=> array([[ 1.,  1.,  1.], [ 1.,  1.,  1.]])
np.empty((2, 3, 2))
#=>
# array([[[ 0.,  0.],
#         [ 0.,  0.],
#         [ 0.,  0.]],

#        [[ 0.,  0.],
#         [ 0.,  0.],
#         [ 0.,  0.]]])

array = np.arange(5)
#=> array([0, 1, 2, 3, 4])
masked_array = np.ma.masked_array(array, array<3)
#=> [-- -- -- 3 4]
np.random.randn(2, 3)

np_array = np.array([1, 2, 3], dtype=np.int32)
np_array = np_array.astype(np.float64)
np_array.dtype #=> dtype('float64')

x, y=np.meshgrid(np.arange(2),np.arange(2))
#x=>
# array([[0, 1],
#        [0, 1]])
#y=>
# array([[0, 0],
#        [1, 1]])

1.1.1 Strides

indicating the number of bytes to "step" in order to advance one element along a dimension, \(3 * 4 * 5\) array of float64(8-bytes) has strides (160, 40, 8)

1.1.2 Supported dtype

  • int8, uint8, int16, uint16, int32, uint32, int64, uint64
  • float16, float32, float64, float128
  • complex64, complex128, complex256
  • bool
  • object
  • string_
  • unicode_

1.1.3 Creating

Function Description
array Convert input data (list, tuple, array, or other sequence type) to an ndarray
asarray Convert input to ndarray, but do not copy if the input is already an ndarray
arange Like the built-in range but returns an ndarray instead of a list.
ones, ones_like Produce an array of all 1’s with the given shape and dtype.
zeros, zeros_like Produce an array of all 0’s with the given shape and dtype.
empty, empty_like Create new arrays by allocating new memory
eye, identity Create a square N x N identity matrix (1’s on the diagonal and 0’s elsewhere)

1.1.4 Indexing

basic indexing
arr = np.array(range(12)).reshape(3,4)
# array([[ 0,  1,  2,  3],
#        [ 4,  5,  6,  7],
#        [ 8,  9, 10, 11]])

arr[1:, :3]
# array([[ 4,  5,  6],
#        [ 8,  9, 10]])
boolean indexing
flags = np.array([1, 2, 3, 2, 2, 1])
data = np.random.randn(6,2)
# array([[ 2.11684529,  1.24861544],
#        [-0.34817586, -0.59905366],
#        [-0.84976431,  0.11840417],
#        [ 1.36648373,  1.33416664],
#        [ 0.37616856, -0.0032112 ],
#        [-0.7749904 , -0.60457688]])
flags ==2
# array([False,  True, False,  True,  True, False], dtype=bool)
data[flags == 2]
# array([[-0.34817586, -0.59905366],
#        [ 1.36648373,  1.33416664],
#        [ 0.37616856, -0.0032112 ]])
data[(flags == 3) | (flags == 1), 1:]
# array([[ 1.24861544],
#        [ 0.11840417],
#        [-0.60457688]])
data[data<0] = 0
# array([[ 2.11684529,  1.24861544],
#        [ 0.        ,  0.        ],
#        [ 0.        ,  0.11840417],
#        [ 1.36648373,  1.33416664],
#        [ 0.37616856,  0.        ],
#        [ 0.        ,  0.        ]])
fancy indexing

Fancy indexing is a term adopted by NumPy to describe indexing using integer arrays .
Unlike slicing, always copies the data into a new array .

arr
# array([[ 0.,  0.,  0.,  0.],
#        [ 1.,  1.,  1.,  1.],
#        [ 2.,  2.,  2.,  2.]])
arr[[2, 1]]
# array([[ 2.,  2.,  2.,  2.],
#        [ 1.,  1.,  1.,  1.]])
arr = np.array(range(12)).reshape(3,4)
# array([[ 0,  1,  2,  3],
#        [ 4,  5,  6,  7],
#        [ 8,  9, 10, 11]])
arr[[1,2], [2,3]]
# array([ 6, 11]) # choose location (1, 2) (2, 3)

arr[[1,2]][:, [2,3]]
# array([[ 6,  7],
#        [10, 11]])
arr[np.ix_([1,2], [2,3])] # same effect
# array([[ 6,  7],
#        [10, 11]])

1.1.5 Reshape 1D to 2D

arr = np.array([1,2,3])
arr.reshape((-1, 1))
array([[1],
       [2],
       [3]])

1.2 matrix

matrix has two dimensions and implements row-column matrix multiplication

1.3 Conditional Logic

1.3.1 any, all

1.3.2 numpy.where

The numpy.where function is a vectorized version of the ternary expression x if condition else y. numpy.where is quicker than list comprehension.

xarr = np.array([1,1,1,1,1])
yarr = np.array([2,2,2,2,2])
cond = np.array([True, False, True, False, False])
np.where(cond, xarr, yarr)
# array([1, 2, 1, 2, 2])
np.where(cond, 4, 3)
# array([4, 3, 4, 3, 3])

1.4 Transpose

Simple transposing with .T is just a special case of swapaxes

1.5 Useful functions

1.5.1 Math

Method Description
sign Returns on array of 1 and -1 depending on the sign of the values
sum Sum of all the elements in the array or along an axis. Zero-length arrays have sum 0
mean Arithmetic mean. Zero-length arrays have NaN mean
std, var Standard deviation and variance, respectively, with optional degrees of freedom adjustment
min, max Minimum and maximum
argmin, argmax Indices of minimum and maximum elements, respectively
cumsum Cumulative sum of elements starting from 0
cumprod Cumulative product of elements starting from 1
abs, fabs Use fabs as a faster alternative for non-complex-valued data
modf Return factional and integral parts of array as separate array
rint Round elements to the nearest integer, preserving the dtype
average Compute the weighted average along the specified axis.
exp Calculate the exponential of all elements in the input array.

1.5.2 Linear Algebra

Function Description
diag Return the diagonal (or off-diagonal) elements of a square matrix
dot Matrix multiplication
trace Compute the sum of the diagonal elements
det Compute the matrix determinant
eig Compute the eigenvalues and eigenvectors of a square matrix
inv Compute the inverse of a square matrix
pinv Compute the Moore-Penrose pseudo-inverse inverse of a square matrix
qr Compute the QR decomposition
svd Compute the singular value decomposition (SVD)
solve Solve the linear system Ax = b for x, where A is a square matrix
lstsq Compute the least-squares solution to y = Xb

1.5.3 Random Number Generation

Function Description
seed Seed the random number generator
permutation Return a random permutation of a sequence, or return a permuted range
shuffle Randomly permute a sequence in place
rand Draw samples from a uniform distribution
randint Draw random integers from a given low-to-high range
randn Draw samples from a normal distribution with mean 0 and standard deviation 1 (MATLAB-like interface)
binomial Draw samples a binomial distribution
normal Draw samples from a normal (Gaussian) distribution
beta Draw samples from a beta distribution
chisquare Draw samples from a chi-square distribution
gamma Draw samples from a gamma distribution
uniform Draw samples from a uniform [0, 1) distribution

1.5.4 Set operations

Function Description
unique(x) Compute the sorted, unique elements in x
intersect1d(x, y) Compute the sorted, common elements in x and y
union1d(x, y) Compute the sorted union of elements
in1d(x, y) Compute a boolean array indicating whether element of x is in y
setdiff1d(x, y) Set difference, elements in x that are not in y
setxor1d(x, y) Set symmetric differences; elements that are in either of the arrays, but not both

2 pandas

2.1 Series

Series is a fixed-length ordered dict

2.2 DataFrame

df = pd.DataFrame(np.arange(8).reshape(4,2),
		  columns=['c1', 'c2'], index=['r1', 'r2', 'r3', 'r4'])

df.ix['r1'] # retrieve row
# c1    0
# c2    1
# Name: r1, dtype: int64

df.ix[['r1','r2']]
#     c1  c2
# r1   0   1
# r2   2   3

df.T
#     r1  r2  r3  r4
# c1   0   2   4   6
# c2   1   3   5   7

del df['c2']

df.columns
# Index([u'c1'], dtype='object')

2.3 Index

Index objects are immutable, functions as a fixed-size set.

  • main type: Index, Int64Index, MultiIndex, DatetimeIndex, PeriodIndex
Method Description
append Concatenate with additional Index objects, producing a new Index
diff Compute set difference as an Index
intersection Compute set intersection
union Compute set union
isin Compute boolean array indicating whether each value is contained in the passed collection
delete Compute new Index with element at index i deleted
drop Compute new index by deleting passed values
insert Compute new Index by inserting element at index i
is_monotonic Returns True if each element is greater than or equal to the previous element
is_unique Returns True if the Index has no duplicate values
unique Compute the array of unique values in the Index

2.3.1 rename with functions

  • data.rename(index=str.title, columns=str.upper)

2.4 Functionality

2.4.1 Reindexing

Example:

frame.reindex(columns=['c1', 'c2'])
frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill', columns=['c1', 'c2'])
frame.reindex(frame2.index, method='ffill')
# is similar to frame.ix[['a', 'b', 'c', 'd'], ['c1', 'c2']]

reindex args: index, method, fill_value, limit, level, copy

2.4.2 Drop

frame.drop(['r1', 'r2'])
frame.drop(['c1', 'c2'], axis=1)

2.4.3 Selection

Indexing options:

Type Notes
df[val] Select column
df.loc[val] Select row
df.loc[:, val] Select column
df.loc[val1, val2] Select both column and row
df.iloc[where] Select row by int position
df.iloc[:, where] Select column by int position
df.iloc[where_i, where_j] Select both column and row by int position
df.at[label_i, label_j] Select a single scalar value by row and column label
df.iat[i, j] Select a single scalar value by int position
get_value, set_value Select single value by row and column label

2.4.4 Arithmetic

  • Basic df1 + df2,
  • use add method to fill na values: df1.add(df2, fill_value=0)
  • Operation between Dataframe and Series
df = pd.DataFrame(np.arange(6).reshape(3,2),
		  columns=['c1', 'c2'], index=['r1', 'r2', 'r3'])

#     c1  c2
# r1   0   1
# r2   2   3
# r3   4   5

s = pd.Series([4,5], index=['c1', 'c2'])

# c1    4
# c2    5
# dtype: int64

df + s
#     c1  c2
# r1   4   6
# r2   6   8
# r3   8  10


s2 = pd.Series([1,2,3], index=['r1', 'r2', 'r3'])
# r1    1
# r2    2
# r3    3
# dtype: int64

df.add(s2, axis=0, fill_value=0)
#     c1  c2
# r1   1   2
# r2   4   5
# r3   7   8

df['sum_c'] = df.eval('c1+c2')
#     c1  c2  sum_c
# r1   0   1      1
# r2   2   3      5
# r3   4   5      9

2.4.5 Broadcasting

frame=pd.DataFrame(np.arange(12).reshape((4,3)), columns=list('bde'), index=list('1234'))
series = frame.ix[0]
frame - series
#=>
#    b  d  e
# 1  0  0  0
# 2  3  3  3
# 3  6  6  6
# 4  9  9  9

series2 = pd.Series(range(3), index=list('bef'))
frame + series2
#=>
#      b   d     e   f
# 1  0.0 NaN   3.0 NaN
# 2  3.0 NaN   6.0 NaN
# 3  6.0 NaN   9.0 NaN
# 4  9.0 NaN  12.0 NaN

2.4.6 Applying

f = lambda x: x.max()
frame.apply(f)
frame.apply(f, axis=1)

apply can also return a Series

def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)
#=>    a   b
# min xxx xxx
# max xxx xxx
  • map method for applying an element-wise function on a Series
  • applymap for applying an element-wise function on a DataFrame

2.4.7 Sorting

df.sort_index()

# by column(s)
df.sort_values(by='c1')
df.sort_values(by=['c1', 'c2'])

series.sort_values()

2.4.8 Ranking

By default, rank breaks ties by assigning each group the mean rank

  • args: 'average'(default), 'min', 'max', 'first'
obj = pd.Series([7, -5, 7, 4, 2, 0, 4, 7])
obj.rank()
#=>
# 0    7.0
# 1    1.0
# 2    7.0
# 3    4.5
# 4    3.0
# 5    2.0
# 6    4.5
# 7    7.0
obj.rank(method='first', ascending=False)
#=>
# 0    1.0
# 1    7.0
# 2    2.0
# 3    3.0
# 4    5.0
# 5    6.0
# 6    4.0

2.4.9 Binning

  • cut
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats # Categories object [(18,25], (25, 35], ...]

useful options: labels, precision

  • qcut
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.]) # pass quantiles

2.4.10 Other funtions

  • numpy ufancs works fine with pandas objects
  • isnull, notnull, dropna, fillna
  • stack, unstack, swaplevel, sortlevel
  • set_index, reset_index
  • unique(series based), value_counts(series based), isin(element-wise)
  • all, any
  • replace: data.replace(-999, np.nan)
  • cut, qcut

2.5 Statistic methods

Basic: count, describe, min, max, quantile, sum, pct_change, diff, corr, cov, corrwith

2.5.1 mean, median, mad, var, std

2.5.2 argmin, argmax, idxmin, idxmax

argmin/argmax: compute index locations (integers) at which minimum or maximum value obtained, respectively

2.5.3 cumsum, cummin, cummax, cumprod

2.5.4 skew

Sample skewness (3rd moment) of values

2.5.5 kurt

Sample kurtosis (4th moment) of values

2.5.6 diff

Compute 1st arithmetic difference (useful for time series)

2.5.7 corr, cov, corrwith

import pandas_datareader as pdr

all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
    all_data[ticker] = pdr.get_data_yahoo(ticker, '1/1/2000', '1/1/2010')

price = pd.DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
returns = price.pct_change()

returns.MSFT.corr(returns.IBM)
returns.MSFT.cov(returns.IBM)

returns.corr()

factors_df.corrwith(prices)
  • DataFrame.corrwith: Compute pairwise correlation.

2.5.8 common args

Method Description
axis Axis to reduce over. 0 for DataFrame’s rows and 1 for columns
skipna Exclude missing values, True by default
level Reduce grouped by level if the axis is hierarchically-indexed (MultiIndex)

skipna option:

  • True(default): NA values are excluded unless the entire slice (row or column in this case) is NA
  • False: if any value is NA, then return NA

2.6 Hierarchical Indexing

2.6.1 Indexing

data[index_level1]
data[index_level1 : index_level1]
data[[index_level1, index_level1]]

select by inner level:
data[:, index_level2]

2.6.2 stack, unstack

2.6.3 swaplevel, sortlevel

2.7 Panel

Panel can be thought as a 3-dimensional analogue of DataFrame. Although hierarchical indexing makes using truly N-dimensional arrays unnecessary in a lot of cases

pdata = pd.Panel({stk: pdr.get_data_yahoo(stk, '1/1/2009', '6/1/2012')
		  for stk in ['AAPL', 'GOOG']})

2.7.1 Useful functions

  • ix

    pdata.ix[:, '6/1/2012', :]
    
  • swapaxes

    pdata.swapaxes('items', 'minor')['Adj Close']
    
  • to_frame

    index will be the [major, minor] axis, items will be the columns

    stacked = pdata.to_frame()
    
  • to_panel

    stacked.to_panel()
    

2.8 config options

pd.options categories:

  • compute
  • display
  • io
  • mode
  • plotting

2.9 Functional Method Chaining

# Usual non-functional way
df2 = df.copy()
df2["k"] = v

# Functional assign way
df2 = df.assign(k=v)

result = df2.assign(col1_demeaned=df2.col1 - df2.col2.mean()).groupby("key").col1_demeaned.std()
  • Assigning in-place may execute faster than using assign, but assign enables easier method chaining

2.9.1 Fancy Method Chaining

df = load_data()
df2 = df[df['col2'] < 0]
# This can be rewritten as:
df = (load_data()
      [lambda x: x['col2'] < 0])

# write the entire sequence as a single chained expression
result = (load_data()
	  [lambda x: x.col2 < 0]
	  .assign(col1_demeaned=lambda x: x.col1 - x.col1.mean())
	  .groupby('key')
	  .col1_demeaned.std())

2.9.2 pipe method

The statement f(df) and df.pipe(f) are equivalent

useful pattern for pipe
def group_demean(df, by, cols):
    result = df.copy()
    g = df.groupby(by)
    for c in cols:
	result[c] = df[c] - g[c].transform("mean")
    return result


result = df[df.col1 < 0].pipe(group_demean, ["key1", "key2"], ["col1"])

3 Wrangling

3.1 Evaluate Data

3.1.1 Quality

Data in low quality is called dirty data.

  • missing value
  • invalid value
  • inaccurate value
  • inconsistent value, e.g. different unit(cm/inch)
Evaluate Method

head, tail, info, value_counts, plot

3.2 Dealing with Missing Data

3.2.1 Imputation

Why impute
  • Not much data
  • Removing data could affect representativeness
Methods
Mean Imputation
Drawbacks: Lessens correlations between variables
Linear Regression
Drawbacks: Over emphasize trends; Exist values suggest too much certainty.

3.3 Data Input/Output

3.3.1 Reading option categories

Indexing

Can treat one or more columns as the returned DataFrame, and whether to get column names from the file, the user, or not at all.

Type inference and data conversion

This includes the user-defined value conversions and custom list of missing value markers.

Datetime parsing

Includes combining capability, including combining date and time information spread over multiple columns into a single column in the result.

Iterating

Support for iterating over chunks of very large files.

Unclean data issues

Skipping rows or a footer, comments, or other minor things like numeric data with thousands separated by commas.

3.3.2 Hints

  • from_csv: a convenience method simpler than read_csv
  • pickle related: load, save

3.3.3 Useful read_csv parameters

  • nrows
  • chunksize
chunker = pd.read_csv("data.csv", chunksize=1000)
for piece_df in chunker:
    # do somethine with piece_df
    ...

3.4 Concatenation

  • pd.merge, merge method
  • join method: performs a left join on the join keys
  • pd.concat
  • combine_first: patching missing data from another df
  • align: align two object on their axes with the specified join method for each axis Index

3.4.1 concat args

Argument Description
objs List or dict of pandas objects to be concatenated. The only required argument
axis Axis to concatenate along; defaults to 0
join One of 'inner', 'outer' , defaulting to 'outer'
join_axes Specific indexes to use for the other n-1 axes instead of performing union/intersection logic
keys Values to associate with objects being concatenated, forming a hierarchical index along the concatenation axis
levels Specific indexes to use as hierarchical index level or levels if keys passed
names Names for created hierarchical levels if keys and / or levels passed
verify_integrity Check new axis in concatenated object for duplicates and raise exception if so. By default( False ) allows duplicates
ignore_index Do not preserve indexes along concatenation axis , instead producing a new range(total_length) index

3.5 Reshaping and Pivoting

3.5.1 stack

Pivots from the columns in the data to the rows. Stacking filters out missing data by default.

data = DataFrame(np.arange(6).reshape((2, 3)), columns=['a', 'b', 'c'])
data
#=>
#    a  b  c
# 0  0  1  2
# 1  3  4  5
data.stack()
#=>
# 0  a    0
#    b    1
#    c    2
# 1  a    3
#    b    4
#    c    5
# dtype: int64

3.5.2 unstack

Pivots from the rows into the columns

data.stack().unstack()
#=>
#    a  b  c
# 0  0  1  2
# 1  3  4  5

# can specific level number or name
data.stack().unstack(0)
#=>
#    0  1
# a  0  3
# b  1  4
# c  2  5

3.5.3 pivot & melt

pivot is a shortcut for creating a hierarchical index using set_index and reshaping with unstack

quotes.head()
#=>
#                   Open        High         Low       Close    Volume  \
# Date
# 2010-01-04  626.951088  629.511067  624.241073  626.751061   3927000
# 2010-01-05  627.181073  627.841071  621.541045  623.991055   6031900
# 2010-01-06  625.861078  625.861078  606.361042  608.261023   7987100
# 2010-01-07  609.401025  610.001045  592.651008  594.101005  12876600
# 2010-01-08  592.000997  603.251034  589.110988  602.021036   9483900

#              Adj Close symbol
# Date
# 2010-01-04  313.062468   GOOG
# 2010-01-05  311.683844   GOOG
# 2010-01-06  303.826685   GOOG
# 2010-01-07  296.753749   GOOG
# 2010-01-08  300.709808   GOOG
quotes.pivot(columns='symbol', values='Close')
#=>
# symbol            AAPL        GOOG         IBM       MSFT
# Date
# 2010-01-04  214.009998  626.751061  132.449997  30.950001
# 2010-01-05  214.379993  623.991055  130.850006  30.959999
# 2010-01-06  210.969995  608.261023  130.000000  30.770000
# 2010-01-07  210.580000  594.101005  129.550003  30.450001
# 2010-01-08  211.980005  602.021036  130.850006  30.660000

3.6 Random Sampling

df.take, df.sample

df = DataFrame(np.arange(5 * 4).reshape(5, 4))
df
#=>
#     0   1   2   3
# 0   0   1   2   3
# 1   4   5   6   7
# 2   8   9  10  11
# 3  12  13  14  15
# 4  16  17  18  19

# simple sampling
df.sample(n=3)

sampler = np.random.permutation(5)
sampler
#=> array([1, 3, 4, 0, 2])
df.take(sampler)
#=>
#     0   1   2   3
# 1   4   5   6   7
# 3  12  13  14  15
# 4  16  17  18  19
# 0   0   1   2   3
# 2   8   9  10  11

3.7 String Manipulation

df.str.XXX

3.7.1 Vectorized string methods

cat, contains, count, endswith/startswith, findall, get, join, len, lower, upper, match, pad, center, repeat, replace, slice, split, strip/rstrip/lstrip

3.8 Check Duplicates

pandas.Index.is_unique, pandas.Series.is_unique

3.9 Indicator/Dummy

Converting Categorical Variable into "dummy" Matrix, use ~pd.get_dummies

df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b']})
df
#   key
# 0   b
# 1   b
# 2   a
# 3   c
# 4   a
# 5   b

pd.get_dummies(df['key'])
#    a  b  c
# 0  0  1  0
# 1  0  1  0
# 2  1  0  0
# 3  0  0  1
# 4  1  0  0
# 5  0  1  0

4 Plotting

4.1 Matplotlib Basic

4.1.1 Figures and Subplots

fig = plt.figure()
ax1 = fig.add_subplot(2, 2, 1)
# get a reference of active figure
plt.gcf()

# all in one subplots
fig, axes = plt.subplots(2,3, figsize=(14, 8))

4.1.2 subplots options

Argument Description
figsize Size of figure
nrows Number of rows of subplots
ncols Number of columns of subplots
sharex All subplots use the same X-axis ticks
sharey see above
subpot_kw creating dict of keywords
**fig_kw Additional keywords

4.1.3 Adjusting Size

subplots_adjust

  • args: left, right, bottom, top, wspace, hspace

4.1.4 global configuration

plt.rc

plt.rc('figure', figsize=(10, 10))
font_options = {'family' : 'monospace',
		'weight' : 'bold',
		'size'
		: 'small'}
plt.rc('font', **font_options)

4.1.5 Choosing Plotting Range

plt.xlim, plt.ylim, ax.set_xlim, ax.set_ylim

4.1.6 title, label, tick, ticklabel, legend

set_title, set_xlabel, set_xticks, set_xticklabels, set

ticks = ax.set_xticks([0, 250, 500, 750, 1000])
labels = ax.set_xticklabels(["one", "tow", "three", "four", "five"])

# ax.set
props = {
    "title": "My first matplotlib plot",
    "xlabel": "Stages"
}
ax.set(**props)

# legend
ax.legend(loc='best') #show the label

4.1.7 Annotations

ax.text, ax.arrow, ax.annotate

ax.annotate(label, xy=(date, spx.asof(date) + 50),
	    xytext=(date, spx.asof(date) + 200),
	    arrowprops=dict(facecolor='black'),
	    horizontalalignment='left', verticalalignment='top')
  • util function: asof

4.2 pandas Plotting

4.2.1 Line Plots

Series
Argument Description
label Label for plot legend
ax matplotlib subplot object to plot on. If nothing passed, uses active matplotlib subplot
style Style string, like 'ko–' , to be passed to matplotlib
alpha The plot fill opacity (from 0 to 1)
DataFrame
Argument Description
kind Can be 'line', 'bar', 'barh', 'kde'
logy Use logarithmic scaling on the Y axis
use_index Use the object index for tick labels
rot Rotation of tick labels (0 through 360)
xticks Values to use for X axis ticks
yticks Values to use for Y axis ticks
xlim X axis limits (e.g. [0, 10] )
ylim Y axis limits
grid Display axis grid (on by default)
subplots Plot each DataFrame column in a separate subplot
sharex, sharey If subplots=True , share the same Y/x axis
figsize Size of figure to create as tuple
title Plot title as string
legend Add a subplot legend ( True by default)
sort_columns Plot columns in alphabetical order
steps

plt.plot(data, drawstyle="steps-post")

4.2.2 Bar Plots

  • kind='bar': for vertical bars, 'barh' for horizontal bars
  • stacked=True: stacked bar plots
  • useful recipe: s.value_counts().plot(kind='bar')

4.2.3 Histogram & Density Plots

  • A kind of bar plot that gives a discretized display of value frequency
comp1 = np.random.normal(0, 1, size=200) # N(0, 1)
comp2 = np.random.normal(10, 2, size=200) # N(10, 4)
values = Series(np.concatenate([comp1, comp2]))
fig = plt.figure(figsize=(10, 5))
values.hist(bins=100, alpha=0.3, color='k', normed=True)
values.plot(kind='kde', style='k--')

hist_kde_plot.png

4.2.4 Scatter Plots

pairs plot or scatter plot matrix: scatter_matrix

4.3 Saving Plots to File

Figure.savefig args

  • fname, dpi, facecolor, edgecolor, format
  • bbox_inches: The portion of the figure to save

4.4 interactive mode

4.5 add-ons

  • mplot3d
  • basemap /cartopy: projection and mapping(plotting 2D data on maps)
  • seaborn /holoviews/ggplot: higher-level plotting interfaces
  • axes_grid: axes and axis helpers

5 Group

5.1 group by

# series groupby
df['data1'].groupby(df['key1'])
df.groupby(df['key1'])['data1'] # syntactic sugar
dict(list(df.groupby('key1')))

# df groupby
df.groupby(['key1', 'key2']) # options: as_index, axis

# get group size
df.groupby().size()

# iterations
for name, group in df.groupby('key1'):
    print name, group

5.1.1 Using Mapping Dict(or Series)

# using mapping dict
mapping = {'a': 'group1', 'b': 'group1', 'c': 'group2'}
df.groupby(mapping, axis=1)

5.1.2 Using a Function

Any function passed as a group key will be called once per index value, with the return values being used as the group names

# using function
df.groupby(len).sum()

5.1.3 Using Mixing

Mixing functions with arrays, dicts or Series is not a problem as everything gets converted to arrays internally

key_list = ['one', 'one', 'two']
df.groupby([len, key_list]).min()

5.2 groupby aggregation

Table 1: Optimized Groupby Methods
Name Description
count Number of non-NA values in the groupNumber of non
sum Sum of non-NA values
mean Mean of non-NA values
median Arithmetic median of non-NA values
std, var Unbiased (n - 1 denominator) standard deviation and variance
min, max Minimum and maximum of non-NA values
prod Product of non-NA values
first, last First and last non-NA values

5.2.1 agg

Series
grouped = series.groupby(group_key)
grouped.agg(['mean', 'std', transform_function])
grouped.agg([('foo', 'mean'), ('bar', 'np.std')]) # foo bar will be the column name of result df
grouped.agg({"tip": np.max, "size": "sum"})
DataFrame
grouped = df.groupby("column")
grouped.agg({'col1': 'mean', 'col2': 'std', 'col3': np.max})
grouped.agg({'col1': ['mean', 'std']})

5.2.2 apply

apply function has args:

df.groupby(['key1', 'key2']).apply(top, n=1)

5.2.3 transform

Like apply, transform works with functions that return Series, but the result must be the same size as the input

5.2.4 fillna with group value

grouped.apply(lambda g: g.fillna(g.mean()))

fill_values = {'a': 5, 'b': 4}
grouped.fillna(lambda g: g.fillna(fill_values[g.name]))

5.3 pivot_table

In addition to providing a convenience interface to groupby + reshape pivot_table.jpg

  • crossbar is a special case of a pivot table.

    data
    #  Sample Nationality Handedness
    # 0 1 USA Right-handed
    # 1 2 Japan  Left-handed
    pd.crosstab(data.Nationality, data.Handedness, margins=True)
    # Handedness Left-handed Right-handed All
    # Nationality
    # Japan      2 3 5
    # USA        1 4 5
    # All        3 7 10
    

5.4 transform

similar to apply but imposes more constraints:

  • It can produce a scalar value to be broadcast to the shape of the group
  • It can produce an object of the same shape as the input group
  • It must not mutate its input

5.4.1 examples

df = pd.DataFrame({'key': ['a', 'b', 'c'] * 2, 'value': np.arange(6.)})
# =>
#   key  value
# 0   a    0.0
# 1   b    1.0
# 2   c    2.0
# 3   a    3.0
# 4   b    4.0
# 5   c    5.0
df.groupby("key").transform(lambda x: x.mean()) # or df.groupby("key").transform("mean")
# => same shape
#    value
# 0    1.5
# 1    2.5
# 2    3.5
# 3    1.5
# 4    2.5
# 5    3.5

5.4.2 unwrapped group operation

normalized = (df['value'] - g.transform('mean')) / g.transform('std')
# faster than
g.apply(lambda x: (x - x.mean()) / x.std())

5.5 Examples

5.5.1 Group Weighted Average

df = pd.DataFrame({'category': ['a', 'a', 'a', 'a',
				'b', 'b', 'b', 'b'],
		   'data': np.random.randn(8),
		   'weights': np.random.rand(8)})
df.groupby("category").apply(lambda g: np.average(g['data'], weights=g['weights']))

5.5.2 Group Correlation

prices.columns
# AAPL,MSFT,XOM,SPX

spx_corr = lambda x: x.corrwith(x['SPX'])
rets = prices.pct_change().dropna()

get_year = lambda x: x.year
by_year = rets.groupby(get_year)
by_year.apply(spx_corr)
#    AAPL MSFT XOM SPX
# 2003 0.541124 0.745174 0.661265 1.0
# 2004 0.374283 0.588531 0.557742 1.0

5.5.3 Group-Wise Linear Regression

prices.columns
# AAPL,MSFT,XOM,SPX
rets = prices.pct_change().dropna()

get_year = lambda x: x.year
by_year = rets.groupby(get_year)

import statsmodels.api as sm
def regress(data, yvar, xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.
    result = sm.OLS(Y, X).fit()
    return result.params

by_year.apply(regress, 'AAPL', ['SPX'])

5.5.4 Grouped Time Resampling

times = pd.date_range('2017-05-20 00:00', freq='1min', periods=N)
df = pd.DataFrame({'time': times.repeat(3),
		   'key': np.tile(['a', 'b', 'c'], N),
		   'value': np.arange(N * 3.)})
time_key = pd.TimeGrouper('5min')
resampled = (df.set_index('time').groupby(['key', time_key]).sum())

6 Timeseries

6.1 Indexing

ts[(time(10,0))]
# same as
ts.at_time(time(10,0))

ts.between_time(time(10, 0), time(10, 30))
ts.asof(pd.date_range('2016-05-01 10:00', periods=4, freq='B'))

6.1.1 asof

By passing an array of timestamps to the asof method, you will obtain an array of the last valid(non-NA) values at or before each timestamps.

6.2 Special Frequencies

Table 2: Base Time Series Frequencies
Alias Offset Type
D Day
B BusinessDay
H Hour
T or min Minute
S Second
L or ms Milli
U Micro
M MonthEnd
BM BusinessMonthEnd
MS MonthBegin
BMS BusinessMonthBegin
W-Mon, W-TUE, … Week
WOM-1MON, … WeekOfMonth
Q-JAN, Q-FEB, … QuarterEnd
BQ-JAN, BQ-FEB, … BusinessQuarterEnd
QS-JAN, … QuarterBegin
BQS-JAN, … BusinessQuarterBegin
A-JAN, … YearEnd
BA-JAN, … BusinessYearEnd
AS-JAN, … YearBegin
BAS-JAN, … BusinessYearBegin

6.2.1 examples

  • "1h30min"
  • "WOM-3FRI": the third Friday of each month:

6.3 Offset Objects

from pandas.tseries.offsets import Day, MonthEnd

now = datetime(2011, 11, 17)
now + 3 * Day()  # -> Timestamp('2011-11-20 00:00:00')
now + MonthEnd()  # -> Timestamp('2011-11-30 00:00:00') "roll forward"
# same as
offset = MonthEnd()
offset.rollforward(now)

offset.rollback(now)

from pandas.tseries.frequencies import to_offset
offset = to_offset("WOM-3FRI")
  • rollforward/rollback is useful with groupby(like resample but slower): ts.groupby(offset.rollforward).mean()

6.4 Time Zone Localization

  • ts.tz_localize("UTC"): same way like dt.replace

6.5 Periods

Periods represent timespans, like days, months, quarters, or years.

p = pd.Period(2007, freq="A-DEC")  # represents the full timespan from 2017-1-1 to 2017-12-31.
p + 5  # Period('2012', 'A-DEC')

rng = pd.period_range("2000-01-01", "2000-06-30", freq="M")
# rng -> PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='period[M]', freq='M')

m = p.asfreq("M", how="start")  # converting from low to high frequency
m.asfreq("A-DEC")  # converting from high to low frequency depending on where the subperiod “belongs.”

6.5.1 Frequencies

  • A-DEC: annual frequency end at DEC
  • Q-SEP: fiscal quarter Q4 end at SEP

6.5.2 Converting from/to Timestamps

  • ts.to_period
  • ts.to_timestamp(how="end")

6.5.3 From Different Columns

pd.PeriodIndex(year=data.year, quarter=data.quarter, freq='Q-DEC')

6.6 Resampling

6.6.1 Downsampling

  • Label: resample(label="right"). For "T" freq, 9:01 represents 9:00~9:01

6.6.2 Upsampling

often need an aggregation function

  • without any aggregation, but introduce None data: asfreq

6.7 Moving Window

Can be useful for smoothing noisy or gappy data.

Function Description
rolling_count Returns number of non-NA observations in each trailing window.
rolling_sum Moving window sum.
rolling_mean Moving window mean.
rolling_median Moving window median.
rolling_var, rolling_std Moving window variance and standard deviation, respectively. Uses n - 1 denominator.
rolling_skew, rolling_kurt Moving window skewness (3rd moment) and kurtosis (4th moment), respectively.
rolling_min, rolling_max Moving window minimum and maximum.
rolling_quantile Moving window score at percentile/sample quantile.
rolling_corr, rolling_cov Moving window correlation and covariance.
rolling_apply Apply generic array function over a moving window.
ewma Exponentially-weighted moving average.
ewmvar, ewmstd Exponentially-weighted moving variance and standard deviation.
ewmcorr, ewmcov Exponentially-weighted moving correlation and covariance.

6.7.1 rolling

  • rolling by time offset: close_px.rolling('20D').mean()

6.7.2 expanding

The expanding mean starts the time window from the beginning of the time series and increases the size of the window until it encompasses the whole series.

6.7.3 ewm

The idea is to specify a constant decay factor to give more weight to more recent observations. There are a couple of ways to specify the decay factor. A popular one is using a span. which makes the result comparable to a simple moving window function with window size equal to the span.

  • with decay factor span: aapl_px.ewm(span=30).mean()

6.8 Binary Moving Window

Some statistical operators need to operate on two time series, like correlation and covariance

aapl_rets.rolling(125, min_periods=100).corr(spx_rets)
rets_df.rolling(125, min_periods=100).corr(spx_rets)

6.9 User-Defined Moving Window Functions

The only requirement is that the function produce a single value (a reduction) from each piece of the array.

from scipy.stats import percentileofscore
score_at_2percent = lambda x: percentileofscore(x, 0.02)
result = returns.AAPL.rolling(250).apply(score_at_2percent)

6.10 Tips

  • normalized date range: pd.date_range('2012-05-02 12:56:31', periods=5, normalize=True)
  • shift by freq: ts.shift(2, freq="M")
  • shortcut for ohlc reampling: ts.resample("5min").ohlc()
  • rolling with min_periods: aapl_px.rolling(125, min_periods=100)

7 Categorical Data

7.1 Motivation

values = pd.Series([0, 1, 0, 0] * 2)
dim = pd.Series(['apple', 'orange'])
# dim
# 0 apple
# 1 orange
# dtype: object
dim.take(values)
# 0 apple
# 1 orange
# 0 apple
# 0 apple
# ...
  • The representation values as integers is called the categorical or dictionary-encoded representation.
  • The array dim of distinct values can be called the categories, dictionary, or levels of the data.
  • The integer values that reference the categories are called the category codes or simply codes.

7.2 Categorical Type

  • pd.Categorical(data: Iterable[str])
  • converting string series to categorical: s.astype("category")
  • converting integer series to categorical: pd.Categorical.from_codes(codes: pd.Series[int], categories: Iterable[str])
  • numeric data to categorical: bins = pd.qcut(data, 4, labels=["Q1", "Q2", "Q3", "Q4"]). often using groupby(bins) after

7.2.1 Better Performance

  • categories.memory_usage() is less
  • GroupBy operations can be significantly faster with categoricals
  • In large datasets, categoricals are often used as a convenient tool for memory savings and better performance

7.3 Categorical Methods

Series.cat is similar to Series.str

  • categorical is replaced with string categories: cat_s.cat.set_categories(actual_categories)
  • methods: add_categories, as_ordered, as_unordered, remove_categories, remove_unused_categories, rename_categories, reorder_categories, set_categories

7.3.1 Creating Dummy Variables for Modeling

When you're using statistics or machine learning tools, you'll often transform categorical data into dummy variables, also known as one-hot encoding.

cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')
pd.get_dummies(cat_s)
#    a  b  c  d
# 0  1  0  0  0
# 1  0  1  0  0
# 2  0  0  1  0
# 3  0  0  0  1
# 4  1  0  0  0
# 5  0  1  0  0
# 6  0  0  1  0
# 7  0  0  0  1

8 Modeling Libraries

  • patsy for describing statistical models
  • statsmodel
  • scikit-learn: scikit-image, scikits-cuda
  • TensorFlow, PyTorch

8.0.1 Books

  • Introduction to Machine Learning with Python by Andreas Mueller and Sarah Guido (O’Reilly)
  • Python Data Science Handbook by Jake VanderPlas (O’Reilly)
  • Python Machine Learning by Sebastian Raschka (Packt Publishing)
  • Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurélien Géron (O’Reilly)

9 Other Tools

9.1 ipyparallel

ipcluster start --n=4
from ipyparallel import Client
rc = Client()
dview = rc[:]
# => <DirectView [0, 1, 2, 3>]
dview.apply_sync(os.getpid)

dview.scatter('a', range(10)) # distribute data (map)
dview.execute('print(a)').display_outputs()
dview.execute('b=sum(a)')
dview.gather('b').r # reduce

9.2 Scipy

9.3 Sympy

Main computer algebra module in Python

import sympy as S
from sympy import init_printing
init_printing() # for latex printing

x = S.symbols('x')

p=sum(x**i for i in range(3)) # 2nd order polynomial
# => x**2 + x + 1

S.solve(p) # solves p == 0
# => [-1/2 - sqrt(3)*I/2, -1/2 + sqrt(3)*I/2]

S.roots(p)
# => {-1/2 - sqrt(3)*I/2: 1, -1/2 + sqrt(3)*I/2: 1}

9.3.1 lambdify

y = S.tan(x) * x + x**2
yf= S.lambdify(x,y,’numpy’)
#=> <function numpy.<lambda>>
yf(np.arange(3))

9.3.2 Alternative SAGE

  • open source mathematics software package
  • non pure-python module.

9.4 seaborn

Beautifully formatted plots

9.5 plotly

9.6 Blaze

  • Big data(too big to fit in RAM)
  • tight integration with Pandas dataframes

https://github.com/blaze/blaze

9.7 bcolz

columnar and compressed data containers(either in-memory and on-disk)

  • based on numpy
  • support for PyTables and pandas dataframes
  • takes advantage of multi-core architecture

https://github.com/Blosc/bcolz

9.8 Enthought Tool-Suite (ETS)

ETS is a collection of components developed by Enthought and our partners, which we use every day to construct custom scientific applications.

9.8.1 Chaco

2-Dimensional Plotting, interactive visualization.

9.8.2 Mayavi

Another Enthought-supported 3D visualization package that sits on VTK.

9.9 bottleneck

provides an alternate implementation of NaN-friendly moving window functions.

9.10 SWIG

Wrapper for c, c++ libraries.

  • generate python PYD

9.11 zipline

10 hdf

10.1 dataset

f = h5py.File("testfile.hdf5")
dset = f.create_dataset("big dataset", (1024**3, ), dtype=np.float32)
dset[0:1024] = np.arange(1024)
f.flush() # dump cache on disk

with dset.astype('float64'):
    out = dset[0,:]
out.dtype #=> dtype('float64')

10.1.1 Ellipsis

dset = f.create_dataset('4d', shape=(100, 80, 50, 20))
dset[0,...,0].shape
# => (80, 50)
dset[...].shape
# => (100, 80, 50, 20)

10.1.2 chunks

dset = f.create_dataset('chunked', (100,480,640), dtype='i1', chunks=(1, 64, 64))

10.1.3 float data saving space

  • it’s common practice to store these data points on disk as single-precision
read float32 in as double precision
  • one way
big_out = np.empty((100, 1000), dtype=np.float64)
dset.read_direct(big_out)
  • another way
with dset.astype('float64'):
    out = dset[0,:]

10.2 lookup tools

10.2.1 HDFView

10.2.2 h5ls

h5ls -vlr

10.3 notice

  • slice on hdf table will do read&write on disk
  • use np.s_ to get a slice object

    dset.read_direct(out, source_sel=np.s_[0,:], dest_sel=np.s_[50,:])