Machine Learning

Table of Contents

1 Definition

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, impoves with experience E – Tom Mitchell(1998)

2 Roadmap

roadmap.png

3 Supervised

Teach the computer how to do something

  • "right answers" given

3.1 Regression problem

predict continuous valued output

3.1.1 Linear Regression

Hypothesis

\[h_\theta(x_0, x_1, \dots)=\theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n (x_0 = 1)\]

  • vectorized

    \[h_\theta(x)=\theta^T \cdot x\]

    • \(\theta\) and \(x\) are \(n\times 1\) matrices or \(n\) vectors
Cost Function

Called Squared error function or Mean squared error \[J(\theta_0, \theta_1)=\frac{1}{2m}\sum_{i=1}^m(\hat{y^i}-y^i)^2=\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^i)-y^i)^2\]

  • Goal: minimize \(J(\theta_0, \theta_1)\)
  • \(\frac{1}{2}\) is a convenience for the computation of the gradient descent,

as the derivative term of the square function will cancel out the \(\frac{1}{2}\) term.

  • vectorized

    \[J(\theta)=\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^i)-y^i)^2=\frac{1}{2m}\sum_{i=1}^m (\theta^T x^i-y^i)^2=\frac{1}{2m}(\theta^T x-y)^2\]

3.1.2 Polynomial Regression

  • Let \(x_1=x, x_2=x^2\) \[h_\theta(x)=\theta_0+\theta_1 x_1+\theta_2 x_2\]
  • need to use feature scaling

3.1.3 Gradient Descent

To find the minimum value of the cost function.

  1. Start with some parameters \(\theta_0, \theta_1, \dots\)
  2. Keep changing parameters to reduce \(J(\theta_0, \theta_1, \dots)\)
idea

repeat until convergence \[\theta_j=\theta_j-\alpha\frac{d}{d\theta_j}J(\theta_0, \theta_1, \dots)\]

  • \(\alpha\) too smaller: gradient descent can be slow
  • \(\alpha\) too large: gradient descent can overshoot the minimum
  • Solve equation: \[\theta_0 = \theta_0-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^i)-y^i)\] \[\theta_1 = \theta_1-\alpha\frac{1}{m}\sum_{i=1}^m((h_\theta(x^i)-y^i)x^i)\]
  • vectorized

    repeat until convergence \[\begin{align*} \theta_j & = \theta_j-\alpha\frac{d}{d\theta_j}J(\theta) \\ & = \theta_j-\alpha\frac{1}{m}\sum_{i=1}^m(\theta^T x^i - y^i)x_j^i \end{align*}\]

Batch Gradient Descent

Each step of gradient descent uses all the training examples.

Other Method to find min value of the cost function

Normal Equations Method

  • Gradient Descent scales better to large data sets
Feature Scaling
  • mean normalization \[\frac{x_i - \mu}{range}\] range could be replaced with std
Learning Rate
  • Debugging gradient descent: Plot cost function with number of iterations on the x-axis
  • Automatic convergence test: Declare convergence if \(J(θ)\) decreases by less than \(E\)(some small value, e.g. \(10^{-3}\)) in one iteration

3.1.4 Normal Equation

Minimize J by taking its derivatives with respect to \(\theta_j\), and set them to zero. \[\theta=(X^T X)^{-1}X^T y\]

vs Gradient Descent
Gradient Descent Normal Equation
Need to choose \(\alpha\) No need to choose \(\alpha\)
Needs many iterations No need to iterate
\(O(k n^2)\) \(O(n^3)\) to calculate \((X^T X)^{-1}\)
Works well when n is large Slow if \(n\) is large. \(n > 10000\)
\((X^T X)^{-1}\) is noninvertible
  • Redundant features (they are linearly dependent)
  • Too many features \(m\le n\), In this case, delete some features or use regularization

3.2 Classification problem

predict discrete valued output

3.2.1 Logistic Regression

Hypothesis
  • Logistic(Sigmoid) function

\[h_\theta(x)=g(\theta^T x)\] where \[g(z)=\frac{1}{1+e^{-z}}\]

LogisticFunc.png

combine together: \[h_\theta(x)=\frac{1}{1+e^{-\theta^T x}}\]

  • \(y=1\) if \(g(z)\ge 0.5\) -> \(z\ge 0\)
  • \(y=0\) if \(g(z) < 0.5\) -> \(z < 0\)
Cost Function

\[Cost(h_\theta(x), y)= \begin{cases} -\log(h_\theta(x)), & \mbox{if }y=1 \\ -\log(1-h_\theta(x)), & \mbox{if }y=0 \end{cases}\]

4 Unsupervised

Let computer learn by itself

4.1 Clustering problem

4.1.1 Examples

  • Social network analysis
  • Market segmentation(categorize customers)
  • Astronomical data analysis

4.2 Non-clustering problem

4.2.1 Cocktail party algorithm