Machine Learning

1. Definition
2. Roadmap
3. Supervised
- 3.1. Regression problem
- 3.2. Classification problem
  - 3.2.1. Logistic Regression
4. Unsupervised
- 4.1. Clustering problem
  - 4.1.1. Examples
- 4.2. Non-clustering problem
  - 4.2.1. Cocktail party algorithm

1 Definition

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, impoves with experience E – Tom Mitchell(1998)

2 Roadmap

3 Supervised

Teach the computer how to do something

"right answers" given

3.1 Regression problem

predict continuous valued output

3.1.1 Linear Regression

Hypothesis

\[h_\theta(x_0, x_1, \dots)=\theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n (x_0 = 1)\]

vectorized
\[h_\theta(x)=\theta^T \cdot x\]
- \(\theta\) and \(x\) are \(n\times 1\) matrices or \(n\) vectors

Cost Function

Called Squared error function or Mean squared error \[J(\theta_0, \theta_1)=\frac{1}{2m}\sum_{i=1}^m(\hat{y^i}-y^i)^2=\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^i)-y^i)^2\]

Goal: minimize \(J(\theta_0, \theta_1)\)
\(\frac{1}{2}\) is a convenience for the computation of the gradient descent,

as the derivative term of the square function will cancel out the \(\frac{1}{2}\) term.

vectorized

\[J(\theta)=\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^i)-y^i)^2=\frac{1}{2m}\sum_{i=1}^m (\theta^T x^i-y^i)^2=\frac{1}{2m}(\theta^T x-y)^2\]

3.1.2 Polynomial Regression

Let \(x_1=x, x_2=x^2\) \[h_\theta(x)=\theta_0+\theta_1 x_1+\theta_2 x_2\]
need to use feature scaling

3.1.3 Gradient Descent

To find the minimum value of the cost function.

Start with some parameters \(\theta_0, \theta_1, \dots\)
Keep changing parameters to reduce \(J(\theta_0, \theta_1, \dots)\)

idea

repeat until convergence \[\theta_j=\theta_j-\alpha\frac{d}{d\theta_j}J(\theta_0, \theta_1, \dots)\]

\(\alpha\) too smaller: gradient descent can be slow
\(\alpha\) too large: gradient descent can overshoot the minimum
Solve equation: \[\theta_0 = \theta_0-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^i)-y^i)\] \[\theta_1 = \theta_1-\alpha\frac{1}{m}\sum_{i=1}^m((h_\theta(x^i)-y^i)x^i)\]

vectorized

repeat until convergence \[\begin{align*} \theta_j & = \theta_j-\alpha\frac{d}{d\theta_j}J(\theta) \\ & = \theta_j-\alpha\frac{1}{m}\sum_{i=1}^m(\theta^T x^i - y^i)x_j^i \end{align*}\]

Batch Gradient Descent

Each step of gradient descent uses all the training examples.

Other Method to find min value of the cost function

Normal Equations Method

Gradient Descent scales better to large data sets

Feature Scaling

mean normalization \[\frac{x_i - \mu}{range}\] range could be replaced with std

Learning Rate

Debugging gradient descent: Plot cost function with number of iterations on the x-axis
Automatic convergence test: Declare convergence if \(J(θ)\) decreases by less than \(E\)(some small value, e.g. \(10^{-3}\)) in one iteration

3.1.4 Normal Equation

Minimize J by taking its derivatives with respect to \(\theta_j\), and set them to zero. \[\theta=(X^T X)^{-1}X^T y\]

vs Gradient Descent

Gradient Descent	Normal Equation
Need to choose \(\alpha\)	No need to choose \(\alpha\)
Needs many iterations	No need to iterate
\(O(k n^2)\)	\(O(n^3)\) to calculate \((X^T X)^{-1}\)
Works well when n is large	Slow if \(n\) is large. \(n > 10000\)

\((X^T X)^{-1}\) is noninvertible

Redundant features (they are linearly dependent)
Too many features \(m\le n\), In this case, delete some features or use regularization

3.2 Classification problem

predict discrete valued output

3.2.1 Logistic Regression

Hypothesis

Logistic(Sigmoid) function

\[h_\theta(x)=g(\theta^T x)\] where \[g(z)=\frac{1}{1+e^{-z}}\]

combine together: \[h_\theta(x)=\frac{1}{1+e^{-\theta^T x}}\]

\(y=1\) if \(g(z)\ge 0.5\) -> \(z\ge 0\)
\(y=0\) if \(g(z) < 0.5\) -> \(z < 0\)

Cost Function

\[Cost(h_\theta(x), y)= \begin{cases} -\log(h_\theta(x)), & \mbox{if }y=1 \\ -\log(1-h_\theta(x)), & \mbox{if }y=0 \end{cases}\]

Machine Learning

Table of Contents

1 Definition

2 Roadmap

3 Supervised

3.1 Regression problem

3.1.1 Linear Regression

Hypothesis

Cost Function

3.1.2 Polynomial Regression

3.1.3 Gradient Descent

idea

Batch Gradient Descent

Other Method to find min value of the cost function

Feature Scaling

Learning Rate

3.1.4 Normal Equation

vs Gradient Descent

\((X^T X)^{-1}\) is noninvertible

3.2 Classification problem

3.2.1 Logistic Regression

Hypothesis

Cost Function

4 Unsupervised

4.1 Clustering problem

4.1.1 Examples

4.2 Non-clustering problem

4.2.1 Cocktail party algorithm