Probability

Table of Contents

1 Basic conceptions

1.1 Terms

  • experiment

    Every probabilistic model involves an underlying process, called the experiment.

  • sample space

    The set of all possible outcomes is called the sample space of the experiment, denoted by \(\Omega\).

  • event

    A subset of the sample space, that is, a collection of possible outcomes is called an event.

  • observation

    The outcome of each event is called observation.

1.2 Set

A set is a collection of objects, which are the elements of the set.

  • countable \(S=\{x_1, x_2, \cdots\}\)
  • uncountable \(S=\{x|x\ satisfies\ P\}\)

    ex: \[\{k|k/2\ is\ integer\}\] \[\{x|0\le x\le 1\}\]

  • complement

    The complement of a set S, with respect to the universe \(\Omega\), is the set \(\{x\in\Omega|x\notin S\}\) of all elements of \(\Omega\) that do not belong to S, and is denoted by \(S^c\). Note that \(\Omega^c = \emptyset\).

  • union

    \[S\cup T = \{x|x\in S\ or\ x\in T\}\] \[\bigcup_{n=1}^{\infty} = S_1\cup S_2 \cup \cdots = \{x|x\in S_n\ for\ some\ n\}\]

  • intersection

    \[S\cap T = \{x|x\in S\ and\ x\in T\}\] \[\bigcap_{n=1}^{\infty} = S_1\cap S_2 \cap \cdots = \{x|x\in S_n\ for\ all\ n\}\]

  • difference

    \[S\setminus T=S\cap T^c\] \[S\setminus T=S\setminus (S\cap T)\]

  • scalar set

    The set of scalars (real numbers) is denoted by \(\Re\), The two-dimensional plane is denoted by \(\Re_2\)

  • algebra of set
    • Commutative

      \[S\cap T=T\cap S\] \[S\cup T=T\cup S\]

    • Associative

      \[S\cap(T\cap U)=(S\cap T)\cap U\] \[S\cup(T\cup U)=(S\cup T)\cup U\]

    • Distributive

      \[S\cap(T\cup U)=(S\cap T)\cup(S\cap U)\] \[S\cup(T\cap U)=(S\cup T)\cap(S\cup U)\]

    • de Morgan’s laws:

      \[(\bigcup_n S_n)^c=\bigcap_n S_n^c\] \[(\bigcap_n S_n)^c=\bigcup_n S_n^c\]

    • others

      \[(S^c)^c=S\] \[S\cap S^c=\emptyset\] \[S\cup\Omega=\Omega\] \[S\cap\Omega=S\] \[(S\setminus T)\cup T = S\cup T\]

1.3 Axioms

  • Nonnegativity

    For every event A, \[P(A) \ge 0\]

  • Normalization

    The probability of the entire sample space \(\omega\) is equal to 1. \[P(\Omega) = 1\]

  • Additivity

    If A and B are two disjoint events, then the probability of their union satisfies. \[P(A\cup B)=P(A)+P(B)\] or \(A_1, A_2, \cdots\) are disjoint events, \[P(\bigcup_{i=1}^n A_i) = \sum_{i=1}^n P(A_i)\]

1.4 Consequences

  • The probability of the empty set

    \[1=P(\Omega)=P(\Omega\cup\emptyset)=P(\Omega)+P(\emptyset)=1+P(\emptyset)\] \[\therefore P(\emptyset)=0\]

  • Monotonicity

    If \(A\subseteq B\), then \(P(A)\le P(B)\)

  • Addition law

    \[P(A\cup B)=P(A)+P(B)-P(A\cap B)\]

    • proof

      \[P(A) = P(A\cap B) + P(A\setminus B)\] \[P(B) = P(B\cap A) + P(B\setminus A)\] \[P(A)+P(B) = 2P(A\cap B) + P(A\setminus B) + P(B\setminus A)\] \[P(A)+P(B) = P(A\cap B) + P((A\setminus B)\cup(B\setminus A)\cup(A\cap B))\ (axioms 3)\] \[P(A)+P(B) = P(A\cap B) + P(A\cup B)\]

  • others
    • \(P(A\cup B)\le P(A)+P(B)\) (addition law & axioms 1)
    • \(P(A\cup B\cup C) = P(A) + P(A^c\cap B) + P(A^c\cap B^c\cap C)\) (axioms 3)
    • if \(P(A\cap B)\) equals to 0, then A and B are mutually exclusive

1.5 Random Variable

random variable is a variable that can takes on a set of values(a random value)

1.6 Discrete Variable

discrete: if a variable is discrete, that means it can only take exact values.

  • PMF

    Probability mass function is the probability distribution of a discrete random variable. \[PMF_X(x)=P(X=x)\]

1.7 Continuous Variable

  • PDF

    Probability density function: like PMF to the discrete variable

  • CDF

    Cumulative distribution function

1.8 Expectation

  • Discrete

    \[E[X]=\sum_{x\in R_X}xPMF_X(x)\]

  • Continuous

    \[E[X]=\int_{-\infty}^{\infty}xPDF_X(x)dx=\int_{-\infty}^{\infty}xd(CDF_X(x))\]

  • Transformation

    Let \(Y = g(X)\), then

    • Discrete

      \[E[Y]=\sum_{x\in R_X}g(x)PMF_X(x)\]

    • Continuous

      \[E[Y]=\int_{-\infty}^{\infty}g(x)PDF_X(x)dx=\int_{-\infty}^{\infty}g(x)d(CDF_X(x))\]

1.9 Variance

\[\begin{align*} Var[X] & = E[(X-E[X])^2] \\ & = E[X^2-2E[X] X + E[X]^2] \\ & = E[X^2]-2E[X]\cdot E[X] + E[X]^2 \\ & = E[X^2]-E[X]^2 \end{align*}\]

  • More details

    \[Var[aX+b]=a^2Var[X]\]

1.10 Standard Deviation

\[\sigma=\sqrt{Var[X]}\]

1.11 Moment

The \(n_{th}\) moment of a random variable is the expected value of its \(n_{th}\) power \[\mu_X(n)=E[X^n]\]

1.12 Central Moment

The \(n_{th}\) central moment of a random variable X is the expected value of the \(n_{th}\) power of the deviation of X from its expected value. \[\bar\mu_X(n)=E[(X-E[X])^n]\]

  • Variance: 2nd central moment
  • Skewness: 3th central moment
  • Kurtosis: 4th central moment

2 Conditional probabilities

2.1 Conceptions

  • \(P(A|B)\)

    The conditional probability of A given B, Ex:

    probabilityTree.png

    then, \(P(B|A)=0.3\)

  • \(P(A|B) = \frac{P(A\cap B)}{P(B)}\)
    • useful restatement: \(P(A\cap B)=P(A|B)P(B)\)

2.2 Axioms

  • Nonnegativity
  • Normalization

    \[P(B|B)=\frac{P(B)}{P(B)}=1\]

  • Additivity

    If \(A_1, A_2, \cdots\) are disjoint events, \[P(\bigcup_{i=1}^n A_i|B) = \sum_{i=1}^n P(A_i|B)\]

2.3 Consequences

  • Multiplication Rule

    \[P(\cap_{i=1}^{n}A_i)=P(A_1)P(A_2|A_1)P(A_3|A_1\cap A_2)\cdots P(A_n|\cap_{i=1}^{n-1}A_i)=\prod_{i=1}^n P(A_n|\cap_{i=1}^{n-1}A_i)\]

    • proof \[P(\cap_{i=1}^n A_i)=P(A_1)\frac{P(A_1\cap A_2)}{P(A_1)}\cdots\frac{P(\cap_{i=1}^n A_i)}{P(\cap_{i=1}^{n-1} A_i)}\] \[=P(A_1)P(A_2|A_1)\cdots P(A_n|\cap_{i=1}^{n-1} A_i)\]

2.4 Total Probability Theorem

Let \(A_1, A_2,\cdots, A_n\) be disjoint events that form a partition of the sample space, then for any event B: \[P(B)=P(A_1\cap B)+\cdots+P(A_n\cap B)\] \[=P(A_1)P(B|A_1)+\cdots+P(A_n)P(B|A_n)\]

2.5 Bayes’ Rule

  • Useful for finding reverse conditional probabilities.

Let \(A_1, A_2,\cdots, A_n\) be disjoint events that form a partition of the sample space, then for any event B: \[P(A_i\cap B)=P(A_i|B)P(B)=P(A_i)P(B|A_i)\] \[P(A_i|B)=\frac{P(A_i)P(B|A_i)}{P(B)}\]

  • depends on total probability theorem, we have:

\[P(A_i|B)=\frac{P(A_i)P(B|A_i)}{P(A_1)P(B|A_1)+\cdots+P(A_n)P(B|A_n)}\]

  • two events

    \[P(A|B)=\frac{P(B|A)P(A)}{P(B)}\]

2.6 Independence

if A and B are independent events. \[P(A|B)=P(A)\] is equivalent to \[P(A\cap B)=P(A)P(B)\]

  • If \(A\) and \(B\) are independent, so are \(A\) and \(B^c\)
  • more events

    \[P(\bigcap_{i=1}^n A_i)=\prod_{i=1}^n P(A_i)\]

2.7 Conditional Independence

Two events A and B are said to be conditionally independent \[P(A\cap B|C)=P(A|C)P(B|C)\] is equivalent to(hint: Bayes' rule) \[P(A|B\cap C)=P(A|C)\]

3 Random Variables

  • difinition: A random variable is a real-valued function of the outcome of the experiment
  • A function of a random variable defines another random variable

3.1 Discrete RV

  • A (discrete) random variable has an associated probability mass function(PMF)
  • PMF

    Probability mass function, \[PMF_X(x)=P(X=x)\] Note that: \[\sum_x PMF_X(x)=1\]

  • CDF

    \[CDF_X(x)=P(X\le x)=\sum_{k\le x}PMF_X(k)\]

3.2 Continuous RV

  • PDF

    \[\begin{align*} PDF_X(x) & =\lim_{\varDelta x\to 0}\frac{P(x\le X \le x+\varDelta x)}{\varDelta x}\\ & =\lim_{\varDelta x\to 0}\frac{CDF_X(x+\varDelta x)-CDF_X(x)}{\varDelta x}\\ & = CDF^\prime_X(x) \end{align*}\]

    \[\begin{align*} P_X(a < x \le b) & = CDF_X(b)-CDF_X(a)\\ & = \int_{-\infty}^b PDF_X(x)\mathrm{d}x - \int_{-\infty}^a PDF_X(x)\mathrm{d}x\\ & = \int_a^b PDF_X(x)\mathrm{d}x \end{align*}\]

    • \(\int_{-\infty}^{\infty}PDF_X(x)\mathrm{d}x = 1\)
    • \(PDF_X(x)\ge 0\) for all \(x\)
    • If \(\varDelta x\) is very small, then \(P(x\le X \le x+\varDelta x) \approx PDF_X(x)\cdot \varDelta x\)
  • CDF

    \[CDF_X(x)=P(X\le x)=\int_{-\infty}^x PDF_X(u)\mathrm{d}u\]

4 Counting

  • Permutations of n objects: \(n!\)
  • k-permutations of n objects: \(\frac{n!}{(n-k)!}\)
  • Combinations of k out of n objects: \({n\choose k}=\frac{n!}{k!(n-k)!}\)
  • Partitions of \(n\) objects into \(r\) groups with i th group having \(n_i\) objests:

    \[{n \choose n_1,n_2,\cdots,n_r} = \frac{n!}{n_1!n_2!\cdots n_r!}\] this is called multinomial coefficient

4.1 n balls into m boxes

  • ball same, box same -> enum
  • ball same, box diff -> partition
    • box not null: \({n-1 \choose k-1}\)
    • box nullable: \({n+k-1 \choose k-1}\)

detail

5 Linear Transforms

Linear transforms are when a variable X is transformed into aX + b, where a and b are constants. The probabilities of each Y should be the same as X \[E(aX+b)=aE[X]+b\] \[Var(aX+b)=a^2Var[X]\]

5.1 Independent observations

\[E(X_1+X_2+...X_n) = nE[X]\] \[Var(X_1+X_2+...X_n) = nVar[X]\]

5.2 Independent Variables

X and Y are independent random variables \[E(X+Y)=E[X]+E[Y]\] \[E(X-Y)=E[X]-E[Y]\] \[Var(X+Y)=Var[X]+Var(Y)\] \[Var(X-Y)=Var[X]+Var(Y)\]

  • linear transforms \[E(aX+bY)=aE[X]+bE[Y]\] \[E(aX-bY)=aE[X]-bE[Y]\] \[Var(aX+bY)=a2Var[X]+b2Var(Y)\] \[Var(aX-bY)=a2Var[X]-b2Var(Y)\]

6 Discrete distributions

6.1 Binomial distribution

  1. You’re running a series of independent trials.
  2. There can be either a success or failure for each trial, and the probability of success is the same for each trial.
  3. There are a finite number of trials.
  4. The main thing you’re interested in is the number of successes in the \(n\) independent trials.

Let:

  • \(X\) be the number of successful outcomes out of \(n\) trials
  • \(p\) be the probability of success in a trial

\[X\sim B(n, p)\]

  • PMF

    \[PMF_X(x)=\dbinom{n}{x} p^x (1-p)^{n-x}\]

  • E[X]

    \[E[X]=np\]

  • Var[X]

    \[Var[X]=np(1-p)\]

  • CDF

    \[CDF_X(x)=\sum_{m=0}^{\lfloor x \rfloor}{n \choose m}p^m(1-p)^{n-m}\]

6.2 Geometric distribution

  1. You run a series of independent trials.
  2. There can be either a success or failure for each trial, and the probability of success is the same for each trial.
  3. The main thing you’re interested in is how many trials are needed in order to get the first successful outcome.

Let:

  • \(X\) be the number of trials needed to get the first successful outcome
  • \(p\) be the probability of success in a trial

\[X\sim Geo(p)\]

  • PMF

    let X be the number of trials needed to get the first successful outcome. To find the probability of \(X\) taking a particular value \(x\), using: \[PMF_X(x)=p(1-p)^{x-1}\]

  • E[X]

    \[E[X]=\frac{1}{p}\]

  • Var[X]

    \[Var[X]=\frac{1-p}{p^2}\]

  • CDF

    \[CDF_X(x)=1-(1-p)^{\lfloor x \rfloor}\]

  • memorylessness

6.3 Pascal distribution

aslo known as Negative Binomial Distribution

  1. You run a series of independent trials.
  2. There can be either a success or failure for each trial, and the probability of success is the same for each trial.
  3. The main thing you’re interested in is how many trials are needed in order to get the \(k\) successful outcomes.

Let:

  • \(X\) be the number of trials needed to get the \(k\) successful outcomes
  • \(p\) be the probability of success in a trial

\[X\sim Pascal(k, p)\]

  • PMF

    \[PMF_X(x)=\begin{cases} {x-1 \choose k-1}(1-p)^{x-k}p^k, & x\ge k\\ 0, & otherwise\\ \end{cases}\]

  • E[X]

    \[E[X]=\frac{k}{p}\]

  • Var[X]

    \[Var[X]=\frac{k(1-p)}{p^2}\]

  • CDF

    \[CDF_X(x)=\begin{cases} I_p(k, \lfloor x \rfloor - k + 1), & x\ge k\\ 0, & otherwise\\ \end{cases}\] \(I_z(a,b)\) is the regularized incomplete beta function

    • more

      \[\begin{align*} CDF_X(x) & =P(X\le x)\\ & =P(\mbox{the k-th success occurs before the x-th trial})\\ & =P(\mbox{k success in x trials})\\ & =P(Y\ge n), Y\sim B(x, p) \end{align*}\] This is why pascal is aslo called negative binomial.

6.4 Poisson distribution

  1. Individual events occur at random and independently in a given interval.
  2. You know the mean number of occurrences in the interval or the rate of occurrences, and it’s finite.
  3. The purpose is to know the number of occurrences in another particular interval.

Let:

  • \(X\) be the number of occurrences in a particular interval \(T\)
  • \(\lambda\) be the rate of occurrences, should be \(uT\), \(u\) is the frequency of occurrences

\[X\sim Po(\lambda)\]

  • PMF

    \[PMF_X(x)=\frac{e^{-\lambda}\lambda^{x}}{x!}\]

  • E[X]

    \[E[X]=\lambda\]

  • Var[X]

    \[Var[X]=\lambda\]

  • CDF

    \[CDF_X(x)= \begin{cases} -\lambda^e\sum_{n=-\infty}^{\lfloor x \rfloor}e^{-\lambda}\cdot \frac{\lambda^n}{n!}, & \mbox{if }x\ge 1 \\ 0, & \mbox{otherwise} \\ \end{cases}\]

  • linear transforms

    If \(X\sim Po(\lambda_x)\) and \(Y\sim Po(\lambda_y)\), and \(X\) and \(Y\) are independent, \[X+Y\sim Po(\lambda_x + \lambda_y)\]

  • simplify the special Binomial distribution case

    If \(X\sim B(n, p)\) where \(n\) is large and \(p\) is small, you can approximate it with \(X \sim Po(np)\).

7 Continuous distributions

7.1 Relationship between PDF & CDF

\[\frac{dy}{dx}CDF(x)=\int PDF(x)dx\]

7.2 Exponential distribution

How long do we need to wait before a customer enters our shop? How long will it take before a call center receives the next phone call? All these questions concern the time we need to wait before a given event occurs. If this waiting time is unknown, it is often appropriate to think of it as a random variable having an exponential. \[X\sim Exponential(\lambda)\]

  • most commonly used to model waiting times
  • \(\lambda\) is called rate parameter
  • PDF

    \[PDF_X(x)=\begin{cases} \lambda e^{-\lambda x}, & \mbox{if } x \ge 0\\ 0, & otherwise \end{cases}\]

  • E[X]

    \[E[X]=\frac{1}{\lambda}\]

  • Var[X]

    \[Var[X]=\frac{1}{\lambda^2}\]

  • CDF

    \[CDF_X(x)=\begin{cases} 1-e^{-\lambda x}, & \mbox{if } x \ge 0\\ 0, & otherwise \end{cases}\]

  • memorylessness

7.3 Erlang distribution

Given a Poisson distribution with a rate of change \(\lambda\), We use Erlang distribution to calculate the waiting times until the \(n\) th Poisson event occurs.

\[X\sim Erlang(n, \lambda)\]

  • PDF

    \[PDF_X(x)=\begin{cases} \frac{1}{(n-1)!}\lambda^n x^{n-1} e^{-\lambda x}, & x\ge 0\\ 0, & otherwise \end{cases}\]

  • E[X]

    \[E[X]=\frac{n}{\lambda}\]

  • Var[X]

    \[Var[X]=\frac{n}{\lambda^2}\]

  • CDF

    \[CDF_X(x)=\begin{cases} 1-\sum_{k=0}^{n-1}\frac{(\lambda x)^k}{k!}e^{-\lambda x}, & x\ge 0\\ 0, & otherwise \end{cases}\]

7.4 Normal distribution

\[X\sim N(\mu, \sigma^2)\] or \[X\sim Gaussian(\mu, \sigma)\]

  • PDF

    \[PDF_X(x)=\frac{1}{\sigma\sqrt{2\pi}}\cdot e^{-\frac{(x-\mu)^2}{2\sigma^2}}\]

  • CDF

    \[CDF_X(x) = \int_{-\infty}^{x}PDF_X(x)dx=\frac{1}{\sigma\sqrt{2\pi}}\int_{-\infty}^{x}e^{-\frac{(x-\mu)^2}{2\sigma^2}}dx\]

    • Use z-table
  • approximate Binomial distribution

    if \(X\sim B(n, p)\) and \(np>15\) and \(nq>5\) , use \(X\sim N(np, npq)\) to approximate it.

    • need to apply a continuity correction (round to discrete value), eg [5.5, 6.5) round to 6
  • approximate Poisson distribution

    if \(X\sim Po(\lambda)\) and \(\lambda>15\), use \(X\sim N(\lambda, \lambda)\) to approximate it.

    • need to apply a continuity correction (round to discrete value)

8 Binomial Theorem

\[(x+y)^n = \sum_{k=0}^{n} \dbinom{n}{k}x^{n-k}y^{k}\]

9 Function of Random Variable

Let \(Y=g(X)\)

9.1 Discrete

\[PMF_Y(y)=\sum_{\forall x|g(x)=y}PMF_X(x)\]

9.2 Continuous

  1. Caculate CDF: \(CDF_Y(y)=P[Y\le y]\)
  2. Caculate PDF: \(PDF_Y(y)=\frac{d}{dy}CDF_Y(y)\)
  • Example

    \(Y=g(X)=aX+b\), then \[CDF_Y(y)=P(X\le \frac{y-b}{a})=CDF_X(\frac{y-b}{a})\] \[PDF_Y(y)=\frac{d}{dy}CDF_Y(y)=\frac{1}{|a|}PDF_X(\frac{y-b}{a})\]

10 Conditional probability distributions

10.1 Discrete

  • Conditional PMF

    \[PMF_{X,Y}(x, y)=PMF_Y(y)PMF_{X|Y}(x|y)\]

    • extension \[PMF_{X,Y,Z}(x, y, z)=PMF_Y(y)PMF_{Y|Z}(y|z)PMF_{X|Y,Z}(x|y,z)\]
  • Expectation

    \[E[X|Y=y]=\sum x\cdot PMF_{X|Y}(x|Y=y)\]

10.2 Continuous

  • Conditional PDF

    \[PDF_{X,Y}(x, y)=PDF_Y(y)PDF_{X|Y}(x|y)\]

  • Expectation

    \[E[X|Y=y]=\int_{-\infty}^{\infty}x\cdot PDF_{X|Y}(x|Y=y)dx\]

  • Marginal PDF

    \[PDF_X(x)=\int PDF_{X,Y}(x, y)dy+\int PDF_{X,Z}(x, z)dz\]

10.3 Memoryless

\[PMF_{X|Y}(x|y) = PMF_{X}{x}\] e.g.

  • Geometric distribution
  • Exponential distribution

11 Joint probability distributions

11.1 PMF/PDF

  • Discrete

    \[PMF_{X,Y}(x, y) = P(X=x\ and\ Y=y)\]

    • \(0\le PMF_{X,Y}(x, y) \le 1\)
    • \(\sum_{x=-\infty}^{\infty}\sum_{y=-\infty}^{\infty}PMF_{X,Y}(x, y)=1\)
    • if \(X,Y\) are independent, \(PMF_{X,Y}(x, y)=PMF_X(x)PMF_Y(y)\)
  • Continuous

    \[PDF_{X,Y}(x, y) = \frac{d CDF_{X,Y}(x, y)}{dx}\frac{d CDF_{X,Y}(x, y)}{dy}\] \[CDF_{X,Y}(x, y) = \int_{-\infty}^{x}\int_{-\infty}^{y}PDF_{X,Y}(u, v)dvdu\]

11.2 CDF

\[CDF_{X,Y}(x, y) = P(X\le x\ and\ Y\le y)\]

  • \(CDF_{X,Y}(x, \infty) = CDF_X(x)\)
  • \(P(x_1 < X \le x_2, y_1 < Y \le y_2)\)

    \[=CDF_{X,Y}(x_2, y_2)-CDF_{X,Y}(x_2, y_1)-CDF_{X,Y}(x_1, y_2)+CDF_{X,Y}(x_1, y_1)\]

11.3 E[h(X, Y)]

  • Discrete

    \[E[h(X, Y)]=\sum_{x=-\infty}^{\infty}\sum_{y=-\infty}^{\infty}h(x, y)\cdot PMF_{X,Y}(x, y)\]

  • Continuous

    \[E[h(X, Y)]=\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}h(x, y)\cdot PDF_{X,Y}(x, y)dxdy\]

  • Transformation
    • \(E[\alpha h_1(X, Y)+\beta h_2(X, Y)=\alpha E[h_1(X, Y)]+\beta E[h_2(X, Y)]\)
    • If \(X, Y\) are independent, \(E[g(X)h(Y)]=E[g(X)]\cdot E[h(Y)]\)

11.4 Var(X+Y)

\[\begin{align*} Var(X+Y) & = E[(X+Y-E[X+Y])^2] \\ & = E[(X-\mu_X+Y-\mu_Y)^2] \\ & = E[(X-\mu_X)^2 + (Y-\mu_Y)^2 + 2(X-\mu_X)(Y-\mu_Y)] \\ & = Var(X) + Var(Y) + 2Cov(X, Y) \end{align*}\]