In the last section we computed a distribution given the data, whereas now we generate data given the distribution.
Random variables are vectors that are generated from a known Cumulative Distribution Function or Probability Distribution Function, which describes the long run frequencies of all possible outcomes. Random variables have a
sample space which refers to the set of all possible outcomes, and
probability for each particular set of outcomes, which is the proportion that those outcomes occur in the long run.
There are only two basic types of sample spaces: discrete (encompassing cardinal-discrete, factor-ordered, and factor-unordered data) and continuous, which lead to two types of random variables. In any case, probabilities must sum up to 1.
However, each type of random variable has many different probability distributions. The most common ones are easily accessible and can be described using the Cumulative Distribution Function \[\begin{eqnarray}
F(x) &=& Prob(X_{i} \leq x).
\end{eqnarray}\] Note that this is just like the ECDF, \(\widehat{F}(x)\), except that it is now theoretically known instead of estimated.1
After introducing different random variables, we will also cover some basic implications of their CDF. Generally, we have \(Prob(X_{i} > x) = 1- F(x)\). We also have two “in” and “out” probabilities.
The probability of \(X_{i}\leq b\) and \(X_{i}\geq a\) can be written in terms of falling into a range \(Prob(X_{i} \in [a,b])=Prob(a \leq X_{i} \leq b) = F(b) - F(a)\).
The opposite probability of \(X_{i} > b\) or \(X_{i} < a\) is \(Prob(X_{i} < a \text{ or } X_{i} > b) = F(a) + [1- F(b)]\). Notice that this opposite probability \(F(a) + [1- F(b)] =1 - [F(b) - F(a)]\), so that \(Prob(X_{i} \text{ out of } [a,b]) = 1 - Prob( X_{i} \in [a,b])\)
4.1 Discrete
The random variable can take one of several values in a set. E.g., any number in \(\{1,2,3,...\}\) or any letter in \(\{A,B,C,...\}\).
Bernoulli.
Think of a Coin Flip: Heads=1 or Tails=0, with equal probability. In general, the probability of heads can vary. \[\begin{eqnarray}
X_{i} &\in& \{0,1\} \\
Prob(X_{i} =0) &=& 1-p \\
Prob(X_{i} =1) &=& p \\
F(x) &=& \begin{cases}
0 & x<0 \\
1-p & x \in [0,1) \\
1 & x\geq 1
\end{cases}
\end{eqnarray}\]
Discrete numbers with equal probability \[\begin{eqnarray}
X_{i} &\in& \{1,...N\} \\
Prob(X_{i} =1) &=& Prob(X_{i} =2) = ... = 1/N\\
F(x) &=& \begin{cases}
0 & x<1 \\
1/N & x \in [1,2) \\
2/N & x \in [2,3) \\
\vdots & \\
1 & x\geq N
\end{cases}
\end{eqnarray}\]
Here is an example with \(N=4\).
The probability of a value smaller than or equal to \(3\) is \(Prob(X_{i} \leq 3)=1/4 + 1/4 + 1/4 = 3/4\).
The probability of a value larger than \(3\) is \(Prob(X_{i} > 3) = 1-Prob(X_{i} \leq 3)=1/4\).
The probability of a value of a value \(>\) 1 and \(\leq 3\) is \(Prob(1 < X_{i} \leq 3) = Prob(X_{i} \leq 3) - \left[ 1- Prob(X_{i} \leq 1) \right] = 3/4 - 1/4 = 2/4\).2
The probability of a value of a value \(\leq\) 1 or \(> 3\) is \(Prob(X_{i} \leq 1 \text{or} X_{i} > 3) = Prob(X_{i} \leq 1) + \left[ 1- Prob(X_{i} \leq 3) \right] = 1/4 + [1 - 3/4]=2/4\).
Code
x <-1:4x_probs <-rep(1/4,4)# sample(x, 1, replace=T, prob=x_probs) # sample of 1X1 <-sample(x, 2000, replace=T, prob=rep(1/4,4))# Plot Long run proportionsproportions <-table(X1)/length(X1)plot(proportions, col=grey(0,.5),xlab='Outcome', ylab='Proportion', main=NA)points(x, x_probs, pch=16, col='blue') # Theoretical values
We can also replace numbers with letters (A,…Z) or names (John, Jamie, …) although we must be careful with the CDF when there is no longer a natural ordering. Here is an empirical example with three outcomes
Code
x <-c('A', 'B', 'C')x_probs <-c(3,6,1)/10sum(x_probs)## [1] 1X1 <-sample(x, 2000, replace=T, prob=x_probs) # sample of 2000,# Plot Long run proportionsproportions <-table(X1)/length(X1)plot(proportions, col=grey(0,.5),xlab='Outcome', ylab='Proportion', main=NA)points(x_probs, pch=16, col='blue') # Theoretical values
An experiment with three possible outcomes (E1, E2, E3) has been repeated 50 times, and it was learned that E1 occurred 10 times, E2 occurred 13 times, and E3 occurred 27 times. Assign probabilities to the outcomes.
4.2 Continuous
The random variable can take one value out of an uncountably infinite number. We describe these variables with the cumulative distribution function \(F\), or the probability density function \(f\). With a continuous random variable, the probability of any individual point is zero.
Continuous Uniform.
Any number on a unit interval allowing for any number of decimal points, with every number having the same probability. \[\begin{eqnarray}
X_{i} &\in& [0,1] \\
f(x) &=& \begin{cases}
1 & x \in [0,1] \\
0 & \text{Otherwise}
\end{cases}\\
F(x) &=& \begin{cases}
0 & x < 0 \\
x & x \in [0,1] \\
1 & x > 1.
\end{cases}
\end{eqnarray}\]
The probability of a value being exactly \(0.25\) is \(Prob(X_{i} =0.25)=0\).
The probability of a value smaller that \(0.25\) is \(F(0.25)=0.25\).
The probability of a value larger than \(0.25\) is \(1-F(0.25)=0.75\).
The probability of a value in \((0.25,0.75]\) is \(Prob(0.25 < X_{i} \leq 0.75) = Prob(X_{i} \leq 0.75) - \left[ 1- Prob(X_{i} \leq 0.25) \right] = 0.75 - 0.25 = 0.5\).
The probability of a value outside of \((0.25,0.75]\) is \(Prob(X_{i} \leq 0.25 \text{ or } x > 0.75) = 0.25 + [1-0.75]=0.5\).
Note that the Continuous Uniform distribution generalizes to an arbitrary interval, \(X_{i} \in [a,b]\). What is the probability of a value larger than \(0.25\) when \(a=-b=2\)? First use the computer to suggest an answer and then show the answer mathematically.
Suppose the flight time (in minutes) between Calgary and Kamloops has Uniform distribution with parameters a = 68 and b = 78. According to Air Canada the flight takes 70 minutes. What is the probability that the flight will be late? What is the probability that a flight takes between 65 and 70 minutes?
Beta.
Any number on the unit interval, \(X_{i} \in [0,1]\), but with unequal probabilities.
The Beta distribution is mathematically complicated to write, and so we omit it. But it is often used, as the density function has two parameters that allow it to take many different shapes.
Any positive number.3 The Exponential distribution has a single parameter, \(\lambda>0\), that governs its shape \[\begin{eqnarray}
X_{i} &\in& [0,\infty) \\
f(x) &=& \lambda exp\left\{ -\lambda x \right\} \\
F(x) &=& \begin{cases}
0 & x < 0 \\
1- exp\left\{ -\lambda x \right\} & x \geq 0.
\end{cases}
\end{eqnarray}\]
Suppose the lifetime of a battery is an exponential random variable with \(\lambda=1/50\). Find the probability that the lifetime is \(\geq 100\) hours. Find the probability that the lifetime is between \(50\) and \(100\) hours.
Normal (Gaussian).
This distribution is for any number on the real line, with bell shaped probabilities. The Normal distribution is mathematically complex and sometimes called the Gaussian distribution. We call it “Normal” because we will encounter it again and again and again. The density function \(f\) has two parameters \(\mu \in (\infty,\infty)\) and \(\sigma > 0\). \[\begin{eqnarray}
X_{i} &\in& (\infty,\infty) \\
f(x) &=& \frac{1}{\sqrt{2\pi \sigma^2}} exp\left\{ \frac{-(x-\mu)^2}{2\sigma^2} \right\}
\end{eqnarray}\]
Suppose scores in math class are approximately normally distributed with \(\mu=50, \sigma=1\). If you selected one student randomly, what is the probability their score is higher than \(90\). Is \(Prob(X_{i}\geq 90)\) higher if \(\mu=25, \sigma=2\)?
4.3 Drawing Samples
To generate a random variable from known distributions, you can use some type of physical machine. E.g., you can roll a fair die to generate Discrete Uniform data or you can roll weighted die to generate Categorical data.
There are also several ways to computationally generate random variables from a probability distribution. Perhaps the most common one is “inverse sampling”.
Random variables have an associated quantile function, which is the inverse of the CDF: the \(x\) value where \(p\) percent of the data fall below it. \[\begin{eqnarray}
Q(p) = F^{-1}(p), \quad p\in [0,1]
\end{eqnarray}\] (Recall that the median is the value \(x\) where \(50\%\) of the data fall below \(x\), for example.) To generate a random variable using inverse sampling, first sample \(p\) from a uniform distribution and then find the associated quantile.4
Using Data.
You can generate a random variable from a known empirical distribution. Inverse sampling randomly selects observations from the dataset with equal probabilities. To implement this, we
order the data and associate each observation with an ECDF value
draw an ECDF probability \(p\) as a random variable with equal probabilities
finding the associated data point
Code
# Empirical DistributionX <- USArrests$MurderFX_hat <-ecdf(X)plot(FX_hat, lwd=2, xlim=c(0,20),pch=16, col=grey(0,.5), main='')# Two Examples of generating a random variablep <-c(.25, .9) # pretended to be randomcols <-c(2,4)QX_hat <-quantile(X, p, type=1)segments(QX_hat, p, -10, p, col=cols)segments(QX_hat, p, QX_hat, 0, col=cols)mtext( round(QX_hat,2), 1, at=QX_hat, col=cols)
If you know the distribution function that generates the data, then you can derive the quantile function and do inverse sampling. Here is an in-depth example of the Dagum distribution. The distribution function is \(F(x)=(1+(x/b)^{-a})^{-c}\). For a given \(p=F(x)\), we can then solve for the quantile \(Q(p)=\frac{ b p^{\frac{1}{ac}} }{(1-p^{1/c})^{1/a}}\). Afterwhich, we sample \(p\) from a uniform distribution and then find the associated quantile.
Code
# Theoretical Quantile Function (from VGAM::qdagum)qdagum <-function(p, scale.b=1, shape1.a, shape2.c) {# Quantile function (theoretically derived from the CDF) ans <- scale.b * (expm1(-log(p) / shape2.c))^(-1/ shape1.a)# Special known cases ans[p ==0] <-0 ans[p ==1] <-Inf# Safety Checks ans[p <0] <-NaN ans[p >1] <-NaNif(scale.b <=0| shape1.a <=0| shape2.c <=0){ ans <- ans*NaN }# Returnreturn(ans)}# Generate Random Variables (VGAM::rdagum)rdagum <-function(n, scale.b=1, shape1.a, shape2.c){ p <-runif(n) # generate random probabilities x <-qdagum(p, scale.b=scale.b, shape1.a=shape1.a, shape2.c=shape2.c) #find the inversesreturn(x)}# Exampleset.seed(123)X <-rdagum(3000,1,3,1)X[1:5]## [1] 0.7390476 1.5499868 0.8845006 1.9616251 2.5091656
Alternatively, you can think of \(F(x)\) as the ECDF for a dataset with an infinite number of observations.↩︎
This is the general formula using CDFs, and you can verify it works in this instance by directly adding the probability of each 2 or 3 event: \(Prob(X_{i} = 2) + Prob(X_{i} = 3) = 1/4 + 1/4 = 2/4\).↩︎
In other classes, you may further distinguish types of random variables based on whether their maximum value is theoretically finite or infinite.↩︎
Drawing random uniform samples with computers is actually quite complex and beyond the scope of this course.↩︎