9  Misc. Univariate Topics

9.1 Data Transformations

Transformations can stabilize variance, reduce skewness, and make model errors closer to Gaussian.

Perhaps the most common examples are power transformations: \(y= x^\lambda\), which includes \(\sqrt{x}\) and \(x^2\).

Other examples include the exponential transformation: \(y=\exp(x)\) for any \(x\in (-\infty, \infty)\) and logarithmic transformation: \(y=\log(x)\) for any \(x>0\).

The Box–Cox Transform nests many cases. For \(x>0\) and parameter \(\lambda\), \[\begin{eqnarray} y=\begin{cases} \dfrac{x^\lambda-1}{\lambda}, & \lambda\neq 0,\\ \log\left(x\right) & \lambda=0. \end{cases} \end{eqnarray}\] This function is continuous over \(\lambda\).

Code
# Box–Cox transform and inverse
bc_transform <- function(x, lambda) {
  if (any(x <= 0)) stop("Box-Cox requires x > 0")
  if (abs(lambda) < 1e-8) log(x) else (x^lambda - 1)/lambda
}
bc_inverse <- function(t, lambda) {
  if (abs(lambda) < 1e-8) exp(t) else (lambda*t + 1)^(1/lambda)
}

X <- USArrests[,'Murder']
hist(X, main='', border=NA, freq=F)

Code

par(mfrow=c(1,3))
for(lambda in c(-1,0,1)){
    Y <- bc_transform(X, lambda)
    hist(Y, 
        main=bquote(paste(lambda,'=',.(lambda))),
        border=NA, freq=F)
}

Law of the Unconscious Statistician (LOTUS).

As before, we will represent the random variable as \(X_{i}\), which can take on values \(x\) from the sample space. If \(X_{i}\) is a discrete random variable (a random variable with a discrete sample space) and \(g\) is a function, then \(\mathbb E[g(X_{i})] = \sum_x g(x)Prob(X_{i}=x)\).

Let \(X_{i}\) take values \(\{1,2,3\}\) with \[\begin{eqnarray} Pr(X_{i}=1)=0.2,\quad Prob(X_{i}=2)=0.5,\quad Prob(X_{i}=3)=0.3. \end{eqnarray}\] Let \(g(x)=x^2+1\). Then \(g(1)=1^2+1=2\), \(g(2)=2^2+1=5\), \(g(3)=3^2+1=10\).

Then, by LOTUS, \[\begin{eqnarray} \mathbb E[g(X_{i})]=\sum_x g(x)Prob(X_{i}=x) &=& g(1)\cdot 0.2 + g(2)\cdot 0.5 + g(3)\cdot 0.3 \\ &=& 2 \cdot 0.2 + 5 \cdot 0.5 + 10 \cdot 0.3 \\ &=& 0.4 + 2.5 + 3 = 5.9. \end{eqnarray}\]

Code
x  <- c(1,2,3)
x_probs <- c(0.2,0.5,0.3)
g  <- function(x) x^2 + 1
sum(g(x) * x_probs) 
## [1] 5.9
Code
g  <- function(x) x^2 + 1

# A theoretical example
x <- c(1,2,3,4)
x_probs <- c(1/4, 1/4, 1/4, 1/4)
sum(g(x) * x_probs) 
## [1] 8.5

# A simulation example
X <- sample(x, prob=x_probs, size=1000, replace=T)
mean(g(X))
## [1] 8.241

If \(X_{i}\) is a continuous random variable (a random variable with a continuous sample space) and \(g\) is a function, then \(\mathbb E[g(X_{i})] = \int_{-\infty}^{\infty} g(x)f(x) dx\).

Code
x <- rexp(5e5, rate = 1)           # X ~ Exp(1)
mean(sqrt(x))                      # LOTUS Simulation
## [1] 0.8865247
sqrt(pi) / 2                       # Exact via LOTUS integral
## [1] 0.8862269

Note that you have already seen the special case where \(g(X_{i})=\left(X_{i}-\mathbb{E}[X_{i}]\right)^2\).

Jensen’s inequality.

Concave functions curve inwards, like the inside of a cave. Convex functions curve outward, the opposite of concave.

If \(g\) is a concave function, then \(g(\mathbb E[X_{i}]) \geq \mathbb E[g(X_{i})]\).

Code
# Continuous Example 1
mean( sqrt(x) )
## [1] 0.8865247
sqrt( mean(x) ) 
## [1] 1.000263

# Continuous Example 2
mean( log(x) )
## [1] -0.5761747
log( mean(x) ) 
## [1] 0.0005266897
Code
## Discrete Example
x  <- c(1,2,3)
px <- c(0.2,0.5,0.3)
EX <- sum(x * px)
EX
## [1] 2.1

g  <- sqrt
gEX <- g(EX)
EgX <- sum(g(x) * px)
c(gEX, EgX)
## [1] 1.449138 1.426722

If \(g\) is a convex function, then the inequality reverses: \(g(\mathbb E[X_{i}]) \leq \mathbb E[g(X_{i})]\).

Code
mean( exp(x) )
## [1] 10.06429
exp( mean(x) )  
## [1] 7.389056

9.2 Drawing Samples

To generate a random variable from known distributions, you can use some type of physical machine. E.g., you can roll a fair die to generate Discrete Uniform data or you can roll weighted die to generate Categorical data.

There are also several ways to computationally generate random variables from a probability distribution. Perhaps the most common one is “inverse sampling”. To generate a random variable using inverse sampling, first sample \(p\) from a uniform distribution and then find the associated quantile quantile function \(\hat{F}^{-1}(p)\).1

Using Data.

You can generate a random variable from a known empirical distribution. Inverse sampling randomly selects observations from the dataset with equal probabilities. To implement this, we

  • order the data and associate each observation with an ECDF value
  • draw \(p \in [0,1]\) as a uniform random variable
  • find the associated quantile via the ECDF

Here is an example of generating random murder rates for US states.

Code
# Empirical Distribution
X <- USArrests[,'Murder']
FX_hat <- ecdf(X)
plot(FX_hat, lwd=2, xlim=c(0,20),
    pch=16, col=grey(0,.5), main='')

Code

# Generating random variables via inverse ECDF
p <- runif(3000) ## Multiple Draws
QX_hat <- quantile(FX_hat, p, type=1)
QX_hat[c(1,2,3)]
## 83.248211956% 50.070770434% 55.293406080% 
##          12.7           7.3           7.9

## Can also do directly from the data
QX_hat <- quantile(X, p, type=1)
QX_hat[c(1,2,3)]
## 83.248211956% 50.070770434% 55.293406080% 
##          12.7           7.3           7.9

Using Math.

If you know the distribution function that generates the data, then you can derive the quantile function and do inverse sampling. That is how computers generate random data from a distribution.

Code
# 4 random data points from 3 different distributions
qunif(4)
## [1] NaN
qexp(4)
## [1] NaN
qnorm(4)
## [1] NaN

Here is an in-depth example of the Dagum distribution. The distribution function is \(F(x)=(1+(x/b)^{-a})^{-c}\). For a given probability \(p\), we can then solve for the quantile as \(F^{-1}(p)=\frac{ b p^{\frac{1}{ac}} }{(1-p^{1/c})^{1/a}}\). Afterwhich, we sample \(p\) from a uniform distribution and then find the associated quantile.

Code
# Theoretical Quantile Function (from VGAM::qdagum)
qdagum <- function(p, scale.b=1, shape1.a, shape2.c) {
  # Quantile function (theoretically derived from the CDF)
  ans <- scale.b * (expm1(-log(p) / shape2.c))^(-1 / shape1.a)
  # Special known cases
  ans[p == 0] <- 0
  ans[p == 1] <- Inf
  # Safety Checks
  ans[p < 0] <- NaN
  ans[p > 1] <- NaN
  if(scale.b <= 0 | shape1.a <= 0 | shape2.c <= 0){ ans <- ans*NaN }
  # Return
  return(ans)
}

# Generate Random Variables (VGAM::rdagum)
rdagum <-function(n, scale.b=1, shape1.a, shape2.c){
    p <- runif(n) # generate random probabilities
    x <- qdagum(p, scale.b=scale.b, shape1.a=shape1.a, shape2.c=shape2.c) #find the inverses
    return(x)
}

# Example
set.seed(123)
X <- rdagum(3000,1,3,1)
X[c(1,2,3)]
## [1] 0.7390476 1.5499868 0.8845006

9.3 Set Theory

Let the sample space contain events \(A\) and \(B\), with their probability denoted as \(Prob(A)\) and \(Prob(B)\). Then

  • the union: \(A\cup B\), refers to either event occurring
  • the intersection: \(A\cap B\), refers to both events occurring
  • the events \(A\) and \(B\) are mutually exclusive if and only if \(A\cap B\) is empty
  • the inclusion–exclusion rule is: \(Prob(A\cup B)=Prob(A)+Prob(B)-Prob(A\cap B)\)
  • the events \(A\) and \(B\) are mutually independent if and only if \(Prob(A\cap B)=Prob(A) Prob(B)\)
  • the conditional probability is \(Prob(A \mid B) = Prob(A \cap B)/ P(B)\)

For example, consider a six sided die where events are the number \(>2\) or the number is even.

  • The sets are \(A=\{3,4,5,6\}\) and \(B=\{2,4,6\}\)
  • To check mutually exclusive: notice \(A\cap B=\{4,6\}\) which is not empty. So event \(A\) and event \(B\) are not mutually exclusive

Further assuming the die are fair, each side has an equal \(1/6\) probability. This lets us compute

  • \(Prob(A)=4/6=2/3\) and \(Prob(B)=3/6=1/2\)
  • \(A\cup B=\{2,3,4,5,6\}\), which implies \(Prob(A\cup B)=5/6\)
  • \(A\cap B=\{4,6\}\), which implies \(Prob(A\cap B)=2/6=1/3\)
  • To check independence: notice \(Prob(A\cap B)=1/3 = (2/3) (1/2) = Prob(A)Prob(B)\). So event \(A\) and event \(B\) are mutually independent
  • To check inclusion–exclusion: notice \(Prob(A) + Prob(B) - Prob(A\cap B) = 2/3 + 1/2 - 1/3 = 4/6 + 3/6 - 2/6 = 5/6 = Prob(A\cup B)\)
  • Finally, the conditional probability of \(Prob(\text{die is} >2 \mid \text{die shows even number}) = Prob(A \mid B) = \frac{Prob(A \cap B)}{P(B)} = \frac{1/3}{1/2} = 2/3\)
Code
# Simulation verification
set.seed(123)
x <- seq(1,6)
rolls <- sample(x, 1e6, replace = TRUE)
A <- rolls > 2
B <- rolls %% 2 == 0
P_AcondB <- mean(A & B) / mean(B)
round(P_AcondB, 3)
## [1] 0.667

Consider a six sided die with sample space \(\{1,2,3,4,5,6\}\) where each outcome is each equally likely. What is \(Prob(\text{die is odd or }<5)\)? Denote the odd events as \(A=\{1,3,5\}\) and the less than five events as \(B=\{1,2,3,4\}\). Then

  • \(A\cup B=\{1,3\}\), which implies \(Prob(A\cap B)=2/6=1/3\).
  • \(Prob(A)=3/6=1/2\) and \(Prob(B)=4/6=2/3\).
  • By inclusion–exclusion: \(Prob(A\cup B)=Prob(A)+Prob(B)-Prob(A\cap B)=1/2+2/3-1/3=5/6\).

Now find \(Prob(\text{die is even and }<5)\). Verify your answer with a computer simulation.

9.3.0.1 **Law of Total Probability. {-}

This law states that \(Prob(A)=Prob(A \mid B) P(B) + Prob(A \mid \not B) P(\not B)\), where \(\not B\) is the event “not \(B\)”.

For example, consider a six sided die where events are the number \(>2\) or the number is even. - The sets are \(A=\{3,4,5,6\}\) and \(B=\{2,4,6\}\) and \(\not B = \{1,3,5\}\) From before, we know that \(Prob(A)=2/3\), \(Prob(B)=1/2\), and \(Prob(A \mid B) = 2/3\). Then see that \(Prob(A \mid B) Prob(B) = \frac{2}{3} \frac{1}{2} = 2/6\)

We can also directly compute \(Prob(\not B)=1/2\), or infer it from \(Prob(\not B)=1-Prob(B)\). The conditional probability of \(Prob(\text{die is} >2 \mid \text{die shows odd number}) = Prob(A \mid \not B) = \frac{Prob(A \cap \not B)}{P(\not B)} = 2/3\).

Then we can see that \(Prob(A \mid B) Prob(B) + Prob(A \mid \not B) Prob(\not B) = 2/6 + 2/6 = 2/3 = Prob(A)\).

9.4 Factorial Distributions

Binomial.

The sum of \(n\) Bernoulli trials (number of successes)

Code
## Minimal example
n <- 15
p <- 0.3

# PMF plot
x <- seq(0, n)
f_x <- dbinom(x, n, p)
plot(x, f_x, type = "h", col = "blue",
    main='',  xlab = "x", ylab = "Prob(X = x)")
title(bquote(paste('Binom(',.(n),', ',.(p), ')' ) ))
points(x, f_x, pch = 16, col = "blue")

Code

# Simulate:  Compare empirical vs theoretical
#X <- rbinom(1e4, size = n, prob = p)
#c(emp_mean = mean(X), th_mean = n*p)
#c(emp_var  = var(X),  th_var  = n*p*(1-p))

Suppose that employees at a company are \(70%\) female and \(30%\) male. If we select a random sample of eight employees, what is the probability that more than \(2\) in the sample are female?

Show that \(\mathbb{E}[X_{i}]=np\) and \(\mathbb{V}[X_{i}]=np(1-p)\). Start with the case \(n=1\), then the case \(n=2\), then case \(n=3\), then the general case of any \(n\).

The Binomial Limit Theorem (de Moivre–Laplace theorem) says that as \(n\) grows large, with \(p \in (0,1)\) staying fixed, the Binomial distribution is approximately normal with mean \(np\) and variance \(np(1-p)\)

The unemployment rate is \(10%\). Suppose that \(100\) employable people are selected randomly. What is the probability that this sample contains between \(9\) and \(12\) unemployed people. Use the normal approximation to binomial probabilities.

Parameters are \(\mu \approx n p = 10\) \(\sigma^2= n p (1−p) = 100 (0.1)(0.9)=9\) \(\sigma=\sqrt{9}=3\)).

Poisson.

The number of events in a fixed interval

Code
## Minimal example
lambda <- 3.5

# PMF plot
x <- seq(0, 15)
f_x <- dpois(x, lambda)
plot(x, f_x, type="h", col="blue",
     xlab = "x", ylab = "Prob(X = x)")
points(x, f_x, pch = 16, col = "blue")
title(bquote(paste('Pois(',.(lambda), ')')))

Code


# Simulate: Compare empirical vs theoretical
#X <- rpois(1e4, lambda)
#c(emp_mean = mean(X), th_mean = lambda)
#c(emp_var  = var(X),  th_var  = lambda)

Show that \(\mathbb{E}[X_{i}] = \mathbb{V}[X_{i}]= \lambda\).

Irwin–Hall.

The sum of \(n\) i.i.d. \(\text{Uniform}(0,1)\).

  • Continuous, support \([0,n]\)
  • Probability Density Function: \(f(x) = \dfrac{1}{(n-1)!} \displaystyle\sum_{k=0}^{\lfloor x \rfloor} (-1)^k \binom{n}{k} (x - k)^{n-1}\) for \(x \in [0,n]\), and \(0\) otherwise
  • See https://en.wikipedia.org/wiki/Irwin%E2%80%93Hall_distribution
  • A common use case: representing the sum of many small independent financial shocks
Code
## Minimal example (base R)
# Irwin–Hall PDF function
dirwinhall <- function(x, n) {
    f_x <- vector(length=length(x))
    for(i in seq(x)) {
        xx <- x[i]
        if(xx < 0 | xx > n){
            f_x[i] <- 0 
        } else {
            k <- seq(0, floor(xx))
            f_k <- sum((-1)^k*choose(n, k)*(xx-k)^(n-1))/factorial(n-1)
            f_x[i] <- f_k
        }
    }
    return(f_x)
}

# Parameters
n <- 2
x <- seq(0, n, length.out = 500)

# Compute and plot PDF
f_x <- dirwinhall(x, n)
plot(x, f_x, type="l", col="blue",
    main='', xlab = "x", ylab = "f(x)")
title(bquote(paste('IrwinHall(',.(n), ')')))

See also the Bates distribution, which is for the mean of \(n\) i.i.d. \(\text{Uniform}(0,1)\) random variables. It essentially rescales the Irwin-Hall distribution to range between \(0\) and \(1\).

Beta.

The sample space is any number on the unit interval, \(X_{i} \in [0,1]\), but with non-uniform probabilities.

Code
X4 <- rbeta(2000,2,2) ## two shape parameters
hist(X4, breaks=20, border=NA, main=NA, freq=F)

#See the underlying probabilities
#f_25 <- dbeta(.25, 2, 2)

x <- seq(0,1,by=.01)
fx <- dbeta(x, 2, 2)
lines(x, fx, col='blue')

The Beta distribution is mathematically complicated to write, and so we omit it. However, we can find the probability graphically using either the probability density function or cumulative distribution function.

Suppose \(X_{i}\) is a random variable with a beta distribution. Intuitively depict \(Prob(X_{i} \in [0.2, 0.8])\) by drawing an area under the density function. Numerically estimate that same probability using the CDF.

Code
plot( ecdf(X4), main=NA) # Empirical

x <- seq(0,1,by=.01) # Theoretical
Fx <- pbeta(x, 2, 2)
lines(x, Fx, col='blue')

# Middle Interval Example 
F2 <- pbeta(0.2, 2, 2)
F8 <- pbeta(0.8, 2, 2)
F_2_8 <- F8 - F2
F_2_8
## [1] 0.792

# Visualize
title('Middle between 0.2 and 0.8')
segments( 0.2, F2, -1, F2, col='red')
segments( 0.8, F8, -1, F8, col='red')

This distribution is often used, as the probability density function has two parameters that allow it to take many different shapes.

For each example below, intuitively depict \(Prob(X_{i} \leq 0.5)\) using the PDF. Repeat the exercise using a CDF instead of a PDF to calculate a numerical value.

Code
op <- par(no.readonly = TRUE); on.exit(par(op), add = TRUE)
x <- seq(0,1,by=.01)
pars <- expand.grid( c(.5,1,2), c(.5,1,2) )
par(mfrow=c(3,3))
apply(pars, 1, function(p){
    fx <- dbeta( x,p[1], p[2])
    plot(x, fx, type='l', xlim=c(0,1), ylim=c(0,4), lwd=2, col='blue')
    #hist(rbeta(2000, p[1], p[2]), breaks=50, border=NA, main=NA, freq=F)
})
title('Beta densities', outer=T, line=-1)

9.5 Beyond Basic Programming

Packages.

Expansion “packages” allow for more functionality.

Most packages can be easily installed from the internet via CRAN.

Code
# common packages for tables/figures
install.packages("plotly")
install.packages("reactable")
install.packages("stargazer")

# common packages for data/teaching
install.packages("AER")
install.packages("Ecdat")
install.packages('wooldridge')

# other packages for statistics and data handling
install.packages("extraDistr")
install.packages("twosamples")
install.packages("data.table")

You only need to install packages once. But you need to load the package via library every time you want the extended functionality. For example, to generate exotic probability distributions

Code
# install.packages("extraDistr"), once ever
library(extraDistr) # every time you need it

par(mfrow=c(1,2))
for(p in c(-.5,0)){
    x <- rgev(2000, mu=0, sigma=1, xi=p)
    hist(x, breaks=50, border=NA, main=NA, freq=F)
}
title('GEV densities', outer=T, line=-1)

Code
library(extraDistr)

par(mfrow=c(1,3))
for(p in c(-1, 0,2)){
    x <- rtlambda(2000, p)
    hist(x, breaks=100, border=NA, main=NA, freq=F)
}
title('Tukey-Lambda densities', outer=T, line=-1)

The most common packages also have cheatsheets you can use.

Updating.

Make sure R and your packages are up to date. The current version of R and any packages used can be found (and recorded) with

Code
sessionInfo()

To update your R packages, use

Code
update.packages()

After updating R, you can update all packages stored on your computer (in all .libPaths())

Code
update.packages(checkBuilt=T, ask=F)
# install.packages(old.packages(checkBuilt=T)[,"Package"])

Base.

While additional packages can make your code faster, they also create dependencies that can lead to problems. So learn base R well before becoming dependent on other packages

Github Packages.

Advanced and Optional

Sometimes you will want to install a package from GitHub. For this, you can use devtools or its light-weight version remotes

To color terminal output on Linux systems, for example, you can use the colorout package

Code
# install.packages("remotes")
library(remotes)
# Install <https://github.com/jalvesaq/colorout
# to .libPaths()[1]
install_github('jalvesaq/colorout')
library(colorout)

To use devtools instead, you also need to have software developer tools installed on your computer.


  1. Drawing random uniform samples with computers is actually quite complex and beyond the scope of this course.↩︎