5  Statistics


We often summarize distributions with statistics: functions of data. The most basic way to do this is with summary, whose values can all be calculated individually. (E.g., the “mean” computes the [sum of all values] divided by [number of values].) There are many other statistics.

Code
summary( runif(1000))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0001859 0.2577175 0.4893537 0.4972023 0.7411376 0.9993870
summary( rnorm(1000) )
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -4.01814 -0.62094  0.05204  0.02909  0.71266  2.73722

5.1 Mean and Variance

The most basic statistics summarize the center of a distribution and how far apart the values are spread.

Mean.

Perhaps the most common statistic is the mean; \[\bar{X}=\frac{\sum_{i=1}^{N}X_{i}}{N},\] where \(X_{i}\) denotes the value of the \(i\)th observation.

Code
# compute the mean of a random sample
X <- runif(100)
hist(X, border=NA, main=NA)
x_bar <- mean(X)  #sum(x)/length(x)
abline(v=x_bar, col=2, lwd=2)
title(paste0('mean= ', round(x_bar,2)), font.main=1)

Variance.

Perhaps the second most common statistic is the variance: the average squared deviation from the mean \[V_{X} =\frac{\sum_{i=1}^{N} [X_{i} - \bar{X}]^2}{N}.\] The standard deviation is simply \(s_{X} = \sqrt{V_{X}}\).

Code
s <- sd(X) # sqrt(var(X))
hist(X, border=NA, main=NA, freq=F)
s_lh <- c(x_bar - s,  x_bar + s)
abline(v=s_lh, col=4)
text(s_lh, -.02,
    c( expression(bar(X)-s[X]), expression(bar(X)+s[X])),
    col=4, adj=0)
title(paste0('sd= ', round(s,2)), font.main=1)

Note that a “corrected version” is used by R and many statisticians: \(V_{X} =\frac{\sum_{i=1}^{N} [X_{i} - \bar{X}]^2}{N-1}\).

Code
var(X)
## [1] 0.09101678
x_bar <- mean(X)
mean( (X - x_bar)^2 )
## [1] 0.09010662

Together, these statistics summarize the central tendency and dispersion of a distribution. In some special cases, such as with the normal distribution, they completely describe the distribution. Other distributions are easier to describe with other statistics.

5.2 Other Center/Spread Statistics

Absolute Deviations.

We can use the Median as a “robust alternative” to means. Recall that the \(q\)th quantile is the value where \(q\) percent of the data are below and (\(1-q\)) percent are above. The median (\(q=.5\)) is the point where half of the data is lower values and the other half is higher. This means that median is not sensitive to extreme values (whereas the mean is).

Code
X <- rgeom(50, .4)
X
##  [1] 6 1 4 2 0 3 0 0 2 2 2 1 2 3 0 0 1 0 2 0 0 0 0 2 3 2 2 3 1 0 0 4 1 0 7 1 0 7
## [39] 3 2 0 4 8 1 3 0 4 1 0 4

proportions <- table(X)/length(X)
plot(proportions, ylab='proportion')

Code

# measures of central tendency
mean(X)
## [1] 1.88
median(X)
## [1] 1.5

# robustness to an extreme value
X_extreme <- c(X, 1000)
mean( X_extreme )
## [1] 21.45098
median( X_extreme )
## [1] 2

We can also use the Interquartile Range or Median Absolute Deviation as an alternative to variance. The first and third quartiles (\(q=.25\) and \(q=.75\)) together measure is the middle 50 percent of the data. The size of that range (interquartile range: the difference between the quartiles) represents “spread” or “dispersion” of the data. The median absolute deviation also measures spread \[\begin{eqnarray} \tilde{X} &=& Med(X_{i}) \\ MAD_{X} &=& Med\left( | X_{i} - \tilde{X} | \right). \end{eqnarray}\]

Code
#sd(X)
mad(X, constant=1) # median( abs(X - median(X)) )
## [1] 1.5

#sd(X) # another alternative
#IQR(X) # diff( quantile(x, probs=c(.25,.75)))

Note that there other absolute deviations:

Code
mean( abs(X - mean(X)) )
mean( abs(X - median(X)) )
median( abs(X - mean(X)) )

Mode and Share Concentration.

Sometimes, none of the above work well. With categorical data, for example, distributions are easier to describe with other statistics. The mode is the most common observation: the value with the highest observed frequency. We can also measure the spread of the frequencies or concentration at the mode vs elsewhere.

Code
# Draw 3 Random Letters
K <- length(LETTERS)
X_id <- rmultinom(3, 1, prob=rep(1/K,K))
X_id
##       [,1] [,2] [,3]
##  [1,]    0    0    0
##  [2,]    0    0    1
##  [3,]    0    0    0
##  [4,]    0    0    0
##  [5,]    0    0    0
##  [6,]    0    0    0
##  [7,]    0    0    0
##  [8,]    0    0    0
##  [9,]    0    0    0
## [10,]    0    0    0
## [11,]    0    0    0
## [12,]    0    0    0
## [13,]    0    0    0
## [14,]    0    0    0
## [15,]    0    0    0
## [16,]    0    0    0
## [17,]    0    0    0
## [18,]    1    0    0
## [19,]    0    0    0
## [20,]    0    0    0
## [21,]    0    0    0
## [22,]    0    0    0
## [23,]    0    0    0
## [24,]    0    1    0
## [25,]    0    0    0
## [26,]    0    0    0

# Draw Random Letters 100 Times
X_id <- rowSums(rmultinom(100, 1, prob=rep(1/K,K)))
X <- lapply(1:K, function(k){
    rep(LETTERS[k], X_id[k])
})
X <- factor(unlist(X), levels=LETTERS)

proportions <- table(X)/length(X)
plot(proportions)

Code

# mode(s)
names(proportions)[proportions==max(proportions)]
## [1] "L" "Q"

# freq. spreads
sd(proportions)
## [1] 0.01826304
sum(proportions^2)
## [1] 0.0468
# freq. concentration at mode
max(proportions)/mean(proportions)
## [1] 1.82

5.3 Shape Statistics

Central tendency and dispersion are often insufficient to describe a distribution. To further describe shape, we can compute the “standard moments” skew and kurtosis, as well as other statistics.

Skewness.

This captures how symmetric the distribution is. \[W_{X} =\frac{\sum_{i=1}^{N} [X_{i} - \bar{X}]^3 / N}{ [s_{X}]^3 }\]

Code
X <- rweibull(1000, shape=1)
hist(X, border=NA, main=NA, freq=F, breaks=20)

Code

skewness <-  function(X){
 X_bar <- mean(X)
 m3 <- mean((X - X_bar)^3)
 s3 <- sd(X)^3
 skew <- m3/s3
 return(skew)
}

skewness( rweibull(1000, shape=1))
## [1] 1.655013
skewness( rweibull(1000, shape=10) )
## [1] -0.6449534

Kurtosis.

This captures how many “outliers” there are. \[K_{X} =\frac{\sum_{i=1}^{N} [X_{i} - \bar{X}]^4 / N}{ [s_{X}]^4 }.\]

Code
X <- rweibull(1000, shape=1)
boxplot(X, main=NA)

Code

kurtosis <- function(X){  
 X_bar <- mean(X)
 m4 <- mean((X - X_bar)^4) 
 s4 <- sd(X)^4
 kurt <- m4/s4 - 3
 return(kurt)
}

kurtosis( rweibull(1000, shape=1) )
## [1] 4.114241
kurtosis( rweibull(1000, shape=10) )
## [1] 0.786354

Clusters/Gaps.

You can also describe distributions in terms of how clustered the values are. Remember: a picture is worth a thousand words.

Code
# Number of Modes
X <- rbeta(1000, .6, .6)
hist(X, border=NA, main=NA, freq=F, breaks=20)

Code

# Random Number Generator 
r_ugly1 <- function(n, theta1=c(-8,-1), theta2=c(-2,2), rho=.25){
    omega   <- rbinom(n, size=1, rho)
    epsilon <- omega * runif(n, theta1[1], theta2[1]) +
        (1-omega) * rnorm(n, theta1[2], theta2[2])
    return(epsilon)
}
# Large Sample
par(mfrow=c(1,1))
x <- seq(-12,6,by=.001)
rX <- r_ugly1(1000000)
hist(rX, breaks=1000,  freq=F, border=NA,
    xlab="x", main='')

Code
# Show True Density
d_ugly1 <- function(x, theta1=c(-8,-1), theta2=c(-2,2), rho=.25){
    rho     * dunif(x, theta1[1], theta2[1]) +
    (1-rho) * dnorm(x, theta1[2], theta2[2]) }
dx <- d_ugly1(x)
lines(x, dx, col=1)

5.4 Probability Theory

You were already introduced to this with https://jadamso.github.io/Rbooks/random-variables.html and probability distributions. In this section, we will dig a little deeper theoretically into the statistics we are most likely to use in practice.

The mean and variance are probably the two most basic statistics we might compute, and are often used. To understand them theoretically, we separately analyze how they are computed for discrete and continuous random variables.

Discrete Random Variables.

If the sample space is discrete, we can compute the theoretical mean (or expected value) as \[ \mathbb{E}[X_{i}] = \sum_{x} x Prob(X_{i}=x), \] where \(Prob(X_{i}=x)\) is the probability the random variable \(X_{i}\) takes the particular value \(x\). Similarly, we can compute the theoretical variance as \[ \mathbb{V}[X_{i}] = \sum_{x} \left(x - \mathbb{E}[X_{i}] \right)^2 Prob(X_{i}=x), \]

For example, consider an unfair coin with a \(.75\) probability of heads (\(x=1\)) and a \(.25\) probability of tails (\(x=0\)) has a theoretical mean of \[ \mathbb{E}[X_{i}] = 1\times.75 + 0 \times .25 = .75 \] and a theoretical variance of \[ \mathbb{V}[X_{i}] = [1 - .75]^2 \times.75 + [0 - .75]^2 \times.25 = 0.1875 \]

Code
X <- rbinom(10000, size=1, prob=.75)

round( mean(X), 4)
## [1] 0.746

round( var(X), 4)
## [1] 0.1895

Weighted Data.

Sometimes, you may have a dataset of values and probability weights. Othertimes, you can calculate them yourself. In either case, you can explicitly do the computations for discrete data. Given data on unique outcome \(x\) and their frequency \(\widehat{p}_{x}=\sum_{i=1}^{N}\mathbf{1}\left(X_{i}=x\right)/N\), we compute \(\bar{X}=\sum_{x} x \widehat{p}_{x}\).

Code
# Compute probability weights for unique values
P  <- table(X) #table of counts
p <- c(P)/length(X) #frequencies (must sum to 1)
x <- as.numeric(names(p)) #unique values
cbind(x,p)
##   x     p
## 0 0 0.254
## 1 1 0.746
# Mean
X_mean <- sum(x*p)
X_mean
## [1] 0.746

Try computing the mean both ways for another random sample

Code
X  <-  sample(c(0,1,2), 1000, replace=T)

This idea generalizes to a weighted mean: an average where different values contribute to the final result with varying levels of importance. For each outcome \(x\) we have a weight \(W_{x}\) and compute \[\begin{eqnarray} \bar{X} &=& \frac{\sum_{x} x W_{x}}{\sum_{x} W_{x}} = \sum_{x} x w_{x}, \end{eqnarray}\] where \(w_{x}=\frac{W_{x}}{\sum_{x} W_{x}}\) is normalized version of \(W_{x}\) that implies \(\sum_{x}w_{x}=1\).

Code
# Example: compute the final grade for this student
Homework1 <- c(score=88, weight=0.25)
Homework2 <- c(score=92, weight=0.25)
Exam1 <- c(score=67, weight=0.2)
Exam2 <- c(score=90, weight=0.3)

Grades <- rbind(Homework1, Homework2, Exam1, Exam2)
Grades
##           score weight
## Homework1    88   0.25
## Homework2    92   0.25
## Exam1        67   0.20
## Exam2        90   0.30

# Weighted Mean
Values <- Grades[,'score'] * Grades[,'weight']
FinalGrade <- sum(Values)
FinalGrade
## [1] 85.4

Note that if there are \(K\) unique outcomes and \(W_{x}=1\) then \(\sum_{x}W_{x}=K\) and \(w_{x}=1/K\). This means \(\bar{X} = \sum_{x} x w_{x} = \sum_{x} x /K\), which is just a simple mean.

Bonus: provide an example of computing a weighted variance building on this code below

Code
x_diff <- (x - x_mean)^2
p <- P/sum(P)
x_var <- sum(p * x_diff)

Continuous Random Variables (Advanced).

If the sample space is continuous, we can compute the theoretical mean (or expected value) as \[ \mathbb{E}[X] = \int x f(x) d x, \] where \(f(x)\) is the probability the random variable takes the particular value \(x\). Similarly, we can compute the theoretical variance as \[ \mathbb{V}[X_{i}]= \int \left(x - \mathbb{E}[X_{i}] \right)^2 f(x) d x, \]

For example, consider a random variable with a continuous uniform distribution over [-1, 1]. In this case, \(f(x)=1/[1 - (-1)]=1/2\) for each \(x\) in [-1, 1] and \[ \mathbb{E}[X_{i}] = \int_{-1}^{1} \frac{x}{2} d x = \int_{-1}^{0} \frac{x}{2} d x + \int_{0}^{1} \frac{x}{2} d x = 0 \] and \[ \mathbb{V}[X_{i}]= \int_{-1}^{1} x^2 \frac{1}{2} d x = \frac{1}{2} \frac{x^3}{3}|_{-1}^{1} = \frac{1}{6}[1 - (-1)] = 2/6 =1/3 \]

Code
X <- runif(10000, -1,1)
round( mean(X), 4)
## [1] -0.0038
round( var(X), 4)
## [1] 0.3317

You can again explicitly do the computations with weighted data, but here we have an additional approximation error

Code
# values and probabilities
h  <- hist(X, plot=F)
wt <- h[['counts']]/length(x) 
xt <- h[['mids']]
# Weighted mean
X_mean <- sum(wt*xt)
X_mean
## [1] -20.65

# Compare to "mean(x)"

5.5 Further Reading

Probability Theory

For weighted statistics, see

Note that many random variables are related to each other

Also note that numbers randomly generated on your computer cannot be truly random, they are “Pseudorandom”.