We often summarize distributions with statistics: functions of data. The most basic way to do this is with summary, whose values can all be calculated individually. (E.g., the “mean” computes the [sum of all values] divided by [number of values].) There are many other statistics.
Code
summary( runif(1000))## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0001567 0.2566212 0.5050232 0.5043039 0.7467155 0.9995065summary( rnorm(1000) )## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -2.785726 -0.636147 0.005909 0.005797 0.632699 3.001677
5.1 Mean and Variance
The most basic statistics summarize the center of a distribution and how far apart the values are spread.
Mean.
Perhaps the most common statistic is the mean; \[\overline{X}=\frac{\sum_{i=1}^{N}X_{i}}{N},\] where \(X_{i}\) denotes the value of the \(i\)th observation.
Code
# compute the mean of a random samplex <-runif(100)hist(x, border=NA, main=NA)m <-mean(x) #sum(x)/length(x)abline(v=m, col=2, lwd=2)title(paste0('mean= ', round(m,2)), font.main=1)
Variance.
Perhaps the second most common statistic is the variance: the average squared deviation from the mean \[V_{X} =\frac{\sum_{i=1}^{N} [X_{i} - \overline{X}]^2}{N}.\] The standard deviation is simply \(s_{X} = \sqrt{V_{X}}\).
Together, these statistics summarize the central tendency and dispersion of a distribution. In some special cases, such as with the normal distribution, they completely describe the distribution. Other distributions are easier to describe with other statistics.
5.2 Other Center/Spread Statistics
Absolute Deviations.
We can use the Median as a “robust alternative” to means. Recall that the \(q\)th quantile is the value where \(q\) percent of the data are below and (\(1-q\)) percent are above. The median (\(q=.5\)) is the point where half of the data is lower values and the other half is higher.
We can also use the Interquartile Range or Median Absolute Deviation as an alternative to variance. The first and third quartiles (\(q=.25\) and \(q=.75\)) together measure is the middle 50 percent of the data. The size of that range (interquartile range: the difference between the quartiles) represents “spread” or “dispersion” of the data. The median absolute deviation also measures spread \[
\tilde{X} = Med(X_{i}) \\
MAD_{X} = Med\left( | X_{i} - \tilde{X} | \right).
\]
Sometimes, none of the above work well. With categorical data, for example, distributions are easier to describe with other statistics. The mode is the most common observation: the value with the highest observed frequency. We can also measure the spread/dispersion of the frequencies, or compare the highest frequency to the average frequency to measure concentration at the mode.
Central tendency and dispersion are often insufficient to describe a distribution. To further describe shape, we can compute the “standard moments” skew and kurtosis, as well as other statistics.
Skewness.
This captures how symmetric the distribution is. \[W_{X} =\frac{\sum_{i=1}^{N} [X_{i} - \overline{X}]^3 / N}{ [s_{X}]^3 }\]
Code
x <-rweibull(1000, shape=1)hist(x, border=NA, main=NA, freq=F, breaks=20)
You were already introduced to this with https://jadamso.github.io/Rbooks/random-variables.html and probability distributions. In this section, we will dig a little deeper theoretically into the statistics we are most likely to use in practice.
The mean and variance are probably the two most basic statistics we might compute, and are often used. To understand them theoretically, we separately analyze how they are computed for discrete and continuous random variables.
Discrete.
If the sample space is discrete, we can compute the theoretical mean (or expected value) as \[
\mu = \sum_{i} x_{i} Prob(X=x_{i}),
\] where \(Prob(X=x_{i})\) is the probability the random variable \(X\) takes the particular value \(x_{i}\). Similarly, we can compute the theoretical variance as \[
\sigma^2 = \sum_{i} [x_{i} - \mu]^2 Prob(X=x_{i}),
\]
Example. Consider an unfair coin with a \(.75\) probability of heads (\(x_{i}=1\)) and a \(.25\) probability of tails (\(x_{i}=0\)) has a theoretical mean of \[
\mu = 1\times.75 + 0 \times .25 = .75
\] and a theoretical variance of \[
\sigma^2 = [1 - .75]^2 \times.75 + [0 - .75]^2 \times.25 = 0.1875
\]
Weighted Data. Sometimes, you may have a dataset of values and probability weights. Othertimes, you can calculate them yourself. In either case, you can explicitly do the computations
Code
# Compute probability weights for unique valuesh <-table(x) #table of countswt <-c(h)/length(x) #probabilities (must sum to 1)xt <-as.numeric(names(h)) #values# Weighted Meanxm <-sum(wt*xt)xm## [1] 0.7454
Try computing the mean both ways for another random sample
Code
x <-sample(c(0,1,2), 1000, replace=T)
Try also computing a weighted variance
Code
# xv <- sum(wt * (x - xm)^2)/sum(wt)
Continuous.
If the sample space is continuous, we can compute the theoretical mean (or expected value) as \[
\mu = \int x f(x) d x,
\] where \(f(x)\) is the probability the random variable takes the particular value \(x\). Similarly, we can compute the theoretical variance as \[
\sigma^2 = \int [x - \mu]^2 f(x) d x,
\]
Example. Consider a random variable with a continuous uniform distribution over [-1, 1]. In this case, \(f(x)=1/[1 - (-1)]=1/2\) for each \(x\) in [-1, 1] and \[
\mu = \int_{-1}^{1} \frac{x}{2} d x = \int_{-1}^{0} \frac{x}{2} d x + \int_{0}^{1} \frac{x}{2} d x = 0
\] and \[
\sigma^2 = \int_{-1}^{1} x^2 \frac{1}{2} d x = \frac{1}{2} \frac{x^3}{3}|_{-1}^{1} = \frac{1}{6}[1 - (-1)] = 2/6 =1/3
\]