1  First Steps


1.1 Why Program in R?

You should program your statistical analysis, and we will cover some of the basics of how to do this in R. You also want your work to be replicable

  • Replicable: someone collecting new data comes to the same results.
  • Reproducibile: someone reusing your data comes to the same results.

You can read more about the distinction in many places, including

We focus on R because it is good for complex stats, concise figures, and coherent organization. It is built and developed by applied statisticians for statistics, and used by many in academia and industry. For students, think about labor demand and what may be good for getting a job. Do some of your own research to best understand how much to invest.

My main sell to you is that being reproducible is in your own self-interest.

An example workflow.

First Steps…

Step 1: Some ideas and data about how variable \(X_{1}\) affects variable \(Y_{1}\), which we denote as \(X_{1}\to Y_{1}\)

  • You copy some data into a spreadsheet, manually aggregate
  • do some calculations and tables the same spreadsheet
  • some other analysis from here and there, using this software and that.

Step 2: Pursuing the lead for a week or two

  • you extend your dataset with more observations
  • copy in a spreadsheet data, manually aggregate
  • do some more calculations and tables, same as before

A Little Way Down the Road …

1 month later: someone asks about another factor: \(X_{2} \to Y\).

  • you download some other type of data
  • You repeat Step 2 with some data on \(X_{2}\).
  • The details from your “point and click” method are a bit fuzzy.
  • It takes a little time, but you successfully redo the analysis.

4 months later: someone asks about yet another factor: \(X_{3}\to Y_{1}\).

  • You again repeat Step 2 with some data on \(X_{3}\).
  • You’re pretty sure none of tables your tried messed up the order of the rows or columns.
  • It takes more time and effort. The data processing was not transparent, but you eventually redo the analysis.

6 months later: you want to explore another outcome: \(X_{2} \to Y_{2}\).

2 years later: your boss wants you to replicate your work: \(X_{1}, X_{2}, X_{3} \to Y_{1}\).

  • A rival has proposed something new. Their idea doesn’t actually make any sense, but their figures and statistics look better.
  • You don’t even use that computer anymore and a collaborator who handled the data on \(X_{2}\) has moved on.

An alternative workflow.

Suppose you decided to code what you did beginning with Step 2.

It does not take much time to update or replicate your results.

  • Your computer runs for 2 hours and reproduces the figures and tables.
  • You also rewrote your big calculations to use multiple cores, this took two hours to do but saved 6 hours each time you rerun your code.
  • You add some more data. It adds almost no time to see whether much has changed.

Your results are transparent and easier to build on.

  • You see the exact steps you took and found an error
  • You try out a new plot you found in The Visual Display of Quantitative Information, by Edward Tufte.
  • You try out an obscure statistical approach that’s hot in your field.
    • it doesn’t make the report, but you have some confidence that candidate issue isn’t a big problem

1.2 First Steps

Install R.

First Install R. Then Install Rstudio.

For Fedora (linux) users, note that you need to first enable the repo and then install

Code
sudo dnf install 'dnf-command(copr)'
sudo dnf copr enable iucar/rstudio
sudo dnf install rstudio-desktop

Make sure you have the latest version of R and Rstudio for class. If not, then reinstall.

Interfacing with R Studio.

Rstudio is perhaps the easiest to get going with. (There are other GUI’s.)

In Rstudio, there are 4 panes. (If you do not see 4, click “file > new file > R script” on the top left of the toolbar.)

The top left pane is where you write your code. For example, type

1+1

The pane below is where your code is executed. Keep you mouse on the same line as your code, and then click “Run”. You should see

> 1+1
[1] 2

If you click “Run” again, you should see that same output printed again.

You should add comments to your codes, and you do this with hashtags. For example

# This is my first comment!
1+1 # The simplest calculation I could think of

You can execute each line one-at-a-time. Or you can highlight them both, to take advantage of how R executes commands line-by-line.

Reading This Textbook.

As we proceed, you can see both my source code and output like this:

Code
1+1
## [1] 2

There are also special boxes

This box contains need to know examples. Such as

Code
1+1
## [1] 2

This box contains test yourself examples and questions. Such as

Code
2+7
## [1] 9
2/7
## [1] 0.2857143

Assignment.

You can create “variables” that store values. For example,

Code
x <- 1 # Make your first variable
x + 1 # The simplest calculation I could think of
## [1] 2
Code
x <- 23 #Another example
x + 1
## [1] 24
Code
y <- x + 1 #Another example
y
## [1] 24

Your variables must be defined in order to use them. Otherwise you get an error. For example,

Code
X +   1 # notice that R is sensitive to capitalization 
## Error: object 'X' not found

Your variable names do not matter technically, but they should be informative

Code
one <- 1 # good variable name
one
## [1] 1

one <- 43 # bad variable name
one
## [1] 43

Good names avoid confusion later

Code
x <- 43
x_plus_two <- x + 2 # better
x_plus_two
## [1] 45

Scripting.

  • Create a folder on your computer to save your scripts
  • Save your R Script file as My_First_Script.R in your folder
  • Close Rstudio
  • Open your script and re-run it

As you work through the material, make sure to both execute and save your scripts. Add lots of commentary to your scripts. Name your scripts systematically.

There are often many ways to accomplish the same goal. You first scripts will be very basic and rough, but you can edit them later based on what you learn. And you can always ask R for help

Code
sum(x, 2) # x + 2
?sum

We write script in the top left so that we can edit common mistakes.

Code
# Mistake 1: using undefined objects
Y

#  Mistake 2: spelling and spacing
Y < - 43
Y_plus_z <- Y + z

# Mistake 3: half-completed code
x + y + 
x_plus_y_plus_z <- x + y + z
# Seeing "+" in the bottom console?
# press "Escape" and try again

1.3 Mathematical Objects

In R: scalars, vectors, and matrices are different kinds of “objects”.

These objects are used extensively in data analysis

  • scalars: summary statistics (average household income).
  • vectors: single variables in data sets (the household income of each family in Vancouver).
  • matrices: two variables in data sets (the age and education level of every person in class).

Vectors are probably your most common object in R, but we will start with scalars.

Scalars.

Make your first scalar

Code
xs <- 2 # Make your first scalar
xs  # Print the scalar
## [1] 2

Perform simple calculations and see how R is doing the math for you

Code
xs + 2
## [1] 4
xs*2 # Perform and print a simple calculation
## [1] 4
(xs+1)^2 # Perform and print a simple calculation
## [1] 9
xs + NA # often used for missing values
## [1] NA

Now change xs, predict what will happen, then re-run the code.

Vectors.

Make your first vector

Code
x <- c(0,1,3,10,6) # Your First Vector
x # Print the vector
## [1]  0  1  3 10  6
x[2] # Print the 2nd Element; 1
## [1] 1
x+2 # Print simple calculation; 2,3,5,8,12
## [1]  2  3  5 12  8
x*2
## [1]  0  2  6 20 12
x^2
## [1]   0   1   9 100  36

Apply mathematical calculations elementwise

Code
x+x
## [1]  0  2  6 20 12
x*x
## [1]   0   1   9 100  36
x^x
## [1] 1.0000e+00 1.0000e+00 2.7000e+01 1.0000e+10 4.6656e+04

In R, scalars are treated as a vector with one element.

Code
c(1)
## [1] 1

Sometimes, we will use vectors that are entirely ordered.

Code
seq(1,7,by=1) #1:7
## [1] 1 2 3 4 5 6 7
seq(1,7,by=0.5)
##  [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0

# Ordering data
sort(x)
## [1]  0  1  3  6 10
x[order(x)]
## [1]  0  1  3  6 10

Matrices.

Matrices are also common objects

Code
x1 <- c(1,4,9)
x2 <- c(3,0,2)
x_mat <- rbind(x1, x2)

x_mat       # Print full matrix
##    [,1] [,2] [,3]
## x1    1    4    9
## x2    3    0    2
x_mat[2,]   # Print Second Row
## [1] 3 0 2
x_mat[,2]   # Print Second Column
## x1 x2 
##  4  0
x_mat[2,2]  # Print Element in Second Column and Second Row
## x2 
##  0

There are elementwise calculations

Code
x_mat+2
##    [,1] [,2] [,3]
## x1    3    6   11
## x2    5    2    4
x_mat*2
##    [,1] [,2] [,3]
## x1    2    8   18
## x2    6    0    4
x_mat^2
##    [,1] [,2] [,3]
## x1    1   16   81
## x2    9    0    4

x_mat + x_mat
##    [,1] [,2] [,3]
## x1    2    8   18
## x2    6    0    4
x_mat*x_mat #NOT classical matrix multiplication
##    [,1] [,2] [,3]
## x1    1   16   81
## x2    9    0    4
x_mat^x_mat
##    [,1] [,2]      [,3]
## x1    1  256 387420489
## x2   27    1         4

And you can also use matrix algebra

Code
x_mat1 <- matrix( seq(2,7), 2, 3)
x_mat1
##      [,1] [,2] [,3]
## [1,]    2    4    6
## [2,]    3    5    7

x_mat2 <- matrix( seq(4,-1), 2, 3)
x_mat2
##      [,1] [,2] [,3]
## [1,]    4    2    0
## [2,]    3    1   -1

tcrossprod(x_mat1, x_mat2) #x_mat1 %*% t(x_mat2)
##      [,1] [,2]
## [1,]   16    4
## [2,]   22    7

crossprod(x_mat1, x_mat2)
##      [,1] [,2] [,3]
## [1,]   17    7   -3
## [2,]   31   13   -5
## [3,]   45   19   -7

1.4 Mathematical Functions

Simple Functions.

Functions are applied to objects

Code
# Define a function that adds two to any vector
add_two <- function(input_vector) { #input_vector is a placeholder
    output_vector <- input_vector + 2 # new object defined locally 
    return(output_vector) # return new object 
}
# Apply that function to a vector
x <- c(0,1,3,10,6)
add_two(input_vector=x) #same as add_two(x)
## [1]  2  3  5 12  8

Common mistakes:

Code
print(output_vector)
# This is not available globally

# Double check your spelling
x < - add_two(input_vector=X) 

# Seeing "+" in the bottom console
# often means you forgot to close the function with "}" 
# press "Escape" and try again
add_two <- function(input_vector) { 
    output_vector <- input_vector + 2 
    return(output_vector)
x <- c(0,1,3,10,6)
add_two(x)

There are many different functions

Code
add_vec <- function(input_vector1, input_vector2) {
    output_vector <- input_vector1 + input_vector2
    return(output_vector)
}
add_vec(x,3)
## [1]  3  4  6 13  9
add_vec(x,x)
## [1]  0  2  6 20 12

sum_squared <- function(x1, x2) {
    y <- (x1 + x2)^2
    return(y)
}

sum_squared(1, 3)
## [1] 16
sum_squared(x, 2)
## [1]   4   9  25 144  64
sum_squared(x, NA) 
## [1] NA NA NA NA NA
sum_squared(x, x)
## [1]   0   4  36 400 144
sum_squared(x, 2*x)
## [1]   0   9  81 900 324

Functions can take functions as arguments. Note that a statistic is defined as a function of data.

Code
statistic <- function(x, f){
    y <- f(x)
    return(y)
}
statistic(x, sum)
## [1] 20

There are many possible functions you can make and use. More complicated functions often have defaults.

Code
fun_of_seq <- function(f, constant=2){
    x1 <- seq(1,3, length.out=12)
    x2 <- x1+constant
    x <- cbind(x1,x2)
    y <- f(x)
    return(y)
}
fun_of_seq(sum)
## [1] 72
fun_of_seq(sum, 3)
## [1] 84
fun_of_seq(prod)
## [1] 30799645993
fun_of_seq(prod, 3)
## [1] 473621744988

You can also apply functions to matrices

Code
sum_squared(x_mat, x_mat)
##    [,1] [,2] [,3]
## x1    4   64  324
## x2   36    0   16

# Apply function to each matrix row
y <- apply(x_mat, 1, sum)^2 
# ?apply  #checks the function details

Loops.

Applying the same function over and over again

Code
# Example 1: simple division
x <- vector(length=3)
#Fill empty vector
for(i in seq(1,3)){
    x[i] <- i/2
}
x
## [1] 0.5 1.0 1.5
# Compare

# Example 2: exponential
#Create empty vector
x <- vector(length=3)
#Fill empty vector
for(i in seq(1,3)){
    x[i] <- exp(i)
}
# Compare
x
## [1]  2.718282  7.389056 20.085537
c( exp(1), exp(2), exp(3))
## [1]  2.718282  7.389056 20.085537

# Example 3: using existing data
x <- c(1,3,9,2)
y <- vector(length=length(x))
for(i in seq_along(x) ){
    y[i] <- x[i] + 1
}
y
## [1]  2  4 10  3

A more complicated example

Code
complicated_fun <- function(i, j=0){
    x <- i^(i-1)
    y <- x + mean( seq(j,i) )
    z <- log(y)/i
    return(z)
}
complicated_vector <- vector(length=10)
for(i in seq(1,10) ){
    complicated_vector[i] <- complicated_fun(i)
}

A recursive example

Code
x <- vector(length=4)
x[1] <- 1
for(i in seq(2,4) ){
    x[i] <- (x[i-1]+1)^2
}
x
## [1]   1   4  25 676

Basic Logic.

TRUE/FALSE

Code
x <- c(1,2,3,NA)
x > 2
## [1] FALSE FALSE  TRUE    NA
x==2
## [1] FALSE  TRUE FALSE    NA

any(x==2)
## [1] TRUE
all(x==2)
## [1] FALSE
2 %in% x
## [1] TRUE

2==TRUE
## [1] FALSE
2==FALSE
## [1] FALSE
 
is.numeric(x)
## [1] TRUE
is.na(x)
## [1] FALSE FALSE FALSE  TRUE

The “&” and “|” commands are logical calculations that compare vectors to the left and right.

Code
x <- seq(1,3)
(x >= 1) & (x < 2)
## [1]  TRUE FALSE FALSE
(x >= 1) | (x < 2)
## [1] TRUE TRUE TRUE

if( all(x >= 1) ){
    print("ok")
} else {
    print("not ok")
}
## [1] "ok"

logic_fun <- function(x){
    if( all(x >= 1) ){
        print("ok")
    } else {
        print("not ok")
    }
}
logic_fun( seq(1,3) )
## [1] "ok"
logic_fun( seq(0,2) )
## [1] "not ok"

1.5 Further Reading

There are many good and free programming materials online. For help setting up, see any of the following links

The most common tasks can be found https://github.com/rstudio/cheatsheets/blob/main/rstudio-ide.pdf

Some of my programming examples originally come from https://r4ds.had.co.nz/ and I recommend https://intro2r.com.

I have also used online material from many places over the years, as there are many good yet free-online tutorials and courses specifically on R programming. See e.g.,

For more on why to program in R, see