Unlike spreadsheet applications (like Excel) or point-and-click statistical analysis software (SPSS), statistical programming software is based around a script-file where the user writes a series of commands to be performed,
Advantages of statistical programming software
Data analysis process is reproducible and transparent.
Due to the open-ended nature of language-based programming, there is far more versatility and customizability in what you can do with data.
Typically statistical programming software has a much more comprehensive range of built-in analysis functions than spreadsheets etc.
R is an open-source language specifically designed for statistical computing (and it’s the most popular choice among statisticians)
Because of its popularity and open-source nature, the R community’s package development means it has the most prewritten functionality of any data analysis software.
Differs from software like Stata, however, in that while you can use prewritten functions, it is equally adept at programming solutions for yourself.
Because it’s usage is broader, R also has a steeper learning curve than Stata.
SAS: Similar to Stata, but more commonly used in business & the private sector, in part because it’s typically more convenient for massive datasets. Otherwise, I think it’s seen as a bit older and less user-friendly.
Python: Another option based more on programming from scratch and with less prewritten commands. Python isn’t specific to math & statistics, but instead is a general programming language used across a range of fields.
Probably the most similar software choice to R at this point, with better general use (and programming ease) at the cost of less package development specific to econometrics/data analysis.
Matlab: Popular in macroeconomics and theory work, not so much in empirical work. Matlab is powerful, but is much more based on programming “from scratch” using matrices and mathematical expressions.
RStudio Cheat Sheets: Very helpful 1-2 page overviews of common tasks and packages in R.
Quick-R: Website with short example-driven overviews of R functionality.
StackOverflow: Part of the Stack Exchange network, StackOverflow is a Q&A community website for people who work in programming. Tons of incredibly good R users and developers interact on StackExchange, so it’s a great place to search for answers to your questions.
R-Bloggers: Blog aggregagator for posts about R. Great place to learn really cool things you can do in R.
R for Data Science: Online version of the book by Hadley Wickham, who has written many of the best packages for R, including the Tidyverse, which we will cover.
RStudio is an is an integrated development environment (IDE).
This means that in addition to a script editor, it also let’s you view your environments, data objects, plots, help files, etc directly within the application.
To execute a section of code, highlight the code and click “Run” or use Ctrl-Enter.
For a single line of code, you don’t need to highlight, just click into that line.
To execute the whole document, the hotkey is Ctrl-Shift-Enter.
Unlike Stata, with R you don’t need any special code to write multiline code - it’s already the default (functions are written with parentheses, so its clear when the line actually ends.)
So there’s no excuse for really long lines. Accepted style suggests using a 80-character limit for your lines.
You can also write multiple expressions on the same line by using ; as a manual line break.
You can access the help file for any given function using the help function. You can call it a few different ways:
? is shorter, so that’s the most frequent method.
# Help on the lm (linear regression) function
?lm
To set the working directory, use the function setwd(). The argument for the function is simply the path to the working directory, in quotes.
However: be sure that the slashes in the path are forward slashes (/). For Windows, this is not the case if you copy the path from File Explorer so you’ll need to change them.
# Set Working Directory
setwd("C:/Users/Andrew/Documents/MyProject")
Examples of basic mathematical operations in R:
# Addition and Subtraction
2 + 2
## [1] 4
# Multiplication and Division
2*2 + 2/2
## [1] 5
# Exponentiation and Logarithms
2^2 + log(2)
## [1] 4.693147
You can also evaluate logical expressions in R
## Less than
5 <= 6
## [1] TRUE
## Greater than or equal to
5 >= 6
## [1] FALSE
## Equality
5 == 6
## [1] FALSE
## Negatiion
5 != 6
## [1] TRUE
You can also use AND ( &) and OR ( |) operation with logical expressions:
## Is 5 equal to 5 OR 5 is equal to 6
(5 == 5) | (5 == 6)
## [1] TRUE
## 5 less 6 AND 6 < 5
(5 < 6) & (7 < 6)
## [1] FALSE
To define an object, use <-. For example
# Assign 2 + 2 to the variable x
x <- 2 + 2
Note: In R, there is no distinction between defining and redefining an object (a la gen/replace in Stata).
y <- 4 # Define y
y <- y^2 # Redefine y
y #Print y
## [1] 16
Data elements in R are categorized into a few seperate classes (ie types)
numeric: Data that should be interpreted as a number.
logical: Data that should be interpreted as a logical statment, ie. TRUE or FALSE.
factor: In affect, a categorical variable. Value may be text, but R interprets the variable as taking on one of limited number of possible values (e.g. sex, municipality, industry etc)
a <- 2; class(a)
## [1] "numeric"
b <- "2"; class(b)
## [1] "character"
c <- TRUE; class(c)
## [1] "logical"
d <- "True"; class(d)
## [1] "character"
The basic data structure containing multiple elements in R is the vector.
An R vector is much like the typical view of a vector in mathematics, ie it’s basically a 1D array of elements.
Typical vectors are of a single-type (these are called atomic vectors).
A list vector can also have elements of different types.
To create a vector, use the function c().
# Create `days` vectors
days <- c("Mon","Tues","Wed",
"Thurs", "Fri")
# Create `temps` vectors
temps <- c(13,18,17,20,21)
# Display `temps` vector
temps
## [1] 13 18 17 20 21
You can name a vector by assigning a vector of names to c(), where the vector to be named goes in the parentheses.
# Assign `days` as names for `temps` vector
names(temps) <- days
# Display `temps` vector
temps
## Mon Tues Wed Thurs Fri
## 13 18 17 20 21
There are multiple ways of subsetting data in R. One of the easiest methods for vectors is to put the subset condition in brackets:
# Subset temps greater than or equal to 10
temps[temps>=18]
## Tues Thurs Fri
## 18 20 21
Operations on vectors are element-wise. So if 2 vectors are added together, each element of the \(2^{nd}\) vector would be added to the corresponding element from the \(1^{st}\) vector.
# Temp vector for week 2
temps2 <- c(8,10,10,15,16)
names(temps2) <- days
# Create average temperature by day
avg_temp <- (temps + temps2) / 2
# Display `avg_temp`
avg_temp
## Mon Tues Wed Thurs Fri
## 10.5 14.0 13.5 17.5 18.5
Data in a 2-dimensional structure can be represented in two formats, as a matrix or as a data frame.
To create a matrix, use the matrix() command.
The syntax of ** matrix() ** is:
matrix(x, nrow=a, ncol=b, byrow=FALSE/TRUE)
-x is the data that will populate the matrix.
-nrow and ncol specify the number of rows and columns, respectively.
Generally need to specify just 1 since the number of elements and a single condition will determine the other.
-byrow specifies whether to fill in the elements by row or column. The default is byrow=FALSE, ie the data is filled in by column.
A simple example of creating a matrix would be:
matrix(1:6, nrow=2, ncol=3, byrow=FALSE)
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
Note the difference in appearance if we instead byrow=TRUE
matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
Using the same c() function as in the creation of a vector, we can specify the values of a matrix:
matrix(c(13,18,17,20,21,
8,10,10,15,16),
nrow=2, byrow=TRUE)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 13 18 17 20 21
## [2,] 8 10 10 15 16
Instead of entering in the matrix data yourself, you may want to make a matrix from existing data vectors:
# Create temps matrix
temps.matrix <- matrix(c(temps,temps2), nrow=2,
ncol=5, byrow=TRUE)
# Display matrix
temps.matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 13 18 17 20 21
## [2,] 8 10 10 15 16
-Naming rows and columns of a matrix is pretty similar to naming vectors.
-Only here, instead of using names(), we use rownames() and colnames()
# Create temps matrix
rownames(temps.matrix) <- c("Week1", "week2")
colnames(temps.matrix) <- days
# Display matrix
temps.matrix
## Mon Tues Wed Thurs Fri
## Week1 13 18 17 20 21
## week2 8 10 10 15 16
In R, matrix multiplication is denoted by %\(*\)%, as in A %\(*\)% B
A * B instead performs element-wise (Hadamard) multiplication of matrices, so that A * B has the entries \(a_1 b_1\), \(a_2 b_2\) etc.
vecA; matB
## [1] 1 2
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
vecA * matB
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 8 10 12
Most of the time you’ll probably be working with datasets that are recognized as data frames when imported into R.
But you can also easily create your own data frames.
This might be as simple as converting a matrix to a data frame:
mydf <- as.data.frame(matB)
mydf
## V1 V2 V3
## 1 1 2 3
## 2 4 5 6
mydf <- data.frame(vecA,matB)
mydf
## vecA X1 X2 X3
## 1 1 1 2 3
## 2 2 4 5 6
Once you have a multidimensional data object, you will usually want to create or manipulate particular columns of the object.
The default way of invoking a named column in R is by appending a dollar sign and the column name to the data object.
wages # View wages data frame
## wage schooling sex exper
## 1 134.23058 13 female 8
## 2 249.67744 13 female 11
## 3 53.56478 10 female 11
wages$expersq <- wages$exper^2; wages # Add expersq
## wage schooling sex exper expersq
## 1 134.23058 13 female 8 64
## 2 249.67744 13 female 11 121
## 3 53.56478 10 female 11 121
Like viewing the class of a homogenous data object, it’s often helpful to view the structure of data frames (or other 2D objects).
# View the structure of the wages data frame
str(wages)
## 'data.frame': 3 obs. of 5 variables:
## $ wage : num 134.2 249.7 53.6
## $ schooling: int 13 13 10
## $ sex : Factor w/ 2 levels "female","male": 1 1 1
## $ exper : int 8 11 11
## $ expersq : num 64 121 121
A common task is to redefine the classes of columns in a data frame.
Common commands can help you with this when the data is formatted suitably:
as.numeric() will take data that looks like numbers but are formatted as characters/factors and change their formatting to numeric.
as.character() will take data formatted as numbers/factors and change their class to character.
as.factor() will reformat data as factors, taking by default the unique values of each column as the possible factor levels.
Although as.factor() will suggest factors from the data, you may want more control over how factors are specified.
With the factor() function, you supply the possible values of the factor and you can also specify ordering of factor values if your data is ordinal.
A dataset on number of extramarital affairs from Fair (Econometrica 1977) has the following variables: number of affairs, years married, presence of children, and a self-rated (Likert scale) 1-5 measure of marital happiness.
str(affairs) # view structure
## 'data.frame': 3 obs. of 4 variables:
## $ affairs: num 0 0 1
## $ yrsmarr: num 15 1.5 7
## $ child : Factor w/ 2 levels "no","yes": 2 1 2
## $ mrating: int 1 5 3
## Format mrating as ordered factor
affairs$mrating <-factor(affairs$mrating,
levels=c(1,2,3,4,5), ordered=TRUE)
str(affairs)
## 'data.frame': 3 obs. of 4 variables:
## $ affairs: num 0 0 1
## $ yrsmarr: num 15 1.5 7
## $ child : Factor w/ 2 levels "no","yes": 2 1 2
## $ mrating: Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 1 5 3
Note that the marital rating (mrating) initially was stored as an integer, which is incorrect. Using factors preserves the ordering while not asserting a numerical relationship between values.
Similar to subsetting a vector, matrices & data frames can also be subsetted for both rows and columns by placing the selection arguments in brackets after the name of the data object:
Arguments can be:
Row or column numbers (eg mydf[1,3])
Row or column names
Rows (ie observations) that meet a given condition
# Subset of wages df with schooling > 10, exper > 10
wages[(wages$schooling > 10) & (wages$exper > 10),]
## wage schooling sex exper expersq
## 2 249.6774 13 female 11 121
Notice that the column argument was left empty, so all columns are returned by default.
Packages in R are similar to user-written commands (think ssc install) in Stata.
But most things you do in Stata probably use core Stata commands.
In R, most of your analysis will probably be done using packages.
To install a package, use the function (preferably in the console) install.packages()
install.packages("tidyverse") # Install tidyverse
install.packages("rio") # Install rio
Unlike Stata, in R you need to declare what packages you will be using at the beginning of each R document.
To do this, use the library() function.
-require() also works, but its use is discouraged for this purpose.
library("tidyverse") # Install tidyverse
library("rio") # Install rio
Previously, importing and exporting data was a mess, with a lot of different functions for different file formats:
The rio package simplifies this by reducing all of this to just one function, import()
PISA_2015 <- import("data/PISA2015.sas7bdat")
PISA_2015[1:5,1:6]
## CNTRYID CNT CNTSCHID CYC NatCen Region
## 1 8 ALB 800001 06MS 000800 800
## 2 8 ALB 800002 06MS 000800 800
## 3 8 ALB 800003 06MS 000800 800
## 4 8 ALB 800004 06MS 000800 800
## 5 8 ALB 800005 06MS 000800 800
export(PISA_2015, "PISA_2015.rds")
Last class, we covered data frames—the most basic data object class for data sets with a mix of data class.
Today, we introduce one final data object: the tibble!
The tibble can be thought of as an update to the data frame—and it’s the first part of the tidyverse package that we’ll look at.
There are three main benefits to the tibble:
The syntax for creating tibbles exactly parallels the syntax for data frames:
tibble() creates a tibble from underlying data or vectors.
as_tibble() coerces an existing data object into a tibble.
PISA_2015 <- as_tibble(PISA_2015); PISA_2015[1:5,1:5]
## # A tibble: 5 x 5
## CNTRYID CNT CNTSCHID CYC NatCen
## <dbl> <chr> <dbl> <chr> <chr>
## 1 8 ALB 800001 06MS 000800
## 2 8 ALB 800002 06MS 000800
## 3 8 ALB 800003 06MS 000800
## 4 8 ALB 800004 06MS 000800
## 5 8 ALB 800005 06MS 000800
Another tidyverse function that’s very useful is glimpse() , a function very similar to str().
Both functions display information about the structure of a data object.
str() provides more information, such as column (variable) attributes embedded from external data formats, but consequently is much less readable for complex data objects.
glimpse() provides only column names, classes, and some data values (much more readable)
I will often use str() when I want more detailed information about data structure, but use glimpse() for quicker glances at the data.
Another major convenience enhancement from the tidyverse is pipes, denoted %>%,
Pipes allow you to combine multiple steps into a single piece of code.
Specifically, after performing a function in one step, a pipe takes the data generated from the first step and uses it as the data input to a second step.
barro.lee.data <- import("data/BL2013_MF1599_v2.1.dta") %>%
as_tibble() %>% glimpse(width = 50)
## Observations: 1,898
## Variables: 20
## $ BLcode <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ country <chr> "Algeria", "Algeria", "Alge…
## $ year <dbl> 1950, 1955, 1960, 1965, 197…
## $ sex <chr> "MF", "MF", "MF", "MF", "MF…
## $ agefrom <dbl> 15, 15, 15, 15, 15, 15, 15,…
## $ ageto <dbl> 999, 999, 999, 999, 999, 99…
## $ lu <dbl> 80.68459, 81.05096, 82.6111…
## $ lp <dbl> 17.563400, 17.018442, 14.31…
## $ lpc <dbl> 3.745905, 3.464397, 3.06939…
## $ ls <dbl> 1.454129, 1.639253, 2.75251…
## $ lsc <dbl> 0.4595877, 0.4952279, 1.049…
## $ lh <dbl> 0.2978759, 0.2594140, 0.322…
## $ lhc <dbl> 0.16479027, 0.14177565, 0.1…
## $ yr_sch <dbl> 0.8464569, 0.8350149, 0.880…
## $ yr_sch_pri <dbl> 0.7443995, 0.7284052, 0.706…
## $ yr_sch_sec <dbl> 0.09280412, 0.09858591, 0.1…
## $ yr_sch_ter <dbl> 0.009253278, 0.008023791, 0…
## $ pop <dbl> 5241, 5699, 6073, 6374, 710…
## $ WBcode <chr> "DZA", "DZA", "DZA", "DZA",…
## $ region_code <chr> "Middle East and North Afri…
A motivating principle behind the creation of the tidyverse was the language of programming should really behave like a language.
Data manipulation in the tidyverse is oriented around a few key “verbs” that perform common types of data manipulation.
Note: the first argument for each these functions is the data object (so pipe!).
Filtering keeps observations (rows) based on conditions.
# Using brackets
wages[(wages$schooling > 10) & (wages$exper > 10),]
## wage schooling sex exper
## 2 249.6774 13 female 11
# Using filter
wages %>% filter(schooling > 10,exper > 10)
## wage schooling sex exper
## 1 249.6774 13 female 11
Notice a couple of things about the output:
filter(wages, schooling > 10,exper > 10)
## wage schooling sex exper
## 1 249.6774 13 female 11
Just like filter is in many ways a more convenient form of writing out bracketed row subset conditions, the verb select() is largely a more convenient method for writing column arguments.
# Using brackets
wages_row1[,c("wage","schooling","exper")]
## wage schooling exper
## 1 134.2306 13 8
# Using select
wages_row1 %>% select(wage,schooling,exper)
## wage schooling exper
## 1 134.2306 13 8
One option we have not covered so far in creating subsets is dropping rows or columns.
R has a specific notation for this, easily used with select():
wages_row1 # What wages_row1 looks like:
## wage schooling sex exper
## 1 134.2306 13 female 8
wages_row1 %>% select(-exper) #drop exper
## wage schooling sex
## 1 134.2306 13 female
Dropping columns (or rows) using the - notation also works with brackets, but only when using the number location of the row or column to be dropped.
wages_row1[,-4] # works
## wage schooling sex
## 1 134.2306 13 female
# wages_row1[,-"exper"] does not work
Because of the ability to use name arguments, dropping with select() is generally easier.
Creating new variables that are functions of existing variables in a data set can be done with mutate().
mutate() takes as its first argument the data set to be used and the equation for the new variable:
wages <- wages %>%
mutate(expsq = exper^2) # Create expersq
wages # Display wages
## wage schooling sex exper expsq
## 1 134.23058 13 female 8 64
## 2 249.67744 13 female 11 121
## 3 53.56478 10 female 11 121
Summary statistics can also be easily created using the tidyverse function summarize()
The summarize functions uses summary statistic functions in R to create a new summary tibble, with syntax largely identical to mutate().
Let’s try summarizing with the mean() summary statistic.
wages %>%
summarize(avg_wage = mean(wage))
## avg_wage
## 1 145.8243
There are a number of summary statistics available in R, which can be used either with the summarize() command or outside of it:
Measures of central tendency and spread:
Position:
Count:
Let’s look at an example of using multiple summary variables with a larger 50-observation sample for the wages
data set.
wages %>%
summarize(avg.wage = mean(wage), sd.wage = sd(wage),
avg.exper = mean(exper), sd.exper = sd(exper))
## # A tibble: 1 x 4
## avg.wage sd.wage avg.exper sd.exper
## <dbl> <dbl> <dbl> <dbl>
## 1 5942. 17526. 7.47 2.08
Creating summary statistics by group is another routine task. This is accommodated in the tidyverse using the group_by().
wages %>% group_by(sex) %>%
summarize(avg.wage = mean(wage), sd.wage = sd(wage))
## # A tibble: 2 x 3
## sex avg.wage sd.wage
## <fct> <dbl> <dbl>
## 1 female 5473. 18883.
## 2 male 6410. 16711.
If you want to sort your data by the values of a particular variable, you can easily do so as well with the arrange()
function.
wages[1:3,] %>% arrange(exper)
## # A tibble: 3 x 4
## wage schooling sex exper
## <dbl> <int> <fct> <int>
## 1 175. 12 female 5
## 2 103. 11 male 7
## 3 1411. 14 female 8
Not: arrange() sorts values in ascending order by default. If you want to sort in descending order, wrap the variable name inside desc() in the function.
Creating a sample from a data set in R is made easy by two main function in R: sample_n and sample_frac.
Syntax:
sample_n(data, size, replace = FALSE/TRUE)
sample_frac(data, size = 1, replace = FALSE/TRUE)
Let’s use tidyverse data manipulation verbs to work through a practical data prep problem from start to finish.
For the problem, Let’s use fuel economy data again, but with half of the data set. The data comes from the vehicles
data set in the fueleconomy package.
# install.packages("fueleconomy") # Run only once
library(fueleconomy)
Now let’s look at how fuel efficiency has changed over time in the data set. Specifically, let’s create descriptive statistics of fuel efficiency by year for “normal” passenger vehicles (4-8 cylinders).
glimpse(vehicles[2:12], width=50)
## Observations: 33,442
## Variables: 11
## $ make <chr> "AM General", "AM General", "AM G…
## $ model <chr> "DJ Po Vehicle 2WD", "DJ Po Vehic…
## $ year <int> 1984, 1984, 1984, 1984, 1985, 198…
## $ class <chr> "Special Purpose Vehicle 2WD", "S…
## $ trans <chr> "Automatic 3-spd", "Automatic 3-s…
## $ drive <chr> "2-Wheel Drive", "2-Wheel Drive",…
## $ cyl <int> 4, 4, 6, 6, 4, 6, 6, 4, 4, 6, 4, …
## $ displ <dbl> 2.5, 2.5, 4.2, 4.2, 2.5, 4.2, 3.8…
## $ fuel <chr> "Regular", "Regular", "Regular", …
## $ hwy <int> 17, 17, 13, 13, 17, 13, 21, 26, 2…
## $ cty <int> 18, 18, 13, 13, 16, 13, 14, 20, 2…
annual.mpg <- vehicles %>% sample_frac(0.5) %>%
filter(cyl %in% 4:8) %>% group_by(year) %>%
summarize(hwy.avg = mean(hwy), hwy.sd = sd(hwy),
city.avg = mean(cty), city.sd = sd(cty)) %>%
arrange(desc(city.avg))
Note: Here I used %in%, which works like inrange in Stata. You could alternately write two inequalities to achieve the same thing.
# Print annual.mpg
annual.mpg
## # A tibble: 32 x 5
## year hwy.avg hwy.sd city.avg city.sd
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 2015 28.6 5.42 20.6 4.78
## 2 2014 27.8 6.44 20.4 5.76
## 3 2013 27.3 6.02 20.0 5.78
## 4 2012 26.3 5.83 19.2 5.21
## 5 2011 25.7 5.36 18.8 4.86
## 6 2010 25.3 4.90 18.4 4.38
## 7 1985 22.8 6.21 17.7 4.70
## 8 2009 24.2 4.54 17.6 3.92
## 9 1986 22.4 5.82 17.5 4.48
## 10 1997 23.6 5.11 17.2 3.76
## # … with 22 more rows
Although the tidyverse summarize() function is more powerful, often you just a want a quick look at summary statistics for the whole data set.
summary(wages)
## wage schooling sex exper
## Min. : 1.69 Min. : 8 female:15 Min. : 3.000
## 1st Qu.: 44.07 1st Qu.:11 male :15 1st Qu.: 6.250
## Median : 160.96 Median :12 Median : 7.500
## Mean : 5941.66 Mean :12 Mean : 7.467
## 3rd Qu.: 1519.01 3rd Qu.:13 3rd Qu.: 8.000
## Max. :73659.06 Max. :15 Max. :13.000
There are a few data cleaning tasks that are pervasive in empirical work:
Renaming columns is easily accommodated with the tidyverse rename() command.
Syntax:
mydataframe <- mydataframe %>% rename(NewVarName = OldVarName)
To see rename() in action, let’s go back to the barro.lee.data
educational data set we imported earlier:
Let’s look at columns 1 and 7 through 9:
glimpse(barro.lee.data[,c(1,7:9)], width = 50)
## Observations: 1,898
## Variables: 4
## $ BLcode <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ lu <dbl> 80.68459, 81.05096, 82.61115, 80…
## $ lp <dbl> 17.563400, 17.018442, 14.313748,…
## $ lpc <dbl> 3.745905, 3.464397, 3.069391, 4.…
See how these variable names are uninformative? Let’s use str() now.
str(barro.lee.data[,c(1,7:9)])
## Classes 'tbl_df', 'tbl' and 'data.frame': 1898 obs. of 4 variables:
## $ BLcode: num 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "Country Code"
## ..- attr(*, "format.stata")= chr "%8.0g"
## $ lu : num 80.7 81.1 82.6 80.9 73.6 ...
## ..- attr(*, "label")= chr "Percentage of No Schooling"
## ..- attr(*, "format.stata")= chr "%7.2f"
## $ lp : num 17.6 17 14.3 14.4 19.2 ...
## ..- attr(*, "label")= chr "Percentage of Primary"
## ..- attr(*, "format.stata")= chr "%7.2f"
## $ lpc : num 3.75 3.46 3.07 4.01 5.23 ...
## ..- attr(*, "label")= chr "Percentage of Primary Complete"
## ..- attr(*, "format.stata")= chr "%7.2f"
Although the column names here are uninformative, the imported SAS file included variable labels saved as attributes, which can be seen using the more verbose str() function.
barro.lee.data <- barro.lee.data %>%
rename(countrycode = BLcode,
perc.noschool = lu,
perc.primary = lp,
perc.primary.complete = lpc)
Now let’s look at the variable names again:
glimpse(barro.lee.data[,c(1,7:9)], width = 50)
## Observations: 1,898
## Variables: 4
## $ countrycode <dbl> 1, 1, 1, 1, 1, 1,…
## $ perc.noschool <dbl> 80.68459, 81.0509…
## $ perc.primary <dbl> 17.563400, 17.018…
## $ perc.primary.complete <dbl> 3.745905, 3.46439…
Along with renaming variables, recoding variables is another integral part of data wrangling.
wages[1:4,"sex"] # Look at sex column
## # A tibble: 4 x 1
## sex
## <fct>
## 1 female
## 2 female
## 3 male
## 4 male
wages$sex <- wages$sex %>% recode("male"=0,
"female"=1) # recode
wages[1:4,"sex"] # Look at sex column
## # A tibble: 4 x 1
## sex
## <dbl>
## 1 1
## 2 1
## 3 0
## 4 0
Another problem characteristic of observational data is missing data. In R, the way to represent missing data is with the value NA.
## Replace 99-denoted missing data with NA
# bracket method
wages[wages$schooling==99,] <- NA
# tidyverse method
wages$schooling <- wages$schooling %>% na_if(99)
You can check for (correctly-coded) missing-values using the is.na() function.
## Missing
wages[is.na(wages$wage),]
## # A tibble: 3 x 4
## wage schooling sex exper
## <dbl> <int> <dbl> <int>
## 1 NA 14 1 8
## 2 NA 11 0 8
## 3 NA 10 0 8
Note: R does not naturally support multiple types of missingness like other languages, although it’s possible to use the sjmisc package to do this.
Rules for tidy data (from R for Data Science):
There two main tidyverse verbs for making data tidy are:
gather(): reduces variable values are spread over multiples columns into a single column.
spread(): when multiple variables values are stored in the same columns, moves each variable into it’s own column.
If values for a single variable are spread across multiple columns (e.g. income for different years), gather moves this into single “values” column with a “key” column to identify what the different columns differentiated.
Syntax:
gather(data, key, value, columnstocombine)
earnings.panel
## # A tibble: 7 x 3
## person y1999 y2000
## <chr> <dbl> <dbl>
## 1 Elsa 10 15
## 2 Mickey 20 28
## 3 Ariel 17 21
## 4 Gaston 19 19
## 5 Jasmine 32 35
## 6 Peter 22 29
## 7 Alice 11 15
earnings.panel <- earnings.panel %>%
gather(key="year", value="wage",y1999:y2000)
earnings.panel
## # A tibble: 14 x 3
## person year wage
## <chr> <chr> <dbl>
## 1 Elsa y1999 10
## 2 Mickey y1999 20
## 3 Ariel y1999 17
## 4 Gaston y1999 19
## 5 Jasmine y1999 32
## 6 Peter y1999 22
## 7 Alice y1999 11
## 8 Elsa y2000 15
## 9 Mickey y2000 28
## 10 Ariel y2000 21
## 11 Gaston y2000 19
## 12 Jasmine y2000 35
## 13 Peter y2000 29
## 14 Alice y2000 15
Spread tackles the other major problem - that often times (particularly in longitudinal data) many variables are condensed into just a “key” (or indicator) column and a value column.
wages2
## person indicator values
## 1 Elsa wage NA
## 2 Mickey wage 174.932480
## 3 Ariel wage 102.668810
## 4 Gaston wage 1.690623
## 5 Jasmine wage 2.166231
## 6 Peter wage 1192.371925
## 7 Alice wage 83.363705
## 8 Elsa wage NA
## 9 Mickey wage 174.932480
## 10 Ariel wage 102.668810
## 11 Gaston wage 1.690623
## 12 Jasmine wage 2.166231
## 13 Peter wage 1192.371925
## 14 Alice wage 83.363705
## 15 Elsa schooling 14.000000
## 16 Mickey schooling 12.000000
## 17 Ariel schooling 11.000000
## 18 Gaston schooling 11.000000
## 19 Jasmine schooling 14.000000
## 20 Peter schooling 12.000000
## 21 Alice schooling 11.000000
## 22 Elsa schooling 14.000000
## 23 Mickey schooling 12.000000
## 24 Ariel schooling 11.000000
## 25 Gaston schooling 11.000000
## 26 Jasmine schooling 14.000000
## 27 Peter schooling 12.000000
## 28 Alice schooling 11.000000
## 29 Elsa exper 8.000000
## 30 Mickey exper 5.000000
## 31 Ariel exper 7.000000
## 32 Gaston exper 8.000000
## 33 Jasmine exper 10.000000
## 34 Peter exper 8.000000
## 35 Alice exper 6.000000
## 36 Elsa exper 8.000000
## 37 Mickey exper 5.000000
## 38 Ariel exper 7.000000
## 39 Gaston exper 8.000000
## 40 Jasmine exper 10.000000
## 41 Peter exper 8.000000
## 42 Alice exper 6.000000
wages2 %>% spread("indicator", "values")
## person wage schooling exper
## 1 Elsa NA 14 8
## 2 Mickey 174.932480 12 5
## 3 Ariel 102.668810 11 7
## 4 Gaston 1.690623 11 8
## 5 Jasmine 2.166231 14 10
## 6 Peter 1192.371925 12 8
## 7 Alice 83.363705 11 6
## 8 Elsa NA 14 8
## 9 Mickey 174.932480 12 5
## 10 Ariel 102.668810 11 7
## 11 Gaston 1.690623 11 8
## 12 Jasmine 2.166231 14 10
## 13 Peter 1192.371925 12 8
## 14 Alice 83.363705 11 6
So far, when changing a data object, we have always been a bit repetitive:
mydataframe <- mydataframe %>% rename(NewVarName = OldVarName)
Along with the standard pipe (%>%), by loading the magrittr package, you can also use the so-called “assignment pipe” (%<>%).
mydataframe %<>% rename(NewVarName = OldVarName)
Another subtlety glossed over so far are lists.
As mentioned in module 1, vectors come in two forms: atomic vectors (with a single data type) and lists (with heterogenous data types).
Lists can take as inputs not only single-valued elements, but also vectors or data frames.
Creating a list from other objects is done with the list() function. The syntax of list is:
wages_df; date_df; description
## wage schooling sex exper
## 1 134.23058 13 female 8
## 2 249.67744 13 female 11
## 3 53.56478 10 female 11
## month year
## 1 Jan 2017
## 2 Feb 2018
## 3 March 2019
## [1] "Data on wages and date information."
mylist <- list(wages = wages_df, dates = date_df, description)
Where wages_df and date_df are data frames and description is a single character element.
To subset a vector/matrix/data frame, one uses single brackets, eg mydf[,].
To refer to an object of a list, use double brackets.
mylist[[3]]
## [1] "Data on wages and date information."
Note: The function list() does not take transfer the names of the data frames, so you will need to either subset by position or assign names to the list objects.
An easy way of extracting an object from a list is with the extract2() function from magrittr. This allows you to extra a given list object by name or position.
wage_data <- mylist %>% extract2(1)
wage_data
## wage schooling sex exper
## 1 134.23058 13 female 8
## 2 249.67744 13 female 11
## 3 53.56478 10 female 11
Instead of creating more complicated data objects, sometimes formatted as list into a simple (atomic) vector. The unlist() function does this.
Example:
simple_list <-list(1,2,3,4)
str(simple_list)
## List of 4
## $ : num 1
## $ : num 2
## $ : num 3
## $ : num 4
simple_list %<>% unlist() %>% str()
## num [1:4] 1 2 3 4
For tasks that you want to iterate over multiple data frames/variables/elements, you may want to think about creating a loop.
Syntax:
for(indexname in range) {
do stuff
}
for (i in 1:4){
print(i^2)
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
You can also loop over elements instead of values.
nlsy97 <- import("./data/nlsy97.rds")
factor.vars <- c("personid","year","sex","race","region","schooltype")
for (i in factor.vars){
nlsy97[,i] %<>% unlist() %>% as.factor()
}
For iterations over vectors and dataframes, the map() function is a great alternative to the for loop.
Map functions take a user-supplied function and iterate it over:
Elements for a vector
Objects of a list
Columns of a data frame
Map functions are much simpler to write than loops and are also generally a good bit faster.
Syntax:
map(data, fxn, option1, option2...)
Example:
nlsy97[,factor.vars] %<>% map(as.factor)
There are multiple map variants that enforce a given data type on results. You should use these whenever you want output of a certain class.
map_lgl for logical vector
map_dbl for numeric vector
map_chr for character vector
map_df for a data frame
nlsy.sub <- nlsy97 %>% select(parentincome, motheredyrs, gpa)
nlsy.sub %>% map_dbl(IQR, na.rm=TRUE)
## parentincome motheredyrs gpa
## 55947.2500 3.0000 2.8075
nlsy.sub %>% map(IQR, na.rm=TRUE)
## $parentincome
## [1] 55947.25
##
## $motheredyrs
## [1] 3
##
## $gpa
## [1] 2.8075
map() works with not only predefined functions, but also “anonymous functions”— unnamed functions defined inside of map().
# Create Z Transform
ztransform <- map_df(nlsy.sub, function(x)
(x - mean(x, na.rm=TRUE)) / sd(x, na.rm=TRUE)
)
### Did it work?
# Means
map_dbl(ztransform, function(x)
round(mean(x, na.rm=TRUE),10))
## parentincome motheredyrs gpa
## 0 0 0
# Standard deviations
map_dbl(ztransform, function(x)
round(sd(x, na.rm=TRUE),10))
## parentincome motheredyrs gpa
## 1 1 1
“If statements” are also a useful part of programming, either in conjunction with iteration or seperately.
Syntax
if(condition){
do stuff
}
In the for loop example, the loop was indexed over only the columns of indicator codes.
Equally, the loop could be done over all columns with an if-statement to change only the indicator codes.
for (j in colnames(nlsy97)){
if(j %in% factor.vars){
nlsy97[,j] %<>% unlist() %>% as.factor()
}
}
You can encompass several conditions using the else if and catch-all else control statements.
if (condition1) {
do stuff
} else if (condition2) {
do other stuff
} else {
do other other stuff
}
As alluded to earlier, if statements can’t test-and-do for vectors, but only single-valued objects.
Most of the time, you probably want to use conditional statements on vectors. The vector equivalent to the if statement is ifelse()
Syntax:
ifelse(condition, true_statement, false_statement)
The statements returned can be simple values, but they can also be functions or even further conditions. You can easily nest multiple ifelses if desired.
numbers <- sample(1:30, 7); numbers
## [1] 29 11 13 22 27 12 30
ifelse(numbers %% 2 == 0,"even","odd")
## [1] "odd" "odd" "odd" "even" "odd" "even" "even"
Note: What if we tried a normal if statement instead?
if(numbers %% 2 == 0){
print("even")} else{
print("odd")}
## [1] "odd"
A better alternative to multiple nested ifelse
statements is the tidyverse case_when function.
Syntax:
case_when(
condition1 ~ statement1,
condition2 ~ statement2,
condition3 ~ statement3,
)
nums_df <- numbers %>% as.tibble() %>%
mutate(interval = case_when(
(numbers > 0 & numbers <= 10) ~ "1-10",
(numbers > 10 & numbers <= 20) ~ "10-20",
(numbers > 20 & numbers <= 30) ~ "20-30"))
nums_df[1:4,]
## # A tibble: 4 x 2
## value interval
## <int> <chr>
## 1 29 20-30
## 2 11 10-20
## 3 13 10-20
## 4 22 20-30
If you find yourself performing the same specific steps more than a couple of times (perhaps with slight variations), then you should consider writing a function.
A function can serve essentially as a wrapper for a series of steps, where you define generalized inputs/arguments.
Ingredients:
Function name
Arguments
Function body
Syntax:
function_name <- function(arg1, arg2, ...){
do stuff
}
Let’s turn the calculation of even or odd that was completed earlier into a function:
# Make odd function
odd <- function(obj){
ifelse(obj %% 2 == 0,"even","odd")
}
Notice that obj here is a descriptive placeholder name for the data object to be supplied as an argument for the function.
odd(numbers)
## [1] "odd" "odd" "odd" "even" "odd" "even" "even"
A useful way of writing simple functions when you’ve already written the code for a specific instance is to use RStudio’s Extract Function option, which is available from the code menu.
Shifting gears from programming…
Another staple task in applied work is combining data from multiple data sets. The tidyverse set of packages includes several useful types of merges (or “joins”):
left_join() Appends columns from dataset B to dataset A, keeping all observations in dataset A.
inner_join() Appends columns together, keeping only observations that appear in both dataset A and B.
semi_join() Keeps only columns of dataset A for observations that appear in both dataset A and B.
anti_join() Keeps only columns of dataset A for observations that do not appear in both dataset A and B.
The starting point for any merge is to enumerate the column or columns that uniquely identify observations in the dataset.
For cross-sectional data, this might be a personal identifier or (for aggregate data) something like municipality, state, country, etc.
For panel data, this will typically be both the personal/group identifier and a timing variable, for example Sweden in 2015 in a cross-country analysis.
Sometimes the names of the key variables are different across datasets.
You could of course rename the key variables to be consistent.
But mismatched key names are easily handled by the tidyverse join functions.
Syntax:
join_function(x, y, by = c("x_name" = "y_name"))
The left_join() is the most frequent type of join, corresponding to a standard merge in Stata.
Syntax:
left_join(x, y, by = "key")
If the key is muliple columns, use c() to list them.
# Look at the datasets
earnings
## person_id wage
## 1 001 150
## 2 002 90
## 3 003 270
educ
## person_id schooling
## 1 001 12
## 2 003 8
## 3 004 16
# Combine data
combined_data <- left_join(earnings, educ,
by="person_id")
# Print data
combined_data
## person_id wage schooling
## 1 001 150 12
## 2 002 90 NA
## 3 003 270 8
Notice that schooling is equal to NA for person ‘002’ because that person does not appear in the educ dataset.
If you want to combine the variables of two data sets, but only keep the observations present in both datasets, use the inner_join() function.
combined_data <- inner_join(earnings, educ,
by="person_id")
combined_data
## person_id wage schooling
## 1 001 150 12
## 2 003 270 8
To keep using only the variables in the first dataset, but where observations in the first dataset are matched in the second dataset, use semi_join().
filtered_data <- semi_join(earnings, educ, by="person_id")
filtered_data
## person_id wage
## 1 001 150
## 2 003 270
Another filtering join is anti_join(), which filters for observations that are not matched in a second dataset.
filtered_data <- anti_join(earnings, educ,
by="person_id")
filtered_data
## person_id wage
## 1 002 90
There are still other join types, which you can read about here.
Finally, instead of joining different datasets for the same individuals, sometimes you want to join together files that are for different individuals within the same dataset.
The function for appending data in the tidyverse is:
bind_rows(list(dataframe1,dataframe2,...))
The last type of data preparation that we will cover in this course is manipulating string data.
The simplest string manipulation may be concatenating (ie combining) strings.
The glue function lets you reference variable values inside of text strings by writing the variable in curly brackets {} inside of the string.
date_df %<>% mutate(
say.month = glue("The month is {month}"),
mo.yr = glue("{month} {year}")
)
date_df
## month year say.month mo.yr
## 1 Jan 2017 The month is Jan Jan 2017
## 2 Feb 2018 The month is Feb Feb 2018
## 3 March 2019 The month is March March 2019
numbers <- c(1,2,3)
for (i in numbers){
print(glue("The magic number is {i}"))
}
## The magic number is 1
## The magic number is 2
## The magic number is 3
Other common string manipulating tasks include extracting or replacing parts of a string. These are accomplished via the str_extract() and str_replace() from the Tidyverse stringr package.
The arguments for each function are:
str_extract(string_object, "pattern_to_match")
str_replace(string_object, "pattern_to_match","replacement_text")
By default, both function operate on the first match of the specified pattern. To operate on all matchs, add “_all" to the function name, as in:
str_extract_all(string_object, "pattern_to_match")
In the last seminar, we created a “year” column from years indicated in the “variable” column text via the expression:
nlsy97$year <- str_extract(nlsy97$variable, "[0-9]+")
After creating the “year” column, we then removed the year values from the values of the “variable” column by replacing these numbers with an empty string.
nlsy97$variable <- str_replace(nlsy97$variable, "[0-9]+","")
When working with formatted text, a third common task is to remove extra spaces before or after the string text.
str_trim(string, side = "left"/"right"/"both")
Note, when printing a string, any formatting characters are shown. To view how the string looks formatted, use the ViewLines() function.
Often we want to modify strings based on a pattern rather than an exact expression, as seen with the str_extract() and str_replace() examples.
Patterns are specified in R (as in many other languages) using a syntax known as “regular expressions” or regex.
Today, we will very briefly introduce some regular expressions.
There are also abbreviation for one of specific types of characters
eg: [:digit:] for numbers, [:alpha:] for letters, [:punct:] for punctuation, and \(\textbf{.}\) for every character.
See the RStudio cheat sheet on stringr for more examples (and in general, as a brilliant reference to regex)
Aside from specifiying the characters to match, such as “[0-9]”, another important component of regular expressions is how many time should the characters appear.
Suppose we want to extract year data that is mixed in with other data as well.
messy_var <- c(1,1987,2006,2010,307,2018)
str_extract(messy_var, "[0-9]")
## [1] "1" "1" "2" "2" "3" "2"
str_extract(messy_var, "[0-9]+")
## [1] "1" "1987" "2006" "2010" "307" "2018"
str_extract(messy_var, "[0-9]{4}")
## [1] NA "1987" "2006" "2010" NA "2018"
Often, special characters can cause problems when working with strings. For example, trying to add a quote can result in R thinking you are trying to close the string.
For most characters, you can “escape” (cause R to read as part of the string) special characters by prepending them with a backslash.
Example:
quote <- "\"Without data, you're just another person with an opinion.\"
- W. Edwards Deming."
writeLines(quote)
## "Without data, you're just another person with an opinion."
## - W. Edwards Deming.
To match part of a string that occurs before or after a specific other pattern, you can also specify “lookarounds”, the pattern the match should precede or follow:
To match a string pattern x, preceded or followed by y:
y precedes x: “(?<=y)x”
y follows x: “x(?=y)”
price_info <-c("The price is 5 dollars")
str_extract(price_info, "(?<=(The price is )).+")
## [1] "5 dollars"
str_extract(price_info, ".+(?=( dollars))")
## [1] "The price is 5"
“Scraping” data from the web - that is, automating the retrieval of data displayed online (other than through API) is an increasingly common data analysis task.
Today, we will briefly explore very rudimentary web scraping, using the rvest package.
The specific focus today is only on scraping data structued as a table on a webpage. The basic method highlighted will work much of the time - but does not work for every table.
The starting point for scraping a web table with rvest is the read_html() function, where the URL to the page with data should go.
After reading the webpage, the table should be parsed. For many tables, the read_html can be piped directly into the html_table() function.
If html_table() does not work, a more robust option is to first pipe read_html into html_nodes(xpath = “//table”) and then into html_table(fill=TRUE)
tech_stock_names <- c("MSFT","AMZN","GOOGL","AAPL","FB","INTC","CSCO")
tech_stocks <- list()
for(j in 1:length(tech_stock_names)){
tech_stocks[[j]] <-read_html(
glue("https://finance.yahoo.com/quote/{tech_stock_names[j]}/history")) %>%
html_table() %>% as.data.frame() %>% mutate(stock = tech_stock_names[j])
}
tech_stocks %<>% bind_rows()
tech_stocks[1:5,c(1,6:8)]
## Date Adj.Close.. Volume stock
## 1 Mar 07, 2019 110.39 25,321,300 MSFT
## 2 Mar 06, 2019 111.75 17,687,000 MSFT
## 3 Mar 05, 2019 111.70 19,538,300 MSFT
## 4 Mar 04, 2019 112.26 26,608,000 MSFT
## 5 Mar 01, 2019 112.53 23,501,200 MSFT
gini_list <-read_html("http://wdi.worldbank.org/table/1.3") %>%
html_nodes(xpath ="//table") %>% html_table(fill=TRUE)
gini_data <- gini_list %>% extract2(3) %>%
as.data.frame() %>% select(1:3)
gini_data[1:5,]
## X1 X2 X3
## 1 Afghanistan .. ..
## 2 Albania 2012 29.0
## 3 Algeria 2011 27.6
## 4 American Samoa .. ..
## 5 Andorra .. ..
So far, we have been working purely with basic “R Script” files, which are very similar to Stata do-files.
But thanks largely to the knitr package, you can easily create reports that interweaves text and R code in a neatly structured manner.
Output can be structured as PDF documents, HTML webpages, Word documents, or various presentation formats including Beamer (LaTex) presentations.
The course website, lecture slides, and exercise instructions have all been generated in R.
Reports of different file formats are generated using the knitr package.
Before installing knitr, make sure sure that you have a Latex distribution installed.
Then install the knitr package and initialize it in the usual manner.
# Run only once (ie in the console)
# install.packages("knitr")
# Initialize library
library("knitr")
Knitr allows for the creation of documents structured using two different typesetting languages:
LaTex with the .RNW file
Markdown (specifically RMarkdown), which was originally created as a simple language for structuring HTML markup.
For this course, we will focus on the RMarkdown format, which has become the dominant method for “knitting” document because of it’s lightweight and flexibility.
After installing knitr, to create an RMarkdown document, go to File—New File—R Markdown.
A popup shows up to ask enter the document Title and Author, as well as what type of document you want to create.
In RMarkdown, expository writing and code “chunks” are differentiated in writing code in specific code chunks.
```{r}
# Here is an example of a code chunk
2 +2
```
When you create the document, that statement gets evaluated as:
# Here is an example of a code chunk
2 +2
## [1] 4
You can also include inline code by using initializing with a backtick and the letter r (no space between), writing the code, then closing the chunk with another backtick.
`r 2+2`
To generate a document in the desired output format from a RMarkdown document, you need to “Knit” the document, which appears as a clickable icon on the menu atop the script pane.
You do not need to Knit a document after every change, however. You can just as easily run the code chunks. There are specific specific buttons to run either the current chunk or all of the chunks above a given chunk.
Anything not written inside of these bacticked sections is interpret as normal writing.
RMarkdown makes styling your writing particularly easy. Some common formatting options include:
**Bold Statement**
*Italics Statement*
Latex input: Most LaTex commands (except for generally longer multi-line structures) can be included in RMarkdown documents just as you’d write them in Tex document.
Lists/Bullet Points: Like the bullet points here, you will often want to structure output using lists.
To create a bulleted list, start each bulleted line with a dash (-).
Make sure to leave an empty line between the start of the list and any other text.
To make an indent “sub-list”, start the sub-list with a plus sign (+) and use tab to indent the line twice for each of the sub-list items.
Ordered lists use the same indent rules as unordered lists, but with no dashes or plus signs.
Something you might wonder is how to obey the RStudio 80-character margins while allowing your text to wrap normally in the generated documents.
The answer lies in how new lines are treated in RMarkdown documents.
There are several output options you can specify for how R code and the code output are expressed in reports. These options are expressed as options in the {r} declaration at the top of the chunk.
echo=FALSE: do not show the R code itself (but potentially the code output depending on other chunk options).
include=FALSE: do not show the R code or the output in the document.
eval=FALSE: do not actually execute the code inside the chunk, only display it.
results=“hide”: run the code but do not show the results output.
warning=FALSE / message=FALSE: do not show warnings or messages associated with the R code.
When a RMarkdown document is generated in a given output format, there are several common things you can do to customize the appearance of the document.
To add options to your document, indent the name of the output type to an indented new line and a colon to it. Then indent under the output type and add the desired options.
Here are a few common options:
table of contents: to include a table of contents in your document, use the toc: yes option.
To change the way data frame output is printed, use the df_print option. Good options are kable or tibble.
To add code highlighting for R chunks, use the highlight: option.
You can also specify output themes for html documents and beamer presentations. For html documents, possible themes are listed here while beamer themes are typically supplied by .sty files in your project folder.
In RMarkdown documents, the working directory is automatically set to the folder in which the RMarkdown document is saved.
From there, you can use relative file paths. If data etc is in the root of the project folder, then just refer to the file name directly.
If data is in a subfolder, eg data, use a relative path like:
import("./assets/mydata.rds"")
Aside from the standard RMarkdown documents that we’ve covered so far, another format worth mentioning is the R Notebook format.
R Notebooks essentially adapt html RMarkdown documents to be even more similar to something like Jupyter Notebooks.
With R Notebooks, you can Preview documents without knitting them over again.
The document also generally has the Notebook-style code-output-code layout.
In addition to using RMarkdown documents to make your scripts and reports more compelling, another process upgrade is using RStudio Projects.
Projects are useful because they:
A project root folder is not only preferable to the need to use setwd(), but also to the default working directory used in RMarkdown documents outside of R Projects. Why?
Because for substantial research projects, you likely will have a lot of files that you split into different subfolders, one of which is probably something like code.
To create a RStudio Project, go to File – New Project. From there, you can choose whether to create a new directory, use an existing directory, or use a Version Control repository.
In practice, I’d suggest you use either a New Directory or Version Control depending on whether or not you want to sync your scripts to GitHub.
Once you have created a Project, you can either open it from the File Menu or by opening the .RProj file in the project directory root.
Sample Folder Structure:
Some workflow management packages:
A concern with any type of analysis project is that over time, the analysis environment can change – making it harder to reproduce results.
Version control is a means to track changes made to files
Version control allows you to:
The most popular software for managing version control is Git.
There’s a good chance you’ve at least seen GitHub before, which is an online remote location for storing version control repositories.
The Git software, however, is equally adept at managing version control on your local computer.
Once Git is installed (and recognized in RStudio), you can use Projects to perform local version control.
In File – New Project – New Directory, once you have Git installed there is a checkbox that can be selected to enable to “Create a Git repository”.
A repository is a location that Git uses to track the changes of a file or folder, here the project folder. The Git repository is stored as a folder named “.git” in the project root.
Creating a Project in this manner with a Git repository will enable version control on your local computer.
In addition to local version control, you can also back up the history of file changes to online repositories using GitHub.
GitHub connects to the local Git repository on your computer, “pushing” and “pulling” changes between the local and remote repositories.
This also allows for easy collaboration on coding projects, as multiple person can sync to files by connecting to the remote repository.
With a GitHub account, you can create a new online repository by clicking the “+” icon in the top right of a GitHub page, and then clicking “New Repository”.
From there, you need to:
Supply GitHub with a repository name (think folder name)
Choose whether or not the repository should be public or private (ie whether or not you want other people to be able to visit your GitHub page and view the repository).
Click on the checkbox to enable “Initialize this repository with a README”.
Once you’ve created an online repository, Projects once again allows you to easily connect RStudio with the repository.
To setup a project for use with GitHub, create a New Project and select Version Control instead of New Directory.
From there, simply choose “Git” and then copy the url of the repository from GitHub into RStudio.
Once you have a Project setup with version control, the first key component of tracking changes is “Committing” them to the repository
You can commit changes by going to the “Git” tab in the upper right-hand side of the RStudio IDE.
In the Git tab, any files that have changed since the last commit are listed. From there, click on the files you’d like to commit and click on the commit button.
A Commit box appears which shows you the changes since the last revision and asks for a commit message, where you should very briefly describe the changes.
If you are just tracking changes with a local repository, commit is sufficient to manage version control.
But if you are using version control with an remote (ie online) repository, you will two other steps to make sure changes are sync between the local repository and online.
To send changes made locally to the online repository, after comitting changes click on “Push.”
To sync changes from the online repository to local files, click on “Pull”.
To view previous versions of the files (along with annotations supplied with the commit message), click on the clock icon in the Git pane.
From there, you can see not only a “difference” view of the file changes, but you can also open the document exactly how it was written in a previous commit.
From there, if you wanted to revert changes, you could explicitly revert the file with Git, or simply copy over the file with code from the previous commit — my preferred method of reverting changes.
The basic method of performing a linear regression in R is to the use the lm() function.
lm()
function. But much more results are available if you save the results to a regression output object, which can then be accessed using the summary() function.Syntax:
myregobject <- lm(y ~ x1 + x2 + x3 + x4,
data = mydataset)
lm(expenditures ~ educ_ref, data=cex_data)
##
## Call:
## lm(formula = expenditures ~ educ_ref, data = cex_data)
##
## Coefficients:
## (Intercept) educ_ref
## -641.1 109.3
cex_linreg <- lm(expenditures ~ educ_ref,
data=cex_data)
summary(cex_linreg)
##
## Call:
## lm(formula = expenditures ~ educ_ref, data = cex_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -541109 -899 -690 -506 965001
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -641.062 97.866 -6.55 5.75e-11 ***
## educ_ref 109.350 7.137 15.32 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7024 on 305970 degrees of freedom
## (75769 observations deleted due to missingness)
## Multiple R-squared: 0.0007666, Adjusted R-squared: 0.0007634
## F-statistic: 234.7 on 1 and 305970 DF, p-value: < 2.2e-16
With the tidy() function from the broom package, you can easily create standard regression output tables.
library(broom)
tidy(cex_linreg)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -641. 97.9 -6.55 5.75e-11
## 2 educ_ref 109. 7.14 15.3 5.75e-53
Another really good option for creating compelling regression and summary output tables is the stargazer package.
# From console: install.packages("stargazer")
library(stargazer)
stargazer(cex_linreg, header=FALSE, type='html')
Dependent variable: | |
expenditures | |
educ_ref | 109.350*** |
(7.137) | |
Constant | -641.062*** |
(97.866) | |
Observations | 305,972 |
R2 | 0.001 |
Adjusted R2 | 0.001 |
Residual Std. Error | 7,024.151 (df = 305970) |
F Statistic | 234.746*** (df = 1; 305970) |
Note: | p<0.1; p<0.05; p<0.01 |
Including interaction terms and indicator variables in R is very easy.
Including any variables coded as factors (ie categorical variables) will automatically include indicators for each value of the factor.
To specify interaction terms, just specify varX1*varX2
.
To specify higher order terms, write it mathematically inside of I().
Example:
wages_reg <- lm(wage ~ schooling + sex +
schooling*sex + I(exper^2), data=wages)
tidy(wages_reg)
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.05 0.611 -3.36 7.88e- 4
## 2 schooling 0.567 0.0501 11.3 3.30e-29
## 3 sexmale -0.326 0.779 -0.418 6.76e- 1
## 4 I(exper^2) 0.00752 0.00144 5.21 2.03e- 7
## 5 schooling:sexmale 0.143 0.0660 2.17 3.01e- 2
By default, when including factors in R regression, the first level of the factor is treated as the omitted reference group.
Example:
wages$sex <- wages$sex %>% relevel(ref="male")
wagereg2 <- lm(wage ~ sex, data=wages); tidy(wagereg2)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.31 0.0775 81.5 0.
## 2 sexfemale -1.17 0.112 -10.4 6.71e-25
A couple of useful data elements that are created with a regression output object are fitted values and residuals. You can easily access them as follows:
myresiduals <- residuals(myreg)
myfittedvalues <- fitted(myreg)
The main package for specification testing of linear regressions in R is the lmtest package.
With it, you can:
All of the tests covered here are from the lmtest package. As usual, you need to install and initialize the package:
## In the console: install.packages("lmtest")
library(lmtest)
Testing for heteroskedasticity in R can be done with the bptest() function from the lmtest to the model object.
bptest(wages_reg)
##
## studentized Breusch-Pagan test
##
## data: wages_reg
## BP = 22.974, df = 4, p-value = 0.0001282
bptest(myreg, studentize = FALSE)
You can also perform the White Test of Heteroskedasticity using bptest() by manually specifying the regressors of the auxiliary regression inside of bptest
:
bptest(myreg, ~ x1 + x2 + x1*x2 + I(x1^2) +
I(x2^2), data=mydata)
The Ramsey RESET Test tests functional form by evaluating if higher order terms have any explanatory value.
resettest(wages_reg)
##
## RESET test
##
## data: wages_reg
## RESET = 7.1486, df1 = 2, df2 = 3287, p-value = 0.0007983
bgtest(wages_reg)
##
## Breusch-Godfrey test for serial correlation of order up to 1
##
## data: wages_reg
## LM test = 7.0938, df = 1, p-value = 0.007735
dwtest(wages_reg)
##
## Durbin-Watson test
##
## data: wages_reg
## DW = 1.9073, p-value = 0.003489
## alternative hypothesis: true autocorrelation is greater than 0
In practice, errors should almost always be specified in a manner that is heteroskedasticity and autocorrelation consistent.
In R, you should more explicitly specify the variance structure.
The sandwich allows for specification of heteroskedasticity-robust, cluster-robust, and heteroskedasticity and autocorrelation-robust error structures.
These can then be used with t-tests [coeftest()] and F-tests [waldtest()] from lmtest.
\(HC_1\) Errors (MacKinnon and White, 1985): \(\Sigma = \frac{n}{n-k}diag{\hat\{u_i}^2\}\)
\(HC_3\) Errors (Davidson and MacKinnon, 1993): \(\Sigma = diag \{ \big( \frac{\hat{u_i}}{1-h_i} \big)^2 \}\)
Approximation of the jackknife covariance estimator
Recommended in some studies over \(HC_1\) because it is better at keeping nominal size with only a small loss of power in the presence of heteroskedasticity.
cex_reg <- lm(expenditures ~ hh_size + educ_ref +
region, data=cex_data)
tidy(coeftest(cex_reg, vcov =
vcovHC(cex_reg, type="HC1")))
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -553. 94.1 -5.88 4.13e- 9
## 2 hh_size -298. 14.3 -20.9 4.33e-97
## 3 educ_ref 109. 7.19 15.2 2.57e-52
## 4 region 83.2 15.3 5.44 5.21e- 8
In linear regressions where the regressors and regressors are in “levels”, the coefficients are of course equal to the marginal effects.
But if the regression is nonlinear or a regressor enter in e.g. in logs or quadratics, then marginal effects may be more important than coefficients.
You can use the package margins to get marginal effects.
# install.packages("margins")
library(margins)
We can get the Average Marginal Effects by using summary
with margins:
summary(margins(wages_reg))
## factor AME SE z p lower upper
## exper 0.1209 0.0232 5.2072 0.0000 0.0754 0.1664
## schooling 0.6422 0.0334 19.2256 0.0000 0.5768 0.7077
## sexfemale -1.3391 0.1077 -12.4298 0.0000 -1.5503 -1.1279
The package plm provides a wide variety of estimation methods and diagnostics for panel data.
We will cover two common panel data estimators, first-differences regression and fixed effects regression.
To estimate first-differences estimator, use the plm() in the plm package.
library(plm)
Syntax:
myreg <- plm(y ~ x1 + x2 + x3, data = mydata,
index=c("groupvar", "timevar"), model="fd")
Of course, in most cases fixed effects regression is a more efficient alternative to first-difference regression.
To use fixed effects regression, instead specify the argument model = “within”.
myreg <- plm(y ~ x1 + x2 + x3, data = mydata,
index=c("groupvar", "timevar"),
model="within", effect = "twoway")
crime_NC <- Crime %>% as.tibble() %>%
select(county, year, crmrte, polpc, region, smsa,
taxpc) %>% rename(crimerate=crmrte,
police_pc = polpc, urban=smsa, tax_pc=taxpc)
crime_NC[1:2,]
## # A tibble: 2 x 7
## county year crimerate police_pc region urban tax_pc
## <int> <int> <dbl> <dbl> <fct> <fct> <dbl>
## 1 1 81 0.0399 0.00179 central no 25.7
## 2 1 82 0.0383 0.00177 central no 24.9
crime_reg <- plm(crimerate ~ police_pc + tax_pc +
region + urban, data=crime_NC,
index=c("county", "year"), model="fd")
tidy(crime_reg)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.000129 0.000374 0.345 7.30e- 1
## 2 police_pc 2.06 0.200 10.3 6.84e-23
## 3 tax_pc 0.00000108 0.0000515 0.0210 9.83e- 1
crime_reg <- plm(crimerate ~ police_pc +
tax_pc + urban, data=crime_NC,
index=c("county", "year"),
model="within", effect="twoway")
tidy(crime_reg)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 police_pc 1.78 0.144 12.4 4.71e-31
## 2 tax_pc 0.0000627 0.0000450 1.39 1.65e- 1
The most popular function for doing IV regression is the ivreg() in the AER package.
library(AER)
Syntax:
myivreg <- ivreg(y ~ x1 + x2 | z1 + z2 + z3,
data = mydata)
Three common diagnostic tests are available with the summary output for regression objects from ivreg().
Durbin-Wu-Hausman Test of Endogeneity: Tests for endogeneity of suspected endogenous regressor under assumption that instruments are exogenous.
F-Test of Weak Instruments: Typical rule-of-thumb value of 10 to avoid weak instruments, although you can compare again Stock and Yogo (2005) critical values for more precise guidance concerning statistical size and relative bias.
Sargan-Hansen Test of Overidentifying Restrictions: In overidentified case, tests if some instruments are endogenous under the initial assumption that all instruments are exogenous.
Let’s look at an IV regression from the seminal paper “The Colonial Origins of Comparative Development” by Acemogulu, Johnson, and Robinson (AER 2001)
col_origins <- import("./data/maketable5.dta") %>%
as.tibble() %>% filter(baseco==1) %>%
select(logpgp95, avexpr, logem4, shortnam) %>%
rename(logGDP95 = logpgp95, country = shortnam,
legalprotect = avexpr, log.settler.mort = logem4)
col_origins_iv <- ivreg(logGDP95 ~ legalprotect |
log.settler.mort, data = col_origins)
Estimates:
IVsummary <- summary(col_origins_iv, diagnostics = TRUE)
IVsummary["coefficients"]
## $coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.9096665 1.0267273 1.859955 6.763720e-02
## legalprotect 0.9442794 0.1565255 6.032753 9.798645e-08
Diagnostics:
IVsummary["diagnostics"]
## $diagnostics
## df1 df2 statistic p-value
## Weak instruments 1 62 22.94680 1.076546e-05
## Wu-Hausman 1 61 24.21962 6.852437e-06
## Sargan 0 NA NA NA
Some useful functions for nonlinear regression include:
Limited Dependent Variable Models:
These models, such as logit and probit (binary choice), or Poisson (count model) are incorporated in R as specific cases of a generalized linear model (GLM).
Regression Discontinutiy:
One of the strong points of R is creating very high-quality data visualization.
R is very good at both “static” data visualization and interactive data visualization designed for web use.
Today, I will be covering static data visualization, but here are a couple of good resources for interactive visualization: [1], [2]
The main package for publication-quality static data visualization in R is ggplot2, which is part of the tidyverse collection of packages.
The workhorse function of ggplot2 is ggplot(), response for creating a very wide variety of graphs.
The “gg” stands for “grammar of graphics”. In each ggplot() call, the appearance of the graph is determined by specifying:
mygraph <- ggplot(mydata, aes(...)) + geom(...) + ...
First, let’s look at a simple scatterplot, which is defined by using the geometry geom_point().
ggplot(col_origins, aes(x=legalprotect,
y = logGDP95)) + geom_point()
Graphs can be extensively customized using additional arguments inside of elements:
ggplot(col_origins, aes(x=legalprotect,y = logGDP95)) +
geom_point(aes(size=logGDP95))
Instead of using a scatter plot, we could use the names of the data points in place of the dots.
ggplot(col_origins,
aes(x=legalprotect, y = logGDP95,
label=country)) + geom_text()
A line graph uses the geometry geom_line().
ggplot(col_origins, aes(x=legalprotect, y = logGDP95)) +
geom_line()
A more useful line is the fitted values from the regression. Here’s a plot of that line with the points from the scatterplot for the Acemoglu IV:
col_origins_fitted <- tibble(col_origins$legalprotect,
fitted(col_origins_iv))
colnames(col_origins_fitted) <- c("legalprotect", "hat")
ggplot(col_origins, aes(x=legalprotect, y = logGDP95)) +
geom_point(color="red") +
geom_line(data = col_origins_fitted, aes(x=legalprotect, y=hat))
A standard task in making the graph is specifying graph titles (main and axes), as well as potentially specifying the scale of the axes.
ggplot(col_origins, aes(x=legalprotect, y = logGDP95)) +
geom_point(color="red") +
geom_line(data = col_origins_fitted, aes(x=legalprotect, y=hat)) +
ggtitle("GDP and Legal Protection") +
xlab("Legal Protection Index [0-10]") +
ylab("Log of 1995 GDP") +
xlim(0, 10) + ylim(5,10)
## Warning: Removed 3 rows containing missing values (geom_point).
## Warning: Removed 5 rows containing missing values (geom_path).
The geometry point for histogram is geom_histogram().
ggplot(col_origins, aes(x=legalprotect)) +
geom_histogram() +
ggtitle("Histogram of Legal Protection Scores") +
xlab("Legal Protection Index [0-10]") +
ylab("# of Countries")
The geometry for a bar plot is geom_bar(). By default, a bar plot uses frequencies for its values, but you can use values from a column by specifying stat = “identity” inside geom_bar().
coeffs_IV <- tidy(col_origins_iv)
ggplot(coeffs_IV,
aes(x=term, y=estimate)) +
geom_bar(stat = "identity") +
ggtitle("Parameter Estimates for Colonial Origins") +
xlab("Parameter") + ylab("Estimate")
You can easily add error bars by specifying the values for the error bar inside of geom_errorbar().
ggplot(coeffs_IV,
aes(x=term, y=estimate)) +
geom_bar(stat = "identity") +
ggtitle("Parameter Estimates for Colonial Origins") +
xlab("Parameter") + ylab("Estimate") +
geom_errorbar(aes(ymin=estimate - 1.96 * std.error,
ymax=estimate + 1.96 * std.error),
size=.75, width=.3, color="red3")
You can easily add color to graph points as well. There are a lot of aesthetic options to do that — here I demonstrate adding a color scale to the graph.
ggplot(col_origins, aes(x=legalprotect,
y = logGDP95 , col= log.settler.mort)) + geom_point() +
ggtitle("GDP and Legal Protection") +
xlab("Legal Protection Index [0-10]") +
ylab("Log of 1995 GDP") +
scale_color_gradient(low="green",high="red",
name="Log Settler Mortality")
ggplot(col_origins, aes(x=legalprotect)) +
geom_histogram(col="black", fill="red2") +
ggtitle("Histogram of Legal Protection Scores") +
xlab("Legal Protection Index [0-10]") +
ylab("# of Countries")
Another option to affect the appearance of the graph is to use themes, which affect a number of general aspects concerning how graphs are displayed.
library(ggthemes)
ggplot(col_origins, aes(x=legalprotect)) +
geom_histogram(col="white", fill="red2") +
ggtitle("Histogram of Legal Protection Scores") +
xlab("Legal Protection Index [0-10]") +
ylab("# of Countries") +
theme_economist()
This has been just a small overview of things you can do with ggplot2. To learn more about it, here are some useful references:
RStudio’s ggplot2 cheat sheet:
Imagine I’m conducting a raffle. I have a large bowl, and inside of it are many raffle entries, each with different numbers on them.
Let the variable x represent the number drawn from the bowl.
Before reaching into the bowl, I want to know:
‘’What is the probability that I’m going to pick up a number less than 10 from the bowl?’’
That is, what is \(p(x \leq 10)\)?
Classical Statistics: No.
Probability is defined as: \(\lim_{n \rightarrow \infty} \frac{n_S}{n}\), where \(n\) is the number of times I repeat the experiment and \(n_S\) is the number of times a number \(x \leq 10\) is drawn.
\(p(x \leq 10)\) was always equal to zero, all that changed was your knowledge of the experiment.
Bayesian Statistics: Almost certainly.
Probability is a measure of subjective belief about how likely an event is, based on prior understanding and new information.
Prior \(\rightarrow\) Information \(\rightarrow\) Posterior
Bayesian statistics integrates the epistemological uncertainty of statistical estimation into its core procedures. It’s fundamental goal is to assess and improve the accuracy of one’s beliefs based on a set of identifying statistical assumptions.
Classical statistics attempts to instead conduct inference on the (unknown) underlying reality, based on its own set of identifying assumptions.
NOT mutually exclusive
The cornerstone of the Bayesian approach (and the source of its name) is the conditional likelihood theorem known as Bayes’ rule.
In its simplest form, Bayes’ Rule states that for two events and A and B (with \(P(B) \neq 0\)): \[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \]
Or, if A can take on multiple values, we have the extended form:
\[ p(A_i|B) = \frac{p(B | A_i) P(A_i)}{\sum_j P(B|A_j)P(A_j)} \]
Adapting Bayes’ Rule to the case of a parameter value, \(\theta\) and observed data y, we have:
\[ p(\theta \mid y) = \frac{f(\theta, y)}{f(y)} = \frac{f(y\mid\theta)p(\theta)}{ \int f(y \mid \theta) p(\theta) d\theta} \underbrace{\propto}_{\text{proportional to}} f(y|\theta)p(\theta) \]
Adding a bit of terminology:
\(f(y|\theta)\) you may recognize from Maximum Likelihood estimation as:
Finally, \(p(\theta)|y)\) is our posterior (post-experiment) belief about the probability distribution of \(\theta\).
\[ \overbrace{p(\theta \mid y)}^{\text{posterior}} = \frac{\overbrace{f(y\mid\theta)}^\text{likelihood} \overbrace{p(\theta)}^\text{prior}}{\underbrace{\int f(y \mid \theta) p(\theta) d\theta}_\text{average likelihood}} \underbrace{\propto}_{\text{proportional to}} \overbrace{f(y|\theta)}^{likelihood} \overbrace{p(\theta)}^{prior}\]
Hence we have the basic statement:
\[ Posterior \propto Likelihood \times Prior \]
Given this compromise with prior beliefs, Bayesian analysis is often attacked as subjective and a lot of emphasis is placed on the role of prior beliefs. But fear not!
First, as the data sample increases, the data becomes the determining factor of the posterior probability.
Moreover, if desired, we can easily specify priors that have no effect on the posterior estimates.
Different types of priors include:
Uninformative (or “flat”) priors: Priors that have no impact on posterior values (ie assuming total ignorance about the possible parameter value).
Informative priors: Priors where we use prior knowledge to specify a prior with a best-guess of the prior mean and distribution for a parameter value.
Weakly informative or regularizing priors: Priors which are deliberately less informative than our actual knowledge, affecting estimates less than informative priors but at least incorporating very conservative information into the production of posterior estimates.
Think for instance of inference about possible life expectancy:
We could specify an uninformative prior that says allows for any possible lifespan (even living to 1 million years old).
We could specify some skewed normal distribution that approximates our current estimates of the distribution of current lifespans.
Or we could specify some weaker prior that for instance allows life expectancy to be more dispersed and ages to reach, say 150 or 200 years old.
In almost all circumstances, a weak prior should be preferred.
### Hypothesis Testing in Bayesian Analysis?
Another point of divergence for Bayesian vs. frequentist data analysis is even more dramatic:
Largely, there is no place for null-hypothesis significance testing (NHST) in Bayesian analysis
Bayesian analysis has something similar called a Bayes’ factor, which essentially assigns a prior probability to the likilihood ratio of a null and alternative model and then estimates it’s posterior probability.
But Bayes factors are heavily criticized by leading Bayesians like Andrew Gelman and Donald Rubin, because it is highly sensitive to prior probabilities and model structures, among other issues.
Instead, analysis is oriented around estimation of the posterior distribution of parameters (or predictions).
Without NHST to tell us if our results are significant, does that mean we just get point estimates and no way to assess how reliable they are? No!
Recall that we are estimating the posterior distribution of a parameter (or predicted outcome).
Hence, we can easily produce a 95% interval for the parameter, simply using the quantiles of the posterior CDF.
In Bayesian analysis, we replace the 100(\(1-\alpha\))% frequentist confidence interval with the 100(\(1-\alpha\))% credible interval.
A credible interval, however, has a much more appealing interpretation than a confidence interval.
A confidence interval has the interpretation that, in repeated samples, the true parameter lies within the confidence region 95% of the time.
A credible interval is what people think a confidence interval should mean: there is a 95% chance that the true value lies within the 95% credible interval.
One major feature of Bayesian inference that I haven’t mentioned so far is the intractability of analytic solutions for estimating posterior distributions in most circumstances.
Recall: \[ p(\theta \mid y) = \frac{f(\theta, y)}{f(y)} = \frac{f(y\mid\theta)p(\theta)}{ \int f(y \mid \theta) p(\theta) d\theta} \]
For models that are more complex or that involve high-dimensional data, closed-form solutions are not available for the integral in the denominator.
Hence, Bayesian analysis instead typically relies on numerical methods, usually Markov Chain Monte Carlo (MCMC) methods.
This method relies on sequentially sampling values of \(\theta\) from an approximation of the posterior distribution and then correcting the approximation to create better subsequent samples.
Because the approximated distribution used in 1 step relies on the sample from the previous step, the simulation forms a Markov chain.
A critical property then is convergence: Have our simulations converged to the real target distribution?
Probably the best approach to doing Bayesian analysis in any software environment is with rstan, which is an R interface to the Stan programming language designed for Bayesian analysis.
To use rstan, you will first need to install RTools from this link.
Then install the package rstan from RStudio (make sure to set dependencies=TRUE
when installing).
Once you’ve setup rstan, there is one more thing you should typically do: tell it to run on multiple cores.
library(rstan)
rstan_options (auto_write=TRUE)
# Run on multiple cores
options (mc.cores=parallel::detectCores ())
Now that you’re hopefully excited about rstan, let’s look at an example of a rstan regression from the package documentation:
# Create schools.stan --------------------------
data {
int<lower=0> J; // number of schools
real y[J]; // estimated treatment effects
real<lower=0> sigma[J]; // s.e. of effect estimates
}
parameters {
real mu;
real<lower=0> tau;
vector[J] eta;
}
## Still creating schools.stan
transformed parameters {
vector[J] theta;
theta = mu + tau * eta;
}
model {
target += normal_lpdf(eta | 0, 1);
target += normal_lpdf(y | theta, sigma);
}
# End of schools.stan --------------------------
## Run Stan regression using schools.stan
fit1 <- stan(
file = "schools.stan", # Stan program
data = schools_data, # named list of data
chains = 4, # number of Markov chains
warmup = 1000, # number of warmup iterations per chain
iter = 2000, # total number of iterations per chain
cores = 2, # number of cores (using 2 just for the vignette)
refresh = 1000 # show progress every 'refresh' iterations
)
Oops! That was a lot of code! And there’s still a bit more left out. Luckily, there are a couple of packages that exist to make your life easier:
First, there is rstanarm, which was created by the developers of Stan and rstan to make running a Bayesian regression with rstan much more like you would run a normal frequentist regression.
Another very similar package to rstanarm is brms, which also makes running Bayesian regression much simpler and ‘R-like’.
rstanarm is faster, has better posterior checking, and is a bit simpler to use.
brms is generally a bit more flexible, with support for some regression types missing in rstanarm, more flexible specification of priors, and support for more types of error correlational structures.
With rstanarm, most regressions you run using the function stan_glm()
Syntax:
mybayesreg <- stan_glm(y ~ X1 + x2 + x3 ...,
family = myfamily, data = mydata,
prior = myprior)
Family (with a possible link argument needed as well) defines the type of regression you want:
family = gaussian
family = binomial(link = "logit")
family = binomial(link = "probit")
family = poisson
Prior distributions:
prior = NULL
[Weakly] Informative Priors can be specified by using prior =
with one of:
data(Titanic)
# Display titanic data
glimpse(Titanic,width = 50)
## Observations: 1,313
## Variables: 6
## $ Name <fct> "Allen, Miss Elisabeth Walton"…
## $ PClass <fct> 1st, 1st, 1st, 1st, 1st, 1st, …
## $ Age <dbl> 29.00, 2.00, 30.00, 25.00, 0.9…
## $ Sex <fct> female, female, male, female, …
## $ Survived <int> 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, …
## $ SexCode <int> 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, …
# Reformat Class
Titanic$class <- str_extract(Titanic$PClass, "[0-9]")
TitanicLinear <- stan_glm(Survived ~ Age +
SexCode + as.factor(class),
data = Titanic, family = gaussian)
summary(TitanicLinear)
##
## Model Info:
##
## function: stan_glm
## family: gaussian [identity]
## formula: Survived ~ Age + SexCode + as.factor(class)
## algorithm: sampling
## priors: see help('prior_summary')
## sample: 4000 (posterior sample size)
## observations: 756
## predictors: 5
##
## Estimates:
## mean sd 2.5% 25% 50% 75% 97.5%
## (Intercept) 0.6 0.1 0.5 0.6 0.6 0.7 0.7
## Age 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## SexCode 0.5 0.0 0.4 0.5 0.5 0.5 0.6
## as.factor(class)2 -0.2 0.0 -0.3 -0.2 -0.2 -0.2 -0.1
## as.factor(class)3 -0.4 0.0 -0.5 -0.4 -0.4 -0.4 -0.3
## sigma 0.4 0.0 0.4 0.4 0.4 0.4 0.4
## mean_PPD 0.4 0.0 0.4 0.4 0.4 0.4 0.5
## log-posterior -364.8 1.7 -369.0 -365.8 -364.4 -363.5 -362.4
##
## Diagnostics:
## mcse Rhat n_eff
## (Intercept) 0.0 1.0 2196
## Age 0.0 1.0 3240
## SexCode 0.0 1.0 3687
## as.factor(class)2 0.0 1.0 2403
## as.factor(class)3 0.0 1.0 2451
## sigma 0.0 1.0 3833
## mean_PPD 0.0 1.0 3875
## log-posterior 0.0 1.0 1852
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
You can also easily get print the credible intervals with the function posterior_interval()
posterior_interval(TitanicLinear, prob=0.95)
## 2.5% 97.5%
## (Intercept) 0.522632399 0.731806400
## Age -0.008119899 -0.003881513
## SexCode 0.443890662 0.559454186
## as.factor(class)2 -0.281056766 -0.130564031
## as.factor(class)3 -0.466080809 -0.322007081
## sigma 0.369957877 0.409149110
plot(TitanicLinear)
You can also easily plot the posterior distribution of a parameter in R.
Titanic_posterior <- TitanicLinear %>% as_tibble() %>%
rename(sec.class = "as.factor(class)2",
third.class = "as.factor(class)3")
ggplot(Titanic_posterior, aes(x=third.class)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
posterior_vs_prior(TitanicLinear)
##
## Drawing from prior...
There are a number of different regression diagnostics after performing Bayesian regression to help infer if the model converged, how well it performs, and even compare between models.
Today, we’ll cover some of them included with rstanarm as well as the very useful shinystan package.
To check the predictive accuracy of the posterior distribution, you can use the function pp_check(), which plots simulated y values from the posterior distribution against the actual values of y.
pp_check(TitanicLinear)
A critical issue in both Bayesian and frequentist estimation is how to balance predictive accuracy with parsimony. Put another, the researcher should be concerned with not overfitting the data while still creating a compelling model.
The basic approach in frequentist method is to use the Akaike information criterion (AIC):
Expected Log Predictive Density: \[\hat{elpd}_{AIC} = \log p(y | \hat{\theta}_{MLE}) - k\]
The most basic Bayesian adaptation of the AIC is the Deviance information criterion (DIC):
\[\hat{elpd}_{DIC} = \log p(y | \hat{\theta}_{Bayes}) - p_{DIC}\]
An improvement over the DIC is the Watanabe-Akaike information criterion:
\[\hat{elpd}_{WAIC} = \sum_{i=1}^{n} \log p(y_i) - \sum_{i=1}^{n} \log V \Big[p(y_i) \Big]\]
The WAIC has the advantages of:
Averaging the likelihood over the posterior distribution rather than using the mean
Does not assume a multivariate Gaussian posterior distribution, as does the DIC (and AIC)
waic(TitanicLinear)
##
## Computed from 4000 by 756 log-likelihood matrix
##
## Estimate SE
## elpd_waic -360.4 19.4
## p_waic 6.0 0.3
## waic 720.9 38.7
Another method alongside WAIC for comparing out-of-sample predictive ability is to apply leave-one-out cross-validation (LOO).
Expected Log Predictive Density:
\[\hat{elpd}_{LOO} = \sum_{i=1}^{n} \log p(y_i | y_{-i})\]
loo(TitanicLinear)
##
## Computed from 4000 by 756 log-likelihood matrix
##
## Estimate SE
## elpd_loo -360.5 19.4
## p_loo 6.0 0.3
## looic 720.9 38.7
## ------
## Monte Carlo SE of elpd_loo is 0.0.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
Titanic_probit <- stan_glm(Survived ~ Age +
SexCode + as.factor(class),
data = Titanic, family = binomial(link=probit))
Loo_probit <- loo(Titanic_probit)
Titanic_logit <- stan_glm(Survived ~ Age +
SexCode + as.factor(class),
data = Titanic, family = binomial(link=logit))
Loo_logit <- loo(Titanic_logit)
# ELPD_diff>0 indicates more support for 2nd model
compare_models(Loo_probit, Loo_logit)
##
## Model comparison:
## (negative 'elpd_diff' favors 1st model, positive favors 2nd)
##
## elpd_diff se
## 1.4 0.8
Probably the most popular diagnostic for Bayesian regression in R is the functionality from the shinystan package.
# Do in console not RMarkdown
launch_shinystan(TitanicLinear)
Comments
To create a comment in R, use a hash ( #. For example: