Introduction to Data Analysis in R

Course Description

This is an introduction to the R statistical programming language, focusing on essential skills needed to perform data analysis from entry, to preparation, analysis, and finally presentation. During the course, you will not only learn basic R functionality, but also how to leverage the extensive community-driven package ecosystem, as well as how to write your own functions in R.

Course content is broken up into 7 seminars, each covering one content module except for the final review seminar. The length of each seminar may vary from module to module, but should generally be less than 3 hours. The first hour or so will be used to introduce new information, while the remainder of the time will be spent doing hands-on practice. The content for each module will be posted the day before the seminar, so that you can familiarize yourself with the material ahead of time if you like (though this isn’t required).

Course Content

Module 1: Introduction to Base R Environment

This module introduces the R programming language and the RStudio software. R programming topics will include coverage of basic operations and data object types, especially vectors, matrices, and data frames.

Lecture Notes: Webpage Slides
Seminar Exercises: Setup Instructions DataCamp Exercises

Module 2: Data Preparation Using the Tidyverse

This module introduces a series of tools for data manipulation/preparation collectively known as the “Tidyverse.” Specifically, this module covers how to subset data, arrange it, transform it, and aggregate it. Students will also learn convenient tools to import and export data.

Lecture Notes: Webpage Slides
Seminar Exercises: Exercise nlsy97.zip Solutions (PDF) Solutions (R Script)
Seminar Exercise 2018: Exercise Solutions (PDF) Solutions (R Script)

Module 3: Programming, Joining Data, and More

This module introduces more advanced programming techniques to adapt R functionality to your own specific problems. Contents include how to perform loops, use conditional statements, and write basic functions. In addition, this module will cover how to join data sets in R using Tidyverse functions, manipulating strings, and scrape tables from the web.

Lecture Notes: Webpage Slides
Seminar Exercises: Exercise Solutions (PDF) Solutions (R Script)
Seminar Exercises 2018: Exercise Solutions (PDF) Solutions (R Script)

Module 4: Project Management and Dynamic Documents

This module provides a few major enhancements to the workflow process of data analysis in R. Fist, Knitr and RMarkdown are introduced as a means to create dynamic reports from R using a variety of formats, such as HTML pages, PDF documents, and beamer presentations. Then, RStudio Projects are introduced as means of organizing folders for empirical projects. Finally, Git and GitHub are introduced to perform version control.

Lecture Notes: Webpage Slides
Seminar Exercises: Exercise Solutions (PDF) Solutions (Beamer) Solutions (html) Solutions (Rmd)

Module 5: Regression Analysis and Data Visualization in R

In this module, standard linear regression in R is introduced, as well as coverage of common diagnostics and post-estimation procedures. In addition, further methods of regression analysis are covered, with special emaphasis on methods for panel and instrumental variables data. Finally, the ggplot2 package is introduced as a means of creating compelling graphs in R.

Lecture Notes: Webpage Slides
Seminar Exercises: Exercise Exercise - Part B nlsy97.rds Solutions (PDF) Solutions (Rmd)
Seminar Exercises 2018: Exercise wdi_data.rds Solutions (PDF) Solutions (Rmd)

Module 6: Introduction to Bayesian Methods in R

This module introduces the basic intuition of Bayesian statistical methods and how to perform Bayesian analysis in R, primarily using the rstanarm package.

Lecture Notes: Webpage Slides
Seminar Exercises: Exercise

Module 7: Review Seminar for Capstone Project

In this module, an extended empirical exercise is utilized to review the skills developed over the preceding seminars. The review will function as preparation for the capstone project, in which students individually replicate results from a recent economics paper.

Seminar Exercises: Exercise datanames.csv Solutions (PDF) Solutions (Rmd)

Capstone Project

For the capstone project, you will be replicating results from “Intergenerational Mobility and Preferences for Redistribution” (AER 2018) by Alberto Alesina, Stefanie Stantcheva, and Edoardo Teso. The capstone project is due on March 28^th.

Files: Instructions actual_probabilities.csv Solutions (HTML) Solutions (Rmd)

Contact:

Course Teacher: Andrew Proctor

Office: A 711 (Arrange by email or stop in if I’m there.)

Email: andrew.proctor@phdstudent.hhs.se