So far, when changing a data object, we have always been a bit repetitive:
mydataframe <- mydataframe %>% rename(NewVarName = OldVarName)
Along with the standard pipe (%>%), by loading the magrittr package, you can also use the so-called “assignment pipe” (%<>%).
mydataframe %<>% rename(NewVarName = OldVarName)
Another subtlety glossed over so far are lists.
As mentioned in module 1, vectors come in two forms: atomic vectors (with a single data type) and lists (with heterogenous data types).
Lists can take as inputs not only single-valued elements, but also vectors or data frames.
Creating a list from other objects is done with the list() function. The syntax of list is:
wages_df; date_df; description
## wage schooling sex exper
## 1 134.23058 13 female 8
## 2 249.67744 13 female 11
## 3 53.56478 10 female 11
## month year
## 1 Jan 2017
## 2 Feb 2018
## 3 March 2019
## [1] "Data on wages and date information."
mylist <- list(wages = wages_df, dates = date_df, description)
Where wages_df and date_df are data frames and description is a single character element.
To subset a vector/matrix/data frame, one uses single brackets, eg mydf[,].
To refer to an object of a list, use double brackets.
mylist[[3]]
## [1] "Data on wages and date information."
Note: The function list() does not take transfer the names of the data frames, so you will need to either subset by position or assign names to the list objects.
An easy way of extracting an object from a list is with the extract2() function from magrittr. This allows you to extra a given list object by name or position.
wage_data <- mylist %>% extract2(1)
wage_data
## wage schooling sex exper
## 1 134.23058 13 female 8
## 2 249.67744 13 female 11
## 3 53.56478 10 female 11
Instead of creating more complicated data objects, sometimes formatted as list into a simple (atomic) vector. The unlist() function does this.
Example:
simple_list <-list(1,2,3,4)
str(simple_list)
## List of 4
## $ : num 1
## $ : num 2
## $ : num 3
## $ : num 4
simple_list %<>% unlist() %>% str()
## num [1:4] 1 2 3 4
For tasks that you want to iterate over multiple data frames/variables/elements, you may want to think about creating a loop.
Syntax:
for(indexname in range) {
do stuff
}
for (i in 1:4){
print(i^2)
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
You can also loop over elements instead of values.
nlsy97 <- import("./data/nlsy97.rds")
factor.vars <- c("personid","year","sex","race","region","schooltype")
for (i in factor.vars){
nlsy97[,i] %<>% unlist() %>% as.factor()
}
For iterations over vectors and dataframes, the map() function is a great alternative to the for loop.
Map functions take a user-supplied function and iterate it over:
Elements for a vector
Objects of a list
Columns of a data frame
Map functions are much simpler to write than loops and are also generally a good bit faster.
Syntax:
map(data, fxn, option1, option2...)
Example:
nlsy97[,factor.vars] %<>% map(as.factor)
There are multiple map variants that enforce a given data type on results. You should use these whenever you want output of a certain class.
map_lgl for logical vector
map_dbl for numeric vector
map_chr for character vector
map_df for a data frame
nlsy.sub <- nlsy97 %>% select(parentincome, motheredyrs, gpa)
nlsy.sub %>% map_dbl(IQR, na.rm=TRUE)
## parentincome motheredyrs gpa
## 55947.2500 3.0000 2.8075
nlsy.sub %>% map(IQR, na.rm=TRUE)
## $parentincome
## [1] 55947.25
##
## $motheredyrs
## [1] 3
##
## $gpa
## [1] 2.8075
map() works with not only predefined functions, but also “anonymous functions”— unnamed functions defined inside of map().
# Create Z Transform
ztransform <- map_df(nlsy.sub, function(x)
(x - mean(x, na.rm=TRUE)) / sd(x, na.rm=TRUE)
)
### Did it work?
# Means
map_dbl(ztransform, function(x)
round(mean(x, na.rm=TRUE),10))
## parentincome motheredyrs gpa
## 0 0 0
# Standard deviations
map_dbl(ztransform, function(x)
round(sd(x, na.rm=TRUE),10))
## parentincome motheredyrs gpa
## 1 1 1
“If statements” are also a useful part of programming, either in conjunction with iteration or seperately.
Syntax
if(condition){
do stuff
}
In the for loop example, the loop was indexed over only the columns of indicator codes.
Equally, the loop could be done over all columns with an if-statement to change only the indicator codes.
for (j in colnames(nlsy97)){
if(j %in% factor.vars){
nlsy97[,j] %<>% unlist() %>% as.factor()
}
}
You can encompass several conditions using the else if and catch-all else control statements.
if (condition1) {
do stuff
} else if (condition2) {
do other stuff
} else {
do other other stuff
}
As alluded to earlier, if statements can’t test-and-do for vectors, but only single-valued objects.
Most of the time, you probably want to use conditional statements on vectors. The vector equivalent to the if statement is ifelse()
Syntax:
ifelse(condition, true_statement, false_statement)
The statements returned can be simple values, but they can also be functions or even further conditions. You can easily nest multiple ifelses if desired.
numbers <- sample(1:30, 7); numbers
## [1] 29 11 13 22 27 12 30
ifelse(numbers %% 2 == 0,"even","odd")
## [1] "odd" "odd" "odd" "even" "odd" "even" "even"
Note: What if we tried a normal if statement instead?
if(numbers %% 2 == 0){
print("even")} else{
print("odd")}
## [1] "odd"
A better alternative to multiple nested ifelse
statements is the tidyverse case_when function.
Syntax:
case_when(
condition1 ~ statement1,
condition2 ~ statement2,
condition3 ~ statement3,
)
nums_df <- numbers %>% as.tibble() %>%
mutate(interval = case_when(
(numbers > 0 & numbers <= 10) ~ "1-10",
(numbers > 10 & numbers <= 20) ~ "10-20",
(numbers > 20 & numbers <= 30) ~ "20-30"))
nums_df[1:4,]
## # A tibble: 4 x 2
## value interval
## <int> <chr>
## 1 29 20-30
## 2 11 10-20
## 3 13 10-20
## 4 22 20-30
If you find yourself performing the same specific steps more than a couple of times (perhaps with slight variations), then you should consider writing a function.
A function can serve essentially as a wrapper for a series of steps, where you define generalized inputs/arguments.
Ingredients:
Function name
Arguments
Function body
Syntax:
function_name <- function(arg1, arg2, ...){
do stuff
}
Let’s turn the calculation of even or odd that was completed earlier into a function:
# Make odd function
odd <- function(obj){
ifelse(obj %% 2 == 0,"even","odd")
}
Notice that obj here is a descriptive placeholder name for the data object to be supplied as an argument for the function.
odd(numbers)
## [1] "odd" "odd" "odd" "even" "odd" "even" "even"
A useful way of writing simple functions when you’ve already written the code for a specific instance is to use RStudio’s Extract Function option, which is available from the code menu.
Shifting gears from programming…
Another staple task in applied work is combining data from multiple data sets. The tidyverse set of packages includes several useful types of merges (or “joins”):
left_join() Appends columns from dataset B to dataset A, keeping all observations in dataset A.
inner_join() Appends columns together, keeping only observations that appear in both dataset A and B.
semi_join() Keeps only columns of dataset A for observations that appear in both dataset A and B.
anti_join() Keeps only columns of dataset A for observations that do not appear in both dataset A and B.
The starting point for any merge is to enumerate the column or columns that uniquely identify observations in the dataset.
For cross-sectional data, this might be a personal identifier or (for aggregate data) something like municipality, state, country, etc.
For panel data, this will typically be both the personal/group identifier and a timing variable, for example Sweden in 2015 in a cross-country analysis.
Sometimes the names of the key variables are different across datasets.
You could of course rename the key variables to be consistent.
But mismatched key names are easily handled by the tidyverse join functions.
Syntax:
join_function(x, y, by = c("x_name" = "y_name"))
The left_join() is the most frequent type of join, corresponding to a standard merge in Stata.
Syntax:
left_join(x, y, by = "key")
If the key is muliple columns, use c() to list them.
# Look at the datasets
earnings
## person_id wage
## 1 001 150
## 2 002 90
## 3 003 270
educ
## person_id schooling
## 1 001 12
## 2 003 8
## 3 004 16
# Combine data
combined_data <- left_join(earnings, educ,
by="person_id")
# Print data
combined_data
## person_id wage schooling
## 1 001 150 12
## 2 002 90 NA
## 3 003 270 8
Notice that schooling is equal to NA for person ‘002’ because that person does not appear in the educ dataset.
If you want to combine the variables of two data sets, but only keep the observations present in both datasets, use the inner_join() function.
combined_data <- inner_join(earnings, educ,
by="person_id")
combined_data
## person_id wage schooling
## 1 001 150 12
## 2 003 270 8
To keep using only the variables in the first dataset, but where observations in the first dataset are matched in the second dataset, use semi_join().
filtered_data <- semi_join(earnings, educ, by="person_id")
filtered_data
## person_id wage
## 1 001 150
## 2 003 270
Another filtering join is anti_join(), which filters for observations that are not matched in a second dataset.
filtered_data <- anti_join(earnings, educ,
by="person_id")
filtered_data
## person_id wage
## 1 002 90
There are still other join types, which you can read about here.
Finally, instead of joining different datasets for the same individuals, sometimes you want to join together files that are for different individuals within the same dataset.
The function for appending data in the tidyverse is:
bind_rows(list(dataframe1,dataframe2,...))
The last type of data preparation that we will cover in this course is manipulating string data.
The simplest string manipulation may be concatenating (ie combining) strings.
The glue function lets you reference variable values inside of text strings by writing the variable in curly brackets {} inside of the string.
date_df %<>% mutate(
say.month = glue("The month is {month}"),
mo.yr = glue("{month} {year}")
)
date_df
## month year say.month mo.yr
## 1 Jan 2017 The month is Jan Jan 2017
## 2 Feb 2018 The month is Feb Feb 2018
## 3 March 2019 The month is March March 2019
numbers <- c(1,2,3)
for (i in numbers){
print(glue("The magic number is {i}"))
}
## The magic number is 1
## The magic number is 2
## The magic number is 3
Other common string manipulating tasks include extracting or replacing parts of a string. These are accomplished via the str_extract() and str_replace() from the Tidyverse stringr package.
The arguments for each function are:
str_extract(string_object, "pattern_to_match")
str_replace(string_object, "pattern_to_match","replacement_text")
By default, both function operate on the first match of the specified pattern. To operate on all matchs, add “_all" to the function name, as in:
str_extract_all(string_object, "pattern_to_match")
In the last seminar, we created a “year” column from years indicated in the “variable” column text via the expression:
nlsy97$year <- str_extract(nlsy97$variable, "[0-9]+")
After creating the “year” column, we then removed the year values from the values of the “variable” column by replacing these numbers with an empty string.
nlsy97$variable <- str_replace(nlsy97$variable, "[0-9]+","")
When working with formatted text, a third common task is to remove extra spaces before or after the string text.
str_trim(string, side = "left"/"right"/"both")
Note, when printing a string, any formatting characters are shown. To view how the string looks formatted, use the ViewLines() function.
Often we want to modify strings based on a pattern rather than an exact expression, as seen with the str_extract() and str_replace() examples.
Patterns are specified in R (as in many other languages) using a syntax known as “regular expressions” or regex.
Today, we will very briefly introduce some regular expressions.
There are also abbreviation for one of specific types of characters
eg: [:digit:] for numbers, [:alpha:] for letters, [:punct:] for punctuation, and \(\textbf{.}\) for every character.
See the RStudio cheat sheet on stringr for more examples (and in general, as a brilliant reference to regex)
Aside from specifiying the characters to match, such as “[0-9]”, another important component of regular expressions is how many time should the characters appear.
Suppose we want to extract year data that is mixed in with other data as well.
messy_var <- c(1,1987,2006,2010,307,2018)
str_extract(messy_var, "[0-9]")
## [1] "1" "1" "2" "2" "3" "2"
str_extract(messy_var, "[0-9]+")
## [1] "1" "1987" "2006" "2010" "307" "2018"
str_extract(messy_var, "[0-9]{4}")
## [1] NA "1987" "2006" "2010" NA "2018"
Often, special characters can cause problems when working with strings. For example, trying to add a quote can result in R thinking you are trying to close the string.
For most characters, you can “escape” (cause R to read as part of the string) special characters by prepending them with a backslash.
Example:
quote <- "\"Without data, you're just another person with an opinion.\"
- W. Edwards Deming."
writeLines(quote)
## "Without data, you're just another person with an opinion."
## - W. Edwards Deming.
To match part of a string that occurs before or after a specific other pattern, you can also specify “lookarounds”, the pattern the match should precede or follow:
To match a string pattern x, preceded or followed by y:
y precedes x: “(?<=y)x”
y follows x: “x(?=y)”
price_info <-c("The price is 5 dollars")
str_extract(price_info, "(?<=(The price is )).+")
## [1] "5 dollars"
str_extract(price_info, ".+(?=( dollars))")
## [1] "The price is 5"
“Scraping” data from the web - that is, automating the retrieval of data displayed online (other than through API) is an increasingly common data analysis task.
Today, we will briefly explore very rudimentary web scraping, using the rvest package.
The specific focus today is only on scraping data structued as a table on a webpage. The basic method highlighted will work much of the time - but does not work for every table.
The starting point for scraping a web table with rvest is the read_html() function, where the URL to the page with data should go.
After reading the webpage, the table should be parsed. For many tables, the read_html can be piped directly into the html_table() function.
If html_table() does not work, a more robust option is to first pipe read_html into html_nodes(xpath = “//table”) and then into html_table(fill=TRUE)
tech_stock_names <- c("MSFT","AMZN","GOOGL","AAPL","FB","INTC","CSCO")
tech_stocks <- list()
for(j in 1:length(tech_stock_names)){
tech_stocks[[j]] <-read_html(
glue("https://finance.yahoo.com/quote/{tech_stock_names[j]}/history")) %>%
html_table() %>% as.data.frame() %>% mutate(stock = tech_stock_names[j])
}
tech_stocks %<>% bind_rows()
tech_stocks[1:5,c(1,6:8)]
## Date Adj.Close.. Volume stock
## 1 Mar 07, 2019 110.39 25,321,300 MSFT
## 2 Mar 06, 2019 111.75 17,687,000 MSFT
## 3 Mar 05, 2019 111.70 19,538,300 MSFT
## 4 Mar 04, 2019 112.26 26,608,000 MSFT
## 5 Mar 01, 2019 112.53 23,501,200 MSFT
gini_list <-read_html("http://wdi.worldbank.org/table/1.3") %>%
html_nodes(xpath ="//table") %>% html_table(fill=TRUE)
gini_data <- gini_list %>% extract2(3) %>%
as.data.frame() %>% select(1:3)
gini_data[1:5,]
## X1 X2 X3
## 1 Afghanistan .. ..
## 2 Albania 2012 29.0
## 3 Algeria 2011 27.6
## 4 American Samoa .. ..
## 5 Andorra .. ..