Exercise 3

knitr::opts_chunk$set(include = TRUE)
knitr::opts_chunk$set(comment = NA)
library(rvest)
library(httr)
library(magrittr)
library(tidyverse)
library(rio)
library(glue)

For this exercise, you will not be working with a single dataset, but will instead practice new skills using both your NLSY data from last seminar, as well as online data concerning movies.

Revisit your NLSY97 dataset from last week
1. Create an indicator for sex using a vectorized conditional statement.
2. Recode the schooltype variable into text values, corresponding to:
  - “Public” if the value is 1
  - “Private, religious” if the value is 2
  - “Private, non-religious” if the value is 3
  - “Other” if the value is 4.
```
nlsy97 <-import("nlsy97.rds")
```

Load the IMDB Top 250 Movies
1. Scrape the data from the “Top 250 Movies as rated by IMDb users” from https://www.imdb.com/chart/top
```
# IMDB Top 250 Movies
top250_basic <- read_html("https://www.imdb.com/chart/top/") %>%
   html_table() %>% as.data.frame()
```
1. Notice that IMDB scrapes the data in Swedish by fault. To get the data in English, use html_session() in place of read_html(), adding the option:
```
add_headers("Accept-Language"="en-US, en;q=0.5")
```
- You may need to load the httr package to use add_headers().
```
top250_eng.pre <- html_session("https://www.imdb.com/chart/top/", 
                               add_headers("Accept-Language"="en-US, en;q=0.5")) %>% 
                    html_table %>% as.data.frame()
```
1. Keep only the columns “Rank…Title” and “IMDb.Rating”, suitably renaming them.
```
top250_eng <- top250_eng.pre %>% select("Rank...Title", "IMDb.Rating") %>%
  rename(Title = Rank...Title,
         Rating = IMDb.Rating) # Keep only Title and Year Columns
```
1. Create a ranking variable by extracting the values that appear before the dot in the title column.
```
top250_eng$Ranking <- top250_eng$Title %>% str_extract("[0-9]+(?=(.\n))")
```
1. Create a year variable, by extracting the numbers inside a parenthesis from the title column.
```
top250_eng$Year <- str_extract(top250_eng$Title, "(?<=[:punct:])[:digit:]+") 
```
1. Redefine the title variable by extracting the string information that appear after the dot in the title column.
```
top250_eng$Title <- top250_eng$Title %>% str_extract("(?<=(.\n)).+")
```
1. Trim the white space on both sides of the title.
```
top250_eng$Title %<>% str_trim(side = "both")
```
1. View data frame
```
head(top250_eng)
```

Get the box office statistics for the top 500 all-time US box office earners
1. Using a loop, create a list of the top 500 box office hits taking advantage of the fact that each 100 movies is listed on the following pages:
- https://www.boxofficemojo.com/alltime/domestic.htm?page=1
- https://www.boxofficemojo.com/alltime/domestic.htm?page=2
- https://www.boxofficemojo.com/alltime/domestic.htm?page=3
- https://www.boxofficemojo.com/alltime/domestic.htm?page=4
- https://www.boxofficemojo.com/alltime/domestic.htm?page=5
- You may need to experiment with one of the pages first to ensure that you get the right dataframe from each iteration of the loop.
```
domesticgross <- list()
for(i in 1:5){
  domesticgross[[i]] <-read_html(
    glue("https://www.boxofficemojo.com/alltime/domestic.htm?page={i}")) %>% 
    html_nodes(xpath = "//table") %>% html_table(fill=TRUE)  %>% 
    extract2(6)
}
```
1. Form a single dataframe out of all the observations from the list you created.
```
topearners <- domesticgross %>% bind_rows()
```
1. Ensure that the column names are correctly treated as column names and not as observations.
```
colnames(topearners) <-topearners[1,]
topearners %<>%  filter(Rank !="Rank")
```
1. Rename the Title and Lifetime earnings appropriately and keep only the title, studio, and gross earnings variables.
```
topearners %<>% rename("Title" = "Title(click to view)",
                       "Gross"="Lifetime Gross") %>%
  select(Title,Studio, Gross)
```
1. View the box office earnings dataset
```
head(topearners)
```

Create a dataset with both IMDB performance and earnings
1. Join the box office earnings and IMDB top 250 datasets, keeping all variables and only the observations that found in both datasets.
```
expensive_movies <-inner_join(top250_eng, topearners, by="Title")
```
1. Remove the dollar sign and commas from the gross earnings variable.
```
expensive_movies$Gross %<>% str_replace_all("[$,]+","")
```
1. Vectorize the as.numeric() function to convert the ranking, gross earnings, and year variables into numeric.
```
expensive_movies[,c("Ranking","Gross","Year")] %<>% map(as.numeric)
```
1. Create a new variable equal to the log of gross earnings
```
expensive_movies %<>%  mutate(logearnings = log(Gross))
```
1. Write your own OLS function (producing coefficients) and use it to run a regression of log earnings on rating and year (with a constant).
- In matrix notation, the formula for $\hat{\beta}_{OLS}$ is: \[ \hat{\beta}_{OLS} = (X'X)^{-1}(X'y) \]
- You will need some more matrix multiplication operators for this:
  - solve(A) yields the inverse of matrix A.
  - t(A) provides the transpose of matrix A.

myols <- function(depvar,indvars) {
  x <- indvars %>% as_tibble()
  x %<>% as_tibble() %>% mutate(Constant = 1) %>% as.matrix()
  y <- depvar %>% as.matrix()
  beta_myols <- t(solve(t(x) %*% x) %*% (t(x) %*% y))
  colnames(beta_myols) <-colnames(x)
  rownames(beta_myols) <- "Estimate"
  beta_myols <- t(beta_myols)
  return(beta_myols)
}

myols(expensive_movies$logearnings,expensive_movies[,c("Rating","Year")])

            Estimate
Rating   0.197996254
Year     0.006279776
Constant 5.035941990