Gender (gender_r) . *** Explore Gender Variable . codebook gender_r // View storage format of variable 'gender_r' // ------------------------------------------------------------------------------------------------------ gender_r GENDER_R ------------------------------------------------------------------------------------------------------ type: numeric (byte) range: [1,2] units: 1 unique values: 2 missing .: 0/4,469 tabulation: Freq. Value 2,253 1 2,216 2 . . *** Create a "Female" Indicator Variable . gen female = (gender_r == 2) if !missing(gender_r) . /* For individuals whose gender is listed in gender_r, assigns a > value of 1 for female if gender is equal to 2, 1 if not. Missing > values in gender_r would also appear as missing in the female > variable.*/ . . tabulate female // Displays the freq/percent of each value of "female." female | Freq. Percent Cum. ------------+----------------------------------- 0 | 2,253 50.41 50.41 1 | 2,216 49.59 100.00 ------------+----------------------------------- Total | 4,469 100.00 . . /* > Discussion: The variable "gender_r" represents the gender listed > for each variable. When the CSV file was read into Stata, the variable > was interpreted as a 'numeric' type variable. 50.41% of observations > are male, 49.59% female, and there are no missing observations. > */ . . *** Note: Another way to create the female indicator variable would be: . // gen female_alt = 0 if !missing(gender_r) . // replace female_alt = 1 if ( gender_r == 2 & !missing(gender_r)) . // tabulate female_alt . . **** 2b. Years of Schooling (yrsqual) . *** Explore 'Years of Schooling' Variable . codebook yrsqual // View storage format of variable 'j_q04a' // ------------------------------------------------------------------------------------------------------ yrsqual YRSQUAL ------------------------------------------------------------------------------------------------------ type: string (str2) unique values: 11 missing "": 0/4,469 examples: "11" "12" "15" "16" . tabulate yrsqual YRSQUAL | Freq. Percent Cum. ------------+----------------------------------- 10 | 196 4.39 4.39 11 | 793 17.74 22.13 12 | 837 18.73 40.86 13 | 395 8.84 49.70 14 | 440 9.85 59.54 15 | 518 11.59 71.13 16 | 500 11.19 82.32 20 | 53 1.19 83.51 6 | 134 3.00 86.51 9 | 601 13.45 99.96 D | 2 0.04 100.00 ------------+----------------------------------- Total | 4,469 100.00 . /* Since 'yrsqual' is a string-variable, only the first > 9 values are shown using the codebook command. Using tabulate, we > see some of the observations have a missing value "D" - which means > "Don't Know" according to the downloaded codebook. */ . . ***** Format- Years of Schooling Variable . replace yrsqual = ".d" if yrsqual == "D" (2 real changes made) . /* Since we need to format the variable as a numeric (quantitive) > variable, we need to Stata to interpret the missing values > correctly. Missing values in Stata are denoted my ".", where > letters can follow the "." to indicate what type of missing data we > have. So we change "D" to ".d". */ . . destring(yrsqual), gen(yearsch) yrsqual: all characters numeric; yearsch generated as byte (2 missing values generated) . /* Now, we need to Stata to convert the variable to numeric, > by parsing the text (string) values as numbers. */ . . tabulate yearsch // Check to make sure no more missing values. YRSQUAL | Freq. Percent Cum. ------------+----------------------------------- 6 | 134 3.00 3.00 9 | 601 13.45 16.45 10 | 196 4.39 20.84 11 | 793 17.75 38.59 12 | 837 18.74 57.33 13 | 395 8.84 66.17 14 | 440 9.85 76.02 15 | 518 11.60 87.62 16 | 500 11.19 98.81 20 | 53 1.19 100.00 ------------+----------------------------------- Total | 4,467 100.00 . . tabulate yearsch, missing /* Note: You can see missing values again in > tabulate by using option, ", missing" */ YRSQUAL | Freq. Percent Cum. ------------+----------------------------------- 6 | 134 3.00 3.00 9 | 601 13.45 16.45 10 | 196 4.39 20.83 11 | 793 17.74 38.58 12 | 837 18.73 57.31 13 | 395 8.84 66.14 14 | 440 9.85 75.99 15 | 518 11.59 87.58 16 | 500 11.19 98.77 20 | 53 1.19 99.96 .d | 2 0.04 100.00 ------------+----------------------------------- Total | 4,469 100.00 . . summarize yearsch // Produces basic descriptive statistics for 'age' Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- yearsch | 4,467 12.32707 2.571902 6 20 . . /* > Discussion: The variable "yrsqual" is a derived measure of years > of schooling. The variable was stored in Stata as a "string" type of > variable (Why? Because some observations take on the non-numeric "D" > value). After converting the variable to numeric, we see the mean is > 12.33, with std. dev. of 2.57, min of 6 and max of 20. There are 2 > missing observations. > */ . . **** 2c. Age (age_r) . *** Explore Gender Variable . codebook age_r // View storage format of variable 'gender_r' // ------------------------------------------------------------------------------------------------------ age_r AGE_R ------------------------------------------------------------------------------------------------------ type: numeric (byte) range: [16,65] units: 1 unique values: 50 missing .: 0/4,469 mean: 40.9009 std. dev: 14.7528 percentiles: 10% 25% 50% 75% 90% 20 28 42 54 61 . . rename age_r age /* Rename 'age_r' to 'age' (not necessary, > but makes regression more understandable later */ . . *** Generate 'Potential Experience' Variable . gen potent_exper = max(0,age - 19) /* Generates a 'Potential Experience" > variable, equal to age - 19 for > individuals who are at least 19, > 0 otherwise. */ . . summarize potent_exper, detail potent_exper ------------------------------------------------------------- Percentiles Smallest 1% 0 0 5% 0 0 10% 1 0 Obs 4,469 25% 9 0 Sum of Wgt. 4,469 50% 23 Mean 22.03424 Largest Std. Dev. 14.54202 75% 35 46 90% 42 46 Variance 211.4704 95% 44 46 Skewness -.0069625 99% 46 46 Kurtosis 1.745389 . . /* > Discussion: The variable "age_r" is a derived measure of age (in years) > of the individual. The variable is stored in Stata as numeric and there > are no missing observations. Using "summarize, detail" we see that the > mean is 22.03 years and median (50th percentile) is 23 years. > */ . . **** 2d. Cognitive Ability (using pvpsl1) . *** Explore 'Problem-solving scale score' Variable . codebook pvpsl1 // View storage format of variable 'pvpsl1' // ------------------------------------------------------------------------------------------------------ pvpsl1 PVPSL1 ------------------------------------------------------------------------------------------------------ type: numeric (float) range: [107.30074,429.56497] units: .00001 unique values: 3,959 missing .: 506/4,469 mean: 290.709 std. dev: 43.3638 percentiles: 10% 25% 50% 75% 90% 232.211 263.069 294.227 320.793 342.671 . . *** Generate Quantile of Cognitive Ability . egen cogn_rank = rank(pvpsl1) if !missing(pvpsl1) /* Rank of individuals' > pvpsl1 if known. */ (506 missing values generated) . > . egen count_cogn = count(pvpsl1) if !missing(pvpsl1) /* Total number of > nomissing observations > for pvpsl1. */ (506 missing values generated) . . *** Percentile Rank . gen cogn_samp_pctile = ((cogn_rank -1) / (count_cogn - 1)) * 100 (506 missing values generated) . . /* > Discussion: The variable "pvpsl1" is a derived measure of an > individuals' problem solving ability. The variable is stored as a > numeric variable in Stata and there are 506 missing observations. > */ . . **** Question 3: Dependent Variable (Monthly Earnings Quintile) . codebook monthlyincpr // View storage format of variable 'earnhrbonus' // ------------------------------------------------------------------------------------------------------ monthlyincpr MONTHLYINCPR ------------------------------------------------------------------------------------------------------ type: numeric (byte) range: [1,6] units: 1 unique values: 6 missing .: 1,236/4,469 tabulation: Freq. Value 449 1 84 2 879 3 749 4 613 5 459 6 1,236 . . . recode monthlyincpr (1 = 5) (2 = 17.5) (3 = 37.5) (4 = 62.5) (5 = 82.5) /// > (6 = 95), gen(income_pctile) (3233 differences between monthlyincpr and income_pctile) . . * Alternate recode . // gen income_pctile = . . // replace income_pctile = 5 if (monthlyincpr == 1 & !missing(monthlyincpr)) . // replace income_pctile = 17.5 if (monthlyincpr == 2 & !missing(monthlyincpr)) . // replace income_pctile = 37.5 if (monthlyincpr == 3 & !missing(monthlyincpr)) . // replace income_pctile = 62.5 if (monthlyincpr == 4 & !missing(monthlyincpr)) . // replace income_pctile = 82.5 if (monthlyincpr == 5 & !missing(monthlyincpr)) . // replace income_pctile = 95 if (monthlyincpr == 6 & !missing(monthlyincpr)) . . replace income_pctile = 0 if c_d05 ==2 // Assign value of 0 for unemployed. (209 real changes made) . . drop if c_d05 ==3 | c_d05 == 4 // Drop if not in labor market or unknown. (905 observations deleted) . . codebook income_pctile // Check number of missing values of new var. ------------------------------------------------------------------------------------------------------ income_pctile RECODE of monthlyincpr (MONTHLYINCPR) ------------------------------------------------------------------------------------------------------ type: numeric (float) range: [0,95] units: .1 unique values: 7 missing .: 122/3,564 tabulation: Freq. Value 209 0 449 5 84 17.5 879 37.5 749 62.5 613 82.5 459 95 122 . . . /* > Discussion: The number missing observations for "monthlyincpr" is 1,236. > The number of missing observations for the revised measure is 122. > */ . . *** Question 4: Regression Analysis . *** 4a: Regress Income Rank on Cognitive Ability, Potential Experience, and Female Gender . reg income_pctile cogn_samp_pctile potent_exper i.female if /// > ((age >= 30) & (age <= 65)) Source | SS df MS Number of obs = 2,451 -------------+---------------------------------- F(3, 2447) = 149.75 Model | 314365.533 3 104788.511 Prob > F = 0.0000 Residual | 1712304.16 2,447 699.756503 R-squared = 0.1551 -------------+---------------------------------- Adj R-squared = 0.1541 Total | 2026669.7 2,450 827.212121 Root MSE = 26.453 ---------------------------------------------------------------------------------- income_pctile | Coef. Std. Err. t P>|t| [95% Conf. reg income_pctile cogn_samp_pctile potent_exper i.female if /// 
> inrange(age, 30, 65)

Source |       SS           df       MS      Number of obs   =     2,451
-------------+----------------------------------   F(3, 2447)      =    149.75
Model |  314365.533         3  104788.511   Prob > F        =    0.0000
Residual |  1712304.16     2,447  699.756503   R-squared       =    0.1551
-------------+----------------------------------   Adj R-squared   =    0.1541
Total |   2026669.7     2,450  827.212121   Root MSE        =    26.453

----------------------------------------------------------------------------------
income_pctile |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-----------------+----------------------------------------------------------------
cogn_samp_pctile |   .3391429   .0199983    16.96   0.000     .2999276    .3783583
potent_exper |   .4132204   .0588681     7.02   0.000     .2977839    .5286569
1.female |  -12.38118   1.071982   -11.55   0.000   -14.48327   -10.27909
_cons |   37.81954   2.311408    16.36   0.000    33.28702    42.35206
----------------------------------------------------------------------------------

/* 
Discussion: 

The coefficient on cogn_samp_pctile implies that a one percentile 
increase in cognitive ability is estimated to shift an individual's percentile 
of earnings up by .3391429 (that is, .3391429 percentage points if 
percentile is expressed on a 0-1 scale). 

The coefficient on potent_exper implies that a one year increase in 
potential experience is estimated to increase ones' percentile 
of earnings by .4132204 percentage points. 

The coefficient on female suggests that being female is estimated to 
increase the percentile of income by 12.38118 percentage points, 
compared to being a male. 

The constant estimate suggests that that the predicted percentile 
of income for a male (female = 0) with 0 years of potential experience 
and in the 0th percentile of cognitive ability is the 37th percentile. 
*/ Interval] -----------------+---------------------------------------------------------------- cogn_samp_pctile | .3391429 .0199983 16.96 0.000 .2999276 .3783583 potent_exper | .4132204 .0588681 7.02 0.000 .2977839 .5286569 1.female | -12.38118 1.071982 -11.55 0.000 -14.48327 -10.27909 _cons | 37.81954 2.311408 16.36 0.000 33.28702 42.35206 ---------------------------------------------------------------------------------- . . /* > Discussion: > > The coefficient on cogn_samp_pctile implies that a one percentile > increase in cognitive ability is estimated to shift an individual's percentile > of earnings up by .3391429 (that is, .3391429 percentage points if > percentile is expressed on a 0-1 scale). > > The coefficient on potent_exper implies that a one year increase in > potential experience is estimated to increase ones' percentile > of earnings by .4132204 percentage points. > > The coefficient on female suggests that being female is estimated to > increase the percentile of income by 12.38118 percentage points, > compared to being a male. > > The constant estimate suggests that that the predicted percentile > of income for a male (female = 0) with 0 years of potential experience > and in the 0th percentile of cognitive ability is the 37th percentile. > */ . . . *** 4b: Add Exper^2 and Age . reg income_pctile cogn_samp_pctile c.potent_exper##c.potent_exper /// > i.female age if inrange(age, 30, 65) note: age omitted because of collinearity Source | SS df MS Number of obs = 2,451 -------------+---------------------------------- F(4, 2446) = 122.42 Model | 338057.281 4 84514.3203 Prob > F = 0.0000 Residual | 1688612.41 2,446 690.35667 R-squared = 0.1668 -------------+---------------------------------- Adj R-squared = 0.1654 Total | 2026669.7 2,450 827.212121 Root MSE = 26.275 ----------------------------------------------------------------------------------------------- income_pctile | Coef. Std. Err. t P>|t| [95% Conf. Interval] ------------------------------+---------------------------------------------------------------- cogn_samp_pctile | .3331509 .0198898 16.75 0.000 .2941482 .3721535 potent_exper | 2.324822 .3315113 7.01 0.000 1.674751 2.974894 | c.potent_exper#c.potent_exper | -.0343134 .0058574 -5.86 0.000 -.0457993 -.0228275 | 1.female | -12.5878 1.065342 -11.82 0.000 -14.67687 -10.49874 age | 0 (omitted) _cons | 14.84099 4.544964 3.27 0.001 5.928615 23.75337 ----------------------------------------------------------------------------------------------- . . /* > Discussion: > > Age: The age variable is omitted. If you look at the top of the > regression output, it notes that age is omitted because of > collinearity (Stata automatically detects perfect collinearity and drops > one of the collinear variables. Age here is a linear function of potential > experience and the constant, since age = potentexper + 19. This is a violation > of the MLR Assumption 3, which is simply "no perfect collinearity." > > Square of Potential Experience: The quadratic of experience is > negative and significant. This indicates that the benefit of an > additional year of experience is diminishing as the years of > experience one already has increases. Omission of a relevant quadratic > term like this is a common example of the mispecification of functional > form that is a violation of MLR Assumption 4 (zero conditional mean) for > estimating the true model. > > R^2: The R^2 in the second model is higher than the first (0.1668 as > opposed to 0.1551), indicating adding the square of experience increases > the total amount of explained variation in income percentile. R^2 will > never decrease with the addition of subsequent variables. To see this, > note that R^2= 1 - (Sum of Squared Residuals / Total Sum of Squares). > Everything except the Sum of Squared Residuals are the same across > the two models, and since the second model contains all predictors from > the firt model, the sum of squared residuals will be no greater than in > the first model. > > */ . . *** 4c: Compare School Years vs Cognitive Ability . reg income_pctile cogn_samp_pctile potent_exper i.female if inrange(age, 30, 65) Source | SS df MS Number of obs = 2,451 -------------+---------------------------------- F(3, 2447) = 149.75 Model | 314365.533 3 104788.511 Prob > F = 0.0000 Residual | 1712304.16 2,447 699.756503 R-squared = 0.1551 -------------+---------------------------------- Adj R-squared = 0.1541 Total | 2026669.7 2,450 827.212121 Root MSE = 26.453 ---------------------------------------------------------------------------------- income_pctile | Coef. Std. Err. t P>|t| [95% Conf. Interval] -----------------+---------------------------------------------------------------- cogn_samp_pctile | .3391429 .0199983 16.96 0.000 .2999276 .3783583 potent_exper | .4132204 .0588681 7.02 0.000 .2977839 .5286569 1.female | -12.38118 1.071982 -11.55 0.000 -14.48327 -10.27909 _cons | 37.81954 2.311408 16.36 0.000 33.28702 42.35206 ---------------------------------------------------------------------------------- . scalar R2model4a = e(r2_a) // Save R^2 as a scalar. (Also in reg output) . . reg income_pctile yearsch potent_exper i.female if inrange(age, 30, 65) Source | SS df MS Number of obs = 2,727 -------------+---------------------------------- F(3, 2723) = 165.66 Model | 366723.701 3 122241.234 Prob > F = 0.0000 Residual | 2009262.6 2,723 737.885642 R-squared = 0.1543 -------------+---------------------------------- Adj R-squared = 0.1534 Total | 2375986.3 2,726 871.601725 Root MSE = 27.164 ------------------------------------------------------------------------------ income_pct~e | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- yearsch | 4.009879 .2173342 18.45 0.000 3.583722 4.436035 potent_exper | .1575075 .0536649 2.94 0.003 .0522795 .2627354 1.female | -15.24822 1.048829 -14.54 0.000 -17.3048 -13.19164 _cons | 8.157888 3.436421 2.37 0.018 1.419632 14.89614 ------------------------------------------------------------------------------ . scalar R2model4c = e(r2_a) // Save R^2 as a scalar. (Also in reg output) . . display R2model4a - R2model4c /* Displays difference in R^2 output. Note: For > the assigmnent, you could just compare them > from the regression output of each model. */ .00066432 . . /* > Discussion: > > The two models perform nearly identically, with the > regression model from 4(a) explaining .066432% more of the variation in > income quintiles. > > (Not graded) Potential Problems with Either Model: > The two models preview common challenges in applied econometrics we will > discuss in subsequent lectures. As you can see from the covariance matrix > below, Cov(cogn_samp_pctile, yearsch) is not equal to zero, and both appear > likely to affect incomes, implying omitted variable bias (i.e. a violation > of MLR Assumption 4). One response would be to control for both cognitive > ability and schooling. But this brings up an issue from Ch.3: endogeneity. > The basic idea is that OLS is biased if you include explanatory variables > that are caused by other variables in the model. If cognitive ability > increases years of schooling, then years of schooling is endogenous when you > both are in the model. Equally, one might imagine that, as individual gains > more years of schooling, their cognitive ability increases. If this is > true, cognitive ability is also endogenous to schooling (when two variables > causally influence each other, this is a particular type of endogenity called > simultaneity). > > */ . . correlate cogn_samp_pctile yearsch, covariance (obs=3,236) | cogn_s~e yearsch -------------+------------------ cogn_samp_~e | 831.174 yearsch | 23.2517 5.44831 . . . **** Extra Question for three person groups . . **** Question 5(a) Explore Structure of the variable "g_q03h" - which is . ** 'Skill use work - Numeracy - How often - Use advanced math or statistics' . codebook g_q03h ------------------------------------------------------------------------------------------------------ g_q03h G_Q03h ------------------------------------------------------------------------------------------------------ type: string (str1) unique values: 9 missing "": 0/3,564 tabulation: Freq. Value 2,851 "1" 372 "2" 131 "3" 77 "4" 33 "5" 7 "D" 1 "N" 1 "R" 91 "V" . . /* > From looking at 'math use at work' with the codebook command, we > see that this variable takes on only 9 unique values, meaning that > all values are displayed by Codebook. From this, we can see right > away that we have the following 'Missing value' indicators that need > to be relabelled: 'D', 'N', 'R', and 'V'. > */ . . **** Question 5(b) Suitably reformat g_q03h and provide the mean and . **** standard deviation using the original vaue scheme. . . *** Recode Missing Values for g_q03h . replace g_q03h = ".d" if g_q03h=="D" variable g_q03h was str1 now str2 (7 real changes made) . replace g_q03h = ".n" if g_q03h=="N" (1 real change made) . replace g_q03h = ".r" if g_q03h=="R" (1 real change made) . replace g_q03h = ".v" if g_q03h=="V" (91 real changes made) . . *** Convert g_q03h to a numeric variable by destringing . destring g_q03h, replace g_q03h: all characters numeric; replaced as byte (100 missing values generated) . . *** Produce summary statistics for g_q03h using original coding of . *** use frequencies . summarize g_q03h Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- g_q03h | 3,464 1.287818 .7269503 1 5 . . /* > The mean (pre-transformation) of this variable is 1.287818 and the > standard deviation is 0.7269503. > */ . . **** Question 5(c) - Recode g_q03h so that the values represents number of . *** times each month an individual uses advanced math or statistics at work . recode g_q03h (1 = 0) (1 = 0.5) (3 = 2.5) (4 = 12) (5 = 20) /// > , gen(mathuseatwork) (3092 differences between g_q03h and mathuseatwork) . . /* > This question highlights a common problem in applied work, which is > that survey data often uses an ordinal or interval approach to > asking retrospectative information. You as the researcher must then > decide how to make that interpretable numerically and justify it. > > In assigning values here myself, I assume that individuals work > 4 5-day work weeks per month, for a total of 20 work days. So if an > individual reports they use math at work "everyday," (5 in the old > schema) that equates to 20 days per month. > > "Never" (1 in original coding) is straightforwardly represented as > 0 times per month. > > For less than once a month (1), I code this as > as the midpoint between 0 and 1, i.e. 0.5 days per month. > > For less than once a week but at least once a month (3), this > should be less than four (i.e. at most 3) according to my > assumptions about a 4 week work month, but greater than 1. I again > use the midpoint of (1,3), that is is 2.5 days per month. > > For at least once a week but not every day (4), this again should be > less than 20 but less than 4. So once again taking the midpoint of > (4,20), I code this as 12 days per month. > */ . . **** Summarize recoded math use at work variable . summarize mathuseatwork Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- mathuseatw~k | 3,464 .7665993 2.663051 0 20 . . /* > The mean of the variable after transforming it to be more directly > interpretable is .7665993 and the standard deviation is 2.663051. > */ . . **** Question 5(d) - Regressions relating to a math use at work -> cognitive . **** ability -> income pathwawy. . . *** Question 5(d)(i) Regression of Cognitive Ability on math use at work . reg cogn_samp_pctile mathuseatwork if (inrange(age, 30, 65) & (c_d05==1)) Source | SS df MS Number of obs = 2,436 -------------+---------------------------------- F(1, 2434) = 66.80 Model | 54148.3048 1 54148.3048 Prob > F = 0.0000 Residual | 1973109 2,434 810.644616 R-squared = 0.0267 -------------+---------------------------------- Adj R-squared = 0.0263 Total | 2027257.3 2,435 832.549199 Root MSE = 28.472 ------------------------------------------------------------------------------- cogn_samp_p~e | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- mathuseatwork | 1.723182 .2108404 8.17 0.000 1.309736 2.136627 _cons | 46.61091 .604462 77.11 0.000 45.4256 47.79623 ------------------------------------------------------------------------------- . . *** Question 5(d)(ii)Regression of Earnings Pctile on Cognitive Ability . reg income_pctile cogn_samp_pctile if (inrange(age, 30, 65) & (c_d05==1)) Source | SS df MS Number of obs = 2,372 -------------+---------------------------------- F(1, 2370) = 221.39 Model | 148408.939 1 148408.939 Prob > F = 0.0000 Residual | 1588709.09 2,370 670.34139 R-squared = 0.0854 -------------+---------------------------------- Adj R-squared = 0.0850 Total | 1737118.03 2,371 732.65206 Root MSE = 25.891 ---------------------------------------------------------------------------------- income_pctile | Coef. Std. Err. t P>|t| [95% Conf. Interval] -----------------+---------------------------------------------------------------- cogn_samp_pctile | .2753122 .0185031 14.88 0.000 .2390283 .311596 _cons | 48.22926 1.040668 46.34 0.000 46.18855 50.26998 ---------------------------------------------------------------------------------- . . *** Question 5(d)(iii) Regression of Earnings Pctile on math use at work . reg income_pctile mathuseatwork if (inrange(age, 30, 65) & (c_d05==1)) Source | SS df MS Number of obs = 2,609 -------------+---------------------------------- F(1, 2607) = 82.45 Model | 60710.6835 1 60710.6835 Prob > F = 0.0000 Residual | 1919661.01 2,607 736.348681 R-squared = 0.0307 -------------+---------------------------------- Adj R-squared = 0.0303 Total | 1980371.69 2,608 759.344975 Root MSE = 27.136 ------------------------------------------------------------------------------- income_pctile | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- mathuseatwork | 1.811131 .1994614 9.08 0.000 1.420012 2.202249 _cons | 58.28769 .5558673 104.86 0.000 57.19771 59.37768 ------------------------------------------------------------------------------- . . /* > Discussion: > > Regression 5(d)(i) suggests that for each additional day per month > that an individual uses advanced math at work, their percentile of > cognitive ability increases by 1.723182, which is statistically > significant (p-value < 0.01). It's not immediately required for > this question, but you may note that these estimates seem almost > implausibly high - as we will discuss further in 5(f). > > Regression 5(d)(ii), like analysis in question 4, suggests that > cognitive ability has a positive impact on earning, with a > 1 percentile increase in positive ability estimated to increase > earnings percentile by 0.2753122, which is statistically > significant (p-value < 0.01). If both this relationship and the > relationship from 5(d)(i) are indeed correct, then math use at > work should have a direct effect on earnings percentile via this > pathway. > > Regression 5(d)(iii) estimates that cognitive ability does indeed > have an effect earnings percentile - in fact even larger than the > estimated effect through the cognitive ability - earnings pathway. > An increase in math use of work by once a month is estimated to > increase earnings percentile by 1.811131, which is statistically > significant (p-value < 0.01). Again, these results are implausibly > high - raising the spector of reverse cauality / endogeneity and > foreshadowing 5(f). > */ . . . **** Question 5(e) - Regressions relating to an erroneous math use at work . **** -> years of schooling -> income pathway. . . *** Question 5(e)(i) Regression of years of schooling on math use at work . reg yearsch mathuseatwork if (inrange(age, 30, 65) & (c_d05==1)) Source | SS df MS Number of obs = 2,699 -------------+---------------------------------- F(1, 2697) = 92.68 Model | 531.014066 1 531.014066 Prob > F = 0.0000 Residual | 15452.2219 2,697 5.72941118 R-squared = 0.0332 -------------+---------------------------------- Adj R-squared = 0.0329 Total | 15983.236 2,698 5.92410527 Root MSE = 2.3936 ------------------------------------------------------------------------------- yearsch | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- mathuseatwork | .1647676 .0171149 9.63 0.000 .131208 .1983273 _cons | 12.83879 .0481801 266.47 0.000 12.74431 12.93326 ------------------------------------------------------------------------------- . . *** Question 5(e)(i) Regression of income percentile on years of schooling . reg income_pctile yearsch if (inrange(age, 30, 65) & (c_d05==1)) Source | SS df MS Number of obs = 2,612 -------------+---------------------------------- F(1, 2610) = 241.93 Model | 168142.737 1 168142.737 Prob > F = 0.0000 Residual | 1813992.45 2,610 695.016266 R-squared = 0.0848 -------------+---------------------------------- Adj R-squared = 0.0845 Total | 1982135.19 2,611 759.147909 Root MSE = 26.363 ------------------------------------------------------------------------------ income_pct~e | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- yearsch | 3.305584 .2125233 15.55 0.000 2.888852 3.722315 _cons | 16.83114 2.810066 5.99 0.000 11.32095 22.34132 ------------------------------------------------------------------------------ . . /* > Discussion: > > Regression 5(e)(i) estimates that math use at work > has a positive, statistically significant effect on years of > schooling. Regression 5(e)(ii) then suggests that years > of schooling has a positive, statistically significant effect on > earnings percentile. > > This would point to a second causal pathway > for math use at work to effect earnings, but thinking about > regression 5(e)(i) - it doesn't make any sense under our assumptions. > If schooling strictly predates math use at work, then math use at > work cannot effect schooling. Instead, what we very likely have is > reverse causality - an individual's schooling instead affects their > math use at work. To see that a coefficient will be different from > zero when the true relationship runs in reverse of what is estimated, > consider the expression for Beta in terms of the sample correlation > and standard deviations: > > - For regression of y on x, the coefficient on x is: > beta_x = Corr(x,y) * (StdDev_x / StdDev_y) > - And for the regression of x on y, the coefficient on y is: > beta_y = Corr(x,y) * (StdDev_y / StdDev_x) > > Since the fraction (StdDev_y / StdDev_x) and it's inverse are always > strictly positive, then for nonzero Corr(x,y), running regression > in the 'wrong' direction (from y to x) will always yield a nonzero > coefficient with the same sign as the effect in the right direction > (from x to y). > > To demonstrate this argument, we run a regression interchanging > our dependent and independent variables in 5(e)(i). > > */ . . *** Demonstrating that regression can't tell us the direction of causality . reg mathuseatwork yearsch Source | SS df MS Number of obs = 3,463 -------------+---------------------------------- F(1, 3461) = 128.98 Model | 882.317455 1 882.317455 Prob > F = 0.0000 Residual | 23676.1402 3,461 6.84083798 R-squared = 0.0359 -------------+---------------------------------- Adj R-squared = 0.0356 Total | 24558.4577 3,462 7.09371973 Root MSE = 2.6155 ------------------------------------------------------------------------------ mathuseatw~k | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- yearsch | .2095326 .0184499 11.36 0.000 .1733588 .2457063 _cons | -1.910755 .2399203 -7.96 0.000 -2.381155 -1.440355 ------------------------------------------------------------------------------ . . **** Question 5(f) - Inference from 5(d) in light of 5(e) . . /* > Discussion: > In 5(e), we see a rather stark case where causality cannot run in > the direction estimated by OLS, where math use at work is estimated > to increase years of schooling that predates work. > > This same concern is likely to extend to the relationship in 5(d). > Individuals with higher cognitive ability are probably more likely > to work in jobs with greater use of advanced math. In general, > there is likely to be the same issue of simultaneity in the > relationship between math use at work and congitive ability. > > Generally, this question highlights the difficulty in finding good > variables where there is no concern about OVB or reverse cauality. > > Specifically, extending the logic from 5(d), it seems reasonable to > believe that higher paying jobs may often require greater use of > mathematics - irrespective of someone's aptitude or qualifications. > Hence, rather than higher math use 'causing' higher earnings, higher > earnings in these situations would be 'causing' more math use. But > since more math use might actually have the effect we originally > hypothesized - increasing congitive ability and thereby leading to > greater earnings - it's hard to disentangle these two effects. > > The potentially problematic nature of the relationship between math > use at work and cognitive abiltiy highlights another possible > challenge to the regression we have specified in 4(a): while > cognitive ability is likely to influence earnings, earnings may also > be affecting the measurement of cognitive ability through higher > math use at better paid jobs. > > Note: Questions 5 is meant to get at the questions of > reverse causality and simultaneity more in-depth. The timing of > effects problem in 5(e) is meant especially to highlight that > causality can't run in the direction specified. But it is also > possible to make a critique centered entirely around more typical > ommited variable bias (OVB). 