# Introduction to Quantitative Methods

## 6. Assumptions and Violations of Assumptions

### 6.2 Solutions

If you're using a UCL computer, please make sure you're running R version 3.2.0. Some of the seminar tasks and exercises will not work with older versions of R. Click here for help on how to start the new version of R on UCL computers.

#### Exercise 1

Use the World Development Indicators and Polity IV datasets from 2012 to estimate a model explaining the effects of the following variables on levels of democracy. Make sure to test and correct for heteroskedasticity and Omitted Variable Bias.

• Unemployment Rate
• Inflation Rate
• Military Expenditure (% GDP)
• Healthcare Expenditure (% GDP)
• Mobile Subscribers (per 100)

NOTE: This exercise will require you to merge the two datasets together. Use iso3n country code as the matching columns for merging.

#### Solution

Let's start by clearing the environment

rm(list = ls())


Next, we load the texreg, lmtest, sandwich and dplyr packages

library(texreg)

Warning: package 'texreg' was built under R version 3.2.3

library(lmtest)
library(sandwich)
library(dplyr)


Load the two datasets, and merge using the iso3n column.

wdi <- read.csv("https://uclspp.github.io/PUBLG100/data/WDI_2012.csv")

merged_data <- merge(wdi, polity, by = "iso3n")


A number of observations have missing values so we need to remove them. Using na.omit() is problematic because it removes observation if it finds NA in any column. We're only interested in UnemploymentRate, Inflation, MilitaryExpenditure, HealthExpenditure, and MobileSubscribers columns so we use the function complete.cases() instead and get a subset of observations that do not have any missing values in the columns we care about.

merged_data <- subset(merged_data,
complete.cases(merged_data[, c("UnemploymentRate", "Inflation", "MilitaryExpenditure", "HealthExpenditure", "MobileSubscribers")]))


Since we have 5 explanatory variables, we need to somehow find a combination of variables that yields the best model. We will use Adjusted R-squared as the criterion for determining what is considered the best model. Next we need a systematic way to find and test various models with different combinations of explanatory variables. There are three possible approaches we can use:

1. Exhaustive search: Use an exhaustive search and try every possible combination of explanatory variables available to us. With 5 variables, an exhaustive search would require (2 ^ 5) -1 or 31 different models. As the number of variables increase, the it become extremely difficult to do an exhaustive search. For example, with 10 variables, we would need 1024 models, and with 15 variables, we would need over 32,768 models to compare.

2. Forward selection: The next option is to start with the simplest model of one variable and start adding variables one at a time until the addition of new variables no longer improves the model.

3. Backward elimination: The last option is to start with the most general model that includes every explanatory variable and remove one variable at a time as long as removing variables results in an improved model.

For this exercise we will use the forward selection method and start with simple models of one variable each:

model_1 = lm(polity2 ~ UnemploymentRate, data = merged_data)
model_2 = lm(polity2 ~ Inflation, data = merged_data)
model_3 = lm(polity2 ~ MilitaryExpenditure, data = merged_data)
model_4 = lm(polity2 ~ HealthExpenditure, data = merged_data)
model_5 = lm(polity2 ~ MobileSubscribers, data = merged_data)


Next we compare the models using screenreg(). Notice how we're providing custom model names for screenreg() to display:

screenreg(list(model_1, model_2, model_3, model_4, model_5),
digits = 3,
custom.model.names = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"))

===========================================================================
Model 1     Model 2     Model 3     Model 4    Model 5
---------------------------------------------------------------------------
(Intercept)           6.691 ***   8.031 ***  10.700 ***   3.359 *    4.889
(0.942)     (0.875)     (0.692)     (1.629)    (2.515)
UnemploymentRate      0.092
(0.084)
Inflation                        -0.139
(0.206)
MilitaryExpenditure                          -0.478 ***
(0.084)
HealthExpenditure                                         0.528 **
(0.196)
MobileSubscribers                                                    0.023
(0.021)
---------------------------------------------------------------------------
R^2                   0.019       0.007       0.340       0.104      0.018
Adj. R^2              0.003      -0.009       0.330       0.089      0.003
Num. obs.            65          65          65          65         65
RMSE                  4.110       4.133       3.369       3.927      4.110
===========================================================================
*** p < 0.001, ** p < 0.01, * p < 0.05


Model 3 with MilitaryExpenditure as the explanatory variable has the highest Adjusted R-squared, so let's use that as the baseline and add the remaining variables one at a time and do another comparison.

model_6 = lm(polity2 ~ MilitaryExpenditure + UnemploymentRate, data = merged_data)
model_7 = lm(polity2 ~ MilitaryExpenditure + Inflation, data = merged_data)
model_8 = lm(polity2 ~ MilitaryExpenditure + HealthExpenditure, data = merged_data)
model_9 = lm(polity2 ~ MilitaryExpenditure + MobileSubscribers, data = merged_data)

screenreg(list(model_3, model_6, model_7, model_8, model_9),
digits = 3,
custom.model.names = c("Model 3", "Model 6", "Model 7", "Model 8", "Model 9"))

===============================================================================
Model 3     Model 6     Model 7     Model 8     Model 9
-------------------------------------------------------------------------------
(Intercept)          10.700 ***  10.977 ***  11.064 ***   7.902 ***   8.703 ***
(0.692)     (1.099)     (0.895)     (1.624)     (2.171)
MilitaryExpenditure  -0.478 ***  -0.486 ***  -0.476 ***  -0.440 ***  -0.473 ***
(0.084)     (0.088)     (0.084)     (0.085)     (0.084)
UnemploymentRate                 -0.024
(0.073)
Inflation                                    -0.109
(0.169)
HealthExpenditure                                         0.321
(0.169)
MobileSubscribers                                                     0.017
(0.017)
-------------------------------------------------------------------------------
R^2                   0.340       0.342       0.345       0.377       0.350
Adj. R^2              0.330       0.320       0.324       0.356       0.329
Num. obs.            65          65          65          65          65
RMSE                  3.369       3.393       3.385       3.302       3.370
===============================================================================
*** p < 0.001, ** p < 0.01, * p < 0.05


Adding HealthExpenditure improves the model slightly so we'll use MilitaryExpenditure + HealthExpenditure as the new baseline and repeat the process of forward selection.

model_10 = lm(polity2 ~ MilitaryExpenditure + HealthExpenditure + UnemploymentRate, data = merged_data)
model_11 = lm(polity2 ~ MilitaryExpenditure + HealthExpenditure + Inflation, data = merged_data)
model_12 = lm(polity2 ~ MilitaryExpenditure + HealthExpenditure + MobileSubscribers, data = merged_data)

screenreg(list(model_8, model_10, model_11, model_12),
digits = 3,
custom.model.names = c("Model 8", "Model 10", "Model 11", "Model 12"))

===================================================================
Model 8     Model 10    Model 11    Model 12
-------------------------------------------------------------------
(Intercept)           7.902 ***   8.116 ***   7.914 ***   5.222
(1.624)     (1.873)     (1.991)     (2.723)
MilitaryExpenditure  -0.440 ***  -0.446 ***  -0.440 ***  -0.431 ***
(0.085)     (0.089)     (0.085)     (0.084)
HealthExpenditure     0.321       0.319       0.321       0.345 *
(0.169)     (0.171)     (0.182)     (0.170)
UnemploymentRate                 -0.017
(0.071)
Inflation                                    -0.002
(0.177)
MobileSubscribers                                         0.021
(0.017)
-------------------------------------------------------------------
R^2                   0.377       0.377       0.377       0.392
Adj. R^2              0.356       0.347       0.346       0.362
Num. obs.            65          65          65          65
RMSE                  3.302       3.327       3.328       3.288
===================================================================
*** p < 0.001, ** p < 0.01, * p < 0.05


The addition of MobileSubscribers as a control variable has improved the the model slightly, so now we're only left with two variables UnemploymentRate and Inflation to test.

model_13 = lm(polity2 ~ MilitaryExpenditure + HealthExpenditure + MobileSubscribers + UnemploymentRate, data = merged_data)
model_14 = lm(polity2 ~ MilitaryExpenditure + HealthExpenditure + MobileSubscribers + Inflation, data = merged_data)

screenreg(list(model_12, model_13, model_14),
digits = 3,
custom.model.names = c("Model 12", "Model 13", "Model 14"))

=======================================================
Model 12    Model 13    Model 14
-------------------------------------------------------
(Intercept)           5.222       5.272       4.957
(2.723)     (3.032)     (3.122)
MilitaryExpenditure  -0.431 ***  -0.432 ***  -0.430 ***
(0.084)     (0.090)     (0.085)
HealthExpenditure     0.345 *     0.345 *     0.357
(0.170)     (0.172)     (0.183)
MobileSubscribers     0.021       0.020       0.021
(0.017)     (0.017)     (0.017)
UnemploymentRate                 -0.003
(0.072)
Inflation                                     0.032
(0.178)
-------------------------------------------------------
R^2                   0.392       0.392       0.392
Num. obs.            65          65          65
RMSE                  3.288       3.316       3.315
=======================================================
*** p < 0.001, ** p < 0.01, * p < 0.05


Adding UnemploymentRate or Inflation does not improve our model at all, so we can keep Model 12 as the best performing model based on adjusted r-squared.

Now, let's test for heteroskedasticity in model_12 using Breusch-Pagan test from the lmtest package.

bptest(model_12)

    studentized Breusch-Pagan test

data:  model_12
BP = 13.168, df = 3, p-value = 0.004288


Since bptest() is telling us that we're dealing with heteroskedastic errors, we must correct them using the coeftest() function:

corrected_errors <- coeftest(model_12, vcov = vcovHC(model_12))

screenreg(model_12,
digits = 3,
custom.model.names = c("Model 12"),
override.se = corrected_errors[, "Std. Error"],
override.pval = corrected_errors[, "Pr(>|t|)"])

==============================
Model 12
------------------------------
(Intercept)           5.222
(3.752)
MilitaryExpenditure  -0.431 **
(0.128)
HealthExpenditure     0.345
(0.261)
MobileSubscribers     0.021
(0.026)
------------------------------
R^2                   0.392
Num. obs.            65
RMSE                  3.288
==============================
*** p < 0.001, ** p < 0.01, * p < 0.05


#### Exercise 2

Use the Massachusetts Test Scores dataset from Stock and Watson to fit a model and explain the effects of the following variables on test scores.

• Student-teacher ratio (stratio),
• % English learners (english),
• % Receiving Lunch subsidy (lunch),
• Average District income (income)

NOTE: Use the 8th grade test scores from the variable score8. The names of explanatory variables are given in parenthesis above:

Dataset: - Massachusetts Test Scores

#### Solution

Let's clear the workspace to ensure that we don't accidentally use variables from previous exercise.

rm(list = ls())


ma_schools <- read.csv("https://uclspp.github.io/PUBLG100/data/MA_Schools.csv")


Let's look at the summary of our dependent and independent variables:

summary(ma_schools$score8)   Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 641.0 685.0 698.0 698.4 712.0 747.0 40  summary(ma_schools$income)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
9.686  15.220  17.130  18.750  20.380  46.860

summary(ma_schools$lunch)   Min. 1st Qu. Median Mean 3rd Qu. Max. 0.40 5.30 10.55 15.32 20.02 76.20  summary(ma_schools$english)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.0000  0.0000  0.0000  1.1180  0.8859 24.4900

summary(ma_schools$stratio)   Min. 1st Qu. Median Mean 3rd Qu. Max. 11.40 15.80 17.10 17.34 19.03 27.00  In the first exercise we used the forward selection method for finding the model that best explained the relationship between our independent variables and the dependent variable. In this exercise, we'll use the backward elimination method and start by the the most general model that includes all variables of interest (i.e. stratio, english, lunch, and income) model_1 <- lm(score8 ~ stratio + english + lunch + income, data = ma_schools) summary(model_1)  Call: lm(formula = score8 ~ stratio + english + lunch + income, data = ma_schools) Residuals: Min 1Q Median 3Q Max -21.557 -5.754 -1.120 4.835 31.856 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 684.79554 6.63042 103.281 <2e-16 *** stratio -0.34361 0.31353 -1.096 0.275 english -0.23289 0.30761 -0.757 0.450 lunch -0.71728 0.07021 -10.216 <2e-16 *** income 1.67359 0.14882 11.245 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 8.794 on 175 degrees of freedom (40 observations deleted due to missingness) Multiple R-squared: 0.8294, Adjusted R-squared: 0.8255 F-statistic: 212.7 on 4 and 175 DF, p-value: < 2.2e-16  Next we remove one variable at a time to see the difference it makes in our model. model_2 <- lm(score8 ~ stratio + english + lunch, data = ma_schools) model_3 <- lm(score8 ~ stratio + english + income, data = ma_schools) model_4 <- lm(score8 ~ stratio + lunch + income, data = ma_schools) model_5 <- lm(score8 ~ english + lunch + income, data = ma_schools)  screenreg(list(model_1, model_2, model_3, model_4, model_5), digits = 3, custom.model.names = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"))  ============================================================================ Model 1 Model 2 Model 3 Model 4 Model 5 ---------------------------------------------------------------------------- (Intercept) 684.796 *** 730.125 *** 665.233 *** 685.837 *** 678.590 *** (6.630) (6.890) (7.998) (6.478) (3.452) stratio -0.344 -0.810 * -0.635 -0.356 (0.314) (0.407) (0.393) (0.313) english -0.233 0.648 -2.389 *** -0.250 (0.308) (0.389) (0.282) (0.307) lunch -0.717 *** -1.155 *** -0.754 *** -0.724 *** (0.070) (0.076) (0.051) (0.070) income 1.674 *** 2.516 *** 1.645 *** 1.695 *** (0.149) (0.156) (0.144) (0.148) ---------------------------------------------------------------------------- R^2 0.829 0.706 0.728 0.829 0.828 Adj. R^2 0.826 0.701 0.723 0.826 0.825 Num. obs. 180 180 180 180 180 RMSE 8.794 11.509 11.079 8.783 8.799 ============================================================================ *** p < 0.001, ** p < 0.01, * p < 0.05  Removing the variable that measures the percent of english learners gives us the best model based on adjusted R-squared. Let's see whether we can get better results by simplifying the model further. model_6 <- lm(score8 ~ stratio + lunch, data = ma_schools) model_7 <- lm(score8 ~ stratio + income, data = ma_schools) model_8 <- lm(score8 ~ lunch + income, data = ma_schools) screenreg(list(model_4, model_6, model_7, model_8), digits = 3, custom.model.names = c("Model 4", "Model 6", "Model 7", "Model 8"))  =============================================================== Model 4 Model 6 Model 7 Model 8 --------------------------------------------------------------- (Intercept) 685.837 *** 729.338 *** 666.144 *** 679.474 *** (6.478) (6.908) (9.461) (3.274) stratio -0.356 -0.797 -1.159 * (0.313) (0.409) (0.460) lunch -0.754 *** -1.069 *** -0.764 *** (0.051) (0.057) (0.050) income 1.645 *** 2.790 *** 1.665 *** (0.144) (0.181) (0.143) --------------------------------------------------------------- R^2 0.829 0.702 0.617 0.828 Adj. R^2 0.826 0.698 0.612 0.826 Num. obs. 180 180 180 180 RMSE 8.783 11.566 13.107 8.790 =============================================================== *** p < 0.001, ** p < 0.01, * p < 0.05  The results tell us that when we control for lunch subsidies and district income, the student teacher ratio has no effect on test scores. Can we simplify the model further? model_9 <- lm(score8 ~ lunch, data = ma_schools) model_10 <- lm(score8 ~ income, data = ma_schools) screenreg(list(model_8, model_9, model_10), digits = 3, custom.model.names = c("Model 8", "Model 9", "Model 10"))  ================================================== Model 8 Model 9 Model 10 -------------------------------------------------- (Intercept) 679.474 *** 716.080 *** 643.894 *** (3.274) (1.235) (3.461) lunch -0.764 *** -1.100 *** (0.050) (0.055) income 1.665 *** 2.909 *** (0.143) (0.177) -------------------------------------------------- R^2 0.828 0.695 0.603 Adj. R^2 0.826 0.693 0.601 Num. obs. 180 180 180 RMSE 8.790 11.657 13.303 ================================================== *** p < 0.001, ** p < 0.01, * p < 0.05  It turns out that the model that include lunch subsidies and district income best captures the relationship between our independent variables and the dependent variable. Now let's try the Breusch-Pagan test for heteroskedasticity bptest(model_8)   studentized Breusch-Pagan test data: model_8 BP = 1.8209, df = 2, p-value = 0.4023  The Breusch-Pagan test tells us that we don't have heteroskedastic errors so we don't need to correct for them. #### Exercise 3 Can the same model be used for California? Why or why not? What assumptions would it violate and how would you correct for them? Dataset: - California Test Scores #### Solution Let's clear the workspace again ensure that we don't accidentally use variables from previous exercise. rm(list = ls())  Load the California test score dataset and see how it compares to the Massachusetts. ca_schools <- read.csv("https://uclspp.github.io/PUBLG100/data/CA_Schools.csv")  Filter the results to only include grade 8 and calculate the student-teacher ratio and average score. ca_schools <- ca_schools %>% filter(grades == "KK-08") %>% mutate(stratio = students / teachers, avg_score = (read + math)/2)  Let's look at the summary of our dependent and independent variables: summary(ca_schools$avg_score)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
605.6   639.4   653.5   652.9   665.8   706.8

summary(ca_schools$income)   Min. 1st Qu. Median Mean 3rd Qu. Max. 5.335 10.470 13.440 14.840 16.340 55.330  summary(ca_schools$lunch)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.00   25.35   44.68   45.93   67.09  100.00

summary(ca_schools$english)   Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 1.530 7.857 15.360 22.320 85.540  summary(ca_schools$stratio)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
14.00   18.67   19.78   19.71   20.89   25.80


Now lets estimate a model with all 4 explanatory variable stratio, english, lunch and income and compare it against 4 other models with each variable removed.

model_1 <- lm(avg_score ~ stratio + english + lunch + income, data = ca_schools)
model_2 <- lm(avg_score ~ stratio + english + lunch, data = ca_schools)
model_3 <- lm(avg_score ~ stratio + english + income, data = ca_schools)
model_4 <- lm(avg_score ~ stratio + lunch + income, data = ca_schools)
model_5 <- lm(avg_score ~ english + lunch + income, data = ca_schools)

screenreg(list(model_1, model_2, model_3, model_4, model_5),
digits = 3,
custom.model.names = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"))

============================================================================
Model 1      Model 2      Model 3      Model 4      Model 5
----------------------------------------------------------------------------
(Intercept)  673.416 ***  696.957 ***  638.237 ***  684.309 ***  663.166 ***
(5.900)      (5.206)      (6.292)      (5.947)      (2.321)
stratio       -0.479       -0.894 ***    0.025       -0.791 **
(0.253)      (0.263)      (0.302)      (0.262)
english       -0.214 ***   -0.147 ***   -0.489 ***                -0.227 ***
(0.034)      (0.035)      (0.031)                   (0.033)
lunch         -0.383 ***   -0.527 ***                -0.504 ***   -0.374 ***
(0.030)      (0.024)                   (0.024)      (0.030)
income         0.659 ***                 1.461 ***    0.493 ***    0.700 ***
(0.093)                   (0.083)      (0.094)      (0.091)
----------------------------------------------------------------------------
R^2            0.789        0.759        0.692        0.765        0.787
Adj. R^2       0.786        0.757        0.689        0.763        0.785
Num. obs.    359          359          359          359          359
RMSE           8.637        9.217       10.420        9.101        8.668
============================================================================
*** p < 0.001, ** p < 0.01, * p < 0.05


Comparing the model above we can see that the model with all 4 explanatory variable stratio, english, lunch and income gives us the highest Adjusted R-squared. This differs from the Massachusetts study where lunch subsidy and district income resulted in the best model.

Whether the model from Massachusetts can be applied to California is a matter of external validity and depends a great deal on whether the two populations are similar. Both datasets provides measurements at the district level using standardized test scores. However, the mean test score from Massachusetts and California are different because of the difference in how the tests are conducted in each state.

The student-teacher ratio also differs slightly between the two datasets. The income level in California has larger spread than in Massachusetts but the average income in Massachusetts is 20% higher. The number of students receiving lunch subsidy and the percentage of english language learners are both higher in California.