Introduction to Quantitative Methods

6. Assumptions and Violations of Assumptions

6.2 Solutions

If you're using a UCL computer, please make sure you're running R version 3.2.0. Some of the seminar tasks and exercises will not work with older versions of R. Click here for help on how to start the new version of R on UCL computers.

Exercise 1

Use the World Development Indicators and Polity IV datasets from 2012 to estimate a model explaining the effects of the following variables on levels of democracy. Make sure to test and correct for heteroskedasticity and Omitted Variable Bias.

  • Unemployment Rate
  • Inflation Rate
  • Military Expenditure (% GDP)
  • Healthcare Expenditure (% GDP)
  • Mobile Subscribers (per 100)

Dataset: - World Development Indicators 2012 - Polity IV 2012

NOTE: This exercise will require you to merge the two datasets together. Use iso3n country code as the matching columns for merging.

Solution

Let's start by clearing the environment

rm(list = ls())

Next, we load the texreg, lmtest, sandwich and dplyr packages

library(texreg)
Warning: package 'texreg' was built under R version 3.2.3
library(lmtest)
library(sandwich)
library(dplyr)

Load the two datasets, and merge using the iso3n column.

wdi <- read.csv("https://uclspp.github.io/PUBLG100/data/WDI_2012.csv")
polity <- read.csv("https://uclspp.github.io/PUBLG100/data/polity4_2012.csv")

merged_data <- merge(wdi, polity, by = "iso3n")

A number of observations have missing values so we need to remove them. Using na.omit() is problematic because it removes observation if it finds NA in any column. We're only interested in UnemploymentRate, Inflation, MilitaryExpenditure, HealthExpenditure, and MobileSubscribers columns so we use the function complete.cases() instead and get a subset of observations that do not have any missing values in the columns we care about.

merged_data <- subset(merged_data, 
                      complete.cases(merged_data[, c("UnemploymentRate", "Inflation", "MilitaryExpenditure", "HealthExpenditure", "MobileSubscribers")]))

Since we have 5 explanatory variables, we need to somehow find a combination of variables that yields the best model. We will use Adjusted R-squared as the criterion for determining what is considered the best model. Next we need a systematic way to find and test various models with different combinations of explanatory variables. There are three possible approaches we can use:

  1. Exhaustive search: Use an exhaustive search and try every possible combination of explanatory variables available to us. With 5 variables, an exhaustive search would require (2 ^ 5) -1 or 31 different models. As the number of variables increase, the it become extremely difficult to do an exhaustive search. For example, with 10 variables, we would need 1024 models, and with 15 variables, we would need over 32,768 models to compare.

  2. Forward selection: The next option is to start with the simplest model of one variable and start adding variables one at a time until the addition of new variables no longer improves the model.

  3. Backward elimination: The last option is to start with the most general model that includes every explanatory variable and remove one variable at a time as long as removing variables results in an improved model.

For this exercise we will use the forward selection method and start with simple models of one variable each:

model_1 = lm(polity2 ~ UnemploymentRate, data = merged_data)
model_2 = lm(polity2 ~ Inflation, data = merged_data)
model_3 = lm(polity2 ~ MilitaryExpenditure, data = merged_data)
model_4 = lm(polity2 ~ HealthExpenditure, data = merged_data)
model_5 = lm(polity2 ~ MobileSubscribers, data = merged_data)

Next we compare the models using screenreg(). Notice how we're providing custom model names for screenreg() to display:

screenreg(list(model_1, model_2, model_3, model_4, model_5), 
          digits = 3,
          custom.model.names = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"))
===========================================================================
                     Model 1     Model 2     Model 3     Model 4    Model 5
---------------------------------------------------------------------------
(Intercept)           6.691 ***   8.031 ***  10.700 ***   3.359 *    4.889 
                     (0.942)     (0.875)     (0.692)     (1.629)    (2.515)
UnemploymentRate      0.092                                                
                     (0.084)                                               
Inflation                        -0.139                                    
                                 (0.206)                                   
MilitaryExpenditure                          -0.478 ***                    
                                             (0.084)                       
HealthExpenditure                                         0.528 **         
                                                         (0.196)           
MobileSubscribers                                                    0.023 
                                                                    (0.021)
---------------------------------------------------------------------------
R^2                   0.019       0.007       0.340       0.104      0.018 
Adj. R^2              0.003      -0.009       0.330       0.089      0.003 
Num. obs.            65          65          65          65         65     
RMSE                  4.110       4.133       3.369       3.927      4.110 
===========================================================================
*** p < 0.001, ** p < 0.01, * p < 0.05

Model 3 with MilitaryExpenditure as the explanatory variable has the highest Adjusted R-squared, so let's use that as the baseline and add the remaining variables one at a time and do another comparison.

model_6 = lm(polity2 ~ MilitaryExpenditure + UnemploymentRate, data = merged_data)
model_7 = lm(polity2 ~ MilitaryExpenditure + Inflation, data = merged_data)
model_8 = lm(polity2 ~ MilitaryExpenditure + HealthExpenditure, data = merged_data)
model_9 = lm(polity2 ~ MilitaryExpenditure + MobileSubscribers, data = merged_data)

screenreg(list(model_3, model_6, model_7, model_8, model_9), 
          digits = 3,
          custom.model.names = c("Model 3", "Model 6", "Model 7", "Model 8", "Model 9"))
===============================================================================
                     Model 3     Model 6     Model 7     Model 8     Model 9   
-------------------------------------------------------------------------------
(Intercept)          10.700 ***  10.977 ***  11.064 ***   7.902 ***   8.703 ***
                     (0.692)     (1.099)     (0.895)     (1.624)     (2.171)   
MilitaryExpenditure  -0.478 ***  -0.486 ***  -0.476 ***  -0.440 ***  -0.473 ***
                     (0.084)     (0.088)     (0.084)     (0.085)     (0.084)   
UnemploymentRate                 -0.024                                        
                                 (0.073)                                       
Inflation                                    -0.109                            
                                             (0.169)                           
HealthExpenditure                                         0.321                
                                                         (0.169)               
MobileSubscribers                                                     0.017    
                                                                     (0.017)   
-------------------------------------------------------------------------------
R^2                   0.340       0.342       0.345       0.377       0.350    
Adj. R^2              0.330       0.320       0.324       0.356       0.329    
Num. obs.            65          65          65          65          65        
RMSE                  3.369       3.393       3.385       3.302       3.370    
===============================================================================
*** p < 0.001, ** p < 0.01, * p < 0.05

Adding HealthExpenditure improves the model slightly so we'll use MilitaryExpenditure + HealthExpenditure as the new baseline and repeat the process of forward selection.

model_10 = lm(polity2 ~ MilitaryExpenditure + HealthExpenditure + UnemploymentRate, data = merged_data)
model_11 = lm(polity2 ~ MilitaryExpenditure + HealthExpenditure + Inflation, data = merged_data)
model_12 = lm(polity2 ~ MilitaryExpenditure + HealthExpenditure + MobileSubscribers, data = merged_data)

screenreg(list(model_8, model_10, model_11, model_12), 
          digits = 3,
          custom.model.names = c("Model 8", "Model 10", "Model 11", "Model 12"))
===================================================================
                     Model 8     Model 10    Model 11    Model 12  
-------------------------------------------------------------------
(Intercept)           7.902 ***   8.116 ***   7.914 ***   5.222    
                     (1.624)     (1.873)     (1.991)     (2.723)   
MilitaryExpenditure  -0.440 ***  -0.446 ***  -0.440 ***  -0.431 ***
                     (0.085)     (0.089)     (0.085)     (0.084)   
HealthExpenditure     0.321       0.319       0.321       0.345 *  
                     (0.169)     (0.171)     (0.182)     (0.170)   
UnemploymentRate                 -0.017                            
                                 (0.071)                           
Inflation                                    -0.002                
                                             (0.177)               
MobileSubscribers                                         0.021    
                                                         (0.017)   
-------------------------------------------------------------------
R^2                   0.377       0.377       0.377       0.392    
Adj. R^2              0.356       0.347       0.346       0.362    
Num. obs.            65          65          65          65        
RMSE                  3.302       3.327       3.328       3.288    
===================================================================
*** p < 0.001, ** p < 0.01, * p < 0.05

The addition of MobileSubscribers as a control variable has improved the the model slightly, so now we're only left with two variables UnemploymentRate and Inflation to test.

model_13 = lm(polity2 ~ MilitaryExpenditure + HealthExpenditure + MobileSubscribers + UnemploymentRate, data = merged_data)
model_14 = lm(polity2 ~ MilitaryExpenditure + HealthExpenditure + MobileSubscribers + Inflation, data = merged_data)

screenreg(list(model_12, model_13, model_14), 
          digits = 3,
          custom.model.names = c("Model 12", "Model 13", "Model 14"))
=======================================================
                     Model 12    Model 13    Model 14  
-------------------------------------------------------
(Intercept)           5.222       5.272       4.957    
                     (2.723)     (3.032)     (3.122)   
MilitaryExpenditure  -0.431 ***  -0.432 ***  -0.430 ***
                     (0.084)     (0.090)     (0.085)   
HealthExpenditure     0.345 *     0.345 *     0.357    
                     (0.170)     (0.172)     (0.183)   
MobileSubscribers     0.021       0.020       0.021    
                     (0.017)     (0.017)     (0.017)   
UnemploymentRate                 -0.003                
                                 (0.072)               
Inflation                                     0.032    
                                             (0.178)   
-------------------------------------------------------
R^2                   0.392       0.392       0.392    
Adj. R^2              0.362       0.351       0.351    
Num. obs.            65          65          65        
RMSE                  3.288       3.316       3.315    
=======================================================
*** p < 0.001, ** p < 0.01, * p < 0.05

Adding UnemploymentRate or Inflation does not improve our model at all, so we can keep Model 12 as the best performing model based on adjusted r-squared.

Now, let's test for heteroskedasticity in model_12 using Breusch-Pagan test from the lmtest package.

bptest(model_12)
    studentized Breusch-Pagan test

data:  model_12
BP = 13.168, df = 3, p-value = 0.004288

Since bptest() is telling us that we're dealing with heteroskedastic errors, we must correct them using the coeftest() function:

corrected_errors <- coeftest(model_12, vcov = vcovHC(model_12)) 

screenreg(model_12, 
          digits = 3,
          custom.model.names = c("Model 12"),
          override.se = corrected_errors[, "Std. Error"], 
          override.pval = corrected_errors[, "Pr(>|t|)"])
==============================
                     Model 12 
------------------------------
(Intercept)           5.222   
                     (3.752)  
MilitaryExpenditure  -0.431 **
                     (0.128)  
HealthExpenditure     0.345   
                     (0.261)  
MobileSubscribers     0.021   
                     (0.026)  
------------------------------
R^2                   0.392   
Adj. R^2              0.362   
Num. obs.            65       
RMSE                  3.288   
==============================
*** p < 0.001, ** p < 0.01, * p < 0.05

Exercise 2

Use the Massachusetts Test Scores dataset from Stock and Watson to fit a model and explain the effects of the following variables on test scores.

  • Student-teacher ratio (stratio),
  • % English learners (english),
  • % Receiving Lunch subsidy (lunch),
  • Average District income (income)

NOTE: Use the 8th grade test scores from the variable score8. The names of explanatory variables are given in parenthesis above:

Dataset: - Massachusetts Test Scores

Solution

Let's clear the workspace to ensure that we don't accidentally use variables from previous exercise.

rm(list = ls())

Load the MA Schools dataset

ma_schools <- read.csv("https://uclspp.github.io/PUBLG100/data/MA_Schools.csv")

Let's look at the summary of our dependent and independent variables:

summary(ma_schools$score8)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  641.0   685.0   698.0   698.4   712.0   747.0      40 
summary(ma_schools$income)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  9.686  15.220  17.130  18.750  20.380  46.860 
summary(ma_schools$lunch)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.40    5.30   10.55   15.32   20.02   76.20 
summary(ma_schools$english)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  1.1180  0.8859 24.4900 
summary(ma_schools$stratio)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  11.40   15.80   17.10   17.34   19.03   27.00 

In the first exercise we used the forward selection method for finding the model that best explained the relationship between our independent variables and the dependent variable. In this exercise, we'll use the backward elimination method and start by the the most general model that includes all variables of interest (i.e. stratio, english, lunch, and income)

model_1 <- lm(score8 ~ stratio + english + lunch + income, data = ma_schools)
summary(model_1)
Call:
lm(formula = score8 ~ stratio + english + lunch + income, data = ma_schools)

Residuals:
    Min      1Q  Median      3Q     Max 
-21.557  -5.754  -1.120   4.835  31.856 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 684.79554    6.63042 103.281   <2e-16 ***
stratio      -0.34361    0.31353  -1.096    0.275    
english      -0.23289    0.30761  -0.757    0.450    
lunch        -0.71728    0.07021 -10.216   <2e-16 ***
income        1.67359    0.14882  11.245   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.794 on 175 degrees of freedom
  (40 observations deleted due to missingness)
Multiple R-squared:  0.8294,    Adjusted R-squared:  0.8255 
F-statistic: 212.7 on 4 and 175 DF,  p-value: < 2.2e-16

Next we remove one variable at a time to see the difference it makes in our model.

model_2 <- lm(score8 ~ stratio + english + lunch, data = ma_schools)
model_3 <- lm(score8 ~ stratio + english + income, data = ma_schools)
model_4 <- lm(score8 ~ stratio + lunch + income, data = ma_schools)
model_5 <- lm(score8 ~ english + lunch + income, data = ma_schools)
screenreg(list(model_1, model_2, model_3, model_4, model_5), 
          digits = 3,
          custom.model.names = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"))
============================================================================
             Model 1      Model 2      Model 3      Model 4      Model 5    
----------------------------------------------------------------------------
(Intercept)  684.796 ***  730.125 ***  665.233 ***  685.837 ***  678.590 ***
              (6.630)      (6.890)      (7.998)      (6.478)      (3.452)   
stratio       -0.344       -0.810 *     -0.635       -0.356                 
              (0.314)      (0.407)      (0.393)      (0.313)                
english       -0.233        0.648       -2.389 ***                -0.250    
              (0.308)      (0.389)      (0.282)                   (0.307)   
lunch         -0.717 ***   -1.155 ***                -0.754 ***   -0.724 ***
              (0.070)      (0.076)                   (0.051)      (0.070)   
income         1.674 ***                 2.516 ***    1.645 ***    1.695 ***
              (0.149)                   (0.156)      (0.144)      (0.148)   
----------------------------------------------------------------------------
R^2            0.829        0.706        0.728        0.829        0.828    
Adj. R^2       0.826        0.701        0.723        0.826        0.825    
Num. obs.    180          180          180          180          180        
RMSE           8.794       11.509       11.079        8.783        8.799    
============================================================================
*** p < 0.001, ** p < 0.01, * p < 0.05

Removing the variable that measures the percent of english learners gives us the best model based on adjusted R-squared.

Let's see whether we can get better results by simplifying the model further.

model_6 <- lm(score8 ~ stratio + lunch, data = ma_schools)
model_7 <- lm(score8 ~ stratio + income, data = ma_schools)
model_8 <- lm(score8 ~ lunch + income, data = ma_schools)

screenreg(list(model_4, model_6, model_7, model_8), 
          digits = 3,
          custom.model.names = c("Model 4", "Model 6", "Model 7", "Model 8"))
===============================================================
             Model 4      Model 6      Model 7      Model 8    
---------------------------------------------------------------
(Intercept)  685.837 ***  729.338 ***  666.144 ***  679.474 ***
              (6.478)      (6.908)      (9.461)      (3.274)   
stratio       -0.356       -0.797       -1.159 *               
              (0.313)      (0.409)      (0.460)                
lunch         -0.754 ***   -1.069 ***                -0.764 ***
              (0.051)      (0.057)                   (0.050)   
income         1.645 ***                 2.790 ***    1.665 ***
              (0.144)                   (0.181)      (0.143)   
---------------------------------------------------------------
R^2            0.829        0.702        0.617        0.828    
Adj. R^2       0.826        0.698        0.612        0.826    
Num. obs.    180          180          180          180        
RMSE           8.783       11.566       13.107        8.790    
===============================================================
*** p < 0.001, ** p < 0.01, * p < 0.05

The results tell us that when we control for lunch subsidies and district income, the student teacher ratio has no effect on test scores.

Can we simplify the model further?

model_9 <- lm(score8 ~ lunch, data = ma_schools)
model_10 <- lm(score8 ~ income, data = ma_schools)

screenreg(list(model_8, model_9, model_10), 
          digits = 3,
          custom.model.names = c("Model 8", "Model 9", "Model 10"))
==================================================
             Model 8      Model 9      Model 10   
--------------------------------------------------
(Intercept)  679.474 ***  716.080 ***  643.894 ***
              (3.274)      (1.235)      (3.461)   
lunch         -0.764 ***   -1.100 ***             
              (0.050)      (0.055)                
income         1.665 ***                 2.909 ***
              (0.143)                   (0.177)   
--------------------------------------------------
R^2            0.828        0.695        0.603    
Adj. R^2       0.826        0.693        0.601    
Num. obs.    180          180          180        
RMSE           8.790       11.657       13.303    
==================================================
*** p < 0.001, ** p < 0.01, * p < 0.05

It turns out that the model that include lunch subsidies and district income best captures the relationship between our independent variables and the dependent variable.

Now let's try the Breusch-Pagan test for heteroskedasticity

bptest(model_8)
    studentized Breusch-Pagan test

data:  model_8
BP = 1.8209, df = 2, p-value = 0.4023

The Breusch-Pagan test tells us that we don't have heteroskedastic errors so we don't need to correct for them.

Exercise 3

Can the same model be used for California? Why or why not? What assumptions would it violate and how would you correct for them?

Dataset: - California Test Scores

Solution

Let's clear the workspace again ensure that we don't accidentally use variables from previous exercise.

rm(list = ls())

Load the California test score dataset and see how it compares to the Massachusetts.

ca_schools <- read.csv("https://uclspp.github.io/PUBLG100/data/CA_Schools.csv")

Filter the results to only include grade 8 and calculate the student-teacher ratio and average score.

ca_schools <- ca_schools %>%
  filter(grades == "KK-08") %>%
  mutate(stratio = students / teachers,
         avg_score = (read + math)/2)

Let's look at the summary of our dependent and independent variables:

summary(ca_schools$avg_score)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  605.6   639.4   653.5   652.9   665.8   706.8 
summary(ca_schools$income)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  5.335  10.470  13.440  14.840  16.340  55.330 
summary(ca_schools$lunch)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   25.35   44.68   45.93   67.09  100.00 
summary(ca_schools$english)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.530   7.857  15.360  22.320  85.540 
summary(ca_schools$stratio)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  14.00   18.67   19.78   19.71   20.89   25.80 

Now lets estimate a model with all 4 explanatory variable stratio, english, lunch and income and compare it against 4 other models with each variable removed.

model_1 <- lm(avg_score ~ stratio + english + lunch + income, data = ca_schools)
model_2 <- lm(avg_score ~ stratio + english + lunch, data = ca_schools)
model_3 <- lm(avg_score ~ stratio + english + income, data = ca_schools)
model_4 <- lm(avg_score ~ stratio + lunch + income, data = ca_schools)
model_5 <- lm(avg_score ~ english + lunch + income, data = ca_schools)
screenreg(list(model_1, model_2, model_3, model_4, model_5), 
          digits = 3,
          custom.model.names = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"))
============================================================================
             Model 1      Model 2      Model 3      Model 4      Model 5    
----------------------------------------------------------------------------
(Intercept)  673.416 ***  696.957 ***  638.237 ***  684.309 ***  663.166 ***
              (5.900)      (5.206)      (6.292)      (5.947)      (2.321)   
stratio       -0.479       -0.894 ***    0.025       -0.791 **              
              (0.253)      (0.263)      (0.302)      (0.262)                
english       -0.214 ***   -0.147 ***   -0.489 ***                -0.227 ***
              (0.034)      (0.035)      (0.031)                   (0.033)   
lunch         -0.383 ***   -0.527 ***                -0.504 ***   -0.374 ***
              (0.030)      (0.024)                   (0.024)      (0.030)   
income         0.659 ***                 1.461 ***    0.493 ***    0.700 ***
              (0.093)                   (0.083)      (0.094)      (0.091)   
----------------------------------------------------------------------------
R^2            0.789        0.759        0.692        0.765        0.787    
Adj. R^2       0.786        0.757        0.689        0.763        0.785    
Num. obs.    359          359          359          359          359        
RMSE           8.637        9.217       10.420        9.101        8.668    
============================================================================
*** p < 0.001, ** p < 0.01, * p < 0.05

Comparing the model above we can see that the model with all 4 explanatory variable stratio, english, lunch and income gives us the highest Adjusted R-squared. This differs from the Massachusetts study where lunch subsidy and district income resulted in the best model.

Whether the model from Massachusetts can be applied to California is a matter of external validity and depends a great deal on whether the two populations are similar. Both datasets provides measurements at the district level using standardized test scores. However, the mean test score from Massachusetts and California are different because of the difference in how the tests are conducted in each state.

The student-teacher ratio also differs slightly between the two datasets. The income level in California has larger spread than in Massachusetts but the average income in Massachusetts is 20% higher. The number of students receiving lunch subsidy and the percentage of english language learners are both higher in California.