Introduction to Quantitative Methods

9. Binary models: Logit

9.2 Solutions

If you're using a UCL computer, please make sure you're running R version 3.3.1 or newer. Some of the seminar tasks and exercises will not work with older versions of R.

Exercise 1

Estimate a model similar to model 2 in the seminar and add Civic Duty (CivicDutyScores) and party identification (PartyID). Show and discuss descriptive statistics of the variables in a meaningful way. E.g.: what does a unit change in a variable mean, are the variables comparable. Do this for all variables but concentrate your discussion on the interesting ones.

Solution

We begin by clLoading all the packages we need.

library(foreign) 
library(Zelig) 
library(texreg) 
library(lmtest)
library(dplyr)

Clear the environment

rm(list = ls())

Load the data and remove observations with missing values.

bes <- read.dta("bes.dta")

bes <- filter(bes, 
              !is.na(Turnout), 
              !is.na(Income), 
              !is.na(polinfoindex), 
              !is.na(CivicDutyScores),
              !is.na(Influence),
              !is.na(PartyID),
              !is.na(Gender), 
              !is.na(edu15), 
              !is.na(edu17), 
              !is.na(edu18), 
              !is.na(edu19plus), 
              !is.na(in_school), 
              !is.na(in_uni))

Let's convert the Gender variable to a factor just like we did in the seminar.

bes$Gender <- factor(bes$Gender, levels = c(0, 1), labels = c("Female", "Male"))
table(bes$PartyID)
   0    1 
1681 2480 
hist(bes$CivicDutyScores, 
     main = "Distribution of Civic Duty",
     xlab = "Level of Civic Duty",
     ylab = "Frequency")

The summary function tells us that CivicDutyScores is a continuous variable. It ranges from a minimum of -3.5 to a maximum of 2.7, has an average value of 0 and is slightly left-skewed. A 'one-unit-change' in civic duty corresponds to a hefty 16% percentage point change in the sense of duty (1/(-3.5 - 2.7)). That means a one-unit-change in civic duty is a lot.

PartyID is binary, 60% of the respondents feel attached to a political party. A unit increase in PartyID corresponds to a change from a person who does not identify with a political party to someone who does. Thus, the variable captures two distinct groups of people.

As the two independent variables (CivicDutyScores and PartyID) are not on the same scale - the former ranges from -3.5 to 2.7 and the latter takes only two values: 0 or 1 - the coefficient sizes will not be directly comparable. That means we cannot say which has a bigger effect just by looking at the absolute size of the coefficients. You can only compare effect sizes of different coefficients when variables are on the same scale (if you are interested, have a look at Stock & Watson Chapter 2.4 to see how to standardize variables and make them comparable)

The resource theory claims that people turn out to vote when they possess more material and cognitive resources. The corresponding variables are Income and polinfoindex.

Income is interval scaled with a minimum of 1 (poorest) and a maximum of 13 (richest). Average income is 5.3. The variable is heavily left-skewed, meaning that there are many more poorer than richer people. polinfoindex measures knowledge about politics and ranges from 0 (fewest) to 8 (most). The mean is 5.4 and the distribution is right-skewed, fewer people have little knowledge than have a lot. For both variables the theoretical range corresponds to the range observed in the sample. A one-unit-change in Income corresponds to 8% of the range of Income. A one-unit-change in knowledge about politics corresponds to 13% of the range of the variable.

hist(bes$Income, main = "Distribution of Income",
     xlab = "Income Level",
     ylab = "Frequency")

hist(bes$polinfoindex, main = "Distribution of Knowledge of Politics",
     xlab = "Level of Knowledge",
     ylab = "Frequency",
     breaks = seq(0, 8, 1))

Influence measures the subjective feeling that one can influence politics. The range is from 1 to 11. The average value is 3.61. The distribution is heavily left-skewed reflecting that most people feel they have little influence over politics. A unit increase corresponds to 10% of the range of the variable.

hist(bes$Influence, main = "Distribution of Influence",
     xlab = "Influence Level",
     ylab = "Frequency")

Education, gender and age are socioeconomic controls (we do not really care about them but excluding them could lead to omitted variable bias). The modal category of education is 15 years of schooling (33%), the average age is 51 and 44% of respondents are male.

Our dependent variable is Turnout, 74% of respondents reported that they turned out to vote. Considering actual participation rates, this value seems rather high and suggests some bias in reporting the result that is socially desirable. 74% sets the baseline for our model. We need to correctly categorize more than 74% of the respondents to add value with our model.

Note: This discussion of the variables is more than you would usually do but it is good to go through these motions even if you do not write about them in order to better understand your data and avoid miskakes later on. What is more, this data description is independent of the subsequent analysis, i.e. we would do the same for linear models.

Finally, estimate the model that includes CivicDutyScores and PartyID.

full_model <- glm(Turnout ~ Income + polinfoindex + Influence + 
                    CivicDutyScores + PartyID + Gender + Age +
                    edu15 + edu17 + edu18 + edu19plus + 
                    in_school + in_uni, 
                  family = binomial(link = "logit"), 
                  data = bes)

Exercise 2

Do joint hypothesis testing and explain your results. You have added two variables, so the comparison should be to the model without those two new variables.

Solution

For the joint hypothesis test, we need to also re-estimate model 2 from the seminar. Then, we will do the likelihood ratio test. We put the restricted model first (the model that restricts CivicDutyScores and PartyID to be zero) and we put the unrestricted full model second. To compare models they need to be nested and based on the sample.

model2 <- glm(Turnout ~ Income + polinfoindex + Influence + 
                Gender + Age + edu15 + edu17 + edu18 + 
                edu19plus + in_school + in_uni, 
              family = binomial(link = "logit"), 
              data = bes)

# likelihood ratio test
lrtest(model2, full_model)
Likelihood ratio test

Model 1: Turnout ~ Income + polinfoindex + Influence + Gender + Age + 
    edu15 + edu17 + edu18 + edu19plus + in_school + in_uni
Model 2: Turnout ~ Income + polinfoindex + Influence + CivicDutyScores + 
    PartyID + Gender + Age + edu15 + edu17 + edu18 + edu19plus + 
    in_school + in_uni
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1  12 -1990.0                         
2  14 -1610.1  2 759.73  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All we need to look at is the p-value of the test statistic. The full model improves upon model 2. It does better at categorizing respondents into voters and non-voters.

Exercise 3

What is the model's predictive power? How well does it do compared to the naive guess and model 2, that we estimated in class?

Solution

We need to calculate the percentage of correctly predicted cases for the new model and for model 2 to answer this. We then compare the percentage of correctly predicted cases of our models and the naive guess.

predicted_probs <- predict.glm(full_model, type = "response")

expected <- as.numeric(predicted_probs > 0.5)

observed <- bes$Turnout

outcome <- table(observed, expected)
outcome
        expected
observed    0    1
       0  608  471
       1  216 2866
(outcome[1,1] + outcome[2,2]) / sum(outcome)
[1] 0.8348955

Our new model correctly categorizes 83% of respondents. That is a clear improvement on the naive guess 74% and model 2 from the seminar.

Exercise 4

How many false positives and how many false negatives do we get in our full final model?

Solution

We have already produced the relevant table that compares the predictions of our full model to the observed outcomes. All we need to do is to look at the off-diagonal positions for type I and II errors.

False positives are in row 1, column 2:

outcome[1,2]
[1] 471

False negatives are in row 2, column 1:

outcome[2,1]
[1] 216

Type I errors are false positives. We predict that people turn out to vote but they did not. We have 471 such cases. Type II errors are false negatives. We predict that people did not vote, when they actually did. We have 216 such cases.

Exercise 5

Choose an average voter who does not identify with a party and show how likely that voter is to vote. Compare that to a voter who identifies with a political party. Interpret your results.

Solution

We need to re-estimate our full model using Zelig to set the respective scenarios and simulate the results.

z.out <- zelig(Turnout ~ Income + polinfoindex + Influence + 
                 CivicDutyScores + PartyID + Gender + Age +
                 edu15 + edu17 + edu18 + edu19plus + in_school + in_uni, 
               model = "logit", 
               data = bes)
How to cite this model in Zelig:
  R Core Team. 2007.
  logit: Logistic Regression for Dichotomous Dependent Variables
  in Christine Choirat, James Honaker, Kosuke Imai, Gary King, and Olivia Lau,
  "Zelig: Everyone's Statistical Software," http://zeligproject.org/
# set covariates: no identification with a party
x.partyid.no <- setx(z.out, PartyID = 0, 
                     Income = mean(bes$Income), 
                     polinfoindex = mean(bes$polinfoindex), 
                     Influence = mean(bes$Influence),
                     CivicDutyScores = mean(bes$CivicDutyScores), 
                     Gender = "Female", 
                     Age = mean(bes$Age), 
                     edu15 = 1, 
                     edu17 = 0, 
                     edu18 = 0, 
                     edu19plus = 0, 
                     in_school = 0,
                     in_uni = 0)

# set covariates: person who identifies with a party
x.partyid.yes <- setx(z.out, 
                      PartyID = 1, 
                      Income = mean(bes$Income), 
                      polinfoindex = mean(bes$polinfoindex), 
                      Influence = mean(bes$Influence),
                      CivicDutyScores = mean(bes$CivicDutyScores), 
                      Gender = "Female", 
                      Age = mean(bes$Age), 
                      edu15 = 1, 
                      edu17 = 0, 
                      edu18 = 0, 
                      edu19plus = 0, 
                      in_school = 0, 
                      in_uni = 0)

# make results replicable
set.seed(123)

s.out <- sim(z.out, x = x.partyid.no, x1 = x.partyid.yes)

summary(s.out)
 sim x :
 -----
ev
          mean         sd       50%      2.5%     97.5%
[1,] 0.7678392 0.02023501 0.7678416 0.7268195 0.8050944
pv
         0     1
[1,] 0.224 0.776

 sim x1 :
 -----
ev
          mean         sd       50%      2.5%     97.5%
[1,] 0.8105673 0.01592236 0.8109989 0.7813206 0.8408141
pv
         0     1
[1,] 0.187 0.813
fd
           mean         sd        50%       2.5%      97.5%
[1,] 0.04272805 0.01650547 0.04227008 0.01171188 0.07635594

In our interpretation, the reported uncertainty will be based on the 95% confidence level. For the average person from our sample who does not identify with any party, we predict a 77% turnout probability in a range from 73% to 81% turnout probability. In line with what we would expect, the average person who does identify with a party is more likely to vote. We predict a probability of 81% (78% to 84%). In addition, the difference between our two hypothetical women is statistically significant. The difference between them is 4 percentage points (between 1 to 7 percentage points).

Exercise 7

Thoroughly interpret the full final model in the light of all three theories in substantial terms.

Solution

We will finally look at regression tables. The model that we interpret is the final full model. However, we will also print model 2 in order to describe changes in the coefficients that were due to the different model specifications.

screenreg(list (model2, full_model), 
          custom.model.names = c("Model 2", "Full Final Model"))
===============================================
                 Model 2       Full Final Model
-----------------------------------------------
(Intercept)         -3.90 ***     -1.55 ***    
                    (0.22)        (0.26)       
Income               0.15 ***      0.17 ***    
                    (0.02)        (0.02)       
polinfoindex         0.25 ***      0.14 ***    
                    (0.02)        (0.03)       
Influence            0.21 ***      0.01        
                    (0.02)        (0.02)       
GenderMale          -0.36 ***     -0.20 *      
                    (0.08)        (0.09)       
Age                  0.05 ***      0.03 ***    
                    (0.00)        (0.00)       
edu15               -0.34 **      -0.32 *      
                    (0.11)        (0.13)       
edu17                0.36 *        0.42 *      
                    (0.16)        (0.18)       
edu18                0.14          0.11        
                    (0.15)        (0.17)       
edu19plus            0.01         -0.05        
                    (0.13)        (0.15)       
in_school            1.13 **       0.86        
                    (0.40)        (0.46)       
in_uni              -0.05         -0.48        
                    (0.27)        (0.31)       
CivicDutyScores                    1.25 ***    
                                  (0.06)       
PartyID                            0.26 **     
                                  (0.09)       
-----------------------------------------------
AIC               4003.90       3248.17        
BIC               4079.90       3336.84        
Log Likelihood   -1989.95      -1610.08        
Deviance          3979.90       3220.17        
Num. obs.         4161          4161           
===============================================
*** p < 0.001, ** p < 0.01, * p < 0.05

We have three competing theories. The resource theory, the rational voter theory and sociological theory. We start by looking at the rational voter theory. In model 2 the corresponding variable Influence was statistically significant. However, it is not anymore in the final model. Therefore, the bottom line is that we do not find support for the rational voter theory.

The reason why the variable lost significance may be due to omitted variable bias. Both PartyID and CivicDutyScores are prime candidates for having been omitted in model 2. It seems plausible that people who identify with a political party also feel that they have more influence over politics. Similarly, people with a higher sense of civic duty might also feel that they can actually influence politics.

We illustrate the effect of the variables by estimating the predicted probabilities of voting for a move from the first to the third quartile of the variable.

summary(bes$Income)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   4.000   5.000   5.322   7.000  13.000 
# Income for 1st quartile: 4
x.low <- setx(z.out, 
              Income = 4, 
              polinfoindex = mean(bes$polinfoindex), 
              Influence = mean(bes$Influence), 
              CivicDutyScores = mean(bes$CivicDutyScores), 
              PartyID = 1, 
              Gender = "Female", 
              Age = mean(bes$Age), 
              edu15 = 1, 
              edu17 = 0, 
              edu18 = 0, 
              edu19plus = 0, 
              in_school = 0, 
              in_uni = 0)

# Income for 3rd quartile: 7
x.high <- setx(z.out, Income = 7, 
               polinfoindex = mean(bes$polinfoindex), 
               Influence = mean(bes$Influence), 
               CivicDutyScores = mean(bes$CivicDutyScores), 
               PartyID = 1, 
               Gender = "Female", 
               Age = mean(bes$Age), 
               edu15 = 1, 
               edu17 = 0, 
               edu18 = 0, 
               edu19plus = 0, 
               in_school = 0, 
               in_uni = 0)

s.out <- sim(z.out, x = x.low, x1 = x.high)

summary(s.out)
 sim x :
 -----
ev
          mean         sd      50%      2.5%     97.5%
[1,] 0.7736383 0.01803416 0.774257 0.7378011 0.8076003
pv
         0     1
[1,] 0.217 0.783

 sim x1 :
 -----
ev
          mean         sd      50%      2.5%     97.5%
[1,] 0.8496579 0.01630346 0.850662 0.8164663 0.8784971
pv
         0     1
[1,] 0.162 0.838
fd
           mean         sd        50%       2.5%      97.5%
[1,] 0.07601957 0.01077462 0.07603654 0.05520903 0.09825411

We predict that a person at the 3rd Income quartile is 8 percentage points more likely to vote than a person at the 1st Income quartile (from 6 percentage points to 10 percentage points). Therefore, the effect of moving up substantially in terms of Income (while remaining average in other respects) does increase the probability of voting but not very much.

Below we do essentially the same for the variable polinfoindex.

summary(bes$polinfoindex)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   4.000   6.000   5.409   7.000   8.000 
# polinfoindex for 1st quartile: 4
x.low <- setx(z.out, 
              Income = mean(bes$Income), 
              polinfoindex = 4, 
              Influence = mean(bes$Influence), 
              CivicDutyScores = mean(bes$CivicDutyScores), 
              PartyID = 1, 
              Gender = "Female", 
              Age = mean(bes$Age), 
              edu15 = 1, 
              edu17 = 0, 
              edu18 = 0, 
              edu19plus = 0, 
              in_school = 0, 
              in_uni = 0)

# polinfoindex for 3rd quartile: 7
x.high <- setx(z.out, 
               Income = mean(bes$Income), 
               polinfoindex = 7, 
               Influence = mean(bes$Influence), 
               CivicDutyScores = mean(bes$CivicDutyScores), 
               PartyID = 1, 
               Gender = "Female", 
               Age = mean(bes$Age), 
               edu15 = 1, 
               edu17 = 0, 
               edu18 = 0, 
               edu19plus = 0, 
               in_school = 0, 
               in_uni = 0)

s.out <- sim(z.out, x = x.low, x1 = x.high)

summary(s.out)
 sim x :
 -----
ev
          mean         sd       50%      2.5%     97.5%
[1,] 0.7781651 0.01847318 0.7782053 0.7412802 0.8125923
pv
         0     1
[1,] 0.213 0.787

 sim x1 :
 -----
ev
          mean         sd       50%      2.5%     97.5%
[1,] 0.8413793 0.01651062 0.8420822 0.8076507 0.8709405
pv
         0     1
[1,] 0.163 0.837
fd
           mean         sd        50%       2.5%      97.5%
[1,] 0.06321417 0.01240853 0.06297236 0.04076795 0.08832803

We predict that our hypothetical average person with high knowledge about politics is 6 percentage points more likely to turnout than the one with low knowledge. We predict the difference to be in the interval of 4 to 9 percentage points. Considering that general knowledge about politics is easily attainable by e.g. watching the news or reading a newspaper now and then, the effect of knowledge is more impressive than that of Income although the nominal effect of Income is larger than that of knowledge. All in all we do find that more resources make voters more likely to turnout.

We illustrate both effects below. First, with Income varying from lowest to highest (all other variables constant)

x.out <- setx(z.out, 
              Income = 1:13, 
              polinfoindex = mean(bes$polinfoindex), 
              Influence = mean(bes$Influence), 
              CivicDutyScores = mean(bes$CivicDutyScores), 
              PartyID = 1, 
              Gender = "Female", 
              Age = mean(bes$Age), 
              edu15 = 1, 
              edu17 = 0, 
              edu18 = 0, 
              edu19plus = 0, 
              in_school = 0, 
              in_uni = 0)

s.out <- sim(z.out, x.out)

ci.plot(s.out, 
        ci = 95, 
        xlab = "Income", 
        ylab = "Predicted probability of Voting",
        main = "Effect of Income on turnout")

Next, we set polinfoindex from lowest to highest (all other variables constant)

x.out <- setx(z.out, 
              Income = mean(bes$Income), 
              polinfoindex = 0:8, 
              Influence = mean(bes$Influence), 
              CivicDutyScores = mean(bes$CivicDutyScores), 
              PartyID = 1, 
              Gender = "Female", 
              Age = mean(bes$Age), 
              edu15 = 1, 
              edu17 = 0,
              edu18 = 0, 
              edu19plus = 0, 
              in_school = 0, 
              in_uni = 0)

s.out <- sim(z.out, x.out)

ci.plot(s.out, 
        ci = 95, 
         xlab = "Knowledge about politics", 
         ylab = "Predicted probability of Voting",
         main = "Effect of knowledge about politics on turnout")

Our final theory is the sociological theory. We proceed by illustrating the effect of civic duty for an otherwise average voter.

To illustrate that, we vary civic duty with all else at appropriate measure of central tendency:

x.out <- setx(z.out, 
              Income = mean(bes$Income), 
              polinfoindex = mean(bes$polinfoindex), 
              Influence = mean(bes$Influence), 
              CivicDutyScores = seq(min(bes$CivicDutyScores), 
                                    max(bes$CivicDutyScores), 
                                    0.01), 
              PartyID = 1, 
              Gender = "Female", 
              Age = mean(bes$Age), 
              edu15 = 1, 
              edu17 = 0, 
              edu18 = 0, 
              edu19plus = 0, 
              in_school = 0, 
              in_uni = 0)

s.out <- sim(z.out, x.out)

ci.plot(s.out, 
        ci = 95,
        ylab = "Predicted probability of Voting",
        main = "Effect of civic duty on turnout")

As we can see the effect of civic duty is extremely large. Remember, all other variables are held at their means or modes. So, we are talking about a voter who is average other than her sense of duty. At extremely low values of civic duty, voters are very unlikely to turnout while at very large values they are very likely to turnout. Uncertainty levels are very small. The effect of CivicDutyScores dwarfs the other effects. Overall, the sociological theory receives overwhelming support in terms of uncertainty (very small) and substantial effect (very large).