Introduction to Quantitative Methods

9. Binary models: Logit

9.1 Seminar

Setting a Working Directory

Before you begin, make sure to set your working directory to a folder where your course related files such as R scripts and datasets are kept. We recommend that you create a PUBLG100 folder for all your work. Create this folder on N: drive if you're using a UCL computer or somewhere on your local disk if you're using a personal laptop.

Once the folder is created, use the setwd() function in R to set your working directory.

Recommended Folder Location R Function
UCL Computers N: Drive setwd("N:/PUBLG100")
Personal Laptop (Windows) C: Drive setwd("C:/PUBLG100")
Personal Laptop (Mac) Home Folder setwd("~/PUBLG100")

After you've set the working directory, verify it by calling the getwd() function.

getwd()

Now download the R script for this seminar from the "Download .R Script" button above, and save it to your PUBLG100 folder.

If you're using a UCL computer, please make sure you're running R version 3.3.1 or newer. Some of the seminar tasks and exercises will not work with older versions of R.
Let's start by loading the required packages:

library(foreign) 
library(Zelig) 
library(texreg) 
library(lmtest)
library(dplyr)

Clear the environment

rm(list = ls())

Loading Data

We use a subset of the 2005 face-to-face British post election study to explain turnout.

Disclaimer: We use a slightly modified version of the data set. For your own research please visit the British Election Study website and download the original dataset.

Download British Election Study Dataset Codebook

You can download the codebook from the link above, but most of the variable names are self explanatory. The ones that need some explanation are listed below:

Variable Description Range
Turnout Turnout at 2005 Brit election No (0); Yes (1)
Gender R's gender 1 (male); 0 (female)
LeftRightSelf R's self placement on left-right 1 (left) - 11 (right)
CivicDutyIndex Sum of scores on CivicDuty1-7 high values mean high civic duty
polinfoindex Correctly answered knowledge questions 0 (low) - 8 (high)
edu* yrs of education binary
in.school R still in school binary
in.uni R attends university binary
bes <- read.dta("bes.dta")

Let's convert the Gender variable to a factor to help us with the interpretation of the results.

bes$Gender <- factor(bes$Gender, levels = c(0, 1), labels = c("Female", "Male"))

Now take a look at the first few observations to see what the dataset looks like

head(bes)
  cs_id Turnout Vote2001 Income Age Gender PartyID Influence Attention
1     1       0        1      4  76 Female       1         1         8
2     2       1        1      5  32   Male       0         3         8
3     3       1       NA     NA  NA   <NA>      NA        NA        NA
4     4       0        1      1  35 Female       0         1         1
5     5       1        1      7  56   Male       0         1         9
6     6       1        1      4  76 Female       1         4         8
  Telephone LeftrightSelf CivicDutyIndex polinfoindex edu15 edu16 edu17
1         1             7             20            7     1     0     0
2         1             6             15            5     0     1     0
3        NA            NA             NA           NA    NA    NA    NA
4         0             5             26            1     0     1     0
5         1             9             16            7     0     0     1
6         1             8             16            4     0     1     0
  edu18 edu19plus in_school in_uni CivicDutyScores
1     0         0         0      0      -0.6331136
2     0         0         0      0       1.4794579
3    NA        NA        NA     NA              NA
4     0         0         0      0      -2.1466281
5     0         0         0      0       1.0324940
6     0         0         0      0       0.3658024

We have a number of missing values. Let's remove them from the dataset but we want to make sure we only remove observations when the variables we are interested in are missing. We'll follow the same procedure we used in week 5 for removing NA's.

While this method might seem tedious, it ensures that we're not dropping observations unnecessarily.

bes <- filter(bes, 
              !is.na(Turnout), 
              !is.na(Income), 
              !is.na(polinfoindex), 
              !is.na(Gender), 
              !is.na(edu15), 
              !is.na(edu17), 
              !is.na(edu18), 
              !is.na(edu19plus), 
              !is.na(in_school), 
              !is.na(in_uni))

Regression with a Binary Dependent Variable

We use the generalized linear model function glm() to estimate a logistic regression. The syntax is very similar to other regression functions we're already familiar with, for example lm() and plm(). The glm() function can be used to estimate many different models. We tell glm() that we've binary dependent variable and we want to use the cumulative logistic link function using the family = binomial(link = "logit") argument:

model1 <- glm(Turnout ~ Income + polinfoindex + Gender + 
                edu15 + edu17 + edu18 + edu19plus + in_school + in_uni, 
              family = binomial(link = "logit"),
              data = bes)

screenreg(model1)
============================
                Model 1     
----------------------------
(Intercept)        -1.14 ***
                   (0.15)   
Income              0.03    
                   (0.02)   
polinfoindex        0.38 ***
                   (0.02)   
GenderMale         -0.35 ***
                   (0.08)   
edu15               0.38 ***
                   (0.10)   
edu17               0.46 ** 
                   (0.15)   
edu18               0.11    
                   (0.14)   
edu19plus           0.24 *  
                   (0.12)   
in_school           0.15    
                   (0.39)   
in_uni             -0.72 ** 
                   (0.25)   
----------------------------
AIC              4401.20    
BIC              4464.53    
Log Likelihood  -2190.60    
Deviance         4381.20    
Num. obs.        4161       
============================
*** p < 0.001, ** p < 0.01, * p < 0.05

Predicted Probabilities and Predictive Power

To assess the predictive power of our model we will check the percentage of cases that it correctly predicts. Let's look the turnout data first. According to the codebook, 0 means the person did not vote, and 1 means they voted.

table(bes$Turnout)
   0    1 
1079 3082 

If we look at the mean of Turnout we will see that is 0.74. That means 74.07% of the respondents said that they turned out to vote. If you predict for every respondent that they voted, you will be right for 74.07% of the people. That is the naive guess and the benchmark for our model. If we predict more than 74.07% of cases correctly our model adds value.

Below we will estimate predicted probabilities for each observation. That is, the probability our model assigns that a respondent will turn out to vote for every person in our data. To do so we use the predict() function with the following the usage syntax:

predict(model, type = "response")

Type help(predict.glm) for information on all arguments.

predicted_probs <- predict(model1, type = "response")

Now that we have assigned probabilities to people turning out to vote, we have to translate those into outcomes. Turnout is binary. A straightforward way would be to say: we predict that all people with a predicted probability above 50% vote and all with predicted probabilities below or equal to 50% abstain.

Using the 50% threshold, we create a new variable expected_values that is 1 if our predicted probability is larger than 0.5 and 0 otherwise.

expected <- as.numeric(predicted_probs > 0.5)

NOTE: The condition predicted_probs > 0.5 gives us FASLE/TRUE, but since our dataset uses 0/1 for the Turnout variable, we convert it to a numeric value using as.numeric() function.

All we have to do now is to compare our expected values of turnout against the actually observed values of turnout. We want our predictions to coincide with the actually observed outcomes. But we need to be aware of two types of errors we can make:

- A type I error is a false positive. It is like crying wolf when there is nothing there (we mistakenly reject a true null hypothesis).

- A type II error is a false negative. It is like not noticing the wolf and finding all your sheep dead the next morning (we fail to reject a false null hypothesis).

The threshold that we set is directly related to the proportion of type I to type II errors. Increasing the threshold will reduce the number of false positives but increase the number of false negatives and vice versa. The default choice is a threshold of 0.5. However, you may imagine situations where one type of error is more costly than the other. For example, given the cost of civil wars, a model predicting civil war may focus on reducing false negatives at the cost of a larger fraction of false positives.

We proceed by producing a table of predictions against actual outcomes. With that we will calculate the percentage of correctly predicted cases and compare that to the naive guess. We have the actually observed cases in our dependent variable (Turnout). The table is just a frequency table. The percentage of correctly predicted cases is simply the sum of correctly predicted cases over the number of cases.

observed <- bes$Turnout

outcome <- table(observed,expected)
outcome
        expected
observed    0    1
       0  160  919
       1  108 2974

Now let's find out how many we correctly predicted and how many we got wrong. You can manually add the numbers if you want, but it's simple enough to take the values out of the 2x2 table:

  • Correct negatives are in row 1, column 1
  • Correct positives are in row 2, column 2
  • All others are incorrect
(outcome[1,1] + outcome[2,2]) / sum(outcome)
[1] 0.7531843

Now let's remember what the actual turnout was:

mean(bes$Turnout)
[1] 0.7406873

You can see that our model outperforms the naive guess slightly. The more lopsided the distribution of your binary dependent variable, the harder it is to build a successful model.

Joint hypothesis testing

We will add two more explanatory variables to our model: Influence and Age. Influence corresponds to a theory we want to test while Age is a socio-economic control variable.

We want to test two theories: the rational voter model and the resource theory.

  • Influence operationalises a part of the rational voter theory. In that theory citizens are more likely to turnout the more they belive that their vote matters. The variable Influence measures the subjective feeling of being able to influence politics.

  • The variables Income and polinfoindex correspond to a second theory which states that people who have more cognitive and material resources are more likely to participate in politics.

Depending on whether the coefficients corresponding to the respective theories are significant or not, we can say something about whether these theories help to explain turnout.

The remaining variables for gender, education and the added variable Age are socio-economic controls. We added age because previous research has shown that political participation changes with the life cycle.

model2 <- glm(Turnout ~ Income + polinfoindex + Influence + Gender + Age + 
                edu15 + edu17 + edu18 + edu19plus + in_school + in_uni, 
              family = binomial(link = "logit"), data = bes)

screenreg(list(model1, model2))
==========================================
                Model 1       Model 2     
------------------------------------------
(Intercept)        -1.14 ***     -3.90 ***
                   (0.15)        (0.22)   
Income              0.03          0.15 ***
                   (0.02)        (0.02)   
polinfoindex        0.38 ***      0.25 ***
                   (0.02)        (0.02)   
GenderMale         -0.35 ***     -0.36 ***
                   (0.08)        (0.08)   
edu15               0.38 ***     -0.34 ** 
                   (0.10)        (0.11)   
edu17               0.46 **       0.36 *  
                   (0.15)        (0.16)   
edu18               0.11          0.14    
                   (0.14)        (0.15)   
edu19plus           0.24 *        0.01    
                   (0.12)        (0.13)   
in_school           0.15          1.13 ** 
                   (0.39)        (0.40)   
in_uni             -0.72 **      -0.05    
                   (0.25)        (0.27)   
Influence                         0.21 ***
                                 (0.02)   
Age                               0.05 ***
                                 (0.00)   
------------------------------------------
AIC              4401.20       4003.90    
BIC              4464.53       4079.90    
Log Likelihood  -2190.60      -1989.95    
Deviance         4381.20       3979.90    
Num. obs.        4161          4161       
==========================================
*** p < 0.001, ** p < 0.01, * p < 0.05

The variables corresponding to the resource theory are Income and polinfoindex. In our model 2 (the one we interpret), both more income and more interest in politics are related to a higher probability to turn out to vote. That is in line with the theory. Interestingly, income was previously insignificant in model 1. It is quite likely that model 1 suffered from omitted variable bias.

As the rational voter model predicts, a higher subjective probability to cast a decisive vote, which we crudly proxy by the feeling to be able to influence politics, does correspond to a higher probability to vote as indicated by the positive significant effect of our Influence variable.

We will test if model 2 does better at predicting turnout than model 1. We use the likelihood ratio test. This is warranted because we added two variables and corresponds to the F-test in the OLS models estimated before. You can see values for the logged likelihood in the regression table. You cannot in general say whether a value for the likelihood is small or large but you can use it to compare models that are based on the same sample. A larger log-likelihood is better. Looking at the regression table we see that the log-likelihood is larger in model 2 than in model 1.

We use the lrtest() function from the lmtest package to test whether that difference is statistically significant. The syntax is the following:

lrtest(model1, model2)
Likelihood ratio test

Model 1: Turnout ~ Income + polinfoindex + Gender + edu15 + edu17 + edu18 + 
    edu19plus + in_school + in_uni
Model 2: Turnout ~ Income + polinfoindex + Influence + Gender + Age + 
    edu15 + edu17 + edu18 + edu19plus + in_school + in_uni
  #Df  LogLik Df Chisq Pr(>Chisq)    
1  10 -2190.6                        
2  12 -1990.0  2 401.3  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value in the likelihood ratio test is smaller than 0.05. Therefore, the difference between the two log-likelihoods is statistically significant which means that model 2 is better at explaining turnout.

Now let's see if AIC and BIC agree.

AIC(model1, model2)
       df      AIC
model1 10 4401.200
model2 12 4003.901
BIC(model1, model2)
       df      BIC
model1 10 4464.535
model2 12 4079.903

According to both AIC and BIC, model 2 has a better fit.

Interpretation with Zelig

Since we cannot directly estimate the marginal effects of the independent variables from a logistic regression model, we use Zelig to help us with the interpretation. We pick meaningful scenarios and predict the probability that a person will vote based on the scenarios. What are meaningful scenarios? A meaningful scenario is a set of covariate values that corresponds to some case that is interesting in reality. For example, you may want to compare a women with 18 years of education to a man with 18 years of education while keeping all other variables at their means, medians or modes.

First, we re-estimate our previous model using Zelig.

NOTE: Since we're estimating a logistic regression model, we use model = "logit" instead of model = "ls" like we've done in the past.

z.out <- zelig(Turnout ~ Income + polinfoindex + Influence + Gender + Age + 
                 edu15 + edu17 + edu18 + edu19plus + in_school + in_uni, 
               model = "logit", 
               data = bes)
How to cite this model in Zelig:
  R Core Team. 2007.
  logit: Logistic Regression for Dichotomous Dependent Variables
  in Christine Choirat, James Honaker, Kosuke Imai, Gary King, and Olivia Lau,
  "Zelig: Everyone's Statistical Software," http://zeligproject.org/

In order to compare the average women with 18 years of education to the average man with 18 years of education, Gender to either Male or Female, edu18 to 1 and all other binary variables to their modes, ordinally scaled variables to their medians and interval scaled variables to their means.

x.male <- setx(z.out, 
               Gender = "Male", edu18 = 1, 
               Age = mean(bes$Age),
               Income = mean(bes$Income), 
               polinfoindex = mean(bes$polinfoindex), 
               Influence = mean(bes$Influence),  
               edu15 = 0, edu17 = 0, edu19plus = 0, in_school = 0, in_uni = 0)

x.female <- setx(z.out, Gender = "Female", edu18 = 1, 
                 Age = mean(bes$Age),
                 Income = mean(bes$Income), 
                 polinfoindex = mean(bes$polinfoindex), 
                 Influence = mean(bes$Influence),  
                 edu15 = 0, edu17 = 0, edu19plus = 0, in_school = 0, in_uni = 0)

s.out <- sim(z.out, x.male, x1 = x.female)
summary(s.out)
 sim x :
 -----
ev
          mean        sd       50%      2.5%     97.5%
[1,] 0.7847456 0.0253645 0.7853111 0.7335729 0.8319658
pv
         0     1
[1,] 0.223 0.777

 sim x1 :
 -----
ev
          mean         sd       50%      2.5%     97.5%
[1,] 0.8381426 0.01955437 0.8391013 0.7953938 0.8727184
pv
         0     1
[1,] 0.155 0.845
fd
           mean         sd        50%      2.5%      97.5%
[1,] 0.05339698 0.01384552 0.05254808 0.0286439 0.08187463

If you're not familiar with how to interpret Zelig simulation summary, then click at the link below for an annotated example:

Interpreting Zelig Simulation

For the average woman (in our sample) with 18 years of education, we predict a probability of 84% that she will vote. Our uncertainty is plus and minus 4% on the 95% confidence level. Note that we've rounded up the numbers just to simplify the interpretation.

For her male counterpart, we predict the probability of turnout to be 78% (plus and minus 5% based on the 95% confidence interval). Is the difference between them statistically significant or put differently is our hypothetical woman more likely to vote than our hypothetical man?

To check that we look at the first difference. The difference is 5% percentage points on average and between 3 percentage points and 8 percentage points based on the 95% confidence interval.

Our next step will be to compare two groups like men and women while varying another continuous variable from lowest to highest. We use income here. So, we set a sequence for income, vary gender and keep every other variable constant at its appropriate measure of central tendency.

Let's see what the range of Income variable is:

range(bes$Income)
[1]  1 13

Income ranges from 1 to 13 in our sample so we can just say Income = 1:13 or use the seq() function as shown below:

x.male <- setx(z.out, 
               Gender = "Male", edu18 = 1, 
               Age = mean(bes$Age),
               Income = seq(1, 13), 
               polinfoindex = mean(bes$polinfoindex), 
               Influence = mean(bes$Influence),  
               edu15 = 0, edu17 = 0, edu19plus = 0, in_school = 0, in_uni = 0)

x.female <- setx(z.out, Gender = "Female", edu18 = 1, 
                 Age = mean(bes$Age),
                 Income = seq(1, 13), 
                 polinfoindex = mean(bes$polinfoindex), 
                 Influence = mean(bes$Influence),  
                 edu15 = 0, edu17 = 0, edu19plus = 0, in_school = 0, in_uni = 0)

s.out <- sim(z.out, x.male, x1 = x.female)

Now we can plot the confidence interval for the two groups:

ci.plot(s.out, 
        ci = 95,
        xlab = "Income", 
        ylab = "Predicted Probability of Voting",
        main = "Effect of Income by Gender")

# add labels manually
# NOTE: the x and y values below are just coordinates on the plot, you can change them based on 
#       where you want the labels placed
text(x = 3, y = .85, labels = "Women")  
text(x = 6, y = .75, labels = "Men")

Dplyr summarise()

What is the average turnout for men and women? We've seen the group_by() function before, but now we combine it with summarise() to generate summary statistics.

First we group by the variable Gender and then we calculate mean and standard deviation of voter turnout.

bes %>%
  group_by(Gender) %>%
  summarise(avg_turnout = mean(Turnout), sd_turnout = sd(Turnout))
# A tibble: 2 × 3
  Gender avg_turnout sd_turnout
  <fctr>       <dbl>      <dbl>
1 Female   0.7465517  0.4350791
2   Male   0.7332971  0.4423559

Saving a Plot

Saving a plot in RStudio is easy. In the plot window, click on Export and choose "Save as Image".

Exercises

Theory Mechanism Corresponding Variables
Resource Theory More resources reduce transaction costs for political participation. People who possess more knowledge about the political system and who are wealthier will be more likely to vote. Income and polinfoindex
Rational Voter Voters are motivated by three factors: costs of voting, benefits of voting and the probability of casting the decisive vote. The first decreases the probability of voting, the latter two increase the probability of voting. Influence as the probability of casting the decisive vote.
Sociological Theory People with more social capital are well integrated in civil society and feel a greater sense of duty to contribute to society while at the same time having a sense that they can shape society. CivicDutyScores
  1. Estimate a model similar to model2 in the seminar and add Civic Duty (CivicDutyScores) and party identification (PartyID). Show and discuss descriptive statistics of the variables in a meaningful way. E.g.: what does a unit change in a variable mean, are variables comparable. Do this for all variables but concentrate your discussion on the interesting ones.

  2. Do joint hypothesis testing and explain your results. You have added two variables, so the comparison should be to the model without those two new variables.

  3. What is the model's predictive power? How well does it do compared to the naive guess and model2 that we estimated in class?

  4. How many false positives and how many false negatives to we get in our full final model?

  5. Choose an average voter who does not identify with a party and show how likely that voter is to vote. Compare that to a voter who identifies with a political party. Interpret your results.

  6. Thoroughly interpret the full final model in the light of all three theories in substantial terms. Use at least one plot to illustrate your results. Export that plot to a file and put it in word document. If you cannot do this, seek help during office hours!