Introduction to Quantitative Methods

4. Bivariate linear regression models

4.1 Seminar

Setting a Working Directory

Before you begin, make sure to set your working directory to a folder where your course related files such as R scripts and datasets are kept. We recommend that you create a PUBLG100 folder for all your work. Create this folder on N: drive if you're using a UCL computer or somewhere on your local disk if you're using a personal laptop.

Once the folder is created, use the setwd() function in R to set your working directory.

Recommended Folder Location R Function
UCL Computers N: Drive setwd("N:/PUBLG100")
Personal Laptop (Windows) C: Drive setwd("C:/PUBLG100")
Personal Laptop (Mac) Home Folder setwd("~/PUBLG100")

After you've set the working directory, verify it by calling the getwd() function.

getwd()

Now download the R script for this seminar from the "Download .R Script" button above, and save it to your PUBLG100 folder.

If you're using a UCL computer, please make sure you're running R version 3.2.0. Some of the seminar tasks and exercises will not work with older versions of R. Click here for help on how to start the new version of R on UCL computers.
Clear the environment as we always do.

# clear environment
rm(list = ls())

Required Packages

We're introducing two new packages in this week's seminar. The Zelig package provides functionality that simplifies some regression tasks, while texreg makes it easy to produce publication quality output from our regression models. We'll discuss these packages in more detail as we go along but for now let's just install them with the install.packages() function.

NOTE: Make sure you install Zelig exactly as shown below.

install.packages("https://cran.r-project.org/src/contrib/Archive/Zelig/Zelig_4.2-1.tar.gz", 
                 repos=NULL,
                 type="source")
Installing package into '/Users/altaf/Library/R/3.2/library'
(as 'lib' is unspecified)
install.packages("texreg")
Installing package into '/Users/altaf/Library/R/3.2/library'
(as 'lib' is unspecified)


The downloaded binary packages are in
    /var/folders/62/x92vvd_n08qb3vw5wb_706h80000gp/T//RtmpixY5GY/downloaded_packages

Now, let's load Zelig, textreg and the dplyr package that we used last week.

NOTE: We installed dplyr last week, but if you don't have it installed for some reason then make sure to install it as well.

library(texreg)
Warning: package 'texreg' was built under R version 3.2.3
library(Zelig)
library(dplyr)

Merging Datasets

Recall from last week that the Quality of Governance (QoG) Institute data we worked with had all the variables of interest available to us in a single dataset. The QoG Institute compiles datasets from various public sources and provides it as a single comparable dataset that's convenient to work with. Often times however, the variables we're interested in are not available from a single source and must be merged together to create a single dataset.

The dataset we're working with today comes from the 1990 US Census and provides demographic and socio-economic data in multiple files. The dataset includes observations from 1,994 communities with each observation identified by a state and communityname variable.

Let's load the datasets into R using read.csv() and confirm that the state and communityname are indeed present in each dataset.

# load the communities datasets
communities <- read.csv("http://uclspp.github.io/PUBLG100/data/communities.csv")
communities_employment <- read.csv("http://uclspp.github.io/PUBLG100/data/communities_employment.csv")

Now let's see what the dataset looks like. We normally use the names() function to see what the variables are called, but we would also like to know what type of data is stored in each variable. We can use the str() function that gives us a bit more information about a dataset than the names() function.

names(communities)
 [1] "state"           "county"          "community"      
 [4] "communityname"   "fold"            "population"     
 [7] "householdsize"   "racepctblack"    "racePctWhite"   
[10] "racePctAsian"    "racePctHisp"     "agePct12t21"    
[13] "agePct12t29"     "agePct16t24"     "agePct65up"     
[16] "numbUrban"       "pctUrban"        "medIncome"      
[19] "pctWWage"        "pctWFarmSelf"    "pctWInvInc"     
[22] "pctWSocSec"      "pctWPubAsst"     "pctWRetire"     
[25] "medFamInc"       "perCapInc"       "whitePerCap"    
[28] "blackPerCap"     "indianPerCap"    "AsianPerCap"    
[31] "OtherPerCap"     "HispPerCap"      "NumUnderPov"    
[34] "PctPopUnderPov"  "PctLess9thGrade" "PctNotHSGrad"   
[37] "PctBSorMore"    
str(communities)
'data.frame':   1994 obs. of  37 variables:
 $ state          : int  8 53 24 34 42 6 44 6 21 29 ...
 $ county         : int  NA NA NA 5 95 NA 7 NA NA NA ...
 $ community      : int  NA NA NA 81440 6096 NA 41500 NA NA NA ...
 $ communityname  : Factor w/ 1828 levels "Aberdeencity",..: 796 1626 2 1788 142 1520 840 1462 669 288 ...
 $ fold           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ population     : num  0.19 0 0 0.04 0.01 0.02 0.01 0.01 0.03 0.01 ...
 $ householdsize  : num  0.33 0.16 0.42 0.77 0.55 0.28 0.39 0.74 0.34 0.4 ...
 $ racepctblack   : num  0.02 0.12 0.49 1 0.02 0.06 0 0.03 0.2 0.06 ...
 $ racePctWhite   : num  0.9 0.74 0.56 0.08 0.95 0.54 0.98 0.46 0.84 0.87 ...
 $ racePctAsian   : num  0.12 0.45 0.17 0.12 0.09 1 0.06 0.2 0.02 0.3 ...
 $ racePctHisp    : num  0.17 0.07 0.04 0.1 0.05 0.25 0.02 1 0 0.03 ...
 $ agePct12t21    : num  0.34 0.26 0.39 0.51 0.38 0.31 0.3 0.52 0.38 0.9 ...
 $ agePct12t29    : num  0.47 0.59 0.47 0.5 0.38 0.48 0.37 0.55 0.45 0.82 ...
 $ agePct16t24    : num  0.29 0.35 0.28 0.34 0.23 0.27 0.23 0.36 0.28 0.8 ...
 $ agePct65up     : num  0.32 0.27 0.32 0.21 0.36 0.37 0.6 0.35 0.48 0.39 ...
 $ numbUrban      : num  0.2 0.02 0 0.06 0.02 0.04 0.02 0 0.04 0.02 ...
 $ pctUrban       : num  1 1 0 1 0.9 1 0.81 0 1 1 ...
 $ medIncome      : num  0.37 0.31 0.3 0.58 0.5 0.52 0.42 0.16 0.17 0.54 ...
 $ pctWWage       : num  0.72 0.72 0.58 0.89 0.72 0.68 0.5 0.44 0.47 0.59 ...
 $ pctWFarmSelf   : num  0.34 0.11 0.19 0.21 0.16 0.2 0.23 1 0.36 0.22 ...
 $ pctWInvInc     : num  0.6 0.45 0.39 0.43 0.68 0.61 0.68 0.23 0.34 0.86 ...
 $ pctWSocSec     : num  0.29 0.25 0.38 0.36 0.44 0.28 0.61 0.53 0.55 0.42 ...
 $ pctWPubAsst    : num  0.15 0.29 0.4 0.2 0.11 0.15 0.21 0.97 0.48 0.02 ...
 $ pctWRetire     : num  0.43 0.39 0.84 0.82 0.71 0.25 0.54 0.41 0.43 0.31 ...
 $ medFamInc      : num  0.39 0.29 0.28 0.51 0.46 0.62 0.43 0.15 0.21 0.85 ...
 $ perCapInc      : num  0.4 0.37 0.27 0.36 0.43 0.72 0.47 0.1 0.23 0.89 ...
 $ whitePerCap    : num  0.39 0.38 0.29 0.4 0.41 0.76 0.44 0.12 0.23 0.94 ...
 $ blackPerCap    : num  0.32 0.33 0.27 0.39 0.28 0.77 0.4 0.08 0.19 0.11 ...
 $ indianPerCap   : num  0.27 0.16 0.07 0.16 0 0.28 0.24 0.17 0.1 0.09 ...
 $ AsianPerCap    : num  0.27 0.3 0.29 0.25 0.74 0.52 0.86 0.27 0.26 0.33 ...
 $ OtherPerCap    : num  0.36 0.22 0.28 0.36 0.51 0.48 0.24 0.18 0.29 0.17 ...
 $ HispPerCap     : num  0.41 0.35 0.39 0.44 0.48 0.6 0.36 0.21 0.22 0.8 ...
 $ NumUnderPov    : num  0.08 0.01 0.01 0.01 0 0.01 0.01 0.03 0.04 0 ...
 $ PctPopUnderPov : num  0.19 0.24 0.27 0.1 0.06 0.12 0.11 0.64 0.45 0.11 ...
 $ PctLess9thGrade: num  0.1 0.14 0.27 0.09 0.25 0.13 0.29 0.96 0.52 0.04 ...
 $ PctNotHSGrad   : num  0.18 0.24 0.43 0.25 0.3 0.12 0.41 0.82 0.59 0.03 ...
 $ PctBSorMore    : num  0.48 0.3 0.19 0.31 0.33 0.8 0.36 0.12 0.17 1 ...

It looks like state is a numeric code and communityname is the name of the community. There is also county and community variables but they contain some missing values or NAs so we won't use them. Notice how read.csv() automatically converted communityname to a factor variable when loading the dataset.

Now, let's do the same for the employment dataset.

str(communities_employment)
'data.frame':   1994 obs. of  8 variables:
 $ state           : int  8 53 24 34 42 6 44 6 21 29 ...
 $ communityname   : Factor w/ 1828 levels "Aberdeencity",..: 796 1626 2 1788 142 1520 840 1462 669 288 ...
 $ PctUnemployed   : num  0.27 0.27 0.36 0.33 0.12 0.1 0.28 1 0.55 0.11 ...
 $ PctEmploy       : num  0.68 0.73 0.58 0.71 0.65 0.65 0.54 0.26 0.43 0.44 ...
 $ PctEmplManu     : num  0.23 0.57 0.32 0.36 0.67 0.19 0.44 0.43 0.59 0.2 ...
 $ PctEmplProfServ : num  0.41 0.15 0.29 0.45 0.38 0.77 0.53 0.34 0.36 1 ...
 $ PctOccupManu    : num  0.25 0.42 0.49 0.37 0.42 0.06 0.33 0.71 0.64 0.02 ...
 $ PctOccupMgmtProf: num  0.52 0.36 0.32 0.39 0.46 0.91 0.49 0.18 0.29 0.96 ...

It seems that state and communityname are common to both datasets so we can use them to do the merge.

Now let's use the merge() function to merge these two datasets together. There are three arguments that we need to provide to the merge() function so it knows what we are trying to merge and how.

merge(x, y, by)
Argument Description
x The first dataset to merge. This is the communities dataset in our case.
y The second dataset to merge. This is the communities_employment dataset.
by Name of the column or columns that are common in both datasets. We know that state and communityname are common to both datasets so we'll pass them as a vector by combining the two names together.

For more information on how the merge() function works, type help(merge) in R.

# merge the two datasets
communities <- merge(communities, communities_employment, by = c("state", "communityname"))

# explore dataset
names(communities)
 [1] "state"            "communityname"    "county"          
 [4] "community"        "fold"             "population"      
 [7] "householdsize"    "racepctblack"     "racePctWhite"    
[10] "racePctAsian"     "racePctHisp"      "agePct12t21"     
[13] "agePct12t29"      "agePct16t24"      "agePct65up"      
[16] "numbUrban"        "pctUrban"         "medIncome"       
[19] "pctWWage"         "pctWFarmSelf"     "pctWInvInc"      
[22] "pctWSocSec"       "pctWPubAsst"      "pctWRetire"      
[25] "medFamInc"        "perCapInc"        "whitePerCap"     
[28] "blackPerCap"      "indianPerCap"     "AsianPerCap"     
[31] "OtherPerCap"      "HispPerCap"       "NumUnderPov"     
[34] "PctPopUnderPov"   "PctLess9thGrade"  "PctNotHSGrad"    
[37] "PctBSorMore"      "PctUnemployed"    "PctEmploy"       
[40] "PctEmplManu"      "PctEmplProfServ"  "PctOccupManu"    
[43] "PctOccupMgmtProf"
View(communities)

Dplyr select() Function

Since our dataset has more columns than we need, let's select only the ones we're interested in and rename them with meaningful names. One approach would be to use either the subset() function or the square bracket [ ] extraction operator for selecting the columns we're interested in. But the easiest way to accomplish this is with the dplyr select() function that allows us select the columns we need and rename them at the same time.

communities <- select(communities, 
                      state, 
                      Community = communityname, 
                      UnemploymentRate = PctUnemployed, 
                      NoHighSchool = PctNotHSGrad,
                      White = racePctWhite)

Now that we've merged the dataset and renamed the columns the way we want, let's try to visualize the data with a simple plot command.

Is there a correlation between unemployment rate and lack of high school level education?

plot(communities$NoHighSchool, communities$UnemploymentRate,
     xlab = "Adults without High School education (%)",
     ylab = "Unemployment Rate")

In order to answer that question empirically, we will run linear regression using the lm() function in R. The lm() function needs to know the relationship we're trying to model and the dataset for our observations. The two arguments we need to provide to the lm() function are described below.

Argument Description
formula The formula describes the relationship between the dependent and independent variables, for example dependent.variable ~ independent.variable. Recall from last week that we used something very similar with t-test when comparing the difference of means.

In our case, we'd like to model the relationship using the formula:
UnemploymentRate ~ NoHighSchool
data This is simply the name of the dataset that contains the variable of interest. In our case, this is the merged dataset called communities.

For more information on how the lm() function works, type help(lm) in R.

model1 <- lm(UnemploymentRate ~ NoHighSchool, data = communities)

The lm() function has modeled the relationship between UnemploymentRate and NoHighSchool and we've saved it in a variable called model1. Let's use the summary() function to see what this linear model looks like.

summary(model1)
Call:
lm(formula = UnemploymentRate ~ NoHighSchool, data = communities)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42347 -0.08499 -0.01189  0.07711  0.56470 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.078952   0.006483   12.18   <2e-16 ***
NoHighSchool 0.742385   0.014955   49.64   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1352 on 1992 degrees of freedom
Multiple R-squared:  0.553, Adjusted R-squared:  0.5527 
F-statistic:  2464 on 1 and 1992 DF,  p-value: < 2.2e-16

Interpreting Regression Output

The output from lm() might seem overwhelming at first so let's break it down one item at a time.

# Description
The dependent variable, also sometimes called the outcome variable. We are trying to model the effects of NoHighSchool on UnemploymentRate so UnemploymentRate is the dependent variable.
The independent variable or the predictor variable. In our example, NoHighSchool is the independent variable.
The differences between the observed values and the predicted values are called residuals.
The coefficients for the intercept and the independent variables. Using the coefficients we can write down the relationship between the dependent and the independent variables as:

UnemploymentRate = 0.078952 + ( 0.7423853 * NoHighSchool )

This tells us that for each unit increase in the variable NoHighSchool, the UnemploymentRate increases by 0.7423853.
The p-value of the model. Recall that according to the null hypotheses, the coefficient of interest is zero. The p-value tells us whether can can reject the null hypotheses or not.
The standard error estimates the standard deviation of the coefficients in our model. We can think of the standard error as the measure of precision for the estimated coefficients.
The t statistic is obtained by dividing the coefficients by the standard error.
The R-squared and adjusted R-squared tell us how much of the variance in our model is accounted for by the independent variable. The adjusted R-squared is always smaller than R-squared as it takes into account the number of independent variables and degrees of freedom.

Now let's plot the regression line with our observations using the abline() function.

plot(communities$NoHighSchool, communities$UnemploymentRate,
     xlab = "Adults without High School education (%)",
     ylab = "Unemployment Rate")
abline(model1, col = "red")

While the summary() function provides a slew of information about a fitted regression model, we often need to present our findings in easy to read tables similar to what you see in journal publications. The texreg package we installed earlier allows us to do just that.

Let's take a look at how to display the output of a regression model on the screen using the screenreg() function from texreg.

screenreg(model1)
=========================
              Model 1    
-------------------------
(Intercept)      0.08 ***
                (0.01)   
NoHighSchool     0.74 ***
                (0.01)   
-------------------------
R^2              0.55    
Adj. R^2         0.55    
Num. obs.     1994       
RMSE             0.14    
=========================
*** p < 0.001, ** p < 0.01, * p < 0.05

Returning to our example, are there other variables that might explain unemployment rates in our communities dataset? For example, is unemployment rate higher or lower in communities with different levels of minority population?

We first create a new variable called Minority by subtracting the percent of White population from 1. Alternatively, we could have added up the percent of Black, Hispanic and Asians to get the percentage of minority population since our census data also has those variables.

communities$Minority <- 1 - communities$White

Next we fit a linear model using Minority as the independent variable.

model2 <- lm(UnemploymentRate ~ Minority, data = communities)
summary(model2)
Call:
lm(formula = UnemploymentRate ~ Minority, data = communities)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.45521 -0.12189 -0.02369  0.10162  0.68203 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.257948   0.005506   46.85   <2e-16 ***
Minority    0.428702   0.015883   26.99   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.173 on 1992 degrees of freedom
Multiple R-squared:  0.2678,    Adjusted R-squared:  0.2674 
F-statistic: 728.5 on 1 and 1992 DF,  p-value: < 2.2e-16

Now let's see how this model compares to our first model. We can show regression line from model2 just like we did with our first model.

plot(communities$Minority, communities$UnemploymentRate,
     xlab = "Minority population (%)",
     ylab = "Unemployment Rate")
abline(model2, col = "blue")

Does model2 offer a better fit than model1? Maybe we can answer that question by looking at the regression tables instead. Let's print the two models side-by-side in a single table with screenreg().

screenreg(list(model1, model2))
======================================
              Model 1      Model 2    
--------------------------------------
(Intercept)      0.08 ***     0.26 ***
                (0.01)       (0.01)   
NoHighSchool     0.74 ***             
                (0.01)                
Minority                      0.43 ***
                             (0.02)   
--------------------------------------
R^2              0.55         0.27    
Adj. R^2         0.55         0.27    
Num. obs.     1994         1994       
RMSE             0.14         0.17    
======================================
*** p < 0.001, ** p < 0.01, * p < 0.05

Finally, let's save the same output as a Microsoft Word document using htmlreg().

htmlreg(list(model1, model2), file = "lab4_model_comparison.doc")
The table was written to the file 'lab4_model_comparison.doc'.

Predictions and Confidence Interval

We will use the Zelig package since it provides convenient functions for making predictions and plotting confidence intervals. First, let's go back to our last model where we used Minority as the independent variable but this time estimate the model using the Zelig package. The arguments for the zelig() function are very similar to what we used with lm(). The only difference is that we need to tell Zelig what type of model we're trying to estimate.

Argument Description
formula The formula is the describes the relationship between the dependent and independent variables, for example dependent.variable ~ independent.variable.
data This is the name of the dataset that contains the variable of interest.
model This is where we tell Zelig the type of model to estimate. Since we're interested in a Least Squares regression model, we will use model = "ls" argument.
z.out <- zelig(UnemploymentRate ~ Minority, data = communities, model = "ls")
 How to cite this model in Zelig:
  Kosuke Imai, Gary King, and Olivia Lau. 2015.
  "ls: Least Squares Regression for Continuous Dependent Variables"
  in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software,"
  http://gking.harvard.edu/zelig

The model estimated with zelig() is saved in a variable called z.out. We can use the summary() function exactly the same way as we used it with lm() to print out the estimated model.

summary(z.out)
Call:
lm(formula = formula, weights = weights, model = F, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.45521 -0.12189 -0.02369  0.10162  0.68203 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.257948   0.005506   46.85   <2e-16 ***
Minority    0.428702   0.015883   26.99   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.173 on 1992 degrees of freedom
Multiple R-squared:  0.2678,    Adjusted R-squared:  0.2674 
F-statistic: 728.5 on 1 and 1992 DF,  p-value: < 2.2e-16

Next, let's use Zelig's setx() and sim() functions for making predictions and plotting confidence intervals.

x.out <- setx(z.out, Minority = seq(0, 1, 0.1))
s.out <- sim(z.out, x = x.out)

summary(s.out)
Expected Values: E(Y|X) 
                    mean          sd       50%      2.5%     97.5%
[Minority=0]   0.2579665 0.005677340 0.2580478 0.2473649 0.2691032
[Minority=0.1] 0.3007337 0.004531902 0.3007323 0.2918661 0.3099384
[Minority=0.2] 0.3435213 0.003877042 0.3434656 0.3361714 0.3508412
[Minority=0.3] 0.3865521 0.004068235 0.3863071 0.3787131 0.3950038
[Minority=0.4] 0.4293833 0.004442889 0.4293484 0.4207419 0.4378098
[Minority=0.5] 0.4721306 0.005599804 0.4720700 0.4614503 0.4828565
[Minority=0.6] 0.5153974 0.006867561 0.5153081 0.5021815 0.5295274
[Minority=0.7] 0.5578146 0.008156760 0.5576684 0.5416097 0.5739449
[Minority=0.8] 0.6002469 0.009260266 0.6001272 0.5817250 0.6185873
[Minority=0.9] 0.6436730 0.011278216 0.6438739 0.6220530 0.6662544
[Minority=1]   0.6866793 0.012432337 0.6863878 0.6627996 0.7108965


Predicted Values: Y|X 
                    mean          sd       50%      2.5%     97.5%
[Minority=0]   0.2579665 0.005677340 0.2580478 0.2473649 0.2691032
[Minority=0.1] 0.3007337 0.004531902 0.3007323 0.2918661 0.3099384
[Minority=0.2] 0.3435213 0.003877042 0.3434656 0.3361714 0.3508412
[Minority=0.3] 0.3865521 0.004068235 0.3863071 0.3787131 0.3950038
[Minority=0.4] 0.4293833 0.004442889 0.4293484 0.4207419 0.4378098
[Minority=0.5] 0.4721306 0.005599804 0.4720700 0.4614503 0.4828565
[Minority=0.6] 0.5153974 0.006867561 0.5153081 0.5021815 0.5295274
[Minority=0.7] 0.5578146 0.008156760 0.5576684 0.5416097 0.5739449
[Minority=0.8] 0.6002469 0.009260266 0.6001272 0.5817250 0.6185873
[Minority=0.9] 0.6436730 0.011278216 0.6438739 0.6220530 0.6662544
[Minority=1]   0.6866793 0.012432337 0.6863878 0.6627996 0.7108965
plot(s.out, qi = "pv", xlab = "Minority population (%)")

Additional Resources

Linear Regression Interactive App

Exercises

  1. Load the immigration dataset (http://uclspp.github.io/PUBLG100/data/communities_immig.csv) and merge it with the communities dataset.
  2. Rename the PctImmigRec5 variable to RecentImmigration5.
  3. Estimate a model explaining the relationship between unemployment rate and recent immigration over the past 5 years using the variable RecentImmigration5.
  4. Plot the regression line of the model.
  5. How does this model compare to the models we estimated in the seminar with NoHighSchool and Minority as independent variables? Present your findings by comparing the output of all three models in side-by-side tables using the texreg package.
  6. Save your model comparison table to a Microsoft Word document (or another format if you don't use Word).
  7. Generate predicted values from the fitted model with RecentImmigration5 and plot the confidence interval using Zelig.
  8. Save all the plots as graphics files.
    • Hint: Look at the Plot window in RStudio for an option to save your plot. Alternatively, you can use the png() function for saving the graphics file directly from your R script.