Before you begin, make sure to set your working directory to a folder where your course related files such as R scripts and datasets are kept. We recommend that you create a PUBLG100
folder for all your work. Create this folder on N: drive if you're using a UCL computer or somewhere on your local disk if you're using a personal laptop.
Once the folder is created, use the setwd()
function in R to set your working directory.
Recommended Folder Location | R Function | |
---|---|---|
UCL Computers | N: Drive | setwd("N:/PUBLG100") |
Personal Laptop (Windows) | C: Drive | setwd("C:/PUBLG100") |
Personal Laptop (Mac) | Home Folder | setwd("~/PUBLG100") |
After you've set the working directory, verify it by calling the getwd()
function.
getwd()
Now download the R script for this seminar from the "Download .R Script" button above, and save it to your PUBLG100
folder.
# clear environment
rm(list = ls())
We're introducing two new packages in this week's seminar. The Zelig
package provides functionality that simplifies some regression tasks, while texreg
makes it easy to produce publication quality output from our regression models. We'll discuss these packages in more detail as we go along but for now let's just install them with the install.packages()
function.
NOTE: Make sure you install Zelig
exactly as shown below.
install.packages("https://cran.r-project.org/src/contrib/Archive/Zelig/Zelig_4.2-1.tar.gz",
repos=NULL,
type="source")
Installing package into '/Users/altaf/Library/R/3.2/library'
(as 'lib' is unspecified)
install.packages("texreg")
Installing package into '/Users/altaf/Library/R/3.2/library'
(as 'lib' is unspecified)
The downloaded binary packages are in
/var/folders/62/x92vvd_n08qb3vw5wb_706h80000gp/T//RtmpixY5GY/downloaded_packages
Now, let's load Zelig
, textreg
and the dplyr
package that we used last week.
NOTE: We installed dplyr
last week, but if you don't have it installed for some reason then make sure to install it as well.
library(texreg)
Warning: package 'texreg' was built under R version 3.2.3
library(Zelig)
library(dplyr)
Recall from last week that the Quality of Governance (QoG) Institute data we worked with had all the variables of interest available to us in a single dataset. The QoG Institute compiles datasets from various public sources and provides it as a single comparable dataset that's convenient to work with. Often times however, the variables we're interested in are not available from a single source and must be merged together to create a single dataset.
The dataset we're working with today comes from the 1990 US Census and provides demographic and socio-economic data in multiple files. The dataset includes observations from 1,994 communities with each observation identified by a state
and communityname
variable.
Let's load the datasets into R using read.csv()
and confirm that the state
and communityname
are indeed present in each dataset.
# load the communities datasets
communities <- read.csv("http://uclspp.github.io/PUBLG100/data/communities.csv")
communities_employment <- read.csv("http://uclspp.github.io/PUBLG100/data/communities_employment.csv")
Now let's see what the dataset looks like. We normally use the names()
function to see what the variables are called, but we would also like to know what type of data is stored in each variable. We can use the str()
function that gives us a bit more information about a dataset than the names()
function.
names(communities)
[1] "state" "county" "community"
[4] "communityname" "fold" "population"
[7] "householdsize" "racepctblack" "racePctWhite"
[10] "racePctAsian" "racePctHisp" "agePct12t21"
[13] "agePct12t29" "agePct16t24" "agePct65up"
[16] "numbUrban" "pctUrban" "medIncome"
[19] "pctWWage" "pctWFarmSelf" "pctWInvInc"
[22] "pctWSocSec" "pctWPubAsst" "pctWRetire"
[25] "medFamInc" "perCapInc" "whitePerCap"
[28] "blackPerCap" "indianPerCap" "AsianPerCap"
[31] "OtherPerCap" "HispPerCap" "NumUnderPov"
[34] "PctPopUnderPov" "PctLess9thGrade" "PctNotHSGrad"
[37] "PctBSorMore"
str(communities)
'data.frame': 1994 obs. of 37 variables:
$ state : int 8 53 24 34 42 6 44 6 21 29 ...
$ county : int NA NA NA 5 95 NA 7 NA NA NA ...
$ community : int NA NA NA 81440 6096 NA 41500 NA NA NA ...
$ communityname : Factor w/ 1828 levels "Aberdeencity",..: 796 1626 2 1788 142 1520 840 1462 669 288 ...
$ fold : int 1 1 1 1 1 1 1 1 1 1 ...
$ population : num 0.19 0 0 0.04 0.01 0.02 0.01 0.01 0.03 0.01 ...
$ householdsize : num 0.33 0.16 0.42 0.77 0.55 0.28 0.39 0.74 0.34 0.4 ...
$ racepctblack : num 0.02 0.12 0.49 1 0.02 0.06 0 0.03 0.2 0.06 ...
$ racePctWhite : num 0.9 0.74 0.56 0.08 0.95 0.54 0.98 0.46 0.84 0.87 ...
$ racePctAsian : num 0.12 0.45 0.17 0.12 0.09 1 0.06 0.2 0.02 0.3 ...
$ racePctHisp : num 0.17 0.07 0.04 0.1 0.05 0.25 0.02 1 0 0.03 ...
$ agePct12t21 : num 0.34 0.26 0.39 0.51 0.38 0.31 0.3 0.52 0.38 0.9 ...
$ agePct12t29 : num 0.47 0.59 0.47 0.5 0.38 0.48 0.37 0.55 0.45 0.82 ...
$ agePct16t24 : num 0.29 0.35 0.28 0.34 0.23 0.27 0.23 0.36 0.28 0.8 ...
$ agePct65up : num 0.32 0.27 0.32 0.21 0.36 0.37 0.6 0.35 0.48 0.39 ...
$ numbUrban : num 0.2 0.02 0 0.06 0.02 0.04 0.02 0 0.04 0.02 ...
$ pctUrban : num 1 1 0 1 0.9 1 0.81 0 1 1 ...
$ medIncome : num 0.37 0.31 0.3 0.58 0.5 0.52 0.42 0.16 0.17 0.54 ...
$ pctWWage : num 0.72 0.72 0.58 0.89 0.72 0.68 0.5 0.44 0.47 0.59 ...
$ pctWFarmSelf : num 0.34 0.11 0.19 0.21 0.16 0.2 0.23 1 0.36 0.22 ...
$ pctWInvInc : num 0.6 0.45 0.39 0.43 0.68 0.61 0.68 0.23 0.34 0.86 ...
$ pctWSocSec : num 0.29 0.25 0.38 0.36 0.44 0.28 0.61 0.53 0.55 0.42 ...
$ pctWPubAsst : num 0.15 0.29 0.4 0.2 0.11 0.15 0.21 0.97 0.48 0.02 ...
$ pctWRetire : num 0.43 0.39 0.84 0.82 0.71 0.25 0.54 0.41 0.43 0.31 ...
$ medFamInc : num 0.39 0.29 0.28 0.51 0.46 0.62 0.43 0.15 0.21 0.85 ...
$ perCapInc : num 0.4 0.37 0.27 0.36 0.43 0.72 0.47 0.1 0.23 0.89 ...
$ whitePerCap : num 0.39 0.38 0.29 0.4 0.41 0.76 0.44 0.12 0.23 0.94 ...
$ blackPerCap : num 0.32 0.33 0.27 0.39 0.28 0.77 0.4 0.08 0.19 0.11 ...
$ indianPerCap : num 0.27 0.16 0.07 0.16 0 0.28 0.24 0.17 0.1 0.09 ...
$ AsianPerCap : num 0.27 0.3 0.29 0.25 0.74 0.52 0.86 0.27 0.26 0.33 ...
$ OtherPerCap : num 0.36 0.22 0.28 0.36 0.51 0.48 0.24 0.18 0.29 0.17 ...
$ HispPerCap : num 0.41 0.35 0.39 0.44 0.48 0.6 0.36 0.21 0.22 0.8 ...
$ NumUnderPov : num 0.08 0.01 0.01 0.01 0 0.01 0.01 0.03 0.04 0 ...
$ PctPopUnderPov : num 0.19 0.24 0.27 0.1 0.06 0.12 0.11 0.64 0.45 0.11 ...
$ PctLess9thGrade: num 0.1 0.14 0.27 0.09 0.25 0.13 0.29 0.96 0.52 0.04 ...
$ PctNotHSGrad : num 0.18 0.24 0.43 0.25 0.3 0.12 0.41 0.82 0.59 0.03 ...
$ PctBSorMore : num 0.48 0.3 0.19 0.31 0.33 0.8 0.36 0.12 0.17 1 ...
It looks like state
is a numeric code and communityname
is the name of the community. There is also county
and community
variables but they contain some missing values or NA
s so we won't use them. Notice how read.csv()
automatically converted communityname
to a factor variable when loading the dataset.
Now, let's do the same for the employment dataset.
str(communities_employment)
'data.frame': 1994 obs. of 8 variables:
$ state : int 8 53 24 34 42 6 44 6 21 29 ...
$ communityname : Factor w/ 1828 levels "Aberdeencity",..: 796 1626 2 1788 142 1520 840 1462 669 288 ...
$ PctUnemployed : num 0.27 0.27 0.36 0.33 0.12 0.1 0.28 1 0.55 0.11 ...
$ PctEmploy : num 0.68 0.73 0.58 0.71 0.65 0.65 0.54 0.26 0.43 0.44 ...
$ PctEmplManu : num 0.23 0.57 0.32 0.36 0.67 0.19 0.44 0.43 0.59 0.2 ...
$ PctEmplProfServ : num 0.41 0.15 0.29 0.45 0.38 0.77 0.53 0.34 0.36 1 ...
$ PctOccupManu : num 0.25 0.42 0.49 0.37 0.42 0.06 0.33 0.71 0.64 0.02 ...
$ PctOccupMgmtProf: num 0.52 0.36 0.32 0.39 0.46 0.91 0.49 0.18 0.29 0.96 ...
It seems that state
and communityname
are common to both datasets so we can use them to do the merge.
Now let's use the merge()
function to merge these two datasets together. There are three arguments that we need to provide to the merge()
function so it knows what we are trying to merge and how.
merge(x, y, by)
Argument | Description |
---|---|
x |
The first dataset to merge. This is the communities dataset in our case. |
y |
The second dataset to merge. This is the communities_employment dataset. |
by |
Name of the column or columns that are common in both datasets. We know that state and communityname are common to both datasets so we'll pass them as a vector by combining the two names together. |
For more information on how the merge()
function works, type help(merge)
in R.
# merge the two datasets
communities <- merge(communities, communities_employment, by = c("state", "communityname"))
# explore dataset
names(communities)
[1] "state" "communityname" "county"
[4] "community" "fold" "population"
[7] "householdsize" "racepctblack" "racePctWhite"
[10] "racePctAsian" "racePctHisp" "agePct12t21"
[13] "agePct12t29" "agePct16t24" "agePct65up"
[16] "numbUrban" "pctUrban" "medIncome"
[19] "pctWWage" "pctWFarmSelf" "pctWInvInc"
[22] "pctWSocSec" "pctWPubAsst" "pctWRetire"
[25] "medFamInc" "perCapInc" "whitePerCap"
[28] "blackPerCap" "indianPerCap" "AsianPerCap"
[31] "OtherPerCap" "HispPerCap" "NumUnderPov"
[34] "PctPopUnderPov" "PctLess9thGrade" "PctNotHSGrad"
[37] "PctBSorMore" "PctUnemployed" "PctEmploy"
[40] "PctEmplManu" "PctEmplProfServ" "PctOccupManu"
[43] "PctOccupMgmtProf"
View(communities)
select()
FunctionSince our dataset has more columns than we need, let's select only the ones we're interested in and rename them with meaningful names. One approach would be to use either the subset()
function or the square bracket [ ]
extraction operator for selecting the columns we're interested in. But the easiest way to accomplish this is with the dplyr select()
function that allows us select the columns we need and rename them at the same time.
communities <- select(communities,
state,
Community = communityname,
UnemploymentRate = PctUnemployed,
NoHighSchool = PctNotHSGrad,
White = racePctWhite)
Now that we've merged the dataset and renamed the columns the way we want, let's try to visualize the data with a simple plot command.
Is there a correlation between unemployment rate and lack of high school level education?
plot(communities$NoHighSchool, communities$UnemploymentRate,
xlab = "Adults without High School education (%)",
ylab = "Unemployment Rate")
In order to answer that question empirically, we will run linear regression using the lm()
function in R. The lm()
function needs to know the relationship we're trying to model and the dataset for our observations. The two arguments we need to provide to the lm()
function are described below.
Argument | Description |
---|---|
formula |
The formula describes the relationship between the dependent and independent variables, for example dependent.variable ~ independent.variable . Recall from last week that we used something very similar with t-test when comparing the difference of means.In our case, we'd like to model the relationship using the formula: UnemploymentRate ~ NoHighSchool |
data |
This is simply the name of the dataset that contains the variable of interest. In our case, this is the merged dataset called communities . |
For more information on how the lm()
function works, type help(lm)
in R.
model1 <- lm(UnemploymentRate ~ NoHighSchool, data = communities)
The lm()
function has modeled the relationship between UnemploymentRate
and NoHighSchool
and we've saved it in a variable called model1
. Let's use the summary()
function to see what this linear model looks like.
summary(model1)
Call:
lm(formula = UnemploymentRate ~ NoHighSchool, data = communities)
Residuals:
Min 1Q Median 3Q Max
-0.42347 -0.08499 -0.01189 0.07711 0.56470
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.078952 0.006483 12.18 <2e-16 ***
NoHighSchool 0.742385 0.014955 49.64 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1352 on 1992 degrees of freedom
Multiple R-squared: 0.553, Adjusted R-squared: 0.5527
F-statistic: 2464 on 1 and 1992 DF, p-value: < 2.2e-16
The output from lm()
might seem overwhelming at first so let's break it down one item at a time.
# | Description |
---|---|
The dependent variable, also sometimes called the outcome variable. We are trying to model the effects of NoHighSchool on UnemploymentRate so UnemploymentRate is the dependent variable. |
|
The independent variable or the predictor variable. In our example, NoHighSchool is the independent variable. |
|
The differences between the observed values and the predicted values are called residuals. | |
The coefficients for the intercept and the independent variables. Using the coefficients we can write down the relationship between the dependent and the independent variables as: UnemploymentRate = 0.078952 + ( 0.7423853 * NoHighSchool )This tells us that for each unit increase in the variable NoHighSchool , the UnemploymentRate increases by 0.7423853. |
|
The p-value of the model. Recall that according to the null hypotheses, the coefficient of interest is zero. The p-value tells us whether can can reject the null hypotheses or not. | |
The standard error estimates the standard deviation of the coefficients in our model. We can think of the standard error as the measure of precision for the estimated coefficients. | |
The t statistic is obtained by dividing the coefficients by the standard error. | |
The R-squared and adjusted R-squared tell us how much of the variance in our model is accounted for by the independent variable. The adjusted R-squared is always smaller than R-squared as it takes into account the number of independent variables and degrees of freedom. |
Now let's plot the regression line with our observations using the abline()
function.
plot(communities$NoHighSchool, communities$UnemploymentRate,
xlab = "Adults without High School education (%)",
ylab = "Unemployment Rate")
abline(model1, col = "red")
While the summary()
function provides a slew of information about a fitted regression model, we often need to present our findings in easy to read tables similar to what you see in journal publications. The texreg
package we installed earlier allows us to do just that.
Let's take a look at how to display the output of a regression model on the screen using the screenreg()
function from texreg
.
screenreg(model1)
=========================
Model 1
-------------------------
(Intercept) 0.08 ***
(0.01)
NoHighSchool 0.74 ***
(0.01)
-------------------------
R^2 0.55
Adj. R^2 0.55
Num. obs. 1994
RMSE 0.14
=========================
*** p < 0.001, ** p < 0.01, * p < 0.05
Returning to our example, are there other variables that might explain unemployment rates in our communities dataset? For example, is unemployment rate higher or lower in communities with different levels of minority population?
We first create a new variable called Minority
by subtracting the percent of White
population from 1. Alternatively, we could have added up the percent of Black, Hispanic and Asians to get the percentage of minority population since our census data also has those variables.
communities$Minority <- 1 - communities$White
Next we fit a linear model using Minority
as the independent variable.
model2 <- lm(UnemploymentRate ~ Minority, data = communities)
summary(model2)
Call:
lm(formula = UnemploymentRate ~ Minority, data = communities)
Residuals:
Min 1Q Median 3Q Max
-0.45521 -0.12189 -0.02369 0.10162 0.68203
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.257948 0.005506 46.85 <2e-16 ***
Minority 0.428702 0.015883 26.99 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.173 on 1992 degrees of freedom
Multiple R-squared: 0.2678, Adjusted R-squared: 0.2674
F-statistic: 728.5 on 1 and 1992 DF, p-value: < 2.2e-16
Now let's see how this model compares to our first model. We can show regression line from model2
just like we did with our first model.
plot(communities$Minority, communities$UnemploymentRate,
xlab = "Minority population (%)",
ylab = "Unemployment Rate")
abline(model2, col = "blue")
Does model2
offer a better fit than model1
? Maybe we can answer that question by looking at the regression tables instead. Let's print the two models side-by-side in a single table with screenreg()
.
screenreg(list(model1, model2))
======================================
Model 1 Model 2
--------------------------------------
(Intercept) 0.08 *** 0.26 ***
(0.01) (0.01)
NoHighSchool 0.74 ***
(0.01)
Minority 0.43 ***
(0.02)
--------------------------------------
R^2 0.55 0.27
Adj. R^2 0.55 0.27
Num. obs. 1994 1994
RMSE 0.14 0.17
======================================
*** p < 0.001, ** p < 0.01, * p < 0.05
Finally, let's save the same output as a Microsoft Word document using htmlreg()
.
htmlreg(list(model1, model2), file = "lab4_model_comparison.doc")
The table was written to the file 'lab4_model_comparison.doc'.
We will use the Zelig package since it provides convenient functions for making predictions and plotting confidence intervals. First, let's go back to our last model where we used Minority
as the independent variable but this time estimate the model using the Zelig package. The arguments for the zelig()
function are very similar to what we used with lm()
. The only difference is that we need to tell Zelig what type of model we're trying to estimate.
Argument | Description |
---|---|
formula |
The formula is the describes the relationship between the dependent and independent variables, for example dependent.variable ~ independent.variable . |
data |
This is the name of the dataset that contains the variable of interest. |
model |
This is where we tell Zelig the type of model to estimate. Since we're interested in a Least Squares regression model, we will use model = "ls" argument. |
z.out <- zelig(UnemploymentRate ~ Minority, data = communities, model = "ls")
How to cite this model in Zelig:
Kosuke Imai, Gary King, and Olivia Lau. 2015.
"ls: Least Squares Regression for Continuous Dependent Variables"
in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software,"
http://gking.harvard.edu/zelig
The model estimated with zelig()
is saved in a variable called z.out
. We can use the summary()
function exactly the same way as we used it with lm()
to print out the estimated model.
summary(z.out)
Call:
lm(formula = formula, weights = weights, model = F, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.45521 -0.12189 -0.02369 0.10162 0.68203
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.257948 0.005506 46.85 <2e-16 ***
Minority 0.428702 0.015883 26.99 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.173 on 1992 degrees of freedom
Multiple R-squared: 0.2678, Adjusted R-squared: 0.2674
F-statistic: 728.5 on 1 and 1992 DF, p-value: < 2.2e-16
Next, let's use Zelig's setx()
and sim()
functions for making predictions and plotting confidence intervals.
x.out <- setx(z.out, Minority = seq(0, 1, 0.1))
s.out <- sim(z.out, x = x.out)
summary(s.out)
Expected Values: E(Y|X)
mean sd 50% 2.5% 97.5%
[Minority=0] 0.2579665 0.005677340 0.2580478 0.2473649 0.2691032
[Minority=0.1] 0.3007337 0.004531902 0.3007323 0.2918661 0.3099384
[Minority=0.2] 0.3435213 0.003877042 0.3434656 0.3361714 0.3508412
[Minority=0.3] 0.3865521 0.004068235 0.3863071 0.3787131 0.3950038
[Minority=0.4] 0.4293833 0.004442889 0.4293484 0.4207419 0.4378098
[Minority=0.5] 0.4721306 0.005599804 0.4720700 0.4614503 0.4828565
[Minority=0.6] 0.5153974 0.006867561 0.5153081 0.5021815 0.5295274
[Minority=0.7] 0.5578146 0.008156760 0.5576684 0.5416097 0.5739449
[Minority=0.8] 0.6002469 0.009260266 0.6001272 0.5817250 0.6185873
[Minority=0.9] 0.6436730 0.011278216 0.6438739 0.6220530 0.6662544
[Minority=1] 0.6866793 0.012432337 0.6863878 0.6627996 0.7108965
Predicted Values: Y|X
mean sd 50% 2.5% 97.5%
[Minority=0] 0.2579665 0.005677340 0.2580478 0.2473649 0.2691032
[Minority=0.1] 0.3007337 0.004531902 0.3007323 0.2918661 0.3099384
[Minority=0.2] 0.3435213 0.003877042 0.3434656 0.3361714 0.3508412
[Minority=0.3] 0.3865521 0.004068235 0.3863071 0.3787131 0.3950038
[Minority=0.4] 0.4293833 0.004442889 0.4293484 0.4207419 0.4378098
[Minority=0.5] 0.4721306 0.005599804 0.4720700 0.4614503 0.4828565
[Minority=0.6] 0.5153974 0.006867561 0.5153081 0.5021815 0.5295274
[Minority=0.7] 0.5578146 0.008156760 0.5576684 0.5416097 0.5739449
[Minority=0.8] 0.6002469 0.009260266 0.6001272 0.5817250 0.6185873
[Minority=0.9] 0.6436730 0.011278216 0.6438739 0.6220530 0.6662544
[Minority=1] 0.6866793 0.012432337 0.6863878 0.6627996 0.7108965
plot(s.out, qi = "pv", xlab = "Minority population (%)")
Linear Regression Interactive App
PctImmigRec5
variable to RecentImmigration5
.RecentImmigration5
.NoHighSchool
and Minority
as independent variables? Present your findings by comparing the output of all three models in side-by-side tables using the texreg
package.RecentImmigration5
and plot the confidence interval using Zelig.png()
function for saving the graphics file directly from your R script.