Introduction to Quantitative Methods

8. Panel Data, Time-Series Cross-Section Models

8.1 Seminar

Setting a Working Directory

Before you begin, make sure to set your working directory to a folder where your course related files such as R scripts and datasets are kept. We recommend that you create a PUBLG100 folder for all your work. Create this folder on N: drive if you're using a UCL computer or somewhere on your local disk if you're using a personal laptop.

Once the folder is created, use the setwd() function in R to set your working directory.

Recommended Folder Location R Function
UCL Computers N: Drive setwd("N:/PUBLG100")
Personal Laptop (Windows) C: Drive setwd("C:/PUBLG100")
Personal Laptop (Mac) Home Folder setwd("~/PUBLG100")

After you've set the working directory, verify it by calling the getwd() function.

getwd()

Now download the R script for this seminar from the "Download .R Script" button above, and save it to your PUBLG100 folder.

If you're using a UCL computer, please make sure you're running R version 3.2.0. Some of the seminar tasks and exercises will not work with older versions of R. Click here for help on how to start the new version of R on UCL computers.

Required Packages

Let's start off by installing two new packages that we are going to need for this week's exercises. The tidyr package works with dplyr and is extremely useful for data wrangling tasks such as reshaping from long-to-wide format or vice-versa. The plm package provides functionality for estimating linear models for panel data.

install.packages("plm")
Installing package into '/Users/altaf/Library/R/3.2/library'
(as 'lib' is unspecified)


The downloaded binary packages are in
    /var/folders/62/x92vvd_n08qb3vw5wb_706h80000gp/T//RtmpKXtnO6/downloaded_packages
install.packages("tidyr")
Installing package into '/Users/altaf/Library/R/3.2/library'
(as 'lib' is unspecified)


The downloaded binary packages are in
    /var/folders/62/x92vvd_n08qb3vw5wb_706h80000gp/T//RtmpKXtnO6/downloaded_packages

Now load the packages in this order:

library(plm)
library(lmtest)
library(texreg)
Warning: package 'texreg' was built under R version 3.2.3
library(tidyr)
library(dplyr)
# clear environment
rm(list = ls())

World Development Indicators (WDI)

We use World Bank's Databank to download World Development Indicators (WDI) from 1960 to 2015 for all 193 UN member states. There are over 1300 indicators available from WDI but we're only interested in the following for the first exercise:

Indicator Definition
NY.GDP.MKTP.KD.ZG GDP growth (annual %)
SH.H2O.SAFE.ZS Improved water source (% of population with access)
SH.XPD.TOTL.ZS Health expenditure, total (% of GDP)
SH.STA.MMRT Maternal mortality ratio (modeled estimate, per 100,000 live births)
SH.PRG.ANEM Prevalence of anemia among pregnant women (%)
SH.MED.NUMW.P3 Nurses and midwives (per 1,000 people)

The query used to download this dataset is available here. You can copy this query and modify it to suit your needs for your own research.

We will use this dataset to understand the causes of maternal mortality rates by demonstrating panel data regression in R. Datasets downloaded from WDI also come with a codebook describing each indicator. The codebook for the dataset we're using today is available here

First, let's load the dataset using the read.csv() function.

wdi_original <- read.csv("http://uclspp.github.io/PUBLG100/data/WDI_Data.csv", na.strings = "..")

IMPORTANT: Notice the use of na.strings = "..". World Bank's WDI datasets by default use two dots ".." to indicate missing values. Although you can change this behavior prior to downloading the dataset, we're showing you one way to get around this problem by telling read.csv() to treat ".." as missing values or NA.

Let's explore the dataset.

View(wdi_original)

As you can see, WDI dataset is not convenient to work with in it's original layout so we need to reshape it for our analysis. By default, indicators and countries are in rows while years are in columns. This is something you can change before downloading the dataset but it's easy enough to deal with using a couple of reshaping functions from dplyr and tidyr. There are 5 steps necessary for reshaping this dataset, and we'll explain each step individually and then put them all together.

  1. Remove the first column from the dataset since we don't need it and it causes problems later. The column name is Series.Name but sometimes it includes non-printable characters so we'll remove it by its numeric index of 1 instead to avoid potential problems.
  2. Filter out any observation where Country.Code is an empty string.
  3. Values for each year are in a separate column, such as X1960..YR1960 for 1960. X1961..YR1961 for 1961 and so on. We want to collect or gather() all these values into a single column called Series.Value.
  4. Next, we convert all indicators to individual columns using spread().
  5. Finally, we extract the year as a numeric using the substring() function.

Now that we know the individual steps, we can chain them together using the %>% pipe operator we saw a few weeks ago.

wdi <- wdi_original %>%
  select(-1) %>% 
  filter(Country.Code != "") %>%
  gather(Year, Series.Value, X1960..YR1960.:X2015..YR2015.) %>%
  spread(Series.Code, Series.Value) %>%
  mutate(Year = as.numeric(substring(Year, 2, 5)))

NOTE: You can use the block of code above for any dataset you download in WDI's default layout. If you change the time period then just remember to adjust the column names on the line where we call the gather() function.

Now that the dataset has been reshaped, we need to rename the variable names to something meaningful.

wdi <- rename(wdi,
              CountryName = Country.Name,
              CountryCode = Country.Code,
              SafeWaterAccess = SH.H2O.SAFE.ZS,
              MaternalMortality = SH.STA.MMRT,
              PregnantWomenWithAnemia = SH.PRG.ANEM,
              HealthExpenditure = SH.XPD.TOTL.ZS)

Let's explore the dataset again and make sure this is how we want the layout to look like.

View(wdi)

Fixed Effects

We would like to understand the causes of maternal mortality rates around the world. In particular, we would like to know how maternal morality rates change over time and across countries and what factors are related to this change.

Our dataset includes a number of indicators that we believe could be related to maternal mortality rates such as healthcare expenditure, proportion of population with access to safe water, etc. But we also know that a number of other factors could also have a profound effect on maternal mortality rates but they are either not easily available or not observable at all. Some of these factors could include a lack of awareness of preventable diseases and attitudes towards prevention or prenatal care that vary greatly from country to country but don't change over time within the same country. Without access to datasets controlling for these factors, our model would suffer from omitted variable bias but a fixed effect model allows us to control for these variables that are constant over time.

We estimate the fixed effect model using the plm() function from the package plm. The plm() function works in a similar fashion as lm() but needs a few additional arguments:

plm(formula, data, index, model, effect)
Argument Description
formula A formula that models the relationship between dependent variable and the independent variables. This is exactly what we do with lm()
data A dataframe object
index The index with variables representing the indices for individual and time values. For example c("CountryCode", "Year")
model For fixed effects we use "within", but model could be one of the following:
- "within"
- "between"
- "random"
- "pooling"
- "fd"
- "ht"
effect One of the following:
- "individual"
- "time"
- "twoways"

We can now run a country fixed effect model using the plm() function. The effect = "individual" tells plm() to estimate "individual" or country fixed effects.

fixed_effects <- plm(MaternalMortality ~ SafeWaterAccess + HealthExpenditure + PregnantWomenWithAnemia, 
                     data = wdi, 
                     index = c("CountryCode", "Year"), 
                     model = "within", 
                     effect = "individual")
summary(fixed_effects)
Oneway (individual) effect Within Model

Call:
plm(formula = MaternalMortality ~ SafeWaterAccess + HealthExpenditure + 
    PregnantWomenWithAnemia, data = wdi, effect = "individual", 
    model = "within", index = c("CountryCode", "Year"))

Unbalanced Panel: n=175, T=7-17, N=2934

Residuals :
    Min.  1st Qu.   Median  3rd Qu.     Max. 
-563.000  -15.300    0.596   13.300  534.000 

Coefficients :
                        Estimate Std. Error  t-value  Pr(>|t|)    
SafeWaterAccess         -8.79494    0.40917 -21.4948 < 2.2e-16 ***
HealthExpenditure       -4.16778    1.22865  -3.3922 0.0007032 ***
PregnantWomenWithAnemia  5.91744    0.53636  11.0325 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    15806000
Residual Sum of Squares: 10147000
R-Squared      :  0.35799 
      Adj. R-Squared :  0.33627 
F-statistic: 512.248 on 3 and 2756 DF, p-value: < 2.22e-16

The country fixed effects model shows SafeWaterAccess and HealthExpenditure with strong negative correlation with MaternalMortality and PregnantWomenWithAnemia with a strong positive correlation. Intuitively this makes sense. Increase in healthcare expenditure and access to safe drinking water are likely to reduce maternal mortality rates. Similarly, anemia is often linked to malnutrition and would understandably increase risk of complications in pregnant women.

But before we go any further, we need to check whether there are indeed any country fixed effects to begin with. We can test for that with plmtest() function. The plmtest() function can test for the presence of individual or time effects.

plmtest(fixed_effects, effect="individual")
    Lagrange Multiplier Test - (Honda)

data:  MaternalMortality ~ SafeWaterAccess + HealthExpenditure + PregnantWomenWithAnemia
normal = 137.93, p-value < 2.2e-16
alternative hypothesis: significant effects

The null hypothesis for plmtest() is that there are no individual effects. The p-value suggests that we can reject the null hypothesis and that there are indeed country fixed effects present in our model.

In addition to country fixed effects, a number of factors could affect maternal mortality rates that are not specific to each country. For example, development of new vaccines or medicines to fight infections, or investments in awareness campaigns by aid organizations such as WHO about preventable diseases likely vary over time but have similar effects around the world.

We can model these time fixed effects using the effect = "time" argument in plm().

time_effects <- plm(MaternalMortality ~ SafeWaterAccess + HealthExpenditure + PregnantWomenWithAnemia, 
                    data = wdi, 
                    index = c("CountryCode", "Year"), 
                    model = "within", 
                    effect = "time")
summary(time_effects)
Oneway (time) effect Within Model

Call:
plm(formula = MaternalMortality ~ SafeWaterAccess + HealthExpenditure + 
    PregnantWomenWithAnemia, data = wdi, effect = "time", model = "within", 
    index = c("CountryCode", "Year"))

Unbalanced Panel: n=175, T=7-17, N=2934

Residuals :
   Min. 1st Qu.  Median 3rd Qu.    Max. 
-486.00  -80.10   -3.21   51.60 1890.00 

Coefficients :
                         Estimate Std. Error t-value  Pr(>|t|)    
SafeWaterAccess         -10.07778    0.26654 -37.810 < 2.2e-16 ***
HealthExpenditure        23.86712    1.64865  14.477 < 2.2e-16 ***
PregnantWomenWithAnemia  11.87103    0.41316  28.732 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    327280000
Residual Sum of Squares: 106910000
R-Squared      :  0.67334 
      Adj. R-Squared :  0.66875 
F-statistic: 2002.18 on 3 and 2914 DF, p-value: < 2.22e-16

When we account for these time variant factors, we notice that the HealthExpenditure variable is now positively correlated with MaternalMortality. At first glance this seems counterintuitive, but keep in mind that with time fixed effects, our model does not distinguish between observations across different countries. If we examine the dataset, we can see that MaternalMortality has steadily declined over the years, but HealthExpenditure around the world has fluctuated with sharp declines during economic crises and relatively smaller increases during economic boom. When we are not using country fixed effects, these fluctuations make it seem as if an increase in HealthExpenditure is correlated with increase in MaternalMortality, but in fact it's just an artifact of time fixed effects.

Let's run the Lagrange Multiplier test on the time_effects model again to see if indeed there are time fixed effects in our model. Remember, the null hypotheses for the test is that there are no time fixed effects.

plmtest(time_effects, effect="time")
    Lagrange Multiplier Test - time effects (Honda)

data:  MaternalMortality ~ SafeWaterAccess + HealthExpenditure + PregnantWomenWithAnemia
normal = -2.1755, p-value = 0.02959
alternative hypothesis: significant effects

The p-value tells us that we can reject the null hypothesis so we know that there are time fixed effects present in our model.

We also confirmed the presence of country fixed effects in the first model we estimated. Now, in order to control for both country AND time fixed effects, we need to estimate a model using the effect = "twoways" argument.

twoway_effects <- plm(MaternalMortality ~ SafeWaterAccess + HealthExpenditure + PregnantWomenWithAnemia, 
                      data = wdi, 
                      index = c("CountryCode", "Year"), 
                      model = "within", 
                      effect = "twoways")
summary(twoway_effects)
Twoways effects Within Model

Call:
plm(formula = MaternalMortality ~ SafeWaterAccess + HealthExpenditure + 
    PregnantWomenWithAnemia, data = wdi, effect = "twoways", 
    model = "within", index = c("CountryCode", "Year"))

Unbalanced Panel: n=175, T=7-17, N=2934

Residuals :
    Min.  1st Qu.   Median  3rd Qu.     Max. 
-566.000  -16.500   -0.454   14.400  536.000 

Coefficients :
                        Estimate Std. Error  t-value  Pr(>|t|)    
SafeWaterAccess         -8.55480    0.43546 -19.6455 < 2.2e-16 ***
HealthExpenditure       -3.68403    1.29204  -2.8513  0.004386 ** 
PregnantWomenWithAnemia  5.28353    0.68313   7.7343  1.45e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    12164000
Residual Sum of Squares: 10125000
R-Squared      :  0.16761 
      Adj. R-Squared :  0.15653 
F-statistic: 183.906 on 3 and 2740 DF, p-value: < 2.22e-16

The results of all three models are shown below. All three explanatory variables are statistically significant in each model. The coefficients for our explanatory variable in the twoway fixed effect model are close to the country fixed effects indicating that these factors vary greatly across countries than they do across time.

screenreg(list(fixed_effects, time_effects, twoway_effects), 
          custom.model.names = c("Country Fixed Effects", "Time Fixed Effects", "Twoway Fixed Effects"))
========================================================================================
                         Country Fixed Effects  Time Fixed Effects  Twoway Fixed Effects
----------------------------------------------------------------------------------------
SafeWaterAccess            -8.79 ***             -10.08 ***           -8.55 ***         
                           (0.41)                 (0.27)              (0.44)            
HealthExpenditure          -4.17 ***              23.87 ***           -3.68 **          
                           (1.23)                 (1.65)              (1.29)            
PregnantWomenWithAnemia     5.92 ***              11.87 ***            5.28 ***         
                           (0.54)                 (0.41)              (0.68)            
----------------------------------------------------------------------------------------
R^2                         0.36                   0.67                0.17             
Adj. R^2                    0.34                   0.67                0.16             
Num. obs.                2934                   2934                2934                
========================================================================================
*** p < 0.001, ** p < 0.01, * p < 0.05

More Guns, Less Crimes

We now turn our attention to serial correlation and cross sectional dependence and use the Guns dataset from Stock and Watson.

Gun rights advocate John Lott argues in his book More Guns, Less Crimes that crime rates in the United States decrease when gun ownership restrictions are relaxed. The data used in Lott's research compares violent crimes, robberies, and murders across 50 states to determine whether the so called "shall" laws that remove discretion from license granting authorities actually decrease crime rates. So far 41 states have passed these "shall" laws where a person applying for a licence to carry a concealed weapon doesn't have to provide justification or "good cause" for requiring a concealed weapon permit.

Let's clear the environment and load the dataset used by Lott and see if we can test the arguments made by gun rights advocates.

# clear the environment first
rm(list = ls())

guns_data <- read.csv("http://uclspp.github.io/PUBLG100/data/guns.csv")

The variables we're interested in are described below. You can also get the original codebook from the Stock and Watson website here that describes other variables in the dataset.

Indicator Definition
mur murder rate (incidents per 100,000)
shall = 1 if the state has a shall-carry law in effect in that year
= 0 otherwise
incarc_rate incarceration rate in the state in the previous year (sentenced prisoners per 100,000 residents; value for the previous year)
pm1029 percent of state population that is male, ages 10 to 29

We will focus on murder rates in this example but you could try the same with variables measuring violent crimes or robberies as well.

Let's create a factor variable representing whether a state has passed "shall" law or not. The variable already exists as 0 or 1 but we want to convert it to a factor for our analysis.

guns_data$shall_law <- factor(guns_data$shall, levels = c(0, 1), labels =c("NO", "YES"))

Let's estimate a fixed effect model on panel data using the plm() function. We will restrict our independent variables to the shall_law, incarc_rate, and pm1029.

fixed_effects <- 
  plm(mur ~ shall_law + incarc_rate + pm1029, 
      data = guns_data, 
      index = c("stateid", "year"), 
      model = "within", 
      effect = "individual")

summary(fixed_effects)
Oneway (individual) effect Within Model

Call:
plm(formula = mur ~ shall_law + incarc_rate + pm1029, data = guns_data, 
    effect = "individual", model = "within", index = c("stateid", 
        "year"))

Balanced Panel: n=51, T=23, N=1173

Residuals :
   Min. 1st Qu.  Median 3rd Qu.    Max. 
-21.100  -0.959   0.016   1.080  29.000 

Coefficients :
               Estimate Std. Error t-value  Pr(>|t|)    
shall_lawYES -1.4513886  0.3154300 -4.6013 4.678e-06 ***
incarc_rate   0.0174551  0.0011261 15.4998 < 2.2e-16 ***
pm1029        0.9582993  0.0859610 11.1481 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    12016
Residual Sum of Squares: 9800
R-Squared      :  0.18444 
      Adj. R-Squared :  0.17595 
F-statistic: 84.3526 on 3 and 1119 DF, p-value: < 2.22e-16

The state_effects model shows that all three of our independent variables are statistically significant, with shall_law decreasing murder rates by 1.45 incidents per 100000 members of the population. The effects of incarceration rate and percentage of male population between 10 and 29 years old are also statistically significant.

Before drawing any conclusions let's make sure whether there are any state effects in our model using plmtest() as we did in the previous exercise.

plmtest(fixed_effects, effect="individual")
    Lagrange Multiplier Test - (Honda)

data:  mur ~ shall_law + incarc_rate + pm1029
normal = 47.242, p-value < 2.2e-16
alternative hypothesis: significant effects

The p-value suggests the presence of state effects so let's run a two-way fixed effect model incorporating both state effects and time effects.

twoway_effects <- 
  plm(mur ~ shall_law + incarc_rate + pm1029, 
      data = guns_data, 
      index = c("stateid", "year"), 
      model = "within", 
      effect = "twoways")

summary(twoway_effects)
Twoways effects Within Model

Call:
plm(formula = mur ~ shall_law + incarc_rate + pm1029, data = guns_data, 
    effect = "twoways", model = "within", index = c("stateid", 
        "year"))

Balanced Panel: n=51, T=23, N=1173

Residuals :
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-19.20000  -0.97500  -0.00697   1.01000  27.10000 

Coefficients :
               Estimate Std. Error t-value  Pr(>|t|)    
shall_lawYES -0.5640474  0.3325054 -1.6964 0.0901023 .  
incarc_rate   0.0209756  0.0011252 18.6411 < 2.2e-16 ***
pm1029        0.7326357  0.2189770  3.3457 0.0008485 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    11263
Residual Sum of Squares: 8519.4
R-Squared      :  0.24357 
      Adj. R-Squared :  0.22779 
F-statistic: 117.746 on 3 and 1097 DF, p-value: < 2.22e-16

In a twoway fixed effects model shall_law is no longer significant and the effect of male population between 10 and 29 years old has decreased from 0.95 to 0.73 incidents per 100,000 population.

Serial Correlation

For time series data we need to address the potential for serial correlation in the error term. We will test for serial correlation with Breusch-Godfrey test using pbgtest() and provide solutions for correcting it if necessary.

pbgtest(twoway_effects)
    Breusch-Godfrey/Wooldridge test for serial correlation in panel
    models

data:  mur ~ shall_law + incarc_rate + pm1029
chisq = 765.16, df = 23, p-value < 2.2e-16
alternative hypothesis: serial correlation in idiosyncratic errors

The null hypothesis for the Breusch-Godfrey test is that there is no serial correlation. The p-value from the test tells us that we can reject the null hypothesis and confirms the presence of serial correlation in our error term.

We can correct for serial correlation using coeftest() similar to how we corrected for heteroskedastic errors. We'll use the vcovHC() function for obtaining a heteroskedasticity-consistent covariance matrix, but since we're interested in correcting for autocorrelation as well, we will specify method = "arellano" which corrects for both heteroskedasticity and autocorrelation.

twoway_effects_hac <- coeftest(twoway_effects, 
                               vcov = vcovHC(twoway_effects, method = "arellano", type = "HC3"))

screenreg(list(twoway_effects, twoway_effects_hac),
          custom.model.names = c("Twoway Fixed Effects", "Twoway Fixed Effects (HAC)"))
==============================================================
              Twoway Fixed Effects  Twoway Fixed Effects (HAC)
--------------------------------------------------------------
shall_lawYES    -0.56               -0.56                     
                (0.33)              (0.48)                    
incarc_rate      0.02 ***            0.02 *                   
                (0.00)              (0.01)                    
pm1029           0.73 ***            0.73                     
                (0.22)              (0.54)                    
--------------------------------------------------------------
R^2              0.24                                         
Adj. R^2         0.23                                         
Num. obs.     1173                                            
==============================================================
*** p < 0.001, ** p < 0.01, * p < 0.05

We can see that with heteroskedasticity and autocorrelation consistent (HAC) standard errors, the percent of male population (10 - 29 yr old) is no longer a significant predictor in our model.

Lagged Dependent Variables (LDV) and Dynamic Models

Another way to address serial correlation is by modeling the time dependence directly. We can think of a dynamic model as one that takes into account whether changes in the predictor variables have an immediate effect on our dependent variable or whether the effects are distributed over time. In our example, do changes in gun laws affect murder rates immediately or are the effects distributed over time?

We can account time dependence by incorporating a Lagged Dependent Variable (LDV) in our model. A Lagged Dependent Variable (LDV) (as the name suggests) is one that "lags" behind the original observation by t time-periods. The lag() function generates lagged variables and has the following form:

lag(x, k)
Argument Description
x A vector or matrix of observations
k Number of lags. Default is 1
ldv_model <- 
  plm(mur ~ lag(mur) + shall_law + incarc_rate + pm1029, 
      data = guns_data, 
      index = c("stateid", "year"), 
      model = "within", 
      effect = "twoways")

summary(ldv_model)
Twoways effects Within Model

Call:
plm(formula = mur ~ lag(mur) + shall_law + incarc_rate + pm1029, 
    data = guns_data, effect = "twoways", model = "within", index = c("stateid", 
        "year"))

Unbalanced Panel: n=51, T=22-23, N=1172

Residuals :
    Min.  1st Qu.   Median  3rd Qu.     Max. 
-23.8000  -0.6240   0.0133   0.6130  17.2000 

Coefficients :
               Estimate Std. Error t-value  Pr(>|t|)    
lag(mur)      0.6553529  0.0185809 35.2702 < 2.2e-16 ***
shall_lawYES -0.0923557  0.2277497 -0.4055  0.685179    
incarc_rate   0.0029214  0.0009262  3.1541  0.001654 ** 
pm1029        0.3839174  0.1501383  2.5571  0.010689 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    11252
Residual Sum of Squares: 3972.3
R-Squared      :  0.64697 
      Adj. R-Squared :  0.60446 
F-statistic: 501.669 on 4 and 1095 DF, p-value: < 2.22e-16

Cross Sectional Dependence

If a federal law imposed restrictions on gun ownership or licensing requirements then the changes would likely affect all 50 states. This is an example of Cross Sectional Dependence and not accounted for in a fixed effect model. Other scenarios could also trigger cross sectional dependence that we should take into consideration. For example, security policies and law enforcement efforts might change after an extraordinary event (think of mass shootings or terrorist attacks) thus influencing law enforcement practices in all states. We can check for cross sectional dependence using the Pesaran cross sectional dependence test or pcdtest().

pcdtest(twoway_effects)
    Pesaran CD test for cross-sectional dependence in panels

data:  formula
z = 3.9121, p-value = 9.148e-05
alternative hypothesis: cross-sectional dependence

As we've seen with other tests, the null hypothesis is that there is no cross sectional dependence. The p-value, however tells that there is indeed cross-sectional dependence and we need to correct it. There are two general approaches to correcting for cross sectional dependence.

Beck and Katz (1995) method or Panel Corrected Standard Errors (PCSE): We can obtain Panel Corrected Standard Errors (PCSE) by first obtaining a robust variance-covariance matrix for panel models with the Beck and Katz (1995) method using the vcovBK()) and passing it to the familiar coeftest() function.

twoway_effects_pcse <- coeftest(twoway_effects, 
                                vcov = vcovBK(twoway_effects, type="HC3", cluster = "group")) 

twoway_effects_pcse
t test of coefficients:

               Estimate Std. Error t value  Pr(>|t|)    
shall_lawYES -0.5640474  0.7662556 -0.7361    0.4618    
incarc_rate   0.0209756  0.0028249  7.4253 2.254e-13 ***
pm1029        0.7326357  0.5118496  1.4313    0.1526    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results from PCSE are sensitive to the ratio between the number of time periods in the dataset (T) and the total number of observations (N). When we're dealing with large datasets (i.e. the T/N ratio is small), we use the Driscoll and Kraay method instead.

Driscoll and Kraay (1998) (SCC): The cross-sectional and serial correlation (SCC) method by Driscoll and Kraay addresses the limitations of Beck and Katz's PCSE method is therefore preferred for obtaining heteroskedasticity and autocorrelation consistent errors that are also robust to cross-sectional dependence. We can get SCC corrected covariance matrix using the vcovSCC() function.

twoway_effects_scc <- coeftest(twoway_effects,
                               vcov = vcovSCC(twoway_effects, type="HC3", cluster = "group"))

twoway_effects_scc
t test of coefficients:

               Estimate Std. Error t value  Pr(>|t|)    
shall_lawYES -0.5640474  0.4349609 -1.2968 0.1949806    
incarc_rate   0.0209756  0.0062285  3.3677 0.0007844 ***
pm1029        0.7326357  0.3854622  1.9007 0.0576074 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
screenreg(list(fixed_effects, twoway_effects, ldv_model, twoway_effects_pcse, twoway_effects_scc), 
          custom.model.names = c("Country Effects", "Twoway Fixed Effects", "LDV", "PCSE", "SCC"))
======================================================================================
              Country Effects  Twoway Fixed Effects  LDV          PCSE       SCC      
--------------------------------------------------------------------------------------
shall_lawYES    -1.45 ***        -0.56                 -0.09      -0.56      -0.56    
                (0.32)           (0.33)                (0.23)     (0.77)     (0.43)   
incarc_rate      0.02 ***         0.02 ***              0.00 **    0.02 ***   0.02 ***
                (0.00)           (0.00)                (0.00)     (0.00)     (0.01)   
pm1029           0.96 ***         0.73 ***              0.38 *     0.73       0.73    
                (0.09)           (0.22)                (0.15)     (0.51)     (0.39)   
lag(mur)                                                0.66 ***                      
                                                       (0.02)                         
--------------------------------------------------------------------------------------
R^2              0.18             0.24                  0.65                          
Adj. R^2         0.18             0.23                  0.60                          
Num. obs.     1173             1173                  1172                             
======================================================================================
*** p < 0.001, ** p < 0.01, * p < 0.05

Exercises

  1. Download the Comparative Political Data Set. Refer to the Codebook for a complete list of variables.
  2. Estimate a model for the electoral fractionalization of the party system as coded by the rae_ele variable in the dataset using all the variables listed at the end of the exercises section.

    • Estimate a fixed effect model and test for country and time fixed effects.
    • Run the necessary tests to check whether country and time fixed effects are present.
  3. Estimate a twoway model and compare to the previous country and time fixed effect models.

  4. Test for serial correlation and cross sectional dependence in the twoway model.
  5. If either serial correlation or cross sectional dependence is present, use the methods discussed in the seminar to obtain heteroskedastic and autocorrelation consistent standard errors.
  6. Compare the HAC and spatially robust standard errors with the twoway model estimated earlier.
  7. Display the results in publication-ready tables and discuss the substantively significant findings.
Variables to be used for exercises 1-9:
Variable Definition
1. vturn Voter turnout in election
2. judrev Judicial review (existence of an independent body which decides whether laws conform to the constitution).
Coded 0 = no, 1 = yes.
3. ud Net union membership as a proportion wage and salary earners in employment (union density)
4. unemp Unemployment rate, percentage of civilian labour force.
5. unemp_pmp Cash expenditure for unemployment benefits as a percentage of GDP (public and mandatory private).
6. realgdpgr Growth of real GDP, percent change from previous year.
7. debt Gross general government debt (financial liabilities) as a percentage of GDP.
8. socexp_t_pmp Total public and mandatory private social expenditure as a percentage of GDP.