Introduction to Quantitative Methods

3. T-test for Difference in Means and Hypothesis Testing

3.1 Seminar

Setting a Working Directory

Before you begin, make sure to set your working directory to a folder where your course related files such as R scripts and datasets are kept. We recommend that you create a PUBLG100 folder for all your work. Create this folder on N: drive if you're using a UCL computer or somewhere on your local disk if you're using a personal laptop.

Once the folder is created, use the setwd() function in R to set your working directory.

Recommended Folder Location R Function
UCL Computers N: Drive setwd("N:/PUBLG100")
Personal Laptop (Windows) C: Drive setwd("C:/PUBLG100")
Personal Laptop (Mac) Home Folder setwd("~/PUBLG100")

After you've set the working directory, verify it by calling the getwd() function.

getwd()

Now download the R script for this seminar from the "Download .R Script" button above, and save it to your PUBLG100 folder.

If you're using a UCL computer, please make sure you're running R version 3.3.1 or newer. Some of the seminar tasks and exercises will not work with older versions of R.

rm(list=ls()) # clear workspace

Loading Dataset in Stata Format

In the last two seminars we worked with data sets in CSV and Excel (.xlsx) formats and learned how to use the readxl package. In this seminar, we will be loading data from Stata file formats. One of the core packages that comes pre-installed with R allows us to directly read a file saved in Stata format. The package is called foreign and it's designed to provide support for non-native (or "foreign") file formats.

The function we need to use for loading a file in Stata format is read.dta(). We use data from the Institute for Quality of Governance today. You can download a codebook for the variables here.

First, we need to download the dataset and save it to our PUBLG100 folder. If you haven't set your working directory yet, make sure to follow the Getting Started instructions from the top of this page.

Download 'Institute for Quality of Governance' Dataset

Now let's load the dataset and do our usual routine of getting to know our data.

library(foreign) # to work with foreign file formats

# loading a STATA format dataset (remember to load the library foreign 1st)
world.data <- read.dta("QoG2012.dta")

# the dimensions: rows (observations) and columns (variables) 
dim(world.data)
[1] 194   7
# the variable names
names(world.data)
[1] "h_j"         "wdi_gdpc"    "undp_hdi"    "wbgi_cce"    "wbgi_pse"   
[6] "former_col"  "lp_lat_abst"
# let's look at the first few observations
head(world.data)
  h_j   wdi_gdpc undp_hdi   wbgi_cce   wbgi_pse former_col lp_lat_abst
1  -5   628.4074       NA -1.5453584 -1.9343837          0   0.3666667
2  -5  4954.1982    0.781 -0.8538115 -0.6026081          0   0.4555556
3  -5  6349.7207    0.704 -0.7301510 -1.7336243          1   0.3111111
4  NA         NA       NA  1.3267342  1.1980436          0   0.4700000
5  -5  2856.7517    0.381 -1.2065741 -1.4150945          1   0.1366667
6  NA 13981.9795    0.800  0.8624368  0.7084046          1   0.1892222

Dplyr Package

Today we introduce another package called dplyr. It is a package that makes it easier for you to work with data sets. It includes functions that let you rename variables, pick out observations that fulfill certain conditions, like gender==female, lets you select certain variables and many more things. From now on, we will introduce a set of new function of dplyr every week. But first we will install the package with install.packages("dplyr"). Each time you want to use the package you have to load it with library(dplyr).

install.packages("dplyr")
Installing package into '/Users/altaf/Library/R/3.3/library'
(as 'lib' is unspecified)


The downloaded binary packages are in
    /var/folders/gq/3p66r9k554s9h3s6fc16tp600000gn/T//RtmpSQ5Xv8/downloaded_packages

Dplyr rename()

The first function from the dplyr package that will make your life easier is rename(). You may have noticed that the variable names in our data are quite cryptic. With the help of the codebook we understand what is going on but we may want to use more intuitive variable names instead. The rename() function lets you do that.

Here's a list of arguments we need to provide to the rename() function and their descriptions.

rename(dataset, expression)
Argument Description
dataset The first argument is the name of your dataset (we have called it world.data).
expression The second argument is the expression that assigns a new variable name to an existing variable in our dataset. For example:
new.variable = existing.variable

Remember, to use any of the dplyr functions, you must first load package with library(dplyr).

# load dplyr
library(dplyr)

We want to rename the variable h_j which is a categorical variable, where 1 means that an independent judiciary exists and the other category means that the executive branch has some control over the judiciary. In order to save your new variable name in the dataset world.data, we have to assign our renamed variable to the dataset using the <- symbol. We will now rename h_j into judiciary.

# dataset.name <- rename(argument1, argument2 = argument3)
# h_j = 1 means there is an independent judiciary
# rename h_j to judiciary

# rename a variable and save the result in our data frame
world.data <- rename(world.data, judiciary = h_j)

# check the result
names(world.data)
[1] "judiciary"   "wdi_gdpc"    "undp_hdi"    "wbgi_cce"    "wbgi_pse"   
[6] "former_col"  "lp_lat_abst"

T-test

We are interested in whether there is a difference in income between countries that have an independent judiciary and countries that do not have an independent judiciary. The t-test is the appropriate test statistic for an interval level dependent variable and a binary independent variable. Our interval level dependent variable is wdi_gdpc which is GDP per capita taken from the World Development Indicators of the World Bank. Our binary independent variable is judiciary. We take a look at our independent variable first.

# frequency table of binary independent variable
table(world.data$judiciary)
 -5   1 
105  64 

As we can see the categories are numbered 1 and -5. We are used to 0 and 1. The codebook tells us that 1 means that the country has an independent judiciary. We will now turn judiciary into a factor variable like we did last time using the factor() function. By default, the factor() function assigns the labels in ascending order to each factor level. But we can also specify the lables in any order we like as long as we also provide the matching levels as shown below:

# creating a factor variable
world.data$judiciary <- factor(world.data$judiciary, 
                               labels = c("independent", "controlled"), 
                               levels = c(1, -5))

# checking the result
head(world.data)
   judiciary   wdi_gdpc undp_hdi   wbgi_cce   wbgi_pse former_col
1 controlled   628.4074       NA -1.5453584 -1.9343837          0
2 controlled  4954.1982    0.781 -0.8538115 -0.6026081          0
3 controlled  6349.7207    0.704 -0.7301510 -1.7336243          1
4       <NA>         NA       NA  1.3267342  1.1980436          0
5 controlled  2856.7517    0.381 -1.2065741 -1.4150945          1
6       <NA> 13981.9795    0.800  0.8624368  0.7084046          1
  lp_lat_abst
1   0.3666667
2   0.4555556
3   0.3111111
4   0.4700000
5   0.1366667
6   0.1892222
# a frequency table of judiciary
table(world.data$judiciary)
independent  controlled 
         64         105 

Now, we check the summary statistics of our dependent variable GDP per captia.

summary(world.data$wdi_gdpc)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  226.2  1768.0  5326.0 10180.0 12980.0 63690.0      16 

Dplyr filter()

To get an idea of whether countries with an independent judiciary are richer, we create sub samples of the data based on whether the country has independent judiciary or not and then check the mean GDP per capita. We can use the filter() function from dplyr to create the two sub samples. The first argument of filter() is the name of the dataset. The second argument is the expression or condition that specifies which rows to look at.

filter(dataset, expression)
Argument Description
dataset The first argument is the name of your dataset.
condition The second argument is the condition used for creating the subset, for example:
gdp > 9999

Now let's create free.legal and controlled.legal sub samples of the data using the filter() function:

# creating subsets of our data based on the status of the judiciary
free.legal <- filter(world.data, judiciary == "independent")
controlled.legal <- filter(world.data, judiciary == "controlled")

You may have noticed that wdi_gdpc contains 16 missing values called NA's by the summary() function. In order to calculate the mean we have to tell the mean() function what to do with them. The argument na.rm = TRUE removes all missings. Therefore, whenever you take the mean of a variable containing missing values you type: mean(variable.name, na.rm = TRUE).

# mean income levels, we remove missings
mean(free.legal$wdi_gdpc, na.rm = TRUE)
[1] 17826.59
mean(controlled.legal$wdi_gdpc, na.rm = TRUE)
[1] 5884.882

Finally, we use the t.test() function. The syntax of a t.test for a binary dependent variable and an interval independent variable is:

t.test(formula, mu, alt, conf)

Lets have a look at the arguments.

Arguments Description
formula The formula describes the relationship between the dependent and independent variables, for example:
dependent.variable ~ independent.variable
mu Here, we set the null hypothesis. Suppose, there was no difference in income between countries that have an independent judiciary and those which do not. We would then expect that if we take the mean of income in both groups they should not differ too much from one another and the difference in means should not be significantly different from 0. So, when someone claims countries with an independent judiciary are richer, we would test this claim against the null hypothesis that there is no difference. Thus, we set mu = 0.
alt There are two alternatives to the null hypothesis that the difference in means is zero. The difference could either be smaller or it could be larger than zero. To test against both alternatives, we set alt = "two.sided".
conf Here, we set the level of confidence that we want in rejecting the null hypothesis. Common confidence intervals are: 90%, 95% and 99%.

Lets run the test.

# t.test
# Interval DV (GDP per captia)
# Binary IV (independent judiciary)
t.test(world.data$wdi_gdpc ~ world.data$judiciary, mu=0, alt="two.sided", conf=0.95)
    Welch Two Sample t-test

data:  world.data$wdi_gdpc by world.data$judiciary
t = 6.0094, df = 98.261, p-value = 3.165e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  7998.36 15885.06
sample estimates:
mean in group independent  mean in group controlled 
                17826.591                  5884.882 

Let's interpret the results you get from t.test(). The first line is data: dependent variable by independent variable. This tells you that you are trying to find out whether there is a difference in means of your dependent variable by the groups of an independent variable. In our example: Do countries with independent judiciaries have different mean income levels than countries without independent judiciaries? In the following line you see the t-value, the degrees of freedom and the p-value. Knowing the t-value and the degrees of freedom you can check in a table on t distributions how likely you were to observe this data, if the null-hypothesis was true. The p-value gives you this probability directly. A p-value of 0.02 means that the probability to see this data given that there is no difference in incomes between countries with and without independent judiciaries, is 2%. In the next line you see the 95% confidence interval because we specified conf=0.95. If you were to take 100 samples and in each you checked the means of the two groups, 95 times the difference in means would be within the interval you see there. At the very bottom you see the means of the dependent variable by the two groups of the independent variable. These are the means that we estimated above. In our example, you see the mean income levels in countries were the executive has some control over the judiciary, and in countries were the judiciary is independent.

Pearson's r

When we want to test for a correlation between two variables that are at least interval scaled, we use Pearson's r. We want to test the relationship between control of corruption wbgi_cce and the human development index undp_hdi. Specifically, the human development index is our dependent variable and control of corruption is our independent variable. We will quickly rename the variables like we have learned above.

# renaming variables
world.data <- rename(world.data, hdi = undp_hdi)
world.data <- rename(world.data, corruption.control = wbgi_cce)

A good way to get an idea about a possible relationship between two interval scaled variables is to do a scatter plot like we have done last week.

# scatterplot
plot(x = world.data$corruption.control,
     y = world.data$hdi,
     xlim = c(xmin = -2, xmax = 3),
     ylim = c(ymin = 0, ymax = 1),
     frame = FALSE,
     xlab = "World Bank Control of Corruption Index",
     ylab = "UNDP Human Development Index",
     main = "Relationship b/w Quality of Institutions and Quality of Life")

You can probably spot that there is likely a positive relationship between control of corruption and the human development index. To see the slope, or in other words, the correlation exactly and whether that correlation is significantly different from zero we estimate Pearson's r with the cor.test() function. This will give us the mean correlation as well as an a confidence interval around the mean. The cor.test() function expects the following arguments:

cor.test(x, y, use, conf.level)

These arguments have the following meanings:

Argument Description
x The x variable that you want to correlate.
y The y variable that you want to correlate.
use How R should handle missing values. use="complete.obs" will use only those rows where neither x nor y is missing.
conf.level Here you can specify the confidence interval around the mean correlation. Standard choices are conf.level=0.9 for the 90% confidence level, conf.level=0.95 for the 95% confidence level and conf.level=0.99 for the 99% confidence level.
#  Pearson's r including test statistic
cor.test(world.data$corruption.control, 
         world.data$hdi, use="complete.obs", 
         conf.level = 0.99)
    Pearson's product-moment correlation

data:  world.data$corruption.control and world.data$hdi
t = 12.269, df = 173, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
99 percent confidence interval:
 0.5626121 0.7736905
sample estimates:
      cor 
0.6821114 

Exercises

  1. Rename the variable wbgi_pse into pol.stability.
  2. Check whether political stability is different in countries that were former colonies (former.col = 1).
  3. Choose an alpha level of .01.
  4. We claim the difference in means is 0.3. Check this hypothesis.
  5. Rename the variable lp_lat_abst into latitude.
  6. Check whether latitude and political stability are correlated.