Introduction to Quantitative Methods

Common Problems and Solutions

1. I'm having problems loading a dataset that I downloaded from the web (or moodle). How do I fix it?
  1. Set the working directory using these instructions to where you saved your dataset.
  2. Load the dataset using just the filename. For example, to load the California test score dataset named caschool.dta from your working directory, do the following:
library(foreign)
california_dataset <- read.dta("caschool.dta")
2. I think I have the wrong version of Zelig? How do I install the correct version?

Follow the instructions here and read more about it here.

3. I'm having problems with the select() function. I keep getting "unused argument" error. How do I fix it?

This is because you loaded Zelig or some other package after loading dplyr. Here's how you can fix it:

  1. Close R Studio
  2. Reopen R Studio
  3. Load packages at the top of your script using the library() function
  4. Make sure dplyr is loaded LAST
4. I'm still confused about removing missing values. When do I need to remove NAs?

Ideally you want to keep as much data as possible but sometimes removing NAs becomes necessary. If you're using Zelig's setx() or sim() functions then you MUST remove NAs. Even if you're not using Zelig, you might find it easier to drop missing values (for example when comparing models with different explanatory variables) but be careful when doing so. Don't throw away observations where NAs occur in variables you don't need for your analysis.

The simplest approach for this is to first keep only the variables you care about, and then drop the observations with missing values. Read the next question on how to do that.

5. How do I keep only the variables I care about and drop the rest?

We've shown you two different ways to do this. Use whichever one you prefer.

Example: To keep only the read_scr, math_scr, and avginc, variables from the California test score dataset, you could do one of the following:

Method 1. Use the select() function from dplyr.

california_dataset <- select(california_dataset, read_scr, math_scr, avginc)

Method 2. Use the square brackets [ ] to pick the columns you want to keep.

california_dataset <- california_dataset[, c("read_scr", "math_scr", "avginc")]

There are other ways to do the same, but the two listed above are the easiest.

Once you've narrowed down the dataset to the variables you need, you can use na.omit()

6. I tried to remove NAs and now I've lost too many observations or I've an empty dataset. What do I do now?

Just reload the dataset and start over. There's no other way to get the observations back once you've thrown them away with na.omit()

7. Why do we sometimes create factors for binary variables?

For binary variables, factors make it easy to interpret your results in a regression table. There are plenty of examples in the seminars where we did this. Try it out on your own. Run a regression with a binary variable as independent variable such as male/female or public-school/private-school and look at the results. Then do the same after creating a factor.

NOTE: This ONLY applies to binary variables (i.e. variables with only two levels, 0 or 1). For categorical variables with more than two levels (such as ethnicity, religion, etc.), you MUST create a factor variable.

8. How do I find out the range of values for a variable?

There are a number ways to do this but one simple way is to use the summary() function which gives you the minimum and maximum values of a variable. The summary() function also gives you other useful statistics such as the mean, 25th and 75th quartiles, and the number of NAs.

Example: To get the min/max of avginc, you could do the following:

summary(california_dataset$avginc)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  5.335  10.640  13.730  15.320  17.630  55.330 
9. What does the seq() function do? How do I know what values to put inside seq() ?

The seq() function simply generates a sequence. We often use it to set the range of X values for Zelig's simulation. For example, in the California test score dataset, income is measured in units of $1000's. To predict the effect of income from 20,000 to 50,000, you'd use seq(20, 50, 1) where the last argument 1 is just the size of each increment.

If you're unsure, take a look at what seq() actually does before using it in other functions like setx().

seq(20, 50, 1)
 [1] 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
[24] 43 44 45 46 47 48 49 50

If you were dealing with percentages between 0 and 1 then you'd use something like this:

seq(0, 1, 0.1)
 [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Notice how we changed the increment from 1 to 0.1.

10. How do we interpret output from Zelig's simulations?

Take a look at Interpreting Zelig Simulation.

11. Can I just use SCC error correction since it corrects for both autocorrelation and cross-sectional dependence?

The short answer is NO, you cannot just use SCC all the time without doing the tests first. SCC is NOT a more general form of HAC, they tackle slightly different issues.

Just like with the choice between robust and classical SEs, corrections have their assumptions and if assumptions are violated SE coverage will not have desired properties (e.g. incorrect 95% confidence intervals). If we have autocorrelation and panel heteroskedasticity but no cross-sectional dependency, then SCC standard errors will have incorrect coverage (they’ll be wrong). On the other hand if we have autocorrelation and cross-sectional dependency (correlated heteroskedasticity across panels) then HAC would have incorrect coverage and we should use SCC.

12. How do I add output from R to a Word document?

Don't just take a screenshot. In seminar 4 we created a Word compatible file using htmlreg() function from texreg package that you can copy/paste to your answers or essay. If you're having problems using this method, try one of the following:

  1. Save your table as a .html file instead and open it with your browser (Firefox, Chrome, Safari, etc.). Then copy it to a Word document.

    htmlreg(list(model1, model2), file="modelcomparison.html")
    
  2. Create a table by hand or copy/paste the output from the console in RStudio. If you're using copy/paste method, use a fixed-width or monospaced font like "Courier" or Courier New" to ensure that columns align properly.

13. How do I add a plot from R to a Word document?

Again, don't just take a screenshot. At the end of seminar 9 there are instructions for saving a plot to a file.