Introduction to Quantitative Methods

Common Problems and Solutions

1. I'm having problems loading a dataset that I downloaded from the web (or moodle). How do I fix it?
  1. Set the working directory using these instructions to where you saved your dataset.
  2. Load the dataset using just the filename. For example, to load the California test score dataset named caschool.dta from your working directory, do the following:
library(foreign)
california_dataset <- read.dta("caschool.dta")
2. I think I have the wrong version of Zelig? How do I install the correct version?

Follow the instructions here and read more about it here.

3. I'm confused about removing missing values. When do I need to remove NAs?

Ideally you want to keep as much data as possible but sometimes removing NAs becomes necessary. If you must drop some observations because of mising values then only do so when the variable(s) you're interested in are missing. Follow the example in Seminar 5 where we drop obervations when latitude, globalization and inst_quality are missing.

4. I tried to remove NAs and now I've lost too many observations or I've an empty dataset. What do I do now?

Just reload the dataset and start over. There's no other way to get the observations back once you've thrown them away with.

5. Why do we sometimes create factors for binary variables?

For binary variables, factors make it easy to interpret your results in a regression table. There are plenty of examples in the seminars where we did this. Try it out on your own. Run a regression with a binary variable as independent variable such as male/female or public-school/private-school and look at the results. Then do the same after creating a factor.

NOTE: This ONLY applies to binary variables (i.e. variables with only two levels, 0 or 1). For categorical variables with more than two levels (such as ethnicity, religion, etc.), you MUST create a factor variable.

6. How do I find out the range of values for a variable?

There are a number ways to do this but one simple way is to use the summary() function which gives you the minimum and maximum values of a variable. The summary() function also gives you other useful statistics such as the mean, 25th and 75th quartiles, and the number of missing values or NAs.

Example: To get the min/max of avginc, you could do the following:

summary(california_dataset$avginc)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  5.335  10.640  13.730  15.320  17.630  55.330 
7. What does the seq() function do? How do I know what values to put inside seq() ?

The seq() function simply generates a sequence. We often use it to set the range of X values for Zelig's simulation. For example, in the California test score dataset, income is measured in units of $1000's. To predict the effect of income from 20,000 to 50,000, you'd use seq(20, 50, 1) where the last argument 1 is just the size of each increment.

If you're unsure, take a look at what seq() actually does before using it in other functions like setx().

seq(20, 50, 1)
 [1] 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
[24] 43 44 45 46 47 48 49 50

If you were dealing with percentages between 0 and 1 then you'd use something like this:

seq(0, 1, 0.1)
 [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Notice how we changed the increment from 1 to 0.1.

Try different arguments for seq() to see what it does:

seq(0, 100)
seq(0, 100, 20)
seq(0, 100, by = 20)
seq(0, 100, length.out = 5)
seq(0, 100, length.out = 6)
seq(0, 100, length.out = 8)
8. Why do we use Zelig and what is the difference between expected and predicted values?

Take a look at the explanation here

9. How do we interpret output from Zelig's simulations?

Take a look at Interpreting Zelig Simulation.

10. Can I just use SCC error correction since it corrects for both autocorrelation and cross-sectional dependence?

The short answer is NO, you cannot just use SCC all the time without doing the tests first. SCC is NOT a more general form of HAC, they tackle slightly different issues.

Just like with the choice between robust and classical SEs, corrections have their assumptions and if assumptions are violated SE coverage will not have desired properties (e.g. incorrect 95% confidence intervals). If we have autocorrelation and panel heteroskedasticity but no cross-sectional dependency, then SCC standard errors will have incorrect coverage (they’ll be wrong). On the other hand if we have autocorrelation and cross-sectional dependency (correlated heteroskedasticity across panels) then HAC would have incorrect coverage and we should use SCC.

11. How do I add output from R to a Word document?

Don't just take a screenshot. In seminar 4 we created a Word compatible file using htmlreg() function from texreg package that you can copy/paste to your answers or essay. If you're having problems using this method, try one of the following:

  1. Save your table as a .html file instead and open it with your browser (Firefox, Chrome, Safari, etc.). Then copy it to a Word document.

    htmlreg(list(model1, model2), file="modelcomparison.html")
    
  2. Create a table by hand or copy/paste the output from the console in RStudio. If you're using copy/paste method, use a fixed-width or monospaced font like "Courier" or Courier New" to ensure that columns align properly.

12. How do I add a plot from R to a Word document?

Again, don't just take a screenshot. At the end of seminar 9 there are instructions for saving a plot to a file.