IV Application (mroz)

Mother and father as IVs for education

Author

Kevin Hu

Case Description

Research Interests:

  • With data set wooldridge::mroz, researchers were interest in the return (log(Wage)) to education (edu) for married women.

  • use both motheduc or/and fatheduc as instruments for educ.

Reproducible Sources

  1. Wooldridge, J.M. Introductory econometrics: a modern approach[M]. Seventh edition. Australia: Cengage, 2020.

  2. Hill C, Griffiths W E, Lim G C. Principles of econometrics[M]. Fifth edition. NJ: John Wiley & Sons, 2018.

  3. Colonescu C. Principles of Econometrics with R (2016) https://bookdown.org/ccolonescu/RPoE4/

Learning Targets

  1. Understand the nature of Endogeneity.

  2. Know the steps of running TSLS method.

  3. Be familiar with R package function systemfit::systemfit() and ARE::ivreg().

  4. Testing Instrument validity (Weak instrument) both using Restricted F-test and J-test.

  5. Testing Regressor endogeneity by using Hausman test.

Exercise Materials

You can find all the exercise materials in this project under the file directory:

D:/github/course-emiii-accompany/IV-wage-mroz
├── code-mroz.R
├── mroz-var-label.txt
├── workingpaper.html
├── workingpaper.qmd
└── workingpaper_files
    ├── figure-docx
    └── figure-html

OLS Estimation

Consider the following “error specified” wage model:

lwagei=β1+β2educi+β3experi+β4expersqi+ei

Two-stage least squares(TSLS): the solutions

We can conduct the TSLS procedure with following two solutions:

  • use the “Step-by-Step solution” methods without variance correction.

  • use the “Integrated solution” with variance correction.

  1. By doing the Step-by-Step solution, we will understand the basic procedure of Two-stage least squares(TSLS). But DO NOT use Step-by-Step solution solution in your paper! It is only for teaching purpose here.

  2. We need a Integrated solution for following reasons:

  • We should obtain the correct estimated error for test and inference.

  • We should avoid tedious steps in the former Step-by-Step routine. When the model contains more than one endogenous regressors and there are lots available instruments, then the step-by-step solution will get extremely tedious.

In R ecosystem, we have two packages to execute the integrated solution:

  • We can use systemfit package function systemfit::systemfit().

  • Or we may use ARE package function ARE::ivreg().

Both of these tools can conduct the integrated solution, and will adjust the variance of estimators automatically.

TWLS (Step-by-step solution): mothereduc as IV

For the Step-by-step solution, let’s consider using mother education(mothereduc) as instrument variable for education(educ).

we can obtain the fitted variable educ^ by conduct the following stage 1 OLS regression

educ^=γ^1+γ^2exper+γ^3expersq+γ^4mothereduc

In the second stage, we will regress log(wage) on the educ^ from stage 1 and experience (exper)and its quadratic term (expersq).

lwage=β^1+β^2educ^+β^3exper+β^4expersq+ϵ^

TWLS (Integrated solution): mothereduc as IV

Let’s consider using motheduc as the only instrument for educ by using the Integrated solution.

{educ^=γ^1+γ^2exper+γ^3expersq+γ^4motheduc(stage 1)lwage=β^1+β^2educ^+β^3exper+β^4expersq+ϵ^(stage 2)

In R ecosystem, we have two packages to execute the integrated solution:

  • We can use systemfit package function systemfit::systemfit().

  • Or we may use ARE package function ARE::ivreg().

Both of these tools can conduct the integrated solution, and will adjust the variance of estimators correctly and automatically.

TWLS (Integrated solution): mothereduc as IV

Now let’s consider using fatheduc as the only instrument for educ.

{educ^=γ^1+γ^2exper+γ^3expersq+γ^4fatheduc(stage 1)lwage=β^1+β^2educ^+β^3exper+β^4expersq+ϵ^(stage 2)

We will repeat the whole procedure by using R function systemfit::systemfit() or ARE::ivreg() as we have done before.

TWLS (Integrated solution): mothedu and fatheduc as IV

Also, we can use both motheduc and fatheduc as instruments for educ.

{educ^=γ^1+γ^2exper+β^3expersq+β^4motheduc+β^5fatheduc(stage 1)lwage=β^1+β^2educ^+β^3exper+β^4expersq+ϵ^(stage 2)

We will repeat the whole procedure by using R function systemfit::systemfit() or ARE::ivreg() as we have done before.

Solutions comparison: a glance

Until now, we obtain totally Five estimation results with different model settings or solutions:

  1. Error specification model with OLS regression directly.

  2. (Step-by-Step solution) Explicit 2SLS estimation without variance correction (IV regression step by step with only matheduc as instrument).

  3. (Integrated solution) Dedicated IV estimation with variance correction ( using R tools of systemfit::systemfit() or ARE::ivreg()).

  • The IV model with only motheduc as instrument for endogenous variable edu

  • The IV model with only fatheduc as instrument for endogenous variable edu

  • The IV model with both motheduc and fatheduc as instruments for endogenous variable edu

After the empirical comparison, we can push to further thinking with these results.

  • Which estimation is the best?

  • How to judge and evaluate different instrument choices?

Testing Instrument validity

Consider the general model

Yi=β0+j=1kβjXji+s=1rβk+sWri+ϵi

  • Yi is the dependent variable
  • β0,,βk+1 are 1+k+r unknown regression coefficients
  • X1i,,Xki are k endogenous regressors
  • W1i,,Wri are r exogenous regressors which are uncorrelated with ui
  • ui is the error term
  • Z1i,,Zmi are m instrumental variables

As we know, valid instruments should satisfy both Relevance condition and Exogeneity condition.

E(ZiXi)0(Relevance)E(Ziϵi)=0(Exogeneity)

Weak instrument: Restricted F-test

In case with a single endogenous regressor, we can take the F-test to check the Weak instrument.

The basic idea of the F-test is very simple:

If the estimated coefficients of all instruments in the first-stage of a 2SLS estimation are zero, the instruments do not explain any of the variation in the X which clearly violates the relevance assumption.

We may use the following rule of thumb:

  • Conduct the first-stage regression of a 2SLS estimation

Xi=γ^0+γ^1W1i++γ^pWpi+θ^1Z1i++θ^qZqi+vi(3)

  • Test the restricted joint hypothesis H0:θ^1==θ^q=0 by compute the F-statistic.

  • If the F-statistic is less than critical value, the instruments are weak.

The rule of thumb is easily implemented in R. Run the first-stage regression using lm() and subsequently compute the restricted F-statistic by R function of car::linearHypothesis().

For all three IV model, we can test instrument(s) relevance respectively.

educ=γ1+γ2exper+γ2expersq+θ1motheduc+v(relevance test 1)educ=γ1+γ2exper+γ2expersq+θ2fatheduc+v(relevance test 2)educ=γ1+γ2exper+γ2expersq+θ1motheduc+θ2fatheduc+v(relevance test 3)

Weak instrument: Cragg-Donald test

The former test for weak instruments might be unreliable with more than one endogenous regressor, though, because there is indeed one F-statistic for each endogenous regressor.

An alternative is the Cragg-Donald test based on the following statistic:

F=NGBLrB21rB2

where:

  • G is the number of exogenous regressors;

  • B is the number of endogenous regressors;

  • L is the number of external instruments;

  • rB is the lowest canonical correlation.

Canonical correlation is a measure of the correlation between the endogenous and the exogenous variables, which can be calculated by the function cancor() in R.

The critical value can be found in table 10E.1 at: Hill C, Griffiths W, Lim G. Principles of econometrics[M]. John Wiley & Sons, 2018.

Instrument Exogeneity: J-test

Instrument Exogeneity means all m instruments must be uncorrelated with the error term,

Cov(Z1i,ϵi)=0;;Cov(Zmi,ϵi)=0.

  • In the context of the simple IV estimator, we will find that the exogeneity requirement can not be tested. (Why?)

  • However, if we have more instruments than we need, we can effectively test whether some of them are uncorrelated with the structural error.

Under over-identification (m>k), consistent IV estimation with (multiple) different combinations of instruments is possible.

If instruments are exogenous, the obtained estimates should be similar.

If estimates are very different, some or all instruments may .red[not] be exogenous.

The Overidentifying Restrictions Test (J test) formally check this.

  • The null hypothesis is Instrument Exogeneity.

H0:E(Zhiϵi)=0, for all h=1,2,,m

The overidentifying restrictions test (also called the J-test, or Sargan test) is an approach to test the hypothesis that the additional instruments are exogenous.

Procedure of overidentifying restrictions test is:

  • Step 1: Compute the IV regression residuals :

ϵ^iIV=Yi(β^0IV+j=1kβ^jIVXji+s=1rβ^k+sIVWsi)

  • Step 2: Run the auxiliary regression: regress the IV residuals on instruments and exogenous regressors. And test the joint hypothesis H0:α1=0,,αm=0

ϵ^iIV=θ0+h=1mθhZhi+s=1rγsWsi+vi(2)

  • Step3: Compute the J statistic: J=mF

where F is the F-statistic of the m restrictions H0:θ1==θm=0 in eq(2)

Under the null hypothesis, J statistic is distributed as χ2(mk) approximately for large samples.

Jχ2(mk)

IF J is less than critical value, it means that all instruments are .red[ex]ogenous.

IF J is larger than critical value, it mean that some of the instruments are .red[en]ogenous.

  • We can apply the J-test by using R function linearHypothesis().

Again, we can use both matheduc and fatheduc as instruments for educ.

Thus, the IV model is over-identification, and we can test the exogeneity of both these two instruments by using J-test.

The 2SLS model will be set as below.

{educ^=γ^1+γ^2exper+β^3expersq+β^4motheduc+β^5fatheduc(stage 1)lwage=β^1+β^2educ^+β^3exper+β^4expersq+ϵ^(stage 2)

And the auxiliary regression should be

ϵ^IV=α^1+α^2exper+α^3expersq+θ^1motheduc+θ^2fatheduc+v(auxiliary model)

Finally, We can calculate J-statistic by hand or obtain it by using special tools.

  • Calculate J-statistic by hand

  • using tools of linearHypothesis(., test = "Chisq")

Testing Regressor endogeneity

How can we test the regressor endogeneity?

Since OLS is in general more efficient than IV (recall that if Gauss-Markov assumptions hold OLS is BLUE), we don’t want to use IV when we don’t need to get the consistent estimators.

Of course, if we really want to get a consistent estimator, we also need to check whether the endogenous regressors are really endogenous in the model.

So we should test following hypothesis:

H0:Cov(X,ϵ)=0 vs. H1:Cov(X,ϵ)0

Hausman test

Hausman tells us that we should use OLS if we fail to reject H0. And we should use IV estimation if we reject H0

Let’s see how to construct a Hausman test. While the idea is very simple.

  • If X is .red[ex]ogenous in fact, then both OLS and IV are consistent, but OLS estimates are more efficient than IV estimates.

  • If X is .red[en]dogenous in fact, then the results from OLS estimators are different, while results obtained by IV (eg. 2SLS) are consistent.

We can compare the difference between estimates computed using both OLS and IV.

  • If the difference is small, we can conjecture that both OLS and IV are consistent and the small difference between the estimates is not systematic.
  • If the difference is large this is due to the fact that OLS estimates are not consistent. We should use IV in this case.

Again, we use both matheduc and fatheduc as instruments for educ in our IV model setting.

{educ^=γ^1+γ^2exper+β^3expersq+β^4motheduc+β^5fatheduc(stage 1)lwage=α^1+α^2educ^+α^3exper+α^4expersq+ϵ^(stage 2)

In R, we can use IV model diagnose tool to check the Hausman test results.

In fact, R function summary(lm_iv_mf, diagnostics = TRUE) by setting diagnostics = TRUE will give you these results.