Does Compulsory School Attendance Affect Schooling and Earnings?
We replicate tables IV, V, and VI of Angrist and Krueger (1991). We start by loading the data.
[1]:
import numpy as np
import pandas as pd
import requests
from sklearn.preprocessing import OneHotEncoder
import subprocess
import tempfile
url = "https://economics.mit.edu/sites/default/files/inline-files/NEW7080_1.rar"
dir = tempfile.TemporaryDirectory()
with open(f"{dir.name}/file.rar", 'wb') as file:
file.write(requests.get(url).content)
subprocess.run(["tar", "xf", f"{dir.name}/file.rar", "-C", dir.name])
df = pd.read_stata(f"{dir.name}/NEW7080.dta")
# renaming from
# https://economics.mit.edu/sites/default/files/inline-files/Descriptive%20Statistics%20QOB.txt
df = df.rename(columns={
"v1": "age",
"v2": "ageq",
"v4": "educ",
"v5": "enocent",
"v6": "esocent",
"v9": "lwklywge",
"v10": "married",
"v11": "midatl",
"v12": "mt",
"v13": "neweng",
"v16": "census",
"v18": "qob",
"v19": "race",
"v20": "smsa",
"v21": "soatl",
"v24": "wnocent",
"v25": "wsocent",
"v27": "yob",
})
# replace AGEQ=AGEQ-1900 if CENSUS==80
df.loc[lambda x: x["census"].eq(80), "ageq"] -= 1900
# gen AGEQSQ= AGEQ*AGEQ
df["ageqsq"] = df["ageq"] ** 2
df["yob_dummies"] = df["yob"] % 10
yob_encoder = OneHotEncoder(
categories=[list(range(9))],
sparse_output=False,
handle_unknown="ignore"
)
yob_encoder.set_output(transform="pandas")
yob_dummies = yob_encoder.fit_transform(df[["yob_dummies"]])
df["yqob"] = df["yob_dummies"].astype("str") + df["qob"].astype("str")
yqob_encoder = OneHotEncoder(
categories=[[f"{y}{q}" for y in range(10) for q in [2, 3, 4]]],
sparse_output=False,
handle_unknown="ignore"
).set_output(transform="pandas")
yqob_dummies = yqob_encoder.fit_transform(df[["yqob"]])
df = pd.concat([df, yob_dummies, yqob_dummies], axis=1)
cohorts = {
"IV": df[lambda x: x["yob"].isin(range(1920, 1930))],
"V": df[lambda x: x["yob"].isin(range(30, 40))],
"VI": df[lambda x: x["yob"].isin(range(40, 50))],
}
age = ["age", "ageqsq"]
other = ["race", "married", "smsa"]
region = ["neweng", "midatl", "enocent", "wnocent", "soatl", "esocent", "wsocent", "mt"]
yob_names = yob_encoder.get_feature_names_out().tolist()
yqob_names = yqob_encoder.get_feature_names_out().tolist()
We now replicate results from tables IV, V, and VI. We don’t perfectly replicate columns (4), (8), as the authors include age and age squared only in the first, but not the second stage. We include them in both stages. See also the following code from https://economics.mit.edu/sites/default/files/inline-files/QOB%20Table%20IV.do:
** Col 2 4 6 8 ***
ivregress 2sls LWKLYWGE YR20-YR28 (EDUC = QTR120-QTR129 QTR220-QTR229 QTR320-QTR329 YR20-YR28)
ivregress 2sls LWKLYWGE YR20-YR28 AGEQ AGEQSQ (EDUC = QTR120-QTR129 QTR220-QTR229 QTR320-QTR329 YR20-YR28)
ivregress 2sls LWKLYWGE YR20-YR28 RACE MARRIED SMSA NEWENG MIDATL ENOCENT WNOCENT SOATL ESOCENT WSOCENT MT (EDUC = QTR120-QTR129 QTR220-QTR229 QTR320-QTR329 YR20-YR28)
ivregress 2sls LWKLYWGE YR20-YR28 RACE MARRIED SMSA NEWENG MIDATL ENOCENT WNOCENT SOATL ESOCENT WSOCENT MT AGEQ AGEQSQ (EDUC = QTR120-QTR129 QTR220-QTR229 QTR320-QTR329 YR20-YR28)
[2]:
from ivmodels import KClass
from ivmodels.tests import wald_test, anderson_rubin_test, conditional_likelihood_ratio_test, lagrange_multiplier_test, j_test
for table, cohort in cohorts.items():
print(f"\nTable {table}")
for column, kappa, exogenous in [
("(1)", "ols", yob_names),
("(2)", "tsls", yob_names),
("(3)", "ols", yob_names + age),
("(4)", "tsls", yob_names + age),
("(5)", "ols", yob_names + region + other),
("(6)", "tsls", yob_names + region + other),
("(7)", "ols", yob_names + region + other + age),
("(8)", "tsls", yob_names + region + other + age)
]:
y = cohort[["lwklywge"]]
X = cohort[["educ"]]
C = cohort[exogenous]
Z = cohort[yqob_names]
estimator = KClass(kappa).fit(X=X, y=y, C=C, Z=Z)
wald_stat, wald_p = wald_test(X=X, y=y, Z=Z, C=C, beta=np.zeros(1), estimator=kappa)
std_error = np.abs(estimator.coef_[0]) / np.sqrt(wald_stat)
print(f"Column {column}, {estimator.coef_[0]:.4f} ({std_error:.4f})")
if kappa == "tsls":
_, ar_p = anderson_rubin_test(X=X, y=y, Z=Z, C=C, beta=np.zeros(1))
_, clr_p = conditional_likelihood_ratio_test(X=X, y=y, Z=Z, C=C, beta=np.zeros(1))
_, lm_p = lagrange_multiplier_test(X=X, y=y, Z=Z, C=C, beta=np.zeros(1))
_, j_p = j_test(X=X, y=y, Z=Z, C=C)
print(f"wald: {wald_p:.2g}, ar: {ar_p:.2g}, clr: {clr_p:.2g}, lm: {lm_p:.2g}, j: {j_p:.2g}")
Table IV
Column (1), 0.0802 (0.0004)
Column (2), 0.0769 (0.0150)
wald: 3.2e-07, ar: 0.0085, clr: 0.00052, lm: 0.00093, j: 0.17
Column (3), 0.0802 (0.0004)
Column (4), 0.1352 (0.0337)
wald: 6e-05, ar: 0.095, clr: 0.11, lm: 0.33, j: 0.8
Column (5), 0.0701 (0.0004)
Column (6), 0.0669 (0.0151)
wald: 9.4e-06, ar: 0.028, clr: 0.002, lm: 0.0028, j: 0.23
Column (7), 0.0701 (0.0004)
Column (8), 0.1039 (0.0341)
wald: 0.0023, ar: 0.2, clr: 0.27, lm: 0.7, j: 0.69
Table V
Column (1), 0.0711 (0.0003)
Column (2), 0.0891 (0.0161)
wald: 3.2e-08, ar: 0.013, clr: 1.2e-05, lm: 1e-05, j: 0.66
Column (3), 0.0711 (0.0003)
Column (4), 0.0655 (0.0280)
wald: 0.019, ar: 0.64, clr: 0.38, lm: 0.33, j: 0.71
Column (5), 0.0632 (0.0003)
Column (6), 0.0806 (0.0164)
wald: 8.8e-07, ar: 0.064, clr: 7.4e-05, lm: 5e-05, j: 0.8
Column (7), 0.0632 (0.0003)
Column (8), 0.0509 (0.0279)
wald: 0.069, ar: 0.85, clr: 0.51, lm: 0.42, j: 0.87
Table VI
Column (1), 0.0573 (0.0003)
Column (2), 0.0553 (0.0138)
wald: 5.8e-05, ar: 5.6e-11, clr: 0.0089, lm: 0.039, j: 5.4e-10
Column (3), 0.0574 (0.0003)
Column (4), 0.1293 (0.0191)
wald: 1.4e-11, ar: 1.6e-09, clr: 1.9e-10, lm: 9.8e-08, j: 0.0032
Column (5), 0.0520 (0.0003)
Column (6), 0.0393 (0.0145)
wald: 0.0067, ar: 1e-08, clr: 0.19, lm: 0.29, j: 1.1e-08
Column (7), 0.0521 (0.0003)
Column (8), 0.1138 (0.0200)
wald: 1.3e-08, ar: 2.5e-07, clr: 1.3e-07, lm: 1.6e-05, j: 0.0025
Notably, for cohorts 1920 - 1929 and 1930 - 1939, the causal effect of education on wages is no longer significant at level 0.05 if using weak-instrument-robust inference and if age and its square are included as an exogenous variables. The LIML variant of the J-statistic rejects the null of correct model specification at level 0.01 for cohort 1940 - 49, making any inference questionable.