Using Geographic Variation in College Proximity to Estimate the Return to Schooling

Card (1993) estimates the causal effect of length of education on hourly wages.

We start by loading the data.

[1]:
from io import BytesIO
from zipfile import ZipFile

import numpy as np
import pandas as pd
import requests

url = "https://davidcard.berkeley.edu/data_sets/proximity.zip"
content = requests.get(url).content

# From code_bk.txt in the zip file
colspec = {
    "id": (1, 5),  # sequential id runs from 1 to 5225
    "nearc2": (7, 7),  # grew up near 2-yr college
    "nearc4": (10, 10),  # grew up near 4-yr college
    "nearc4a": (12, 13),  # grew up near 4-yr public college
    "nearc4b": (15, 16),  # grew up near 4-yr priv college
    "ed76": (18, 19),  # educ in 1976
    "ed66": (21, 22),  # educ in 1966
    "age76": (24, 25),  # age in 1976
    "daded": (27, 31),  # dads education missing=avg
    "nodaded": (33, 33),  # 1 if dad ed imputed
    "momed": (35, 39),  # moms education
    "nomomed": (41, 41),  # 1 if mom ed imputed
    "weight": (43, 54),  # nls weight for 1976 cross-section
    "momdad14": (56, 56),  # 1 if live with mom and dad age 14
    "sinmom14": (58, 58),  # lived with single mom age 14
    "step14": (60, 60),  # lived step parent age 14
    "reg661": (62, 62),  # dummy for region=1 in 1966
    "reg662": (64, 64),  # dummy for region=2 in 1966
    "reg663": (66, 66),  # dummy for region=3 in 1966
    "reg664": (68, 68),
    "reg665": (70, 70),
    "reg666": (72, 72),
    "reg667": (74, 74),
    "reg668": (76, 76),
    "reg669": (78, 78),  # dummy for region=9 in 1966
    "south66": (80, 80),  # lived in south in 1966
    "work76": (82, 82),  # worked in 1976
    "work78": (84, 84),  # worked in 1978
    "lwage76": (86, 97),  # log wage (outliers trimmed) 1976
    "lwage78": (99, 110),  # log wage in 1978 outliers trimmed
    "famed": (112, 112),  # mom-dad education class 1-9
    "black": (114, 114),  # 1 if black
    "smsa76r": (116, 116),  # in smsa in 1976
    "smsa78r": (118, 118),  # in smsa in 1978
    "reg76r": (120, 120),  # in south in 1976
    "reg78r": (122, 122),  # in south in 1978
    "reg80r": (124, 124),  # in south in 1980
    "smsa66r": (126, 126),  # in smsa in 1966
    "wage76": (128, 132),  # raw wage cents per hour 1976
    "wage78": (134, 138),
    "wage80": (140, 144),
    "noint78": (146, 146),  # 1 if noninterview in 78
    "noint80": (148, 148),
    "enroll76": (150, 150),  # 1 if enrolled in 76
    "enroll78": (152, 152),
    "enroll80": (154, 154),
    "kww": (156, 157),  # the kww score
    "iq": (159, 161),  # a normed iq score
    "marsta76": (163, 163),  # mar status in 1976 1=married, sp. present
    "marsta78": (165, 165),
    "marsta80": (167, 167),
    "libcrd14": (169, 169),  # 1 if lib card in home age 14
}

with ZipFile(BytesIO(content)).open("nls.dat") as file:
    df = pd.read_fwf(
        file,
        names=colspec.keys(),
        # pandas expects [from, to[ values, starting at 0
        colspecs=[(f - 1, t) for (f, t) in colspec.values()],
        na_values=".",
    )

df = df[lambda x: x["lwage76"].notna()].set_index("id")

# construct potential experience and its square
df["exp76"] = df["age76"] - df["ed76"] - 6
df["exp762"] = df["exp76"] ** 2
df["age762"] = df["age76"] ** 2

df["f1"] = df["famed"].eq(1).astype("float")  # mom and dad both > 12 yrs ed
df["f2"] = df["famed"].eq(2).astype("float")  # mom&dad >=12 and not both exactly 12
df["f3"] = df["famed"].eq(3).astype("float")  # mom=dad=12
df["f4"] = df["famed"].eq(4).astype("float")  # mom >=12 and dad missing
df["f5"] = df["famed"].eq(5).astype("float")  # father >=12 and mom not in f1-f4
df["f6"] = df["famed"].eq(6).astype("float")  # mom>=12 and dad nonmissing
df["f7"] = df["famed"].eq(7).astype("float")  # mom and dad both >=9
df["f8"] = df["famed"].eq(8).astype("float")  # mom and dad both nonmissing

indicators = ["black", "smsa66r", "smsa76r", "reg76r"]
# exclude reg669, as sum(reg661, ..., reg669) = 1
indicators += [f"reg66{i}" for i in range(1, 9)]

family = ["daded", "momed", "nodaded", "nomomed", "famed", "momdad14", "sinmom14"]
fs = [f"f{i}" for i in range(1, 9)]  # exclude f9 as sum(f1, ..., f9) = 1
family += fs

X = df[["ed76", "exp76", "exp762"]]  # endogenous
y = df["lwage76"]  # outcome
C = df[family + indicators]  # included exogenous variables
Z = df[["nearc4a", "nearc4b", "nearc2", "age76", "age762"]]  # instruments

We then estimate the causal effect of education on wages using the Two-Stage Least-Squares (TSLS) and Limited Information Maximum Likelihood (LIML) estimators.

[2]:
from ivmodels.models import KClass

tsls = KClass(kappa="tsls").fit(Z=Z, X=X, y=y, C=C)
print(tsls.named_coefs_[:4])

liml = KClass(kappa="liml").fit(Z=Z, X=X, y=y, C=C)
print(liml.named_coefs_[:4])
intercept    4.680112
ed76         0.144954
exp76        0.061604
exp762      -0.001196
Name: coefficients, dtype: float64
intercept    3.060812
ed76         0.172352
exp76        0.051571
exp762      -0.000713
Name: coefficients, dtype: float64

To influence policy, inference for causal effect estimates is crucial. Test statistics, p-values, and confidence sets can be computed using the tests in ivmodels.tests.

[3]:

from ivmodels.tests import wald_test statistic, p_value = wald_test(Z=Z, X=X, C=C, y=y, beta=np.zeros(X.shape[1])) print(f"{statistic=:.3f}, {p_value=:.3g}")
statistic=284.905, p_value=0

That is, the joint causal effect of education, experience, and experience squared is highly significant using the Wald test. Subvector inference for individual components of the causal effect is easier to interpret.

[4]:
from ivmodels.confidence_set import ConfidenceSet
from ivmodels.tests import inverse_wald_test

statistic, p_value = wald_test(Z=Z, X=df[["ed76"]], W=df[["exp76", "exp762"]], C=C, y=y, beta=np.zeros(1))
print(f"{statistic=:.3f}, {p_value=:.3g}")

for alpha in [0.05, 0.01, 0.001]:
    quadric = inverse_wald_test(Z=Z, X=df[["ed76"]], W=df[["exp76", "exp762"]], C=C, y=y, alpha=alpha)
    confidence_set = ConfidenceSet.from_quadric(quadric)
    print(f"{100 * (1 - alpha)}% confidence set: {confidence_set:.3f}")
statistic=10.529, p_value=0.00118
95.0% confidence set: [0.057, 0.233]
99.0% confidence set: [0.030, 0.260]
99.9% confidence set: [-0.002, 0.292]

Test statistics, p-values, and confidence sets can be computed for multiple features using the summary method of KClass. This also displays results of Anderson’s (1951) test of reduced rank (a multivariate extension of the first-stage F-test) and the LIML (weak instrument robust) variant of the J-statistic.

[5]:
features = ["ed76", "exp76", "exp762"]
print(tsls.summary(X=X, Z=Z, C=C, y=y, feature_names=features))
Summary based on the wald test.

         estimate  statistic   p-value              conf. set
ed76        0.145      10.53  0.001175       [0.0574, 0.2325]
exp76      0.0616      6.635  0.009997      [0.01473, 0.1085]
exp762  -0.001196      1.029    0.3104  [-0.003506, 0.001115]

Endogenous model statistic: 284.9, p-value: <1e-16
(Multivariate) F-statistic: 15.47, p-value: 0.001454
J-statistic (LIML): 4.246, p-value: 0.1197

Anderson’s test statistic of reduced rank is of the order 10, suggesting that instruments might be weak. The ivmodels package implements three weak-instrument robust tests: the (subvector) Anderson-Rubin test (Anderson and Rubin, 1949 and Guggenberger et al., 2012), the (subvector) conditional likelihood-ratio test (Moreira, 2003 and Kleibergen, 2021), and the (subvector) Lagrange multiplier test (Kleibergen, 2002 and Londschien et al., 2024).

[6]:
for test in ["anderson-rubin", "conditional likelihood-ratio", "lagrange multiplier"]:
    print(liml.summary(X=X, Z=Z, C=C, y=y, feature_names=features, test=test))
    print("")
Summary based on the anderson-rubin test.

          estimate  statistic   p-value              conf. set
ed76        0.1724      5.027  0.001748      [0.08205, 0.3557]
exp76      0.05157      2.093   0.09876    [-0.02839, 0.09739]
exp762  -0.0007127      1.495    0.2138  [-0.002968, 0.003186]

Endogenous model statistic: 66.33, p-value: <1e-16
(Multivariate) F-statistic: 15.47, p-value: 0.001454
J-statistic (LIML): 4.246, p-value: 0.1197

Summary based on the conditional likelihood-ratio test.

          estimate  statistic   p-value              conf. set
ed76        0.1724      10.84  0.002516      [0.07273, 0.3987]
exp76      0.05157      2.034    0.1785     [-0.04522, 0.1018]
exp762  -0.0007127     0.2379    0.6444  [-0.003198, 0.004002]

Endogenous model statistic: 327.4, p-value: <1e-16
(Multivariate) F-statistic: 15.47, p-value: 0.001454
J-statistic (LIML): 4.246, p-value: 0.1197

Summary based on the lagrange multiplier test.

          estimate  statistic  p-value                                      conf. set
ed76        0.1724      5.739  0.01659         [-0.6013, -0.05795] U [0.06078, 0.472]
exp76      0.05157      1.648   0.1992           [-0.06781, 0.1025] U [0.116, 0.3515]
exp762  -0.0007127     0.2043   0.6513  [-0.01522, -0.003838] U [-0.003228, 0.005095]

Endogenous model statistic: 324.9, p-value: <1e-16
(Multivariate) F-statistic: 15.47, p-value: 0.001454
J-statistic (LIML): 4.246, p-value: 0.1197

The causal effect of education on wages is still significant at level 0.01 for the weak-instrument robust tests.