{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "74251d3d",
   "metadata": {},
   "source": [
    "# Card (1993): Using geographic variation in college proximity to estimate the return to schooling\n",
    "\n",
    "Card (1993) estimates the causal effect of length of education on hourly wages.\n",
    "Their dataset is based on a 1976 subsample of the National Longitudinal Survey of Young Men and contains 3010 observations.\n",
    "It includes years of education obtained by 1976 (``ed76``), log hourly wages in 1976 (``lwage76``), age in 1976 (``age76``), and indicators of whether the individual lived close to a public four-year college (`nearc4a`), a private four-year college (`nearc4b`), or a two-year college in 1966 (`nearc2`).\n",
    "The dataset also contains variables or indicators on race, metropolitan area, region (summarized in `indicators`), and family background (summarized in `family`).\n",
    "Card (1993) defines (potential) experience `exp76` as `age76` - `ed76` - 6.\n",
    "\n",
    "Card (1993) reports estimates of the causal effect on education on log wages for multiple specifications.\n",
    "We focus on their specification where experience and experience squared are included as endogenous regressors and variables or indicators on race, metropolitan area, region, and family background are included as exogenous regressors.\n",
    "Whereas Card uses age and age squared to instrument for experience and experience squared, we simply include these as additional instruments.\n",
    "We also include all three college proximity indicators as instruments, instead of aggregating `nearc4a` and `nearc4b` into a single instrument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "08ab71f4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ed76</th>\n",
       "      <th>exp76</th>\n",
       "      <th>exp762</th>\n",
       "      <th>lwage76</th>\n",
       "      <th>nearc4a</th>\n",
       "      <th>nearc4b</th>\n",
       "      <th>nearc2</th>\n",
       "      <th>age76</th>\n",
       "      <th>age762</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>7</td>\n",
       "      <td>16</td>\n",
       "      <td>256</td>\n",
       "      <td>6.306275</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>29</td>\n",
       "      <td>841</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>12</td>\n",
       "      <td>9</td>\n",
       "      <td>81</td>\n",
       "      <td>6.175867</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>27</td>\n",
       "      <td>729</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>12</td>\n",
       "      <td>16</td>\n",
       "      <td>256</td>\n",
       "      <td>6.580639</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>34</td>\n",
       "      <td>1156</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>11</td>\n",
       "      <td>10</td>\n",
       "      <td>100</td>\n",
       "      <td>5.521461</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>27</td>\n",
       "      <td>729</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>12</td>\n",
       "      <td>16</td>\n",
       "      <td>256</td>\n",
       "      <td>6.591674</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>34</td>\n",
       "      <td>1156</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    ed76  exp76  exp762   lwage76  nearc4a  nearc4b  nearc2  age76  age762\n",
       "id                                                                        \n",
       "2      7     16     256  6.306275        0        0       0     29     841\n",
       "3     12      9      81  6.175867        0        0       0     27     729\n",
       "4     12     16     256  6.580639        0        0       0     34    1156\n",
       "5     11     10     100  5.521461        1        0       1     27     729\n",
       "6     12     16     256  6.591674        1        0       1     34    1156"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from io import BytesIO\n",
    "from zipfile import ZipFile\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import requests\n",
    "\n",
    "url = \"https://davidcard.berkeley.edu/data_sets/proximity.zip\"\n",
    "content = requests.get(url).content\n",
    "\n",
    "# From code_bk.txt in the zip file\n",
    "colspec = {\n",
    "    \"id\": (1, 5),  # sequential id runs from 1 to 5225\n",
    "    \"nearc2\": (7, 7),  # grew up near 2-yr college\n",
    "    \"nearc4\": (10, 10),  # grew up near 4-yr college\n",
    "    \"nearc4a\": (12, 13),  # grew up near 4-yr public college\n",
    "    \"nearc4b\": (15, 16),  # grew up near 4-yr priv college\n",
    "    \"ed76\": (18, 19),  # educ in 1976\n",
    "    \"ed66\": (21, 22),  # educ in 1966\n",
    "    \"age76\": (24, 25),  # age in 1976\n",
    "    \"daded\": (27, 31),  # dads education missing=avg\n",
    "    \"nodaded\": (33, 33),  # 1 if dad ed imputed\n",
    "    \"momed\": (35, 39),  # moms education\n",
    "    \"nomomed\": (41, 41),  # 1 if mom ed imputed\n",
    "    \"weight\": (43, 54),  # nls weight for 1976 cross-section\n",
    "    \"momdad14\": (56, 56),  # 1 if live with mom and dad age 14\n",
    "    \"sinmom14\": (58, 58),  # lived with single mom age 14\n",
    "    \"step14\": (60, 60),  # lived step parent age 14\n",
    "    \"reg661\": (62, 62),  # dummy for region=1 in 1966\n",
    "    \"reg662\": (64, 64),  # dummy for region=2 in 1966\n",
    "    \"reg663\": (66, 66),  # dummy for region=3 in 1966\n",
    "    \"reg664\": (68, 68),\n",
    "    \"reg665\": (70, 70),\n",
    "    \"reg666\": (72, 72),\n",
    "    \"reg667\": (74, 74),\n",
    "    \"reg668\": (76, 76),\n",
    "    \"reg669\": (78, 78),  # dummy for region=9 in 1966\n",
    "    \"south66\": (80, 80),  # lived in south in 1966\n",
    "    \"work76\": (82, 82),  # worked in 1976\n",
    "    \"work78\": (84, 84),  # worked in 1978\n",
    "    \"lwage76\": (86, 97),  # log wage (outliers trimmed) 1976\n",
    "    \"lwage78\": (99, 110),  # log wage in 1978 outliers trimmed\n",
    "    \"famed\": (112, 112),  # mom-dad education class 1-9\n",
    "    \"black\": (114, 114),  # 1 if black\n",
    "    \"smsa76r\": (116, 116),  # in smsa in 1976\n",
    "    \"smsa78r\": (118, 118),  # in smsa in 1978\n",
    "    \"reg76r\": (120, 120),  # in south in 1976\n",
    "    \"reg78r\": (122, 122),  # in south in 1978\n",
    "    \"reg80r\": (124, 124),  # in south in 1980\n",
    "    \"smsa66r\": (126, 126),  # in smsa in 1966\n",
    "    \"wage76\": (128, 132),  # raw wage cents per hour 1976\n",
    "    \"wage78\": (134, 138),\n",
    "    \"wage80\": (140, 144),\n",
    "    \"noint78\": (146, 146),  # 1 if noninterview in 78\n",
    "    \"noint80\": (148, 148),\n",
    "    \"enroll76\": (150, 150),  # 1 if enrolled in 76\n",
    "    \"enroll78\": (152, 152),\n",
    "    \"enroll80\": (154, 154),\n",
    "    \"kww\": (156, 157),  # the kww score\n",
    "    \"iq\": (159, 161),  # a normed iq score\n",
    "    \"marsta76\": (163, 163),  # mar status in 1976 1=married, sp. present\n",
    "    \"marsta78\": (165, 165),\n",
    "    \"marsta80\": (167, 167),\n",
    "    \"libcrd14\": (169, 169),  # 1 if lib card in home age 14\n",
    "}\n",
    "\n",
    "with ZipFile(BytesIO(content)).open(\"nls.dat\") as file:\n",
    "    df = pd.read_fwf(\n",
    "        file,\n",
    "        names=colspec.keys(),\n",
    "        # pandas expects [from, to[ values, starting at 0\n",
    "        colspecs=[(f - 1, t) for (f, t) in colspec.values()],\n",
    "        na_values=\".\",\n",
    "    )\n",
    "\n",
    "df = df[lambda x: x[\"lwage76\"].notna()].set_index(\"id\")\n",
    "\n",
    "# construct potential experience and its square\n",
    "df[\"exp76\"] = df[\"age76\"] - df[\"ed76\"] - 6\n",
    "df[\"exp762\"] = df[\"exp76\"] ** 2\n",
    "df[\"age762\"] = df[\"age76\"] ** 2\n",
    "\n",
    "df[\"f1\"] = df[\"famed\"].eq(1).astype(\"float\")  # mom and dad both > 12 yrs ed\n",
    "df[\"f2\"] = df[\"famed\"].eq(2).astype(\"float\")  # mom&dad >=12 and not both exactly 12\n",
    "df[\"f3\"] = df[\"famed\"].eq(3).astype(\"float\")  # mom=dad=12\n",
    "df[\"f4\"] = df[\"famed\"].eq(4).astype(\"float\")  # mom >=12 and dad missing\n",
    "df[\"f5\"] = df[\"famed\"].eq(5).astype(\"float\")  # father >=12 and mom not in f1-f4\n",
    "df[\"f6\"] = df[\"famed\"].eq(6).astype(\"float\")  # mom>=12 and dad nonmissing\n",
    "df[\"f7\"] = df[\"famed\"].eq(7).astype(\"float\")  # mom and dad both >=9\n",
    "df[\"f8\"] = df[\"famed\"].eq(8).astype(\"float\")  # mom and dad both nonmissing\n",
    "\n",
    "indicators = [\"black\", \"smsa66r\", \"smsa76r\", \"reg76r\"]\n",
    "# exclude reg669, as sum(reg661, ..., reg669) = 1\n",
    "indicators += [f\"reg66{i}\" for i in range(1, 9)]\n",
    "\n",
    "family = [\"daded\", \"momed\", \"nodaded\", \"nomomed\", \"famed\", \"momdad14\", \"sinmom14\"]\n",
    "fs = [f\"f{i}\" for i in range(1, 8)]  # exclude f8 as sum(f1, ..., f8) = 1\n",
    "family += fs\n",
    "\n",
    "# endogenous variables: years of education, experience, experience squared\n",
    "X = df[[\"ed76\", \"exp76\", \"exp762\"]]\n",
    "y = df[\"lwage76\"]  # outcome: log wage\n",
    "# included exog. variables: indicators for family background, region, race\n",
    "C = df[family + indicators]\n",
    "# instruments: proximity to colleges, age, and age squared\n",
    "Z = df[[\"nearc4a\", \"nearc4b\", \"nearc2\", \"age76\", \"age762\"]]\n",
    "\n",
    "pd.concat([X, y, Z], axis=1).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "188dfd87",
   "metadata": {},
   "source": [
    "For simplicity, we reduce to model 1 by replacing $Z$, $X$, and $y$ with their residuals after regressing out $C$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "931debc0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from ivmodels.utils import oproj\n",
    "\n",
    "C = C - C.mean(axis=0)\n",
    "# oproj(A, B) returns B minus the projection of B onto the column space of A.\n",
    "Z, X, y = oproj(C, Z, X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b1d7c71-f9b7-4c58-a87e-aebe4927fcf5",
   "metadata": {},
   "source": [
    "\n",
    "## Estimators\n",
    "\n",
    "We compare the ordinary least-squares (OLS), two-stage least-squares (TSLS), and limited information maximum likelihood (LIML) estimators applied to Card's dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "d6c5fc2e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "intercept    4.768683\n",
       "ed76         0.072634\n",
       "exp76        0.084529\n",
       "exp762      -0.002290\n",
       "Name: coefficients, dtype: float64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from ivmodels import KClass\n",
    "\n",
    "ols = KClass(kappa=\"ols\").fit(Z=None, X=X, y=y)\n",
    "ols.named_coef_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "84a044af-acbe-4c98-acbc-9186120b1431",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "intercept    3.907937\n",
       "ed76         0.144954\n",
       "exp76        0.061604\n",
       "exp762      -0.001196\n",
       "Name: coefficients, dtype: float64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tsls = KClass(kappa=\"tsls\").fit(Z=Z, X=X, y=y)\n",
    "tsls.named_coef_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "332d5952-1748-4138-b1e0-87c85e01e7ca",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "intercept    3.587249\n",
       "ed76         0.172352\n",
       "exp76        0.051571\n",
       "exp762      -0.000713\n",
       "Name: coefficients, dtype: float64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "liml = KClass(kappa=\"liml\").fit(Z=Z, X=X, y=y)\n",
    "liml.named_coef_"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "831c74fe",
   "metadata": {},
   "source": [
    "Here $3 = m_x < k = 5$ and the LIML estimator $\\hat{\\beta}_{\\text{LIML}}$ regularizes the two-stage least-squares estimator $\\hat{\\beta}_{\\text{TSLS}}$ away from the ordinary least-squares estimator $\\hat{\\beta}_{\\text{OLS}}$.\n",
    "\n",
    "We numerically validate Proposition 10 in Londschien (2025), that\n",
    "$$AR(\\hat\\beta_\\mathrm{LIML}) = \\frac{n - k}{k} \\cdot (\\hat\\kappa_\\mathrm{LIML} - 1).$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "a2da29ef-c1ce-477f-a936-cc45c98f250e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8568537499505241"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Multiply with (n - k - 1) instead of (n - k) due to the included intercept\n",
    "(y.shape[0] - Z.shape[1] - 1) / Z.shape[1] * (liml.kappa_ - 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "9b6dfb6f-1607-4285-94da-5071ec2b98cd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.8568537499489436, 0.5092552733782498)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from ivmodels.tests import anderson_rubin_test\n",
    "\n",
    "# This returns a tuple (statistic, p_value)\n",
    "anderson_rubin_test(Z=Z, X=X, y=y, beta=liml.coef_)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60c475b3-b38a-4416-be5e-7d73db69b084",
   "metadata": {},
   "source": [
    "\n",
    "## Tests for the causal parameter\n",
    "\n",
    "We test the null hypothesis \n",
    "$$H_0 \\colon \\text{ The causal effect of education on log wages } \\beta_0 = 0$$\n",
    "using the subvector Wald, Anderson-Rubin, (conditional) likelihood-ratio, and lagrange multiplier tests."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "66d67a01-07ec-4850-8f66-7324185547cd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wald (TSLS): statistic=10.62, p-value=0.0011\n",
      "Wald (LIML): statistic= 9.46, p-value=0.0021\n",
      "AR         : statistic= 5.07, p-value=0.0016\n",
      "LR         : statistic=10.93, p-value=0.0009\n",
      "CLR        : statistic=10.93, p-value=0.0024\n",
      "LM         : statistic= 5.79, p-value=0.0161\n"
     ]
    }
   ],
   "source": [
    "from functools import partial\n",
    "from ivmodels.tests import (\n",
    "    wald_test,\n",
    "    likelihood_ratio_test,\n",
    "    conditional_likelihood_ratio_test,\n",
    "    lagrange_multiplier_test\n",
    ")\n",
    "\n",
    "X, W = X[[\"ed76\"]], X[[\"exp76\", \"exp762\"]]\n",
    "\n",
    "for test, name in [\n",
    "    (partial(wald_test, estimator=\"tsls\"), \"Wald (TSLS)\"),\n",
    "    (partial(wald_test, estimator=\"liml\"), \"Wald (LIML)\"),\n",
    "    (anderson_rubin_test, \"AR\"),\n",
    "    (likelihood_ratio_test, \"LR\"),\n",
    "    (conditional_likelihood_ratio_test, \"CLR\"),\n",
    "    (lagrange_multiplier_test, \"LM\"),\n",
    "]:\n",
    "    stat, pval = test(Z=Z, X=X, W=W, y=y, beta=np.array([0.]))\n",
    "    print(f\"{name:<11}: statistic={stat:5.2f}, p-value={pval:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "54ab1442",
   "metadata": {},
   "source": [
    "\n",
    "## Testing model assumptions"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c94cea00",
   "metadata": {},
   "source": [
    "We compute the LIML variant of the Sargan J-statistic (Guggenberger, 2012) and the Cragg-Donald test statistic (Cragg and Donald, 1993) of reduced rank."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "562c4c40-dbae-418d-aa7b-912a2e03cd9c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "J-statistic   :  4.284, p-value: 0.1174\n",
      "Rank statistic: 15.613, p-value: 0.0014\n"
     ]
    }
   ],
   "source": [
    "from ivmodels.tests import j_test, rank_test\n",
    "\n",
    "XW = pd.concat([X, W], axis=1)\n",
    "\n",
    "j_stat, j_pval = j_test(Z=Z, X=XW, y=y, estimator=\"liml\")\n",
    "print(f\"J-statistic   : {j_stat:6.3f}, p-value: {j_pval:.4f}\")\n",
    "rank_stat, rank_pval = rank_test(Z=Z, X=XW)\n",
    "print(f\"Rank statistic: {rank_stat:6.3f}, p-value: {rank_pval:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eed80575",
   "metadata": {},
   "source": [
    "The (LIML variant of the) J-test does not reject at $\\alpha = 0.05$. \n",
    "\n",
    "Stock and Yogo (2002) do not tabulate critical values for Anderson's (1951) statistic of reduced rank for Wald tests to have acceptable size for $m=3$ endogenous regressors.\n",
    "For the TSLS centered Wald test at significance $\\alpha = 0.05$ to have an expected size not exceeding $r = 0.1$ for $k=5$ and $m=1, 2$ these critical values would be 26.87 and 19.45. These values are close to the observed value of 15.6.\n",
    "\n",
    "We apply Scheidegger et al.'s (2025) residual prediction test of model misspecification to our dataset.\n",
    "This tests\n",
    "$$\n",
    "H_0 \\colon \\exists \\beta \\in \\mathbb{R}^m \\text{ such that } \\mathbb{E}[ y_i - X_i \\beta \\mid Z_i ] = 0.\n",
    "$$\n",
    "Here, we need to explicitly include the excluded exogenous regressors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "7a53ee6b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1.6869718742813247, 0.04580438012203314)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from ivmodels.tests import residual_prediction_test\n",
    "\n",
    "residual_prediction_test(\n",
    "    Z=df[[\"nearc4a\", \"nearc4b\", \"nearc2\", \"age76\", \"age762\"]],\n",
    "    X=df[[\"ed76\", \"exp76\", \"exp762\"]],\n",
    "    y=df[\"lwage76\"],\n",
    "    C=df[family + indicators],\n",
    "    seed=0,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79700f7f",
   "metadata": {},
   "source": [
    "The result is just barely significant at 5\\%.\n",
    "To avoid the $p$-value lottery due to the random train/test split used in the residual prediction test, Scheidegger et al. (2025) suggest aggregating the $p$-values from 50 random splits by taking 2 times the median. This results in a conservative $p$-value (Meinshausen et al., 2009)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "4d7a885c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.1058841781437827"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ps = np.zeros(50)\n",
    "for i in range(50):\n",
    "    _, ps[i] = residual_prediction_test(\n",
    "        Z=df[[\"nearc4a\", \"nearc4b\", \"nearc2\", \"age76\", \"age762\"]],\n",
    "        X=df[[\"ed76\", \"exp76\", \"exp762\"]],\n",
    "        y=df[\"lwage76\"],\n",
    "        C=df[family + indicators],\n",
    "        seed=i\n",
    "    )\n",
    "\n",
    "2 * np.median(ps)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d208eb9",
   "metadata": {},
   "source": [
    "This is not significant at 5%.\n",
    "\n",
    "Is it necessary to treat experience and experience squared as endogenous variables in Card's (1995) analysis?\n",
    "Treating them as exogenous would simplify the analysis by negating the need for subvector tests.\n",
    "Using Scheidegger's (2025) residual prediction test we can verify that instruments are likely not valid if we treat experience and its square as exogenous.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "d7ec3cd7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.003990876020778189"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "for i in range(50):\n",
    "    _, ps[i] = residual_prediction_test(\n",
    "        Z=df[[\"nearc4a\", \"nearc4b\", \"nearc2\", \"age76\", \"age762\"]],\n",
    "        X=df[[\"ed76\"]],\n",
    "        y=df[\"lwage76\"],\n",
    "        C=df[family + indicators + [\"exp76\", \"exp762\"]],\n",
    "        seed=i\n",
    "    )\n",
    "\n",
    "2 * np.median(ps)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa1fab57",
   "metadata": {},
   "source": [
    "\n",
    "## Confidence Sets"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0717682",
   "metadata": {},
   "source": [
    "We compute 95\\% (subvector) confidence sets for the causal effect of education on log wages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "cadad646-860c-4b23-8a36-0b3e1993b873",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wald (TSLS): [0.058, 0.232]\n",
      "Wald (LIML): [0.063, 0.282]\n",
      "AR         : [0.083, 0.352]\n",
      "LR         : [0.079, 0.368]\n",
      "CLR        : [0.073, 0.396]\n",
      "LM         : [-0.594, -0.059] U [0.061, 0.467]\n"
     ]
    }
   ],
   "source": [
    "from ivmodels.tests import (\n",
    "    inverse_anderson_rubin_test,\n",
    "    inverse_wald_test,\n",
    "    inverse_likelihood_ratio_test,\n",
    "    inverse_conditional_likelihood_ratio_test,\n",
    "    inverse_lagrange_multiplier_test\n",
    ")\n",
    "\n",
    "for name, inverse_test in [\n",
    "    (\"Wald (TSLS)\", partial(inverse_wald_test, estimator=\"tsls\")),\n",
    "    (\"Wald (LIML)\", partial(inverse_wald_test, estimator=\"liml\")),\n",
    "    (\"AR\", inverse_anderson_rubin_test),\n",
    "    (\"LR\", inverse_likelihood_ratio_test),\n",
    "    (\"CLR\", inverse_conditional_likelihood_ratio_test),\n",
    "    (\"LM\", inverse_lagrange_multiplier_test),\n",
    "]:\n",
    "    print(f\"{name:<11}: {inverse_test(Z=Z, X=X, W=W, y=y, alpha=0.05):.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78ceea6e",
   "metadata": {},
   "source": [
    "The confidence sets are of varying size.\n",
    "The inverse Anderson-Rubin test confidence set is the smallest among the weak-instrument-robust tests.\n",
    "This occurs as $J_\\mathrm{LIML} = 4.284$ is substantial and\n",
    "$$\n",
    "(k - m_w) \\cdot \\mathrm{AR}(\\beta) = \\mathrm{LR}(\\beta) + J_\\mathrm{LIML}\n",
    "$$\n",
    "such that \"a little misspecification improves power\" as discussed in Londschien(2025), section 4.\n",
    "\n",
    "We observe a common phenomenon for the Lagrange multiplier test:\n",
    "The confidence set obtained by inversion is disjoint.\n",
    "This is due to the test being a function of the score, the derivative of the likelihood, such that the statistic is zero both at the minimum but also at the maximum of the likelihood.\n",
    "The Lagrange multiplier test still has its uses, and might result in more smaller confidence sets if one can discard one part of the confidence set a-priori, e.g., as prior knowledge dictates that $\\beta_0 > 0$.\n",
    "\n",
    "We now numerically verify proposition 44 of Londschien (2025).\n",
    "For $\\alpha_\\mathrm{max} = 1 − F_{\\chi^2(k - m_w)}(J_\\mathrm{LIML})$, we expect the inverse Anderson-Rubin test confidence set to be just non-empty.\n",
    "For any $\\alpha > \\alpha_\\mathrm{max}$, we expect the confidence set to be empty."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "ce88eab4-9b19-4367-b54e-480b0a16b231",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Inverse AR (1-alpha=0.767640): [0.1721722, 0.1725321]\n",
      "Inverse AR (1-alpha=0.767642): ∅\n"
     ]
    }
   ],
   "source": [
    "import scipy.stats\n",
    "\n",
    "alpha_max = 1 - scipy.stats.chi2(5 - 2).cdf(j_stat)  # not equal to j_pval\n",
    "\n",
    "cs = inverse_anderson_rubin_test(Z=Z, X=X, y=y, W=W, alpha=alpha_max - 1e-6)\n",
    "print(f\"Inverse AR (1-alpha={1 - alpha_max - 1e-6:7f}): {cs:.7f}\")\n",
    "cs = inverse_anderson_rubin_test(Z=Z, X=X, y=y, W=W, alpha=alpha_max + 1e-6)\n",
    "print(f\"Inverse AR (1-alpha={1 - alpha_max + 1e-6:7f}): {cs:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73986ef2",
   "metadata": {},
   "source": [
    "For $\\alpha_\\mathrm{min} = 1 - F_{\\chi^2(k-{m_W})}^{-1}(\\mathrm{CD})$, we expect the inverse Anderson-Rubin test confidence set to be just bounded.\n",
    "For $\\alpha > \\alpha_\\mathrm{min}$, the confidence set should be unbounded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "00d5ee0a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Inverse AR (1-alpha=0.998638): [-inf, -1451.003] U [-0.005, inf]\n",
      "Inverse AR (1-alpha=0.998640): [-0.005, 1452.363]\n"
     ]
    }
   ],
   "source": [
    "alpha_min = rank_pval\n",
    "\n",
    "# The confidence set explodes at exactly alpha_min due to numerical instabilities\n",
    "cs = inverse_anderson_rubin_test(Z=Z, X=X, y=y, W=W, alpha=alpha_min - 1e-6)\n",
    "print(f\"Inverse AR (1-alpha={1 - alpha_min - 1e-6:.6f}): {cs:.3f}\")\n",
    "cs = inverse_anderson_rubin_test(Z=Z, X=X, y=y, W=W, alpha=alpha_min + 1e-6)\n",
    "print(f\"Inverse AR (1-alpha={1 - alpha_min + 1e-6:.6f}): {cs:.3f}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1466111e",
   "metadata": {},
   "source": [
    "The confidence sets obtained using $\\chi^2(k - m_W)$ critical values are more powerful than those proposed by Dufour et al. (2005).\n",
    "For $\\alpha=0.005$, the former are finite and don't include 0, whereas the latter are unbounded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "39cad1bc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Inverse AR (1-alpha=0.995): [0.028, 0.932]\n",
      "Inverse AR (1-alpha=0.995) by projection: [-inf, -1.822] U [-0.023, inf]\n"
     ]
    }
   ],
   "source": [
    "cs = inverse_anderson_rubin_test(Z=Z, X=X, y=y, W=W, alpha=0.005)\n",
    "print(f\"Inverse AR (1-alpha={1 - 0.005}): {cs:.3f}\")\n",
    "\n",
    "cs = inverse_anderson_rubin_test(Z=Z, X=XW, y=y, alpha=0.005).project([0])\n",
    "print(f\"Inverse AR (1-alpha={1 - 0.005}) by projection: {cs:.3f}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:ivmodels] *",
   "language": "python",
   "name": "conda-env-ivmodels-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}