Causal Machine Learning for Novel Settings Boot Camp

August 19-20

Purdue University’s Department of Statistics in the College of Science and the Krannert School of Management, with support from the Regenstrief Center for Healthcare Engineering (RCHE), are co-organizing the 2022 Causal Machine Learning for Novel Settings Boot Camp. The boot camp will be held between August 19-20 and will be a hybrid event, including virtual as well as in-person speakers and participants.

The tutorials are aimed at Purdue graduate students and faculty interested in learning more about the principles and modern developments of causal inference. The boot camp will host distinguished invited speakers who will introduce participants to their own state-of-the-art contributions to causal machine learning. In addition, Purdue faculty members will provide demonstrations of the fundamental theories, tools, applications, and software for machine learning and causal inference during the boot camp.

Due to space constraints, in-person participation will be limited. The in-person registration option will only be available to faculty and PhD students sponsored by Krannert, Statistics, and RCHE.

The in-person sessions will be held on campus at the Krannert Center (KCTR). The details for the virtual participation will be provided to the registrants.

Audience: Faculty and Graduate Students

Goals: Provide participants with a conceptual and practical introduction to aspects of modern causal inference with machine learning.

Organizers:

Mohammad S. Rahman
Vinayak Rao
Arman Sabbaghi

Register to attend virtually

Schedule

Day 1, August 19 (9:00 am - 4:45 pm)

KCTR or Virtual

9:00 am: Welcome and Logistics

9:15 am: "Introduction to fundamental concepts in Causal Inference and ML approaches of Causal Inference." (Arman Sabbaghi, Purdue University)

10:30 am: “Synthetic Control: Methods and Practice.” (Alberto Abadie, MIT) Recording

12:15 pm: Lunch (served for the physical participants)

2:00 pm: “Statistical Inference for Heterogeneous Treatment Effects and Individualized Treatment Rules Discovered by Generic Machine Learning in Randomized Experiments.” (Kosuke Imai Harvard University) Slides Recording

Abstract: Researchers are increasingly turning to machine learning (ML) algorithms to estimate heterogeneous treatment effects (HET) and develop individualized treatment rules (ITR) using randomized experiments. Despite their promise, ML algorithms may fail to accurately ascertain HET or produce efficacious ITR under practical settings with many covariates and small sample size. In addition, the quantification of estimation uncertainty remains a challenge. We develop a general approach to statistical inference for estimating HET and evaluating ITR discovered by a generic ML algorithm. We utilize Neyman's repeated sampling framework, which is solely based on the randomization of treatment assignment and random sampling of units. Unlike some of the existing methods, the proposed methodology does not require modeling assumptions, asymptotic approximation, or resampling methods. We extend our analytical framework to a common setting, in which the same experimental data is used to both train ML algorithms and evaluate HET/ITR. In this case, our statistical inference incorporates the additional uncertainty due to random splits of data used for cross-fitting.

3:45 pm: “A Tale of Two Panel Data Regressions.” (Dennis Shen, UC Berkeley) Slides Recording

Abstract: A central goal in social science is to evaluate the causal effect of a policy. In this pursuit, researchers often organize their observations in a panel data format, where a subset of units are exposed to a policy (treatment) for some time periods while the remaining units are unaffected (control). The spread of information across time and space motivates two general approaches to estimate and infer causal effects: (i) unconfoundedness, which exploits time series patterns, and (ii) synthetic controls, which exploits cross-sectional patterns. Although conventional wisdom decrees that the two approaches are fundamentally different, we show that they yield numerically identical estimates under several popular settings that we coin the symmetric class. We study the two approaches for said class under a generalized regression framework and argue that valid inference relies on both correlation patterns. Accordingly, we construct a mixed confidence interval that captures the uncertainty across both time and space. We illustrate its advantages over inference procedures that only account for one dimension using data-inspired simulations and empirical applications. Building on these insights, we advocate for panel data agnostic (PANDA) regression--rooted in model checking and based on symmetric estimators and mixed confidence intervals--when the data generating process is unknown.

Day 2: August 20 (9:45 am - 4:00 pm)

KCTR or Virtual

9:45 am: "Overview of matrix completion methods." (Vinayak Rao, Purdue University) Slides Recording

11:00 am: "Causal Matrix Completion." (Dennis Shen, UC Berkeley) Slides Recording

Abstract: Matrix completion studies the recovery of an underlying matrix from a sparse subset of noisy observations. It is traditionally assumed that the entries of the matrix are "missing completely at random" (MCAR), i.e., each entry is revealed in an i.i.d. fashion. This is arguably unrealistic due to the presence of latent confounders (unobserved factors) that determine both the entries of the underlying matrix and the missingness pattern in the observed matrix. For example, a user who vehemently hates being scared is unlikely to watch horror films. In general, these confounders yield "missing not at random" (MNAR) data, which can negatively affect methods that do not account for this selection bias. Through the language of potential outcomes, we develop a formal causal model for matrix completion in this context and provide novel identification arguments for a variety of natural causal estimands. Bridging concepts from synthetic controls and nearest neighbor methods, we design a procedure coined "synthetic nearest neighbors" (SNN). We prove entry-wise finite-sample consistency and asymptotic normality of SNN in the presence of MNAR data. As a byproduct, our results also lead to new theoretical results for the matrix completion literature. Across simulated and real data, we find that SNN performs well relative to other common matrix completion methods.

12:15 pm: Lunch (served for the physical participants)

1:30 pm: Industry session: “Casual Inference and Estimands in Clinical Trials.” (Ilya Lipkovich) Slides Recording

Abstract: In this presentation we revisit recent ICH E9 (R1) Addendum on Estimands and Sensitivity Analysis in Clinical Trials and discuss various strategies for handling intercurrent events (ICEs) using the causal inference framework. The language of potential outcomes (PO) is widely accepted in the causal inference literature but is not yet recognized in the clinical trial community and was not used in defining causal estimands in ICH E9(R1). In this presentation, we bridge the gap between the causal inference community and clinical trialists by advancing the use of causal estimands in clinical trial settings. We illustrate how concepts from causal literature, such as POs and dynamic treatment regimens, can facilitate defining and implementing causal estimands for different types of outcomes providing a unifying language for both observational and randomized clinical trials. We emphasize the need for a mix of strategies in handling different types of ICEs, rather than taking one-strategy-fit-all approach. We suggest that hypothetical strategies should be used more broadly and provide examples of different hypothetical strategies for different types of ICEs.

2:15 pm: Industry session: “Propensity Score Integrated Bayesian Methods for External Borrowing in Hybrid Control Arm Designs.” (Mingyang Shan) Slides

Abstract: While randomized controlled trials (RCTs) are the gold standard to assess the efficacy and safety of experimental treatments, there is increasing interest to improve the efficiency of clinical studies with a small concurrent control group by using information from external control subjects from real-world data (RWD). However, causal inference using multiple data sources requires the treatment assignment mechanism to be unconfounded and the data sources to be strongly ignorable (ensuring exchangeability of potential outcome distributions across studies). Hybrid external control designs (HECD) that augment RCTs with external control subjects can potentially detect and adjust for violations of these assumptions. Several propensity score integrated Bayesian power prior and meta-analytic predictive prior approaches have been proposed to leverage external control data for HECDs by controlling their contribution into the likelihood function according to differences in covariate or conditional outcome distributions. This presentation will provide an overview of the extensions to the causal inference framework, recent advances in Bayesian external borrowing methodology, and present a simulation study to evaluate the operating characteristics of the integrated propensity score methods under different violations of the unconfoundedness and strong data source ignorability assumptions.

3:15 pm: Industry session: “Casual Inference in Drug Development.” (Yongming Qu) Slides

4:00 pm: Networking Reception (Weiler Lounge in KCTR)

Data and related materials (Purdue authentication required to access data)

Causal Machine Learning for Novel Settings Boot Camp

Schedule

Day 1, August 19 (9:00 am - 4:45 pm)

Day 2: August 20 (9:45 am - 4:00 pm)

Recommended Readings: