Generalizability in Causal Inference: Theory and Algorithms
In the empirical sciences, experiments are invariably conducted with the intent of being used elsewhere (e.g., outside the laboratory), where conditions are likely to be different. This practice is based on the premise that, owing to certain commonalities between the source and target environments, causal claims will be valid even where experiments have never been performed. Yet, despite the extensive amount of empirical work relying on this premise, practically no formal treatments have been attempted to reveal the conditions under which environments can differ and still allow, in some formal sense, generalizations to be valid.
This work develops a theoretical framework for understanding, representing, and algorithmizing the generalization problem described above and brings other types of generalization problems, of both causal and statistical character, under the same theoretical umbrella. The generalization problems addressed in this thesis are as follows:
Problem 1. Transportability (generalizing experimental findings across settings, populations, or domains). How to reuse causal information acquired by experiments in one setting to answer causal queries in another, possibly different setting where only passive observations can be collected? This question embraces several sub-problems treated informally in the literature under rubrics such as ``external validity," ``meta-analysis,'' ``quasi-experiments,'' and ``heterogeneity.''
Problem 2. Selection Bias (generalizing statistical findings across sampling conditions (preferential exclusion of units from the sample)). How can knowledge from a sampled subpopulation be generalized to the entire population when the sampling process is not random, but determined by variables in the analysis?
Problem 3. Experimental identifiability (generalizing experimental findings across experimental conditions in the same population). How can accessible experiments be used as surrogates for other experiments that are too difficult, expensive, or unethical to be conducted in practice?
Building on the modern theory of causation, we provide algebraic, graphical, and algorithmic conditions to support the inductive step required in the corresponding task in each of these problems. This characterization delineates the formal boundary between estimable and non-estimable effects, and identifies which pieces of scientific knowledge need to be collected in each study to construct a bias-free estimate of the target query. The theory provided in this work is general, in the sense that it takes as input any arbitrary set of generalizability assumptions and decides whether this specific instance admits solution.
The problems discussed in this thesis have applications in several empirical sciences such as bioinformatics, medicine, economics, social sciences as well as in data-driven fields such as machine learning, artificial intelligence and statistics.