Estimating causal effects is one of the fundamental problems in the empirical sciences. When a randomized study can be performed, estimating causal effects reduces to standard, well-understood methods. However, randomized experiments can be imperfect, unethical, or infeasible in many real-world scenarios. In such cases, determining whether and how the causal effect can be estimated depends on the underlying causal structure, which is generally modeled using structural equation models (SEMs). These models allow researchers to express causal assumptions formally and transparently, test them against data, and derive their consequences. As a result, researchers can use SEMs to determine whether their assumptions enable a causal effect to be estimated and, if so, derive a consistent estimator for that effect. Likewise, researchers can derive testable implications of their assumptions and test them against data. While linear SEMs have been studied for nearly a century, no complete and tractable algorithm has been developed for determining whether an effect is estimable and for deriving a consistent estimator (called the identification problem). Likewise, little work has been done to develop algorithmic methods for deriving testable implications of linear SEMs. In this work, I devise a new family of graph-based methods to address these two fundamental problems in linear SEMs.
Perhaps the most common method of identifying and estimating causal effects in a linear structural model is via regression. However, in order for regression methods to provide unbiased and consistent estimates of the causal effect, the exogeneity assumption must be satisfied. A common way of testing this assumption is to perform a ``robustness test'', where variables are added to the regression and a consequent shift in the coefficient of interest is taken as evidence of misspecification or bias. However, I show that certain regressors, when added to the regression, will induce a shift, even when the model is properly specified. Such robustness tests would produce false alarm, suggesting that the model is misspecified when it is not. I propose a simple, graphical criterion that allows researchers to quickly determine which variables, when added to the regression, constitute informative robustness tests. I also characterize when and how robustness tests are able to detect confounding bias.
Another pervasive and well-known method for deriving testable implications is overidentification. Overidentification occurs when the modeling assumptions allow for two distinct and independent estimators for a given parameter. In this case, we can test the identifying assumptions by comparing and imposing equality on the two estimates. In this work, I extend the state-of-the-art half-trek identification algorithm and apply it to systematically derive overidentifying and other constraints that can be used to test the model against the observed covariance matrix.
Previous algorithms designed for the identification of linear SEMs are not always able to identify parameters and testable implications that non-parametric methods can, which is surprising given that the assumption of linearity imposes additional constraints over the observed data. I propose a new decomposition strategy where an SEM can be recursively reduced into simpler sub-models, allowing the identification of parameters and testable implications that could not be identified in the original model. I prove that that the resulting procedure enables the identification of any parameter or testable implication that can be identified by non-parametric algorithms, closing the gap between parametric and non-parametric methods.
Lastly, I devise a new framework called auxiliary variables (AVs) that allows researchers to incorporate knowledge of causal effects into the model. As a result, researchers can utilize AVs to supplement graphical identification and model testing methods with knowledge derived from previously identified causal effects, related studies, or surrogate experiments. I then apply this framework to develop a procedure that alternates steps of identification using instrumental sets with construction of AVs. I prove that, even without utilizing external knowledge of causal effects, this algorithm is the most powerful polynomial-time identification algorithm currently available, subsuming all methods found in the literature.