A Flexible Nonparametric Test for Conditional Independence

This paper proposes a nonparametric test for conditional independence that is easy to implement, yet powerful in the sense that it is consistent and achieves n^{-1/2} local power. The test statistic is based on an estimator of the topological "distance" between restricted and unrestricted probability measures corresponding to conditional independence or its absence. The distance is evaluated using a family of Generically Comprehensively Revealing (GCR) functions, such as the exponential or logistic functions, which are indexed by nuisance parameters. The use of GCR functions makes the test able to detect any deviation from the null. We use a kernel smoothing method when estimating the distance. An integrated conditional moment (ICM) test statistic based on these estimates is obtained by integrating out the nuisance parameters. We simulate the critical values using a conditional simulation approach. Monte Carlo experiments show that the test performs well in finite samples. As an application, we test the key assumption of unconfoundedness in the context of estimating the returns to schooling.


Introduction
In this paper, we propose a ‡exible nonparametric test for conditional independence. Let X; Y; and Z be three random vectors. The null hypothesis we want to test is that Y is independent of X given Z, denoted Intuitively, this means that given the information in Z, X cannot provide additional information useful in predicting Y . Dawid (1979) showed that some simple heuristic properties of conditional independence can form a conceptual framework for many important topics in statistical inference: su¢ ciency and ancillarity, parameter identi…cation, causal inference, prediction su¢ ciency, data selection mechanisms, invariant statistical models, and a subjectivist approach to model-building.
An important application of conditional independence testing in economics is to test a key assumption identifying causal e¤ects. Suppose we are interested in estimating the e¤ect of X (e.g., schooling) on Y (e.g., income), and that X and Y are related by the equation where U (e.g., ability) is an unobserved cause of Y (income) and 0 and 1 are unknown coe¢ cients, with 1 representing the e¤ect of X on Y . (We write a linear structural equation here merely for concreteness.) Since X is typically not randomly assigned and is correlated with U (e.g., unobserved ability will a¤ect both schooling and income), OLS will generally fail to consistently estimate 1 . Nevertheless, if, as in Griliches and Mason (1972) and Griliches (1977), we can …nd a set of covariates Z (e.g., proxies for ability, such as AFQT scores) such that U ? X j Z; we can estimate 1 consistently by various methods: covariate adjustment, matching, methods using the propensity score such as weighting and blocking, or combinations of these approaches. Assumption (1) is a key assumption for identifying 1 . It is called a conditional exogeneity assumption by White and Chalak (2008). It enforces the "ignorability" or "unconfoundedness" condition, also known as "selection on observables" (Barnow, Cain, and Goldberger, 1981).
Note that assumption (1) cannot be directly tested since U is unobservable. But if there are other observable covariates V satisfying certain conditions (see White and Chalak, 2010), we have U ? X j Z implies V ? X j Z; so we can test (1) by testing its implication, V ? X j Z: Section 6 of this paper applies this test in the context of a nonparametric study of returns to schooling. In the literature, there are many tests for conditional independence when the variables are categorical. But in economic applications it is common to condition on continuous variables, and there are only a few nonparametric tests for the continuous case. Previous work on testing conditional independence for continuous random variables includes Linton and Gozalo (1997, "LG"), Fernandes and Flores (1999, "FF"), and Delgado and Gonzalez-Manteiga (2001, "DG"). Su and White have several papers (2003,2007,2008,2010, "SW") addressing this question. Although SW's tests are consistent against any deviation from the null, they are only able to detect local alternatives converging to the null at a rate slower than n 1=2 and hence su¤er from the "curse of dimensionality." Recently, Song (2009) has proposed a distribution-free conditional independence test of two continuous random variables given a parametric single index that achieves the local n 1=2 rate. Speci…cally, Song (2009) tests the hypothesis where ( ) is a scalar-valued function known up to a …nite-dimensional parameter , which must be estimated. A main contribution here is that our proposed test also achieves n 1=2 local power, despite its fully nonparametric nature. In contrast to Song (2009), the conditioning variables can be multi-dimensional; and there are no parameters to estimate. The test is motivated by a series of papers on consistent speci…cation testing by Bierens (1982Bierens ( , 1990, Bierens and Ploberger (1997), and Stinchcombe and White (1998, "StW"), among others. Whereas Bierens (1982Bierens ( , 1990 and Bierens and Ploberger (1997) construct tests essentially by comparing a restricted parametric and an unrestricted regression model, the test in this paper follows a suggestion of StW, basing the test on estimates of the topological distance between unrestricted and restricted probability measures, corresponding to conditional independence or its absence.
This distance is measured indirectly by a family of moments, which are the di¤erences of the expectations under the null and under the alternative for a set of test functions. The chosen test functions make use of Generically Comprehensively Revealing (GCR) functions, such as the logistic or normal cumulative distribution functions (CDFs), and are indexed by a continuous nuisance parameter vector . Under the null, all moments are zero. Under the alternative, the moments are nonzero for essentially all choices of . This is in contrast with DG (2001), which employs an indicator testing function that is not generally and comprehensively revealing. By construction, the indicator function takes only the values one and zero, whereas the GCR function is more ‡exible and hence may better present the information.
We estimate these moments by their sample analogs, using kernel smoothing. An integrated conditional moment (ICM) test statistic based on these is obtained by integrating out the nuisance parameters. Its limiting null distribution is a functional of a mean zero Gaussian process. We simulate critical values using a conditional simulation approach suggested by Hansen (1996) in a di¤erent setting.
The plan of the paper is as follows. In Section 2, we explain the basic idea of the test and specify a family of moment conditions and their empirical counterparts. This family of moment conditions is (essentially) equivalent to the null hypothesis of conditional independence and forms a basis for the test. In Section 3, we establish stochastic approximations of the empirical moment conditions uniformly over the nuisance parameters. We derive the …nite-dimensional weak convergence of the empirical moment process. We also provide bandwidth choices for practical use: a simple "plug-in"estimator of the MSE-optimal bandwidth. In Section 4, we formally introduce and analyze our ICM test statistic. In particular, we establish its asymptotic properties under the null and alternatives and provide a conditional simulation approach to simulate the critical values. In Section 5, we report some Monte Carlo results examining the size and power properties of our test and comparing its performance with that of a variety of other tests in the literature. In Section 6, we study the returns to schooling, using the proposed statistic to test the key assumption of unconfoundedness. The last section concludes and discusses directions for further research.

The Null Hypothesis
Let X, Y , and Z be three random vectors, with dimensions d X , d Y ; and d Z , respectively.
, we want to test the null that Y is independent of X conditional on Z, i.e., against the alternative that Y and X are dependent conditional on Z, i.e., H a : Y 6 ? X j Z: Let F Y jXZ (y j x; z) be the conditional distribution function of Y given (X; Z) = (x; z) and F Y jZ (y j z) be the conditional distribution function of Y given Z = z. Then we can express the null as The following three expressions are equivalent to one another and to (3): where we have used the standard notations for distribution functions. Let : R ! [0; 1] be a one-to-one mapping with Boreal measurable inverse. De…ne Y (Y ) = ( (Y 1 ) ; : : : ; (Y d Y )) and de…ne X (X) and Z (Z) similarly. Then Y ? X j Z is equivalent to Y (Y ) ? X (X) j Z (Z) : The equivalence holds because the sigma …elds are not a¤ected by the transformation. An example of such a transformation is the normal CDF. In practice, we may also use a linear map such as to map the data into a bounded set. So without loss of generality, we assume that P (W 2 [0; 1] d ) = 1 throughout the rest of the paper.

An Equivalent Null Hypothesis in Moment Conditions
The approach adopted in this paper is inspired by a series of papers on consistent speci…cation testing: Bierens (1982Bierens ( , 1990, Bierens and Ploberger (1997), and StW, among others. The tests in those papers are based on an in…nite number of moment conditions indexed by nuisance parameters. Bierens (1990) provides a consistent test of speci…cation of nonlinear regression models. Consider the regression function g (x) = E (Y j X = x). Bierens tests the hypothesis that the parametric functional form, f (x; ), is correctly speci…ed in the sense that g (x) = f (x; 0 ) for some 0 2 . The test statistic is based on an estimator of a family of moments E indexed by a nuisance parameter vector . Under the null hypothesis of correct speci…cation, these moments are zero for all . Bierens's (1990) Lemma 1 shows that the converse essentially holds, due to the properties of the exponential function, making the test capable of detecting all deviations from the null.
StW …nd that a broader class of functions has this property. They extend Bierens's result by replacing the exponential function in the moment conditions with any GCR function, and by extending the probability measures considered in the Bierens (1990) approach to signed measures. As stated in StW, GCR functions include non-polynomial real analytic functions, e.g., exp, logistic CDF, sine, cosine, and also some nonanalytic functions like the normal CDF or its density. Further, they point out that such speci…cation tests are based on estimates of topological distances between a restricted model and an unrestricted model. Following this idea, we can construct a test for conditional independence based on estimates of a topological distance between unrestricted and restricted probability measures corresponding to conditional independence or its absence.
To de…ne the GCR property formally, let C(F ) be the set of continuous functions on a compact set F R d ; and sp [H ' ( )] be the span of a collection of functions H ' ( ): We writew := (1; w 0 ) 0 : The de…nition below is the same as De…nition 3.6 in StW.
R 1+d g is generically comprehensively revealing if for all with non-empty interior, the uniform closure of sp[H ' ( )] contains C(F ) for every compact set F R d .
Intuitively, GCR functions are a class of functions indexed by 2 whose span comes arbitrarily close to any continuous function, regardless of the choice of ; as long as it has non-empty interior. When there is no confusion, we simply call ' GCR if the generated H ' is GCR.
We now establish an equivalent hypothesis in the form of a family of moment conditions following StW. Let P be the joint distribution of the random vector W , and let Q be the joint distribution of W with Y ? X j Z. Thus, P is an unrestricted probability measure, whereas Q is restricted. To be speci…c, P and Q are de…ned such that for any event A, and where 1[ ] is an indicator function. Since W 2 [0; 1] d with probability 1, the domain of the integration in the above integrals is a cube in R d , and is omitted for notational simplicity. We will follow the same practice hereafter. Note that the measure P will be the same as the measure Q if and only if the null is true: To test the null hypothesis is thus equivalent to test whether there is any deviation of P from Q. It should be pointed out that the marginal distribution of Z is the same under P and Q regardless of whether the null is true or not.
Let E P and E Q be the expectation operators with respect to the measure P and the measure Q. De…ne is a vector of nuisance parameters,W = (1; W 0 ) 0 ; and ' is such that the indicated expectations exist for all . Under the null hypothesis, ' ( ) is obviously zero for any choice of and any choice of '; including GCR functions. To construct a powerful test, we want ' ( ) to be nonzero under the alternative. If ' 0 ( 0 ) is not zero under some alternative, we say that ' 0 can detect that particular alternative for the choice = 0 . An arbitrary function ' 0 may fail to detect some alternatives for some choices of . Nevertheless, according to StW, given the boundedness of W; the properties of GCR functions imply that they can detect all possible alternatives for essentially all 2 R 1+d with having non-empty interior. "Essentially all" 2 means that the set of "bad" 's, i.e., the set f 2 : ' ( ) = 0 and Y 6 ? X j Zg; has Lebesgue measure zero and is not dense in .
Given that any deviation of P from Q can be detected by essentially any choice of 2 , testing H 0 : Y ? X j Z is equivalent to testing H 0 : ' ( ) = 0 for essentially all 2 for a GCR function ' and a set with non-empty interior. The alternative is H a : H 0 is false. A straightforward testing approach would be to estimate ' ( ) and to see how far the estimate is from zero. But if we proceed in that way, we encounter a nonparametric estimatorf Z of the density f Z in the denominator of the test statistic, making the analysis of limiting distributions awkward. To avoid this technical issue, we compute the expectations of 'f Z rather than those of ', leading to a new "distance" metric between P and Q: i : Using the change-of-measure technique, we have where P and Q are probability measures de…ned according to with C = R f 2 Z (z) dz being the normalizing constant. Under the null of H 0 : Y ? X j Z; P and Q are the same measure, and so 'f ( ) = 0 for all 2 : Under the alternative of H a : Y 6 ? X j Z; P and Q are di¤erent measures. By de…nition, if ' is GCR, then its revealing property holds for any probability measure (see De…nition 3.2 of StW). So under the alternative, we have 'f ( ) 6 = 0 for essentially all 2 : The behaviors of 'f ( ) under the H 0 and H a imply that we can employ 'f ( ) in place of ' ( ) to perform our test.
To sum up, when ' is a GCR function, has non-empty interior, and R f 2 Z (z) dz < 1; a null hypothesis equivalent to conditional independence is H 0 : 'f ( ) = 0 for essentially all 2 : That is, the null hypothesis of conditional independence is equivalent to a family of moment conditions indexed by . For notational simplicity, we drop the subscript and write ( ) := 'f ( ) hereafter.

Heuristics for Rates
When the probability density functions exist, the conditional independence is equivalent to any of the following: where the notation for density functions is self explanatory. One way to test conditional independence is to compare the densities in a given equation to see if the equality holds. For example, Su and White's (2008) test essentially compares To do that, they estimate f XY Z ; f Z ; f XZ ; and f Y Z nonparametrically, so their test has power against local alternatives at a rate of only n 1=2 h d=4 , the slowest rate of the four nonparametric density estimators, i.e., the rate forf XY Z . This rate is slower than n 1=2 and hence re ‡ects the "curse of dimensionality."The dimension here is d = d X + d Y + d Z , which is at least three and could potentially be larger. To achieve the rate n 1=2 , we do not compare the density functions directly. Instead, our family of moment conditions indirectly measures the distance between f XY Z f Z and f XZ f Y Z , so that for each given , the test statistic is based on an estimator of an average that can achieve an n 1=2 rate, just as a semiparametric estimator would.
To better understand the moment conditions of the equivalent null, we write Instead of comparing f XY Z f Z with f Y Z f XZ , we now compare their integral transforms.
Before the transformation, f XY Z f Z and f Y Z f XZ are functions of (x; y; z), the data points, and those functions can only be estimated at a nonparametric rate slower than n 1=2 . But their integral transforms are now functions of . For each , the transformation is an average of the data so that semiparametric techniques could be used here to get an n 1=2 rate. Essentially, we compare two functions by comparing their weighted averages. The two comparisons are equivalent because of the properties of the chosen test functions. That is, if we choose GCR functions for our test functions, de…ned on a compact index space with non-empty interior, and we do not detect any di¤erence between P and Q transforms at an arbitrary point , then P and Q must agree, and as a consequence P and Q must agree. We gain robustness by integrating over many points :

Empirical Moment Conditions
With some abuse of notation, we write '( 0 + x 0 Then the moment conditions can be rewritten as The …rst term of ( ) is a mean of 'f Z , where ' is known and f Z can be estimated by a kernel smoothing method. The second term is a mean of g XZ f Z (Z), where the function g XZ (x; z; ) is a conditional expectation that can be estimated by a Nadaraya-Watson estimator. Thus we can estimate ( ) bŷ is a multivariate kernel function. In this paper, we follow the standard practice and use a product kernel of the form: where d u is the dimension of u and h h n is the bandwidth that depends on n.
n;h ( ) is an empirical version of ( ): For each 2 ;^ n;h ( ) is a second order Ustatistic. When^ n;h ( ) is regarded as a process indexed by 2 ;^ n;h ( ) is a U-process.
is not symmetric in i and j: To achieve the symmetry so that the theory of U-statistics and U-processes can be applied, we rewritê where 3 Stochastic Approximations and Finite Dimensional Convergence

Assumptions
In this subsection, we state the assumptions that are required to establish the asymptotic properties of^ n;h ( ): We start with a de…nition, which uses the following multi-index notation: for j = (j 1 ; : : : ; j m ) with j`being nonnegative integers, we denote jjj = j De…nition 2 G (A; ; ; m), > 1, is a class of functions g ( ) : R m ! R indexed by 2 A satisfying the following two conditions: (a) for each ; g ( ) is b times continuously di¤ erentiable, where b is the greatest integer that is smaller than ; (b) let Q (u; v) be the Taylor series expansion of g (u) around v of order b : for some constants > 0 and > 0: In the absence of the index set A, we use G ( ; ; m) to denote the class of functions. In this case, our de…nition is similar to De…nition 2 in Robinson (1988) and De…nition 2 in DG (2001). A su¢ cient condition for condition (b) is that the partial derivative of the b-th order is uniformly Hölder continuous: for all j such that jjj = b: We are ready to present our assumptions.
is an IID sequence of random variables on the complete probability space ( ; F; P ) ; (b) each element Z`of Z is supported on [0; 1]; (c) the distribution of Z admits a density function f Z (z) with respect to the Lebesgue measure.
for some integer q > 0 and some constants > 0 and > 0; (b) D j f Z ( z) = 0 for all 0 jjj q and all z on the boundary of [0; 1] d Z ; (c) the conditional distribution functions F Y jZ ; F XjZ ; and F XY jZ admit the respective densities f Y jZ (yjz); f XjZ (xjz); and f XY jZ (x; yjz) with respect to a …nite counting measure, or the Lebesgue measure or their product measure; (d) as functions of z indexed by x; y; or ( Assumption 3 (GCR) (a) is compact with non-empty interior; (b) ' 2 G ( ; ; 1).

Assumption 4 (Kernel Function)
The univariate kernel k ( ) is the qth order symmetric and bounded kernel k : Some discussions on the assumptions are in order. The IID condition in Assumption 1 is maintained for convenience. Analogous results hold under weaker conditions, but we leave explicit consideration of these aside. If we know the support of Z`; then a linear map, if necessary, can be used to ensure that Z`is supported on [0; 1]: In this case, the support condition in Assumption 1(b) is innocuous. When the support of Z`is not known, we can estimate the endpoints of the support by min i=1;:::;n (Z`i) and max i=1;:::;n (Z`i): Under some conditions, these estimators converge to the true endpoints at the rate of 1=n. As a result, the estimation uncertainty has no e¤ect on our asymptotic results.
Assumptions 2(a) and (d) are needed to control the smoothing bias. Under Assumptions 1(b) and 2(a), we have R f 2 Z (z) dz < 1: So it is not necessary to state the square integrability of f Z (z) as a separate assumption. In assumption 2(d), the smoothness condition is with respect to the conditioning variable Z. It does not require the marginal distributions of X and Y to be smooth. In fact, X and Y could be either discrete or continuous. In addition, from a technical point of view, we only need to assume that there exists a version of the conditional density functions satisfying Assumption 2(d).
Assumption 2(b) is a technical condition, which helps avoid the boundary bias problem, a well-known problem for density estimation at the boundary. The GCR approach of StW requires the boundedness of the random vectors, and so we have to deal with the boundary bias problem. If Assumption 2(b) does not hold, we can transform Z intoZ = ( 1 (Z 1 ) ; 1 (Z 2 ) ; : : : ; is strictly increasing and q + 1 times continuously di¤erentiable with inverse : : : ; q; then Assumption 2(b) is satis…ed for the transformed random vectorZ and we can work withZ rather than Z: We can do so because Y ? X j Z if and only if Y ? X jZ: An example of is the CDF of a beta distribution: If a kernel with compact support is used, we can remove the dominating boundary bias by normalization. See, for example, Li and Racine (2007, pp. 31). In this case, we do not need to assume f Z ( ) to be zero on the boundary.
From a theoretical point of view, it is necessary to reduce the boundary bias to a certain order so that^ n;h ( ) is asymptotically centered at ( ). However, if Z i takes values in a closed subset of its support with probability close to one, the boundary e¤ect will be small. In this case, we may skip the transformation and ignore the boundary bias in practice.
Assumption 3(a) is needed only when we attempt to establish the uniformity of some asymptotic properties over : Like Assumption 2, Assumption 3(b) helps control the smoothing bias. It is satis…ed by many GCR functions such as exp ( ) ; normal PDF, sin ( ) ; and cos ( ).
The conditions on the high order kernel in Assumption 4 are fairly standard. For example, both Robinson (1988) and DG (2001) make a similar assumption. The only di¤erence is that Robinson (1988) and DG (2001) require that > q + 1; while we require a stronger condition that > q 2 + 2q + 2 in Assumption 4(b). The stronger condition is needed to control the boundary bias, which is absent in Robinson (1988) and DG (2001), as they assume that Z has an unbounded support. Assumption 4(b) is not restrictive. It is satis…ed by typical kernels used in practice, as they are either supported on [0; 1] or have exponentially decaying tails.
Assumption 5(a) ensures that the degenerate U-statistic in the Hoe¤ding decomposition of^ n;h ( ) is asymptotically negligible. Assumption 5(b) removes the dominating bias of n;h ( ): See Lemmas 1 and 2 below. A necessary condition for Assumption 5 to hold is that 2q > d Z .

Stochastic Approximations
To establish the asymptotic properties of^ n;h ( ); we develop some stochastic approximations, using the theory of U-statistics and U-processes pioneered by Hoe¤ding (1948).
Let h;1 (w; ) = E h;2 (w; W j ; ): Using Hoe¤ding's H-decomposition, we can decom-pose^ n;h ( ) as^ The sum of the …rst two terms in the H-decomposition is known as the Hájek projection.
For easy reference, we denote it as By construction, H n;h ( ) and R n;h ( ) are uncorrelated zero mean random variables. We show that the projection remainder R n;h ( ) is asymptotically negligible, and as a result n;h ( ) and its Hájek projection~ n;h ( ) have the same limiting distribution. For each given and h; R n;h ( ) is a degenerate second order U-statistic with kernel h;2 ( ; ; ) : According to the theory of U-statistics (e.g., Lee, 1990), we have This can also be proved directly by observing that~ h; If h were …xed, then it follows from the basic U-statistic theory that R n;h ( ) = o p (1= p n) for each 2 : However, in the present setting, h ! 0 as n ! 1, so the basic Ustatistic theory does not directly apply. Nevertheless, we can still show that R n;h ( ) is still o p n 1=2 under Assumption 5(a). In fact, we can prove a stronger result, as Lemma 1 shows.
We proceed to establish a stochastic approximation of the Hájek projection~ n;h ( ). Note that both h ( ) and H n;h ( ) depend on h. Using a Taylor expansion, we can separate terms independent of h from those associated with h in h ( ) and H n;h ( ). By using a higher order kernel K and controlling the rate of h so that it shrinks fast enough, we can ensure that the terms associated with h vanish asymptotically, as in Powell, Stock, and Stoker (1989).
More speci…cally, we …rst show that h ( ) = ( ) + O(h q ), where q is the order of the kernel k. Then we show that H n;h ( ) = 2n 1 Under Assumption 5(b), p nh q ! 0, which makes both the second term of h ( ) and the second term of H n;h ( ) vanish asymptotically. The following lemma presents these results formally. It follows from Lemmas 1 and 2 that p n h^ n;h ( ) ( ) ]g have the same limiting distribution for each 2 :

Finite Dimensional Convergence
In this subsection, we view^ n;h ( ) as a U-process indexed by and consider its …nitedimensional convergence.
Let s = f 1 ; 2 ; :::; s g for some s < 1 and `2 ; and de…nê If, in addition, H 0 holds, then ( ) = 0, and Theorem 3 is of interest in its own right. For example, we can use it to construct a Wald test. There may be some power loss if s is small. When s is large enough such that s approximates very well, then the power loss will be small. The idea can be motivated from the method of sieves. We do not pursue this here but refer to Huang (2009) for more discussions. Instead, we consider the ICM tests in the next section. Theorem 3 is an important …rst step in obtaining the asymptotic distributions of the ICM statistics.
Observe that^ n;h ( ) (hence~ n;h ( )) is not symmetric in X and Y; whereas the hypothesis Y ? X j Z is. However p n[^ n;h ( ) h ( )] is asymptotically equivalent to : It can be readily checked that 1 (W ; ) is symmetric in Y and X. Alternatively, we can follow the de…nition of g XZ in (12) and de…ne g Y Z (y; z), g Z (z), and g XY Z (x; y; z; ) as where the last equality is tautological. Then which is clearly symmetric in Y and X: If we construct another estimator, say n;h ( ); by switching the roles of X and Y , we can show that n;h ( ) and^ n;h ( ) are asymptotically equivalent in the sense that p n[ n;h ( ) ^ n;h ( )] = o p (1) uniformly over 2 : So there is no asymptotic gain in taking an average of^ n;h ( ) and n;h ( ). This point is further supported by the symmetry of (W ; ) in X and Y:

Bandwidth Selection
Although any choice of bandwidth h satisfying Assumption 5 will deliver the asymptotic distribution in Theorem 3, in practice we need some guidance on how to select h. Ideally we should select an h that would give us the greatest power for a given size of test, but deriving that procedure would be complicated enough to justify another study. Moreover, it would only make a di¤erence for higher order results. Thus, for the present purposes, we just provide a simple "plug-in" estimator of the MSE-minimizing bandwidth proposed by Powell and Stoker (1996).
Since the test statistic is based on^ n;h ( ), which estimates ( ), it is appealing to choose an h that minimizes the mean squared error (MSE) of^ n;h ( ). After some tedious but straightforward calculations, we get where B 5 is de…ned in (43) in the appendix, and (W ; ) is de…ned by The term 4n 1 var [ 1 (W ; )] 4n 2 var [ 1 (W ; )] does not depend on h. The term 2n 2 ( ) 2 must be of smaller order than 4n 1 C 0 h q , and 4n 1 C 0 h q must be of smaller order than fE [B 5 (W ; )]g 2 h 2q ; otherwise there would be a contradiction to Assumption

5(b). So the leading term of M SE[^ n;h ( )] that involves h is
By minimizing M SE 1 h^ n;h ( ) i ; we obtain the optimal bandwidth Now Assumption 5(a) is satis…ed: And so is Assumption 5(b): The optimal bandwidth depends on the unknown quantities E [ (W ; )] and E [B 5 (W ; )]. Here we follow the standard practice (e.g., Powell and Stoker (1996)) and use a simple plug-in estimator of h : Let h 0 be an initial bandwidth. Suppose E h;2 (W i ; W j ; ) 4 ) for some > 0, and let % = max f + 2d Z ; 2q + d Z g. If h 0 ! 0 and nh % 0 ! 1, then by Proposition 4.2 of Powell and Stoker (1996), The estimatorB 5 given above is a "slope"between two points (h q 0 ;^ n;h 0 ( )) and ( h q 0 ;^ n; h 0 ( )). To get a more stable estimator, we could use a regression of^ n;h 0 ( ) on h q 0 for various values of h 0 . Given^ andB 5 ; the plug-in estimator of h iŝ In practice we can choose q large enough so that % = maxf + 2d Z , 2q + d Z g = 2q + d Z ; then we can choose the initial bandwidth to be h 0 = o n 1=(2q+d Z ) . The data drivenĥ depends on . We may choose di¤erent bandwidths for di¤erent 's. This is what we follow in our Monte Carlo experiments.
Powell and Stoker (1996) mention one technical proviso:^ n ( ;ĥ) is not guaranteed to be asymptotically equivalent to^ n ( ; h ) since the MSE calculations are based on the assumption that h is deterministic. The suggested solution is to discretize the set of possible scaling constants, replacingĥ with the closest value,ĥ y , in some …nite set. The estimation uncertainty inĥ y is small enough that it will not a¤ect the asymptotic MSE.

An Integrated Conditional Moment Test
In this section, we "integrate out" to get an integrated conditional moment (ICM) type test statistic, following Bierens (1990) and StW (1998).

The Test Statistic
If ' is GCR, testing H 0 : Y ? X j Z is equivalent to testing H 0 : ( ) = 0 for essentially all 2 : In other words, if we view^ n;h ( ) as a random function in , we are testing whether its mean function ( ) is zero on . If is compact, we can show that p n^ n;h ( ) converges to a zero mean Gaussian process under the null. Based on p n^ n;h ( ), we construct the ICM test statistic M n = n Z h^ n;h ( ) where is a probability measure on that is absolutely continuous with respect to the Lebesgue measure on . Here we integrate [^ n;h ( )] 2 ; which gives a Cramer-von Mises (CM) type test. Alternatively, we could integrate j^ n;h ( )j p ; 1 p 1: The choice p = 1 (which gives the maximum over ) yields a Kolmogorov-Smirnov (KS) type test. We work with p = 2 for concreteness and because CM-type tests often outperform KS-type tests. As Boning and Sowell (1999) show, choosing to be the uniform density has a certain optimality property in a closely related context.

Asymptotic Distribution of the Test Statistic
To establish the weak convergence of M n , we …rst show that p n h^ n;h ( ) ( ) i converges to a Gaussian process. De…ne Then Lemmas 1 and 2 imply that If H 0 also holds, then T n ( ) p n^ n;h ( ) d ! Z ( ) .
Let M : C ( ) ! R + be k k 1 continuous. Then applying the continuous mapping theorem (Billingsley 1999 under H 0 :

Global and Local Alternatives
The global alternatives for our conditional independence test can always be written as for some nontrivial and nonzero function (x; y; z). Then under H a , we have This will be nonzero for essentially all 2 provided that ' is GCR. It follows from Theorem 4 that lim n!1 Pr(M n > c n ) = 1 for any critical value c n = o(n): That is, the test is consistent: as the sample size increases, the test will eventually detect the alternative H a .
To construct a local alternative, we consider a mixture distribution of the form where c is a constant and~ (yjx; z) is a conditional density function ofỸ given (X;Z) such thatỸ 6 ?X jZ: By construction,~ (yjx; z) is a nontrivial function of x and z: That is, the distribution of W is a mixture of two distributions: one satis…es the null of conditional independence and the other does not. The mixing proportion is local to unity. Equivalently, we can rewrite the local alternative as The essentially nonzero mean is the source of the power of the ICM test against the local alternative.

Calculating the Asymptotic Critical Values
Under the null, M n has a limiting distribution given by a functional of a zero mean Gaussian process whose covariance function depends on the DGP. The asymptotic critical values thus depend on the DGP and cannot be tabulated. One could follow Bierens and Ploberger (1997) and obtain upper bounds for the asymptotic critical values. Here, we use the conditional Monte Carlo approach suggested by Hansen (1996) to simulate the asymptotic null distribution.
To apply this approach, we construct a process T n ( ); which follows the desired zero mean Gaussian process conditional on fW i g. The desired conditional covariance function for T n is It is straightforward to show that under Assumptions 1-5 and the null hypothesis, A typical T n ( ) is constructed by generating fV i g n i=1 as IID standard normal random variables independent of fW i g and setting Following the arguments similar to the proof of Theorem 2 in Hansen (1996), we can show that under the null hypothesis, provided that Assumptions 1-5 hold.
Simulation results show that the empirical PDFs of M n and M n are fairly close. To save space, we do not report the results here, but they are available in Huang (2009).
To approximate the distribution of M n , we follow the steps below: This gives a simulated sample (M n;1 ; :::; M n;B ), whose empirical distribution should be close to the true distribution of the actual test statistic M n under the null. Then we can compute the proportion of simulated values that exceed M n to get the simulated asymptotic p value. We reject the null hypothesis if the simulated p value lies below the speci…ed level for the test. As Hansen (1996) points out, B is under the control of the econometrician and can be chosen su¢ ciently large to obtain a good approximation.

A Rescaled ICM Test
The variance of p n^ n;h ( ) depends on : It is plausible that by rescaling p n^ n;h ( ) by its standard deviation, one might obtain a somewhat better test. Thus, consider Proposition 5 Suppose Assumptions 1-5 hold and that inf 2 ( ) > 0: Then under the null hypothesis,T whereZ is a zero mean Gaussian process on with covariance function By the continuous mapping theorem, we havẽ N (0; 1), independent of fW i g: Then we can follow the proof of Theorem 2 in Hansen (1996) As a result, the critical value ofM n can be obtained by simulatingM n : Simulation results not reported here show that the empirical PDFs ofM n andM n are fairly close.
Although we do not give formal statements, results analogous to those for M n hold under the local and global alternatives. Simulation results in the next section suggest that the rescaled ICM test has somewhat better power for most experiments.

Monte Carlo Experiments
In this section, we perform some Monte Carlo simulation experiments to examine the …nite sample performance of our conditional independence test.
For all simulations, we generate IID f(X i ; Y i ; Z i )g. We choose '( ) to be the standard normal PDF, and k(u) be the sixth-order Gaussian kernel (q = 6). The number of replications for each experiment is 1000, and the number of replications for simulating M n orM n is 999.

DGP 1
We …rst generate a sample f(X i ; Y i ; Z i )g using the DGP and Z s N (0; 2 Z ) = N (0; 3): When = 0, the null is true; otherwise the alternative holds.
We normalize each variable so that its support is comparable to that of the GCR function '( ): For the standard normal PDF, the support is the real line but the function is e¤ectively zero out of the interval [ 4; 4]: We normalize each variable to be supported on this interval. This can be achieved by takingX i = 8 [X i min(X i )] =[max(X i ) min (X i )] 4: We normalize Y i and Z i analogously. The conditional independence test is then applied toX i ;Ỹ i ; andZ i . Although any compact with a non-empty interior can be used, we take = [ 1; 1] 4 : This choice ensures that fW 0 i ; 2 g can take any value in the e¤ective support of '( ): To compute the ICM statistic M n ; we need to compute the integral R [T n ( )] 2 d ( ). In the absence of a closed-form expression, we recommend using the Monte Carlo integration method. For each simulation replication, we choose 100 s 's randomly from the uniform distribution on [ 1; 1] 4 and approximate the integral by the average P 100 s=1 T 2 n ( s )=100: We have also tried using 50 random draws, but the results are e¤ectively the same. Note that T 2 n ( s ) depends on the bandwidth parameter h: In our simulation experiments, we employ the data-driven bandwidthĥ ( s ) in (25) with h 0 = n 1=[3(2q+d Z )] and = 0:5: We use di¤erent bandwidths for di¤erent 's. Given the bandwidthĥ ( s ) ; we compute the statistic T 2 n ( s ) as T 2 n ( s ) = n^ 2 n;ĥ( s ) ( s ) : The average of T 2 n ( s ) gives us the ICM statistic M n : The rescaled ICM statisticM n is computed similarly.
We use DGP 1 to study the …nite sample size and power of the test against conditional mean dependence. We use to indicate the strength of the dependence between X and Y , conditional on Z. Since both XjZ and Y jZ are normal, X;Y jZ fully captures the dependence between X and Y , conditional on Z.
We plot the power of the tests for ranging from 0:9 to 0:9: For this, we choose = X;Y jZ 2 r 1 2 X;Y jZ for X;Y jZ = 0:9; 0:8; :::; 0:9: The size and power look fairly good for sample sizes as small as 100, and they look very good when the sample size reaches 200. The "non-standardized" results in Figure 1 correspond to M n ; and the "standardized" results in Figure 2 correspond toM n . When the sample size is small, the levels of the tests approach their nominal value from below, delivering conservative tests. When the sample size increases to 200; our tests become fairly accurate in size. The power functions show thatM n performs better than M n in this experiment. This may be due to some e¢ ciency improvements associated with the partial GLS correction embodied inM n .

DGP 2
DGP 2 is a modi…cation of DGP 1 that focuses on the consequences of fat-tailed distributions. Here, " X and " Y are proportional to the Student t with 3 degrees of freedom: The power functions for M n are plotted in Figure 3, and those forM n are plotted in Figure  4. We see that the power is a little but not a lot worse than for the normal distributions of DGP 1.

DGP 3
DGP 3 is another modi…cation of DGP 1. This time we allow skewness, choosing both " X and " Y to be centered chi-square distributions: The power functions of M n are plotted in Figure 5 and those forM n are plotted in Figure  6. Here, the power is slightly better than that for DGP 1. Overall, the size and power properties of our tests are robust to the data distribution.

Comparison with Other Tests
In this section we compare the standardized ICM testM n with other conditional independence tests. Su and White's (2008) test essentially compares f XY Z f Z with f XZ f Y Z and can detect local alternatives at the rate n 1=2 h d=4 : Su and White's (2007) test essentially compares f Y jX;Z with f Y jZ and can detect local alternatives at the rate n 1=2 h (d X +d Z )=4 : Our test compares integral transforms and can detect local alternatives at the rate n 1=2 . We …rst compare all three tests using DGP1. Figure 7 shows the power functions when the sample size is 100. The GCR test in the …gure is the test we propose. It is clear that our test outperforms the SW 2007 test, which in turn outperforms the SW 2008 test. More speci…cally, while our GCR test has almost the same empirical size as the SW 2007 test, it is more powerful than the SW 2007 test. The SW 2008 test is very conservative and has almost no power when is small in absolute value. That is, when the departure from the null is small, the SW 2008 test is less able to detect it, compared with our GCR test and the SW 2007 test. Figure 8 shows the power functions when the sample size is increased to 200. We see that the power of our GCR test improves faster than the power of SW 2007, which again improves faster than the power of SW 2008. These results are consistent with the local alternative rate results.
Finally, we compare the power function of ourM n test with the tests proposed by LG (1997) and DG (2001). Figure 9 reports the results for DGP 1 with n = 200. We report only the results for the Cramer-von Mises type test for each method, as the results for the Kolmogorov-Smirnov type test are qualitatively similar. In the …gure, "LG" and "DG" represent the Cramer-von Mises type tests of LG (1997) and DG (2001), respectively. The …gure demonstrates the clear advantage of our GCR test. It is as accurate in size as the LG test but more powerful than the latter test. The GCR test has better …nite sample performances than the DG test in terms of both size and power properties.
In all the …gures, we also report the "gold standard" t-test. This is as good a test as one could want, in the sense that it is the parametric maximum likelihood test for = 0 in a correctly speci…ed linear model. Although our test is not as powerful as the t-test, which is reasonable since our test is fully nonparametric, our GCR test does outperform all other nonparametric tests. On the other hand, the t-test measures only linear dependence. In the presence of nonlinear dependence, the t-test may be less powerful than the nonparametric tests. This is supported by simulation results not reported here.

Application to Returns to Schooling
As stated in the introduction, one important application of tests for conditional independence is to test a key assumption identifying causal e¤ects. In this section, we provide an example.
In the literature on returns to schooling, the most widely investigated structural equation is a Mincer (1974) type semi-logarithmic human capital earnings function: where the subscript i indexes individuals, ln Y i is log hourly wage, S i is years of completed schooling, EXP i is years of work experience, EXP 2 i is work experience squared, and U i represents unobserved drivers of ln Y i ; centered at zero. The e¤ect of interest is 1 ; the e¤ect of an additional year of schooling on wage. In what follows, we drop the i subscript.
Least squares estimates of the Mincer equation su¤er from the well-known ability bias problem, which is caused by the dependence of schooling on unobserved ability. To make this explicit, let U = A + "; where A represents unobserved ability, and rewrite the Mincer equation as ln Y = 0 + 1 S + 2 EXP + 3 EXP 2 + A + ".
One method empirical researchers have adopted to address the ability bias issue is to …nd proxies Z for ability, for example IQ or AFQT scores, and include these as regressors (e.g., Griliches and Mason, 1972;Griliches, 1977;and Blackburn and Neumark, 1993). Now consider the regression of ln Y on S; EXP ; and Z : The last equality is justi…ed by a conditional mean independence assumption, If this holds, then we have (@=@s) (S; EXP ; Z) = 1 ; so that the e¤ect of interest, 1 ; is identi…ed and can be consistently estimated.
There is no reason a priori that the wage equation must have the speci…c Mincer form, however. More generally, one can consider a nonparametric speci…cation The crucial condition justifying the third equality is conditional independence: This is called a "conditional exogeneity" assumption by White and Chalak (2008). It implies the "ignorability" or "unconfoundedness" condition, also known as "selection on observables" in the literature, ensuring identi…cation of causal e¤ects. Thus, if (32) holds, and even if the speci…c Mincer function (31) does not, we can still identify the average marginal e¤ect of schooling 1 (s; x; z) and consistently estimate this by various methods. If (32) fails, then the marginal e¤ect of interest is no longer identi…ed (see, e.g., White and Chalak, 2008, theorem 4.1).
We cannot test (32) directly, as A and " are unobservable. However, following White and Chalak (2010), if we can observe V such that V = f (A; "; X; Z; ) (33) ? S j (A; X; Z); where f denotes some unknown function and is unobserved, then Thus, we can test unconfoundedness by testing the implied condition Equation (33) provides some guidance about how to choose V . The conditional independence requirement on is particularly plausible when is a measurement error, so that both Z and V could be error-laden proxies for ability. Here, we test (34) using data from the National Longitudinal Survey of Youth 1979 (NLSY 79). In particular, we use the data from survey year 2000 and restrict the sample to white males. 1 We use the age-adjusted standardized AFQT in year 1980 as Z. V includes math and verbal scores for preliminary scholastic aptitude tests from 1981 high school transcripts. To satisfy (33), we use years of schooling beyond high school as S, so that V is not a¤ected by S. X includes actual work experience in survey year 2000 and total tenure with employer in survey year 2000.
To implement the test, we choose '( ) to be the standard normal PDF, and let k( ) be the sixth-order Gaussian kernel. We choose and other metaparameters as described in the Monte Carlo section. Applying ourM n test, we …nd that we do not reject the null hypothesis (34) at the 5% level. Thus, we do not …nd evidence refuting the approach commonly used by empirical researchers, providing some support for parametric or nonparametric estimation of e¤ects of interest.

Concluding Remarks
In this paper, we develop a ‡exible nonparametric test for conditional independence that is simple to implement, yet powerful. It is consistent against any deviation from the null and achieves local power at the parametric n 1=2 rate, despite its nonparametric character. It is also very ‡exible as it allows for a rich class of GCR functions.
There are several useful directions for future research. First, we have assumed that the data are IID. But this is not essential for the results. We may straightforwardly extend the approach to a time-series framework, so that we could test, for example, nonlinear Granger causality. Another extension could be to modify the test so that it can be used when Z contains both discrete and continuous variables. This is often relevant in applied microeconomics. This extension has been considered in Chapter 3 of Huang (2009). A third direction is to further study the bandwidth selection problem. Here, we choose the bandwidth to minimize the mean squared error of^ n;h ( ). Ideally, however, one should choose the bandwidth that optimizes the trade-o¤ between size and power.

Appendix of Proofs
Throughout the proofs, we use C to denote a constant that may be di¤erent across di¤erent equations or lines.
and so where ' max = sup 2 sup W 2[0;1] d '(W 0 ); which is …nite under Assumption 3. Using Assumption 2, we have It follows from Assumption 4 that Combining this with (35), we have, using Assumption 5(a): This implies that R n;h ( ) = o p (1= p n) pointwise for each 2 : To show the uniformity result that sup 2 R n;h ( ) = o p (1= p n) ; we employ the theory of U-processes. In particular, we apply Proposition 4 in DG (2001) with their k = 2: The class of functions under consideration is K = f h;2 (W i ; W j ; ) : 2 g : Since j h;2 (W i ; W j ; )j 2' max jK h (Z i Z j )j ; we can use K (W i ; W j ) = 2' max jK h (Z i Z j )j as the envelope function. As sets of linear functions whose subgraphs are half planes, both fW i : 2 g and fW ij : 2 g are VC-type. Under Assumption 3(b), it is clear that f'(W i ) : 2 g and f'(W ij ) : 2 g also are VC-type. Multiplying by a …xed function K h ( ) will not change their VC property and the associated VC characteristics. Therefore f h;2 (W i ; W j ; ) : 2 g is VC type with VC characteristics independent of h: Applying Proposition 4 in DG (2001), we have for some constant C that does not depend on h: Proof of Lemma 2: Part (a). We …rst establish an expansion of where we have used Assumption 2(b). Similarly when z`2 (1 h ; 1]; If we choose 2 ( q q+1 ; 1 q q 2 +q+1 ); which is feasible, then for some e > 0: Repeating the above arguments for other elements of z; we obtain By the same argument, we can show that under Assumption 4 and 2(a)(b): We have therefore proved that Ch q+e : (37) Using the above result, we have and the o(h q ) term holds uniformly over 2 : Next, let uniformly over 2 and (x;z) 2 [0; 1] d X +d Z . Using this result, we have uniformly over 2 where : By de…nition, (Z i ; X i ; Z i ; ) = g XZ (X i ; Z i ; ): So uniformly over 2 : (39) and It is easy to see that E 1 (W i ; ) = ( ) : So where the o(h q ) term holds uniformly over 2 : Since B 5 (X i ; Y i ; Z i ; ) is continuous in ; E sup 2 jB 5 (X i ; Y i ; Z i ; )j < 1; (X i ; Y i ; Z i ) is IID, and is compact, we can use a standard textbook argument to show that a ULLN applies to n 1 P n i=1 B 5 (X i ; Y i ; Z i ; ): That is, sup 2 n 1 P n i=1 B 5 (X i ; Y i ; Z i ; ) = O(1): Combining this with part (a), we have