Many important social and economic variables are naturally defined for pairs of agents (or dyads). Examples include trade between pairs of countries (e.g., Tinbergen, 1962), input purchases and sales between pairs of firms (e.g., Atalay et al., 2011), research and development (R&D) partnerships across firms (e.g., Konig et al., 2019) and friendships between individuals (e.g., Christakis et al., 2020). Dyadic data arises frequently in the analysis of social and economic issues. See Graham (2020) for many other examples and references. While the statistical analysis of network data began almost a century ago, rigorously justified methods of inference for dyadic or network statistics are only now emerging (cf., Goldenberg et al., 2010).
This dissertation studies statistical inference problems of dyadic data. Throughout I focus on target parameters of fundamental theoretical and applied interest. These include density functions, regression functions, density-weighted average derivatives, and coefficients in linear regressions. Dyadic data exhibits a distinct kind of local dependence property: i.e., any random variables of dyads that share one or two indices/agents may be dependent. The four chapters of this dissertation develop a broad set of theoretical results for estimation and inference of nonparametric, parametric, and semiparametric models for dyadic data and present generic and in some cases surprising implications of the local dependence.
In Chapter 1 I study nonparametric estimation of density functions for undirected dyadic random variables (i.e., random variables defined for all n=N(N-1)/2 unordered pairs of agents/nodes in a weighted network of order N). In this setting, I show that density functions may be estimated by an application of the kernel estimation method of Rosenblatt et al. (1956) and Parzen (1962). I suggest an estimate of their asymptotic variances inspired by a combination of (i) Newey's (1994) method of variance estimation for kernel estimators in the “monadic” setting and (ii) a variance estimator for the (estimated) density of a simple network first suggested by Holland and Leinhardt (1976). More unusual are the rates of convergence and asymptotic (normal) distributions of these dyadic density estimates. Specifically, I show that they converge at the same rate as the (unconditional) dyadic sample mean: the square root of the number, N, of nodes. This differs from the results for nonparametric estimation of densities and regression functions for monadic data, which generally have a slower rate of convergence than their corresponding sample mean. Then I study the robustness of the normality-based and the bootstrap-based inference procedures. Since the distribution of this kernel density estimator depends on both the unknown presence/absence of dyadic dependence and the bandwidth choice, successfully approximating its distribution under a wide range of scenarios is both nonstandard and especially desirable. Toward this goal, I first establish the robustness of the normality-based inference by showing that the consistency of variance estimator and asymptotic normal approximation are valid under both dependence regimes with both commonly used and small-bandwidth asymptotics (Cattaneo et al., 2014). Then, I establish asymptotic inconsistency of a wide class of generalized bootstrap (tailored toward U-statistics) in this setting. Finally, I propose a simple modification of the bootstrap procedure and show its consistency holds robustly. The chapter ends with a semiparametric efficiency bound calculation for density estimation and shows that the kernel density estimator achieves optimal asymptotic variance. Section 1.1, 1.2, 1.3 of this chapter are joint work with Bryan Graham and James Powell.
In Chapter 2 I study nonparametric estimation of regression functions for directed dyadic data. Let i=1,...,N index a simple random sample of units drawn from some large population. For each unit, researchers observe the vector of regressors X_i and, for each of the N(N-1) ordered pairs of units, an outcome Y_{ij}. The outcomes Y_{ij} and Y_{kl} are independent if their indices are disjoint, but dependent otherwise (i.e., “dyadically dependent”). Let W_{ij}=(X_i',X_j')'; using the sampled data I seek to construct a nonparametric estimate of the mean regression function g(W_{ij})=E[Y_{ij}|X_i,X_j]. I present two sets of results. First, I calculate lower bounds on the minimax risk for estimating the regression function at (i) a point and (ii) under the infinity norm. Second, I calculate (i) pointwise and (ii) uniform convergence rates for the dyadic analog of the familiar Nadaraya-Watson (NW) kernel regression estimator. I show that the NW kernel regression estimator achieves the optimal rates suggested by the risk bounds when an appropriate bandwidth sequence is chosen. This optimal rate differs from the one available under iid data: the effective sample size is smaller and d_W=dim(W_{ij}) influences the rate differently. This chapter is joint work with Bryan Graham and James Powell.
In Chapter 3 I study estimation of the density-weighted average derivative for directed dyadic data. This parameter is of substantial practical interest as it is proportional to the coefficients in single index models (Powell et al., 1989), which encompasses various models of limited dependent variables. Besides carefully setting up the directed dyadic single index regression model with both monadic and dyadic explainable variables, the main contributions of this chapter are extending the kernel-based estimator of the density-weighted average derivatives from the “monadic” iid (e.g. Stoker, 1986; Powell et al., 1989; Newey and Stoker, 1993) to directed dyadic data and proving its robust asymptotic normality (asymptotic normality holds under both nondegeneracy and degeneracy and across a wide range of bandwidth sequences) using asymptotic quadratic approximation. This robust asymptotic normality result presents an interesting contrast between this kernel-based semiparametric estimator and the sample mean of dyadic data, which exhibits asymptotic non-normality when dyadic dependence is absent and whose uniform nonconservative inference procedure does not exist (Menzel, 2021). This chapter marks the start of my analysis of estimation of semiparametric models for dyadic data, which is continued in the next chapter.
In Chapter 4 I study error components models of dyadic data, of which a major motivation is separating the monadic and dyadic components of variation. The development parallels that of error components with panel data: I progressively enrich the random effect model by going from being without covariates to being with covariates and from homoskedasticity to multiplicative heteroskedasticity. Throughout enriching the models, I focus on estimating the coefficients in a linear regression, which includes both monadic and dyadic explanatory variables. To understand the nature of the estimation problem under different error components models, I study the performance of intuitive OLS estimators, propose more efficient estimators, calculate the asymptotic efficiency bounds (Cramer-Rao lower bound, CRLB), and compare the efficiency bounds to variances of the estimators. Under homoskedasticity, I prove the sample mean, which converges at rate O(N^{-1/2}), and least square estimator with double-differencing operation, which converges at rate O(binom{N}{2}^{-1/2}), achieve the CRLB and are asymptotically efficient for estimating the marginal expectation and the coefficients of dyadic variables in a linear regression respectively. Under unknown multiplicative heteroskedasticity, I show that the intuitive two-step semiparametric generalized score estimator for estimating the linear regression coefficients, which is a natural extension of the classical feasible generalized least square estimator (FGLS) for linear regression with heteroskedasticity for the “monadic” iid data, is not adaptive to the unknown heteroskedasticity. Its convergence rate is faster than that of the OLS estimator, O(N^{-1/2}), but slower than the rate suggested by CRLB, O(binom{N}{2}^{-1/2}). This result makes a distinction from a familiar result in the monadic iid setting, i.e. a two-step semiparametric generalized score estimator often indeed achieves adaptivity and CRLB in iid setting. The gap between the performance of the best available estimator and the CRLB suggests that for this estimation problem with dyadic data either there exists a better estimator that is adaptive and achieves the CRLB, or there is a tighter efficiency bound. I point this gap out for further research.