A Comparison of Two Methods for Making Statistical Inferences on Nei's Measure of Genetic Distance

Summary The delta and jackknife methods can be used to estimate Nei's measure of genetic distance and calculate conJidence intervals for this estimate. Computer simulations were used to study the bias and variance of each estimator and the accuracy of the corresponding approximate 95% confidence intervals. The simulations were conducted using 3 sets of data and several sample sizes. The results showed: (1) the jackknife reduced bias; (2) in 8 out of 9 cases the variance and mean square error of the jackknife estimator were less; (3) a second order jackknife reduced the bias the most but suffiered a corresponding increase in variance; (4) both thefirst orderyackknife and delta methods yielded intervals whose conJidence levels were approximately equal but less than 95%. The a a of been a of loci an estimator of his distance statistic. The of bias


Introduction
Introduction of the gel electrophoresis technique has made it possible for population geneticists to assay large samples of structural gene products in natural populations. The loci detectable by this technique are segments of DNA which code for proteins that perform enzymatic functions. A subclass of all genetic variants for a given locus will code for enzymes which can be differentiated by electrophoresis. These genetic variants for a given locus are thus called electrophoretic alleles. Allele and electrophoretic allele will be used interchangeably. Using estimates of electrophoretic allele frequencies, conservative estimates of genetic variation within and between populations can be obtained. Many problems of interest to population geneticists and evolutionary biologists require that genetic differences between populations be expressed in a single statistic. These problems include the process of speciation (Ayala 1975) and the construction of phylogenetic relationships between species (Ayala 1975, Sneath andSokal 1973). Suggestions of appropriate statistics have not been lacking (Nei 1973). One widely used measure has been Nei's standard measure of genetic distance (Nei }971, 1972).
Procedures for estimating Nei's distance and the sampling variance of these estimates has been described previously (Nei 1978, Nei andRoychoudhury 1974). In this paper numerical results will be presented that show Nei's standard estimator of genetic distance is biased upwards when only a small number of loci are sampled. The bias introduced when a small number of loci are sampled has been discussed previously by Nei (1973). When only a small number of individuals have been sampled at a large number of loci Nei (1978) has derived an unbiased estimator of his distance statistic. The problem of bias reduction when only a Key Words: Jackknife; Delta method. 757 758 BIOMETRICS, DECEMBER 1979 small number of loci have been sampled has Ilot been studied. An alternate method of estimating Nei's distance, the jackknife method, is examined. For each estimator the bias, variance and mean square error are determilled from 3 Monte Carlo studies. These properties are used as criteria for deciding which of these methods for estimating Nei's distance is best.
In addition to studying the properties of these estimators of genetic distance the accuracy of confidence intervals constIucted about these estimators will be examined.

Nei's Distance Measure
To define Nei's estimate of genetic distance for a given sample, let n = number of loci in the sample, mi number of alleles at the ith locus, xk'i) = the frequency of the kth allele at locus i in populatiorl X, and yk(i) = the frequency of the kth allele at locus i in population Y. It will be assumed that the allele frequencies are known exactly, so Nei's (1978)  where the summation goes from 1 to mi arld i goes from 1 to n. The genetic distanee, D, between populations X and Y is estimated by The genetic distance being estimated, D, is defined by (1) with the summations taken over all loci in the genome of the species being studied instead of the sample of size n. In general E(Dn) £ D, that is Dn is a biased estimator as iIldicated by Nei (1973).

The Delta Method
The delta method provides a recipe for determining approximately the expected value and variance of a random variable (or vector). This is accomplished by expanding the function in a Taylor series about the expected value of the random variable and taking the expected value of the first two terms (see Kendall and Stuart 1969, pp. 231-232, for a typical derivation). Nei and Roychoudhury (1974) have used this Taylor series approximation to obtain an estimator of Var(Dn) applicable to populations having polymorphic loci which is, Var(x)/4yx2 + Var(iy)/4Jy2 + Var(iXy)/yXy2 + Cov(jxJy)/2yxJy C o v Ux y Xy ) / y x y Xy -C o v UY J XY ) / J Y J XY e 2.3. The Jackknife For a recent review of jackknife methodology see Miller (1974). If Dni is defined as equation (1) except that the data for the ith locus has been deleted, n pseudovalues may be defined as sni-nDn-(n-l)Dne (i = 1, 2, . . ., n). The jackknife estimator Dn is simply the mean of the n pseudovalues Dn = n-llsni-nDn-n-'(n-l)iDni. (2) Tukey (1958) suggested that the n pseudovalues be treated as approximately independent NEI'S MEASURE OF GENETIC DISTANCE 7S9 and identically distributed random variables. The pseudovalues can then be used to define the variance of Dn using the standard estimator If fin is biased and its expected value is D + a/n + b/n2, then the jackknife will eliminate the l /n term from the bias. There is also a second order jackknife, Dn'2', such that E(Dn'2') = D + O(l /n3). It is defined as where Dn i is the same as (2) except that the data for the ith locus has been deleted.
No simple relationship between Dn and the initial random variables jx('l, jy('l, and jXyf'l is apparent in (2). It is then qllite difficult to partition the variance of Dn into inter-and intralocus effects as Nei and Roychoudhury (1974) have done for Var(Dn). The use of (2) in theoretical studies could be cumbersome. Here attention will be restricted to data analysis only.

Simulation Procedures
Initially, two genetic populations with n loci are defined. Ten thousand random samples of n loci are then drawn with replacement and Dn and Dn are calculated from (1) and (2). 95% confidence intervals are constIucted about each of these estimates in each sample assuming that where Dn is either Dn of Dn and has a t distribution with n-l degrees of freedom (d.f.). The frequency with which the interval included D was tabulated and the bias, variance and mean square error of each estimator were calculated from the 10,000 observations in addition to the confidence level associated with each method.
The simulations were carried out using three sets of data to define the allelic frequencies for each of the genetic populations. Information concerning these are summarized in Table   1. From these data, the j('l vectors, j('l = (jx('l, jy( l, jXy(ll)tS for the n loci randomly sampled with replacement were used to compute the Monte Carlo statistics. Given i, j('l was assumed known, which is akin to assuming the allelic frequencies were estimated without error. This procedure seems justified since Nei and Roychoudhury (1974)  where k is the most frequent allele in population Y. Since only 5 significant digits of jxfi, j and jXy'i' were recorded, this modification of the data had no effect on the values of jxfi) or jy(i', but did produce non-zero jXy('l so that Dn and Dn could be defined. D for this altered set of data was decreased by only 2 x 10-4 distance units.

Results
A summary of the results are given in Tables 2, 3 and 4. One run calculating the second order jackknife was carried out with the data of Ayala, Tracey, Barr, McDonald and Perez-Salas (1974) using n = 5. The main advantage of the second order jackknife is its potential bias-reducing properties. This potential was realized but was coupled to a correspondingly high increase in variance. Because of this undesirable effect and the large number of computations necessary, the second order jackknife was not run for larger sample sizes or different data.
The simulation provided a large but finite set of data from which to estimate the bias, confidence level, etc. Therefore confidence intervals have been placed on most of these estimates in Tables 2, 3 and 4 to support conclusions drawn in the discussion.
The estimate of the bias of each estimator was obtained as the deviation of the mean value of each from the parametric value of D (Table 1). The variance of this estimated bias  Ayala et al.'s (1974) Data = 0.499, Dn = the Jackknife Estimator, Dn = the Estimator, Dn'2' = the Second Order Jackknife Estimator t Based on numericol evoluation of (4). tt Estimated from the 10,000 volues of Dn and Dnwas also estimated and used in calculating a confidence interval for the % bias as shown in Tables 2, 3 and 4 and equal tail confidence intervals were calculated for the variance.
During the simulation the computer kept track of the number of times, x, that each calculated confidence interval using Dn or Dtl included D. Obviously x has a binomial variance [(x/10,000)(1-(x/10,000))/10,000]. To further examine the distribution properties of ¢t(Dn) the frequency of l+(Dn)l > 2.5 in several runs is presented along with that expected assuming ¢t(Dn) tn_l in Table 5.
-4. Discussion The major difficulty in obtaining a best estimator of D is that standard solutions such as maximum likelihood estimators are not available since nothing of sufficient accuracy can be  TABLE 5 Observed and Expected Frequency of |f(Dn)| > 2.5 for Simulation Studies of Hedgecock's (1978) and Avise and Ayala's ( 1 976 said about the distribution of the random vector (jx ('l, jy('l, jXr(ll). The two alternative methods considered here both suffier from relatively large variances. This problem has been emphasized previously by Nei (Li and Nei 1975, Nei 1975, Nei and Roychoudhury 1974. Nei stressed the importarlce of studying a large number of loci. There are, however, many studies of natural populations where less than 30 loci have been sampled and very few with more than 40 studied.
Dn also has the drawback that it will be biased when a small number of loci are sampled.
The approximate magnitude of this bias is given by In Tables 2, 3 and 4 the expected bias from equation (4) is given in addition to the observed bias in the simulations. In all cases Dn was biased upwards. It can be seen that equation (4) gives a good estimate of the magnitude of the bias for sample sizes of 15 and 30 but is rather poor for a sample size of 5. For small sample sizes third and higher order terms in the Taylor series are of sufficient magnitude that equation (4) is no longer a good approximation of the bias. The magnitude of the expected and the observed bias increase with increasing D. This result is in accord with equation (4). Large values of D imply small values °f JXY and the first term of equation (4) is inversely proportional to Jxy2 Examination of Tables 2, 3 and 4 show the jackknife to be quite effective at reducing the bias. In all cases with n-15 Dn has either no detectable bias or it is <0.1%. One exception is Dl5 in Table 4. In this case the bias of the jackknife estimator is still 3 that of the delta estimator.
In all but one case the variance and the mean square error of Dn is less than Dn. The jackknife's reduction in variance is not as dramatic as the bias reduction. For instance, the variarlce of D30 is only 2So less than the variance of D30 for Hedgecock's data, 66Fo less for Ayala's data and 14% less for Avise's data. While these are modest reductions in variance this fact, coupled with the substantial bias reduction of Dn make it the superior estimator. These desirable properties seem to hold over a wide range of sample sizes and values of D.
The last problem considered is the estimation of confidence intervals about Dn and Dn.
There are no detectable differences between the confidence levels of intervals generated around Dn and Dn. Most intervals in Tables 2, 3 and 4 have confidence levels <95%. This indicates that the assumptions that allow one to infer that equation (3) has a t distribution 762 BIOMETRICS, DECEMBER 1979 with n-1 d.f. are not entirely correct. One assumption that does not hold is that Dn and Dn are unbiased estimators of D. As n gets larger the confidence levels associated with each method approach 95%. This is due, at least partly, to the fact that for large n E(Dn), E(Dn) D.
An indication of how well Dn and Dn are described by a t distribution is given in Table 5.
One striking result is the large discrepancy between the expected and the observed values for Hedgecock's data even when n = 30. This problem occurs because the distribution of Dn is not symmetric but is truncated at 0. When D is small this truncation causes the distribution of +(Dn) to deviate substantially from a t distribution. Potential solutions to this problem are currently under snvestigation.