Molecular Polymorphism: How Much Is There and Why Is There So Much?*

The evidence for genetic variation can be traced to Mendel’s experiments: The discovery of the laws of heredity was made possible by the expression of segregating alleles. Since that time, the study of genetic variation in natural populations has been characterized by a gradual discovery of ever-increasing amounts of genetic variation. In the early decades of this century geneticists thought that an individual is homozygous at most gene loci and that individuals of the same species are genetically almost identical. Recent discoveries suggest that, at least in outcrossing organisms, the DNA sequences inherited one from each parent are likely to be different for nearly every gene locus in every individual: ie, that every individual may be heterozygous at most, if not all, gene loci. But the efforts to obtain precise estimates of genetic variation have been thwarted for various reasons.


INTRODUCTION
This paper consists of two parts. First, I review the question of how much molecular genetic variation exists in natural populations. Second, I present some general considerations concerning the processes that contribute to maintain that variation.

PROTEIN POLYMORPHISMS
Genetic variation is an attribute that cannot be exhaustively measured. It is not possible, even if we wanted it, to examine every gene in every individual of a given species, so as to obtain a complete enumeration of the genetic variation in the species. The well-known solution in such a situation is to measure a sample from the group to be evaluated. Two conditions need to be met for a valid extension of the results from a sample to the whole set. First, the sample must be representative or unbiased; second, the sample must be accurately measured. In the case at hand, the requirement that the sample be unbiased applies to two levels: (1) the individual organisms sampled must be, on the average, neither more nor less genetically variable than the population as a whole; (2) the genes sampled must be neither more nor less polymorphic, on the average, than the whole genome. And the condition of accuracy requires that genes that are different be identified as such; ie, it requires that every allelic variant be recognizable.
Neither one of these two necessary conditions for valid sampling has been met in the study of genetic variation. There is no serious difficulty in sampling individuals that are, on the average, as genetically variable as the population as a whole. An important consideration is that the individuals sampled not be particularly either inbred or interrelated; but this is not difficult to satisfy. The difficulty lies in choosing the genes to be sampled. With the methods of Mendelian genetics, the existence of a gene is ascertained by examining the progenies of crosses between individuals showing different forms of a given character; from the proportion of individuals in the various classes, we infer whether one or more genes are involved. By such methods the only genes known to exist are those that are variable. There is no way of obtaining an unbiased sample of the genome, because invariant genes cannot be included in the sample of genes to be examined.
A way out of this problem became possible with the advent of molecular genetics. The genetic information encoded in the coding sequence of the DNA of a structural gene is translated into the sequence of amino acids making up a polypeptide. One can select for study a series of proteins without previously knowing whether or not they are variable in a population-a series of proteins that, with respect to variation, are an unbiased sample of all the structural genes in the organism. If a protein is found to be invariant among individuals, it is inferred that the gene coding for the protein is also invariant; if the protein is variable, the gene is inferred also to be variable and one can measure how variable it is, ie, how many variant forms of the protein exist, and in what frequencies.
Gel electrophoresis is a fairly simple technique that makes possible the study of protein variation with only a moderate investment of time and money. Since the 1960s, genetic variation has been studied in a large variety of organisms by gel electrophoresis. It was clear from the beginning of these studies that not all allelic variants are detected by electrophoresis, and hence that the condition of accuracy is not satisfied. But because genes for electrophoretic studies can be chosen without regard to how variable the genes are, many investigators thought that electrophoresis would provide estimates of variation in structural genes that would be accurate to a first approximation.
Electrophoretic data give the frequency of electromorphs (proteins that differ in electrophoretic mobility). Proteins encoded by different alleles may yield indistinguishable electromorphs, but as a first approximation it is assumed that each electromorph corresponds to only one allele. A variety of statistics can be used to summarize the amount of genetic variation in a population. The most extensively used measures are the polymorphism (P) and the heterozygosity (H). P is simply the proportion of loci found to be polymorphic in the sample. Usually, a locus is considered polymorphic when the frequency of the most common allele (electromorph) is no greater than a certain value, such as 0.99 or 0.95. In outcrossing organisms, H estimates the average frequency of heterozygous loci per individual or, what is equivalent, the average frequency of heterozygous individuals per locus. In naturally inbred organisms, H is a good measure of genetic variation in a population only if it is calculated from the allelic frequencies as the "expected" frequency of heterozygous individuals on the assumption of Hardy-Weinberg equilibrium. H is a better measure of genetic variation than P for most purposes, because it is more precise [ 11. A related measure also used by population geneticists is the effective number of alleles, n,, which is the reciprocal of the average frequency of homozygous individuals, ie, 1 / ( 1 -4 .
Electrophoretic studies have established that natural populations of most organisms possess large stores of genetic variation, even though not all variants are detected. Table 1 shows that the average heterozygosity is about 6.0% for vertebrates and about 13.4 % for invertebrates, although considerable heterogeneity exists within each of these groups. Plants, even those reproducing by self-fertilization, also have considerable genetic variation. The average proportion of polymorphic loci in a population lies between 20 and 50% for most animal or plant species.
How accurate are electrophoretic estimates? That is, what proportion of the total variation is detected by electrophoretic techniques? Electrophoresis cannot, of course, detect nucleotide substitutions that do not change the encoded amino acids.
The question is what proportion of amino acid substitutions are detected. Some biologists have argued that electrophoresis detects only substitutions that change the net electric charge of the encoded proteins and have calculated that about 67% of all amino acid substitutions are electrophoretically cryptic [2]. It is now known, however, that electrically neutral charges can, at least in some cases, be detected [3].
The question raised could ultimately be resolved by obtaining the amino acid sequence of a sufficiently large number of electromorphs with identical electrophoretic mobility. This is clearly not feasible at present because of the enormous time and cost required. A variety of other, less satisfactory, methods have manifested the existence of electrophoretically cryptic variation. The methods used include sequential electrophoresis, heat denaturation, urea denaturation, and peptide mapping.
Sequential electrophoresis consists of performing electrophoresis of the same samples under diverse conditions; eg, using different buffers or different gel concentrations. If tissue samples or enzymes are exposed to high temparature or some other denaturing agent such as urea, two proteins with identical electrophoretic mobility may become distinguishable because one but not the other is denatured by the treatment. Peptide mapping, or "fingerprinting," is practiced by digesting the proteins with trypsin or some other enzyme that hydrolizes the polypeptides into a number of small peptides; these are then subjected to two-dimensional chromatography or to chromatography in one dimension and electrophoresis in the other.  Table 2 summarizes results obtained by sequential electrophoresis and two denaturation methods in three species of Drosophila-the only organisms in which several loci have been studied by these methods in a given species. The average increase in heterozygosity is 0.04 by sequential electrophoresis and about 0.08 by the denaturation methods, or an increase in the amount of variation (n:ln, between 12 and 25 % . The methods used tend to uncover more cryptic variation when the loci sampled are more heterozygous to start with. But the average H of the loci sampled is 0.181 to 0.410, substantially greater than the average of 0.150 observed in Drosophila populations when random samples of loci are assayed. Hence, the increase in variation detected by these methods on a random sample of loci might be somewhat smaller than the values shown in Table 2. The amount of cryptic variation detected at the Adh locus of Drosophila melanogaster by three different methods is displayed in Table 3. As might be expected, peptide mapping detects more cryptic variation than any of the two other techniques. Yet the increase in variation, 20%, is not very large. If we assume that this value, as an average, is an approximate estimate of the amount of cryptic protein variation, we can calculate the "corrected" amount of genetically determined protein variation in natural populations ( Table 4)

DNA-SEQUENCE POLYMORPHISM
It has been known for more than a decade that only a small fraction, perhaps less than 10% of the nuclear DNA of eukaryotes is translated into protein. The recently developed techniques of DNA cloning and sequencing have shown that genes are separated from each other by long DNA sequences that do not become transcribed into RNA. The genes themselves have a complex organization. At both ends they have relatively short sequences that are present in the mature mRNA transcript, but do not code for amino acids. Most genes contain, in addition, intervening sequences (introns), which separate from each other the segments that code for the amino acids (exons). The introns are transcribed in the nucleus, together with the rest of the gene, but they are spliced out before the mRNA migrates to the cytoplasm. The question of how much genetic variation exists in the DNA of an organism can, thus, be formulated in various ways. One may ask the question about the whole genome or about particular components such as, for example, the coding segments. A number of genes have been sequenced in two or more related species, and it has become apparent that different segments evolve at different rates. This suggests that different kinds of segments may have different levels of polymorphism, a hypothesis recently corroborated by direct evidence.

Slightom et al [4]
have sequenced two alleles of the *Y gene, which codes for one of the polypeptides of fetal hemoglobin (Fig. 1) If the *y gene is a typical example, it seems likely that at the level of the DNA sequence every outcrossed individual will be heterozygous at nearly all, if not all, loci-that is, if the noncoding sequences are taken into account. The question of heterozygosity needs to be reformulated in terms of the proportion of nucleotide differences, which may be called nucleotide heterozygosity or nucleotide diversity.
Trying to measure nucleotide heterozygosity, one encounters some ambiguity. If only substitutions are considered, the nucleotide heterozygosity of *y is 13/1647 = 0.008. If the deletions are also taken into account, the question arises of how they are to be counted. If each deleted segment is counted as one difference independently of its length, then there are three additional differences between the two alleles and the heterozygosity is 16/1647 = 0.010; if each deleted nucleotide is counted as one difference, then the heterozygosity is 3911647 = 0.024.   The nucleotide heterozygosity in other genes for which two independent alleles have been sequenced is given in Table 5. Three genes (Adh in Drosophila, C, in rats, and *y in humans) have substitution heterozygosities between 1 and 2%. The DNA sequenced for Adh and C, includes only coding regions and thus no deletions were observed. For the insulin genes the substitution heterozygosity is only 0.003, but the 5' flanking region contains a deletion/insertion of 467 contiguous np, which are within a highly repetititve sequence. The constant region of the heavy chain of mouse immunoglobulin consists of eight proteins. One of these, y2a, is known to differ extensively from one inbred mouse strain to another. The gene, ZgG2a, coding for this protein has been sequenced in two strains. Of the 1,108 bases sequenced, 111 (10%) are different. Only 18 (16.2 %) of these nucleotide substitutions are silent; the others yield different amino acids in 15% of the sites. There are reasons to presume that the variation observed in the mouse ZgG2u gene may not be typical of structural loci. Immunoglobulin genes are very polymorphic; the two alleles sequenced come from two inbred strains, rather than from outbred individuals; the two proteins were known to be very different before the DNA was sequenced. Indeed, the frequency of amino acid differences between the two allele products is one order of magnitude greater than the average observed in other kinds of protein.

-3 -
If we exclude from consideration the ZgG2a and insulin genes as being atypical, it would appear that nucleotide heterozygosity may be around 1 or 2%. This must be taken only as a very tentative estimate because of the paucity of the data. Estimates of nucleotide heterozygosity have been obtained in four species of sea urchins by DNA denaturation followed by competitive reassociation ("hybridization"). This technique is inexact but has the advantage that it assays the complete genome of an organism. The results for the single-copy DNA are summarized in Table 6. The estimated frequency of nucleotide substitutions ranges from 2 to 4%. This is not very different from the I-2% estimate of nucleotide heterozygosity derived from the sequence data. Thus, although quantitative estimates of the amount of DNAsequence variation cannot be provided with confidence for organisms in general, there can be no doubt that the variation is extensive. If the noncoding regions of genes are included, it seems likely that most, if not all, genes are heterozygous in every outbred individual.

SELECTION VERSUS DRIFT
What is the evolutionary significance of this wealth of protein and DNA variation? Is it adaptive, the stuff from which are built the multifarious adaptations of organisms to their environments? Or is it for the most part evolutionary noise, variations that are tolerated by natural selection because they do not modify any significant function of the organisms? All genetic variations arise first by the mutation process, broadly understood so as to include not only the substitution of one nucleotide by another but also deletions, duplications, and reorganizations of the DNA. If the mutants modify the adaptations of organisms, they will increase or decrease in frequency as a result of natural selection. If they have no effect on adaptation, mutants will drift in frequency as a consequence of random sampling from generation to generation. The hypothesis that considers a mutation or a polymorphism as adaptively neutral is the starting null hypothesis of the population geneticist. In recent years, however, Kimura and others [5-71 have argued that, with respect to DNA and protein evolution, adaptive neutrality is no longer just a null hypothesis, but a notion positively supported by evidence.
Two approaches may be followed to test the hypothesis of neutrality versus natural selection. One consists of testing each particular polymorphism to ascertain whether natural selection is implicated [see [8][9][10][11][12]. The other approach is global: It uses theoretical reasoning or empirical evidence to argue for or against the role of natural selection with respect to a general kind of variation, molecular variation in the case at hand [eg , 13,141.
I want to examine here two general arguments-one positive, the other negative-that have been advanced to support the adaptive neutrality of protein variation.
The positive argument relies on the apparent existence of a molecular evolutionary clock. When the rate of evolution is examined in a protein such as cytochrome c, it is observed that amino acid substitutions have occurred in different branches of the phylogeny at different times and at approximately constant rates. What is meant by the phrase "approximately constant rates" is that the substitutions occur with a constant probability, but stochastic variation is expected.
Langley and Fitch [ 151 have tested statistically the evolution of seven proteins in 17 mammals and found that the variance in the rate of amino acid substitutions is much too large-inconsistent with the hypothesis that the rate was stochastically constant as predicted by the neutrality theory (Table 7). It is possible, however, to maintain that the rate is stochastically constant but that it has a variance greater than expected from a Poisson distribution [16]. One additional problem with this sort of evidence in support of the neutrality hypothesis is that stochastically constant rates of molecular evolution are also predicted by models of natural selection [ 171. Therefore, the existence of a molecular evolutionary clock cannot be used in support of either the neutrality or the adaptive hypothesis.

HETEROSIS AND GENETIC LOAD
The negative argument offered in support of the neutrality theory is based on the concept of genetic load. The argument is that if some alleles are less adaptive than others, then a number of individuals would have less than optimal genotypes at each polymorphic locus subject to natural selection. If the number of such loci is very large, a population might be unable to withstand the burden of so many poorly fit individuals. This argument deserves to be examined in detail because the neutrality hypothesis was largely proposed as the only alternative left for those who rejected natural selection because of the enormous genetic load that would be created by ubiquitous protein polymorphisms .
The genetic load argument is strongest in the case of heterosis; ie, when a polymorphism is maintained owing to the adaptive superiority of the heterozygotes. Sved et a1 [18], King [19], and others have suggested that an efficient method for testing whether heterosis plays a major role in natural populations is to compare the fitness of ordinary outbred individuals with the fitness of individuals homozygous for a larger-than-average proportion of loci. This method permits one to ascertain whether heterozygotes are at an overall advantage over homozygotes.
Numerous experiments, particularly in Drosophila, have shown that an increase in homozygosity results in a decrease in fitness. The experiments published before 1970 were, in general, carried out by measuring particular components of fitness, mostly viability [20] and fertility [21,22], and were not, in any case, performed under population conditions [23]. Sved and Ayala [24] devised a method by which fitness as a whole can be measured in Drosophila flies made homozygous for full chromosomes, under conditions of equilibrium population density and a stable age distribution. This method has now been used in a number of experiments that yield consistent results in that the fitness of homozygotes for one full chromosome is invariably very low, in the sublethal range.
The method of Sved and Ayala [24] is as follows. Flies homozygous and heterozygous for whole chromosomes sampled from a natural population are obtained using the method shown in Figure 3. The flies recovered in the F3 are used to establish experimental populations, where the course of natural selection can be studied over many generations. Since the balancer chromosome inhibits recombination, only two kinds of viable zygotes can exist at any time-those homozygous for the wild chromosome and those heterozygous for the wild and the balancer chromosome; all zygotes homozygous for the balancer chromosome die before completing development. If the homozygotes for the wild chromosome have lower fitness than the balancer heterozygotes, a stable equilibrium will eventually be established between the two types of flies. The relative fitness of the homozygotes can be directly calculated from the zygotic equilibrium frequencies. If the balancer heterozygotes have lower fitness than the chromosomal homozygotes, the balancer chromosome will gradually decrease in frequency; the relative fitness of the two kinds of flies can then be estimated from the rate of elimination. Control experimental populations are set up with flies heterozygous for different wild chromosomes, and for these and the balancer chromosome. The heterozygotes for different wild chromosomes have genetic constitutions comparable to flies in a natural population. The control populations permit an estimation of the fitness of balancer heterozygotes relative to wild heterozygotes. This estimate of fitness can be used to estimate the fitness of chromosomal homozygotes relative to flies heterozygous for random combinations of wild chromosomes.
With this method, overall fitness rather than a specific fitness component is measured under population conditions. The experiments show that under these conditions all chromosomes become either lethal or semilethal. Figure 4 shows the fitness distribution of 23 second chromosomes sampled from a natural population of Drosophila melanogaster [25]. Although the method is extremely laborious, studies have been conducted in three species of Drosophila. The results are summarized in TabIe 8.
In order to estimate the number of loci that can be maintained by natural selection in view of the fitness experiments, the assumption is made that selective interactions between loci are multiplicative and that there is no linkage disequilibrium   Table 8). The second or the third chromosomes of D melanogaster are estimated to contain each about 40% of the genome. Therefore, the number of heterozygous loci that could be maintained by heterosis in the whole genome under the assumptions made could be, approximately, 200/0.40 = 500. This is 10% of the 5,000 loci estimated to be present in D melanogaster. The heterozygosity (rr) as estimated by electrophoretic methods in D melanogaster is about 0.10; hence, the evidence indicates that all the polymorphisms observed by electrophoresis could be maintained by heterotic natural selection. Therefore, it would seem that arguments of genetic load cannot be used against the hypothesis that many natural polymorphisms are maintained by natural selection.
The calculations just made rely on a number of assumptions. A particularly relevant one is that fitness interactions among loci are multiplicative. Seager et a1 [27] have performed an experiment to test this hypothesis. The experiment consists of measuring the fitness of flies homozygous (1) for only the second chromosome, (2) for only the third chromosome, and (3) for both the second and the third chromosome. The results are striking. The fitness of D melanogaster homozygous for both the second and the third chromosomes (0.079 k 0.024) is not significantly different from the fitness of flies homozygous for only the second (0.081 0.014) or only the third chromosome (0.080 * 0.017). If we assume that fitnesses are multiplicative, the average expected fitness of the double-chromosome homozygotes is 0.0066 & 0.002; the observed value is more than ten times greater.
The experiments of Seager et al [27] manifest, therefore, large negative synergistic interactions, If we assume that similar synergistic interactions occur when there is homozygosis for one full chromosome or less, then the fitness depression observed in the homozygotes for only one chromosome could account for a number of heterotic loci much greater than calculated above. And it should be noted in addition that other forms of natural selection, such as frequency dependence and the associated phenomenon of overcompensation [28] cause a lesser genetic load than heterosis and may in fact increase the ability of a population to exploit the environmental resources.