Genetic polymorphism: from electrophoresis to DNA sequences

Recent studies indicate that the amount of protein variation undetected by electrophoresis may be reasonably small. Nevertheless, at the protein level, a typical sexually-reproducing organism may be heterozygous at 20 or more percent of the gene loci. Although the evidence is limited, it appears that at the level of the DNA nucleotide sequence every individual is heterozygous at every locus — if introns as well as exons are taken into account. The evidence available does not support the hypothesis that, at least at the protein level, the variation is adaptively neutral.


A n elusive problem
Genetic variation is one of the fundamental parameters of the evolutionary process. This is because the evolutionary potential of a population is a function of the amount of genetic variation present in the population at a given time (and also, of course, of the rate of mutation, but this will largely be reflected in the amount of genetic variation present). The positive relationship between amount of genetic variation and rate of evolution has been demonstrated mathematically 2 and corroborated experimentally 3, but it is intuitively obvious as well -the greater the number of variable gene loci and the more alleles there are at each locus, the greater the possibility for change in the frequency of some alleles at the expense of others. The evidence for genetic variation can be traced to Mendel's experiments: the discovery of the laws of heredity was made possible by the expression of segregating alleles. Since that time, the study of genetic variation in natural populations has been characterized by a gradual discovery of ever increasing amounts of genetic variation. In the early decades of this century geneticists thought that an individual is homozygous at most gene loci and that individuals of the same species are genetically almost identical. Recent discoveries suggest that, at least in outcrossing organisms, the DNA sequences inherited one from each parent are likely to be different for every gene locus in every individual; i.e., that every individual may be heterozygous at every gene locus. But the efforts to obtain precise estimates of genetic variation have been thwarted for various reasons. Genetic variation is an attribute that cannot be exhaustively measured. It i not possible, even if we so wanted, to examine every gene in every individual of a given species, so as to obtain a complete enumeration of the genetic variation in the species. The wellknown solution in such a situation is to measure a sample from the group to be evaluated. Two conditions need to be met for a valid extension of the results obtained in the study of a sample to the whole set. First, the sample must be representative or unbiased," second, the sample must be accurately measured. In the case at hand, the requirement that the sample be unbiased applies to 2 levels: a) the individual organisms sampled must be, on the average, neither more nor less genetically variable than the population as a whole; b) the genes sampled must be neither more nor less polymorphic, on the average, than the whole genome. And the condition of accuracy requires that genes that are different be identified as such; i.e., it requires that every allelic variant be recognizable. Neither one of the 2 necessary conditions for valid sampling have been met in the study of genetic variation. There is no serious difficulty in sampling individuals that are, on the average, as genetically variable as the population as a whole. An important consideration is that the individuals sampled not be either particularly inbred or interrelated; but this is not difficult to satisfy. The difficulty lies in choosing the genes" to be sampled. With the methods of Mendelian genetics, the existence of a gene is ascertained by examining the progenies of crosses between individuals showing different forms of a given character; from the proportion of individuals in the various classes, we infer whether one or more genes are involved. By such methods, therefore, the only genes known to exist are those that are variable. There is no way of obtaining an unbiased sample of the genome, because invariant genes cannot be included in the sample of genes to be examined.
A way out of this problem became possible with the advent of molecular genetics. The genetic information encoded in the coding sequence of the DNA of a structural gene is translated into the sequence of amino acids making up a polypeptide. One can select for study a series of proteins without previously knowing whether or not they are variable in a population -a series of proteins that, with respect to variation, are an unbiased sample of all the structural genes in the organism. If a protein is found to be invariant among individuals, it is inferred that the gene coding for the protein is also invariant; if the protein is variable, the gene is inferred also to be variable and one can measure how variable it is, i.e., how many variant forms of the protein exist, and in what frequencies.
Gel electrophoresis is a fairly simple technique that makes possible the study of protein variation with only a moderate investment of time and money. Since the 1960s, genetic variation has been studied in a large variety of organisms by gel electrophoresis. It was clear from the beginning of these studies that not all allelic variants are detected by electrophoresis, and hence that the condition of accuracy is not satisfied. But because genes for electrophoretic studies can be chosen without regard to how variable the genes are, many investigators thought that electrophoresis would provide estimates of variation in structural genes that would be accurate to a first approximation. This expectation has not, however, been fulfilled. At present, it appears doubtful that the genes studied by electrophoresis are an unbiased sample of the structural genes, let alone the genome as a whole; and it is questionable whether formulae can be found to transform electrophoretic measures into 'true' estimates of genetic variation even for the genes assayed by electrophoresis. The past few years have witnessed a new important development: techniques for the isolation ('cloning') of genes and other DNA segments and for ascertaining their nucleotide sequence. The condition of accurate measurement is fully satisfied by these techni-ques, because every nucleotide difference (=every allele) can be detected. And there is hope that the condition of unbiased sampling may also be satisfied, because all sorts of genes, whether translated or only transcribed, and indeed any kind of DNA sequence can be subject to study. Only the future will tell whether these expectations are fulfilled.

Protein polymorphisms
During the first half of the 20th century, it gradually became apparent that genetic variation is pervasive. The evidence came primarily from 3 kinds of study: morphological variation, artificial selection, and inbreeding 4,5. But quantitative measures of genetic variation were not possible: there was no way to determine the proportion of all gene loci that were not variable, nor the degree of polymorphism of variable genes. In the 1960s, gel electrophoresis followed by selective staining provided a simple method for assaying variation in enzymes and other soluble proteins. Electrophoresis makes possible the study of gene loci independent of whether they are variable or not. Most of the proteins assayed are encoded by single gene loci. The gel patterns can, then, be interpreted as singlelocus genotypes. Genotypic and allelic frequencies, as well as other relevant genetic information can be readily obtained. Thus, the way might seem apparently open for obtaining measures of genetic variation, even though these measures are minimum estimates because not all allelic differences are detected by electrophoresis. The application of electrophoretic techniques to the study of genetic variation generated enormous enthusiasm among evolutionists for one additional reason: it provides a method for obtaining genetic information from organisms not suitable for breeding experiments. Organisms with long generations, or that cannot be bred in the laboratory because they live in exotic environments such as the deep-sea Or for other reasons, could now be assayed for certain genetic parameters. Previous to the electrophoretic revolution, genetic data existed for only a few dozen multicellular organisms. Now, hundreds of different species have been studied by electrophoresis. The number of loci sampled in many species is sufficiently large, 15 or more, so that average estimates of genetic variation can be advanced with some degree of confidence. A partial summary is given in table 1; reviews can be found in references 6-11. Electrophoretic data give the frequency of electromorphs (proteins that differ in electrophoretic mobility). Proteins encoded by different alleles may yield indistinguishable electromorphs, but as a first approximation it is assumed that each electromorph corresponds to only one allele. A variety of statistics can be used to summarize the amount of genetic variation in a population. The most extensively used measures are the polymorphism (P) and the heterozygosity (H). P is simply the proportion of loci found to be polymorphic in the sample. Usually, a locus is considered polymorphic when the frequency of the most common allele (electromorph) is no greater than a certain value, such as 0.99 or 0.95. In outcrossing organisms, H estimates the average frequency of heterozygous loci per individual or, what is equivalent, the average frequency of heterozygous individuals per locus. In naturally inbred organisms, H is a good measure of genetic variation in a population if it is calculated from the allelic frequencies as the 'expected' frequency of heterozygous individuals on the assumption of Hardy-Weinberg equilibrium. H is a better measure of genetic variation than P for most purposes, because it is more precise 5. A related measure also used by population geneticists is the effective number of alleles, no which is the reciprocal of the average frequency of homozygous individuals, i.e. 1/(l-H). Electrophoretic studies have confirmed that natural populations of most organisms possess large stores of genetic variation. Even though not all variants are detected, table 1 shows that the average heterozygosity is about 6.0% for vertebrates and about 13.4% for invertebrates, although considerable heterogeneity exists within each of these groups. Even self-fertilizing plants have considerable genetic variation. The average proportion of polymorphic loci in a population lies between 20 and 50% for most organisms. The question raised could ultimately be resolved by obtaining the amino acid sequence of a sufficiently large number of electromorphs with identical electrophoretic mobility. This is clearly not feasible at present because of the enormous cost and time required. A variety of other, less satisfactory methods have manifested the existence of electrophoretically cryptic variation. The methods used include sequential electrophoresis, heat denaturation, urea denaturation, monoclonal antibodies, and peptide mapping. Electrophoretic studies of genetic variation usually employ a single set of experimental conditions in the assay of a given enzyme. The method of sequential electrophoresis consists of applying a variety of conditions to a given enzyme. The conditions most often varied are gel concentration and buffer pH; typically, 6-10 different sets of conditions are used. Electromorphs that have identical mobility under a set of conditions may be distinguishable when the conditions are changed.

Electrophoretically cryptic variation
The species most extensively examined by sequential electrophoresis is Drosophila pseudoobscura.

The problem of bias
Accuracy -i.e., ability to discriminate among all allelic products that are different -is one of the conditions that data must meet if we are to obtain valid estimates of genetic variation. The other condition is lack of bias, i.e., the genes studied must be a random representation of all the genes in the organism. The genes surveyed by electrophoresis are structural genes coding for soluble proteins. Whether all such genes are randomly represented in electrophoretic surveys is a question that cannot be answered at present. It is not known, either, whether structural genes coding for nonsoluble proteins are either more or less variable than genes coding for soluble proteins. The large majority of the DNA of eukaryotic organisms, however, does not code for proteins. Much of this additional DNA may primarily, or exclusively, have a structural function, but a fraction is involved in gene regulation and this will be of considerable evolutionary import. Regulatory genes, sensu lato, are those that regulate or modify the activity of other genes 11. Thus defined, the enzymes (83%) is significantly affected by genetic variation in chromosomes other than that in which the gene coding for the enzyme is located. Variation in regulatory or modifier genes affecting the activity of other genes is, therefore, pervasive. Unfortunately we do not know how many gene loci may have modifier effects on any given enzyme locus. At present there seems to be no way in which the variation in gene regulation can be quantified using statistics such as H or n~ With respect to variation in gene regulation, we are at the same stage where we were with respect to structural genes before the use of electrophoretic techniques. We can state that the variation is extensive, but we cannot tell how many regulatory genes are polymorphic or how polymorphic they are.

DNA sequence polymorph&m in eukaryotes
It has been known for more than a decade that only a small fraction, perhaps less than 10%, of the nuclear DNA of eukaryotes is translated into protein.  . 1). The 2 alleles are from a single individual, one allele from the paternal and the other from the maternal chromosome. The results are summarized in figure 2. The 3 exons have identical sequences, but nucleotide substitutions occur in the 5' flanking sequence and in both introns. Most of the substitutions occur in the 5' region of the larger intron; the 2 alleles also differ in this region by 2 fournucleotide gaps. The length of the DNA sequenced is 1468 nucleotide pairs; the total number of substitutions is 24. Depending on whether or not the 2 gaps are counted as differences, the percentage of nucleotide differences is either 2.2 or 1.6%. The constant region of the heavy chain of mouse immunoglobulin consists of 8 proteins. One of these, 72a, is known to differ extensively from one inbred mouse strain to another. The gene, IgG2a, coding for this protein has been sequenced in 2 strains 34. Of the 1108 bases sequenced, 11 1 (10%) are different. Only   18 (16.2%) of these nucleotide substitutions are silent; the others yield different amino acids in 15% of the sites. There are reasons to presume that the variation observed in the mouse IgG2a gene may not be typical of structural loci. Immunoglobulin genes are very polymorphic; the 2 alleles sequenced come from 2 inbred strains, rather than from outbred individuals; the 2 proteins were known to be very different before the DNA was sequenced. Indeed the frequency of amino acid differences between the 2 allele products is one order of magnitude greater than the average observed in other kinds of protein.
In any case, the globin gene results 33 suggest that, at least if introns are taken into account, every diploid individual may be heterozygous at virtually every gene locus. When the DNA base-sequence is considered, questions about heterozygosity will have to be answered not in terms of gene loci (because t00% of the loci are likely to be heterozygous), but in terms of nucleotides. And there is evidence indicating that the values reported above -2 and 10% nucleotide differences for the globin and the immunoglobulin gene, respectively -might not be far off the mark.
The genome of eukaryotes consists of single-copy DNA, which typically may be around 70% of the total, and of repetitive DNA. The latter is made up of sequences each represented by several copies, sometimes many thousands, in the genome. Britten et al. 35 and Grula et al. 36 have used techniques for DNA denaturation, followed by competitive reassociation ('hybridization') of the dissociated DNA strands, in order to estimate the amount of nucleotide variation in single-copy DNA. The estimated frequencies of nucleotide substitutions in the 4 species of sea urchins examined are: Strongylocentrotus purpuratus, 4% ; S.fi'anciscanus, 3.2% ; S. intermedius, 3% ; and S. drobachiensis, 2%. The single-copy DNA consists of 2 fractions, one less polymorphic than the other. The less polymorphic fraction makes up the larger part of the DNA. In S.purpuratus the 'heterozygosity' values are 3% and 9% for the less polymorphic and the more polymorphic fractions, respectively. After correction for silent substitutions, 2-4% nucleotide substitutions in translated DNA would yield 5-9% amino acid differences. An electrophoretic study of 12 enzyme systems in S. intermedius has given a heterozygosity estimate of 0.18, which is not very different from the mean value for invertebrates (see table 1). If we assume that Lr=0.18 corresponds approximately to t amino acid difference per 5 proteins, and that the average length of a protein is 300 amino acids, the electrophoretic data would reflect 1 substitution per 1500 amino acids 36. The 'heterozygosity' value obtained from the reassociation data is about 100 times greater (see above: 5-9% amino acid substitutions are about 1 in 15). The difference may be due in part to the inability of detecting all amino acid substitutions by electrophoresis. But it seems likely that the larger proportion of the nucleotide diversity observed by reassociation involves DNA that does not code for amino acids. In any case, it deserves notice that the frequency of nucleotide heterozygosity observed by DNA hybridization (2-4%) is not very different from the value obtained by sequencing the a~ gene (2%).
DNA cleavage with restriction endonucleases is another method to estimate the proportion of nucleotide differences in the DNA. DNA-sequence polymorphisms have been detected by endonuclease digestion in human globin genes 37,38 and in the ovoalbumin gene of chicken 39. Jeffreys 4~ has examined in 60 unrelated human individuals a continuous DNA segment containing several globin genes of the beta family (i.e., most of the segment shown on top of figure 1). A cleavage site in one but not another DNA sequence means that the 2 sequences differ by at least 1 base-pair at the site (each cleavage site contains 4 or more contiguous nucleotides). The number of cleaved sites is 52-54, amounting to 300-310 base pairs; the number of variant sites is 3. The frequency of variable nucleotide sites may, then, be calculated as 3/ 300 = 1%. But this 'intuitive' estimate can be shown to be biased; the corrected estimate is 0.5% 41 . Moreover, this is an estimate of polymorphism, not of heterozygosity. The latter can be estimated as 0.1% 41. The nucleotide heterozygosity value based on Jeffreys' data is about 20 times smaller than the value obtained from the actual sequence of the Ay gene.
This may be accidental, due e.g. to the small number of nucleotides assayed by Jeffreys; or it may be that the endonuclease technique yields biased low estimates because restriction sites are more conserved than others. The second alternative may be questioned in view of the large frequency of nucleotide differences detected in the mitochondrial DNA of some but not other organisms by restriction endonucleases. In mice, for example, about 2% nucleotides are different between individuals in Mus rnusculus as well as in M. dornesticus, whereas no differences have been detected in M. rnolossinus 42. In primates, the frequency of nucleotide differences between individuals is 1.0-1.3% in chimpanzees, but only 0.3% in humans 43 (see also Upholt  Although quantitative estimates of the amount of DNA-sequence variation cannot be provided with confidence for organisms in general, there can be no doubt that the variation is extensive. If the noncoding regions of genes are included, it seems likely that most, if not all, genes are heterozygous in every outbred individual. The amount of variation in the flanking sequences that occur between genes is also likely to be large.

Are genetic polyrnnorphisrns neutral or adaptive?
Natural populations store large amounts of genetic variation. What is the evolutionary significance of the variation? One possible answer is that the variation in protein and DNA sequence found in natural popula-tions is for the most part adaptively neutral; i.e., that alternative genotypes have identical fitness. If this were the case, the evolution of the alternative sequences (alleles) would be determined by the random process of sampling from generation to generation. Another possible answer is that the variation is adaptively significant and, thus, that natural selection plays a significant role in molecular evolution.
Two arguments -one direct, the other indirect -may be adduced in support of the neutrality hypothesis. The positive argument relies on the apparent existence of a molecular evolutionary clock. When the rate of evolution is examined in say, a protein such as cytochrome c, it is observed that amino acid substitutions have occurred in different branches and at different times at approximately constant rates. What is meant by the phrase 'approximately constant rates' is that the substitutions occur with a constant probability, but stochastic variation is expected.
Langley and Fitch 48 have tested statistically the evolution of 7 proteins in 17 mammals and found that the variance in the rate of amino acid substitutions is much too large to be consistent with the hypothesis that the rate was stochastically constant as predicted with the neutrality theory. It is possible, however, to maintain that the rate is stochastically constant but that it has a variance greater than expected from a Poisson distribution 49.
One problem with this sort of evidence in support of the neutrality hypothesis is that stochastically constant rates of molecular evolution are also predicted by models of natural selection 5~ Therefore, the existence of a molecular evolutionary clock cannot be used in support of either the neutrality or the adaptive hypothesis. A more serious objection against the neutrality theory comes from recent data on the rate of nucleotide substitutions. Kimura 51 has demonstrated that, according to the neutrality theory, the rate of evolution of neutral alleles (adaptively neutral amino acid or nucleotide substitutions) is exactly the neutral mutation rate, independent of the size of the population, the length of the generations, and any other parameters. Analysis of the globin genes and other DNA sequences has, however, shown that the rate of nucleotide substitutions is significantly different for nucleotides that yield amino acid substitutions than for nucleotides in redundant 3rd positions; for those in the translated segments (exons) than for those in the nontranslated segments (introns) of genes; for those in genes than for those in intergenic sequences; and so on 33 '52-54. There is no reason to believe that the rate of mutation would be systematically different for these various categories of nucleotides. Thus, the neutrality theory would require that the fraction of nucleotide mutations that are neutral be different for different nucleotides, even among those that do not yield amino acid substitutions. Consider, for example, nucleotide mutations in redundant 3rd positions of codons. It is now known that the evolmionary rate of substitution for these nucleotides is several times smaller than for nucleotides in intergenic sequences 33,52-54. It follows that only a small fraction of the mutations in redundant 3rd positions can be neutral, thus falsifying the claim made in the past by proponents of the neutrality theory, namely that all (or nearly all) mutations in redundant 3rd positions would be neutral. The point is that if the fraction of nucleotides that are neutral is different for different kinds of nucleotides (even for those not affecting the amino acid sequence), for different parts of the genome, and perhaps for different groups of organisms, then the neutrality theory loses its predictive value and becomes an ad hoc explanation, claiming simply that those nucleotide substitutions that do occur in evolution are neutral. The interesting question in order to understand molecular evolution is no longer whether some nucleotide substitutions are neutral, but rather what the nature is of the selective constraints that determine the rates of nucleotide substitution for different genes or parts of the genome. The indirect argument offered in support of the neutrality theory is based on the concept of genetic load. The argument is that if some alleles are less adaptive than others, then a number of individuals would have less than optimal genotypes at each polymorphic locus subject to natural selection. If the number of such loci is very large, a population might be unable to withstand the burden of so many poorly fit individuals. The genetic-load argument is strongest in the case of heterosis; i.e., when a polymorphism is maintained owing to the adaptive superiority of the heterozygotes. Sved et al. 55, King 56 and others have suggested that an efficient method for testing whether heterosis plays a major role in natural populations is to compare the fitness of ordinary outbred individuals with the fitness of individuals homozygous for a larger than average proportion of loci. This method permits one to ascertain whether heterozygotes are at an overall advantage over homozygotes.
Numerous experiments, particularly in Drosophila, have shown that an increase in homozygosity results in a decrease in fitness. The experiments published before 1970 were, in general, carried out by measuring particular components of fitness, mostly viability 57 and fertility 58,59, and were not, in any case, performed under population conditions 6°. Sved and Ayala 61 devised a method by which fitness as a whole can be measured under population conditions, in Drosophila flies made homozygous for full chromosomes, under conditions of equilibrium population density and a stable age distribution. This method has now been used in a number of experiments that yield consistent results in that the fitness of homozygotes for one full chromosome is invariably very low, in the sublethal range (table 7). In all the experiments reported in table 7, wild chromosomes were sampled from natural populations and flies made homozygous for a whole chromosome by means of crosses with special laboratory stocks. Chromosomes that reduced the viability of homozygotes to zero or near zero were eliminated from the fitness studies. Fitness was, then, measured in population cages over many generations by comparing the fitness of homozygous flies with the fitness of flies heterozygous for random combinations of wild chromosomes. In order to estimate the number of loci that can be maintained by natural selection in view of the fitness experiments, the assumption is made that selective interactions between loci are multiplicative and that there is no linkage disequilibrium 68. If at each locus maintained by heterosis the heterozygote has a 0.01 selective advantage over either homozygote, then the fitness of a homozygous individual relative to an individual heterozygous at 210 loci would be (0.99)2t°~ 0.12. This is approximately the mean fitness of individuals homozygous for a complete 2nd or a 3rd chromosome in D. metanogaster (see table 7). Since, under the assumptions made, an individual would be heterozygous on the average at 50% of the heterotic loci, the total number of polymorphic loci maintained by heterosis in each chromosome could be 420. The 2nd and 3rd chromosomes of D. melanogaster are estimated to contain together about 75% of the genome. Therefore, the number of polymorphic loci that could be maintained by heterosis in the whole genome could be, approximately, (420 + 420)/0.75 = 1120. These calculations are based on assumptions which are unlikely to apply in nature. But some more realistic assumptions 6~ and recent experimental results 66 indicate that an even greater number of polymorphisms could be maintained by heterotic natural selection in natural populations of Drosophila. It should be pointed out, however, that these fitness experiments do not demonstrate that the decrease in fitness of the homozygous flies is due to homozygosis for heterotic loci. It is equally possible that it is due to homozygosis for deleterious alleles present in all wild chromosomes. But these experiments do show that arguments of genetic load cannot be used against the hypothesis that many natural polymorphisms are maintained by heterosis. Moreover, other forms of balancing selection may also contribute to the maintenance of genetic polymorphisms. Frequency-dependent selection is a more effective mechanism to maintain genetic polymorphisms than heterosis 69.