Sequence v ariation in the dihydrofolate reductase-thymidylate synthase ( DHFR-TS ) and trypanothione reductase ( TR ) genes of Trypanosoma cruzi

Dihydrofolate reductase-thymidylate synthase ( DHFR-TS ) and trypanothione reductase ( TR ) are important enzymes for the metabolism of protozoan parasites from the family Trypanosomatidae (e.g. Trypanosoma spp ., Leishmania spp. ) that are targets of current drug-design studies. Very limited information exists on the le v els of genetic polymorphism of these enzymes in natural populations of any trypanosomatid parasite. We present results of a sur v ey of nucleotide v ariation in the genes coding for those enzymes in a large sample of strains from Trypanosoma cruzi , the agent of Chagas’ disease. We discuss the results from an e v olutionary perspecti v e. A sample of 31 strains show 39 silent and fi v e amino acid polymorphisms in DHFR-TS , and 35 silent and 11 amino acid polymorphisms in TR . No amino acid replacements occur in regions that are important for the enzymatic acti v ity of these proteins, but some polymorphisms occur in sites pre v iously assumed to be in v ariant. The sequences from both genes cluster in four major groups, a result that is not fully consistent with the current classification of T. cruzi in two major groups of strains. Most polymorphisms correspond to fixed differences among the four sequence groups. Two tests of neutrality show that there is no e v idence of adapti v e di v ergence or of selecti v e e v ents ha v ing shaped the distribution of polymorphisms and fixed differences in these genes in T. cruzi . Howe v er, one nearly significant reduction of v ariation in the TR sequences from one sequence group suggests a recent selecti v e e v ent at, or close to, that locus. # 2002 Else v ier Science B.V. All rights reserved.


Introduction
Enzymes that are essential to the metabolism of parasitic protozoa are attractive targets for antiparasite chemotherapy. Drugs that block the activity of those enzymes can inhibit the parasite's growth and therefore represent viable alternatives or complements to the development of vaccines. Two important metabolic enzymes of human parasites from the family Trypanosomatidae (Trypanosoma spp ., Leishmania spp. ) have received much attention as potential targets for the development of chemotherapeutic agents: the bifunc-tional dihydrofolate reductase-thymidylate synthase (DHFR-TS ) and trypanothione reductase (TR ).
In most organisms, the enzymes dihydrofolate reductase (DHFR ) and thymidylate synthase (TS ) catalyze consecutive reactions in the de novo synthesis of 2?deoxythimidylate (dTMP) and exist as monofunctional separate proteins [1]. However, in protozoa DHFR and TS are expressed as a bifunctional monomeric enzyme, with the DHFR domain at the amino terminus and TS at the carboxy terminus of the polypeptide [2 Á/4]. DHFR-TS has been a major target of research on antifolate drugs due to its central role in cellular metabolism and DNA synthesis. However, despite the success of antifolate chemotherapy against bacteria and malaria parasites, there are still no antifolate agents that can effectively block the activity of DHFR-TS in trypanosomatids [5].
Those difficulties have triggered interest on the enzyme trypanothione reductase (TR ) as a more likely target for the development of drugs against trypanosomatid parasites [6]. Trypanosomatids differ from other organisms in that they lack the glutathione/glutathione reductase system for maintaining the stable reducing intracellular environment necessary for protection against oxidative stress. Instead, they rely on TR and a derivative of glutathione called trypanothione [7 Á/9]. TR has therefore attracted a lot of attention as a potential target for drugs that block the trypanothione metabolism of trypanosomatid parasites without interfering with the glutathione metabolism of the human host [6,10].
Although nucleotide sequences of the genes coding for TR and DHFR-TS have been obtained for the majority of important trypanosomatid parasites [11 Á/21], there is almost no information on the sequence polymorphism of these genes in natural populations of any trypanosomatid parasite. Such information is especially relevant for Trypanosoma cruzi , the agent of Chagas' disease, which is very polymorphic at the genetic level [22,23]. Until very recently, the genes coding for DHFR-TS and TR had been only sequenced, respectively, in one or three strains of T. cruzi [15,19,24,25]. Nucleotide sequences from the DHFR-TS and TR genes from a large group of strains of T. cruzi that represent most of the genetic diversity of this parasite were recently obtained [26]. Here we use that large comparative sequence dataset to study the genetic polymorphism and evolution of the DHFR-TS and TR genes in T. cruzi .

Samples
General information about the origin of the 31 T. cruzi strains included in this study is given in Table 1. DNA samples were obtained from M. Tibayrenc and collaborators (CEPM CNRS/ORSTOM, Montpellier, France). Three samples from two species of bat trypanosomes (T. cruzi marinkellei and T. vespertilionis ) were also included, and used to root the phylogenetic trees. Previously published sequences of T. cruzi were also included in the analyses: TR from the CL strain (GenBank acc. no. M38051) [15], Silvio strain (Z13958) [24], and CAI strain (M97953) [25]; and the DHFR-TS sequence from the Y strain (L22484) [19].

Results and discussion
3.1. Heterozygosity and haplotypic diversity of the DHFR-TS and TR genes in T. cruzi Partial sequences of 1473 bp, corresponding to 94% of the complete sequence of the DHFR-TS gene (total length 1563 bp), were collected from 31 strains of T. cruzi (Table 1). The sequences start at position 31 of the T. cruzi gene (codon 11) and end 60 bp before the stop codon. Nucleotide composition is slightly biased (57.9% G'/C), the bias being more evident at third codon positions (68.8%). A measure of codon bias, the effective number of codons (ENC) [34], indicates that DHFR-TS has a moderate level of codon bias in T. cruzi (ENC 0/ 48.66).
Partial sequences of 1290 bp were obtained for the TR gene (total length 1476 bp). The collected sequence starts at position 76 of the T. cruzi gene (codon 26) and ends 111 bp before the stop codon. The sequences have no detectable nucleotide composition bias (52.2% G'/ C). Although the G'/C content of synonymous third codon positions is 60.0%, there is no evidence of codon usage bias in this gene (ENC 0/52.76) [34].
Although most of the strains are homozygous for the sequences of these two genes, several heterozygous strains were observed. As previously described [26], the PCR products from those strains were cloned and multiple clones (5 Á/10) sequenced to infer the haplotypes. Two haplotypes were found in all the heterozygous strains. All the variable sites from all collected sequences are shown in Tables 2 and 3. Sequences from heterozygous strains are labeled with a H1 or H2 suffix after the strain name, where H1 or H2 stand for haplotypes 1 or 2. In Tables 2 and 3 the sequences are organized using the four sequence groups (A Á/D) defined by Machado and Ayala [26], which reflect the phylogenetic affinities among the haplotypes (see below).
While most haplotypes from the same strain only differ at 1 Á/3 positions, the two DHFR-TS and TR haplotypes of strains SOC3 cl5, EPP, PSC-O, CL F11F5 and TULAHUEN cl2 are fairly divergent, differing in at least 16 or 22 sites (in DHFR-TS and TR , respectively). As shown by Machado and Ayala [26] that observed haplotype structure suggests the occurrence of at least one hybridization event in T. cruzi , because the two nuclear haplotypes fall in two distantly related sequence clades (B and C) and the heterozygous strains only carry one mitochondrial haplotype, thus ruling out laboratory contamination. Interestingly, the strain chosen for the T. cruzi genome project, CL F11F5 (CL Brener), is heterozygous for DHFR-TS and TR, and is inferred to have a hybrid genotype based on these nucleotide data [26] and a combination of multilocus enzyme electrophoresis and RAPD data [35]. Analyses of molecular variance (AMOVA) [36] show that most of the genetic diversity found in these genes is explained by variation among the four sequence groups rather than by variation found within each sequence group: 85 and 91% of the total genetic variation found, respectively, in the DHFR-TS and TR sequences of T. cruzi are due to differences among sequence clades. Haplotypic diversity (H d ) [37] for these genes, defined as the probability of randomly choosing two different gene copies from the sample, is high in T. cruzi . The DHFR-TS sample shows 23 haplotypes in the 41 sequences sampled (H d 0/0.941), and each sequence clade differs in its variability (Table 2)

Nucleotide and amino acid variation in the DHFR-TS gene
Fifty one nucleotide sites are variable in the DHFR-TS sample; 39 are silent polymorphisms and 12 cause an amino acid change (Table 2). However, only eight Table 2 List of polymorphic nucleotide sites in the DHFR-TS sequences of T. cruzi Clade designations are from [26] (see Figs. 3 and 4). GenBank reference T. cruzi sequence (L22484) [19]. Nucleotide and amino acid positions correspond to those of the GenBank sequence. Changes that are likely to be sequencing errors in the GenBank sequence are marked in bold and shaded (see text). Table 3 List of polymorphic sites in the TR sequences of T. cruzi     1 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 1 3 3 3 1 2 3 3 1 3 3 3 1 3 3  Amino acid change  K  D  H  G  D  N  V T  K V  I  N  E  N  S  E  D  I N  N I  V  Amino acid site  1  1  1  2  2  3  3 3  4 4  4  0  1  5  4  7  0  4 5  0 0  4  3  2  6  7  8  6  9 3  2 3  1 Clade designations are from [26] (see Figs. 3 and 4). Nucleotide and amino acid positions correspond to those of the GenBank reference T. cruzi sequence (M38051) [15]. a GenBank sequence (Z13958) [24]. b GenBank sequence (M97953) [25]. c GenBank reference sequence (M38051) [15]. Interestingly, all the multiple changes in the same codon are observed in the GenBank sequence (accession no. L22484) [19]. The substitution pattern in that sequence suggests that the observed changes are likely to be sequencing errors rather than real nucleotide substitutions. First, all changes are unique to the GenBank sequence and involve CG to GC or GC to CG substitutions at adjacent nucleotide positions, which suggest sequencing errors due to compression problems or scoring errors. Second, the three inferred amino acid substitutions are non-conservative at the biochemical level (Arginine (R) to Alanine (A) and viceversa, and Glycine (G) to Alanine (A)). Third, with the exception of the GenBank sequence (L22484), codons 21 and 430 code for amino acids that are conserved across all trypanosomatid species sequenced to date (Arginine (R)  [19]. Residues 1 Á/234 correspond to the DHFR domain, and the remaining residues to the TR domain [19]. Polymophic sites in T. cruzi are highlighted. Potential mistakes in the Genbank sequence (residues 21, 324 and 430) are also highlighted. The first six residues of the TS domain are marked in bold. Leishmania major (X51733) [11]; L. amazonensis (X51735) [14]; Crithidia fasciculata (M22852) [13].
and Glycine (G)) (Fig. 1). Omitting the GenBank sequence, the numbers of observed silent and replacement polymorphisms are 39 and 5, respectively. The vast majority of nucleotide polymorphisms corresponds to fixed nucleotide differences between sequence clades of T. cruzi (see below) [26]. Eleven of the 44 polymorphisms are singletons (observed in only one sequence), and all of them are silent changes. Fig. 1 shows the alignment of the DHFR-TS amino acid sequences from a subset of the T. cruzi strains and different trypanosomatids. In all trypanosomatids the first 234 residues have been assigned to the DHFR domain [19]. Four amino acid polymorphisms were observed in the DHFR domain, while only one was observed in the TS domain (Fig. 1). Three of the four amino acid changes in the DHFR domain correspond to fixed differences among clades, and were observed in sites that are also variable in other Trypanosomatid sequences. The change in residue 149 (E Á/G) is only observed in one of the haplotypes from two putatively hybrid T. cruzi strains (EPP H2, PSC-O H2) [26] ( Table  2). The change observed in residue 440 of the more conserved TS domain is only observed in two strains from the same sequence clade (OPS21 cl11, CUICA cl1) ( Table 2). No changes were observed in the 15 conserved residues that are suggested to be involved in dihydrofolate binding in two bacterial DHFR enzymes [38,39]. With the exception of those polymorphisms observed in the GenBank sequence, all the observed amino acid polymorphisms in the DHFR-TS gene of T. cruzi are conservative at the biochemical level.

Nucleotide and amino acid variation in the TR gene
TR has more amino acid polymorphisms than DHFR-TS ( Table 3). Eleven of the 46 polymorphic sites observed in T. cruzi cause amino acid replacements. Eleven singletons were observed, three of which cause amino acid replacements (in strains FLORIDA C16, Silvio and CM 17); six of the singletons occur in the sequence from strain CANIII, which corresponds to the only strain sampled from clade D, one of the four sequence clades defined for T. cruzi (see below) [26]. Fig. 2 shows the alignment of the TR amino acid sequences from a selected group of T. cruzi strains and all available Trypanosomatid sequences. Five of the 11 amino acid changes observed in T. cruzi occur in sites that were previously assumed to be invariant among Trypanosomatids. Among those five sites, changes at sites 402 Á/403 (NI Á/KV) and 441 (V Á/I) correspond to fixed differences among clades. Interestingly, the conservative amino acid changes that are unique to strain CM17 (position 247, G Á/S) and to one of the haplotypes from strain FLORIDA C16 (position 278, D Á/E) occur in sites of the protein that are completely conserved across trypanosomatids (Fig. 2) and even in the human glutathione reductase [18]. The remaining six amino acid changes are observed in regions of the protein that are variable in Trypanosomatids, and, with the exception of the change in site 95 (K Á/N) of the Silvio strain, correspond to fixed differences among clades. None of the observed changes fall in sites that have been suggested to be important for the enzymatic activity of TR [18].
In the only additional study of TR nucleotide polymorphism in another species of trypanosomatid (Crithidia fasciculata ), three haplotypes were observed in a sample of five genomic clones [16]. In that sample, only one of the 14 polymorphic sites that were observed leads to an amino acid replacement. That replacement is conservative (QÁ/E) and occurs at the very 3? end of the gene in a region not covered by our partial sequences. Interestingly, the proportion of replacement to silent polymorphisms is much higher in T. cruzi (11/35) than in C. fasciculata (1/13). In fact, in the region sequenced by us there are no amino acid polymorphisms in the C. fasciculata sample [16]. Additional sampling in C. fasciculata is necessary to determine whether that observation reflects higher selective constraints on the evolution of this gene in this organism.

Phylogeny of the DHFR-TS and TR sequences from T. cruzi
Pairwise corrected distances among selected sequences of T. cruzi and other trypanosomatids are shown in Table 4. Genetic divergences among T. cruzi strains are low, never exceeding 2%, while distances with the distantly related trypanosomatids Crithidia and Leishmania are fairly large (45 Á/50%). Figs. 3 and 4 show that the DHFR-TS and TR sequences of T. cruzi cluster in four major sequence clades (hereafter referred as clades A, B, C and D, after Machado and Ayala [26]). The same pattern is observed in sequences from other nuclear [40] and mitochondrial loci [26]. The reconstructed genealogies do not fully agree with former phylogenetic studies based on non-nucleotide genetic data [35,41,42] that have suggested the presence of two major phylogenetic lineages in T. cruzi (recently named T. cruzi I and T. cruzi II [43]). All sequences from strains classified as T. cruzi I are monophyletic and fall in clade A. On the other hand, sequences from strains classified as T. cruzi II are paraphyletic, falling into clades B, C and D, which are each monophyletic but so that clades B and D are more closely related to clade A than to clade C (Figs. 3 and 4). Clade C corresponds to the most anciently derived group of T. cruzi sequences.
The current classification of T. cruzi in two distinct groups based on non-nucleotide genetic data (allozymes, RAPDs, RFLPs, microsatellites) cannot be fully reconciled with the gene genealogies shown here and in previous studies. The fact that all genealogies recon- Fig. 2. Alignment of amino acid TR sequences from T cruzi and other trypanosomatids. Representative sequences from each sequence clade of T. cruzi are included. Positions are defined by the T. cruzi reference sequences from strains CL (Accession M38051) [15] and Silvio (Z13958) [24]. The additional T. cruzi GenBank amino acid sequence from strain CAI (M97953) [25] is identical to the amino acid sequence from strain TEH and is not shown. Sites that are polymophic in the T. cruzi sequences are highlighted. T. brucei (X63188) [17]; T. congolense (M21122) [12]; Crithidia fasciculata (Z12618) [17]; Leishmania donovani (Z23135) [20]. structed with nuclear or mitochondrial loci do not recover T. cruzi I and T. cruzi II as two distinct groups of strains suggests that either the current classification is wrong or that T. cruzi may have had a complicated ancestral demographic history. The evidence provided by the well-supported gene genealogies is insufficient for rejecting the current classification of T. cruzi . This classification, based on non-nucleotide sequence data, could still constitute a better representation of the actual evolutionary relationships among T. cruzi strains than that suggested by the gene genealogies, because the former reflects relationships among multiple loci randomly sampled from the genome, that is, relationships inferred from genome-wide patterns of variation, while the latter only reflects relationships among alleles from a single locus [26].
Under the assumption that the classification of T. cruzi in two distinct groups is correct, the conflicting portraits of the history of this organism could be reconciled proposing that T. cruzi has had a demographic history that includes at least one major genetic exchange event leading to the formation of T. cruzi II. Machado and Ayala [26] proposed that the recent ancestor of T. cruzi may have consisted of at least four isolated lineages that carried the ancestral alleles of the four distinct sequence clades (A Á/D) observed in extant strains, and that recent genetic exchange events resulted in most of the current T. cruzi II strains carrying combinations of alleles from at least two of the ancestral lineages (alleles from clades B and C). Under that explanation, the genome of T. cruzi II strains would be a mosaic formed with alleles from clades B, C and, possibly, D. This explanation predicts that some strains from T. cruzi II should carry alleles from sequence clade B at some parts of their genome and alleles from clade C at others. That pattern has yet to be observed. However, the hybrid strains from T.
cruzi II partially fit that description, although the observation of current complete heterozygosity at the regions of the genome where the DHFR-TS and TR loci are located suggests that this hybridization event is more recent than the event(s) leading to the formation of T. cruzi II.
One also needs to consider the possibility that the potential complex history of T. cruzi may not allow to use a single phylogenetic tree or a simple classification to represent the evolutionary history of this organism. Discordance among histories reconstructed using different genes have been observed in several groups of closely related species or among populations within species [44], where gene trees from different loci render incongruent histories that are consistent with complex ancestral demographic histories or histories that involve hybridization events. Thus, before undertaking a reevaluation (or reaffirmation) of the current classification of T. cruzi as an accurate representation of its evolutionary history, it will be necessary to collect more sequence data from multiple loci located in different regions of the genome. The results from the current genome sequence project of T. cruzi [45] should provide a guide for choosing loci at selected regions of the genome and carry out such study.

Tests of neutrality
In order to determine whether there is evidence of adaptive protein divergence for these enzymes or whether these genes have been recently under selection, two standard test of neutrality were applied. Both tests focus on the correlation between the amounts of polymorphism and divergence that is expected under neutrality, due to the linear dependence of both patterns on the neutral mutation rate. For applying the tests, we considered each sequence clade as an independent group (i.e. with no genetic exchange among groups) and Table 4 Tamura Á/Nei distances among a subset of DHFR-TS sequences (above the diagonal) and TR sequences (below the diagonal), from representative T. cruzi strains and outgroups  [13]; TR: Accession Z12618 [17]. c DHFR-TS: L. major Accession M12734 [11]; TR: L. donovani Accession Z23135 [20].
compared patterns of polymorphism within each clade with patterns of divergence among clades. We also compared all T. cruzi sequences with a single sequence from either one of the two outgroups (T. c. marinkellei and T. vespertilionis ). The DHFR-TS GenBank sequence (Accession L22484) was not included in the analyses based on the evidence presented above suggesting that several of the nucleotide substitutions observed in that sequence are sequencing mistakes. The McDonald Á/Kreitman test [27] (Table 5) examines whether the ratio of silent to amino acid variation is the same for polymorphisms as it is for fixed differences between groups of organisms. Under the assumption that these two kinds of variation are selectively neutral, the ratios are expected to be the same. Table 5 shows that the hypothesis of selective neutrality is not rejected in any of the comparisons. Even if the DHFR-TS GenBank sequence is included, the test does not reject neutrality (not shown). Thus, there is no evidence of adaptive divergence for the DHFR-TS and TR enzymes in T. cruzi .
The second test we applied was the HKA test [28] (Table 6), which considers polymorphism and divergence at two or more loci. Natural selection is inferred Fig. 3. Genealogical relationships among DHFR-TS sequences from T. cruzi (Neighbor joining tree). Sequences from T. cruzi marinkellei and T. vespertilionis were used as outgroups. Numbers below or above the branches are bootstrap values 50% (500 replications). The conspicuous long branch in the GenBank sequence (Y) is generated by the unique substitutions in that sequence that we have identified as potential sequencing errors (see text).
when the observed values of divergence or polymorphism depart exceptionally from expected values generated by fitting a neutral, constant population size model. We applied the HKA test to sequence clades A, B and C (the low number of sequences did not allow to conduct the test with clade D). In each case a single sequence from one of the two bat trypanosome outgroups was used. The significance of the observed HKA statistic was determined by comparison to the x 2 distribution and by comparison with the distribution of the statistic following 1000 coalescent simulations. The test did not reject neutrality in clades A or B, regardless of the outgroup sequence used. For those cases none of the HKA tests approached statistical significance and the P values obtained by simulation or from the x 2 distribution were very similar. Interestingly, neutrality was rejected for clade C only when the HKA statistic was compared with the simulated distribution and before correcting for multiple tests. In those cases, the test statistic also approached statistical significance when compared to the x 2 distribution ( Table 6). The almost significant departure of clade C from the null neutral pattern is due to a lower than expected polymorphism in TR . While sequences from Clade C show ten polymorphic sites in Fig. 4. Genealogical relationships among TR sequences from T. cruzi (Neighbor joining tree). Sequences from T. cruzi marinkellei and T. vespertilionis were used as outgroups. Numbers below or above the branches are bootstrap values 50% (500 replications). GenBank sequences are marked with asterisks (**).
DHFR-TS (not including the GenBank sample), there are no polymorphisms in TR . That observation does not fit the neutral expectation because the level of divergence between the TR sequences from Clade C and the outgroup are not different from those of the other sequence clades. This observation suggests the occurrence of a recent selective event at, or close to the TR locus in the strains carrying sequences from clade C.

Conclusions
This study has uncovered a large number of polymorphisms in the DHFR-TS and TR genes of T. cruzi . Most of the genetic variation is due to differences among sequence clades, reflecting a history of strong ancestral population structure and long-term clonal divergence of at least four distinct populations.
Although most nucleotide variation is silent, a few amino acid polymorphisms were observed, although none occur in sites that are functionally important. The sites in enzyme regions being targeted by drug design studies are all conserved in our extensive sample of T. cruzi strains. The high amino acid conservation across trypanosomatids suggests that drugs designed against DHFR-TS and TR for one trypanosomatid species may work in other species.
This study opens up the possibility to study evolution in action against drug resistance in T. cruzi . Our data provide a unique opportunity to compare the amount and type of genetic variation of the DHFR-TS and TR genes in natural populations of this parasite prior to and after the use of potential selective agents. The comparisons could allow to detect and then follow the evolutionary dynamics of new amino acid mutations responsible for the evolution of drug-resistant strains in nature. Moreover, available studies on the molecular mechanisms responsible for resistance against drugs that block the activity of DHFR in Plasmodium falciparum [46 Á/48] and about selection of different amino acid point mutations in different populations of that parasite [49,50], would allow to conduct interesting and informative comparisons with T. cruzi . It will be possible, for instance, to try to determine whether mechanisms of drug resistance are similar in both parasites (i.e. do similar point mutations confer resistance?) and, more interestingly, whether the evolutionary dynamics of selected mutations are similar in both parasites. The last comparison is quite relevant given that the population structures of both parasites are different, clonal in T. cruzi [22,51], but sexual in P. falciparum with different degrees of population structure (or inbreeding) that are correlated with the frequency of transmission [52,53], and thus one expects to see contrasting dynamics reflecting these differences. The DHFR-TS sequence from GenBank (Acc. # L22484) was not included in the analyses. Clade names correspond to previously defined sequence clades [26] (see Figs. 3 and 4). * G -tests of independence were performed using Williams' correction [55]. The tests use polymorphism within group 1 and divergence between group 1 and a single sequence from group 2 (T. c. marinkellei or T. vespertilionis ). GenBank sequences were not included in the analyses. Clade names correspond to the sequence clades defined by Machado and Ayala [26] (see Figs. 3 and 4). a The HKA test statistic [28]. b The probability of a x 2 higher than observed, estimated with 1000 coalescent simulations. c The probability of a x 2 higher than observed, based on the x 2 distribution.