Population genetics of Plasmodium resistance genes in Anopheles gambiae : no evidence for strong selection

Anopheles mosquitoes are the primary vectors for malaria in Africa, transmitting the disease to more than 100 million people annually. Recent functional studies have revealed mosquito genes that are crucial for Plasmodium development, but there is presently little understanding of which genes mediate vector competence in the wild, or evolve in response to parasite-mediated selection. Here, we use population genetic approaches to study the strength and mode of natural selection on a suite of mosquito immune system genes, CTL4, CTLMA2, LRIM1, and APL2 (LRRD7), which have been shown to affect Plasmodium development in functional studies. We sampled these genes from two African populations of An. gambiae s.s. , along with several closely related species, and conclude that there is no evidence for either strong directional or balancing selection on these genes. We highlight a number of challenges that need to be met in order to apply population genetic tests for selection in Anopheles mosquitoes; in particular the dearth of suitable outgroup species and the potential difficulties that arise when working within a closely-related species complex.


Introduction
Mosquitoes belonging to the Anopheles gambiae Giles species complex include the primary vectors of human-pathogenic Plasmodium species in sub-Saharan Africa, and as such are indirectly responsible for the deaths of more than one million people annually (World Malaria Report, WHO 2005). This has driven research into the molecular basis of mosquito-Plasmodium interaction, most notably the sequencing of the An. gambiae genome (Christophides et al . 2002;Holt et al . 2002) and the subsequent identification of genes that mediate mosquito susceptibility to Plasmodium infections (e.g. Blandin et al . 2004;Osta et al . 2004;Abraham et al . 2005;Vlachou & Kafatos 2005;Dong et al . 2006a, b). These genes have largely been identified through expression or gene knockout studies, and while it is clear that such genes are critically involved in Plasmodium development and/or transmission, these studies do not indicate which genes are key players in the ecology and evolution of mosquito-Plasmodium interactions. In particular, it is generally not known which immune system genes are targets of parasite adaptation, or which mosquito genes mediate variation in vector competence in the field, and thus human exposure to Plasmodium (but see Riehle et al . 2006).
Analysis of DNA polymorphism and divergence can identify which resistance genes are targets of strong selection (reviewed in Nielsen 2005). On the one hand, strong directional selection (as in an arms-race) is expected to increase the rate of amino-acid substitution between species (reviewed in Yang & Bielawski 2000), and the concomitant spread of new advantageous alleles will reduce the level of within-species genetic diversity around the selected locus (a 'selective sweep', Maynard Smith & Haigh 1974;Kaplan et al . 1989;Braverman et al . 1995;Kim & Stephan 2000). On the other hand, if heterozygous genotypes are advantageous, if rare alleles experience a selective advantage, or if there are strong costs to resistance alleles in the absence of the parasite, then genetic diversity can be maintained for extended periods of time and divergence between alternate haplotypes may be extreme (e.g. Seger 1988;Stahl et al . 1999;Charlesworth 2006). Importantly, the evolutionary history and population genetics of resistance genes may have implications for the control of malaria through the use of genetically modified mosquitoes (Little 2006). For example, if host-parasite interactions maintain adaptive diversity at a particular host locus, then artificially driving a 'resistance' allele from this locus (e.g. as identified by a small experimental study) through the mosquito population could facilitate the future spread of a new, or currently rare, strain of Plasmodium (Little 2006; see also Slate 2005). The degree of threat posed by this process could be elucidated through population genetic studies that shed light on the rate of adaptation or the nature of coevolutionary interactions that resistance genes are subject to. Generally, there is little understanding of how transient resistance polymorphisms are likely to be in natural populations.
However, population-genetic approaches to understanding patterns of molecular evolution demand a number of prerequisites that may be difficult to fulfil for some taxa. Firstly, analyses involving between-species comparisons, such as those based on the rates of synonymous substitution (substitutions that do not alter an amino acid: K S ) and nonsynonymous substitution (those that cause an amino acid change: K A ) require the availability of outgroup sequences with an appropriate level of divergence (e.g. Yang & Bielawski 2000). Secondly, analyses based on within-lineage variation (including genetic diversity, haplotype structure, and the distribution of allele frequencies) assume that loci share the same history of population size and growth, which may not be the case for loci which reside on chromosomal inversions that fluctuate in frequency (Andolfatto et al . 2001) or which have introgressed from other lineages. We discuss below how both may be problematic for An. gambiae.
Anopheles gambiae is a member of the Pyretophorus Series of Anopheles (subgenus Cellia ), which comprises both African (e.g. the gambiae complex) and Oriental taxa (such as An. vagus Dönitz), including some of the most notorious malaria vectors in both regions. The Afrotropical species include An. daudi Coluzzi, An. christyi (Hunt et al . 1998). Unfortunately, despite their biomedical importance, surprisingly little is known of the internal systematics of the Pyretophorus Series (Foley et al . 1998;Anthony et al . 1999;Harbach 2004), making a priori selection of outgroups difficult. Sequence similarity between members of the gambiae complex is known to be high (e.g. Besansky et al . 1994, e.g. Besansky et al . 2003a, while outside of the Pyretophorus Series other well-studied Anopheles mosquitoes appear to be highly divergent, such that K S for nuclear genes is approaching saturation (e.g. Little & Cobbe 2005). In addition, members of the gambiae complex may share considerable ancestral polymorphism, and/or may not be fully reproductively isolated from each other (e.g. Besansky et al . 1994;Besansky et al . 2003a;Donnelly et al . 2004). For example, even rare matings and partial hybrid sterility, as seen between complex members (White 1971;White 1974), may allow homogenization of genetic diversity for neutral or globally favourable alleles. Furthermore, An. gambiae s.s. itself comprises differentiated lineages that may constitute a case of incipient speciation (i.e. M and S molecular forms: Slotman et al . 2007;della Torre et al . 2002;Turner et al . 2005; but see also Yawson et al . 2007). Shared ancestral polymorphism, incipient speciation, and recent introgression or intermittent gene flow between the complex members make it especially difficult to unambiguously ascribe otherwise unusual patterns of diversity to selection. Therefore, the identification of suitable (and unsuitable) outgroups is a priority for studies of molecular evolution in An. gambiae .
Although it is probable that many genes across the Anopheles genome mediate interactions with Plasmodium , early reports suggested that the polymorphic chromosomal inversion 2La may be strongly associated with a plasmodium susceptibility phenotype in some mosquito lineages (e.g. Vernick & Collins 1989;Petrarca & Beier 1992). Efforts were made to map susceptibility genes in this region (Zheng et al . 1997) and a number of candidate genes have since been identified. Following a functional screen of candidate genes, a leucine rich repeat gene ( LRIM1 ) located in the 2La inversion, and two C-type lectins ( CTL4 and CTLMA2 ) also located on 2L, but outside the inversion, were identified as giving strong Plasmodium -related phenotypes in the non-natural host-parasite combination of An. gambiae and P. berghei (Osta et al . 2004; but see also Cohuet et al . 2006). In this species combination, CTL4 and CTLMA2 appear to be essential for successful Plasmodium invasion, as RNAi knockdowns resulted in an increased in Plasmodium melanization from almost zero to 97% ( CTL4 ) and 53% ( CTLMA2 ) (Osta et al . 2004). Conversely, LRIM1 may be involved in resistance to Plasmodium infection, as knockdowns of this gene resulted in a ~3.6-fold increase in oocyte numbers (Osta et al . 2004). Recently, Riehle et al . (2006) measured Plasmodium -susceptibility in the progenies of wild-caught female mosquitoes, and found a quantitative effect associated within a region overlapping the 2La inversion. Following a database search for candidate genes within this quantitative trait locus, they identified two candidate genes (of 976 loci within the 15Mbp interval) which were designated APL1 and APL2 . APL2 is synonymous with LRRD7 (Dong et al . 2006a), and in that study was found to be induced in response to Plasmodium infection, and to display increased levels of Plasmodium infection upon RNAi knock-down (Dong et al . 2006a).
Segregating chromosomal inversions, such 2La, affect selection in a number of ways. Genetic recombination is reduced within inversions, allowing long-term and longer range association between alleles at different loci (reviewed in Schaeffer & Anderson 2005;Kirkpatrick & Barton 2006). These associations may be adaptive, and potentially contribute to the speciation process (e.g. Ayala & Coluzzi 2005;Butlin 2005;Kirkpatrick & Barton 2006). For example the frequency of the 2La inversion An. gambiae is known to correlate with the abiotic conditions ( Fig. 2 in Coluzzi 1992), and is thought to be advantageous in arid environments. However, the increase in haplotype structure associated with inversions can also complicate, or even mislead, population-genetic inference of selection. Selection at linked loci will have a greater impact on local diversity and the allele-frequency spectrum that it otherwise would, and in the case of 2La, association with very strongly selected traits such as resistance to the pesticide dieldrin (Brooke et al . 2000; but see also Brooke et al . 2006) might dominate patterns of diversity. Indeed, it has been suggested that 2La inversion polymorphism is adaptively maintained in An. gambiae , even within laboratory populations (Brooke et al . 2000), and such balancing selection acting on the whole inversion might maintain highly divergent alleles at genes such as LRIM1, even in the absence of any direct selection acting on them.
Here we take a population-genetic approach to understanding the evolution of genes implicated in Plasmodium resistance or susceptibility in An gambiae , as identified by Osta et al . (2004;CTL4, CTLMA2, LRIM1), and Riehle et al . (2006) and Dong et al. (2006a;APL2/LRRD7). First, by considering the degree of substitution saturation at synonymous sites, we screened potential outgroups for their suitability for use in studies of sequence evolution in Anopheles gambiae s.s. Second, we surveyed An. gambiae populations from East and West Africa for the level and distribution of genetic diversity in these genes. Third, we used population-genetic tests based on the observed levels genetic differentiation between population and patterns of diversity to test whether these genes show evidence of strong or recent adaptive evolution.

Origin and identification of samples
The identification of all An. gambiae complex members was verified by diagnostic PCR (Scott et al. 1993). An. gambiae s.s. individuals fall into two groups that are reported to be partially reproductively isolated, known as the M and S molecular forms, and we assayed for the presence of M and S from IGS sequences using PCR-RFLP (e.g. Favia et al. 1997). An. gambiae individuals were collected from three sites in Africa: Mount Cameroon region (Cameroon, provided by S. Wanji, University of Buea), Mbita (Suba district, Western Kenya, provided by H. M. Ferguson, University of Glasgow, UK) and Paziani (Malindi district, coastal Kenya). As expected from their known geographical distributions , all surveyed Mbita (n = 26) and Paziani individuals were S form, and all but one of the Cameroon individuals were M form (n = 38; S-form individual not used in this study).
Other species were sampled from field collections and/or laboratory strains. Specifically, An. arabiensis individuals were sourced from a laboratory strain maintained at London School of Hygiene and Tropical Medicine (Strain 'Dondotha', provided by C. Curtis) and from field collections in Ahero (Kenya); An. merus individuals were sourced from a laboratory strain maintained by the Medical Research Council of South Africa (provided by R. Maharaj; MRC, Durban, South Africa) and from field collections in Kilifi (Kenya). An. quadriannulatus A individuals were sourced from a laboratory strain maintained at the University of Wageningen (Strain 'Sangqua', Zimbabwe, provided by W. Takken). An. bwambae (Bundibugyo,Uganda), An. christyi (Runkungiri District, Uganda) and An. vagus (Vietnam) were collected for ongoing taxonomic work by Y. M. Linton. An. albimanus and An. stephensi individuals were sourced from a laboratory strain maintained at the LSHTM (provided by C. Curtis, strains 'PANAMA' and 'DUB 234', respectively). Aedes aegypti sequences were derived from the publicly available draft genome sequence (strain LVP, Broad Institute and The Institute for Genomic Research).

Loci analysed
We amplified three An. gambiae loci that have been experimentally associated with resistance or susceptibility to Plasmodium berghei infection (LRIM1, CTL4, CTLMA2), and one candidate locus that experimentally associated with both P. bergei and P. falciparium phenotypes (APL2/ LRRD7). We also attempted to amplify APL1, but were unable to reliably attribute amplified sequences to a single locus; therefore APL1 was not included for further study. In An. gambiae the four loci are spread over approximately 27 Mbp on chromosome arm 2L, between cytological bands 24B and 21F (Fig. 1). In addition to these infectionassociated loci, we also sequenced two putative 'housekeeping' genes from the same region; ENSANGG00000019219 that has homology to Drosophila ubiquitin C-terminal hydrolase CG3431 (EC 3.4.19.12), and ENSANGG00000008286 that has homology to kynurenine 3-monooxygenases (EC 1.14.13.9). These loci are likely to be engaged in routine cell function and not associated with resistance to parasite infection.

2La inversion status
Two of the sequenced loci reside in the region of the 2La chromosomal inversion (ENSANGG0000008286 and LRIM1; Fig. 1), which is polymorphic in An. gambiae, and fixed in the other members of the gambiae complex: 2La in arabiensis, 2L+ a in the other complex members (reviewed by Coluzzi et al. 2002). Reduced recombination within the inversion and fluctuations in inversion frequency are expected to affect patterns of genetic diversity, particularly in populations where the inversion is polymorphic. We therefore attempted to identify the 2La/2L+ a inversion karytopes of individuals sampled from the Mbita and Mount Cameroon populations using a recently published PCR assay based on the cloned inversion breakpoints (Sharakhov et al. 2006;White et al. 2007). This assay uses a multiplex PCR with a single primer inside the inversion, paired with competing primers outside of the inversion breakpoints, one at either end. These are designed to amplify fragments of 207 bp and 492 bp, indicative of inversion forms 2L+ a and 2La, respectively.
In addition to these expected fragment lengths, we found the primers also amplified fragments of lengths c. 687 bp, 672 bp, 760 bp and 1020 bp. These do not appear to be PCR artefacts or to derive from a second unlinked locus, as within the polymorphic Mbita population the 2La/2L+ a amplification fragments were in Hardy-Weinberg equilibrium (51 individuals, χ 2 = 0.44, 1df., P = 0.51). Furthermore, direct sequencing of these fragments from a small subset of individuals allowed us to identify them as insertion/deletion derivatives of expected PCR products of the published assay. Therefore we were able to tentatively assign 2La/2L+ a inversion status based on fragment length, contingent on the assay primers being truly diagnostic in these populations. The Mount Cameroon population appeared to be fixed for the 2L+ a variant (20 individuals assayed), whilst the inversion was polymorphic in the Mbita population (54% 2La, 46% 2L+ a , 51 individuals assayed). The inversion-assay sequences have been submitted to GenBank under accession numbers EF519331-EF519344. Unfortunately, because we lacked sufficient DNA from some sampled individuals, we were unable to obtain molecular karyotypes for all the Mbita individuals genotyped at other loci. However, sufficient individuals were assayed to confirm that that the inversion karyotype did not have a large impact upon our conclusions (see Results and Discussion).

PCR and sequencing
Genomic DNA was extracted from single mosquitoes using DNeasy kits (QIAgen). PCR primers for each amplified region were designed from the published genome sequence of An. gambiae (Holt et al. 2002), and primer sequences are given in Table S1 (Supplementary material). Where possible, primers were positioned in protein-coding DNA sequence to facilitate amplification in species other than An. gambiae s.s. Consequently, sequences do not represent entire genes (see Table 2 for amplified lengths). Where amplification repeatedly failed, additional primers were investigated, but amplification success was generally sporadic outside of the gambiae complex.
Approximately 40% of sequences from CTL4, CTLMA2 and LRIM1 were cloned and each allele sequenced separately to allow determination of phase between heterozygous sites. All other sequences were obtained by directly sequencing PCR products derived from heterozygous individuals, and are therefore unphased. For cloned PCR products, LRIM1 was amplified as a single fragment, and CTL4:CTLMA2 (which are close neighbours) were amplified as a second fragment. Following PCR, unincorporated PCR primers and dNTPs were removed using 'QIAquick' spin-column kits (QIAgen), and products were cloned using TOPO Cloning Kits for sequencing (Invitrogen). From the remainder of the PCR fragments, that were not cloned and also for colony PCR products that derived from cloned fragments, unincorporated PCR primers and dNTPs were removed using exonuclease I (New England BioLabs) and shrimp alkaline phosphatase (Amersham). The PCR products were then sequenced in both directions using BigDye™ Terminator Cycle Sequencing Kit (v3.1, Applied BioSystems) reagents and an ABI capillary sequencer. The sequence chromatograms were inspected by eye to confirm the validity of all differences within and between species, and assembled using SeqManII (DNAstar Inc., Madison USA). All sequences have been submitted to GenBank as aligned sets using ambiguity codes to indicate heterozygous sites in direct sequences. Sequence accession numbers span the range EF519345-EF519529.

Genetic divergence between lineages
Many population-genetic tests for selection require the number of substitutions between species to be estimated (e.g. that of Hudson et al. 1987), and/or that homologous synonymous and nonsynonymous sites can be correctly identified between species (e.g. the McDonald-Kreitman test, 1991). If divergence is too high, substitutions may be saturated (particularly at synonymous sites and any nonsynonymous sites under positive selection) and homology between codons difficult to establish. Conversely, if divergence it low, then (weak) selection may not have had time to drive a detectable number of substitutions. Therefore, in order to select outgroups we estimated divergence between An. gambiae and the other lineages using two methods.
First, for all comparisons we calculated the Jukes-Cantor corrected number of substitutions using dnasp Version 4.10.3 (Rozas et al. 2003). Second, divergence between more distantly related species outside of the An. gambiae complex was estimated using paml Version 3.15 (Yang 1997), using both pairwise estimates (runmode = -2) and, for housekeeping locus ENSANGG0000008286 only, a tree-based analysis (runmode = 0, Model 1) where tree topology was determined using the neighbour-joining method, and was based on nonsynonymous sites only (as implemented in mega Version 3.1; Kumar et al. 2004). Jukes-Cantor correction assumes all changes are equally likely, and will tend to underestimate the number of substitutions when the sequence is approaching saturation; whereas the maximum-likelihood (ML) method should be less susceptible to this. MK tests (McDonald & Kreitman 1991) are intended to identify selection through an excess of amino acid substitution between species. Briefly, if it is assumed that synonymous sites are selectively neutral (or close to neutrality) and that polymorphic nonsynonymous sites are close to neutrality (a reasonable assumption as selected sites will be only transiently polymorphic) then departures from independence in a 2 × 2 contingency table for synonymous and nonsynonymous fixed differences and polymorphisms can be ascribed to selection acting on amino acid substitutions. However, if there have been multiple substitutions at each synonymous site (i.e. divergence is approaching mutation saturation) there is a danger that the number of synonymous substitutions will be underestimated, leading to an erroneous inference of positive selection. We therefore only applied MK tests to comparisons within the An. gambiae complex, where K S << 1. Tests were performed using dnasp Version 4.10.3 (Rozas et al. 2003).

Genetic diversity and differentiation
As a new allele spreads to fixation it carries with it neighbouring variants, reducing genetic diversity in the surrounding region (a 'selective sweep', e.g. Maynard Smith & Haigh 1974;Kaplan et al. 1989;Braverman et al. 1995). Thus a local reduction in diversity can provide evidence for recent selection. We calculated pairwise diversity (π) at synonymous and nonsynonymous sites in An. gambiae for all sequenced loci using dnasp. This expected reduction in diversity following a selective sweep forms the basis of the Hudson-Kreitman-Aguadé (HKA; Hudson et al. 1987) test for selection, in which diversity is compared between loci hypothesized to be under selection, and loci believed to be evolving according to the neutral model. Since variation in diversity between loci may also be due to variation in mutation rates, or to differing proportions of very highly constrained sites, HKA tests also take into account silent-site divergence between loci (Hudson et al. 1987).
Because of the potential for multiple substitutions at synonymous sites, we chose not to apply HKA tests to species comparisons outside the An. gambiae species complex; therefore we used An. merus as the outgroup for all HKA tests, as this taxon showed the largest average synonymous site divergence from An. gambiae seen within the complex (K S = 3.9%-11.1%, depending on locus). We assumed that the 'housekeeping' loci ENSANGG0000019219 and ENSANGG0000008286 would not be under strong positive or balancing selection, and we tested these 'control' loci against each putative Plasmodium resistance/ susceptibility gene (CTL4, CTLMA2, LRIM1, and APL2) using synonymous sites only. First, we pooled An. gambiae data from both Mbita and Cameroon populations and applied HKA tests as implemented in dnasp, which assesses significance based on the Chi-squared distribution. Second, we applied tests separately to Mbita and Cameroon populations using the program hka (Hey 2004), which assesses significance by coalescent simulation (10 000 replicates for each test), conservatively assuming samples come from a single panmictic population, and that loci are unlinked from each other but otherwise nonrecombining.
The distribution of genetic diversity within and between populations may also be informative regarding selection (reviewed in Beaumont 2005), as global selective sweeps or strong balancing selection will tend to reduce differentiation between populations at selected loci, while local adaptation will tend to increase differentiation. Therefore, to identify any large differences in differentiation, we calculated F ST between Mount Cameroon and Mbita, Kenya (West and East Africa, respectively) for each of the analysed loci. Rare introgression and/or ancestral polymorphism mean that the gambiae complex members can often share variation (Besansky et al. 1994;Besansky et al. 2003a;Slotman et al. 2005), such that the calculation of F ST between complex members becomes meaningful (e.g. Besansky et al. 2003a). Therefore, where sample sizes allowed, we additionally calculated F ST between An. gambiae and both An. merus and An. arabiensis.
Deviations from the neutral allele-frequency distribution Diversity will be regained by mutation after a selective sweep, leading to an excess of new rare variants. Tajima's D statistic (1989b), which measures the difference between average pairwise diversity (which in turn is dependent on allele frequencies) and Watterson's estimate of θ (which depends only the number of segregating sites ;Watterson 1975), is sensitive to this excess of rare variants and can be used to infer recent selection. Fu & Li (1993) have proposed a number of related test statistics that similarly identify deviations from the expected site frequency spectrum that are likely to be associated with recent selection. Fu and Li's D statistic (1993) is sensitive to the ratio of new to old mutations, as distinguished by polarizing alleles using an outgroup, and their D* statistic is intended to do the same in the absence of an outgroup. Fay & Wu (2000) have proposed a statistic, H, which is sensitive to a selective sweep through its effect on the number of high-frequency derived variants.
We calculated all four of these test statistics for each of the loci using dnasp. Because these statistics are also sensitive to population structure, we applied tests separately to the Mount Cameroon and Mbita populations, as well as to all An. gambiae individuals combined. Significance was assessed on the basis of coalescent simulations (as implemented in dnasp) using 1000 replicates for each test. Simulations were conditional on the number of segregating sites and conservatively assumed no recombination within loci.

Cross-species amplification and levels conservation
We were able to amplify all loci from five species of the An. gambiae complex (An. gambiae s.s., An arabiensis, An. merus, An. bwambae and An. quadriannulatus), with the exception of 'housekeeping' locus ENSANGG00000019219 from An. bwambae. Across species outside of the An. gambiae complex the housekeeping locus ENSANGG0000008286 proved the easiest to amplify (Table 1, Fig. 2), but amplification of the other loci was (Table 1). In particular, we were unable to amplify a product for LRIM1 for any species outside of the complex, and we were able to amplify only three loci from An. vagus and two from An. christyi.
The putative 'housekeeping' genes ENSANGG00000019219 and ENSANGG0000008286 were the most highly conserved between species; for example, homologues of the An. gambiae sequences could be identified in the draft Aedes aegypti genome (c. 80% amino acid identity; Table 1). In contrast, APL2, CTLMA2 and CTL4 were less conserved: we were unable to identify clear orthologues of these genes in the Aedes genome and there was relatively low amino acid identity between An. gambiae and other Anopheles mosquitoes (e.g. 71% in CTL4 between An. gambiae and An. vagus). Furthermore, our failure to amplify these loci from more than one species outside of the An. gambiae complex, and our failure to amplify LRIM1 from any species outside of the complex is, was suggestive of reduced sequence conservation.
As expected, all methods indicated that synonymous substitution between An. gambiae and Ae. aegypti was approaching saturation, i.e. K S >> 1 (Table 1). Comparisons between An. gambiae and other Anopheles species gave K S estimates in the range c. 0.4-1.5, and in general ML estimates using codeml (paml) were much higher than those based on the Jukes-Cantor correction. This suggests that multiple substitutions have occurred at many synonymous sites, such that raw counts, or simple corrections thereof, are a poor estimator of the number of synonymous substitutions between An. gambiae and the nearest relatives of the An. gambiae complex. In contrast, within the complex, synonymous site divergence was low, in the range K S 0.03-0.11, depending on locus and species comparison ( Table 2). For most loci (data not shown, but see Fig. 1

Fig. 2
Neighbour-joining tree for ENSANGG0000008286. In order to simplify the tree, only the most divergent sequences from within the An. gambiae complex are shown. K S (as estimated by paml) is approaching 1 between the An. gambiae complex and the next most closely related species. Topology is based on neighbourjoining using nonsynonymous sites only, and bootstrap values for major divergences are shown above internal nodes (mega Version 3.1, Kumar et al. 2004). Branch lengths are based on synonymous site divergence estimated using paml Version 3.14 (Yang 1997).   However, it should be noted that the extremely small numbers of fixed differences between An. gambiae s.s and the other complex members mean that there is very little power to identify an excess of amino acid substitutions.

Patterns of genetic diversity
Synonymous site diversity in An. gambiae, as measured by π (the average pairwise differences between sequences), ranged between approximately 2 and 4% at synonymous sites, and 0.2 and 0.5% at nonsynonymous sites (Table 3). Diversity was slightly, but nonsignificantly, higher in the Kenyan population (Mbita) than the Mount Cameroon population (P = 0.17, Mann-Whitney test across loci; Table 3). HKA tests measure heterogeneity in diversity between loci, relative to differentiation between species (e.g. as summarized by S/K in Table 4). All HKA tests for differences between the housekeeping loci and each of the candidate loci, tested independently, were nonsignificant. We found low, but significantly greater than zero, genetic differentiation between the Kenyan population (Mbita, all 'S' molecular form) and the Mount Cameroon population (all 'M' form; P < 0.05 for each locus, based on permutation tests). However, differentiation did not differ markedly between loci (highest, ENSAGG00000019219 F ST = 0.22; lowest, CTL4 F ST = 0.13). Statistics that measure a departure from the expected allele frequency distribution can be sensitive to population structure (e.g. Tajima 1989a); therefore, we calculated these statistics separately for the Mount Cameroon and Mbita (Kenya) populations. All statistics were calculated for synonymous sites only. For Mbita, Tajima's D statistic was generally positive, which is indicative of a high proportion of intermediate-frequency alleles, while for Mount Cameroon Tajima's D was generally negative, indicating an excess of high and low frequency alleles (Fig. 3a). This deviation in Tajima's D was only significant for APL2 in the Mbita population (no correction for multiple tests). However, across all loci the difference in Tajima's D between locations was marginally significant (P = 0.04, Mann-Whitney test). Fu and Li's D statistic was similarly more positive for the Mbita population than the Mount Cameroon population (P = 0.03, Mann-Whitney), but again only differed significantly from neutrality for one locus (CTL4 in Mbita, no correction for multiple tests; Fig. 3b). Fu and Li's D* statistic, which does not require an outgroup, was highly correlated with Fu and Li's D (r 2 = 0.97, data not shown). Fay and Wu's H statistic, which is sensitive to an excess of high-frequency derived alleles, did not differ from the neutral expectation for any loci (Fig. 3c). 'sites' is the number of (synonymous) sites analysed, n the number of haplotypes, S the number of synonymous segregating sites, K the number of synonymous substitutions between species. An. merus was used as the outgroup for all tests. A priori, it might be expected that the 2La inversion would affect patterns of genetic diversity in LRIM1 and ENSANGG00000008286. This could either be through fluctuations in inversion frequency leading to reduced polymorphism, or balancing selection acting on the inversion as a whole incidentally maintaining divergent haplotypes in these genes. However, we saw no evidence of this. Despite the Mt. Cameroon population being fixed for 2L+ a , and the Mbita population being polymorphic for the inversion, diversity was not appreciably elevated or reduced in the region of the inversion in the Mbita sample relative to the Mount Camaroon sample (Table 3), and deviations from the neutral allele frequency spectrum (e.g. as measured by Tajima's D or Fu & Li's D) appear to be no larger for LRIM1 and ENSANGG00000008286 than for other loci in the Mbita population (Fig. 3). Moreover, based on karyotype-assayed mosquitoes homozygous for either the 2La and 2L+ a inversion type, we could detect only one fixed difference across LRIM1 and ENSANGG00000008286, out of 115 segregating sites in total (only 3 homozygous mosquitoes genotyped for LRIM1 and ENSANGG00000008286 were available from the Mbita population for the 2La assay, making this a very conservative test for the effect of 2La).

Synonymous site divergence, introgression, and outgroup choice
This study used DNA polymorphism and divergence data to test whether mosquito immunity genes thought to be important for Plasmodium development might also be subject to host-pathogen arms races or balancing selection. For some of the tests, in particular MK and HKA tests, an accurate estimate of the number of synonymous substitutions between species is needed. This requires an outgroup for which synonymous substitution does not approach saturation. For many well-studied taxa this is a not difficult (e.g. Drosophila melanogaster-D. simulans K S = 0.12), but our data suggest that it may be hard to meet this requirement for Anopheles gambiae (e.g. Fig. 2). In particular, although previous morphology-based phylogenetic analyses suggested that An. vagus (an oriental member of the Pyretophorus series) and An. christyi (thought to be the basal taxon of the Pyretophorus series), were amongst the most closely related taxa to the An. gambiae complex (Anthony et al. 1999), for the genes we analysed K S was approaching 1 between these lineages (Fig. 2, Table 1), making them potentially inappropriate for outgroup-based tests. Although further systematic work may reveal that other taxa (unavailable to us) constitute a more appropriate outgroup, given the level of divergence in the genes analysed here it seems probable that future comparative work of this type will be restricted to species within the gambiae complex.
Unfortunately, using a species from within the complex as an outgroup presents its own problems. Members of the An. gambiae complex (including An. arabiensis, An. bwambae, An. merus and An. quadriannulatus A) are closely related, with K S highly variable between loci, but of the order of 3%-10% (Table 2, see also Besansky et al. 2003b)and if a correction is made for the level of diversity within lineages (K S -π S , p220. Nei 1987) then this becomes 0%-10%. Although the close relationships and low level of divergence within the An. gambiae complex make MK and HKA analyses viable in principle, in a given length of sequence the number of fixed differences is small, reducing statistical power. Furthermore, for some loci, sequences from other members of the gambiae complex nest within the An. gambiae s.s. clade (e.g. Fig. 2). In fact F ST (which measures genetic differentiation as the proportion of total diversity that is due to between-group differences) can be calculated between members of the An. gambiae complex as if they were partially isolated populations (Besansky et al. 2003b), and does not approach one (F ST is also highly variable between loci, Table 5; F ST range 0.3 -0.9 here and 0.1-0.8 in Besansky et al. 2003b). This may result either from introgression between species, or from the continued segregation of inherited ancestral polymorphisms, and may have implications for the inference of selection. Specifically, this may mean that, in effect, we are using an allele from within the population as an arbitrary outgroup for our analyses, based only on divergence. Moreover, introgressed alleles will not be differentiated at all between the hybridizing species -preventing comparative analysis on those loci -and this may even be compounded if introgression is more common for strongly selected loci.

Evidence for selection
Despite the potential difficulties in selecting a suitable outgroup, we were able to apply several population-genetic methods for detecting selection. Amino acid conservation between An. gambiae and other species was lower for APL2, CTL4 and CTLMA2 than for the 'housekeeping' locus ENSANGG0000008286 (Table 1), and our repeated failure to amplify LRIM1 outside of the An. gambiae complex was also suggestive of reduced sequence conservation. Moreover, within the gambiae complex K A /K S ratios were generally higher for the four putative Plasmodium-associated genes than for the two 'housekeeping' loci; for example, across all within-complex comparisons the average K A /K S for candidate resistance/susceptibility-associated genes was 0.14, but only 0.04 for the 'housekeeping' loci. An elevated rate of amino-acid evolution is in-line with what is known for immunity genes in other taxa (Hurst & Smith 1999;Schlenke & Begun 2003). However, the K A /K S ratio per se provides no evidence for adaptive change, as the observed reduction in amino-acid conservation may equally be due to lower selective constraints in these immunity-related genes relative to our house-keeping genes.
McDonald-Kreitman tests can identify positive selection by comparing the relative numbers of synonymous and nonsynonymous within-species polymorphisms and between-species substitutions (McDonald & Kreitman 1991). If there are an excess of amino-acid substitutions between species, this can be ascribed to selection fixing new amino-acid variants in each species. For most of the loci we analysed we found an excess of amino-acid variation, rather than an excess of amino acid substitutions (NI > 1, Table 2). This suggests that these none of the analysed loci (APL2, LRIM1, CTL4 or CTLMA2) have been under strong or consistent directional selection since the common ancestor of extant An. gambiae and other members of the gambiae complex. Although it may be argued that this is not a robust result, as the relatively close relationship (K S ~ 3 -11%) between species leads to low power in MK tests, subject to the caveats regarding introgression given above, it does at least suggest that if directional selection is acting it is likely to be weak. For example, it is possible to find positive MK tests in comparisons between very closely related species, such as Drosophila yakuba-D. santomea (K S ~3%) (antiviral genes R2D2 and Ago2, Obbard et al. 2006).
Recent selective sweeps associated with the fixation of a single new allele can be detected through their effect on genetic diversity and the allele frequency spectrum (reviewed in Nielsen 2005). Although these statistics can be sensitive to demographic processes (such as population growth), when compared between loci they provide a useful method for detecting selection in the absence of an outgroup. However, none of the loci we analysed showed an extreme level of synonymous site diversity, and all were close to the level found previously for other genes in An. gambiae (1.9%-4.3% here vs. 1%-4.9% in Besansky et al. 2003a). Genetic diversity was slightly higher for the housekeeping genes than for the putative Plasmodium-resistance genes (average 3.5% vs. 2.6% at synonymous sites), but HKA tests found that this effect was not significant (Table 4). This may be due to the low power we have to detect small deviations, given the sample sizes (of the order of 30 polymorphic sites and 20 fixed differences for each test, Table 4), but again at least suggests that any reduction in diversity compared to housekeeping loci is small.
Although skews in the allele-frequency distribution were not significantly different from a neutral model for individual loci, suggesting that none have undergone a recent selective sweep, overall there was a slight but significant difference between the Mbita (East Africa) and Mount Cameroon (West Africa) populations. Tajima's D differed between populations, being generally positive in the Kenya population (a slight nonsignificant excess of intermediate frequency alleles), and negative in the Mount Cameroon population (a slight nonsignificant excess of extreme-frequency alleles). Fu and Li's D (and D*) showed a similar pattern, with a greater proportion of new mutations in the Cameroon population than the Kenya population. This may indicate a difference in demographic history between populations, e.g. population growth or a recent bottle-neck in the Mount Cameroon population, and/or some form of substructuring within in the Mbita population. However, because all Mount Cameroon individuals analysed in this study were M form, and all Mbita individuals were S form (as assessed by PCR-RFLP) it is impossible to say whether this difference in the allele frequency spectrum, and the level of population differentiation, is due to demographic factors associated with geographical region, or with the M and S molecular forms, or both. Perhaps surprisingly, we also found the 2La  Table 2) so that these values should be treated with caution. All statistics are significantly greater than zero, by 1000 permutations. inversion had little or no impact on the two loci it spans (LRIM1 and ENSANGG0000008286). However, it is known that recombination is most strongly suppressed near inversion breakpoints (these genes are c. 9.8Mbp and 3.5 Mbp from the inversion ends, respectively) and gene conversion can homogenize genetic diversity between inversions over time (e.g. Schaeffer & Anderson 2005).

Understanding Plasmodium resistance in Anopheles mosquitoes
Despite being previously implicated in Plasmodium resistance or susceptibility, none of the genes we analysed showed any evidence of strong or recent adaptive evolution in An. gambiae, either under directional or balancing selection. It is therefore possible that that there is little or no selective pressure for adaptive change exerted on these genes. Indeed, contrary to the study that identified LRIM1, CTL4 and CTLMA2 (which used the non-natural host-parasite combination of An. gambiae and P. berghei; Osta et al. 2004), and in broad agreement with our findings, recent work indicates that these genes may not be important in resistance to P. falciparum, a natural parasite of An. gambiae (Cohuet et al. 2006). However, this does not necessarily detract from the wider host-parasite implications of our work, as genes involved in the encapsulation process are likely to engage with many parasites other than P. falciparum.
In addition, many Anopheles loci other than those analysed here have been identified as putative immunity genes (Christophides et al. 2002), and a few are good candidates for mediating levels of Plasmodium infection. These include APL1 (Riehle et al. 2006), which was identified in the same screen as APL2 but which we were unable to reliably amplify by PCR, serine protease inhibitors and TEPs (Blandin et al. 2004;Abraham et al. 2005;, which are the subject to ongoing population-genetic work in our lab. We believe studies which can uncover the rate and mode of evolution in genes mediating vector competence are essential if we are to gain a full understanding of insect-vectored diseases such as malaria. It is perhaps too often overlooked that Plasmodium evolution may be shaped by interaction with the vector's immune system, as well as that of the vertebrate host.