INAUGURAL ARTICLE by a Recently Elected Academy Member:Prevalence of positive selection among nearly neutral amino acid replacements in Drosophila

2007 
Synonymy in the genetic code results in a natural periodicity in which the third nucleotide of many codons is only weakly constrained because any of two or more nucleotides at this position specify the same amino acid in the polypeptide chain. Fourfold degenerate codons allow any nucleotide at the third position, whereas twofold degenerate codons treat either both pyrimidine nucleotides or both purine nucleotides as synonymous. Of the 20 common amino acids, the codons for 12 are twofold degenerate at the third position, 1 is threefold degenerate (isoleucine, which allows U, C, or A at the third position), and 8 are fourfold degenerate. (In this tabulation, leucine, serine, and arginine are each counted twice because each is specified by six codons.) In a typical coding sequence with a GC content of 50% the average codon degeneracy is 3. The high level of synonymy in the genetic code is a boon to population genomics, because the synonymous sites in a coding sequence serve as a sort of internal control for historical and demographic factors affecting a population, relatively free of selective constraint. Because nonsynonymous sites in the same coding sequence share the same history and demography as the synonymous sites, but may be subject to greater selective constraints or even positive selection, comparisons between nonsynonymous sites and synonymous sites can potentially reveal the magnitude and direction of selection pressures operating on the nonsynonymous sites. An early application of this approach compared the frequency spectrum of polymorphic nonsynonymous sites with that of synonymous sites among sequences encoding 6-phosphogluconate dehydrogenase in a sample of the enteric bacterium Escherichia coli (1). An excess of low-frequency nonsynonymous polymophisms suggested that most amino acid polymorphisms in this enzyme are very slightly deleterious, with a selection coefficient on the order of 6–26 times the reciprocal of the effective population size. No more than half of all amino acid polymorphisms in the enzyme could be considered as selectively neutral. An important extension of this approach came from McDonald and Kreitman (2), who compared polymorphisms within species to divergence between species. This approach avoided any need to estimate the allele-frequency spectrum of polymorphisms, while taking advantage of evolutionary changes through time. First applied to the Adh gene encoding alcohol dehydrogenase in three species of the Drosophila melanogaster species subgroup, the approach yielded evidence that a significant proportion of amino acid replacements between species are driven by positive selection. Explicit expressions for the expected values in comparisons of polymorphism and divergence were soon developed based on a sampling theory for the independent infinite-sites model with selection (3). Application of this theory to the Drosophila Adh data again suggested small selection coefficients, on the order of five times the reciprocal of the effective population size, and that the number of amino acids in the enzyme that are susceptible to favorable mutation at any one time ranges from 2 to 23. One limitation of the McDonald–Kreitman test is that, for the sample sizes typically available, the statistical test for homogeneity in a 2 × 2 table is relatively lacking in power. Another limitation is that such data often include one or more cells whose entry is 0. Thus there has been an effort to examine polymorphism-divergence data across multiple genes to estimate α, defined as the fraction of amino acid fixations driven by positive selection (4, 5). Maximum-likelihood approaches yield estimates of α of 25% ± 20% across several species of Drosophila (6, 7). This approach assumes that harmful mutations are so drastically deleterious, and beneficial mutations so strongly favored, that their fate is settled so rapidly by selection that they cannot contribute significantly to the level of amino acid polymorphism. Considerable evidence suggests that this assumption is not correct (1, 5, 8–10). To the extent that mildly deleterious and mildly favorable nonsynonymous substitutions contribute to amino acid polymorphisms, the estimate of α is biased downward. The assumption of fluctuating selection leads to somewhat higher estimates (11). Quite another approach to the analysis of polymorphism and divergence makes use of population genetics theory (3) to estimate the values of the parameters governing mutation, selection, and random genetic drift at independent nucleotide sites (12). The intuitive appeal of this approach is that it avoids the artificial dichotomy between what is selectively neutral and what is not, but rather focuses on the actual estimates of the selection coefficients that emerge from the analysis. In this model, the expected value of each cell in a McDonald–Kreitman table can be shown to be an independent Poisson random variable (3), and the parameters governing mutation, selection, random genetic drift, and time since species divergence can be estimated by Markov chain Monte Carlo simulation using a hierarchical Bayesian model (12). In the original formulation, each nonsynonymous substitution likely to contribute to polymorphism or divergence in a particular gene is assumed to have the same selective effect, but these values can differ from one gene to the next. The selective effect is scaled according to the diploid effective population number, which is to say that it is estimated as some multiple of Nes, where s is the conventional selection coefficient and Ne is the diploid effective population size. This approach is reliable provided that the species being compared are sufficiently closely related that multiple nucleotide substitutions at the same site, or synonymous sites mutating to nonsynomous sites or vice versa, can be ignored (13). The assumption that each nonsynonymous substitution in a gene has the same selective effect is obviously artificial, but it served the original purpose of estimating the distribution of the scaled selection coefficient among genes (12). A more sophisticated and biologically realistic model was introduced by Sawyer et al. (9). In this model, the selective effect of each nonsynonymous mutation likely to contribute to polymorphism or divergence is regarded as a random sample from some underlying normal distribution whose mean but not variance may differ from one gene to the next. The spirit of the model is analogous to that of analysis of variance, in which different “treatments” (in this case, genes) have different “effects” (in this case, mean selective effects). The assumption that the underlying distributions are Gaussian is natural in a continuous-time model of selection (14) given the implications of the Central Limit Theorem, but plausible alternatives should also eventually be considered. Changes in demographics can confound the interpretation of polymorphism and divergence (2, 5, 15). For example, a rapid dramatic increase in the effective population number will result in the selective elimination of some deleterious nonsynonymous polymorphisms that might previously have remained polymorphic, thereby reducing the nonsynonymous polymorphisms without affecting nonsynonymous divergence. Demographics need to be considered for the sibling species D. melanogaster and Drosophila simulans, which appear to have expanded their range out of Africa ≈10,000–15,000 years ago (16, 17), probably with an accompanying a population bottleneck followed by an expansion (18). Hence, for Drosophila the ideal polymorphism data would seem to be that derived from African populations. As it happens, Proschel et al. (19) have recently acquired such data for a large set of genes. These data afford a valuable opportunity to apply the Sawyer model (9) to estimate values of great interest in population genomics, including the fraction of amino acid polymorphisms that are deleterious, the fraction of amino acid differences between related species that are nearly neutral or positively selected, and the distribution of selection coefficients among new mutations likely to become polymorphic or among mutations that are fixed. In this article we present the results of the analysis. The principal inferences are that the majority of amino acid polymorphisms within Drosophila species are mildly deleterious but that a large fraction of amino acid differences between species are driven by positive selection. However, the magnitude of selection that needs to be postulated to explain the data is extremely small, usually >2 but <10 times the reciprocal of the effective population size. These results are predicated on the assumption that most synonymous polymorphisms and fixed differences are selectively neutral or nearly neutral, and so they pertain only to amino acid substitutions and not to nucleotide substitutions in noncoding DNA.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []