Context dependency of nucleotide probabilities and variants in human DNA

2021 
Abstract Background Genomic DNA has been shaped by mutational processes through evolution. The cellular machinery for error correction and repair has left its marks in the nucleotide composition along with structural and functional constraints. Therefore, the probability of observing a base in a certain position in the human genome is highly context-dependent. Mutations are also known to depend on the genomic context, but in previous work, the nucleotide distribution and the mutability have not been combined. Results Here we use a context-dependent nucleotide model as the basis for a mutability model for the human genome. We first investigate simple models of nucleotides conditioned on sequence context and develop a bidirectional Markov model that depends on up to 14 nucleotides to each side. We show how the genome predictability varies across different types of genomic regions. Surprisingly, this model can predict a base from its context with an average of more than 50% accuracy. Inspired by DNA substitution models, we develop a model of mutability that estimates a mutation matrix (called the alpha matrix) on top of the nucleotide distribution. The advantage of this separation into two terms is that the alpha matrix can be estimated from a much smaller context than the nucleotide model, but the final model will still depend on the full context of the nucleotide model. With the bidirectional Markov model of order 14 and an alpha matrix dependent on just one base to each side, we obtain a model that compares well with a model of mutability that estimates mutation probabilities directly conditioned on three nucleotides to each side. For high-probability population variants, which are mainly CpG sites, the simple model fits better than our hybrid model, but for somatic variants it is opposite. Interestingly, the model is not very sensitive to the size of the context for the alpha matrix. Conclusions Our study found strong context dependencies of nucleotides in the human genome. The best model can estimate the nucleotide probabilities depending on contexts up to 14 nucleotides to each side. Based on these models, a substitution model was constructed that separates into the context model and an alpha matrix dependent in a small context. These models fit variants very well.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    35
    References
    1
    Citations
    NaN
    KQI
    []