LYRUS: A Machine Learning Model for Predicting the Pathogenicity of Missense Variants

2021 
Single amino acid variations (SAVs) are a primary contributor to variations in the human genome. Identifying pathogenic SAVs can aid in the diagnosis and understanding of the genetic architecture of complex diseases, such as cancer. Most approaches for predicting the functional effects or pathogenicity of SAVs rely on either sequence or structural information. Nevertheless, previous analyses have shown that methods that depend on only sequence or structural information may have limited accuracy. Recently, researchers have attempted to increase the accuracy of their predictions by incorporating protein dynamics into pathogenicity predictions. This study presents (LYRUS), a machine learning method that uses an XGBoost classifier selected by TPOT to predict the pathogenicity of SAVs. LYRUS incorporates five sequence–based features, six structure–based features, and four dynamics–based features. Uniquely, LYRUS includes a newly–proposed sequence co–evolution feature called variation number. LYRUS9s performance was evaluated using a dataset that contains 4,363 protein structures corresponding to 20,307 SAVs based on human genetic variant data from the ClinVar database. Based on our dataset, the LYRUS classifier has higher accuracy, specificity, F–measure, and Matthews correlation coefficient (MCC) than alternative methods including PolyPhen2, PROVEAN, SIFT, Rhapsody, EVMutation, MutationAssessor, SuSPect, FATHMM, and MVP. Variation numbers used within LYRUS differ greatly between pathogenic and neutral SAVs, and have a high feature weight in the XGBoost classifier employed by this method. Applications of the method to PTEN and TP53 further corroborate LYRUS9s strong performance. LYRUS is freely available and the source code can be found at https://github.com/jiaying2508/LYRUS.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    79
    References
    0
    Citations
    NaN
    KQI
    []