Identifying DNA-binding proteins based on multi-features fusion and LASSO feature selection

2020 
DNA-binding proteins, performing an indispensable function in the maintenance of genetic information and holding significances for biomedical research, are inefficiently identified by traditional experimental methods due to their huge quantities. On the contrary, the machine learning method as an emerging technique demonstrates satisfactory speed and decent accuracy. Thus, this work focuses on extracting four different features from primary and secondary sequence features, i.e., RS, PseAACS, PSSM-ACCT and PSSM-DWT. With the LASSO dimension reduction method, we experiment on the combination of feature submodels to obtain the optimized number of top rank features. These features are input into the training Ensemble subspace discriminant descriptor to predict the DNA-binding proteins. Three different datasets are adopted to evaluate the performances of the as-proposed approach in this work. The PDB1075 and PDB594 datasets are adopted for the 5-fold Cross-Validation, and the PDB186 is used for the independent experiment. In the 5-fold Cross-Validation, the PDB1075 and PDB594 both show extremely high precision reaching 86.98% and 88.2%, respectively, while the accuracy of independent experiment is 75.8%, which suggests that the methodology proposed in this work is capable of predicting DNA-binding proteins effectively.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []