nGIA: A novel Greedy Incremental Alignment based algorithm for gene sequence clustering

2022 
Gene sequence clustering is very basic and important in computational biology and bioinformatics for the study of phylogenetic relationships and gene function prediction, etc. With the rapid growth of the amount of biological data (gene/protein sequences), gene sequence clustering algorithms face more challenges in low precision and efficiency. The growing redundant sequences in gene sequence databases usually contribute to the increasing memory and computing demand for most clustering methods. For example, the original greedy incremental alignment-based (GIA) clustering algorithm obtains high precision clustering results, but with very poor efficiency. Efficient greedy incremental clustering algorithms have been developed with a cost of precision reduction, which usually trade clustering precision off for speed improvement. Algorithms with a better balance between precision and speed are needed. This paper proposes a novel Greedy Incremental Alignment-based algorithm called nGIA for gene clustering with high efficiency and precision. nGIA consists of a pre-filter, a modified short word filter, a new data packing strategy, a modified greedy incremental method, and is parallelized via GPU. The experimental evaluations on four independent datasets show that the proposed tool can cluster datasets with high precisions of 99.99%. Compared with the results of CD-HIT, Vsearch, and Uclust, nGIA is on average 13.6x, 6.2x, and 1.7x faster. In addition, we have developed a multi-node version to handle large data sets. The strong scalability test shows that the multi-node version of nGIA can scale up to 32 threads with a 31% parallel efficiency. The software is available at .
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []