Linear normalised hash function for clustering gene sequences and identifying reference sequences from multiple sequence alignments

Manal Helal,Fanrong Kong,Sharon C.-A. Chen,Fei Zhou,Dominic E. Dwyer,John Potter,Vitali Sintchenko

Linear normalised hash function for clustering gene sequences and identifying reference sequences from multiple sequence alignments

2012

Manal Helal
Fanrong Kong
Sharon C.-A. Chen
Fei Zhou
Dominic E. Dwyer
John Potter
Vitali Sintchenko

Background Comparative genomics has put additional demands on the assessment of similarity between sequences and their clustering as means for classification. However, defining the optimal number of clusters, cluster density and boundaries for sets of potentially related sequences of genes with variable degrees of polymorphism remains a significant challenge. The aim of this study was to develop a method that would identify the cluster centroids and the optimal number of clusters for a given sensitivity level and could work equally well for the different sequence datasets.

Keywords:

Comparative genomics
Hash function
Multiple sequence alignment
Cluster analysis
Polymorphism (computer science)
Gene
Bioinformatics
Distance matrix
Computer science
Text mining
Pattern recognition
Data mining
Cluster (physics)
Centroid
Artificial intelligence

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations