Alignment-free sequence analysis

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches. In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches. The emergence and need for the analysis of different types of data generated through biological research has given rise to the field of bioinformatics. Molecular sequence and structure data of DNA, RNA, and proteins, gene expression profiles or microarray data, metabolic pathway data are some of the major types of data being analysed in bioinformatics. Among them sequence data is increasing at the exponential rate due to advent of next-generation sequencing technologies. Since the origin of bioinformatics, sequence analysis has remained the major area of research with wide range of applications in database searching, genome annotation, comparative genomics, molecular phylogeny and gene prediction. The pioneering approaches for sequence analysis were based on sequence alignment either global or local, pairwise or multiple sequence alignment. Alignment-based approaches generally give excellent results when the sequences under study are closely related and can be reliably aligned, but when the sequences are divergent, a reliable alignment cannot be obtained and hence the applications of sequence alignment are limited. Another limitation of alignment-based approaches is their computational complexity and are time-consuming and thus, are limited when dealing with large-scale sequence data. The advent of next-generation sequencing technologies has resulted in generation of voluminous sequencing data. The size of this sequence data poses challenges on alignment-based algorithms in their assembly, annotation and comparative studies. Alignment-free methods can broadly be classified into five categories: a) methods based on k-mer/word frequency, b) methods based on the length of common substrings, c) methods based on the number of (spaced) word matches, d) methods based on micro-alignments, e) methods based on information theory and f) methods based on graphical representation. Alignment-free approaches have been used in sequence similarity searches, clustering and classification of sequences, and more recently in phylogenetics (Figure 1). Such molecular phylogeny analyses employing alignment-free approaches are said to be part of next-generation phylogenomics. A number of review articles provide in-depth review of alignment-free methods in sequence analysis. The AFproject is an international collaboration to benchmark and compare software tools for alignment-free sequence comparison. The popular methods based on k-mer/word frequencies include feature frequency profile (FFP), Composition vector (CV), Return time distribution (RTD), frequency chaos game representation (FCGR). and Spaced Words The methodology involved in FFP based method starts by calculating the count of each possible k-mer (possible number of k-mers for nucleotide sequence: 4k, while that for protein sequence: 20k) in sequences. Each k-mer count in each sequence is then normalized by dividing it by total of all k-mers' count in that sequence. This leads to conversion of each sequence into its feature frequency profile. The pair wise distance between two sequences is then calculated Jensen–Shannon (JS) divergence between their respective FFPs. The distance matrix thus obtained can be used to construct phylogenetic tree using clustering algorithms like neighbor-joining, UPGMA etc. In this method frequency of appearance of each possible k-mer in a given sequence is calculated. The next characteristic step of this method is the subtraction of random background of these frequencies using Markov model to reduce the influence of random neutral mutations to highlight the role of selective evolution. The normalized frequencies are put a fixed order to form the composition vector (CV) of a given sequence. Cosine distance function is then used to compute pairwise distance between CVs of sequences. The distance matrix thus obtained can be used to construct phylogenetic tree using clustering algorithms like neighbor-joining, UPGMA etc. This method can be extended through resort to efficient pattern matching algorithms to include in the computation of the composition vectors: (i) all k-mers for any value of k, (ii) all substrings of any length up to an arbitrarily set maximum k value, (iii) all maximal substrings, where a substring is maximal if extending it by any character would cause a decrease in its occurrence count. The RTD based method does not calculate the count of k-mers in sequences, instead it computes the time required for the reappearance of k-mers. The time refers to the number of residues in successive appearance of particular k-mer. Thus the occurrence of each k-mer in a sequence is calculated in the form of RTD, which is then summarised using two statistical parameters mean (μ) and standard deviation (σ). Thus each sequence is represented in the form of numeric vector of size 2·4k containing μ and σ of 4k RTDs. The pair wise distance between sequences is calculated using Euclidean distance measure. The distance matrix thus obtained can be used to construct phylogenetic tree using clustering algorithms like neighbor-joining, UPGMA etc.

Parent Topic

Child Topic

No Parent Topic