language-icon Old Web
English
Sign In

Smith–Waterman algorithm

The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure. The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure. The algorithm was first proposed by Temple F. Smith and Michael S. Waterman in 1981. Like the Needleman–Wunsch algorithm, of which it is a variation, Smith–Waterman is a dynamic programming algorithm. As such, it has the desirable property that it is guaranteed to find the optimal local alignment with respect to the scoring system being used (which includes the substitution matrix and the gap-scoring scheme). The main difference to the Needleman–Wunsch algorithm is that negative scoring matrix cells are set to zero, which renders the (thus positively scoring) local alignments visible. Traceback procedure starts at the highest scoring matrix cell and proceeds until a cell with score zero is encountered, yielding the highest scoring local alignment. Because of its quadratic complexity in time and space, it often cannot be practically applied to large-scale problems and is replaced in favor of less general but computationally more efficient alternatives such as (Gotoh, 1982), (Altschul and Erickson, 1986), and (Myers and Miller, 1988). In 1970, Saul B. Needleman and Christian D. Wunsch proposed a heuristic homology algorithm for sequence alignment, also referred to as the Needleman–Wunsch algorithm. It is a global alignment algorithm that requires O ( m n ) {displaystyle O(mn)} calculation steps ( m {displaystyle m} and n {displaystyle n} are the lengths of the two sequences being aligned). It uses the iterative calculation of a matrix for the purpose of showing global alignment. In the following decade, Sankoff, Reichert, Beyer and others formulated alternative heuristic algorithms for analyzing gene sequences. Sellers introduced a system for measuring sequence distances. In 1976, Waterman et al. added the concept of gaps into the original measurement system. In 1981, Smith and Waterman published their Smith–Waterman algorithm for calculating local alignment. The Smith–Waterman algorithm is fairly demanding of time: To align two sequences of lengths m {displaystyle m} and n {displaystyle n} , O ( m n ) {displaystyle O(mn)} time is required. Gotoh and Altschul optimized the algorithm to O ( m n ) {displaystyle O(mn)} steps. The space complexity was optimized by Myers and Miller from O ( m n ) {displaystyle O(mn)} to O ( n ) {displaystyle O(n)} (linear), where n {displaystyle n} is the length of the shorter sequence, for the case where one only of the many possible optimal alignments is desired. In recent years, genome projects conducted on a variety of organisms generated massive amounts of sequence data for genes and proteins, which requires computational analysis. Sequence alignment shows the relations between genes or between proteins, leading to a better understanding of their homology and functionality. Sequence alignment can also reveal conserved domains and motifs. One motivation for local alignment is the difficulty of obtaining correct alignments in regions of low similarity between distantly related biological sequences, because mutations have added too much 'noise' over evolutionary time to allow for a meaningful comparison of those regions. Local alignment avoids such regions altogether and focuses on those with a positive score, i.e. those with an evolutionarily conserved signal of similarity. A prerequisite for local alignment is a negative expectation score. The expectation score is defined as the average score that the scoring system (substitution matrix and gap penalties) would yield for a random sequence. Another motivation for using local alignments is that there is a reliable statistical model (developed by Karlin and Altschul) for optimal local alignments. The alignment of unrelated sequences tends to produce optimal local alignment scores which follow an extreme value distribution. This property allows programs to produce an expectation value for the optimal local alignment of two sequences, which is a measure of how often two unrelated sequences would produce an optimal local alignment whose score is greater than or equal to the observed score. Very low expectation values indicate that the two sequences in question might be homologous, meaning they might share a common ancestor. Let A = a 1 a 2 . . . a n {displaystyle A=a_{1}a_{2}...a_{n}} and B = b 1 b 2 . . . b m {displaystyle B=b_{1}b_{2}...b_{m}} be the sequences to be aligned, where n {displaystyle n} and m {displaystyle m} are the lengths of A {displaystyle A} and B {displaystyle B} respectively.

[ "Sequence alignment" ]
Parent Topic
Child Topic
    No Parent Topic