language-icon Old Web
English
Sign In

Suffix tree

In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations. In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations. The construction of such a tree for the string S {displaystyle S} takes time and space linear in the length of S {displaystyle S} . Once constructed, several operations can be performed quickly, for instance locating a substring in S {displaystyle S} , locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc. Suffix trees also provide one of the first linear-time solutions for the longest common substring problem. These speedups come at a cost: storing a string's suffix tree typically requires significantly more space than storing the string itself. The concept was first introduced by Weiner (1973), which Donald Knuth subsequently characterized as 'Algorithm of the Year 1973'. The construction was greatly simplified by McCreight (1976), and also by Ukkonen (1995). Ukkonen provided the first online-construction of suffix trees, now known as Ukkonen's algorithm, with running time that matched the then fastest algorithms.These algorithms are all linear-time for a constant-size alphabet, and have worst-case running time of O ( n log ⁡ n ) {displaystyle O(nlog n)} in general. Farach (1997) gave the first suffix tree construction algorithm that is optimal for all alphabets. In particular, this is the first linear-time algorithm for strings drawn from an alphabet of integers in a polynomial range. Farach's algorithm has become the basis for new algorithms for constructing both suffix trees and suffix arrays, for example, in external memory, compressed, succinct, etc. The suffix tree for the string S {displaystyle S} of length n {displaystyle n} is defined as a tree such that: Since such a tree does not exist for all strings, S {displaystyle S} is padded with a terminal symbol not seen in the string (usually denoted $). This ensures that no suffix is a prefix of another, and that there will be n {displaystyle n} leaf nodes, one for each of the n {displaystyle n} suffixes of S {displaystyle S} . Since all internal non-root nodes are branching, there can be at most n −  1 such nodes, and n + (n − 1) + 1 = 2n nodes in total (n leaves, n − 1 internal non-root nodes, 1 root). Suffix links are a key feature for older linear-time construction algorithms, although most newer algorithms, which are based on Farach's algorithm, dispense with suffix links. In a complete suffix tree, all internal non-root nodes have a suffix link to another internal node. If the path from the root to a node spells the string χ α {displaystyle chi alpha } , where χ {displaystyle chi } is a single character and α {displaystyle alpha } is a string (possibly empty), it has a suffix link to the internal node representing α {displaystyle alpha } . See for example the suffix link from the node for ANA to the node for NA in the figure above. Suffix links are also used in some algorithms running on the tree. A generalized suffix tree is a suffix tree made for a set of words instead of a single word. It represents all suffixes from this set of words. Each word must be terminated by a different termination symbol or word. A suffix tree for a string S {displaystyle S} of length n {displaystyle n} can be built in Θ ( n ) {displaystyle Theta (n)} time, if the letters come from an alphabet of integers in a polynomial range (in particular, this is true for constant-sized alphabets).For larger alphabets, the running time is dominated by first sorting the letters to bring them into a range of size O ( n ) {displaystyle O(n)} ; in general, this takes O ( n log ⁡ n ) {displaystyle O(nlog n)} time.The costs below are given under the assumption that the alphabet is constant.

[ "Data structure", "Suffix", "C++ string handling", "String (computer science)", "FM-index", "Compressed suffix array", "Longest common substring problem", "Heavy path decomposition", "Range minimum query" ]
Parent Topic
Child Topic
    No Parent Topic