Extensive Simulations for Longest Common Subsequences: Finite Size Scaling, a Cavity Solution, and Configuration Space properties

1998 
The Longest Common Subsequence (LCS) Problem asks for the longest sequence of (non-contiguous) matches between two given strings of characters. Using extensive Monte Carlo simulations, we find a finite size scaling law of the form E(L)/N =C + A/(N^1/2 ln N)+... for the mean LCS length of two random strings of size N over S letters. We provide precise estimates of C for S between 2 and 15. We consider also a related Bernoulli Matching model where the different entries of an N times M array are independently occupied with probability 1/S. In that case we find the expression of the limit of L(N,M)/N as N grows to infinity, as a function of r=M/N. This expression provides a very good approximation for the Random String model, which gets more and more accurate as S increases. The question of the ``universality class'' of the LCS problem is also considered. For the Bernoulli Matching model we find very good agreement with recent scaling predictions of Hwa and Lassig for Needleman-Wunsch sequence alignment. We find however that the variance of the LCS length has a different scaling different in the Random String model, suggesting that long-ranged correlations among the matches are relevant in this model. We finally study the ``ground state'' properties of this problem. We find in particular that the number of solutions typically grows exponentially with N, i.e. this system has a residual entropy at T=0. Also the overlap between two LCSs chosen at random is found to be self averaging and to aproach a definite value q(S)<1 as N grows.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    16
    Citations
    NaN
    KQI
    []