language-icon Old Web
English
Sign In

Mappability and read length

2014 
Power-law distributions are the main functional form for the distribution of repeat size and repeat copy number in the human genome. When the genome is broken into fragments for sequencing, the limited size of fragments and reads may prevent an unique alignment of repeat sequences to the reference sequence. Repeats in the human genome can be as long as $10^4$ bases, or $10^5-10^6$ bases when allowing for mismatches between repeat units. Sequence reads from these regions are therefore unmappable when the read length is in the range of $10^3$ bases. With the read length of exactly 1000 bases, slightly more than 1% of the assembled genome, and slightly less than 1% of the 1kb reads, are unmappable, excluding the unassembled portion of the human genome (8% in GRCh37). The slow decay (long tail) of the power-law function implies a diminishing return in converting unmappable regions/reads to become mappable with the increase of the read length, with the understanding that increasing read length will always move towards the direction of 100% mappability.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    73
    References
    33
    Citations
    NaN
    KQI
    []