Virus detection in high-throughput sequencing data without a reference genome of the host

2018 
Abstract Discovery of novel viruses in host samples is a multidisciplinary process which relies increasingly on next-generation sequencing (NGS) followed by computational analysis. A crucial step in this analysis is to separate host sequence reads from the sequence reads of the virus to be discovered. This becomes especially difficult if no reference genome of the host is available. Furthermore, if the total number of viral reads in a sample is low, de novo assembly of a virus which is a requirement for most existing pipelines is hard to realize. We present a new modular, computational pipeline for discovery of novel viruses in host samples. While existing pipelines rely on the availability of the hosts reference genome for filtering sequence reads, our new pipeline can also cope with cases for which no reference genome is available. As a further novelty of our method a decoy module is used to assess false classification rates in the discovery process. Additionally, viruses with a low read coverage can be identified and visually reviewed. We validate our pipeline on simulated data as well as two experimental samples with known virus content. For the experimental samples, we were able to reproduce the laboratory findings. Our newly developed pipeline is applicable for virus detection in a wide range of host species. The three modules we present can either be incorporated individually in other pipelines or be used as a stand-alone pipeline. We are the first to present a decoy approach within a virus detection pipeline that can be used to assess error rates so that the quality of the final result can be judged. We provide an implementation of our modules via Github. However, the principle of the modules can easily be re-implemented by other researchers.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    31
    References
    5
    Citations
    NaN
    KQI
    []