ContaTester: Fast cross-contamination estimation and identification for large human sequencing cohorts

2021 
Background: Interest in genomic medicine for human health studies and clinical applications is rapidly increasing. Clinical applications require contamination-free samples to avoid misleading results and provide a sound basis for diagnosis. Results: Here we present ContaTester, a tool which requires only allele balance information gathered from a VCF file to detect cross-contamination in germline human DNA samples. Based on a regression model of allele balance distribution, ContaTester allows fast checking of contamination levels for single samples or large cohorts (less than two minutes per sample). We demonstrate the efficiency of ContaTester using experimental validations: ContaTester shows similar results to methods requiring alignment data but with a significantly reduced storage footprint and less computation time. Additionally, for contamination levels above 5%, ContaTester can identify contaminants across a cohort, providing important clues for troubleshooting and quality assessment. Conclusions: ContaTester estimates contamination levels from VCF files generated from whole genome sequencing normal sample and provides reliable contaminant identification for cohorts or experimental batches.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    8
    References
    0
    Citations
    NaN
    KQI
    []