BagGMM: Calling copy number variation by bagging multiple Gaussian mixture models from tumor and matched normal next-generation sequencing data

2019 
Abstract Copy number variations (CNVs) contribute significantly to human genomic variability, some of which lead to diseases. However, effective detection of CNVs from whole genome next generation sequencing data (NGS) remains challenging. Here, we present BagGMM, a new method to call CNVs using tumor-normal matched samples from NGS data. BagGMM extracts read depth ratios of tumor samples to normal samples, divides the genomic sequences into segments by sliding windows to count the average coverage ratio of each segment, filters candidate deletions and duplications based on a coarse criterion of coverage ratio, and then builds Gaussian Mixture Model (GMM) for remaining ratios to identify the remaining ambiguous copy number states after filtration. Bagging multiple GMMs makes false positive calls descent instead of using one GMM, thus enhancing the detection power of BagGMM. Considering the computation speed of GMMs and false positive calls, we employ a segmentation procedure “large window and then small windows”, which is also helpful to determine boundary of CNV regions. We apply BagGMM to three simulation datasets and two groups of human whole genome sequencing (WGS) data for breast cancer patients and ovarian cancer patients to identify CNVs, respectively. All performed experiments demonstrate that BagGMM has the capability of robustly identification of CNVs with different sizes and states. The performance of this tool is compared to four peer existing CNV detection methods. BagGMM shows a significant improvement in both sensitivity and specificity for detecting both copy number gains and losses.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    41
    References
    3
    Citations
    NaN
    KQI
    []