A fast and robust strategy to remove variant level artifacts in Alzheimer's Disease Sequencing Project data

2021 
Whole-exome sequencing (WES) and whole-genome sequencing (WGS) are expected to be critical to further elucidate the missing genetic heritability of Alzheimers disease (AD) risk by identifying rare coding and/or noncoding variants that contribute to AD pathogenesis. In the United States, the Alzheimers Disease Sequencing Project (ADSP) has taken a leading role in sequencing AD-related samples at scale, with the resultant data being made publicly available to researchers to generate new insights into the genetic etiology of AD. In order to achieve sufficient power, the ADSP has adapted a study design where subsets of larger AD cohorts are collected and sequenced across multiple centers, using a variety of sequencing kits. This approach may lead to variable variant quality across sequencing centers and/or kits. Here, we performed exome-wide and genome-wide association analyses on AD risk using the latest ADSP WES and WGS data releases. We observed that many variants displayed large variation in allele frequencies across sequencing centers/kits and contributed to spurious association signals with AD risk. We also observed that sequencing kit/center adjustment in association models could not fully account for these spurious signals. To address this issue, we designed and implemented novel filters that aim to capture and remove these center/kit-specific artifactual variants. We conclude by deriving a novel, fast, and robust approach to filter variants that represent sequencing center- or kit-related artifacts underlying spurious associations with AD risk in ADSP WES and WGS data. This approach will be important to support future robust genetic association studies on ADSP data, as well as other studies with similar designs. Author SummaryNext generation sequencing data represents a highly valuable resource to uncover rare coding and/or noncoding genetic variants that contribute to Alzheimers disease risk. In order to achieve large sample sizes that are required for such data, the Alzheimers Disease Sequencing Project (ADSP) has taken the leading role in sequencing Alzheimers disease related samples at scale in the United States. The ADSPs study design however leads to variable variant quality across the involved sequencing centers, necessitating a quality control approach that ensures robust genetic association analyses. Here, we present and validate a rigorous quality control pipeline, where we specifically developed a new strategy to handle inter-center variant quality issues in the ADSP. In doing so, we provide a first glance into exome- and genome-wide associations with Alzheimers disease risk using the latest releases of ADSP data (respectively 20.5k and 16.9k individuals). In sum, our pipeline is important to support future robust genetic association studies on ADSP data, as well as other studies with similar design. This in turn will contribute to accelerating Alzheimers disease gene discovery and gene-driven therapy development.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    29
    References
    0
    Citations
    NaN
    KQI
    []