Profiling and leveraging relatedness in a precision medicine cohort of 92,455 exomes

Jeffrey Staples,Evan K. Maxwell,Nehal Gosalia,Claudia Gonzaga-Jauregui,Christopher Snyder,Alicia Hawes,John Penn,Ricardo Ulloa,Xiaodong Bai,Alexander E. Lopez,Cristopher V. Van Hout,Colm O'Dushlaine,Tanya M. Teslovich,Shane McCarthy,Suganthi Balasubramanian,H. Lester Kirchner,Joseph B. Leader,Michael F. Murray,David H. Ledbetter,Alan R. Shuldiner,George D. Yancoupolos,Frederick E. Dewey,David J. Carey,John D. Overton,Aris Baras,Lukas Habegger,Jeffrey G. Reid

Profiling and leveraging relatedness in a precision medicine cohort of 92,455 exomes

2017

Large-scale human genetics studies are ascertaining increasing proportions of populations as they continue growing in both number and scale. As a result, the amount of cryptic relatedness within these study cohorts is growing rapidly and has significant implications on downstream analyses. We demonstrate this growth empirically among the first 92,455 exomes from the DiscovEHR cohort and, via a custom simulation framework we developed called SimProgeny, show that these measures are in-line with expectations given the underlying population and ascertainment approach. For example, we identified ~66,000 close (first- and second-degree) relationships within DiscovEHR involving 55.6% of study participants. Our simulation results project that >70% of the cohort will be involved in these close relationships as DiscovEHR scales to 250,000 recruited individuals. We reconstructed 12,574 pedigrees using these relationships (including 2,192 nuclear families) and leveraged them for multiple applications. The pedigrees substantially improved the phasing accuracy of 20,947 rare, deleterious compound heterozygous mutations. Reconstructed nuclear families were critical for identifying 3,415 de novo mutations in ~1,783 genes. Finally, we demonstrate the segregation of known and suspected disease-causing mutations through reconstructed pedigrees, including a tandem duplication in LDLR causing familial hypercholesterolemia. In summary, this work highlights the prevalence of cryptic relatedness expected among large healthcare population genomic studies and demonstrates several analyses that are uniquely enabled by large amounts of cryptic relatedness.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations