Spark: A navigational paradigm for genomic data exploration

Cydney B. Nielsen,Hamid Younesy,Henriette O'Geen,Xiaoqin Xu,Andrew R. Jackson,Aleksandar Milosavljevic,Ting Wang,Joseph F. Costello,Martin Hirst,Peggy J. Farnham,Steven J.M. Jones

Spark: A navigational paradigm for genomic data exploration

2012

A pressing challenge arising from the productivity of large-scale data-generating consortia, such as the Encyclopedia of DNA Elements (ENCODE) Project (The ENCODE Project Consortium 2012) or the Roadmap Epigenomics Project (Bernstein et al. 2010), is ensuring that these data are accessible to the biological community for analysis. While public repositories provide easy access to primary data, subsequent data processing and analysis can pose a significant computational hurdle to many biologists. In addition, the depth and breadth of these resources are unprecedented, and much of the initial analysis may be exploratory in nature. The biologically interesting signals may be too poorly understood at the outset to be identified and analyzed in an automated fashion. Visualization is a powerful approach in such cases. Not only does it lower the computational barrier for use, but also it is particularly effective in facilitating human reasoning about complex data, which is essential during this early exploration phase. Genome browsers are one such class of visualization tool that have enjoyed widespread popularity among biologists and that frequently serve as the primary means of examining genome-wide data during the initial inspection and discovery phases. Part of their power comes from the ability to integrate diverse data sets by plotting them as vertically stacked ‘tracks’ across a common genomic x-axis. Genome browsers have played an important role in increasing the accessibility of large public data sets, for example, the ENCODE data resource is currently hosted by the UCSC Genome Browser (Kent et al. 2002). However, the power of genome-wide data sets is in their ability to reveal global regulatory patterns that would be difficult, if not impossible, to extrapolate from studies of individual loci. Genome browsers inherently limit the data view to individual loci, and while invaluable for visualizing data patterns at specific regions of interest, they have limited power to facilitate global analysis. For many types of queries, there is a mismatch between the level of data abstraction at which the investigator wishes to interrogate the data set (e.g., gene set) and the level at which the data are displayed in a genome browser (e.g., individual gene). As a result, computational experts typically conduct such global analyses with custom tools. Recently, the Human Epigenome Browser (Zhou et al. 2011) enabled users to filter the genomic x-axis to only annotated genes involved in a pathway of interest, as queried by a KEGG identifier. This is an important step toward replacing the genome coordinate axis with a functional axis and enabling comparisons of data tracks across multiple loci within the genome browser framework, but depending on the size of the gene set, it can still be challenging to obtain an overview of the data patterns from such a view. There are several good examples of computational methods that generate biologically meaningful genome-wide data summaries. One common approach used to interpret epigenomic data, such as histone modifications and DNA methylation, is to identify and functionally characterize combinatorial data patterns. For example, methylation of both lysine 4 and lysine 27 on histone H3 is an epigenetic signature characteristic of embryonic stem cells, termed a ‘bivalent domain,’ thought to silence developmental genes while keeping them poised for activation (Azuara et al. 2006; Bernstein et al. 2006). Early work in signature detection clustered well-annotated promoters on the basis of specific histone modification patterns derived from chromatin immunoprecipitation (ChIP) coupled microarray data (ChIP-chip) (Heintzman et al. 2007). Both seqMINER (Ye et al. 2011) and Cistrome (Liu et al. 2011) are analysis tools that include such a clustering approach and provide cluster visualization through static heatmaps. A probabilistic method, ChromaSig, subsequently eliminated the dependence on existing annotations and offered a way to discover chromatin signatures de novo by searching genome-wide using data from ChIP followed by sequencing (ChIP-seq) (Hon et al. 2008). More recently, hidden Markov model (HMM), and Bayesian network approaches have been applied to uncover recurrent chromatin states (Ernst and Kellis 2010; Hoffman et al. 2012). However, none of these approaches support interactive data exploration. All of the above tools produce static summary images, typically in the form of heatmaps and there are few or no mechanisms by which to dynamically guide the analysis based on human knowledge of the biological system under study. Here we present Spark, a visualization approach that employs clustering to create a global data overview and high-level entry point for analysis, while also enabling interactive drill-down to the supporting data at the level of individual loci. It is intended to facilitate responsive exploratory navigation through a genome-wide data set and to be used as a complement to genome browsing. Its novelty over existing tools lies in its support of user-guided clustering, specifically enabling users to split existing clusters into subclusters and thus direct the clustering algorithm toward patterns of interest. Given that the clusters are generated across a set of user-specified input regions, Spark supports the analysis of both well-annotated regions and potential novel elements, such as those identified as having enrichments in a particular ChIP-seq experiment. The tool is connected to popular external resources, for example, the display links individual loci to the corresponding view in the UCSC Genome Browser, and gene ontology (GO) analysis is available at the cluster level by interfacing with the DAVID suite of tools (Huang et al. 2009) and thus minimizes the need for programmatic data manipulation. Spark employs a very general clustering technique with few parameters and can therefore flexibly handle diverse data sets. The ENCODE and Human Epigenome Atlas data sets are directly accessible through the Spark user interface, and initial results suggest that Spark will be a valuable exploratory tool for these communities.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations