Building Wrangler: A transformational data intensive resource for the open science community

2014 
With the growth of data in science and engineering fields and the I/O intense technologies used to carry out research with these massive datasets, it has become clear new solutions to support data research is required. In support of this, the Texas Advanced Computing Center presents Wrangler, the first open science research platform built from the ground up in support of data. Wrangler features a replicated 10 PB Lustre based parallel file system, compute capacity of 120 Intel Haswell nodes and 15 TB of RAM. In addition to the base system, Wrangler features a unique NAND flash-based storage system from DSSD, providing users with 0.5 PB of storage 1 TB/s bandwidth and 250 million IOP/s across the cluster. Supporting Hadoop, but not just Hadoop, Wrangler will provide current and future researchers with an environment supporting the most I/O intensive workflows in fields from astronomy to paleontology. With data at the forefront of Wrangler's mission, support for ETL workflows, data curation, and data publication will enable users as they both discover new results and publish their own research. Support for both SQL and noSQL databases and GIS based extensions will also be provided, allowing users to leverage these tools for both data cataloging and cross-study integration. Wrangler will allow users to focus more on what is most important to them, the data and knowledge gained from its analysis, and less on the details of curation and I/O optimization.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    3
    References
    4
    Citations
    NaN
    KQI
    []