ProvDB: Lifecycle Management of Collaborative Analysis Workflows

2017 
As data-driven methods are becoming pervasive in a wide variety of disciplines, there is an urgent need to develop scalable and sustainable tools to simplify the process of data science, to make it easier for the users to keep track of the analyses being performed and datasets being generated, and to enable the users to understand and analyze the workflows. In this paper, we describe our vision of a unified provenance and metadata management system to support lifecycle management of complex collaborative data science workflows. We argue that the information about the analysis processes and data artifacts can, and should be, captured in a semi-passive manner; and we show that querying and analyzing this information can not only simplify bookkeeping and debugging tasks but also enable a rich new set of capabilities like identifying flaws in the data science process itself. It can also significantly reduce the user time spent in fixing post-deployment problems through automated analysis and monitoring. We have implemented a prototype system, PROVDB, on top of git and Neo4j, and we describe its key features and capabilities.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    27
    References
    17
    Citations
    NaN
    KQI
    []