Augmented Lineage: Traceability of Data Analysis Including Complex UDFs.

2021 
Data lineage allows information to be traced to its origin in data analysis by showing how the results were derived. Although many methods have been proposed to identify the source data from which the analysis results are derived, analysis is becoming increasingly complex both with regard to the target (e.g., images, videos, and texts) and technology (e.g., AI and machine learning). In such complex data analysis, simply showing the source data may not ensure traceability. Analysts often need to know which parts of images are relevant to the output and why the classifier made a decision. Recent studies have intensively investigated interpretability and explainability in the machine learning (ML) domain. Integrating these techniques into the lineage framework will greatly enhance the traceability of complex data analysis, including the basis for decisions. In this paper, we propose the concept of augmented lineage, which is an extended lineage, and an efficient method to derive the augmented lineage for complex data analysis. We express complex data analysis flows using relational operators by combining user defined functions (UDFs). UDFs can represent invocations of AI/ML models within the data analysis. Then we present an algorithm to derive the augmented lineage for arbitrarily chosen tuples among the analysis results. We also experimentally demonstrate the efficiency of the proposed method.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    17
    References
    0
    Citations
    NaN
    KQI
    []