The Yin & Yang of Data Management - Four intractable contradictions and their harmonious resolutions

2017 
ABSTRACT ObjectivesIn our long-term prospective cohort study of over 500,000 adults, participants are actively followed-up through established morbidity and mortality registries, and by linkage to the national health insurance system. We also administer regular questionnaires; take physical measurements; and gather genetic, metabolomics, and even meteorological data. Importing, integrating and distributing such a large and diverse dataset presents practical challenges, but also more fundamental ones. When individually reasonable data management requirements come into conflict, can a solution be found that satisfies both without compromising either? In this session I present, with examples, four such conflicts and the ways in which we harmonise and resolve them. Approach & ResultsGather perfect data / Handle imperfect dataA data manager does not want to allow erroneous data into their database, but nor do they want to discard data that are imperfect but still meaningful. The approach that we take is to implement automatic validation at the point of entry, tailored to the data source in question. Fix data issues / Don't make assumptionsIdeally every data issue that is detected should be fixed or flagged, but correction is not always possible and it's rarely clear where to draw the line between error and outlier. We address this via comprehensive data documentation, empowering each analyst to identify, assess and handle the values that might be problematic for them. Be flexible / Be consistentEvery researcher has different data requirements, definitions and exclusions. Data management must support this without needless duplication of effort, or leaving everyone working on incompatible datasets. Our solution is centralised distribution: a single core database, centrally maintained and updated, from which all analyst's datasets are derived, and into which the work of individuals is incorporated for the benefit of all. Keep it simple / Include everythingMost analyses examine some areas of the data in great detail but require only basic summaries of others. We supports this using well-chosen data aggregation, offering multiple levels of detail so that each analyst can decide how far they wish to ‘zoom in’ on each element. ConclusionSome apparently contradictory requirements of data management can be resolved with the above techniques, creating a resource suitable for a wide range of applications without compromise.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []