C3-1: Creation of Deidentified Test Datasets from the VDW for Code Development Prior to IRB Approval

2013 
Background/AimsThe turn-around time of data-only studies could be shortened by about a month if programmers were to start developing extraction and analytic code before official IRB approval has been granted. However, while code development is likely more successful and efficient if the programs can be checked against data, fully identified datasets like the VDW cannot be accessed without the proper authorizations. A solution to this dilemma is the creation of deidentified test datasets. These test datasets are based on the VDW and preserve its structure and much of the richness of the data. Yet, due to sufficient deidentification, their use for code development falls outside of the purview of the Privacy Rule’s research provisions and does not qualify as Human Subjects research, so IRB approval is not required.MethodsWe created VDW based test datasets for 25,000 randomly selected individuals. Only patients up to the age of 90 and within the mid 90% of the age-specific height and weight distribution were eligible. All identifiers, such as MRN, encounter ID, provider ID, and facility code, were replaced with randomly assigned study identifiers, and crosswalks from VDW to study identifier were destroyed immediately after test dataset creation. The same study identifier was used for a given VDW identifier across all tables. Every date was shifted by a specific number of days that varied randomly across individuals, but was consistent for all dates associated with one person. Additional deidentification measures such as random sorts and grouping into larger categories were also applied.ConclusionsA combination of methods serves to sufficiently deidentify a group of test datasets that can be used for code development while awaiting IRB approval for the project. The employment of consistent study identifiers across datasets and date shifting by a person-specific constant preserves important relationships across content areas and time. Use of these deidentified test datasets allows for programming to begin as soon as the study population and protocol are sufficiently defined. This, in turn, enhances research efficiences as data analysis can begin sooner, data-driven decisions can happen earlier, and studies can be completed more quickly.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []