Generalizability Challenges of Mortality Risk Prediction Models: A Retrospective Analysis on a Multi-center Database

2021 
ImportanceModern predictive models require large amounts of data for training and evaluation which can result in building models that are specific to certain locations, populations in them and clinical practices. Yet, best practices and guidelines for clinical risk prediction models have not yet considered such challenges to generalizability. ObjectivesTo investigate changes in measures of predictive discrimination, calibration, and algorithmic fairness when transferring models for predicting in-hospital mortality across ICUs in different populations. Also, to study the reasons for the lack of generalizability in these measures. Design, Setting, and ParticipantsIn this multi-center cross-sectional study, electronic health records from 179 hospitals across the US with 70,126 hospitalizations were analyzed. Time of data collection ranged from 2014 to 2015. Main Outcomes and MeasuresThe main outcome is in-hospital mortality. Generalization gap, defined as difference between model performance metrics across hospitals, is computed for discrimination and calibration metrics, namely area under the receiver operating characteristic curve (AUC) and calibration slope. To assess model performance by race variable, we report differences in false negative rates across groups. Data were also analyzed using a causal discovery algorithm "Fast Causal Inference" (FCI) that infers paths of causal influence while identifying potential influences associated with unmeasured variables. ResultsIn-hospital mortality rates differed in the range of 3.9%-9.3% (1st-3rd quartile) across hospitals. When transferring models across hospitals, AUC at the test hospital ranged from 0.777 to 0.832 (1st to 3rd quartile; median 0.801); calibration slope from 0.725 to 0.983 (1st to 3rd quartile; median 0.853); and disparity in false negative rates from 0.046 to 0.168 (1st to 3rd quartile; median 0.092). When transferring models across geographies, AUC ranged from 0.795 to 0.813 (1st to 3rd quartile; median 0.804); calibration slope from 0.904 to 1.018 (1st to 3rd quartile; median 0.968); and disparity in false negative rates from 0.018 to 0.074 (1st to 3rd quartile; median 0.040). Distribution of all variable types (demography, vitals, and labs) differed significantly across hospitals and regions. Shifts in the race variable distribution and some clinical (vitals, labs and surgery) variables by hospital or region. Race variable also mediates differences in the relationship between clinical variables and mortality, by hospital/region. Conclusions and RelevanceGroup-specific metrics should be assessed during generalizability checks to identify potential harms to the groups. In order to develop methods to improve and guarantee performance of prediction models in new environments for groups and individuals, better understanding and provenance of health processes as well as data generating processes by sub-group are needed to identify and mitigate sources of variation. Key PointsO_ST_ABSQuestionC_ST_ABSDoes the sub-group level performance of mortality risk prediction models vary significantly when applied to hospitals or geographies different from the ones in which they are developed? What characteristics of the datasets explain the performance variation? FindingsIn this retrospective cross-sectional study based on a multi-center critical care database, mortality risk prediction models developed in one hospital or geographic region setting exhibited lack of generalizability to different hospitals/regions. Distribution of clinical (vitals, labs and surgery) variables significantly varied across hospitals and regions. Dataset shifts in race and clinical variables due to hospital or geography result in mortality prediction differences according to causal inference results, and the race variable commonly mediated changes in clinical variable shifts. MeaningFindings demonstrate evidence that such models can exhibit disparities in performance across racial groups even while performing well in terms of average population-wide metrics. Therefore, assessing subgroup performance differences should be included in model evaluation guidelines. Based on shifts in variables mediated by the race variable, understanding and provenance of data generating processes by population sub-group are needed to identify and mitigate sources of variation and can be used to decide whether to use a risk prediction model in new environments.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    73
    References
    0
    Citations
    NaN
    KQI
    []