
Abstract
Objectives:
To derive and externally validate supervised machine learning (ML) models predictive of cardiac surgery-associated acute kidney injury (CS-AKI).
Design:
Retrospective cohort analysis.
Setting:
Multicenter (4), cardiac surgical centers from January 2019 to February 2022.
Patients:
Seven days to 18 years old who had undergone cardiac surgery.
Interventions:
None.
Measurements and Main Results:
CS-AKI was defined using Kidney Disease: Improving Global Outcomes criteria, with stages 2/3 classified as severe, during the first 7 postoperative days. Data analysis followed two approaches: 1) combining three centers for derivation and using a fourth for external validation and 2) randomly dividing the entire dataset into derivation and validation cohorts in a 4:1 ratio. Forty ML models were developed across five derivation-validation pairs using four ML algorithms (light gradient-boosting machine, extreme gradient boosting, categorical boosting, and histogram gradient boosting) to predict two outcomes (any and severe CS-AKI) utilizing preoperative, intraoperative, and immediate postoperative variables. SHapley Additive exPlanations was used for input variable importance analysis. A cohort of 1100 patients was analyzed. Any CS-AKI and severe CS-AKI occurred in 49.1% and 23.1% patients, respectively. Wide range of variations in external validation of model performance were observed among all 40 ML models. For any CS-AKI, the range in metrics were: area under the receiver operating characteristic curve (AUROC) 0.64–0.83, sensitivity 0.29–0.86, specificity 0.46–0.95, positive predictive value (PPV) 0.50–0.85, and negative predictive value (NPV) 0.60–0.86. For severe CS-AKI, we found the range in metrics with AUROC 0.65–0.77, sensitivity 0.04–0.58, specificity 0.77–0.99, PPV 0.32–0.75, and NPV 0.78–0.90. Preoperative serum creatinine, cardiopulmonary bypass, aortic cross-clamp duration, weight, and age at surgery were the most important predictors associated with CS-AKI.
Conclusions:
This analysis of a retrospective multicenter dataset shows that external performance of ML models vary, highlighting challenges in generalizability, which may be due to center-based differences in practice.