Parsimonious machine learning models to predict resource use in cardiac surgery across a statewide collaborative

Objective We sought to several develop parsimonious machine learning models to predict resource utilization and clinical outcomes following cardiac operations using only preoperative factors. Methods All patients undergoing coronary artery bypass grafting and/or valve operations were identified in the 2015-2021 University of California Cardiac Surgery Consortium repository. The primary end point of the study was length of stay (LOS). Secondary endpoints included 30-day mortality, acute kidney injury, reoperation, postoperative blood transfusion and duration of intensive care unit admission (ICU LOS). Linear regression, gradient boosted machines, random forest, extreme gradient boosting predictive models were developed. The coefficient of determination and area under the receiver operating characteristic (AUC) were used to compare models. Important predictors of increased resource use were identified using SHapley summary plots. Results Compared with all other modeling strategies, gradient boosted machines demonstrated the greatest performance in the prediction of LOS (coefficient of determination, 0.42), ICU LOS (coefficient of determination, 0.23) and 30-day mortality (AUC, 0.69). Advancing age, reduced hematocrit, and multiple-valve procedures were associated with increased LOS and ICU LOS. Furthermore, the gradient boosted machine model best predicted acute kidney injury (AUC, 0.76), whereas random forest exhibited greatest discrimination in the prediction of postoperative transfusion (AUC, 0.73). We observed no difference in performance between modeling strategies for reoperation (AUC, 0.80). Conclusions Our findings affirm the utility of machine learning in the estimation of resource use and clinical outcomes following cardiac operations. We identified several risk factors associated with increased resource use, which may be used to guide case scheduling in times of limited hospital capacity.

Results: Compared with all other modeling strategies, gradient boosted machines demonstrated the greatest performance in the prediction of LOS (coefficient of determination, 0.42), ICU LOS (coefficient of determination, 0.23) and 30-day mortality (AUC, 0.69). Advancing age, reduced hematocrit, and multiple-valve procedures were associated with increased LOS and ICU LOS. Furthermore, the gradient boosted machine model best predicted acute kidney injury (AUC, 0.76), whereas random forest exhibited greatest discrimination in the prediction of postoperative transfusion (AUC, 0.73). We observed no difference in performance between modeling strategies for reoperation (AUC, 0.80).
Conclusions: Our findings affirm the utility of machine learning in the estimation of resource use and clinical outcomes following cardiac operations. We identified several risk factors associated with increased resource use, which may be used to guide case scheduling in times of limited hospital capacity. (JTCVS Open 2022;11:214-28) Observed length of stay versus predictions by machine learning model.

CENTRAL MESSAGE
Compared to traditional linear regression, machine learning yielded superior performance in the prediction of length of stay, mortality, acute kidney injury, and reoperation following cardiac operations.

PERSPECTIVE
This study outlined the development of machine learning (ML) models to predict length of stay (LOS) following cardiac operations. Several clinical, operation-related, and hospital characteristics were found to be associated with increased LOS. Taken together, our findings suggest that ML models may be used to inform case scheduling strategies during times of limited hospital capacity.
Video clip is available online.
The novel COVID-19 pandemic has placed unprecedented strain on health care systems, influencing the allocation of personnel and resources. Several groups have reported cardiac surgery case volume reductions of 45% to 94%, with significant regional variability. [1][2][3][4] Subject to rates of "reopening" and patients' desire to proceed with elective surgery, the projected time to equilibrium between backlogged cases and ongoing surgical need is estimated to be 12 to 22 months. [5][6][7] Furthermore, recovery from cessation of elective cases requires a nuanced approach to management of deferred and newly presenting patients as well as ongoing demands for perioperative resources. With estimates that operating volume must exceed 120% of baseline to accommodate deferred patients while concurrently preventing excess waitlist morbidity, rapid and accurate prediction of hospital bed occupancy and resource utilization are especially crucial. 6 The Society of Thoracic Surgeons (STS), among others, has successfully implemented risk models to provide canonical estimates for parameters such as mortality, postoperative complications, and prolonged length of stay (LOS). 8 However, as demonstrated by several reports of poor calibration when applied at the institutional level, these predictive tools are often cumbersome and require numerous data fields to yield a predicted risk without accounting for local variations in clinical practice. [9][10][11] Furthermore, most available models predict prolonged LOS in a binary manner, rather than an estimate of the actual duration of hospitalization in days. 8,[12][13][14] The classification of LOS into prolonged and routine reduces generalizability and limits the application of such tools in acute care settings. 15 Machine learning (ML) algorithms allow for complex modeling of nonlinear relationships between predictive factors and have demonstrated superior discrimination and calibration in several clinical applications. [16][17][18] Therefore, we sought to develop ML-based models to predict LOS, 30-day mortality and select complications using an academic, statewide database. We hypothesized that a parsimonious ML model, containing few explanatory covariates, would yield superior discrimination and calibration compared with traditional linear and logistic regression.

Study Population
All adults (aged 18 years or older) who underwent coronary artery bypass grafting (CABG) and/or valve operations were identified from the 2015 to 2021 University of California Cardiac Surgery Consortium (UCCSC) repository. Founded in 2013, the UCCSC is a collaborative among 5 academic hospitals across California. Data elements, including those submitted to the STS, are prospectively collected in compliance with policies of individual institutions and the University of California Systemwide Review Board (IRB No. 16-000558, approved May 6, 2016, renewed April 15, 2020). The need for patient written consent for the publication of the study data was waived by the institutional review board due to the de-identified nature of the UCCSC.
Patients were stratified by the class of operation performed: isolated CABG, isolated valve, concomitant CABG/valve and multivalve operations. Those who required left ventricular assist device implantation, extracorporeal membrane oxygenation, or transcatheter procedures were excluded to maintain cohort homogeneity. Moreover, records with missing values for overall and intensive care unit (ICU) LOS as well as 30-day mortality were excluded ( Figure E1). Patients with LOS or ICU LOS >95th percentile (>30 days for LOS,>259 hours in ICU) were similarly excluded.

Variable and Outcome Definitions
The primary end point was overall LOS. Mortality at 30 days, acute kidney injury (AKI), postoperative blood transfusion, reoperation and ICU LOS were also considered. Patient comorbidities, operative characteristics, and complications including AKI, postoperative blood transfusion and reoperation, were defined in accordance with the STS Adult Cardiac Database dictionary. 19 Annual operative caseload, number of adult cardiac surgeons, total number of low acuity and cardiothoracic ICU beds were tabulated for each institution. Variables with missing values in >20% of patients were not considered for inclusion. For retained features with missing data, values were imputed using the median and mode for continuous and categorical variables, respectively. The number of records with missing data for each variable is reported in Table E1.

Modeling Techniques
We compared 3 ML algorithms to traditional, multivariable linear, and logistic regression: gradient boosted machines (GBM), extreme gradient boosting (XGBoost) and random forest (RF). These algorithms autonomously generate a large set of decision trees to capture nuanced patterns in training data. In the case of RF, the development of every decision tree occurs independently, and the final output of the model is the arithmetic mean of the output from each decision tree. However, the XGBoost and GBM algorithms train decision trees in a stepwise manner to compensate for errors of the prior trees, and the output is the weighted average of each decision tree's estimate. 20 A brief schematic highlighting the differences between boosting (XGBoost and GBM) and bagging (RF) classifiers is shown in Figure E2. Hyperparameters, which are used to control the learning process of ML models, were selected using the GridSearchCV

Model Development
Thirty-seven preoperative patient and hospital characteristics were chosen as candidate predictors. Clinical variables were selected from the STS risk score variable list based on clinical relevance and are listed in Table 1. 8 Hospital factors were incorporated to account for variation in practice across participating institutions. Variable selection was performed using recursive feature elimination, a ML technique that is used to reduce collinearity and eliminate covariates with low variance. In recursive feature elimination, cross-validation is used to exhaustively evaluate variable sets of different sizes and select the best collection of features. Given that transportability and ease of use is an important aspect of risk tools, we identified the smallest set of variables that retained maximum predictive performance. This algorithm was independently applied using linear regression and GBM to ascertain any differences between modeling strategies. Selected variables were used for all subsequent model development (Table E3). We also compared the performance of ML against the STS risk scores for 30-day mortality, AKI, and reoperation.
The derivation cohort consisted of operations performed before March 2020, whereas the remainder comprised the validation dataset. To obtain cross-validated performance metrics, models were fit using 50% of the derivation cohort and tested using the remainder. This process was repeated 100 times to acquire model performance metrics, which are reported as means with 95% CIs. To account for potential differences in case-mix due to the COVID-19 pandemic, we assessed the stability of model performance in the pre-(derivation) and post-COVID-19 (validation) eras. Values are presented as mean AE SD or n. CABG, Coronary artery bypass grafting.

Model Evaluation and Interpretation
Linear regression, GBM, RF, and XGBoost models were compared using the coefficient of determination (R 2 ) between observed and predicted values. Binary classifiers were evaluated using the area under the receiver operating characteristic (AUC). The accuracy of probabilistic predictions was assessed using the Brier score, for which lower values denote superior calibration. Model R 2 and Brier scores were analyzed using a paired t test, which allowed for comparison of model performance across crossvalidation folds. Similarly, model AUCs were compared using DeLong's test, which specifically accounts for the influence of model evaluation on a common test set. SHapley additive values were calculated to estimate the marginal influence of each covariate on the output of a decision tree model. 17 Baseline characteristics are reported as means with SD or medians with interquartile range (IQR), as appropriate. Means were analyzed using the adjusted Wald test, whereas medians were analyzed with the Mann-Whitney U test. Categorical variables are reported as frequencies and were compared using the Pearson c 2 test. Statistical significance was set at a ¼ 0.05. Statistical analysis was conducted using Stata 16.0 (StataCorp) and Python version 3.9. The sklearn, shap, and xgboost packages of Python were used to develop and assess ML models as described above. 21,22

Population Characteristics
Across 5 participating centers, 6,316 patients met study criteria. The study cohort was predominantly male (72.5%), with mean age 63 years. A significant proportion of patients had preexisting medical conditions such as diabetes, congestive heart failure, and atrial fibrillation ( Table  1). The most frequent operation was isolated CABG (50.5%), followed by isolated valve (33.3%) and concomitant CABG/valve operations (10.6%). The majority of operations were performed electively. Over the study period, the highest volume center performed 1,205 operations, whereas the lowest volume center performed 626 operations. The 30-day mortality rate was 0.9%. Overall, 27.7% of patients received postoperative transfusions, and 1.5% developed AKI. Median LOS was 8 days (IQR, 6-13 days) with a median ICU LOS of 74 hours (IQR, 47-116 hours).
Comparison of baseline characteristics and outcomes between the derivation and validation cohorts is shown in Tables 1 and 2. Patients in the validation cohort were marginally older (64 AE 13 vs 63 AE 13 years; P <.001) and had greater rates of congestive heart failure (45.0% vs 33.8%; P <.001) and peripheral vascular disease (11.1% vs 8.3%; P ¼ .003). Valve operations were more frequent in the validation group, compared with derivation. Although rates of 30-day mortality and AKI were similar, the incidence of reoperation (6.9% vs 9.1%; P ¼ .014) and postoperative blood transfusion (23.1% vs 28.8%; P <.001) was lower in the validation cohort. The distribution of LOS and ICU LOS was statistically different between the derivation and validation datasets ( Table 2).

Variable Selection
Recursive feature elimination was applied to 37 candidate variables to determine the optimal covariate set in the prediction of overall LOS. Figure 1 demonstrates the cross-validated R 2 versus the number of covariates included in each model. The GBM model outperformed linear regression, regardless of feature set size. Notably, after the inclusion of 23 features, no appreciable increase in performance was observed from the GBM or linear regression model. Thus, all models were developed using the 23 features that were most strongly associated with LOS (Table E3).

Resource Utilization
Linear regression, GBM, RF, and XGBoost models were developed to predict in-hospital LOS. Compared to linear regression, the GBM model yielded a higher R 2 (0.42 vs 0.41; P<.001, Table 3). As shown in Figure E3, predictions by the GBM model were more strongly correlated with observed values for LOS, compared with linear regression. Although the difference in cross-validated R 2 between the 2 strategies was subtle, the GBM model greatly outperformed linear regression in the validation dataset (R 2 , 0.47 vs 0.42) ( Table 4). When assessing cumulative model error in the validation cohort, the GBM model resulted in a 197-day reduction in error across all patients relative to linear regression.
The GBM model was interpreted using SHapley summary plots, and the most salient predictors of LOS were ranked by their relative importance (y-axis). Figure 2 depicts how high (red dot) and low (blue dot) feature values corresponded to a change in LOS prediction. Elective admission had the highest feature importance and was associated with significantly decreased LOS. In addition, we found decreased hematocrit and serum albumin levels to increase the estimated LOS. Certain procedures, such as concomitant CABG/valve and multivalve operations, were found to confer longer LOS. Notably, an increased number of floor beds conferred greater estimated LOS (Figure 2). In the prediction of ICU LOS, the GBM model demonstrated significantly increased cross-validated R 2 compared with linear regression (0.23 vs 0.15; P<.001). However, in the validation dataset, the XGBoost model demonstrated the highest performance (Table 4). Decreased preoperative creatinine level, low ejection fraction, and preexisting congestive heart failure were associated with greater predicted ICU LOS. Notably, increased annual hospital volume and a higher number of low-acuity beds were associated with lower estimated ICU LOS (Figure 3).

Clinical Outcomes
The GBM, RF, and XGBoost models outperformed logistic regression in the prediction of 30-day mortality (AUC, 0.69 vs 0.67; P < .001). Furthermore, the GBM and RF models outperformed logistic regression and XGBoost in the prediction of AKI (Table 3). Whereas postoperative blood transfusion was best predicted by GBM and XGBoost, all modeling strategies displayed similar discrimination in the estimation of reoperation ( Table 3). The STS risk score for 30-day mortality and AKI outperformed ML models. However, ML displayed greater discrimination than the STS model in the prediction of reoperation (Table 3). These comparisons were consistent when evaluating the Brier score for each model (Tables E4  and E5).

DISCUSSION
Reliable estimation of hospitalization duration remains a challenge for surgeons and administrators alike. The present study developed several parsimonious ML models to develop a readily useful prediction instrument for LOS (Video 1). This work entails one of the largest applications of ML to discretely model LOS using a multicenter, academic dataset. Compared with linear and logistic regression, we found ML algorithms to exhibit higher performance for prognostication of LOS, 30-day mortality, AKI, postoperative transfusion and ICU LOS. Using autonomous techniques, we identified several key predictors of increased resource use, including existing comorbidities, decreased preoperative hematocrit and serum albumin levels. And finally, we noted a significant influence of hospital characteristics on ICU LOS, suggesting the need for incorporation of center-specific characteristics in predictive tools.
Several clinical characteristics, including preoperative anemia, renal dysfunction, and operative complexity, were associated with increased overall and ICU LOS. These findings are expected because laboratory values such as hematocrit level, international normalized ratio, creatinine level, and albumin level are incorporated in virtually every clinical risk score calculator. 8 24 In addition to patient factors, we found certain hospital structural characteristics to influence ICU LOS. For example, increasing cardiac institutional volume and a greater number of low-acuity beds was associated with reduced ICU LOS. Several factors may contribute to this important finding. Higher institutional cardiac surgery volume may represent greater expertise, the presence of standardized care pathways, and more efficient hospital throughput for these cases. Moreover, greater availability  of low-acuity beds may lead to less delay in transitioning out of the ICU when clinical milestones are met. 15 Consistent with this notion, several prior studies have demonstrated wide variation in hospital practices that may influence LOS, such as expedited discharge after lung resection and CABG. 25,26 A nationwide study of minimally invasive esophagectomy in the Netherlands demonstrated great heterogeneity in ICU LOS, pointing to differences in use of early extubation protocols and analgesic modalities as contributing factors. 27 Investigation at a broader scale is necessary to confirm the generalizability of our findings and to identify modifiable practice patterns that increase LOS.
In the present work, ML models exhibited superior accuracy in the prognostication of overall and ICU length of stay, compared with linear regression. A single-center study similarly compared linear regression and artificial neural networks, finding the latter to have enhanced LOS prediction for patients undergoing isolated CABG. 28 Furthermore, LaFaro and colleagues 29 used a sample of 185 patients undergoing cardiac surgery to show that artificial neural networks yield more accurate estimates of ICU LOS compared with linear regression. The improved performance of ML models is likely attributable to their ability to capture nonlinear interactions between covariates and outcomes of interest. Although the decision-tree structure evaluates such interactions autonomously, linear regression models can only accommodate explicitly included interaction terms, making the development of an equivalent model cumbersome and more prone to bias. Our findings are in congruence with the growing body of literature, which demonstrates increased performance of ML models in the clinical setting. [16][17][18] Thus, ML algorithms should be considered as a viable and potentially superior alternative modeling approach in surgical care applications.
Although ML methods outperformed linear strategies for prediction of reoperation, the STS models outperformed ML for 30-day mortality and AKI. This observation is most attributable to the large sample used to derive the STS risk scores as well as the incorporation of more than 100 data fields. 14 Nonetheless, the STS models are limited to operations either involving CABG or single-valve replacement, and do not provide risk estimates for aortic surgery or multivalve procedures. Such operations present a more heterogeneous risk profile and may reduce the performance of predictive models. We opted to include such operations in our modeling attempts to develop a tool that accurately reflects the case-mix at our 5 academic institutions. Indeed, procedures not accounted for by the STS comprised approximately 5% of our study cohort. Regardless, ML approaches are gradually being incorporated into the STS models to provide more bespoke estimates, an effort that will certainly improve risk prediction across cases performed in the United States. The predictive models developed in the present work have considerable utility in the clinical and administrative settings. Their mode of application is tunable to an institution's needs, and the insights that they provide have the potential to enhance clinical outcomes. A landmark randomized control trial by Shimabukuro and colleagues 30 found the implementation of ML models to reduce ICU mortality and LOS, demonstrating that such tools can tangibly improve clinical outcomes and decrease resource utilization. Our group has chosen to make the ML models with the greatest R 2 and AUC available for public use. This online tool may be used by clinicians when evaluating patient risk or by administrators who wish to apply our predictive model at the programmatic level. However, a model that continuously incorporates postoperative events into the estimated LOS would be most pertinent to patient care in the perioperative setting. Further efforts to develop such tools are warranted.
Given the premium placed on low-acuity and ICU beds during the COVID-19 pandemic, hospitals transiently reduced surgical volume. Prachand and colleagues 31 proposed a widely used framework for triaging medically necessary, time sensitive procedures. It highlighted several key factors, such as operating-room time, estimated LOS, and anticipated blood loss, when determining resource allocation. In the event of significant reduction in operating capacity, the development of algorithms that balance risk associated with delay in operative management as well as estimated resource use may be necessary. Our proposed ML based models may better inform decisions about scheduling and optimizing case-mix to ensure sufficient hospital throughput. With wide availability of ML present and use of few explanatory variables, prospective studies may readily determine the pragmatic influence of such models in optimizing hospital efficiency.
The present study has several limitations. As a multicenter study confined to a group of academic centers, our findings are not generalizable to the cardiac surgical population at large. In addition, although the consortium makes a concerted effort to homogenize practice patterns across participating institutions, certain clinical factors may vary by center and surgeon, such as the threshold for blood transfusion. Transfer status was similarly not captured in the UCCSC and could not be accounted for in our predictive models. Furthermore, despite the relatively large size of the dataset, prospective application of the ML models is required to externally validate their utility. Nonetheless, we used robust statistical methods and a

CONCLUSIONS
We have demonstrated the superior performance of ML models in providing accurate predictions for LOS using a multi-institutional, cardiac surgery database. Derived from few variables, such models can estimate resource use and better inform projected hospital census. Leveraging the information derived from machine learning models may be especially useful in reducing the influence of pandemicrelated disruptions in cardiac surgical programs (Figure 4).

Conflict of Interest Statement
Dr Shemin serves as a consultant to the Edwards Lifesciences Advisory Board and as a co-principal investigator on the PARTNER II trial. All other authors reported no conflicts of interest.
The Journal policy requires editors and reviewers to disclose conflicts of interest and to decline handling or reviewing manuscripts for which they have a conflict of interest. The editors and reviewers of this article have no conflicts of interest.