• Mon. Dec 4th, 2023

Healthcare Definition

Healthcare Definition, You Can't Live Withou It.

Machine learning generalizability across healthcare settings: insights from multi-site COVID-19 screening


Neural network models were trained to predict the COVID-19 status of patients presenting to hospital emergency departments across four independent UK National Health Service (NHS) Trusts—Oxford University Hospitals NHS Foundation Trust (OUH), University Hospitals Birmingham NHS Trust (UHB), Bedfordshire Hospitals NHS Foundations Trust (BH), and Portsmouth Hospitals University NHS Trust (PUH). The inclusion and exclusion criterion used can be found in Supplementary Note 1. United Kingdom National Health Service (NHS) approval via the national oversight/regulatory body, the Health Research Authority (HRA), has been granted for this work (IRAS ID: 281832).

For each site, data was split into training, validation, and test sets. From OUH, we had two data extracts. The first data extract consisted of 114,957 COVID-free patient presentations prior to the global COVID-19 outbreak, and 701 COVID-positive (by PCR testing) patient presentations during the first wave of the COVID-19 epidemic in the UK (pre-pandemic and UK “wave one” cases, to June 30, 2020). We split this cohort into a training and validation set using an 80:20 ratio, respectively. This ensures that the label of COVID-19 status is correct during training. Our second data extract consisted of 20,845 COVID-negative patients and 2,012 COVID-positive patients (PCR-confirmed) and was used as the test set (presentations from UK “wave two”, from October 1, 2020 to March 6, 2021). For PUH, UHB, and BH, we had only one extract each, consisting exclusively of presentations from during the COVID-19 outbreak, and thus used a 60:20:20 split for the training, validation, and test sets, respectively. A summary of population statistics can be found in Table 1, and a summary of training, validation, and test cohorts can be found in Table 2. Additional statistics describing population distributions can be found in Supplementary Note 2.

Table 1 Summary population characteristics for OUH, PUH, UHB, and BH cohorts.
Table 2 Training, validation, and test set distributions.

During the development process, training sets were used for model development, hyperparameter optimization, and training; validation sets were used for model optimization; and after successful development and training, test sets were used to evaluate the performance of the final models. The OUH training set consisted of COVID-free cases prior to the outbreak, and so we matched every COVID-positive case to twenty COVID-free presentations based on age, representing a simulated prevalence of 5%. For PUH, UHB, and BH, we used undifferentiated test sets representing all patients admitted to the respective trusts in the defined time periods.

Sensitivity, specificity, area under receiver operator characteristic curve (AUROC), and positive and negative predictive values (PPV and NPV) are reported, alongside 95% confidence intervals (CIs), comparing model predictions to results of confirmatory viral testing (laboratory PCR and SAMBA-II). CIs for sensitivity, specificity and predictive values were computed based on standard error, and for AUROC calculated using Hanley and McNeil’s method26. To assess the overall performance of each method, we compute the mean of each metric over all models of the same framework, for each of the held-out test sites they were evaluated on. Performance across methods were compared to determine the best one.

Feature sets and preprocessing

To train and validate our models, we used clinical data with linked, deidentified demographic information for all patients presenting to emergency departments across the four hospital groups. To better compare our results to the clinical validation study performed in ref. 10, we used a similar set of features to one of their models—“CURIAL-Lab”—which used a focused subset of routinely collected clinical features. These included blood tests and vital signs, excluding the coagulation panel and blood gas testing, as these are not performed universally and are less informative10. However, unlike CURIAL-Lab, we did not include the type of oxygen delivery device as a feature, as it is not coded universally between sites (requiring custom preprocessing in order to make the test site data equivalent), and as neural networks evaluate features heavily on their variability with respect to other variables, we wanted to use a feature set consisting of entirely continuous variables to avoid any optimization or convergence issues. Subsequently, we also did not include oxygen saturation, as this is clinically uninterpretable without knowledge of how much oxygen support is needed. Table 3 summarizes the final features included. Summary statistics of vital signs and blood tests are presented in Supplementary Tables 1, 2, respectively.

Table 3 Clinical predictors considered (ALT alanine aminotransferase, CRP C-reactive protein, eGFR estimated glomerular filtration rate).

Consistent with the training performed in ref. 10, we used population median imputation to replace any missing values. We then standardized all features in our data to have a mean of 0 and a standard deviation of 1. To ensure that all test sets were treated independently from the training data, on the assumption that the training data are not accessible at the point of modeling the data from the test sites, preprocessing was performed independently on each target site. Thus, imputation and standardization methods based on the training data were used to preprocess the training and validation sets; the test sets were preprocessed using the same methods based instead on each respective test cohort. For example, the standardization requires knowledge of the variance of each variable; in preprocessing the test data, the variance was thus derived from the test data and not from the training data. This ensures that there is no distribution leakage between the training and test sets, allowing for unbiased external evaluation, which is the hypothesis that we seek to test. Additionally, because test sets are preprocessed independently to the training sets, all comparator models are tested on the same test cohorts, allowing for direct comparison of the various methods.

Model architecture and optimization

To predict COVID-19 status, we used a fully-connected neural network (Fig. 3). The rectified linear unit (ReLU) activation function was used for the hidden layers and the sigmoid activation function was used in the output layer. For updating model weights, the Adaptive Moment Estimation (Adam) optimizer was used during training.

Fig. 3: Model architecture used for classification.
figure 3

Figure shows neural network architecture used for classification, including a visual of the decision boundary used to determine COVID-19 positive or negative status.

For each model developed, appropriate hyperparameter values were determined through standard fivefold cross-validation (CV), using respective training sets. Fivefold CV was used to ensure that hyperparameter values were evaluated on as much data as possible, as to provide the best estimate of potential model performance on unseen data. We performed a grid search to determine: (i) the number of nodes to be used in each layer of the neural network, (ii) the learning rate, and (iii) the dropout rate. When used in combination with fivefold CV, performance estimates for all combinations of hyperparameter values can be evaluated, allowing for the optimal combination to chosen for model training. Details on the hyperparameter values used in the final models can be found in Supplementary Table 3.

The raw output of many ML classification algorithms is a probability of class membership, which must be interpreted before it’s mapped to a particular class label (Fig. 3). For binary classification, the default threshold is typically 0.5, where all values equal to or greater than 0.5 are mapped to one class and all other values are mapped to the other. However, this default threshold can lead to poor performance, especially when there is a large class imbalance (as seen with our training datasets). Thus, we used a grid search to adjust the decision boundary used for identifying COVID-19 positive or negative cases. For our purposes, the threshold was optimized to sensitivities of 0.85 to ensure clinically acceptable performance in detecting positive COVID-19 cases. This sensitivity was chosen to exceed the sensitivity of lateral flow device (LFD) tests which are used in routine care. LFD sensitivity for OUH admissions between December 23, 2021 and March 6, 2021 was 56.9% (95% confidence interval 51.7–62.0%)10. Additionally, the gold standard for diagnosis of viral genome targets is by real-time polymerase chain reaction (RT-PCR), which has an estimated sensitivity of ~70%27. Thus, a threshold of 0.85 ensures that the model is effective at identifying COVID-19 positive cases, while exceeding the sensitivities of current diagnostic testing methods.

Baseline models

To start, we established baseline models for single-site and multi-site training. For the multi-site baseline, we trained and optimized a model using the combined training data and combined validation data from all the sites. Here, preprocessing was performed after the data had been combined. This model was then tested on all four test sets, separately.

For the single-site baseline models, we used data from each site separately in training, building four custom models. We then subsequently tested each model on the held-out test set from the same site used for training, akin to internal validation.

After training the two baseline models described, we then evaluated three different methods for adopting the single-site models in new settings (for use as ready-made models), akin to external validation. The three methods are: (1) Testing models “as-is”, (2) Readjusting model output thresholds using test site-specific data, and (3) Finetuning models (via transfer learning) using test site-specific data.

Testing “as-is”

The first method applies a ready-made model “as-is,” without any modifications. Thus, for the previously trained single-site models, we directly evaluated them on each of the external test sets. This method can be used when the external site does not have access to the original training data, preprocessing transforms, model architecture, and model weights.

Threshold adjustment

To adapt a ready-made model to a new setting, a ready-made model can be adjusted, via re-thresholding, using test site-specific data. For each site tested, we used the site-specific validation set to readjust the decision boundary on each ready-made model (i.e., optimizing the model to a sensitivity of 0.85, based on the new site’s data). The readjusted model was then evaluated on its respective test set. This method can also be used when the external site does not have access to the original training data, preprocessing transforms, model architecture, or model weights, as it only modifies the decision threshold of the final model output (recall Fig. 3).

Transfer learning

The final method adapts a ready-made model to a new setting using transfer learning. For each site tested, we used the site-specific training and validation set to finetune each ready-made model by updating the existing weights. For each site, we randomly sampled 50% of the training data to perform transfer learning, modeling real-world scenarios where new sites may not have a lot of data to train standalone models. We used a small learning rate (0.0001) so that the procedure would not completely override the previously learned weights as might otherwise occur with a larger value of the learning rate. As before, the test site-specific validation set was used to optimize the model to a sensitivity of 0.85. The resultant finetuned model was then evaluated on its respective test set. This technique can be used when the external site does not have access to the original training data or the preprocessing pipeline (including functions, transforms, etc.), but does have access to the model architecture and pretrained model weights.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.


By admin

Leave a Reply

Your email address will not be published. Required fields are marked *