During the creation of our other packages, particularly relating to Stranded patients module and readmissions module, it became apparent that the ability to predict length of stay was useful to establish what factors cause a longer length of stay, when compared to other patient types.
Because of our musings, and progress with the other modules, the data science team at Draper & Dash decided to create an algorithm that predicts a patient length of stay will be. This prediction is made on the first day of their inpatient stay allowing key decision makers and clinical staff in identifying longer staying patients sooner. Again, there are links to what is known as long stayer patients (as this is another way of saying a patient has had a longer than 7 day LOS).
Our ML algorithm: the technicals
Most researchers and data scientists in the NHS, when they come to make a prediction of LOS, always aim to undertake regression techniques. The D&D approach – is to take this concept and turn it on its head, meaning instead of regression we create categorical bandings of LOS and then predict whether a patient will fall into one of these categories. These categories were chosen by looking at simple descriptive methods of analysis to get an understanding of the ranges of values. Our algorithm buckets LOS into the categories listed below:
- 0 to 2 days
- 3 to 5 days
- 6 to 15 days
- 15 to 25 days
- 25 to 30 days
- And so on..
These cohorts and bands could be changed by the trusts who are interested in using this method, as a quick descriptive summary analysis will reveal the distributions.
Our ML Models
Once we have undertaken the conversion into categorical variables (buckets) we run our custom built ensemble ML algorithm to handle the task.
All the capabilities of the algorithm have been added by our in house data science experts and are built into the module so that if you are interested in the model, your analysts do not need to put the sweat, as we have put the effort in for you.
This algorithm is a very complex prediction, using numerous predictor variables (the things we use to make the predictions i.e. patient age, co-morbidities, etc.). Because it is termed an ensemble model, it has the advantage of combining all the power of other ML models and harnessing them to improve the prediction accuracy and depth.
Our custom (no one else has it) ensemble methods
The models used to address the ML problem are Gradient Boosting Machines (GBM) and Recursive Partitioning (RPART).
Once we have obtained the predictions from the two models, we then combine them in order to obtain a better, balanced and more trustworthy result. In essence, refining the accuracy prediction.
What do these models do?
- RPART is a kind of decision tree, particularly designed for multivariate analysis. It allows intensive user customization, boosting specificity, RMSE (Root Mean Squared Error – used in regression analysis), or any other element that the ML engineer / data scientist might deem to be of importance for a particular use case. Moreover, it is a robust and precise model. It is precise as its specificity tends to be high.
- GBM is a different kind of tree algorithm, in which each new tree is a fit on a modified version of the original data set. It is made of an ensemble of weak prediction models, that, by the combination of their “weak” predictions, generates a strong prediction. What does boosting mean? In a multivariate prediction, some observations are easy to classify, and some are much harder to classify. The system will, in following iterations of the GBM model, increase the weight of the hardest observations, and lower the weight on those that are easy to classify. After this, the output will be the combination of these two trees. This process will be repeated for a specified number of iterations, thus incrementally improving the results.
Once we have generated the predictions with these 2 models, using different contributions from each of the models, we combine the model outputs, therefore generating a more balanced and useful output.
The D&D Feature Importance Locator Engine (FILE) is then used to generate a string of dominant factors that caused the predicted LOS. Like we said earlier, this would be a string of factors similar to this example: age between 80-90, presenting complaint – fracture, social care intervention needed, etc.
The aforesaid method has been covered in our previous blog post.
Want to find out more?
We are truly excited about our machine learning complement. We have so much more that we are working on in the pipeline, and stay posted for the exciting things we are doing with radiology turnaround times over the coming weeks.
If you are interested in any of our existing solutions, then please contact info@draperanddash.com.