Predictive Solutions Series – Stranded and Super Stranded ML Module

What is the motive for this solution now? Year on year flow is an issue for trusts across the country, especially downstream flow. However, what if you could identify a patient who is likely to stay longer than 7 days? What if you could also identify those super long stay patients (super stranded) where their inpatient length of stay is longer than 21 days? We have an algorithm and module that does just that.

Our development team at Draper and Dash have created an application and predictive module with the capacity to be able to identify these patient factors in the first day of their inpatient journey. For this, we use patient features such as age, gender, day of the week, waiting time, etc. Our aim is to create models that can be generalisable from trust to trust.

The technicals

You may ask – “surely it is easy to predict whether someone will be deemed stranded?”. We answer, not without the expertise of our in-house healthcare data science team. This is in fact a very complex prediction – using a great number of variables – for many different patient types. Our method is to  create a customised Ensemble Model (in plain speaking combining multiple model together) that takes advantage of the predictive capabilities of 3 different ML models, and combines them to boost accuracy, sensitivity and specificity (https://www.draperanddash.com/machinelearning/2019/08/bucketing-and-highlighting-dominant-predictors-in-your-ml-models/).

The algorithms

The models used are Recursive Partitioning (RPART) from the Caret package, as well as Naïve Bayes (NB) from the e1071 package and the K-means Clustering algorithm, from the cluster package.

One could wonder, why 3 different algorithms? And why these 3?

Because, RPART, NB, and K-means use very different approaches towards making a prediction, and this difference is extremely useful in obtaining a balanced, and trustworthy output. Tell me more about these models?

  1. RPART is a kind of decision tree, particularly designed for multivariate analysis. It allows intensive user customization, boosting specificity, RMSE (Root Mean Squared Error – used in regression analysis), or any other element that the ML engineer / data scientist might deem to be of importance for a particular use case. Moreover, it is a robust and precise model. It is precise as its specificity tends to be high.
  2. Naïve Bayes is a simple and fast probabilistic classifier. It assumes that all variables are independent from each other and, using a simple statistical approach, obtains the least conservative of the three outputs. It performs not so well in terms of specificity, but it tends to show a great performance on sensitivity. To understand what these terms mean – please reference to our blog post on understanding confusion matrices: https://www.draperanddash.com/machinelearning/2019/07/confusion-matrices-evaluating-your-classification-models/.
  3. The K-means clustering algorithm uses a unique approach, and it was specifically designed for the Stranded and Super Stranded apps. After bucketing the variables (https://www.draperanddash.com/machinelearning/2019/08/bucketing-and-highlighting-dominant-predictors-in-your-ml-models/), we use this model to find “groups of neighbours”, that is, patients whose data is distributed in a similar fashion to each other. Once we’ve obtained the different groups (in our case, we distribute the data in around 800 different groups), we calculate the mean value of the Stranded patients in that group, multiply it by a 100, and use this number as the probability of a patient within this group of being stranded. When a new patient is admitted as an inpatient, we simply find out what group that patient belongs to and assign a probability of being stranded, based on the historic data of patients within the same group.

Once we have run the prediction with these 3 models, using different contributions from each of the models, we combine the model outputs, thus generating a more balanced, accurate, and stronger prediction.

Using the method explained in a previous blog post (https://www.draperanddash.com/machinelearning/2019/08/bucketing-and-highlighting-dominant-predictors-in-your-ml-models/), we generate the list of most important variables that led to this prediction.

With this customised system, we can obtain predictions that are trustworthy, tested, and that give useful information to the practitioner.

Want to find out more?

If you are interested in identifying long staying patients sooner, then act now and arrange a demo with our support team. Enquires to info@draperanddash.com.

Alfonso Portabales – Data Scientist