Data Preparation

Model your data as a series of process observations or measures that are associated with an outcome of interest.  Compose each observation as a common set of features (aka., independent variables or factors) along with the associated outcome.  Both features and outcomes are either numerical (dates, times, ages, etc.), binary (yes/no, true/false, etc.) or categorical (Gender, Service line , Floor unit, Shift, DRG, etc.).


Learn more »

Example Applications

You prepare a comma delimited file with a header row as shown below.   Include a single categorical or binary outcome column as shown in the right most column.  All input data can be either numerical, categorical or free text.  Input data columns contain feature values that may be expected to have some impact on the outcome.  Feature selection, also called variable selection, is best determined in advance of data preparation by subject matter experts.  We work closely with our customers to incorporate domain knowledge within the feature set design.

Numerical feature values must begin with a number and have an ordered relationship, like patient temperature for example.

Categorical features must begin with a letter and are those where there is no order between the possible values for the variable (i.e. there is no order relationship between Sunny and Rain, one is not bigger nor smaller than the other, but are just distinct.

Sample COVID-19 Critical Care Study

The ability to accurately predict the number of needed ventilators is one important aspect in the managment of this disease.  Machine Learning has the ability to leverage the various data points needed to create a predictive model.  This study was based on statistical data reported by the ICNARC report on COVID-19 in Critical Care on 04-APR-2020.  Table 2 of this report characterizes those critical COVID-19 patients who needed 'Basic' respiratory intervention and those who needed 'Advanced' intervention (need for mechanical ventilation).  A patient dataset was synthesized (n=2000) based on the published statistics and a predictive model for the type of respiratory intervention was generated.


Cross-validation metrics for the learned Multi-class Classification model for the ability to classify the need for advanced respiratory support


 Average MicroAccuracy: 0.823    Standard deviation: (0.013 - Confidence Interval 95%)
 Average MacroAccuracy: 0.758   Standard deviation: (0.021 - Confidence Interval 95%)

 Below is a chart that depicts the predictive strength of various factors that were measured within the cited population study.




Sample HCAHPS Quality Ratings Study

The sample below shows encounter data coupled with healthcare HCAHPS survey results.  In this example, all encounter data features are categorical.  Use as many categories (columns) as you wish; however, use only the minimum number of categorical values as possible within each category.  For instance, if the answer to a question is 'sometimes' or 'usually' that answer should be rolled up to a single category.   In our example below responses of 'sometimes''usually, and 'never' were rolled up to a category of 'Other'.  A good practice is to not use categorical values that do not help distinguish between outcomes of interest.  In this case we were only interested in distinguishing leading factors between 'Always', the only desired response, and all other responses.

This is the question:

"During this hospital stay, how often did nurses explain things in a way you could understand?"

So what are the results?

Although Table 1 below consists of only 100 samples and 8 encounter features it is difficult to manually identify which factors most distinguish the undesired response of 'other' from the desired response of 'Always'.  View the table below.  For us humans there are too many inputs, too many outputs, too many anomolies and randomness.  We would never be able to determine, from looking at the data, what data relationships can predict the outcome.   However;  machine learning is able to train on 80% of this data and predict the response label (outcome) of the other 20% with 96-100% accuracy and tell you what findings it used in prediction!

We can use a series of rules derived from machine learning to implement process interventions designed to improve patient satisfaction.  For example; we may learn that a particular service line and staff shift are strong factors in a negative survey response.  One can then focus on those process areas for a positive impact on survey results.

 Table 1 - Simulated encounter and HCAHPS results data

  Q3 Survey Data






HTO Magazine