predicting-readmission-nlp-from-discharge-summary

Data Science Capstone Project

Predicting 30-Day All-Cause Readmission from Hospital Discharge Summary

Student name: Cynthia Pedrasa
Student pace: self paced
Scheduled project review date/time: Monday, July 27, 2020
Instructor name: Jeff Herman
Deliverables:

  1. Jupyter Notebooks:
    a. Data Preprocessing
    b. NLP Text Preparation
    c. Model tuning

  2. Blog Post:

  3. Executive Summary:

drawing

Hospital Readmissions are both a Clinical and Financial Problem!

Business Drivers

drawing

Currently, clinical data use is limited to the structured information. Dashboards are limited to reporting discrete data elements and coded information. However, it was reported that more than 80 percent of a healthcare organization’s data is unstructured, including physician notes, clinical assessments, registration forms, discharge summaries and other nonstandardized electronic forms, which makes data collection and analysis difficult using standard methods. Insights we could get from the doctors notes that are free text in nature and if we are able to identify risk factors from the sea of data, we might be able to supplement prediction of readmission risk and improve outcomes for the patients.

drawing
                                Source: Weber GM, et al., Finding the missing link for big biomedical data. JAMA 2014; 311(24):
Barriers to healthcare data:
* patient protection
* data quality
* cost(monetary, time,resources)
* transparency
* disparate rules across stakeholders
Structured Data Characteristics Unstructured Data Characteristics
Pre-defined ontology Not pre-defined – may be text, image, sound, video
Easy to search Difficult to search
Examples: Examples:
ICD-10-CM Discharge Summary
CPT Clinical notes
LOINC Radiographs
SNOMED Mobile health data

We will use natural language processing to turn the unstructured discharge summary data into information that will help identify at-risk patients and allow the clinicians to intervene. Hospital discharge summaries serve as the primary documents communicating a patient’s care plan to the post-hospital care team. The discharge summary is the form of communication that accompanies the patient to the next setting of care. High-quality discharge summaries are generally thought to be essential for promoting patient safety during transitions between care settings, particularly during the initial post-hospital period. It plays an important role in preventing avoidable hospital readmissions.

The Joint Commission mandates that six components be present in all U.S. hospital discharge summaries:
1. Reason for hospitalization
2. Significant findings
3. Procedures and treatment provided
4. Patient’s discharge condition
5. Patient and family instructions (as appropriate)
6. Attending physician’s signature
drawing

Build the MIMIC Database

We will utilize the MIMIC-III (Medical Information Mart for Intensive Care III), a free hospital database. Mimic III is a relational database that contains de-identified data from over 40,000 patients who were admitted to Beth Israel Deaconess Medical Center in Boston, Massachusetts from 2001 to 2012. MIMIC-III contains detailed information regarding the care of real patients, and as such requires credentialing before access.

In order to get access to the data for this project, you will need to request access at this link (https://mimic.physionet.org/gettingstarted/access/) and complete the required training course at CITI “Data or Specimens Only Research”

from IPython.display import IFrame IFrame(‘https://mit-lcp.github.io/mimic-schema-spy/relationships, width=700, height=450)

Data Description

A SQLite database was generated using the MIMIC III CSV files. MIMIC-III is a relational database consisting of 26 tables. For a detailed description of the database structure, see the MIMIC-III Clinical Database: and Database Schema:

drawing

The data files are distributed in comma separated value (CSV) format following the RFC 4180 standard. Notably, string fields which contain commas, newlines, and/or double quotes are encapsulated by double quotes (“). Actual double quotes in the data are escaped using an additional double quote. For example, the string she said "the patient was notified at 6pm" would be stored in the CSV as "she said ""the patient was notified at 6pm""". More detail is provided on the RFC 4180 description page: https://tools.ietf.org/html/rfc4180

Visualization

Many unstructured notes e.g. assessment, medical history, progress notes, discharge notes, etc. are generated daily by the multi-disciplinary healthcare workers.

drawing

For Hospital Readmissions, we are only concerned about the adult and non-elective encounters.

drawing

We are predicting the presence of a readmission risk, for example, “yes” (1) would mean they have readmissions, and “no” (0) would mean they don’t have the hospital readmissions.

drawing

30-day hospital readmission is a problem!
drawing

What were the sources of admission for our patients?

drawing

Data Preprocessing

drawing

TEXT PROCESSING
Note: We skipped the use of Stemming (process of reducing each word to its root or base) and Lemmatization (a more calculated process of returning the base or dictionary form of a word) in this project.

After cleaning the text and splitting them into tokens, we converted the discharge summary text to numbers.

drawing

Model Selection

Spot-check machine learning classification algorithms:

drawing

XGBoost Classifier was selected as the model with the best AUC score to predict readmissions from discharge notes.

Train Test Split

For cross-validation,the dataset was split into training, validation, and test sets.

Grid-search was run to determine the best parameters for the selected classifier. The hypertuned parameters were fit on the training data and assessed using the following metrics:

drawing

Cross-Validation: Validation Accuracy

A given classification model was fit on the training data. It then classified the validation data. To assess the accuracy of the model, those predictions were compared to the actual labels.

Confusion Matrices

A confusion matrix was generated for each classifier. The confusion matrix is used to show the number of:

Performance Metrics

drawing

Feature Importance

drawing

Conclusions

drawing

Future Work

Compared to random predictions, results from our predictive model (AUC=.71) is a good baseline for further improving our model.
Feature engineering, ensemble of models and parameter tuning of the model will help the adoption of the model as a clinical decision system for evaluating readmission Explore other unstructured notes and/or combine with structured clinical information to strengthen predictive scores.

Predicting hospital readmissions based on unstructured data opens many opportunities in predictive analytics where a vast amount of untapped data could be utilized to reduce hospital readmissions, improve outcomes for the patients, lower healthcare cost while providing quality care.

Reduce Readmission by Predicting Patients at risk for Readmission

drawing

===========================================================================

Prerequisites

===========================================================================
You may need to install some software and packages.

  1. Install Anaconda (https://docs.anaconda.com/anaconda/install/)

  2. Install SQLite (https://sqlitebrowser.org/)

  3. Install Scikit-learn (https://anaconda.org/anaconda/scikit-learn)
    conda install -c anaconda scikit-learn
    
  4. Install NLTK (http://www.nltk.org/install.html)
    conda install -c anaconda nltk
    
  5. Install the TensorFlow deep learning library

  6. Install Imbalanced-Learn Library (https://anaconda.org/conda-forge/imbalanced-learn)
    conda install -c conda-forge imbalanced-learn
    
  7. Install XGBoost Library (https://anaconda.org/conda-forge/xgboost)
    conda install -c conda-forge xgboost
    

===========================================================================

Acknowledgments

===========================================================================

References

Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific data, 3, 160035.

MIMIC-III citation

MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). DOI: 10.1038/sdata.2016.35. Available at: http://www.nature.com/articles/sdata201635

Mimic III Data

Pollard, T. J. & Johnson, A. E. W. The MIMIC-III Clinical Database http://dx.doi.org/10.13026/C2XW26 (2016).

PhysioNet

Physiobank, physiotoolkit, and physionet components of a new research resource for complex physiologic signals. Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov P, Mark RG, Mietus JE, Moody GB, Peng C, and Stanley HE. Circulation. 101(23), pe215–e220. 2000.

MIMIC Code Repository

Johnson, Alistair EW, David J. Stone, Leo A. Celi, and Tom J. Pollard. “The MIMIC Code Repository: enabling reproducibility in critical care research.” Journal of the American Medical Informatics Association (2017): ocx084.

Mimic III has extensive documentation that I linked below to provide additional information about the data source.

Mimic III Data.

https://mimic.physionet.org/gettingstarted/overview/

Mimic III Schema

https://mit-lcp.github.io/mimic-schema-spy/relationships.html