Student name: Cynthia Pedrasa
Student pace: self paced
Scheduled project review date/time: Monday, July 27, 2020
Instructor name: Jeff Herman
Deliverables:
Jupyter Notebooks:
a. Data Preprocessing
b. NLP Text Preparation
c. Model tuning
Currently, clinical data use is limited to the structured information. Dashboards are limited to reporting discrete data elements and coded information. However, it was reported that more than 80 percent of a healthcare organization’s data is unstructured, including physician notes, clinical assessments, registration forms, discharge summaries and other nonstandardized electronic forms, which makes data collection and analysis difficult using standard methods. Insights we could get from the doctors notes that are free text in nature and if we are able to identify risk factors from the sea of data, we might be able to supplement prediction of readmission risk and improve outcomes for the patients.
Source: Weber GM, et al., Finding the missing link for big biomedical data. JAMA 2014; 311(24):
Barriers to healthcare data: |
---|
* patient protection |
* data quality |
* cost(monetary, time,resources) |
* transparency |
* disparate rules across stakeholders |
Structured Data Characteristics | Unstructured Data Characteristics |
---|---|
Pre-defined ontology | Not pre-defined – may be text, image, sound, video |
Easy to search | Difficult to search |
Examples: | Examples: |
ICD-10-CM | Discharge Summary |
CPT | Clinical notes |
LOINC | Radiographs |
SNOMED | Mobile health data |
We will use natural language processing to turn the unstructured discharge summary data into information that will help identify at-risk patients and allow the clinicians to intervene. Hospital discharge summaries serve as the primary documents communicating a patient’s care plan to the post-hospital care team. The discharge summary is the form of communication that accompanies the patient to the next setting of care. High-quality discharge summaries are generally thought to be essential for promoting patient safety during transitions between care settings, particularly during the initial post-hospital period. It plays an important role in preventing avoidable hospital readmissions.
The Joint Commission mandates that six components be present in all U.S. hospital discharge summaries: |
---|
1. Reason for hospitalization |
2. Significant findings |
3. Procedures and treatment provided |
4. Patient’s discharge condition |
5. Patient and family instructions (as appropriate) |
6. Attending physician’s signature |
We will utilize the MIMIC-III (Medical Information Mart for Intensive Care III), a free hospital database. Mimic III is a relational database that contains de-identified data from over 40,000 patients who were admitted to Beth Israel Deaconess Medical Center in Boston, Massachusetts from 2001 to 2012. MIMIC-III contains detailed information regarding the care of real patients, and as such requires credentialing before access.
In order to get access to the data for this project, you will need to request access at this link (https://mimic.physionet.org/gettingstarted/access/) and complete the required training course at CITI “Data or Specimens Only Research”
from IPython.display import IFrame IFrame(‘https://mit-lcp.github.io/mimic-schema-spy/relationships, width=700, height=450)
A SQLite database was generated using the MIMIC III CSV files. MIMIC-III is a relational database consisting of 26 tables. For a detailed description of the database structure, see the MIMIC-III Clinical Database: and Database Schema:
The data files are distributed in comma separated value (CSV) format following the RFC 4180 standard. Notably, string fields which contain commas, newlines, and/or double quotes are encapsulated by double quotes (“). Actual double quotes in the data are escaped using an additional double quote. For example, the string she said "the patient was notified at 6pm"
would be stored in the CSV as "she said ""the patient was notified at 6pm"""
. More detail is provided on the RFC 4180 description page: https://tools.ietf.org/html/rfc4180
Many unstructured notes e.g. assessment, medical history, progress notes, discharge notes, etc. are generated daily by the multi-disciplinary healthcare workers.
For Hospital Readmissions, we are only concerned about the adult and non-elective encounters.
We are predicting the presence of a readmission risk, for example, “yes” (1) would mean they have readmissions, and “no” (0) would mean they don’t have the hospital readmissions.
What were the sources of admission for our patients?
TEXT PROCESSING
Note: We skipped the use of Stemming (process of reducing each word to its root or base) and Lemmatization (a more calculated process of returning the base or dictionary form of a word) in this project.
After cleaning the text and splitting them into tokens, we converted the discharge summary text to numbers.
Bag-of-Words Model (BoW)
We may want to perform classification of documents, so each discharge summary from the admission is an input and a class label (“Readmission”) is the output for our predictive algorithm. Algorithms take vectors of numbers as input, therefore we need to convert documents to fixed-length vectors of numbers. For this project we will be utilizing the Bag-of-Words Model, or BoW. This model doesn’t focus about the order of words but focuses on the occurrence of words in a document or the degree to which they are present in encoded.
Count Vectorizer
Count Vectorizer will be utilized to tokenize a collection of text documents, build a vocabulary of known words and also to encode new documents using that vocabulary.
An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document.
Here are the steps:
vector = CountVectorizer(max_features = 3000, tokenizer = clean_tokenize, stop_words = stop_words)
vector.fit(df_train.TEXT.values)
console
X_train_tf = vect.transform(df_train.TEXT.values) X_valid_tf = vect.transform(df_valid.TEXT.values)
TF-IDF - Term Frequency - Inverse Document Frequency - word frequency scores that try to highlight words that are more frequent in a documents but not across documents.
Spot-check machine learning classification algorithms:
XGBoost Classifier was selected as the model with the best AUC score to predict readmissions from discharge notes.
For cross-validation,the dataset was split into training, validation, and test sets.
Grid-search was run to determine the best parameters for the selected classifier. The hypertuned parameters were fit on the training data and assessed using the following metrics:
A given classification model was fit on the training data. It then classified the validation data. To assess the accuracy of the model, those predictions were compared to the actual labels.
A confusion matrix was generated for each classifier. The confusion matrix is used to show the number of:
Compared to random predictions, results from our predictive model (AUC=.71) is a good baseline for further improving our model.
Feature engineering, ensemble of models and parameter tuning of the model will help the adoption of the model as a clinical decision system for evaluating readmission
Explore other unstructured notes and/or combine with structured clinical information to strengthen predictive scores.
Predicting hospital readmissions based on unstructured data opens many opportunities in predictive analytics where a vast amount of untapped data could be utilized to reduce hospital readmissions, improve outcomes for the patients, lower healthcare cost while providing quality care.
Reduce Readmission by Predicting Patients at risk for Readmission
===========================================================================
===========================================================================
You may need to install some software and packages.
Install Anaconda (https://docs.anaconda.com/anaconda/install/)
Install SQLite (https://sqlitebrowser.org/)
conda install -c anaconda scikit-learn
conda install -c anaconda nltk
Install the TensorFlow deep learning library
conda install -c conda-forge imbalanced-learn
conda install -c conda-forge xgboost
===========================================================================
===========================================================================
Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific data, 3, 160035.
MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). DOI: 10.1038/sdata.2016.35. Available at: http://www.nature.com/articles/sdata201635
Pollard, T. J. & Johnson, A. E. W. The MIMIC-III Clinical Database http://dx.doi.org/10.13026/C2XW26 (2016).
Physiobank, physiotoolkit, and physionet components of a new research resource for complex physiologic signals. Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov P, Mark RG, Mietus JE, Moody GB, Peng C, and Stanley HE. Circulation. 101(23), pe215–e220. 2000.
Johnson, Alistair EW, David J. Stone, Leo A. Celi, and Tom J. Pollard. “The MIMIC Code Repository: enabling reproducibility in critical care research.” Journal of the American Medical Informatics Association (2017): ocx084.
Mimic III has extensive documentation that I linked below to provide additional information about the data source.
https://mimic.physionet.org/gettingstarted/overview/
https://mit-lcp.github.io/mimic-schema-spy/relationships.html