Student name: Cynthia Pedrasa
Student pace: self paced
Scheduled project review date/time: Monday, Jun 15, 2020 Time: 10:00am-11:00am (EDT)
Instructor name: Jeff Herman
Deliverables:
Jupyter Notebooks:
a. Data Preprocessing
b. Model Evaluation
For the final project for Module 5, we have chosen a binary classification project on predicting 30-day Readmisssion Risk of patients with Diabetes.
Hospital readmissions are associated with unfavorable patient outcomes and high financial costs.
The Medicare Payment Advisory Commission (MedPAC) reported that in 2005, 17.6% of hospital admissions resulted in readmissions within 30 days of discharge, 11.3% within 15 days, and 6.2% within 7 days. National Conference of State Legislatures
Diabetes is one of the most frequently treated condition in US Hospitals with 20.3% readmission rate. Agency for Healthcare Research and Quality
Healthcare Regulatory Agencies are focused on 30-day readmission rates as a way to improve quality. The Centers for Medicare and Medicaid Services (CMS) have labeled 30-day readmission rates as a measure of healthcare quality and emphasize its reduction as a strategy to reduce healthcare costs while also maintaining quality. Some of these mandatory and/or voluntary programs are as follows:
A successful predictive model will help the Healthcare Organization:
### Data Science Workflow
Understanding the typical work flow on how the data science process works is important in business understanding and problem solving. Using the OSEMN Framework the student will go through the different steps of the framework in an iterative and non-linear process.
In this Module 5 Project, the student will build and test different binary classifier algorithms to predict 30-day hospital readmissions of patients with diabetes, based on the electronic medical records. The models will be tuned to improve accuracies and the model with the best score will be selected to make accurate predictions of data.
Part I of the project includes the Introduction, data load, data scrubbing thru completion of pre-processing of the final dataframe.
Part II of the project includes the Data Modeling, tuning, evaluation, performance metrics and finalization/saving of the model. for later prediction use.
In this project we would like to find the answers to the following questions:
UCI Machine Learning Datasets Repository
The dataset represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.
Target Variable:
Attribute | Description |
---|---|
Readmitted | Days to inpatient readmission. Values: “<30” if the patient was readmitted in less than 30 days, “>30” if the patient was readmitted in more than 30 days, and “No” for no record of readmissionFeatures: |
Predictors:
Attribute | Description |
---|---|
Encounter ID | Unique identifier of an encounter |
Patient number | Unique identifier of a patient |
Race Values | Caucasian, Asian, African American, Hispanic, and other |
Gender | male, female, and unknown/invalid |
Age | Grouped in 10-year intervals: 0, 10), 10, 20), …, 90, 100) |
Weight | Weight in pounds |
Admission type | Integer identifier corresponding to 9 distinct values, for example, emergency, urgent, elective, newborn, and not available |
Discharge disposition | Integer identifier corresponding to 29 distinct values, for example, discharged to home, expired, and not available |
Admission source | Integer identifier corresponding to 21 distinct values, for example, physician referral, emergency room, and transfer from a hospital |
Time in hospital | Integer number of days between admission and discharge |
Payer code | Integer identifier corresponding to 23 distinct values, for example, Blue Cross/Blue Shield, Medicare, and self-pay Medical |
Medical specialty | Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family/general practice, and surgeon |
Number of lab procedures | Number of lab tests performed during the encounter |
Number of procedures | Numeric Number of procedures (other than lab tests) performed during the encounter |
Number of medications | Number of distinct generic names administered during the encounter |
Number of outpatient visits | Number of outpatient visits of the patient in the year preceding the encounter |
Number of emergency visits | Number of emergency visits of the patient in the year preceding the encounter |
Number of inpatient visits | Number of inpatient visits of the patient in the year preceding the encounter |
Diagnosis 1 | The primary diagnosis (coded as first three digits of ICD9); 848 distinct values |
Diagnosis 2 | Secondary diagnosis (coded as first three digits of ICD9); 923 distinct values |
Diagnosis 3 | Additional secondary diagnosis (coded as first three digits of ICD9); 954 distinct values |
Number of diagnoses | Number of diagnoses entered to the system 0% |
Glucose serum test result | Indicates the range of the result or if the test was not taken. Values: “>200,” “>300,” “normal,” and “none” if not measured |
A1c test result | Indicates the range of the result or if the test was not taken. Values: “>8” if the result was greater than 8%, “>7” if the result was greater than 7% but less than 8%, “normal” if the result was less than 7%, and “none” if not measured. |
Change of medications | Indicates if there was a change in diabetic medications (either dosage or generic name). + + Values: “change” and “no change” |
Diabetes medications | Indicates if there was any diabetic medication prescribed. Values: “yes” and “no” |
24 features for medications | For the generic names: metformin, repaglinide, nateglinide, chlorpropamide, glimepiride, acetohexamide, glipizide, glyburide, tolbutamide, pioglitazone, rosiglitazone, acarbose, miglitol, troglitazone, tolazamide, examide, sitagliptin, insulin, glyburide-metformin, glipizide-metformin, glimepiride- pioglitazone, metformin-rosiglitazone, and metformin- pioglitazone, the feature indicates whether the drug was prescribed or there was a change in the dosage. Values: “up” if the dosage was increased during the encounter, “down” if the dosage was decreased, “steady” if the dosage did not change, and “no” if the drug was not prescribed |
Attributes with Null values | Description |
---|---|
race | 2273 |
weight | 98569 |
payer_code | 40256 |
medical_specialty | 49949 |
diag_1 | 21 |
diag_1 | 358 |
diag_3 | 1423 |
You may need to install some software and packages.
Install Anaconda (https://docs.anaconda.com/anaconda/install/)
conda install -c anaconda scikit-learn
conda install -c conda-forge imbalanced-learn
conda install -c conda-forge xgboost