Sunteți pe pagina 1din 1

The Impacts of Advanced Categorical Data Analysis on the Healthcare Industry

The Potential of GALILEO in an Increasingly Digital World


By: Sean Jordan Mentor: Dr. Philip B. Graff Sector/Group: AOS/QAS
Abstract GALILEO Test Data Set #1
Recently, the healthcare industry has been drowning in the
amount of data that they have had to take in and analyze
What is it? Cleveland Heart Disease Data Set
from their patients, the number of which will only increase • A cluster-based data analysis program • Contains anonymous patient information on certain
in the future. GALILEO, a new cluster-based data analysis • Being developed in the Large Scale Analytics Group symptoms of heart disease
program that is currently being developed at the JHU/APL, • Uses entropy-based density metrics to cluster data values • 14 columns/attributes
can help mediate this problem. By using the Python into different, high-quality clusters • 303 instances of data
programming language and the Jupyter Notebook, the • Proficient with analyzing quantitative, categorical, and • Optimized for machine learning
GALILEO program code was tested on different types of mixed datasets for hidden clues and relationships Results
medical datasets taken from healthcare providers. The
results from these tests, although not operationally useful,
suggest areas of the program that need improvement and
development in the future. With future work on making the
GALILEO program more universally applicable to a greater The above two distributions have the same
variety of datasets, it can help medical providers better amount of data points but different densities,
which affects how Galileo treats the resulting
extrapolate useful information from the large patient cluster GALILEO sorts data values into clusters with
different densities, as seen above
datasets they maintain.
• Created clusters are ranked
by Akaike Information
BIG DATA & MEDICINE Criterion (AIC), Bayesian • For this test, GALILEO was not nearly as effective as it was for the mushroom
Information Criterion (BIC)
Why is it a problem? and Density Information
dataset – two clusters (the minimum value) was found to be the best number of
clusters, and the resulting confusion matrix shows a lot of undesired spread.
• Big Data Analysis – Analysis of large data sets Criterion (DIC).
• Some data sets grow with the human population • GALILEO itself performs Test Data Set #2
• Healthcare Industry suffers most from high influxes of data unsupervised machine learning
• Mean, standard deviation, etc. used to analyze quantitative • Scikit-learn enables supervised The optimal cluster is found with the
maximum value for the DIC and the
CASP Protein Data Set
data sets learning techniques minimum values for the AIC and BIC • Contains data on a variety of physiochemical properties
• Limited options available for mixed/categorical datasets, of tertiary protein structure
which are common in medicine GALILEO Proof of Work • 9 columns/attributes
• The time needed and cost of health • 46,000 instances of data
diagnoses keeps rising The Mushrooms Dataset • Optimized for machine learning
• There were approximately 35 million • Common benchmark dataset used to test categorical data Results
patients admitted to US Hospitals in analysis programs such as GALILEO
2017 • 23 columns/attributes
The human population has been • Records must be kept for people not • 8124 rows of data
exponentially rising for the past
few centuries admitted as well, causing a massive • Contains different letters representing different
buildup of data in hospitals characteristics of tested mushrooms
Results
Machine Learning • GALILEO found that the
• Unsupervised Machine Learning: Finding hidden mushroom dataset can be best
patterns in unlabeled data represented with 23 different
• Supervised Machine Learning: Inferring a fit function clusters.
from labeled training data, and using that function to guess • The original mushroom dataset • For this test, GALILEO was pretty effective in summarizing the data set. It found 23
classification of unlabeled data contained 23 different species to be the best number of clusters, which is very close to the number of clusters that
the data set should theoretically have. The resulting confusion matrix shows a few
of mushrooms
Data Collection Methods • Shows that GALILEO has
concentrated areas of populated clusters, which is promising.

• Notepad++: Edited Python files in the GALILEO program potential when analyzing Conclusions
• Jupyter Notebook: Ran GALILEO files on many data sets categorical datasets • Too many unique values or too few data instances
• Scikit-learn: Made confusion matrices for data sets causes under-fitting problems for GALILEO
• MatPlotLib: Plotted graphs showing AIC/BIC/DIC • Only works well with datasets that have enough
samples for their attribute space
• The confusion matrix shows • Provides new ways to view a data set
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40
a few main concentrations
with little spread, pointing
• Needs more work and testing to identify weaknesses
towards precise lines around • Can have a huge impact on the digital world of data
the synthesized clusters

• The three data sets used in


this research project were
Citations
Acknowledgements taken from the University of
Dinov, I. D. (2016, March 17). Volume and value of big healthcare data. Retrieved from HHS Public
Access website: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4795481/pdf/nihms-766954.pdf
• Dr. Philip Graff and the other members of California – Irvine (UCI) Donalek, C. (2011, April). Supervised and unsupervised learning. Retrieved from Caltech Astronomy
website: http://www.astro.caltech.edu/~george/aybi199/Donalek_Classif.pdf
the Large Scale Analytics Group Machine Learning Repository. Dutcher, J. (2014, September 3). What is big data? Retrieved from datascience@berkeley
website: https://datascience.berkeley.edu/what-is-big-data/
• Mrs. Dungey and the other members of the 11100 Johns Hopkins Road, Savkli, C., Lin, J., Graff, P., & Kinsey, M. (2017, August 24). GALILEO: A generalized low-entropy
Long Reach High School teachers and staff Laurel, MD 20723
mixture model. Retrieved from Cornell University Library website:
https://arxiv.org/pdf/1708.07242.pdf

S-ar putea să vă placă și