Omic and Electronic Health Record Big Data Analytics For Precision Medicine

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 64, NO.
2, FEBRUARY 2017 263
Omic and Electronic Health Record Big Data

Analytics for Precision Medicine
Po-Yen Wu, Member, IEEE, Chih-Wen Cheng, Member, IEEE, Chanchala D. Kaddi, Member, IEEE,
Janani Venugopalan, Member, IEEE, Ryan Hoffman, Member, IEEE,
and May D. Wang , Senior Member, IEEE
(Review Paper)
AbstractObjective: Rapid advances of high-throughput

technologies and wide adoption of electronic health records
(EHRs) have led to fast accumulation of omic and EHR
data. These voluminous complex data contain abundant in-
formation for precision medicine, and big data analytics can
extract such knowledge to improve the quality of healthcare.
Methods: In this paper, we present omic and EHR data
characteristics, associated challenges, and data analytics Fig. 1. Key types of biomedical big data for precision medicine.
including data preprocessing, mining, and modeling. Re-
sults: To demonstrate how big data analytics enables preci-
sion medicine, we provide two case studies, including iden-
tifying disease biomarkers from multi-omic data and incor- system. The goal of the early personalized medicine model is
porating omic information into EHR. Conclusion: Big data
to customize healthcare delivery for each individual and to max-
analytics is able to address omic and EHR data challenges
for paradigm shift toward precision medicine. Significance: imize the effectiveness of each patients treatment [1]. In 2009,
Big data analytics makes sense of omic and EHR data to Hood and Friend propose the personalized, predictive, preven-
improve healthcare outcome. It has long lasting societal im- tive, and participatory medicine (a.k.a. P4 medicine) model
pact. that aims to transform current reactive care to future proactive
medicine, and ultimately to reduce healthcare expenditure and
Index TermsBig data analytics, bioinformatics, elec- improve patients health outcome [2]. Recently, the new preci-
tronic health records (EHRs), health informatics, omic sion medicine model is proposed to precisely classify patients
data, precision medicine. into subgroups sharing a common biological basis of diseases
for more effective treatment and improved care outcome [3], [4].
Precision medicine requires data utility ranging from collection
I. INTRODUCTION and management (i.e., data storage, sharing, and privacy) to
O ACHIEVE the best care for patients, many models have analytics (i.e., data mining, integration, and visualization) [5].
T been proposed over the years to improve the healthcare Because of rapid advances in biotechnologies, highly complex
biomedical data are becoming available in huge volumes [6].
To make sense of these heterogeneous data, big data analytics,
Manuscript received September 20, 2015; revised January 31, 2016;
accepted February 18, 2016. Date of publication October 10, 2016; date
including data quality control, analysis, modeling, interpreta-
of current version January 18, 2017. This work was supported in part tion, and validation, is needed to cover application areas such
by grants from the National Center for Advancing Translational Sciences as bioinformatics [7][9], health informatics [10][12], imaging
of the National Institutes of Health (NIH) under Award UL1TR000454 informatics [13], [14], and sensor informatics [15], [16].
and Award NIH R01CA163256, the Georgia Research Alliance Cancer
Coalition (Distinguished Cancer Scholar Award to Prof. M. D. Wang), the As presented in 2015 U.S. Precision Medicine Initiative [17],
Childrens Healthcare of Atlanta, Centers for Disease Control and Pre- incorporating -omic data and knowledge into electronic health
vention, Microsoft Research, and Hewlett-Packard. The content of this record (EHR) (see Fig. 1) is viewed as a necessary step for
article is solely the responsibility of the authors and does not necessarily
represent the official views of the NIH. Asterisk indicates corresponding
delivering precision medicine [3], [5], [17], [18]. Thus, this
author. paper reviews big omic and EHR data analytics for preci-
P.-Y. Wu is with the School of Electrical and Computer Engineering, sion medicine with key terms summarized in Tables I and II.
Georgia Institute of Technology. Section II presents omic and EHR data characteristics, chal-
C.-W. Cheng, C. D. Kaddi, J. Venugopalan, and R. Hoffman are with the
Department of Biomedical Engineering, Georgia Institute of Technology, lenges, and big data analytics; Section III uses two case studies to
and also with Emory University. illustrate the impact of big data analytics in precision medicine;
M. D. Wang is with the Department of Biomedical Engineering,
Section IV enumerates several well-known biomedical big data
Georgia Institute of Technology, Atlanta, GA 30332, USA, and also
initiatives; Section V discusses current opportunities in big data
with Emory University, Atlanta, GA 30322, USA (e-mail: maywang@
bme.gatech.edu). analytics for precision medicine; and finally, Section VI con-
Digital Object Identifier 10.1109/TBME.2016.2573285 cludes this paper.
0018-9294 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
264 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 64, NO. 2, FEBRUARY 2017
TABLE I adoption of EHR for the entire population provides a foun-

BIOMEDICAL BIG DATA KEYWORDS dation for studying healthcare efficiency and safety [19]. Omic
data analytics often aims at finding biomarkers by cleaning up
Topics Keywords raw data generated by NGS or MS, extracting molecular pro-
Omic Data Genomics, transcriptomics, epigenomics, proteomics,
files, identifying statistically significant molecules, constructing
metabolomics, etc. models describing molecular interactions or temporal system
EHR Data Big data in EHR, next-generation EHR, clinical data behavior, and validating putative biomarkers. EHR data analyt-
management, medical coding systems, etc.
Data Challenges Biomedical big data challenges, omic data challenges, ics typically aims at predicting future outcome based on pop-
EHR data challenges, etc. ulation and individual longitudinal data. The analytics has a
Omic Data Analytics NGS sequence mapping, NGS variant detection, RNA-seq similar process such as data cleaning, clinical features identifi-
computation, ChIP-seq computation, MS preprocessing,
NGS biomarker identification, NGS differential analysis, cation, predictive model construction, and clinical validation.
omic network analysis, omic dynamic modeling, etc.
EHR Data Analytics Temporal medical data mining, irregular time series analysis
in EHR, clinical decision support, unsupervised/ supervised A. Biomedical Big Data
learning in EHR, waveform analysis in EHR, etc.
Big Data Analytics Enablers Big data harmonization, big data platform, big data 1) Big Omic Data: Omic data contain a comprehen-
framework
sive catalog of molecular profiles (e.g., genomic, transcrip-
tomic, epigenomic, proteomic, and metabolomic as explained
TABLE II in Table II) in biological samples that provide a basis for
OMIC AND EHR DATA CONCEPT GLOSSARY precision medicine [17]. The genome, transcriptome, and
epigenome are upstream of the proteome and metabolome.
Term Definition Ref. A genome is unique and mostly invariant over time with
its knowledge embedded in single nucleotide polymorphisms
Genome An organisms complete set of DNA [20] (SNPs), frameshift mutations (insertions or deletions; or in-
Transcriptome A collection of all the gene readouts present in an [20]
organisms cell dels), copy number variations (CNVs), and other struc-
Epigenome A multitude of chemical compounds that can tell [20] tural variations (SVs) [30], [31]; transcriptomic knowl-
the genome what to do edge is contained in gene expression, transcript expression,
Proteome An entire set of proteins encoded by the genome [21]
Metabolome A comprehensive catalogue of metabolites in an [21]
gene fusion, and alternative splicing [32], [33]; epigenomic
organisms cell knowledge is carried in protein-DNA binding sites, histone
Omics The study of the -ome [21] modification patterns, and DNA methylation patterns [34]; pro-
Single Nucleotide A variation at a single position in a DNA sequence [22]
Polymorphism among individuals teomic knowledge is reflected by protein expression, post-
Frameshift Mutation A genetic mutation caused by a deletion or [22] translational modification, and proteinprotein interactions
(Indels) insertion in a DNA sequence that shifts the way the [35]; and metabolomic knowledge is shown in the abundance
sequence is read
Copy Number Variation The number of copies of a particular genetic [22] of metabolites [36]. Because epigenomic information impacts
sequence differs between individuals transcriptomic, proteomic, and metabolomic profiles [37], and
Structural Variation Genomic alterations that involve segments of [23] the proteome and metabolome are directly responsible for the
DNA that are larger than 1 kb, and can be
microscopic or submicroscopic establishment of phenotypes, uncovering interactions among the
Fusion Gene A new gene formed by the breakage and [24] proteome, metabolome, and upstream processes is a key toward
re-joining of two different genes
Spanning Read Pair paired reads that harbor a fusion boundary in the [25]
precision medicine.
insert sequence 2) Big EHR Data: EHR data can be unstructured (e.g.,
Split Read A read that harbors a fusion boundary in the read [25] clinical notes) or structured (e.g., ICD-9 diagnosis codes, ad-
itself
Alternative Splicing A process that includes or excludes certain exons [26]
ministrative data, chart, and medication) [38]. Written or dic-
when forming mature mRNAs tated clinical notes describe the patients condition and are the
Protein-DNA Binding A segment of DNA sequences where targeted [27] most efficient and human-intuitive way for clinical documen-
Site proteins may bind
Histone Modification A covalent post-translational modification to [27] tation. However, they are the most challenging for computer
histone proteins that can impact gene expression analysis because of unstructured and heterogeneous data for-
DNA Methylation The addition of methyl (CH3 ) group to DNA that [27] mats; abundant typing and spelling errors; violation of natu-
modifies the function of the genes
Antecedent (Ant.) in A set of conditions which the outcome variable [28] ral language grammar; and rich domain-specific abbreviations,
ARL depends on acronyms, and idiosyncrasies [39]. Structured EHR data can be
Consequent (Cons.) in A set of conditions serving as the outcome variable [28] categorized into two classes [40]. Administrative data include
ARL
Confidence in ARL Confidence(Ant. Cons.) =
count( A n t. C on s. )
[28] those remain unchanged during the entire course of a clinical
count( A n t. )
Fast Healthcare FHIR uses standardized Resources (i.e., [29] encounter (e.g., demographic data), and those keep updating
Interoperability predefined data formats and elements) to exchange over time (e.g., diagnoses and procedures) [41]. Ancillary clini-
Resources (FHIR) EHR data.
cal data are frequently recorded during a clinical encounter that
can be discrete (e.g., physiological measures, medication, and
II. BIG DATA FOR PRECISION MEDICINE lab tests), or continuous (e.g., respiration, blood pressure, pulse
oximetry, and electrocardiography waveforms captured by sen-
The invention of high-throughput omic assays such as next- sors, either through bedside monitoring devices or ambulatory,
generation sequencing (NGS) and mass spectrometry (MS) has implanted, or wearable devices) [40].
led to fast accumulation of omic data. Likewise, the wide
WU et al.: OMIC AND ELECTRONIC HEALTH RECORD BIG DATA ANALYTICS FOR PRECISION MEDICINE 265
B. Challenges Associated With Omic and EHR Data TABLE III

SELECTED METHODS FOR DIMENSIONALITY REDUCTION
Omic and EHR big data analytics is challenging due to data
frequency, quality, dimensionality, and heterogeneity. Method Advantages Limitations
1) Diverse Data Collection Frequency: First, differ-
ent data modalities have different data collection frequency. For Feature extraction: PCA, Reduces dimensionality; Performance usually inferior to
SVD, tensor-based relatively immune to noise supervised approaches; difficult
example, in omic data, a genome is invariant over a long pe- approaches [54] to interpret results
riod of time, and often only needs a one-time data acquisition, Feature selection: Reduces dimensionality; easy Sometimes affected by noisy
while other types of omic data vary with environment, tissue filter-based (mRMR), to interpret data
wrapper-based
types, and time that require multitime-point acquisition. In EHR, (sequential feature
bedside monitoring data are captured at very high frequency, selection) [55]
while lab tests may be taken a few times a day. In addi- *
Highly impactful method with more than 50 000 relevant papers.
tion, data generation frequency may be influenced by cost
(e.g., proteomic/metabolomic data generated by MS versus ge-
nomic/transcriptomic/epigenomic data generated by NGS). Sec- subset of variables or latent factors that preserve as much of the
ond, data collection frequency can be irregular. For example, in characteristics of the original data as possible with two strate-
EHR, most clinical variables have irregular sampling frequen- gies (see Table III): 1) feature selection that aims to select an
cies, depending on the criticality of a patient and the easiness of optimal subset of existing features; and 2) feature extraction that
a measurement. aims to transform existing features into a more compact set of
2) Inherent Data Quality Issues: In omic data, quality dimensions [56].
issues are caused by a combination of biological, instrumental, Feature selection techniques consist of filtering, wrapper,
and environmental factors such as sample contamination [42], or embedded methods. Filtering methods limit the number
[43], batch effects [44], [45], and low signal-to-noise ratios [46], of features by calculating a score designed to estimate the
[47]. In EHR data, quality issues include missing data because usefulness of each feature. Thus, they are generally faster and
recorded clinical variables vary with each clinical encounter and do not require explicit class labeling. The minimum redundancy
depend on clinical teams assessment of the patients condition maximum relevance (mRMR) method is a filtering method that
[48], and erroneous data entries happening due to data entry iteratively selects features sharing the most mutual information
mistakes or misinterpretation of original documents when en- (relevance) with the least redundancy [57]. In contrast, wrap-
tering [49]. For high-resolution waveform data, common quality per methods select a subset of features (i.e., wrap the feature
issues include random noise, gaps in the waveform, and artifacts selection) for targeted learning models by using evaluation met-
(e.g., patients motion) [50]. These data quality issues may lead rics such as cross-validation accuracy [58]. Embedded methods
to wrong conclusion, but correcting these remains challenging. integrate machine learning algorithms [e.g., support vector ma-
3) High Dimensionality: A big challenge in either omic chines (SVM)] with recursive feature elimination [59].
or EHR data mining is the curse of dimensionality associated Among feature extraction techniques, principal component
with high-dimensional data [51]. Omic data often have many analysis (PCA) is a basic method that identifies a small num-
dimensions or features (may be more than 104 ) much larger than ber of orthogonal linear vectors [51]. Its performance heavily
the number of samples available, while EHR data may contain depends on correctly identifying an optimal number of com-
a large sample size of high-dimensional data but with each ponents, and requires careful testing and validation [60]. Other
individual sample only sparsely populated. Making sense of techniques include artificial neural networks such as autoen-
these data with statistical significance presents to be challenging. coders [61], and nonlinear kernels in PCA [62].
4) Heterogeneous Data Types: In omics, using un- 2) Omic Data Preprocessing: NGS and MS high-
derlying molecular fingerprints to characterize disease subtypes throughput assays require different preprocessing methods that
may require heterogeneous multi-omic data. For example, the are summarized in Table IV. NGS is a popular assay for ge-
integrative personal omics profile project has integrated multi- nomic, transcriptomic, and epigenomic studies. Its common
ple molecular expression profiles to uncover dynamic molecular preprocessing step is sequence mapping that identifies not only
changes between healthy and diseased states [52]. However, in- the origin but also the alignment of each read [78]. This step is
tegrating multi-omic data is challenging because of variations computationally intensive and requires auxiliary data structures
in represented biological processes, technical and biological (e.g., the hash table [63] and the BurrowsWheeler transform
noise levels, identification accuracy, spatiotemporal resolution, [64]), multithreading, or in-memory computing [65] for im-
and many other confounding factors [53]. In EHR, the data are proved computational efficiency. Genomic studies typically aim
inherently heterogeneous. To accomplish precision medicine, to identify variants in a sequenced genome [79]. Small-scale
it is necessary and critical to make sense of heterogeneous variant (i.e., SNPs and indels) detection uses per base differ-
data. ences between reads and the reference genome as the evidence
[30], [66]. Large-scale variant (i.e., SVs) detection uses read-
pair-based, read-depth-based, split-read-based, and assembly-
C. Big Data Analytics for Precision Medicine based methods [80], [81]. Transcriptomic studies mostly
1) General Analytics for Biomedical Big Data: Most center on expression profiling, fusion gene detection, and al-
omic and EHR are high-dimensional data that not only require ternative splicing detection [32], [33]. Expression profiling as-
longer computational time but also affect the accuracy of analy- sociates mapped reads with genes and isoforms. Different profil-
sis. Thus, we try to reduce data dimensionality by identifying a ing methods handle multimapped reads differently, where some
TABLE IV TABLE V
SELECTED TOOLS FOR OMIC DATA PREPROCESSING SELECTED TOOLS FOR OMIC BIOMARKER IDENTIFICATION
Tool Assay Omic Data Key Functionality Tool Omic Data Omic Biomarker Approach
GMAP* [63] Next-generation Genomic, Sequence mapping SNPassoc [89] Genomic Significant SNPs associated with Genome-wide
sequencing transcriptomic, traits association
and epigenomic studies
BWA [64] SNPTEST [90]
STAR [65] VAT [91] Significant SNPs and indels
GATK [30] Genomic Genomic variant discovery associated with traits

SAMtools [66] PLINK [92] Significant SNPs, indels, and
HTSeq [67] Gene expression quantification CNVs associated with traits
BEDTools [68] CNVRuler [93] Significant CNVs associated with
RSEM [69] Gene and transcript expression traits
quantification edgeR [94] Transcriptomic Differentially expressed genes Differential
Cufflinks [70] /transcripts analysis (model
fitting and
defuse [25] Transcriptomic Gene fusion detection
statistical tests)
TopHat-Fusion [24]
DESeq2 [95]
Trans-ABySS [71] Alternative splicing detection and
omniBiomarker [96]
quantification
DiffSplice [97] Differential alternative splicing
Trinity [72]
MATS [98]
Cufflinks [70]
DBChIP [99] Epigenomic Differential binding sites
Scripture [73]
ChIPDiff [100] Differential histone modification
MACS [74] Epigenomic ChIP-seq peak calling
sites
SISSRs [75]
QDMR [101] Differentially methylated regions
OpenMS [76] Mass Proteomic and Peak detection and quantification
DetectTLC [102] Proteomic and Molecular patterns in mass Similarity
spectrometry metabolomic
metabolomic spectrometry images scoring
MZmine 2 [77]
Automics [103] Differentially abundant Supervised and
metabolites unsupervised
GMAP stands for genomic mapping and alignment program; BWA, Burrows-Wheeler learning
aligner; STAR, spliced transcripts alignment to a reference; GATK, genome analysis toolkit; MetaboAnalyst
RSEM, RNA-seq by expectation-maximization; Trans-ABySS, transcriptome assembly and [104]
analysis pipeline; MACS, model-based analysis of ChIP-seq; and SISSRs, site identification
from short sequence reads. Highly impactful tool with more than 50 citations per year. SNPassoc stands for SNP-based whole genome association studies; VAT, variant association
tools; PLINK, population-based linkage analyses; edgeR, empirical analysis of digital
gene expression data in R; MATS, multivariate analysis of transcript splicing; DBChIP,
methods associate the reads with all loci [67], [68], while others differential binding with ChIP-seq data; and QDMR, quantitative differentially methylated
regions. Highly impactful tool with more than 50 citations per year.
probabilistically associate the reads with only a few model-
inferred loci [69], [70]. Fusion gene detection relies on two
factors, the spanning read pairs and the split read [24], [25].
Alternative splicing detection relies on either de novo transcrip- biological conditions (e.g., disease vs. non-disease) or differ-
tome assembly [71], [72], [82][84], or inference from sequence ent time points (e.g., before vs. after a treatment). Thus, Ta-
mapping outputs [70], [73]. Epigenomic studies mainly focus ble V summarizes selected tools that identify discriminatory
on identifying patterns of protein-DNA binding sites, histone biomarkers among different groups. Most omic biomarkers
modification, and DNA methylation [34]. Epigenomic data pre- are identified by investigating statistically significant differences
processing builds a profile representing the density of reads among groups, such as differentially expressed genes or tran-
along the genome, models background noises, and determines scripts [94][96], differential alternative splicing [97], [98], dif-
statistically significant peaks [78]. ferential protein-DNA binding [99], differential histone modifi-
MS is for proteomic and metabolomic studies, and its pre- cation [100], and differential DNA methylation [101]. The basic
processing steps include alignment, baseline correction, and idea is to quantify and then fit the abundance of each group to
peak detection [85]. In chromatography-coupled MS, chromato- Poisson-based distributions (e.g., the Poisson distribution and
graphic peak alignment can correct drift to ensure coherent the negative binomial distribution), followed by statistical tests
retention time and accurate mass across samples, and mass- (e.g., the Fishers exact test and the likelihood ratio test) that
to-charge ratio (m/z) alignment can ensure mass spectra and determine the statistical significance of each molecular feature.
component features are comparable among samples [86]. For genomic data, genome-wide association studies (GWAS)
In matrix-assisted laser desorption ionization (MALDI) MS, uses different approaches (e.g., the chi-squared test or logistic
baseline correction is particularly important. Low-mass mea- regression) to assess the degree of association between each vari-
surement noise from the chemical matrix used in MALDI ant and a targeted trait, and then select most significant variants
experiments affects the spectral baseline and needs to be re- as biomarkers [105]. Most GWAS focuses on SNP association
moved prior to analysis [87]. Peak detection is then performed [89][91], while only a few infer CNV or SV association [92],
based on criteria such as signal-to-noise ratio, peak shape, and [93].
detection thresholds [88]. A common subsequent step is the 4) Systems Biology Modeling Using Omic Data:
identification, and potentially the filtering, of isotopic peaks To gain insights about a complex molecular system, we can
from the spectrum [77]. conduct systems biology modeling using either static network
3) Biomarker Identification Using Omic Data: In analysis or dynamic temporal analysis based on omic fea-
practice, different groups of samples are collected for different tures (see Table VI).
TABLE VI TABLE VII

SELECTED TOOLS FOR OMIC DATA MODELING SELECTED METHODS FOR EHR DATA PREPROCESSING
Tool Modeling Approach Key Functionality Method Advantages Limitations

Type
Missing data: listwise Simple to implement; Loss of statistical power;
WGCNA [106] Static Correlation between Network construction, module deletion, mean filling complete case analysis introduces biases;
Network quantitative variables detection, and gene selection [118], [119] underestimates variances
analysis Missing data: hot deck, Simple to implement and Introduces biases;
CODENSE [107] Summary graphs and Mining frequent coherent dense nearest neighbor [120] interpret; immune to underestimates variances
dense subgraphs forsubgraphs across large numbers cross-user
frequent edges of massive graphs inconsistencies
MEMo [108] Mutually exclusive Network construction, module Missing data: Simple to implement and Does not account for
genomic alterationsdetection, and gene selection interpolation (linear, interpret; direct relationships among
CellDesigner [109] Dynamic Ordinary and partial
Graphical interface for ODE or piecewise linear, spline, estimation on the basis of different features
Temporal differential equations
PDE model implementation and cubic) [121] neighbors
Analysis simulation; systems biology Missing data: Accounts for uncertainty Does not account for
markup language compatibility model-based filling in imputations missing data mechanisms
(expectation (i.e., MCAR, MAR, and
NetLogo [110] Agent-based models General-purpose modeling
environment capable of maximization, maximum MNAR)
simulating hundreds to thousands likelihood, multiple
of interacting agents imputations) [122]
BoolNet [111] Boolean models Simulating and analyzing Boolean Waveforms: noise Generally simple to Falls short in situations
and probabilistic Boolean models filtering (IIR, FIR, PCA, implement where true waveform is
Snoopy [112] Petri nets Network modeling using Petri ICA, Kalman filter, obscured by artifact such
nets; hierarchical structure and wavelets) [50], [123] as patient motion
multiple class compatibility Waveforms: signal Human-interpretable Can be complex to
quality indices metrics of signal quality implement and
[123][125] computationally
WGCNA stands for weighted correlation network analysis; CODENSE, coherent dense
intensive; may require
subgraphs; and MEMo, mutual exclusivity modules in cancer. Highly impactful tool with ad-hoc calibration based
more than 50 citations per year. on the features of the
target waveform
Static network analysis studies the interactome (i.e., a com- Waveforms: sensor Improved SNR; reduces Computationally
fusion [126], [127] data dimensionality while intensive; loss of detail
plete set of molecular interactions) with three steps [113]: iden- increasing data quality from individual sensor
tifying a network scaffold that describes interactions among waveforms
omic features [106], [108], decomposing the network scaffold
Highly impactful method with more than 50 000 relevant papers.
into smaller network modules [106][108], and mathematically
representing each network module for downstream simulation
and analysis [114]. Most interactome networks use a single
omic data such as metabolic networks and gene regulatory with mutual information, or infer other ad hoc definitions of
networks. Few incorporate multi-omic data but are limited to signal quality to identify artifacts and gaps in the waveform
simpler organisms (e.g., S. cerevisiae and C. elegans) [115]. [123][125];
Dynamic temporal analysis (e.g., ordinary or partial differ- 3) sensor fusion techniques (e.g., using redundant mea-
ential equations, Boolean networks, agent-based models, and surements of electrocardiography, blood pressure sensors, and
Petri nets [116]) makes use of temporal measurement of omic photoplethysmography to derive a more reliable measure of
data to develop and validate dynamic predictive models of com- heart rate than any single signal alone [126], [127]) to correct
plex systems. For example, a recent study on A. thaliana used a artifacts and fill in gaps in the waveform.
Granger causality model to integrate two types of metabolomic 6) EHR Data Mining: To derive actionable knowledge
data acquired at multiple time points for studying the interaction from complex EHR big data, two strategies such as static end-
of primary and secondary metabolism [117]. point prediction and temporal data mining are summarized in
5) EHR Data Preprocessing: Information embedded Table VIII.
in EHR is abundant but disorganized in nature. Thus, EHR 6) a) Static endpoint prediction: After dimension-
data requires systematic preprocessing that are summarized in ality reduction, we can model the relationship between selected
Table VII. On EHR missing data, conventional approaches either clinical features (i.e., the patients condition) and targeted clin-
impute missing values by the mean or median in a population, ical endpoints (i.e., the clinical outcome) with three groups of
or listwise or pairwise delete records with missing values. These techniques: Regression analysis is a statistical process that esti-
approaches are simple and easy to implement, but they ignore mates the relationship between independent variables (i.e., fea-
the underlying data structure and tend to introduce additional bi- tures) and dependent variables (i.e., endpoints). If dependent
ases [128]. Thus, more robust missing data imputation methods variables follow distributions such as normal, Poisson, and bi-
such as interpolation [129], multiple imputation [130], expec- nomial, we can use a generalized linear model for regression
tation maximization [131], and maximum likelihood [132] are model fitting; Classification involves building statistical models
needed. that assign a new observation to a known class. Many classifica-
On high time-resolution waveform data quality issues [50], tion techniques such as decision trees, k-nearest neighbors, and
we can use the following: SVMs prove to be effective in clinical applications; Associate
1) filtering strategies such as median filtering, Kalman filter- Rule Learning (ARL) discovers frequent and reliable associa-
ing, and model-based filtering to handle noise [50], [123]; tions among clinical variables, and these association rules de-
2) signal quality indices that detect the presence of expected scribe that if all elements in the antecedent occurs, all elements
physiological features, quantify the agreement between signals in the consequent should occur with certain confidence [28]. In
TABLE VIII TABLE IX

SELECTED METHODS FOR EHR DATA MINING SELECTED PLATFORMS FOR BIG DATA ANALYTICS
Method Advantages Limitations Platform Advantages Limitations
Logistic regression, cox Simple to implement and Sensitive to outliers Apache Hadoop Horizontally scalable; Generally most effective for
regression, local regression interpret; direct estimates of (MapReduce) [11], [142] fault-tolerant; designed to be batch-mode processing; not
(LOESS) [133] relevant hazards for Cox deployed on always appropriate for
regression commodity-grade hardware; real-time, online analytics
Logistic regression with LASSO Reduces feature space Prone to overfitting free and open-source
regularization [134] IBM InfoSphere Platform Includes purpose-built tools Commercial licensing
Hidden Markov models [135] Simultaneous detection, Sensitive to the design of [143] to handle streaming
segmentation, and the Markov model being information; integrates with
classification in a waveform trained open-source tools such as
Conditional random fields [136] Supports temporal analysis; Sensitive to Hadoop
resistant to differences in regularization and feature Apache Spark Streaming Integrates with the Hadoop Depends on more expensive
class prevalence space size [144] stack; allows one code base hardware with large amounts
Relational subgroup discovery, Valid sequential techniques Tradeoffs between for both batch-mode and of RAM to work efficiently
online analysis
episode rule mining, windowing for some clinical applications simplicity, complexity,
[137] and temporal resolution Tableau, QlikView, TIBCO Visualization of large and Generally incomplete
Rule mining, Allens interval Temporal mining/modeling Requires specific Spotfire, and other visual complex data sets solutions, requiring other
algebra, directed acyclic graph capabilities experimental design analytics tools tools to effectively handle
[138] data storage
*
Highly impactful method with more than 50 000 relevant papers. Highly impactful platform.
general, these machine learning techniques prefer a large sample

size.
6) b) Temporal data mining: EHR captures diagno-
sis, treatment, and outcome chronologically throughout a med-
ical encounter, and thus, it is important to model temporal rela-
tionship between events using temporal data mining techniques
such as the hidden Markov model (HMM) and the conditional
random field (CRF) [135], [136]. One constraint of HMM and
CRF is that they require predefined clinical variables and out-
come categories often difficult to generalize for a given treat-
ment of a given patient. Thus, temporal association rule mining Fig. 2. Integrative analysis of multi-omic data leads to the improved un-
(TARM) is proposed to discover causality between the event derstanding of cancer mechanisms, which in turn enables more precise
classification of cancer subtypes.
and outcome. A temporal association rule, denoted by A T C,
describes an antecedent A followed by a consequent C sep-
arated by a time difference T. Because the selection of the To deploy biomedical big data for precision medicine in
event and outcome is flexible, the TARM model can be tailored health care, there is a critical need to address domain-specific
for any eventoutcome combination in various clinical settings challenges such as the requirements of Health Insurance
[137], [139]. Portability and Accountability Act, Health Information Tech-
nology for Economic and Clinical Health, and other privacy
regulations. Thus, security is an important enabling technology
D. Enablers of Biomedical Big Data Analytics (e.g., encryption for protected health information) in biomedical
big data [140], [145], [148].
The big data revolution has led to the development of enter-
prise tools and platforms for extracting, summarizing, and in-
terpreting knowledge from rapidly generated data, for business III. CASE STUDIES
intelligence, analytics, and predictive modeling as summarized In this section, we present two real-world applications to
in Table IX [11], [140], [141]. illustrate the utility of biomedical big data analytics for pre-
Distributed computing systems such as Apache Hadoop cision medicine: 1) integrative omic data for the improved
(based on MapReduce) provide the storage and processing understanding of cancer mechanisms (see Fig. 2); and 2) the
backbone for dealing with very large datasets [142]. Specific incorporation of genomic knowledge into the EHR system for
tools also exist to solve more specialized problems. For exam- improved patient diagnosis and care (see Fig. 3).
ple, IBM InfoSphere Streams and Apache Spark Streaming can
handle real-time streaming data [143], [144]. Cloud computing
providers such as Amazon Elastic Compute Cloud (EC2) can A. Integrative Omics for Precise Cancer Understanding
provide on-demand computing power to accommodate scalable One notable effort that integrates multi-omic data for the
growth from development to truly big data production [145]. improved understanding of cancer mechanisms is The Cancer
Many cloud-based services in bioinformatics such as Illuminas Genome Atlas (TCGA) [149]. TCGA hosts public datasets of 27
BaseSpace [146] and the Galaxy project [147] are deployed on cancer types with more than 11 000 patient cases. Each patient
Amazon EC2. is annotated with clinical data (i.e., demographic, diagnostic,
Another big challenge is that each individual typically has

millions of variants. The consortium has proposed one potential
solution that stores only known pathological variants in the
EMR system. However, because the set of known pathological
variants may change over time, this approach may lead to the
inclusion of false positive and the exclusion of false negative
variants. Thus, an alternative solution is to archive raw data in
separate repositories easily accessible when necessary [158].
EMR is only for local clinic and hospital, while EHR con-
Fig. 3. Integrating derived omic knowledge into the existing EHR sys- tains and shares medical records among all participant clinics
tem is an approach to utilizing molecular information for clinical decision and hospitals [159]. Thus, interoperability is critical in using big
support, and it also help deliver precision medicine. data for precision medicine. Recently, the Health Level Seven
International proposed the Fast Healthcare Interoperability Re-
sources (FHIR) standard that addresses this important issue. On
clinical genomics, several new FHIR resources and extension
and survival data) and multimodal omic data (i.e., genomic, definitions are designed for variant data [160]. With such a stan-
transcriptomic, epigenomic, and proteomic). dardized data exchange protocol, clinicians can utilize genomic
We use head and neck squamous cell carcinoma (HNSCC) information with other existing EHR data to determine the most
as an example to illustrate the integrative multi-omic study for effective treatment for each patient, which is a paradigm shift
precision medicine [150]. In 2014, a pan-cancer study with toward precision medicine.
12 cancer types using multi-omic TCGA data was performed
[151]. Among 3527 samples in total, 305 were HNSCC. Six
different data types (i.e., DNA copy number, methylation, mu- IV. BIOMEDICAL BIG DATA INITIATIVES
tation, and expression of mRNAs, miRNAs, and proteins) were To apply big data analytics for precision medicine, Table X
analyzed both separately and integratively. By using clustering- summarizes multiple consortium initiatives that collect and or-
based methods, pathway activities (inferred from gene ganize data from various projects and trials, and make them
expression and copy number data) have shown common CNVs, available to the research community for secondary data use.
mutation frequency patterns, and survival patterns between HN- First, initiatives such as Project Data Sphere aim to im-
SCC and lung squamous cell carcinomas or some bladder can- prove research efficiency and to encourage collaboration by
cers. Such integrative pan-cancer analysis provides more precise integrating information of clinical trials for different cancers.
subtyping across multiple cancers sharing common molecular- For example, the European Union-funded RD-Connect aggre-
level processes underlying cancer development. This new sub- gates data of multiple rare diseases from around the world.
typing system reflects precision medicine because it finds TCGA of U.S., Therapeutically Applicable Research to Gener-
precise classification of patients into disease subgroups. ate Effective Treatments, and the International Cancer Genome
TCGA Research Network has published more than 30 articles Consortium aim to study multiple aspects of individual diseases
describing multi-omic investigation on numerous cancer types, by collecting multi-omic data of hundreds of patients for each
and identified more precise, clinically relevant subtyping for cancer type. In contrast, the U.S. 1000 Genomes Project and the
multiple cancers [152][154]. U.K.-based 100 000 Genomes Project aim to connect genotypes
with phenotypes using single omic data type.
Second, large data repositories have been created and main-
B. Adoption of Genomics in EHR for Precision Medicine tained by organizations such as U.S. National Institutes of
In a clinical setting, healthcare providers use electronic medi- Health (NIH) and the World Health Organization (WHO). The
cal record (EMR) for clinical decision support. Thus, it is impor- Trans-NIH BioMedical Informatics Coordinating Committee
tant to incorporate omic data and knowledge into EMR. The has established a data repository that archives the data from
Electronic Medical Records and Genomics (eMERGE) Network 61 large multicenter studies for promoting secondary use of
consortium aims to identify causal genomic variants (mostly biomedical data [161]. The Global Health Observatory Data
SNPs) for EMR-based phenotypes and to integrate identified Repository is maintained by WHO for population-level health
genotype-phenotype associations into the EMR system [155]. studies [162]. As another example, the Health Indicators Ware-
One crucial challenge is on how to store variants present in an house within the U.S. Department of Health and Human Ser-
individual or even in family members in the EMR [156]. The vices provides country-level and state-level aggregated clinical
consortium has proposed several recommendations on augment- information [163].
ing the current EMR structure.
1) It should store various genomic variants, such as SNPs,
indels, and CNVs, in a discrete computable format. V. DISCUSSION
2) It needs to satisfy interoperability to reduce the burden in Among many data types included in the NIH Big Data to
data transfer and update within and between healthcare facilities. Knowledge Initiative, omic data, EHR, and medical imaging
3) It has to support rule-based decision support engines. data are the three most important biomedical big data. We con-
4) It must contain abundant visualization elements for easier ducted the review of omic and EHR data because of their close
interpretation [157]. relationship with precision medicine [3], [5], [17], [18].
TABLE X positive impact of knowledge and insight obtained from inte-

SELECTED BIOMEDICAL BIG DATA INITIATIVES grative analysis of genomic and transcriptomic [169], transcrip-
tomic and proteomic [170], and multiple omic data types [53],
Consortium Data Sources Data Elements [151] on disease diagnosis, prognosis, and treatment. The next
TCGA Multi-omic data for 27 cancer types, Clinical, genomic, transcriptomic,
important direction is the development of guidelines (or best
covering more than 11 000 cases epigenomic, and proteomic data practices) for omic data integration and interpretation that will
Project Data Patient-level data from comparator Common data include baseline, safety, in turn enable better prediction of biosystem behavior, and safer
Sphere arms of Phase IIB and III clinical efficacy, medication, dosing, lab test,
trials; currently contains 33 trials medical history, and demographic data and more effective therapeutics.
covering 12 cancer types 2) Waveform and irregularly spaced time series analysis:
TARGET Multi-omic data for seven types of Clinical, genomic, transcriptomic, and
childhood cancers epigenomic data
Real-time streaming data analytics needs to be further devel-
1,000 Large-scale genome sequencing Low-coverage whole genome oped due to the pervasive use of wearable sensors in either the
Genomes project for populations of African, sequencing for 179 individuals; critical care setting or in the continuous home monitoring setting
Project European, and East Asian ancestry high-coverage targeted exome
sequencing for 697 individuals for fitness and preventative medicine [171], and the need to re-
100,000 Large-scale genome sequencing Genome sequencing will be duce alarm fatigue [172]. However, the challenge for irregularly
Genomes project for studying cancers and rare completed in 2017 sampled temporal data remains and requires advanced impu-
Project diseases in the U.K.
ICGC Genomic data for 18 cancer types; SNPs, CNVs, methylation, and gene tation techniques and robust parameter extraction techniques
partially overlap the TCGA data and miRNA expression [50], [173].
RD-ConnectInfrastructure project funded by Currently links to three biobanks and
European Union for facilitating rare more than 150 rare disease registries
3) Patient similarity: Precision medicine promotes precise
disease research subgroup classification of patients based on biological basis
ADNI Multicenter, longitudinal study with Clinical, genetic, magnetic resonance such as molecular profiles. Thus, EHR mining can assist in pa-
elderly control subjects, early imaging, and positron emission
Alzheimers disease subjects, and tomography imaging data tient classification based on clinical measurements (e.g., drug
mild cognitive impairment subjects responses, physiological signals, and disease susceptibility).
iDASH Data from 17 focused trials, each of Imaging, EHR, sensor, and genomic However, because of high patient variability for any disease,
which represents a specific objective data from multiple clinical trials
and a patient population the precise subtypes of many diseases remain unknown as of
GHO Worldwide population and Population-level statistics and today, and it requires systematic big data analytics to model
environmental data for infectious modeling
diseases, noncommunicable diseases, physician knowledge to validate the reliability of patient sub-
sexually transmitted diseases, and grouping based on EHR mining.
childrens health
BMIC Large trials encompassing thousands EHR, imaging, genetic, and social
of samples research data VI. CONCLUSION
MIMIC II ICU data for more than 30 000 Chart data, administrative data, alert
patients with more than 40 000 ICU data, lab results, electronic In this review, we present omic and EHR big data challenges
stays documentation, and bedside monitor and current progressed. We provide case studies to show how
trends and waveforms
HIW Federal data for aggregated health Data element varies, depending on the big data analytics can facilitate precision medicine. Because
indices by geography; covers data trials biomedical big data analytics is in its infancy, more biomedi-
from claims, healthcare cost, to
population statistics
cal data scientists and engineers are needed to gain necessary
biomedical knowledge, to use large data provided by biomedi-
TCGA stands for The Cancer Genome Atlas; TARGET, Therapeutically Applicable Re- cal big data initiatives, and to put concerted effort in areas such
search to Generate Effective Treatments; ICGC, International Cancer Genome Consortium; as multi-omic data integration, waveform and time series data
ADNI, Alzheimers Diseases Neuroimaging Initiative; iDASH, Integrating Data for Analy-
sis, Anonymization, and Sharing; GHO, WHO Global Health Observatory Data Repository;
analysis, and patient similarity and so on to speed up big data
BMIC, Trans-NIH BioMedical Informatics Coordinating Committee; MIMIC II, Multipa- research for precision medicine. By delivering the most suitable
rameter Intelligent Monitoring in Intensive Care II; and HIW, Health Indicators Warehouse. and effective treatment to each patient based on their precise
subtyping information, the healthcare system can achieve better
care efficiency and quality.
Big data have had major societal impact in energy, environ-
ment, financial, and others. They motivate rapid advances in REFERENCES
data storage, data mining and analytics, data retrieval, and data
visualization [164], [165]. When applying to biomedicine and [1] G. H. Fernald et al., Bioinformatics challenges for personalized
medicine, Bioinformatics, vol. 27, pp. 17411748, Jul. 2011.
healthcare, big data will improve quality and outcome by dis- [2] L. Hood and S. H. Friend, Predictive, personalized, preventive, par-
covering new knowledge (e.g., automated identification of post- ticipatory (P4) cancer medicine, Nature Rev. Clin. Oncol., vol. 8,
operative complications in EHR data [166]); disseminating new pp. 184187, Mar. 2011.
[3] S. Desmond-Hellmann et al., Toward Precision Medicine: Building a
knowledge (e.g., data-driven clinical decision support systems Knowledge Network for Biomedical Research and a New Taxonomy of
such as IBM Watson); incorporating omic data into EHR (e.g., Disease. Washington, DC, USA: National Academies Press, 2011.
eMERGE network [167]); and implementing patient-centered [4] A. Katsnelson, Momentum grows to make personalized medicine
care (e.g., e-health [168]). more precise, Nature Med., vol. 19, Mar. 2013, Art. no. 249.
[5] R. Mirnezami et al., Preparing for precision medicine, New Engl. J.
To accelerate the delivery of precision medicine, more re- Med., vol. 366, pp. 489491, Feb. 2012.
search is needed in the following biomedical big data areas. [6] C. G. Chute et al., Some experiences and opportunities for big data in
1) Omic data integration: As illustrated by the TCGA case translational research, Genetics Med., vol. 15, pp. 802809, Oct. 2013.
[7] L. Dai et al., Bioinformatics clouds for big data manipulation, Biol.
study, integrative multi-omic data analysis is of growing im-
Direct, vol. 7, Nov. 2012, Art. no. 43.
portance because it provides holistic view of molecular finger- [8] V. Marx, Biology: The big challenges of big data, Nature, vol. 498,
prints for each patients condition. Recent research has shown pp. 255260, Jun. 2013.
[9] A. ODriscoll et al., Big data, Hadoop and cloud computing in ge- [39] S. M. Meystre et al., Extracting information from textual documents
nomics, J. Biomed. Informat., vol. 46, pp. 774781, Oct. 2013. in the electronic health record: A review of recent research, Yearbook
[10] T. B. Murdoch and A. S. Detsky, The inevitable application of big data Med. Informat., vol. 35, pp. 128144, 2008.
to health care, J. Amer. Med. Assoc., vol. 309, pp. 13511352, Apr. [40] P. B. Jensen et al., Mining electronic health records: Towards better
2013. research applications and clinical care, Nature Rev. Genetics, vol. 13,
[11] W. Raghupathi and V. Raghupathi, Big data analytics in healthcare: pp. 395405, Jun. 2012.
Promise and potential, Health Inf. Sci. Syst., vol. 2, Feb. 2014, Art. no. 3. [41] P. S. Romano et al., A comparison of administrative versus clinical
data: Coronary artery bypass surgery as an example, J. Clin. Epidemiol.,
[12] D. W. Bates et al., Big data in health care: Using analytics to identify vol. 47, pp. 249260, Mar. 1994.
and manage high-risk and high-cost patients, Health Affairs, vol. 33, [42] R. K. Patel and M. Jain, NGS QC toolkit: A toolkit for quality control
pp. 11231131, Jul. 2014. of next generation sequencing data, PloS ONE, vol. 7, Feb. 1, 2012,
[13] W. Hsu et al., Biomedical imaging informatics in the era of precision Art. no. e30619.
medicine: Progress, challenges, and opportunities, J. Amer. Med. Inform. [43] T. H. Stokes et al., Chip artifact correction (caCORRECT): A bioinfor-
Assoc., vol. 20, pp. 10101013, Nov. 2013. matics system for quality assurance of genomics and proteomics array
[14] K. Clark et al., The Cancer Imaging Archive (TCIA): Maintaining data, Ann. Biomed. Eng., vol. 35, pp. 106880, Jun. 2007.
and operating a public information repository, J. Digit. Imag., vol. 26, [44] J. T. Leek et al., Tackling the widespread and critical impact of
pp. 10451057, Dec. 2013. batch effects in high-throughput data, Nature Rev. Genetics, vol. 11,
[15] H. Banaee et al., Data mining for wearable sensors in health monitoring pp. 733739, Oct. 2010.
systems: A review of recent trends and challenges, Sensors, vol. 13, [45] W. E. Johnson et al., Adjusting batch effects in microarray expression
pp. 1747217500, Dec. 2013. data using empirical Bayes methods, Biostatistics, vol. 8, pp. 118127,
[16] L. D. Xu et al., Internet of things in industries: A survey, IEEE Trans. Jan. 2007.
Ind. Informat., vol. 10, no. 4, pp. 22332243, Nov. 2014. [46] C. Ledergerber and C. Dessimoz, Base-calling for next-generation se-
[17] F. S. Collins and H. Varmus, A new initiative on precision medicine, quencing platforms, Briefings Bioinformat., vol. 12, pp. 489497, Sep.
New Engl. J. Med., vol. 372, pp. 793795, Feb. 2015. 2011.
[18] I. S. Kohane, Ten things we have to do to achieve precision medicine, [47] C. A. Smith et al., XCMS: Processing mass spectrometry data for
Science, vol. 349, pp. 3738, Jul. 2015. metabolite profiling using nonlinear peak alignment, matching, and iden-
[19] R. Hillestad et al., Can electronic medical record systems transform tification, Anal. Chem., vol. 78, pp. 779787, Feb. 1, 2006.
health care? Potential health benefits, savings, and costs, Health Affairs, [48] N. G. Weiskopf and C. Weng, Methods and dimensions of electronic
vol. 24, pp. 11031117, Sep. 2005. health record data quality assessment: Enabling reuse for clinical re-
[20] NHGRI. [Online]. Available: https://www.genome.gov/10000202/fact- search, J. Amer. Med. Informat. Assoc., vol. 20, pp. 144151, Jan. 2013.
sheets/ [49] S. Bowman, Impact of electronic health record systems on informa-
[21] R. Romero et al., The use of high-dimensional biology (genomics, tran- tion integrity: Quality and safety implications, Perspectives Health Inf.
scriptomics, proteomics, and metabolomics) to understand the preterm Manage., vol. 10, Oct. 2013.
parturition syndrome, BJOG Int. J. Obstetrics Gynaecol., vol. 113, [50] G. D. Clifford et al., Robust parameter extraction for decision support
pp. 118135, Dec. 2006. using multimodal intensive care data, Philos. Trans. A, Math. Phys. Eng.
[22] Nature. [Online]. Available: http://www.nature.com/scitable Sci., vol. 367, pp. 411429, Jan. 2009.
[23] L. Feuk et al., Structural variation in the human genome, Nature Rev. [51] M. Girolami et al., Analysis of complex, multidimensional datasets,
Genetics, vol. 7, pp. 8597, Feb. 2006. Drug Discovery Today: Technol., vol. 3, pp. 1319, Apr. 2006.
[24] D. Kim and S. L. Salzberg, TopHat-Fusion: An algorithm for dis- [52] R. Chen et al., Personal omics profiling reveals dynamic molecular and
covery of novel fusion transcripts, Genome Biol., vol. 12, Aug. 2011, medical phenotypes, Cell, vol. 148, pp. 12931307, Mar. 2012.
Art. no. R72. [53] L. Stanberry et al., Integrative analysis of longitudinal metabolomics
[25] A. McPherson et al., deFuse: An algorithm for gene fusion discov- data from a personal multi-omics profile, Metabolites, vol. 3,
ery in tumor RNA-seq data, PLoS Comput. Biol., vol. 7, May 2011, pp. 741760, Sep. 2013.
Art. no. e1001138. [54] P. Luukka and J. Lampinen, A classification method based on princi-
[26] D. L. Black, Mechanisms of alternative pre-messenger RNA splicing, pal component analysis and differential evolution algorithm applied for
Annu. Rev. Biochem., vol. 72, pp. 291336, 2003. prediction diagnosis from clinical emr heart data sets, Comput. Intell.
[27] What-is-Epigenetics. [Online]. Available: http://www.whatisepigenetics. Optim.: Appl. Implementations, vol. 7, pp. 263283, 2010.
com/ [55] Y. Huang et al., Feature selection and classification model construction
[28] C.-W. Cheng et al., icuARMAn ICU clinical decision support sys- on type 2 diabetic patients data, Artif. Intell. Med., vol. 41, pp. 251262,
tem using association rule mining, IEEE J. Transl. Eng. Health Med., Nov. 2007.
vol. 1, Nov. 2013, Art. no. 4400110. [56] P. Cunningham, Dimension reduction, in Machine Learning Tech-
[29] Health Level Seven International. [Online]. Available: http://hl7.org/fhir/ niques for Multimedia: Case Studies on Organization and Retrieval.
[30] M. A. DePristo et al., A framework for variation discovery and geno- Berlin, Germany: Springer, 2008, pp. 91112.
typing using next-generation DNA sequencing data, Nature Genetics, [57] H. Peng et al., Feature selection based on mutual information: Criteria
vol. 43, pp. 491498, May 2011. of max-dependency, max-relevance, and min-redundancy, IEEE Trans.
[31] C. Xie and M. T. Tammi, CNV-seq, a new method to detect copy number Pattern Anal., vol. 27, no. 8, pp. 12261238, Aug. 2005.
variation using high-throughput sequencing, BMC Bioinformat., vol. 10, [58] R. Kohavi and G. H. John, Wrappers for feature subset selection, Artif.
Mar. 2009, Art. no. 80. Intell., vol. 97, pp. 273324, Dec. 1997.
[32] F. Ozsolak and P. M. Milos, RNA sequencing: Advances, challenges [59] Y. Saeys et al., A review of feature selection techniques in bioinfor-
and opportunities, Nature Rev. Genetics, vol. 12, pp. 8798, Feb. 2011. matics, Bioinformatics, vol. 23, pp. 25072517, Oct. 2007.
[33] C. Trapnell et al., Differential gene and transcript expression analysis of [60] J. l. Coste et al., Methodological issues in determining the dimension-
RNA-SEQ experiments with TopHat and Cufflinks, Nature Protocols, ality of composite health measures using principal component analysis:
vol. 7, pp. 562578, Mar. 2012. Case illustration and suggestions for practice, Quality Life Res., vol. 14,
[34] M. Hirst and M. A. Marra, Next generation sequencing based ap- pp. 641654, May 2005.
proaches to Epigenomics, Brief Funct. Genomics, vol. 9, pp. 455465, [61] G. E. Hinton and R. R. Salakhutdinov, Reducing the dimensionality of
Dec. 2010. data with neural networks, Science, vol. 313, pp. 504507, Jul. 2006.
[35] R. Aebersold and M. Mann, Mass spectrometry-based proteomics, [62] B. Scholkopf et al., Nonlinear component analysis as a kernel eigen-
Nature, vol. 422, pp. 198207, Mar. 13, 2003. value problem, Neural Comput., vol. 10, pp. 12991319, Jul. 1998.
[36] A. Buchholz et al., Metabolomics: Quantification of intracellular [63] T. D. Wu and C. K. Watanabe, GMAP: A genomic mapping and align-
metabolite dynamics, Biomolecular Eng., vol. 19, pp. 515, Jun. ment program for mRNA and EST sequences, Bioinformatics, vol. 21,
2002. pp. 18591875, May 2005.
[37] S. J. Liu, Epigenetics advancing personalized nanomedicine in cancer [64] H. Li and R. Durbin, Fast and accurate short read alignment with
therapy, Adv. Drug Del. Rev., vol. 64, pp. 15321543, Oct. 2012. Burrows-Wheeler transform, Bioinformatics, vol. 25, pp. 17541760,
[38] K. Hayrinen et al., Definition, structure, content, use and impacts of Jul. 2009.
electronic health records: A review of the research literature, Int. J. Med. [65] A. Dobin et al., STAR: Ultrafast universal RNA-seq aligner, Bioinfor-
Informat., vol. 77, pp. 291304, May 2008. matics, vol. 29, pp. 1521, Jan. 2013.
[66] H. Li, A statistical framework for SNP calling, mutation discovery, [93] J. H. Kim et al., CNVRuler: A copy number variation-based
association mapping and population genetical parameter estimation from case-control association analysis tool, Bioinformatics, vol. 28,
sequencing data, Bioinformatics, vol. 27, pp. 29872993, Nov. 2011. pp. 17901792, Jul. 2012.
[67] S. Anders et al., HTSeq A Python framework to work with high- [94] M. D. Robinson et al., EDGER: A bioconductor package for differen-
throughput sequencing data, Bioinformatics, vol. 31, pp. 166169, Jan. tial expression analysis of digital gene expression data, Bioinformatics,
2015. vol. 26, pp. 139140, Jan. 2010.
[68] A. R. Quinlan and I. M. Hall, BEDTools: A flexible suite of utilities [95] M. I. Love et al., Moderated estimation of fold change and dispersion
for comparing genomic features, Bioinformatics, vol. 26, pp. 841842, for RNA-seq data with DESeq2, Genome Biol., vol. 15, Dec. 2014,
Mar. 2010. Art. no. 550.
[69] B. Li and C. N. Dewey, RSEM: Accurate transcript quantification from [96] J. H. Phan et al., OmniBiomarker: A Web-based application for
RNA-seq data with or without a reference genome, BMC Bioinformat., knowledge-driven biomarker identification, IEEE Trans. Biomed. Eng.,
vol. 12, Aug. 2011, Art. no. 323. vol. 60, no. 12, pp. 33643367, Dec. 2013.
[70] C. Trapnell et al., Transcript assembly and quantification by RNA- [97] Y. Hu et al., DiffSplice: The genome-wide detection of differential
Seq reveals unannotated transcripts and isoform switching during cell splicing events with RNA-seq, Nucleic Acids Res., vol. 41, Jan. 2013,
differentiation, Nature Biotechnol., vol. 28, pp. 511515, May 2010. Art. no. e39.
[71] G. Robertson et al., De novo assembly and analysis of RNA-seq data, [98] S. H. Shen et al., MATS: A Bayesian framework for flexible detection
Nature Methods, vol. 7, pp. 909912, Nov. 2010. of differential alternative splicing from RNA-seq data, Nucleic Acids
[72] M. G. Grabherr et al., Full-length transcriptome assembly from RNA- Res., vol. 40, Apr. 2012, Art. no. R74.
seq data without a reference genome, Nature Biotechnol., vol. 29, [99] K. Liang and S. Keles, Detecting differential binding of transcrip-
pp. 644652, Jul. 2011. tion factors with ChIP-seq, Bioinformatics, vol. 28, pp. 121122, Jan.
[73] M. Guttman et al., Ab initio reconstruction of cell type-specific tran- 2012.
scriptomes in mouse reveals the conserved multi-exonic structure of [100] H. Xu et al., An HMM approach to genome-wide identification of dif-
lincRNAs, Nature Biotechnol., vol. 28, pp. 503510, May 2010. ferential histone modification sites from ChIP-seq data, Bioinformatics,
[74] Y. Zhang et al., Model-based analysis of ChIP-Seq (MACS), Genome vol. 24, pp. 23442349, Oct. 2008.
Biol., vol. 9, Sep. 2008, Art. no. R137. [101] Y. Zhang et al., QDMR: A quantitative method for identification
[75] R. Jothi et al., Genome-wide identification of in vivo protein- of differentially methylated regions by entropy, Nucleic Acids Res.,
DNA binding sites from ChIP-Seq data, Nucleic Acids Res., vol. 36, vol. 39, May 2011, Art. no. e58.
pp. 52215231, Sep. 2008. [102] C. D. Kaddi et al., DetectTLC: Automated reaction mixture screening
[76] M. Sturm et al., OpenMS An open-source software framework for utilizing quantitative mass spectrometry image features, J. Amer. Soc.
mass spectrometry, BMC Bioinformat., vol. 9, Mar. 2008, Art. no. 163. Mass. Spectrometry, vol. 27, pp. 359365, Oct. 2015.
[77] T. Pluskal et al., MZmine 2: Modular framework for processing, visu- [103] T. Wang et al., Automics: An integrated platform for NMR-based
alizing, and analyzing mass spectrometry-based molecular profile data, metabonomics spectral processing and data analysis, BMC Bioinfor-
BMC Bioinformat., vol. 11, Jul. 2010, Art. no. 395. matics, vol. 10, Mar. 2009, Art. no. 83.
[78] S. Pepke et al., Computation for ChIP-seq and RNA-seq studies, [104] J. Xia et al., MetaboAnalyst 2.0 A comprehensive server
Nature Methods, vol. 6, pp. S22S32, Nov. 2009. for metabolomic data analysis, Nucleic Acids Res., vol. 40,
[79] S. Pabinger et al., A survey of tools for variant analysis of next- pp. W127W133, Jul. 2012.
generation genome sequencing data, Brief Bioinformat., vol. 15, [105] W. S. Bush and J. H. Moore, Chapter 11: Genome-wide association
pp. 256278, Mar. 2014. studies, PloS Comput. Biol., vol. 8, Dec. 2012, Art. no. e1002822.
[80] C. Alkan et al., Genome structural variation discovery and genotyping, [106] P. Langfelder and S. Horvath, WGCNA: An R package for weighted
Nature Rev. Genetics, vol. 12, pp. 363375, May 2011. correlation network analysis, BMC Bioinformat., vol. 9, Dec. 2008,
[81] M. Zhao et al., Computational tools for copy number variation (CNV) Art. no. 559.
detection using next-generation sequencing data: Features and perspec- [107] H. Y. Hu et al., Mining coherent dense subgraphs across massive
tives, BMC Bioinformat., vol. 14, Sep. 2013, Art. no. S1. biological networks for functional discovery, Bioinformatics, vol. 21,
[82] Z. Chang et al., Bridger: A new framework for de novo transcrip- pp. I213I221, Jun. 2005.
tome assembly using RNA-seq data, Genome Biol., vol. 16, Feb. 2015, [108] G. Ciriello et al., Mutual exclusivity analysis identifies oncogenic
Art. no. 30. network modules, Genome Res., vol. 22, pp. 398406, Feb. 2012.
[83] M. Pertea et al., StringTie enables improved reconstruction of [109] A. Funahashi et al., CellDesigner 3.5: A versatile modeling tool for
a transcriptome from RNA-seq reads, Nature Biotechnol., vol. 33, biochemical networks, Proc. IEEE, vol. 96, no. 8, pp. 12541265, Aug.
pp. 290295, Mar. 2015. 2008.
[84] J. A. Martin and Z. Wang, Next-generation transcriptome assembly, [110] U. Wilensky. 1999. NetLogo. [Online]. Available: http://ccl.northwes-
Nature Rev. Genetics, vol. 12, pp. 671682, Oct. 2011. tern.edu/netlogo/
[85] L. N. Mueller et al., An assessment of software solutions for the analysis [111] C. Mussel et al., BoolNet An R package for generation, recon-
of mass spectrometry based quantitative proteomics data, J. Proteome struction and analysis of Boolean networks, Bioinformatics, vol. 26,
Res., vol. 7, pp. 5161, Jan. 2008. pp. 13781380, May 2010.
[86] W. B. Dunn et al., Procedures for large-scale metabolic profiling of [112] C. Rohr et al., Snoopy A unifying Petri net framework to investi-
serum and plasma using gas chromatography and liquid chromatography gate biomolecular networks, Bioinformatics, vol. 26, pp. 974975, Apr.
coupled to mass spectrometry, Nature Protocols, vol. 6, pp. 10601083, 2010.
Jun. 2011. [113] A. R. Joyce and B. O. Palsson, The model organism as a system:
[87] T. Alexandrov, MALDI imaging mass spectrometry: Statistical data integrating omics data sets, Nature Rev. Molecular Cell Biol., vol. 7,
analysis and current computational challenges, BMC Bioinformat., pp. 198210, Mar. 2006.
vol. 13, Nov. 2012, Art. no. S11. [114] A. L. Barabasi et al., Network medicine: A network-based ap-
[88] C. Yang et al., Comparison of public peak detection algorithms for proach to human disease, Nature Rev. Genetics, vol. 12, pp. 5668,
MALDI mass spectrometry data analysis, BMC Bioinformat., vol. 10, Jan. 2011.
Jan. 2009, Art. no. 4. [115] M. Vidal et al., Interactome networks and human disease, Cell,
[89] J. R. Gonzalez et al., SNPassoc: An R package to perform whole vol. 144, pp. 986998, Mar. 2011.
genome association studies, Bioinformatics, vol. 23, pp. 644645, Mar. [116] H. de Jong, Modeling and simulation of genetic regulatory systems: A
2007. literature review, J. Comput. Biol., vol. 9, pp. 67103, Jul. 2002.
[90] J. Marchini et al., A new multipoint method for genome-wide asso- [117] H. Doerfler et al., Granger causality in integrated GC-MS and LC-
ciation studies by imputation of genotypes, Nature Genetics, vol. 39, MS metabolomics data reveals the interface of primary and secondary
pp. 906913, Jul. 2007. metabolism, Metabolomics, vol. 9, pp. 564574, Jun. 2013.
[91] G. T. Wang et al., Variant association tools for quality control and [118] J. Lee et al., Interrogating a clinical database to study treatment
analysis of large-scale sequence and genotyping array data, Amer. J. of hypotension in the critically ill, BMJ Open, vol. 2, Jun. 2012,
Human Genetics, vol. 94, pp. 770783, May 2014. Art. no. e000916.
[92] S. Purcell et al., PLINK: A tool set for whole-genome association and [119] J. Lee and R. G. Mark, An investigation of patterns in hemodynamic
population-based linkage analyses, Amer. J. Human Genetics, vol. 81, data indicative of impending hypotension in intensive care, Biomed.
pp. 559575, Sep. 2007. Eng. Online, vol. 9, Oct. 2010, Art. no. 62.
[120] E. C. Lau et al., Use of electronic medical records (EMR) for oncol- [145] H. Demirkan and D. Delen, Leveraging the capabilities of service-
ogy outcomes research: Assessing the comparability of EMR informa- oriented decision support systems: Putting analytics and big data in
tion to patient registry and health claims data, Clin. Epidemiol., vol. 3, cloud, Decision Support Syst., vol. 55, pp. 412421, May 2013.
pp. 259272, Oct. 2011. [146] Illumina. BaseSpace: Genomics Cloud Computing. [Online]. Available:
[121] G. Allasia, Approximating potential integrals by cardinal basis inter- https://basespace.illumina.com
polants on multivariate scattered data, Comput. Math. Appl., vol. 43, [147] E. Afgan et al., Harnessing cloud computing with galaxy cloud, Nature
pp. 275287, Feb.-Mar. 2002. Biotechnol., vol. 29, pp. 972974, Nov. 2011.
[122] C. A. Welch et al., Evaluation of two-fold fully conditional specifica- [148] A. Wright et al., Creating and sharing clinical decision support con-
tion multiple imputation for longitudinal electronic health record data, tent with Web 2.0: Issues and examples, J. Biom. Informat., vol. 42,
Statist. Med., vol. 33, pp. 37253737, Sep. 2014. pp. 334346, 2009.
[123] Q. Li et al., Robust heart rate estimation from multiple asynchronous [149] T. C. G. Atlas. [Online]. Available: http://cancergenome.nih.gov/
noisy sources using signal quality indices and a Kalman filter, Physiol. [150] C. D. Kaddi and M. D. Wang, Models for predicting stage in head
Meas., vol. 29, pp. 1532, Jan. 2008. and neck squamous cell carcinoma using proteomic data, in Proc. 36th
[124] W. Zong et al., Reduction of false arterial blood pressure alarms using Annu. Int. Conf. IEEE Eng. Med. Biol. Soc., 2014, pp. 52165219.
signal quality assessment and relationships between the electrocardio- [151] K. A. Hoadley et al., Multiplatform analysis of 12 cancer types re-
gram and arterial blood pressure, Med. Biol. Eng. Comput., vol. 42, veals molecular classification within and across tissues of origin, Cell,
pp. 698706, Sep. 2004. vol. 158, pp. 929944, 2014.
[125] I. Silva et al., Signal quality estimation with multichannel adaptive [152] R. G. Verhaak et al., Integrated genomic analysis identifies clini-
filtering in intensive care settings, IEEE Trans. Biomed. Eng., vol. 59, cally relevant subtypes of glioblastoma characterized by abnormalities
no. 9, pp. 24762485, Sep. 2012. in PDGFRA, IDH1, EGFR, and NF1, Cancer Cell, vol. 17, pp. 98110,
[126] J. M. Feldman et al., Robust sensor fusion improves heart rate esti- Jan. 19, 2010.
mation: Clinical evaluation, J. Clin. Monitoring, vol. 13, pp. 379384, [153] Cancer Genome Atlas Network, Comprehensive molecular portraits of
Nov. 1997. human breast tumours, Nature, vol. 490, pp. 6170, Oct. 4, 2012.
[127] T. Wartzek et al., Robust sensor fusion of unobtrusively measured heart [154] Cancer Genome Atlas Research Network, Comprehensive genomic
rate, IEEE J. Biomed. Health Informat., vol. 18, no. 2, pp. 654660, Mar. characterization of squamous cell lung cancers, Nature, vol. 489,
2014. pp. 51925, Sep. 27, 2012.
[128] J. Labar`ere et al., How to derive and validate clinical prediction mod- [155] O. Gottesman et al., The electronic medical records and genomics
els for use in intensive care medicine, Intensive Care Med., vol. 40, (eMERGE) network: Past, present, and future, Genetics Med., vol. 15,
pp. 513527, Apr. 2014. pp. 761771, Oct. 2013.
[129] J. L. Schafer, Multiple imputation: A primer, Statist. Methods Med. [156] K. Marsolo and S. A. Spooner, Clinical genomics in the world of the
Res., vol. 8, pp. 315, 1999. electronic health record, Genetics Med., vol. 15, pp. 786791, Oct. 2013.
[130] J. Tian et al., Missing data analyses: A hybrid multiple imputation [157] J. L. Kannry and M. S. Williams, Integration of genomics into the
algorithm using Gray System Theory and entropy based on clustering, electronic health record: Mapping terra incognita, Genetics Med.,
Appl. Intell., vol. 40, pp. 376388, Mar. 2014. vol. 15, pp. 757760, Jun. 2013.
[131] K. A. Hallgren and K. Witkiewitz, Missing data in alcohol clinical [158] A. G. Ury, Storing and interpreting genomic information in widely
trials: A comparison of methods, Alcohol Clin. Exp. Res., vol. 37, deployed electronic health record systems, Genetics Med., vol. 15,
pp. 21522160, Dec. 2013. pp. 779785, Aug. 2013.
[132] N. Stadler et al., Pattern alternating maximization algorithm for miss- [159] C. A. Caligtan and P. C. Dykes, Electronic health records and personal
ing data in high-dimensional problems, J. Mach. Learn. Res., vol. 15, health records, Semin. Oncol. Nursing, vol. 27, pp. 218228, Aug. 2011.
pp. 19031928, Jan. 2014. [160] G. Alterovitz et al., SMART on FHIR Genomics: Facilitating stan-
[133] L. Fuchs et al., ICU admission characteristics and mortality rates dardized clinico-genomic apps, J. Amer. Med. Informat. Assoc., vol. 22,
among elderly and very elderly patients, Intensive Care Med., vol. 38, pp. 11738, Nov. 2015.
pp. 16541661, Oct. 2012. [161] NIH-BMIC. [Online]. Available: https://www.nlm.nih.gov/NIHbmic/
[134] Y. Li et al., A method for controlling complex confounding effects in [162] WHO. [Online]. Available: http://apps.who.int/gho/data/node.main
the detection of adverse drug reactions using electronic health records, [163] NCHS. [Online]. Available: http://www.healthindicators.gov/
J. Amer. Med. Informat. Assoc., vol. 21, pp. 308314, Mar. 2014. [164] H. C. Chen et al., Business intelligence and analytics: From big data to
[135] R. V. Andreao et al., ECG signal analysis through hidden Markov big impact, MIS Quart., vol. 36, pp. 11651188, Dec. 2012.
models, IEEE Trans. Biomed. Eng., vol. 53, no. 8, pp. 15411549, Aug. [165] M. Chen et al., Big data: A survey, Mobile Netw. Appl., vol. 19,
2006. pp. 171209, Apr. 2014.
[136] G. de Lannoy et al., Weighted conditional random fields for supervised [166] H. J. Murff et al., Automated identification of postoperative com-
interpatient heartbeat classification, IEEE Trans. Biomed. Eng., vol. 59, plications within an electronic medical record using natural language
no. 1, pp. 241247, Jan. 2012. processing, JAMA, vol. 306, pp. 848855, 2011.
[137] J. Klema et al., Sequential data mining: A comparative case study in [167] C. A. McCarty et al., The emerge network: A consortium of biorepos-
development of atherosclerosis risk factors, IEEE Trans. Syst., Man, itories linked to electronic medical records data for conducting genomic
Cybern. C, Appl. Rev., vol. 38, no. 1, pp. 315, Jan. 2008. studies, BMC Med. Genomics, vol. 4, 2011, Art. no. 13.
[138] R. Harpaz et al., Novel data-mining methodologies for adverse drug [168] L. Ricciardi et al., A national action plan to support consumer engage-
event discovery and analysis, Clin. Pharmacol. Therapeutics, vol. 91, ment via e-health, Health Affairs, vol. 32, pp. 376384, 2013.
pp. 10101021, Jun. 2012. [169] M. K. Keck et al., Integrative analysis of head and neck cancer iden-
[139] S. K. Harms and J. S. Deogun, Sequential association rule mining with tifies two biologically distinct HPV and three non-HPV subtypes, Clin.
time lags, J. Intell. Inf. Syst., vol. 22, pp. 722, Jan. 2004. Cancer Res., vol. 21, pp. 870881, Feb. 15, 2015.
[140] H. Chen et al., Business intelligence and analytics: From big data to [170] C. Vogel and E. M. Marcotte, Insights into the regulation of protein
big impact, MIS Quart., vol. 36, pp. 11651188, Dec. 2012. abundance from proteomic and transcriptomic analyses, Nature Rev.
[141] O. Ola and K. Sedig, The challenge of big data in public health: An Genetics, vol. 13, pp. 227232, Apr. 2012.
opportunity for visual analytics, Online J. Public Health Informat., [171] J. Andreu-Perez et al., Big data for health, IEEE J. Biom. Health
vol. 5, Feb. 2014, Art. no. 223. Informat., vol. 19, no. 4, pp. 11931208, Jul. 2015.
[142] R. C. Taylor, An overview of the Hadoop/MapReduce/HBase frame- [172] M. Cvach, Monitor alarm fatigue: An integrative review, Biomed. In-
work and its current applications in bioinformatics, BMC Bioinformat., strum. Technol., vol. 46, pp. 268277, Jul. 2012.
vol. 11, Dec. 2010, Art. no. S1. [173] R. Martis et al., Wavelet-based machine learning techniques for
[143] A. Biem et al., IBM infosphere streams for scalable, real-time, intelli- ECG signal analysis, in Machine Learning in Healthcare Informatics,
gent transportation services, in Proc. ACM SIGMOD Int. Conf. Manage. vol. 56, S. Dua, U. R. Acharya, and P. Dua, Eds. Berlin, Germany:
Data, 2010, pp. 10931104. Springer, 2014, pp. 2545.
[144] M. Zaharia et al., Spark: Cluster computing with working sets, in
Authors photograph and biography not available at the time of publica-
Proc. 2nd USENIX Conf. Hot Topics Cloud Comput., 2010, pp. 1010.
tion.

Omic and Electronic Health Record Big Data Analytics For Precision Medicine

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Omic and Electronic Health Record Big Data Analytics For Precision Medicine

Încărcat de

Drepturi de autor:

Formate disponibile

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 64, NO.

2, FEBRUARY 2017 263

Omic and Electronic Health Record Big Data

AbstractObjective: Rapid advances of high-throughput

TABLE I adoption of EHR for the entire population provides a foun-

B. Challenges Associated With Omic and EHR Data TABLE III

TABLE VI TABLE VII

Tool Modeling Approach Key Functionality Method Advantages Limitations

TABLE VIII TABLE IX

Method Advantages Limitations Platform Advantages Limitations

general, these machine learning techniques prefer a large sample

Another big challenge is that each individual typically has

TABLE X positive impact of knowledge and insight obtained from inte-

S-ar putea să vă placă și