Documente Academic
Documente Profesional
Documente Cultură
(Review Paper)
0018-9294 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
264 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 64, NO. 2, FEBRUARY 2017
TABLE IV TABLE V
SELECTED TOOLS FOR OMIC DATA PREPROCESSING SELECTED TOOLS FOR OMIC BIOMARKER IDENTIFICATION
Tool Assay Omic Data Key Functionality Tool Omic Data Omic Biomarker Approach
GMAP* [63] Next-generation Genomic, Sequence mapping SNPassoc [89] Genomic Significant SNPs associated with Genome-wide
sequencing transcriptomic, traits association
and epigenomic studies
BWA [64] SNPTEST [90]
STAR [65] VAT [91] Significant SNPs and indels
GATK [30] Genomic Genomic variant discovery associated with traits
SAMtools [66] PLINK [92] Significant SNPs, indels, and
HTSeq [67] Gene expression quantification CNVs associated with traits
BEDTools [68] CNVRuler [93] Significant CNVs associated with
RSEM [69] Gene and transcript expression traits
quantification edgeR [94] Transcriptomic Differentially expressed genes Differential
Cufflinks [70] /transcripts analysis (model
fitting and
defuse [25] Transcriptomic Gene fusion detection
statistical tests)
TopHat-Fusion [24]
DESeq2 [95]
Trans-ABySS [71] Alternative splicing detection and
omniBiomarker [96]
quantification
DiffSplice [97] Differential alternative splicing
Trinity [72]
MATS [98]
Cufflinks [70]
DBChIP [99] Epigenomic Differential binding sites
Scripture [73]
ChIPDiff [100] Differential histone modification
MACS [74] Epigenomic ChIP-seq peak calling
sites
SISSRs [75]
QDMR [101] Differentially methylated regions
OpenMS [76] Mass Proteomic and Peak detection and quantification
DetectTLC [102] Proteomic and Molecular patterns in mass Similarity
spectrometry metabolomic
metabolomic spectrometry images scoring
MZmine 2 [77]
Automics [103] Differentially abundant Supervised and
metabolites unsupervised
GMAP stands for genomic mapping and alignment program; BWA, Burrows-Wheeler learning
aligner; STAR, spliced transcripts alignment to a reference; GATK, genome analysis toolkit; MetaboAnalyst
RSEM, RNA-seq by expectation-maximization; Trans-ABySS, transcriptome assembly and [104]
analysis pipeline; MACS, model-based analysis of ChIP-seq; and SISSRs, site identification
from short sequence reads. Highly impactful tool with more than 50 citations per year. SNPassoc stands for SNP-based whole genome association studies; VAT, variant association
tools; PLINK, population-based linkage analyses; edgeR, empirical analysis of digital
gene expression data in R; MATS, multivariate analysis of transcript splicing; DBChIP,
methods associate the reads with all loci [67], [68], while others differential binding with ChIP-seq data; and QDMR, quantitative differentially methylated
regions. Highly impactful tool with more than 50 citations per year.
probabilistically associate the reads with only a few model-
inferred loci [69], [70]. Fusion gene detection relies on two
factors, the spanning read pairs and the split read [24], [25].
Alternative splicing detection relies on either de novo transcrip- biological conditions (e.g., disease vs. non-disease) or differ-
tome assembly [71], [72], [82][84], or inference from sequence ent time points (e.g., before vs. after a treatment). Thus, Ta-
mapping outputs [70], [73]. Epigenomic studies mainly focus ble V summarizes selected tools that identify discriminatory
on identifying patterns of protein-DNA binding sites, histone biomarkers among different groups. Most omic biomarkers
modification, and DNA methylation [34]. Epigenomic data pre- are identified by investigating statistically significant differences
processing builds a profile representing the density of reads among groups, such as differentially expressed genes or tran-
along the genome, models background noises, and determines scripts [94][96], differential alternative splicing [97], [98], dif-
statistically significant peaks [78]. ferential protein-DNA binding [99], differential histone modifi-
MS is for proteomic and metabolomic studies, and its pre- cation [100], and differential DNA methylation [101]. The basic
processing steps include alignment, baseline correction, and idea is to quantify and then fit the abundance of each group to
peak detection [85]. In chromatography-coupled MS, chromato- Poisson-based distributions (e.g., the Poisson distribution and
graphic peak alignment can correct drift to ensure coherent the negative binomial distribution), followed by statistical tests
retention time and accurate mass across samples, and mass- (e.g., the Fishers exact test and the likelihood ratio test) that
to-charge ratio (m/z) alignment can ensure mass spectra and determine the statistical significance of each molecular feature.
component features are comparable among samples [86]. For genomic data, genome-wide association studies (GWAS)
In matrix-assisted laser desorption ionization (MALDI) MS, uses different approaches (e.g., the chi-squared test or logistic
baseline correction is particularly important. Low-mass mea- regression) to assess the degree of association between each vari-
surement noise from the chemical matrix used in MALDI ant and a targeted trait, and then select most significant variants
experiments affects the spectral baseline and needs to be re- as biomarkers [105]. Most GWAS focuses on SNP association
moved prior to analysis [87]. Peak detection is then performed [89][91], while only a few infer CNV or SV association [92],
based on criteria such as signal-to-noise ratio, peak shape, and [93].
detection thresholds [88]. A common subsequent step is the 4) Systems Biology Modeling Using Omic Data:
identification, and potentially the filtering, of isotopic peaks To gain insights about a complex molecular system, we can
from the spectrum [77]. conduct systems biology modeling using either static network
3) Biomarker Identification Using Omic Data: In analysis or dynamic temporal analysis based on omic fea-
practice, different groups of samples are collected for different tures (see Table VI).
WU et al.: OMIC AND ELECTRONIC HEALTH RECORD BIG DATA ANALYTICS FOR PRECISION MEDICINE 267
Logistic regression, cox Simple to implement and Sensitive to outliers Apache Hadoop Horizontally scalable; Generally most effective for
regression, local regression interpret; direct estimates of (MapReduce) [11], [142] fault-tolerant; designed to be batch-mode processing; not
(LOESS) [133] relevant hazards for Cox deployed on always appropriate for
regression commodity-grade hardware; real-time, online analytics
Logistic regression with LASSO Reduces feature space Prone to overfitting free and open-source
regularization [134] IBM InfoSphere Platform Includes purpose-built tools Commercial licensing
Hidden Markov models [135] Simultaneous detection, Sensitive to the design of [143] to handle streaming
segmentation, and the Markov model being information; integrates with
classification in a waveform trained open-source tools such as
Conditional random fields [136] Supports temporal analysis; Sensitive to Hadoop
resistant to differences in regularization and feature Apache Spark Streaming Integrates with the Hadoop Depends on more expensive
class prevalence space size [144] stack; allows one code base hardware with large amounts
Relational subgroup discovery, Valid sequential techniques Tradeoffs between for both batch-mode and of RAM to work efficiently
online analysis
episode rule mining, windowing for some clinical applications simplicity, complexity,
[137] and temporal resolution Tableau, QlikView, TIBCO Visualization of large and Generally incomplete
Rule mining, Allens interval Temporal mining/modeling Requires specific Spotfire, and other visual complex data sets solutions, requiring other
algebra, directed acyclic graph capabilities experimental design analytics tools tools to effectively handle
[138] data storage
*
Highly impactful method with more than 50 000 relevant papers. Highly impactful platform.
[9] A. ODriscoll et al., Big data, Hadoop and cloud computing in ge- [39] S. M. Meystre et al., Extracting information from textual documents
nomics, J. Biomed. Informat., vol. 46, pp. 774781, Oct. 2013. in the electronic health record: A review of recent research, Yearbook
[10] T. B. Murdoch and A. S. Detsky, The inevitable application of big data Med. Informat., vol. 35, pp. 128144, 2008.
to health care, J. Amer. Med. Assoc., vol. 309, pp. 13511352, Apr. [40] P. B. Jensen et al., Mining electronic health records: Towards better
2013. research applications and clinical care, Nature Rev. Genetics, vol. 13,
[11] W. Raghupathi and V. Raghupathi, Big data analytics in healthcare: pp. 395405, Jun. 2012.
Promise and potential, Health Inf. Sci. Syst., vol. 2, Feb. 2014, Art. no. 3. [41] P. S. Romano et al., A comparison of administrative versus clinical
data: Coronary artery bypass surgery as an example, J. Clin. Epidemiol.,
[12] D. W. Bates et al., Big data in health care: Using analytics to identify vol. 47, pp. 249260, Mar. 1994.
and manage high-risk and high-cost patients, Health Affairs, vol. 33, [42] R. K. Patel and M. Jain, NGS QC toolkit: A toolkit for quality control
pp. 11231131, Jul. 2014. of next generation sequencing data, PloS ONE, vol. 7, Feb. 1, 2012,
[13] W. Hsu et al., Biomedical imaging informatics in the era of precision Art. no. e30619.
medicine: Progress, challenges, and opportunities, J. Amer. Med. Inform. [43] T. H. Stokes et al., Chip artifact correction (caCORRECT): A bioinfor-
Assoc., vol. 20, pp. 10101013, Nov. 2013. matics system for quality assurance of genomics and proteomics array
[14] K. Clark et al., The Cancer Imaging Archive (TCIA): Maintaining data, Ann. Biomed. Eng., vol. 35, pp. 106880, Jun. 2007.
and operating a public information repository, J. Digit. Imag., vol. 26, [44] J. T. Leek et al., Tackling the widespread and critical impact of
pp. 10451057, Dec. 2013. batch effects in high-throughput data, Nature Rev. Genetics, vol. 11,
[15] H. Banaee et al., Data mining for wearable sensors in health monitoring pp. 733739, Oct. 2010.
systems: A review of recent trends and challenges, Sensors, vol. 13, [45] W. E. Johnson et al., Adjusting batch effects in microarray expression
pp. 1747217500, Dec. 2013. data using empirical Bayes methods, Biostatistics, vol. 8, pp. 118127,
[16] L. D. Xu et al., Internet of things in industries: A survey, IEEE Trans. Jan. 2007.
Ind. Informat., vol. 10, no. 4, pp. 22332243, Nov. 2014. [46] C. Ledergerber and C. Dessimoz, Base-calling for next-generation se-
[17] F. S. Collins and H. Varmus, A new initiative on precision medicine, quencing platforms, Briefings Bioinformat., vol. 12, pp. 489497, Sep.
New Engl. J. Med., vol. 372, pp. 793795, Feb. 2015. 2011.
[18] I. S. Kohane, Ten things we have to do to achieve precision medicine, [47] C. A. Smith et al., XCMS: Processing mass spectrometry data for
Science, vol. 349, pp. 3738, Jul. 2015. metabolite profiling using nonlinear peak alignment, matching, and iden-
[19] R. Hillestad et al., Can electronic medical record systems transform tification, Anal. Chem., vol. 78, pp. 779787, Feb. 1, 2006.
health care? Potential health benefits, savings, and costs, Health Affairs, [48] N. G. Weiskopf and C. Weng, Methods and dimensions of electronic
vol. 24, pp. 11031117, Sep. 2005. health record data quality assessment: Enabling reuse for clinical re-
[20] NHGRI. [Online]. Available: https://www.genome.gov/10000202/fact- search, J. Amer. Med. Informat. Assoc., vol. 20, pp. 144151, Jan. 2013.
sheets/ [49] S. Bowman, Impact of electronic health record systems on informa-
[21] R. Romero et al., The use of high-dimensional biology (genomics, tran- tion integrity: Quality and safety implications, Perspectives Health Inf.
scriptomics, proteomics, and metabolomics) to understand the preterm Manage., vol. 10, Oct. 2013.
parturition syndrome, BJOG Int. J. Obstetrics Gynaecol., vol. 113, [50] G. D. Clifford et al., Robust parameter extraction for decision support
pp. 118135, Dec. 2006. using multimodal intensive care data, Philos. Trans. A, Math. Phys. Eng.
[22] Nature. [Online]. Available: http://www.nature.com/scitable Sci., vol. 367, pp. 411429, Jan. 2009.
[23] L. Feuk et al., Structural variation in the human genome, Nature Rev. [51] M. Girolami et al., Analysis of complex, multidimensional datasets,
Genetics, vol. 7, pp. 8597, Feb. 2006. Drug Discovery Today: Technol., vol. 3, pp. 1319, Apr. 2006.
[24] D. Kim and S. L. Salzberg, TopHat-Fusion: An algorithm for dis- [52] R. Chen et al., Personal omics profiling reveals dynamic molecular and
covery of novel fusion transcripts, Genome Biol., vol. 12, Aug. 2011, medical phenotypes, Cell, vol. 148, pp. 12931307, Mar. 2012.
Art. no. R72. [53] L. Stanberry et al., Integrative analysis of longitudinal metabolomics
[25] A. McPherson et al., deFuse: An algorithm for gene fusion discov- data from a personal multi-omics profile, Metabolites, vol. 3,
ery in tumor RNA-seq data, PLoS Comput. Biol., vol. 7, May 2011, pp. 741760, Sep. 2013.
Art. no. e1001138. [54] P. Luukka and J. Lampinen, A classification method based on princi-
[26] D. L. Black, Mechanisms of alternative pre-messenger RNA splicing, pal component analysis and differential evolution algorithm applied for
Annu. Rev. Biochem., vol. 72, pp. 291336, 2003. prediction diagnosis from clinical emr heart data sets, Comput. Intell.
[27] What-is-Epigenetics. [Online]. Available: http://www.whatisepigenetics. Optim.: Appl. Implementations, vol. 7, pp. 263283, 2010.
com/ [55] Y. Huang et al., Feature selection and classification model construction
[28] C.-W. Cheng et al., icuARMAn ICU clinical decision support sys- on type 2 diabetic patients data, Artif. Intell. Med., vol. 41, pp. 251262,
tem using association rule mining, IEEE J. Transl. Eng. Health Med., Nov. 2007.
vol. 1, Nov. 2013, Art. no. 4400110. [56] P. Cunningham, Dimension reduction, in Machine Learning Tech-
[29] Health Level Seven International. [Online]. Available: http://hl7.org/fhir/ niques for Multimedia: Case Studies on Organization and Retrieval.
[30] M. A. DePristo et al., A framework for variation discovery and geno- Berlin, Germany: Springer, 2008, pp. 91112.
typing using next-generation DNA sequencing data, Nature Genetics, [57] H. Peng et al., Feature selection based on mutual information: Criteria
vol. 43, pp. 491498, May 2011. of max-dependency, max-relevance, and min-redundancy, IEEE Trans.
[31] C. Xie and M. T. Tammi, CNV-seq, a new method to detect copy number Pattern Anal., vol. 27, no. 8, pp. 12261238, Aug. 2005.
variation using high-throughput sequencing, BMC Bioinformat., vol. 10, [58] R. Kohavi and G. H. John, Wrappers for feature subset selection, Artif.
Mar. 2009, Art. no. 80. Intell., vol. 97, pp. 273324, Dec. 1997.
[32] F. Ozsolak and P. M. Milos, RNA sequencing: Advances, challenges [59] Y. Saeys et al., A review of feature selection techniques in bioinfor-
and opportunities, Nature Rev. Genetics, vol. 12, pp. 8798, Feb. 2011. matics, Bioinformatics, vol. 23, pp. 25072517, Oct. 2007.
[33] C. Trapnell et al., Differential gene and transcript expression analysis of [60] J. l. Coste et al., Methodological issues in determining the dimension-
RNA-SEQ experiments with TopHat and Cufflinks, Nature Protocols, ality of composite health measures using principal component analysis:
vol. 7, pp. 562578, Mar. 2012. Case illustration and suggestions for practice, Quality Life Res., vol. 14,
[34] M. Hirst and M. A. Marra, Next generation sequencing based ap- pp. 641654, May 2005.
proaches to Epigenomics, Brief Funct. Genomics, vol. 9, pp. 455465, [61] G. E. Hinton and R. R. Salakhutdinov, Reducing the dimensionality of
Dec. 2010. data with neural networks, Science, vol. 313, pp. 504507, Jul. 2006.
[35] R. Aebersold and M. Mann, Mass spectrometry-based proteomics, [62] B. Scholkopf et al., Nonlinear component analysis as a kernel eigen-
Nature, vol. 422, pp. 198207, Mar. 13, 2003. value problem, Neural Comput., vol. 10, pp. 12991319, Jul. 1998.
[36] A. Buchholz et al., Metabolomics: Quantification of intracellular [63] T. D. Wu and C. K. Watanabe, GMAP: A genomic mapping and align-
metabolite dynamics, Biomolecular Eng., vol. 19, pp. 515, Jun. ment program for mRNA and EST sequences, Bioinformatics, vol. 21,
2002. pp. 18591875, May 2005.
[37] S. J. Liu, Epigenetics advancing personalized nanomedicine in cancer [64] H. Li and R. Durbin, Fast and accurate short read alignment with
therapy, Adv. Drug Del. Rev., vol. 64, pp. 15321543, Oct. 2012. Burrows-Wheeler transform, Bioinformatics, vol. 25, pp. 17541760,
[38] K. Hayrinen et al., Definition, structure, content, use and impacts of Jul. 2009.
electronic health records: A review of the research literature, Int. J. Med. [65] A. Dobin et al., STAR: Ultrafast universal RNA-seq aligner, Bioinfor-
Informat., vol. 77, pp. 291304, May 2008. matics, vol. 29, pp. 1521, Jan. 2013.
272 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 64, NO. 2, FEBRUARY 2017
[66] H. Li, A statistical framework for SNP calling, mutation discovery, [93] J. H. Kim et al., CNVRuler: A copy number variation-based
association mapping and population genetical parameter estimation from case-control association analysis tool, Bioinformatics, vol. 28,
sequencing data, Bioinformatics, vol. 27, pp. 29872993, Nov. 2011. pp. 17901792, Jul. 2012.
[67] S. Anders et al., HTSeq A Python framework to work with high- [94] M. D. Robinson et al., EDGER: A bioconductor package for differen-
throughput sequencing data, Bioinformatics, vol. 31, pp. 166169, Jan. tial expression analysis of digital gene expression data, Bioinformatics,
2015. vol. 26, pp. 139140, Jan. 2010.
[68] A. R. Quinlan and I. M. Hall, BEDTools: A flexible suite of utilities [95] M. I. Love et al., Moderated estimation of fold change and dispersion
for comparing genomic features, Bioinformatics, vol. 26, pp. 841842, for RNA-seq data with DESeq2, Genome Biol., vol. 15, Dec. 2014,
Mar. 2010. Art. no. 550.
[69] B. Li and C. N. Dewey, RSEM: Accurate transcript quantification from [96] J. H. Phan et al., OmniBiomarker: A Web-based application for
RNA-seq data with or without a reference genome, BMC Bioinformat., knowledge-driven biomarker identification, IEEE Trans. Biomed. Eng.,
vol. 12, Aug. 2011, Art. no. 323. vol. 60, no. 12, pp. 33643367, Dec. 2013.
[70] C. Trapnell et al., Transcript assembly and quantification by RNA- [97] Y. Hu et al., DiffSplice: The genome-wide detection of differential
Seq reveals unannotated transcripts and isoform switching during cell splicing events with RNA-seq, Nucleic Acids Res., vol. 41, Jan. 2013,
differentiation, Nature Biotechnol., vol. 28, pp. 511515, May 2010. Art. no. e39.
[71] G. Robertson et al., De novo assembly and analysis of RNA-seq data, [98] S. H. Shen et al., MATS: A Bayesian framework for flexible detection
Nature Methods, vol. 7, pp. 909912, Nov. 2010. of differential alternative splicing from RNA-seq data, Nucleic Acids
[72] M. G. Grabherr et al., Full-length transcriptome assembly from RNA- Res., vol. 40, Apr. 2012, Art. no. R74.
seq data without a reference genome, Nature Biotechnol., vol. 29, [99] K. Liang and S. Keles, Detecting differential binding of transcrip-
pp. 644652, Jul. 2011. tion factors with ChIP-seq, Bioinformatics, vol. 28, pp. 121122, Jan.
[73] M. Guttman et al., Ab initio reconstruction of cell type-specific tran- 2012.
scriptomes in mouse reveals the conserved multi-exonic structure of [100] H. Xu et al., An HMM approach to genome-wide identification of dif-
lincRNAs, Nature Biotechnol., vol. 28, pp. 503510, May 2010. ferential histone modification sites from ChIP-seq data, Bioinformatics,
[74] Y. Zhang et al., Model-based analysis of ChIP-Seq (MACS), Genome vol. 24, pp. 23442349, Oct. 2008.
Biol., vol. 9, Sep. 2008, Art. no. R137. [101] Y. Zhang et al., QDMR: A quantitative method for identification
[75] R. Jothi et al., Genome-wide identification of in vivo protein- of differentially methylated regions by entropy, Nucleic Acids Res.,
DNA binding sites from ChIP-Seq data, Nucleic Acids Res., vol. 36, vol. 39, May 2011, Art. no. e58.
pp. 52215231, Sep. 2008. [102] C. D. Kaddi et al., DetectTLC: Automated reaction mixture screening
[76] M. Sturm et al., OpenMS An open-source software framework for utilizing quantitative mass spectrometry image features, J. Amer. Soc.
mass spectrometry, BMC Bioinformat., vol. 9, Mar. 2008, Art. no. 163. Mass. Spectrometry, vol. 27, pp. 359365, Oct. 2015.
[77] T. Pluskal et al., MZmine 2: Modular framework for processing, visu- [103] T. Wang et al., Automics: An integrated platform for NMR-based
alizing, and analyzing mass spectrometry-based molecular profile data, metabonomics spectral processing and data analysis, BMC Bioinfor-
BMC Bioinformat., vol. 11, Jul. 2010, Art. no. 395. matics, vol. 10, Mar. 2009, Art. no. 83.
[78] S. Pepke et al., Computation for ChIP-seq and RNA-seq studies, [104] J. Xia et al., MetaboAnalyst 2.0 A comprehensive server
Nature Methods, vol. 6, pp. S22S32, Nov. 2009. for metabolomic data analysis, Nucleic Acids Res., vol. 40,
[79] S. Pabinger et al., A survey of tools for variant analysis of next- pp. W127W133, Jul. 2012.
generation genome sequencing data, Brief Bioinformat., vol. 15, [105] W. S. Bush and J. H. Moore, Chapter 11: Genome-wide association
pp. 256278, Mar. 2014. studies, PloS Comput. Biol., vol. 8, Dec. 2012, Art. no. e1002822.
[80] C. Alkan et al., Genome structural variation discovery and genotyping, [106] P. Langfelder and S. Horvath, WGCNA: An R package for weighted
Nature Rev. Genetics, vol. 12, pp. 363375, May 2011. correlation network analysis, BMC Bioinformat., vol. 9, Dec. 2008,
[81] M. Zhao et al., Computational tools for copy number variation (CNV) Art. no. 559.
detection using next-generation sequencing data: Features and perspec- [107] H. Y. Hu et al., Mining coherent dense subgraphs across massive
tives, BMC Bioinformat., vol. 14, Sep. 2013, Art. no. S1. biological networks for functional discovery, Bioinformatics, vol. 21,
[82] Z. Chang et al., Bridger: A new framework for de novo transcrip- pp. I213I221, Jun. 2005.
tome assembly using RNA-seq data, Genome Biol., vol. 16, Feb. 2015, [108] G. Ciriello et al., Mutual exclusivity analysis identifies oncogenic
Art. no. 30. network modules, Genome Res., vol. 22, pp. 398406, Feb. 2012.
[83] M. Pertea et al., StringTie enables improved reconstruction of [109] A. Funahashi et al., CellDesigner 3.5: A versatile modeling tool for
a transcriptome from RNA-seq reads, Nature Biotechnol., vol. 33, biochemical networks, Proc. IEEE, vol. 96, no. 8, pp. 12541265, Aug.
pp. 290295, Mar. 2015. 2008.
[84] J. A. Martin and Z. Wang, Next-generation transcriptome assembly, [110] U. Wilensky. 1999. NetLogo. [Online]. Available: http://ccl.northwes-
Nature Rev. Genetics, vol. 12, pp. 671682, Oct. 2011. tern.edu/netlogo/
[85] L. N. Mueller et al., An assessment of software solutions for the analysis [111] C. Mussel et al., BoolNet An R package for generation, recon-
of mass spectrometry based quantitative proteomics data, J. Proteome struction and analysis of Boolean networks, Bioinformatics, vol. 26,
Res., vol. 7, pp. 5161, Jan. 2008. pp. 13781380, May 2010.
[86] W. B. Dunn et al., Procedures for large-scale metabolic profiling of [112] C. Rohr et al., Snoopy A unifying Petri net framework to investi-
serum and plasma using gas chromatography and liquid chromatography gate biomolecular networks, Bioinformatics, vol. 26, pp. 974975, Apr.
coupled to mass spectrometry, Nature Protocols, vol. 6, pp. 10601083, 2010.
Jun. 2011. [113] A. R. Joyce and B. O. Palsson, The model organism as a system:
[87] T. Alexandrov, MALDI imaging mass spectrometry: Statistical data integrating omics data sets, Nature Rev. Molecular Cell Biol., vol. 7,
analysis and current computational challenges, BMC Bioinformat., pp. 198210, Mar. 2006.
vol. 13, Nov. 2012, Art. no. S11. [114] A. L. Barabasi et al., Network medicine: A network-based ap-
[88] C. Yang et al., Comparison of public peak detection algorithms for proach to human disease, Nature Rev. Genetics, vol. 12, pp. 5668,
MALDI mass spectrometry data analysis, BMC Bioinformat., vol. 10, Jan. 2011.
Jan. 2009, Art. no. 4. [115] M. Vidal et al., Interactome networks and human disease, Cell,
[89] J. R. Gonzalez et al., SNPassoc: An R package to perform whole vol. 144, pp. 986998, Mar. 2011.
genome association studies, Bioinformatics, vol. 23, pp. 644645, Mar. [116] H. de Jong, Modeling and simulation of genetic regulatory systems: A
2007. literature review, J. Comput. Biol., vol. 9, pp. 67103, Jul. 2002.
[90] J. Marchini et al., A new multipoint method for genome-wide asso- [117] H. Doerfler et al., Granger causality in integrated GC-MS and LC-
ciation studies by imputation of genotypes, Nature Genetics, vol. 39, MS metabolomics data reveals the interface of primary and secondary
pp. 906913, Jul. 2007. metabolism, Metabolomics, vol. 9, pp. 564574, Jun. 2013.
[91] G. T. Wang et al., Variant association tools for quality control and [118] J. Lee et al., Interrogating a clinical database to study treatment
analysis of large-scale sequence and genotyping array data, Amer. J. of hypotension in the critically ill, BMJ Open, vol. 2, Jun. 2012,
Human Genetics, vol. 94, pp. 770783, May 2014. Art. no. e000916.
[92] S. Purcell et al., PLINK: A tool set for whole-genome association and [119] J. Lee and R. G. Mark, An investigation of patterns in hemodynamic
population-based linkage analyses, Amer. J. Human Genetics, vol. 81, data indicative of impending hypotension in intensive care, Biomed.
pp. 559575, Sep. 2007. Eng. Online, vol. 9, Oct. 2010, Art. no. 62.
WU et al.: OMIC AND ELECTRONIC HEALTH RECORD BIG DATA ANALYTICS FOR PRECISION MEDICINE 273
[120] E. C. Lau et al., Use of electronic medical records (EMR) for oncol- [145] H. Demirkan and D. Delen, Leveraging the capabilities of service-
ogy outcomes research: Assessing the comparability of EMR informa- oriented decision support systems: Putting analytics and big data in
tion to patient registry and health claims data, Clin. Epidemiol., vol. 3, cloud, Decision Support Syst., vol. 55, pp. 412421, May 2013.
pp. 259272, Oct. 2011. [146] Illumina. BaseSpace: Genomics Cloud Computing. [Online]. Available:
[121] G. Allasia, Approximating potential integrals by cardinal basis inter- https://basespace.illumina.com
polants on multivariate scattered data, Comput. Math. Appl., vol. 43, [147] E. Afgan et al., Harnessing cloud computing with galaxy cloud, Nature
pp. 275287, Feb.-Mar. 2002. Biotechnol., vol. 29, pp. 972974, Nov. 2011.
[122] C. A. Welch et al., Evaluation of two-fold fully conditional specifica- [148] A. Wright et al., Creating and sharing clinical decision support con-
tion multiple imputation for longitudinal electronic health record data, tent with Web 2.0: Issues and examples, J. Biom. Informat., vol. 42,
Statist. Med., vol. 33, pp. 37253737, Sep. 2014. pp. 334346, 2009.
[123] Q. Li et al., Robust heart rate estimation from multiple asynchronous [149] T. C. G. Atlas. [Online]. Available: http://cancergenome.nih.gov/
noisy sources using signal quality indices and a Kalman filter, Physiol. [150] C. D. Kaddi and M. D. Wang, Models for predicting stage in head
Meas., vol. 29, pp. 1532, Jan. 2008. and neck squamous cell carcinoma using proteomic data, in Proc. 36th
[124] W. Zong et al., Reduction of false arterial blood pressure alarms using Annu. Int. Conf. IEEE Eng. Med. Biol. Soc., 2014, pp. 52165219.
signal quality assessment and relationships between the electrocardio- [151] K. A. Hoadley et al., Multiplatform analysis of 12 cancer types re-
gram and arterial blood pressure, Med. Biol. Eng. Comput., vol. 42, veals molecular classification within and across tissues of origin, Cell,
pp. 698706, Sep. 2004. vol. 158, pp. 929944, 2014.
[125] I. Silva et al., Signal quality estimation with multichannel adaptive [152] R. G. Verhaak et al., Integrated genomic analysis identifies clini-
filtering in intensive care settings, IEEE Trans. Biomed. Eng., vol. 59, cally relevant subtypes of glioblastoma characterized by abnormalities
no. 9, pp. 24762485, Sep. 2012. in PDGFRA, IDH1, EGFR, and NF1, Cancer Cell, vol. 17, pp. 98110,
[126] J. M. Feldman et al., Robust sensor fusion improves heart rate esti- Jan. 19, 2010.
mation: Clinical evaluation, J. Clin. Monitoring, vol. 13, pp. 379384, [153] Cancer Genome Atlas Network, Comprehensive molecular portraits of
Nov. 1997. human breast tumours, Nature, vol. 490, pp. 6170, Oct. 4, 2012.
[127] T. Wartzek et al., Robust sensor fusion of unobtrusively measured heart [154] Cancer Genome Atlas Research Network, Comprehensive genomic
rate, IEEE J. Biomed. Health Informat., vol. 18, no. 2, pp. 654660, Mar. characterization of squamous cell lung cancers, Nature, vol. 489,
2014. pp. 51925, Sep. 27, 2012.
[128] J. Labar`ere et al., How to derive and validate clinical prediction mod- [155] O. Gottesman et al., The electronic medical records and genomics
els for use in intensive care medicine, Intensive Care Med., vol. 40, (eMERGE) network: Past, present, and future, Genetics Med., vol. 15,
pp. 513527, Apr. 2014. pp. 761771, Oct. 2013.
[129] J. L. Schafer, Multiple imputation: A primer, Statist. Methods Med. [156] K. Marsolo and S. A. Spooner, Clinical genomics in the world of the
Res., vol. 8, pp. 315, 1999. electronic health record, Genetics Med., vol. 15, pp. 786791, Oct. 2013.
[130] J. Tian et al., Missing data analyses: A hybrid multiple imputation [157] J. L. Kannry and M. S. Williams, Integration of genomics into the
algorithm using Gray System Theory and entropy based on clustering, electronic health record: Mapping terra incognita, Genetics Med.,
Appl. Intell., vol. 40, pp. 376388, Mar. 2014. vol. 15, pp. 757760, Jun. 2013.
[131] K. A. Hallgren and K. Witkiewitz, Missing data in alcohol clinical [158] A. G. Ury, Storing and interpreting genomic information in widely
trials: A comparison of methods, Alcohol Clin. Exp. Res., vol. 37, deployed electronic health record systems, Genetics Med., vol. 15,
pp. 21522160, Dec. 2013. pp. 779785, Aug. 2013.
[132] N. Stadler et al., Pattern alternating maximization algorithm for miss- [159] C. A. Caligtan and P. C. Dykes, Electronic health records and personal
ing data in high-dimensional problems, J. Mach. Learn. Res., vol. 15, health records, Semin. Oncol. Nursing, vol. 27, pp. 218228, Aug. 2011.
pp. 19031928, Jan. 2014. [160] G. Alterovitz et al., SMART on FHIR Genomics: Facilitating stan-
[133] L. Fuchs et al., ICU admission characteristics and mortality rates dardized clinico-genomic apps, J. Amer. Med. Informat. Assoc., vol. 22,
among elderly and very elderly patients, Intensive Care Med., vol. 38, pp. 11738, Nov. 2015.
pp. 16541661, Oct. 2012. [161] NIH-BMIC. [Online]. Available: https://www.nlm.nih.gov/NIHbmic/
[134] Y. Li et al., A method for controlling complex confounding effects in [162] WHO. [Online]. Available: http://apps.who.int/gho/data/node.main
the detection of adverse drug reactions using electronic health records, [163] NCHS. [Online]. Available: http://www.healthindicators.gov/
J. Amer. Med. Informat. Assoc., vol. 21, pp. 308314, Mar. 2014. [164] H. C. Chen et al., Business intelligence and analytics: From big data to
[135] R. V. Andreao et al., ECG signal analysis through hidden Markov big impact, MIS Quart., vol. 36, pp. 11651188, Dec. 2012.
models, IEEE Trans. Biomed. Eng., vol. 53, no. 8, pp. 15411549, Aug. [165] M. Chen et al., Big data: A survey, Mobile Netw. Appl., vol. 19,
2006. pp. 171209, Apr. 2014.
[136] G. de Lannoy et al., Weighted conditional random fields for supervised [166] H. J. Murff et al., Automated identification of postoperative com-
interpatient heartbeat classification, IEEE Trans. Biomed. Eng., vol. 59, plications within an electronic medical record using natural language
no. 1, pp. 241247, Jan. 2012. processing, JAMA, vol. 306, pp. 848855, 2011.
[137] J. Klema et al., Sequential data mining: A comparative case study in [167] C. A. McCarty et al., The emerge network: A consortium of biorepos-
development of atherosclerosis risk factors, IEEE Trans. Syst., Man, itories linked to electronic medical records data for conducting genomic
Cybern. C, Appl. Rev., vol. 38, no. 1, pp. 315, Jan. 2008. studies, BMC Med. Genomics, vol. 4, 2011, Art. no. 13.
[138] R. Harpaz et al., Novel data-mining methodologies for adverse drug [168] L. Ricciardi et al., A national action plan to support consumer engage-
event discovery and analysis, Clin. Pharmacol. Therapeutics, vol. 91, ment via e-health, Health Affairs, vol. 32, pp. 376384, 2013.
pp. 10101021, Jun. 2012. [169] M. K. Keck et al., Integrative analysis of head and neck cancer iden-
[139] S. K. Harms and J. S. Deogun, Sequential association rule mining with tifies two biologically distinct HPV and three non-HPV subtypes, Clin.
time lags, J. Intell. Inf. Syst., vol. 22, pp. 722, Jan. 2004. Cancer Res., vol. 21, pp. 870881, Feb. 15, 2015.
[140] H. Chen et al., Business intelligence and analytics: From big data to [170] C. Vogel and E. M. Marcotte, Insights into the regulation of protein
big impact, MIS Quart., vol. 36, pp. 11651188, Dec. 2012. abundance from proteomic and transcriptomic analyses, Nature Rev.
[141] O. Ola and K. Sedig, The challenge of big data in public health: An Genetics, vol. 13, pp. 227232, Apr. 2012.
opportunity for visual analytics, Online J. Public Health Informat., [171] J. Andreu-Perez et al., Big data for health, IEEE J. Biom. Health
vol. 5, Feb. 2014, Art. no. 223. Informat., vol. 19, no. 4, pp. 11931208, Jul. 2015.
[142] R. C. Taylor, An overview of the Hadoop/MapReduce/HBase frame- [172] M. Cvach, Monitor alarm fatigue: An integrative review, Biomed. In-
work and its current applications in bioinformatics, BMC Bioinformat., strum. Technol., vol. 46, pp. 268277, Jul. 2012.
vol. 11, Dec. 2010, Art. no. S1. [173] R. Martis et al., Wavelet-based machine learning techniques for
[143] A. Biem et al., IBM infosphere streams for scalable, real-time, intelli- ECG signal analysis, in Machine Learning in Healthcare Informatics,
gent transportation services, in Proc. ACM SIGMOD Int. Conf. Manage. vol. 56, S. Dua, U. R. Acharya, and P. Dua, Eds. Berlin, Germany:
Data, 2010, pp. 10931104. Springer, 2014, pp. 2545.
[144] M. Zaharia et al., Spark: Cluster computing with working sets, in
Authors photograph and biography not available at the time of publica-
Proc. 2nd USENIX Conf. Hot Topics Cloud Comput., 2010, pp. 1010.
tion.