Sunteți pe pagina 1din 5

MINING HEALTH EXAMINATION RECORDS A GRAPH BASED APPROACH

LITERATURE SURVEY

ENSEMBLE ROUGH HYPERCUBOID APPROACH FOR CLASSIFYING CANCERS

Jin-Mao Wei , Nankai University, Tianjin and Northeast Normal University, Jilin

Cancer classification is the critical basis for patient-tailored therapy. Conventional


histological analysis tends to be unreliable because different tumors may have similar
appearance. The advances in microarray technology make individualized therapy possible.
Various machine learning methods can be employed to classify cancer tissue samples based on
microarray data.
However, few methods can be elegantly adopted for generating accurate and reliable as
well as biologically interpretable rules.
In this paper, we introduce an approach for classifying cancers based on the principle of
minimal rough fringe. For training rough hypercuboid classifiers from gene expression data sets,
the method dynamically evaluates all available genes and sifts the genes with the smallest
implicit regions as the dimensions of implicit hypercuboids.
An unseen object is predicted to be a certain class if it falls within the corresponding class
hypercuboid. Based upon the method, ensemble rough hypercuboid classifiers are subsequently
constructed. Experimental results on some open cancer gene expression data sets show that the
proposed method is capable of generating accurate and interpretable rules compared with some
other machine learning methods. Hence, it is a feasible way of classifying cancer tissues in
biomedical applications.
COGBOOST: BOOSTING FOR FAST COST-SENSITIVE GRAPH CLASSIFICATION
Shirui Pan College of Information Engineering, Northwest A&F University, Yangling,
China
Graph classification has drawn great interests in recent years due to the increasing
number of applications involving objects with complex structure relationships. To date, all
existing graph classification algorithms assume, explicitly or implicitly, that misclassifying
instances in different classes incurs an equal amount of cost (or risk), which is often not the case
in real-life applications (where misclassifying a certain class of samples, such as diseased
patients, is subject to more expensive costs than others).
Although cost-sensitive learning has been extensively studied, all methods are based on
data with instance-feature representation. Graphs, however, do not have features available for
learning and the feature space of graph data is likely infinite and needs to be carefully explored
in order to favor classes with a higher cost.
In this paper, we propose, CogBoost, a fast cost-sensitive graph classification algorithm,
which aims to minimize the misclassification costs (instead of the errors) and achieve fast
learning speed for large scale graph data sets.
To minimize the misclassification costs, CogBoost iteratively selects the most
discriminative subgraph by considering costs of different classes, and then solves a linear
programming problem in each iteration by using Bayes decision rule based optimal loss function.
In addition, a cutting plane algorithm is derived to speed up the solving of linear programs for
fast learning on large scale data sets. Experiments and comparisons on real-world large graph
data sets demonstrate the effectiveness and the efficiency of our algorithm.
EXTRACTION OF INTERPRETABLE MULTIVARIATE PATTERNS FOR EARLY
DIAGNOSTICS
Mohamed F. Ghalwash BCenter for Data Anal. & Biomed. Inf., Temple Univ.,
Philadelphia, PA, USA
Vladan Radosavljevic Center for Data Anal. & Biomed. Inf., Temple Univ., Philadelphia,
PA, USA
Zoran Obradovic Center for Data Anal. & Biomed. Inf., Temple Univ., Philadelphia, PA,
USA

Leveraging temporal observations to predict a patient's health state at a future period is a


very challenging task. Providing such a prediction early and accurately allows for designing a
more successful treatment that starts before a disease completely develops. Information for this
kind of early diagnosis could be extracted by use of temporal data mining methods for handling
complex multivariate time series.
However, physicians usually prefer to use interpretable models that can be easily
explained, rather than relying on more complex black-box approaches.
In this study, a temporal data mining method is proposed for extracting interpretable
patterns from multivariate time series data, which can be used to assist in providing interpretable
early diagnosis. The problem is formulated as an optimization based binary classification task
addressed in three steps.
First, the time series data is transformed into a binary matrix representation suitable for
application of classification methods. Second, a novel convex-concave optimization problem is
defined to extract multivariate patterns from the constructed binary matrix. Then, a mixed integer
discrete optimization formulation is provided to reduce the dimensionality and extract
interpretable multivariate patterns.
Finally, those interpretable multivariate patterns are used for early classification in
challenging clinical applications. In the conducted experiments on two human viral infection
datasets and a larger myocardial infarction dataset, the proposed method was more accurate and
provided classifications earlier than three alternative state-of-the-art methods.
STABILIZED SPARSE ORDINAL REGRESSION FOR MEDICAL RISK
STRATIFICATION

Tran, Truyen; Phung, Dinh; Luo, Wei; Venkatesh, Svetha


June 2015
Knowledge & Information Systems;Jun2015, Vol. 43 Issue 3, p555
Academic Journal

The recent wide adoption of Electronic Medical Records (EMR) presents great
opportunities and challenges for data mining. The EMR data is largely temporal, often noisy,
irregular and high dimensional. This paper constructs a novel ordinal regression framework
for predicting medical risk stratification from EMR.
First, a conceptual view of EMR as a temporal image is constructed to extract a diverse
set of features. Second, ordinal modeling is applied for predicting cumulative or progressive
risk. The challenges are building a transparent predictive model that works with a large
number of weakly predictive features, and at the same time, is stable against re-sampling
variations.
Our solution employs sparsity methods that are stabilized through domain-specific
feature interaction networks. We introduces two indices that measure the model stability
against data re-sampling. Feature networks are used to generate two multivariate Gaussian
priors with sparse precision matrices (the Laplacian and Random Walk).

LEARNING PHENOTYPE STRUCTURE USING SEQUENCE MODEL

Yuhai Zhao , Sch. of Inf. Sci. & Eng., Northeastern Univ., Shenyang, China
Guoren Wang ,Sch. of Inf. Sci. & Eng., Northeastern Univ., Shenyang, China
Xiang Zhang Dept. of Electr. Eng. & Comput. Sci., Case Western Reserve Univ.,
Cleveland, OH, USA

Advanced microarray technologies have enabled to simultaneously monitor the


expression levels of all genes. An important problem in microarray data analysis is to discover
phenotype structures.
The goal is to
1) find groups of samples corresponding to different phenotypes (such as disease or
normal), and
2) for each group of samples, find the representative expression pattern or signature that
distinguishes this group from others.
Some methods have been proposed for this issue, however, a common drawback is that
the identified signatures often include a large number of genes but with low discriminative
power.
In this paper, we propose a g*-sequence model to address this limitation, where the
ordered expression values among genes are profitably utilized. Compared with the existing
methods, the proposed sequence model is more robust to noise and allows to discover the
signatures with more discriminative power using fewer genes. This is important for the
subsequent analysis by the biologists.
We prove that the problem of phenotype structure discovery is NP-complete. An efficient
algorithm, FINDER, is developed, which includes three steps:
1) trivial g*-sequences identifying,
2) phenotype structure discovery, and
3) refinement. Effective pruning strategies are developed to further improve the
efficiency.
We evaluate the performance of FINDER and the existing methods using both synthetic
and real gene expression data sets. Extensive experimental results show that FINDER
dramatically improves the accuracy of the phenotype structures discovered (in terms of both
statistical and biological significance) and detects signatures with high discriminative power.
Moreover, it is orders of magnitude faster than other alternatives.

S-ar putea să vă placă și