Sunteți pe pagina 1din 5

A data mining model of knowledge discovery

based on the deep learning


Yonglin Ma1, Yuanhua Tan2, Chaolin Zhang2, and Yici Mao3
1
Application Management office of SINOPEC IT management Department, Beijing, 100728, China
2
Karamay Hongyou Software Co., Xinjiang, 834000, China
3
Karamay Municipal People's Government Bureau of Information Industry, Xinjiang, 834000, China
E-mail: mayl@sinopec.com (Y. Ma), tanyh66@petrochina.com.cn (Y. Tan)
lin_2728@126.com (C. Zhang), 18909901617@126.com (Y. Mao)

AbstractWith the development of the database technology machine readable and machine interpretable format and must
and the spread of the internet, the amount of the data in represent knowledge in a manner that facilitates inference.
databases increases at an exponential speed, which yields the KDD can transform data to knowledge. DM plays a key role in
difficult problems of excess data and information explosion, this transformation process. DM usually refers to extracting
etc. The traditional database technology is restricted in reading effective information from a mass of data by using all kinds of
and writing, querying and basic statics operations, but cant algorithms [2]. In most cases, DM has a close relationship with
acquire the deep data attributes or implicit information. Facing computer science, by using statistics algorithm, on-line analytic
with the huge database in all kinds of fields, it is more and more processing, information retrieval, machine learning, expert
difficult to cope with the big data only by using conventional
system, pattern recognition and other latest techniques to
technology. New technique to deal with these data at a high level
is eagerly demanded. Therefore, the KDD (Knowledge Discovery
achieve the assignment [9]. When it comes to the relationship
in Database) technology arises at the historic moment. KDD is an of KDD and DM, many researchers hold different opinions. In
integrated process, which includes data input, iterative solving, our opinion, KDD refers to the overall process of discovering
user interface and many other custom requirements and design useful knowledge from data, while DM refers to a particular
decisions, where the data mining (DM) is a key and specific step step in this process [10],[11]. DM is the application of specific
in KDD. This paper deeply analyzes state of the art technology of algorithms for extracting patterns from mass of data. The
DM, and points out the challenge and technological bottleneck of progress of KDD is shown in Fig. 1.
DM. Moreover, a data mining model architecture of knowledge
discovery based on deep learning is proposed.

KeywordsKDD; data mining model; deep leaning;

I. INTRODUCTION
In recent years, with the unceasing development of the
computer and database technology, people have entered an
entirely new era where big data is everywhere [1]. The data
amount of different industries or businesses are accumulating
at a dramatic pace. According to statistics, the data transferred
from NASA satellites in orbit to ground monitoring stations is
over 50GB every minute. The number of transaction items of
Alibaba in China is over 200 million every day. In addition, the Fig. 1. The process of KDD. According to the information transfer theory, the
data amount of meteorological record, medical treatment and information amount cant increase during transform, so the information
scientific research fields are also increasing at a nearly amount decrease, but the information level becomes higher.
exponential speed. How to find effective and useful
information from a mass of fuzzy database with noise is a According to Fig. 1, it is easily found that in the process
critical problem to be eagerly solved [2]-[7]. Traditional from data source to knowledge, the information level increases
database technology can only finish reading, writing, querying though the information amount decreases. The information
and other basic statics operation, but does not have the ability becomes more and more easy to be analyzed and processed and
to explore the deep relationship of data, let alone extracting meanwhile it is also easy to find the importance of DM in
high level knowledge. Fortunately, KDD provides a reasonable KDD process. Thus, the DM model is the focus of this paper.
idea to solve these problems [8]. Facing with so enormous data, human analysts with no special
tools can no longer make sense. However, the data mining can
KDD is the creation of knowledge from structured automate the process of finding relationships and patterns in
(relational databases, XML) and unstructured (text, documents, raw data and the results can be further utilized in an automated
images) sources [4]. The resulting knowledge needs to be a decision support system.

978-1-4799-8389-6/15/$31.00 2015
c IEEE 1212
Fig. 2. The hierarchy structure of visual perception system in brain. The input information is pixel and after a process of iteration and abstraction, the input
information transforms to high level patterns.

Conventional data mining technology absorbs the ideas of brain by using multi hierarchy structure. Unfortunately, most
classical algorithms in AI field, and most of DM models have a of experiments before 2006 failed. Moreover, the Greedy
close relationship with AI algorithms [1]. Since the term Data Layer-Wise Training provided a reliable method for the
Mining has been introduced in the last century, it has achieved implement of deep learning [6]. In view of latest work in deep
great development up to now. Data mining is a natural learning, deep belief network (DBN) is used in this paper [18],
development with the increasing use of computerized databases and the supervised and unsupervised learning is also integrated
to store data and provides the answer to business analysts [12]- together in the proposed DM model based on deep learning.
[15]. In 1960s, people only managed to do data collection,
which only means recording the data due to the restriction of II. CONVENTIONAL DM MODELS IN KDD
computer technology and limited storage. In 1980s, with the
development of computer and database technology, the storage In this section, the convention DM models are presented
device have been highly developed, and data analysis and and compared. Moreover, the technological bottleneck of
querying technology are also mature. Since 21 century, people conventional DM models is also pointed out.
begin to pay more attention to the deep level information of
data to predict what will happen next [16], and the research of A. Model based on classical statistic algorithm
data mining reaches a peak. So far, the data mining model can At the preliminary development stage of DM, most of
be mainly classified into the following categories: 1) model models are based on classical statistic algorithms [19]-[21].
based on classical statics algorithm. 2) model based on cluster. Among them, two representative models are Native Bayes
3) model based on SVM. 4) model based on artificial network. model and Decision Tree model [21]. Both of them are based
However, after years of research, most of the conventional DM on the analysis of given data, and a prediction result will come
models encounter a technological bottleneck which is hardly out when an unknown variable enters into the built model. The
broken. The core problem is that data analysis is not deep representative application of these two models is to solve the
enough, which cant transform raw data to knowledge well-known problems, e.g., whether to play tennis or not
successfully. Moreover, many researches even stop at the level according to weather condition [8]. Eqs. (1)-(2) show that
of classification. In 2006, Hilton proposed greedy layer-wise Native Bayes model integrates the prior probability and
training of deep networks [1]. His work brings new hope for posterior probability by using probabilistic model as follows:
deep structure analytical optimal problems. Later, multi-layer
automatic encoder for deep structure was proposed, which was P( x | c) P(c)
P (c | x ) (1)
an outstanding breakthrough in deep analysis of raw data. Deep P( x)
learning research booms from that time and also provides a
new idea for DM model structure. In this paper, deep learning P( x | X ) P( x1 | c) u P( x2 | c) u" u P( xn | c) u P(c) (2)
is applied in DM model of knowledge discovery [16]-[17].
where P(x|c) is likelihood probability, P(c) is class prior
Moreover, a novel DM model architecture for KDD based on
probability, P(x) is predictor prior probability, and P(c|x) is
deep learning is proposed.
posterior probability.
In 1981, David Hubel and Torsten Wiesel found that Decision tree model is based on ID3 algorithm which
information processing of visual perception system in brain is divides the raw data into sub datasets, and the branches and
hierarchy structure, as shown in Fig. 2, and they carried out leaves of the decision tree gradually develop at the same time.
many experiments on cat brain to prove their conjecture. This The decision tree is built from the top to the bottom. The raw
result made people realize that the work pattern of nervous data can be classified into many sub datasets which contain
system nerve-center-brain was probably an unceasing sample data according to a certain standard. ID3 algorithm uses
iterative and abstract process. And deep learning is also based entropy to measure the amount of uncertainty in the data set S,
on this theory. It tries to simulate the perception system in the Entropy H(S) is defined as follows:

2015 IEEE 10th Conference on Industrial Electronics and Applications (ICIEA) 1213
H (S )  p ( x ) log 2 p ( x ) (3)
x X

where S is the current dataset whose entropy is being


calculated. X is the classes in S, p(x) is the proportion of the
number of elements in class x to the number of elements in set
S. Particularly, if H(S) = 0the subset S is perfectly classified.
Otherwise, the subset S can be further divided. In this case, ID3
algorithm uses information gain to measure the difference in
entropy from before to after when the subset S is split on an
attribute A [19]. In ID3, the information gain can be calculated
(instead of entropy) for each remaining attribute. The attribute
with the largest information gain is used to split the subset S on
this iteration.
Fig. 3. Schematic of SVM, where the green points and the red points are
B. Model based on Cluster separated into two regions by the support vectors.
Cluster algorithms usually merge the input data according
to center distance or hierarchical structure [23]-[25]. It tries to Subject to constraint in Eq. (8)
find the similar attributes of the raw data so as to classify the yi ( w xi  b) t 1  [i , i 1, 2" n (8)
data by comparing their similarity. Common cluster algorithms
includes K-means algorithm and expectation maximization In which C is penalty coefficient, [i is relaxant factor.
(EM) algorithm. Distance is usually used to measure the
similarity of data and there are three common distance D. Model based on artificial network
functions in cluster algorithm, which are defined as follows: Artificial network derives from the idea to simulate
k biological neural network and is an important branch of
Euclidean: d (x  y )i i
2
(4) machine learning [18]-[24]. Some well-known artificial
i 1 networks include biological neural network, back propagation
k network, and Hopfield network. In fact, all of these artificial
Manhattan : d | x  y |
i 1
i i (5) networks share two main ideas, namely, feedforward of raw
1/ q
data, and back propagation of error, as shown in Fig. 4.
k
Minkowski : d (| xi  yi |) q (6)
i1
x1 Ii w x ji j
x1
w j w
Cluster algorithm is a reasonable choice when we have to 1i
1i 'w K u d u x
classify the raw data to some kinds of categories. However,
w ji w ji
there is no transformation from data to knowledge, which xj f yi xj f H
means that the cluster algorithm can only finish preprocessing
for our data mining model.

ni
w
ni

w
xn yi f (Ii ) xn
C. Model based on SVM
SVM is a kind of regression algorithm, there are some Fig. 4. Two main ideas in artificial network, where one is feedforward of raw
kinds of regression algorithms which are usually used in data, and the other is back propagation of error
mathematical analysis [9], such as ordinary least square, By adjusting the input weight of perception and feedback
logistic regression, stepwise regression, multivariate adaptive weight of error, the network can reach the optimum state.
regression, splines, locally estimated and scatterplot smoothing.
However, when dealing with high-dimensional data, these
E. Technological bottleneck in DM
methods easily encounter problems. Contrary, SVM has great
advantage in dealing with high dimensional data, and can also Though many DM models have been proposed, its
divide data by hyperplanes [12], as shown in Fig. 3. development still encounters technological bottleneck after
years of research. The core problem is that most conventional
The original optimal hyperplane algorithm proposed by MD models do not have deep analysis ability. Thus, they
Vapnik in 1963 was a linear classifier. After that, in 1992, B. E. cannot simulate the thought of primate. These models mainly
Boser suggested a way to create nonlinear classifiers by work on mathematical analysis rather than intelligent analysis.
applying the kernel trick to maximum-margin hyperplanes. However, with the continuous increase of data amount, it is
SVM can solve both the linear and nonlinear classification more and more important to handle the raw data in a new way
problem. In most cases, searching the hyperplane is an which can consider the deep relationship of input data. Luckily,
optimization problem, as shown in Eq. (7). the deep learning has some unique advantages compared with
n the mentioned models. A novel DM model architecture based
1
arg min || w ||2 C [i (7) on deep leaning is proposed in this paper. The most important
w,[ ,b 2
i 1 component of the model is the deep learning unit. This unit is

1214 2015 IEEE 10th Conference on Industrial Electronics and Applications (ICIEA)
composed of the deep belief networks (DBNs). Structure of B. DBN structure
DBNs and the Train process is shown in the next section. Supposing a two-dimensional diagram with two layers,
there is no connection in the same layer, the first layer is
III. DM ARCHITECTURE BASED ON DEEP LEARNING visible layer or input layer, and the other layer is hidden layer.
If all the nodes are random binary variables, and the probability
A. Basic unit for deep learning network distribution p(v, h) subjects to Boltzmann distribution, i.e., Eq.
The main assignment of deep learning is to convert raw (10), such structure is called Restricted Boltzmann Machine
data to high level knowledge [25]. It is a process of encoding in (RBM) [25], as shown in Fig 6.
essence. The raw data in dataset is different in term of length,
p(v, h1 , h 2 ," hl ) p (v | h1 ) p(h1 | h 2 )" p (hl 1 | hl ) (10)
capacity and format, but after being encoded, these data
become formal and easy to be handled. In order to measure the DBN are made from multi RBMs, and a typical net
accuracy of coding, each encoder has a corresponding decoder structure of DBN is shown in Fig. 7. This network can be
while the decoder can restore the code to raw data. If the classified into a visible layer and some hidden layers. There is
reductive result is identical with the raw data, or there is only a connection between layers, but not connection in the layer
little difference, it will mean that the encoding process is itself. Hidden layer unit are trained to capture the relationship
effective. Otherwise, the encoding process is probably invalid of high dimension data in visible layer. In our proposed model
[26]. The decoder is not a necessary component of the final based on deep learning, DBN is used to construct an auto
DM model, but necessary in the construction of the model. In learning network.
this paper, the sparse auto encoder is used to construct the
encoder and decoder structure. Thats to say, we attach
regularity restriction of L1 on the base of auto encoder. As
shown in Fig. 5, the code and error are defined as Eq. (9).

Fig. 5.The scheme of encoder and decoder, sparse auto encoder applied in this Fig. 7. Structure of DBNs, which is composed of a visiable layer and some
process. hidden layers.

input : X In machine learning, DBNs can be viewed as a generative


T graphical model or an alternative deep neural network
code : h W X composed of multiple layers [19]. From Fig. 7, DBN can be
error : E (W , X ) || Wh  X ||2  O | h j | (9) considered as a composition of simple and unsupervised
j networks such as RBMs or auto-encoders, where each sub-
As shown in Fig. 5, the sparse auto encoder makes the code network's hidden layer serves as the visible layer for the next.
result sparse, and the sparse expression usually more effective. This also leads to a fast and layer-by-layer unsupervised
This mechanism is similar as the neural activity of primate, training procedure.
since most of the nerve cells in nerve system are restrained and
only a few nerve cells are at excitatory state when the external C. Deep learning Training
stimulation exists. Supervised learning and unsupervised learning are
integrated together in the proposed DM model based on deep
learning. In supervised learning, the input data are called
training data and each group training data have an explicit
label or corresponding result. Moreover, the supervised
learning is a process of comparing prediction result and real
data, by correcting the error manually. The learning model is
adjusted until the prediction result reaches an expected
accuracy [26]. On the contrary, in unsupervised learning, the
input data are not labeled and the learning model tries to infer
the inner connection of the input data. In fact, the data needs to
be dealt with is enormous. It is impossible to label all data, so
our aim is to build an auto learning model, when coping with
the unlabeled data. The learning model should output the
Fig. 6. Structure of Restricted Boltzmann Machine, where the bottom layer is accurate prediction result. However, it is only an ideal situation,
visible layer and the top layer is hidden layer. and, many new data is hardly to be recognized in unsupervised

2015 IEEE 10th Conference on Industrial Electronics and Applications (ICIEA) 1215
learning. To solve this problem, we set a recognition buffer in [2] G. Hinton, S. Osindero S, Y. W. The, A fast learning algorithm for deep
proposed model. Once the data cant be recognized, it will be belief nets, Neural Computation, 2006, 18(7): 1527-1554.
added into recognition buffer, and enters the model again after [3] S. Shin, Y. Guo, Y. Choi, Y. Choi, M. Choi, and C. Kim, Development
of a robust data mining method using CBFS and RSM, Perspectives of
being attached a result manually. As mentioned above, our DM Systems Informatics, Springer Berlin Heidelberg, 2007, 377-388.
model architecture based on deep learning is shown in Fig. 8. [4] C. Poultney, S. Chopra, Y. L. Cun, Efficient learning of sparse
representations with an energy-based model, Advances in Neural
Information Processing Systems, 2006, 1137-1144.
[5] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, From data mining to
knowledge discovery in databases, AI Magazine, 1996, 17(3), 37.
[6] Y. Bengio, P. Lamblin, D. Popovici, et al, Greedy layer-wise training of
deep networks, Advances in Neural Information Processing Systems,
2007, 19, 153.
[7] G. E. Hinton, R. R. Salakhutdinov, Reducing the dimensionality of data
with neural networks, Science, 2006, 313(5786), 504-507.
[8] J. Han, Data mining techniques, ACM SIGMOD Record, ACM, 1996,
25(2), 545.
[9] J. Weston, C. Watkins, Support vector machines for multi-class pattern
recognition, ESANN, 1999, 99, 219-224.
[10] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, et al, Advances in
knowledge discovery and data mining, 1996.
[11] S. Zhang, C. Zhang, and Q. Yang, Data preparation for data mining,
Applied Artificial Intelligence: An International Journal, 2003, 17(5-6),
375-381.
[12] C. Cortes and V. Vapnik, Support-vector networks, Machine Learning,
Fig. 8. Architecture of proposed DM model based on deep learing. Supervised 1995, 20(3), 273-297.
learning and unsupervised learning are integrated together in this model and in [13] J. Wu, and E. Chen, A novel nonparametric regression ensemble for
particular, a recognition buffer unit is used to take user input into consideration. rainfall forecasting using particle swarm optimization technique coupled
with artificial neural network, Advances in Neural Networks, Springer
Berlin Heidelberg, 2009, 49-58.
IV. CONCLUSION AND FUTURE WORK
[14] B. E. Boser, I. M. Guyon, and V. N. Vapnik, A training algorithm for
Data mining is a hot research in recent years with the optimal margin classifiers, Proceedings of the fifth annual workshop on
urgent demand of handling Big Data. It is not an independent Computational learning theory, 1992, 144-152.
branch in AI field but a subject that relates to statics, machine [15] A. Aizerman, E. M. Braverman, and L. I. Rozoner, Theoretical
learning, expert system and many other latest techniques. So foundations of the potential function method in pattern recognition
learning, Automation and Remote Control, 1964, 25, 821-837.
far, there are many DM models proposed, but in some extent,
[16] C. Christopher, Encyclopaedia Britannica: definition of data mining,
these models are not successful because they cant simulate the Retrieved, 2010-12-09, 2010.
thought of primate. There is still a long way to achieve human [17] T. Hastie, R. Tibshirani, J. Friedman J, et al, The elements of statistical
blueprint in robotic field. learning: data mining, inference and prediction, The Mathematical
Intelligencer, 2005, 27(2), 83-85.
According to the investigation on state of the art data
[18] J. Han, M. Kamber, Data Mining, Southeast Asia Edition: Concepts and
mining technology, a novel DM model architecture based on Techniques, Morgan kaufmann, 2006.ISBN 9781558604896
deep learning is proposed in this paper. Compared with the
[19] I. H. Witten,E. Frank, Data Mining: Practical machine learning tools and
conventional DM models, the proposed model based on deep techniques, Morgan Kaufmann, 2005. ISBN 9780123748560
learning has deep analytical ability and the raw data can be [20] M. Kantardzic, Data mining: concepts, models, methods, and algorithms,
converted to knowledge more explicitly. The proposed DM John Wiley & Sons, 2011. ISBN ISBN 0471228524
model has a broad application prospect and can be applied in [21] K. Zhao, B. Liu, T. M. Tirpak, et al, A visual data mining framework for
weather forecast, data trend analysis, commercial prediction convenient identification of useful knowledge, Fifth IEEE International
and many other businesses or industries. The implement of Conference on Data Mining, 2005, 8 pp.
proposed model will be carried out in our future work to prove [22] H. P. Kriegel, P. Krger, J. Sander, et al, Density-based clustering,
the effectiveness of the proposed model. Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery, 2011, 1(3), 231-240.
[23] R. Agrawal, J. Gehrke, D. Gunopulos, et al, Automatic subspace
V. ACKNOWLEDGEMENT clustering of high dimensional data, Data Mining and Knowledge
This research is funded by the National High Technology Discovery, 2005, 11(1), 5-33.
Research and Development Program of China (863 Program) [24] E. Achtert, C. Bhm, H. P. Kriegel, et al, Detection and visualization of
subspace cluster hierarchies, Advances in Databases: Concepts, Systems
under Grant No. 2013AA01A607 and the Twelfth Five-Year and Applications, 2007, 152-163.
Plan for Xinjiang Manufacturing Information Technology
[25] D. Ciresan, U. Meier, and J. Schmidhuber, Multi-column deep neural
Demonstration Project under Grant No. 201130110-3. networks for image classification, IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2012, 3642-3649.
REFERENCES [26] R. Salakhutdinov and H. Larochelle, Efficient learning of deep
[1] I. Nikolova, Deep learning architecture for data mining from surgical Boltzmann machines, International Conference on Artificial Intelligence
data, Proceedings of the 35th International Convention. IEEE, 2012: and Statistics, 2010, 693-700.
998-1002.

1216 2015 IEEE 10th Conference on Industrial Electronics and Applications (ICIEA)

S-ar putea să vă placă și