Sunteți pe pagina 1din 6

Data Types Generalization for Data Mining Algorithm;

Mon-Fong Jiang Shian-Shyong Tseng+ Shan-Yi Liao

Department of Computer and Information Science National Chiao Tung University Hsinchu 300, Taiwan, R.O.C. E-mail: sstseng@,cis.nctu.edu.tw

Abstract
With the increasing of database applications, mining interesting information from huge databases becomes of most concern and a variety of mining algorithms have been proposed in recent years. As we know, the data processed in data mining may be obtained from many sources in which different data types may be used. However, no algorithm can be applied to all applications due to the difficulty for fitting data types of the algorithm, so the selection of an appropriate mining algorithm is based on not only the goal of application, but also the data fittability. Therefore, to transform the non-fitting data type into target one is also an important work in data mining, but the work is often tedious or complex since a lot of data types exist in real world. Merging the similar data types of a given selected mining algorithm into a generalized data type seems to be a good approach to reduce the transformation complexity. In this work, the data types fittability problem for six kinds of widely used data mining techniques is discussed and a data type generalization process including merging and transforming phases is proposed. In the merging phase, the original data types of data sources to be mined are first merged into the generalized ones. The transforming phase is then used to convert the generalized data types into the target ones for the selected mining algorithm. Using the data type generalization process, the user can select appropriate mining algorithm just for the goal of application without considering the data types.

to choose an appropriate one by themselves. This is because the data provided can not be directly used for data mining algorithms. Since most data mining algorithms can only be applied to some specific data types, the types of data stored in databases restricts the choice of data mining methods. If certain kinds of knowledge need to be obtained using some data mining algorithms, data types transformation should be done first and this is what we called the data types fittability problem for data mining. For the time being, there is no tool that can help users to do this kind of data types transformation. In this paper, we will survey and analyze the data types fittability problem for data mining algorithms, and then we propose a data types generalization process to solve the data types fittability problem for the attributes in relational databases. The data types generalization process including merging and transforming phases is a procedure to transform the data types of atttributes contained in relations (tables). In the merging phase, the original data types of data sources to be mined are first merged into the generalized ones. The transforming phase is then used to convert the generalized data types into the target ones for the selected mining algorithm. Using the data type generalization process, the user can select appropriate mining algorithm just for the goal of application without considering the data types.

2. Related work
As mentioned above, because many data mining algorithms can only be applied to the data types with restricted range, users possibly need to do data types transformation before the selected algorithm has been executed. In this paper, we propose a general concept called data types generalization process which provide a procedure for doing this kind of data types transformation. Data types generalization can be seen as a pre-processing of data mining. Of course, other pre-processing such as data selection, data cleaning, dimension (attribute) reduction, missing data handling may also need to be performed before running the selected data mining algorithm. In summary, the whole process of data mining is the so-called KDD (knowledge discovery in databases)[5, 71, as shown in Figure 1.

1. Introduction
In recent years, the amount of various data grows rapidly. Widely available, low-cost computer technology now makes it possible to both collect historical data and also institute on-line analysis for newly arriving data. Automated data generation and gathering leads to tremendous amounts of data stored in databases. Although we are filled with data, but we lack for knowledge. Data mining [4, 9, 161 is the automated discovery of non-trivial, previously unknown, and potentially useful knowledge embedded in databases. Different kinds of data mining methods and algorithms have been proposed [4, 91, each of which has its own advantages and suitable application domains. However, it is difficult for users to choose an appropriate one by themselves.

* This work was supported by the National Science Council of the Republic of China under Grant N o . NSC88-22 13-E-009-078.

0-7803-5731-0/99/$10.00 01999 IEEE

III -928

databases

Visualimion

Figure 1: The KDD process and the role of data types generalization. There is a major difference between the data types generalization process and other data mining pre-processes. Other pre-processes (like missing value handling) are all independent of the selected data mining method. That is, they can be done without knowing what data mining algorithm will be used. But it is clear that data types generalization process depends on the desired mining method. The target of doing data transformation using data types generalization is to make the specified data set suitable for the mining algorithm. Therefore, if we want to achieve this goal, we must survey both the data types in databases and their relations with various data mining methods. The flow of solving a data mining problem with doing data transformation is illustrated in Figure 2.
Daw Mining
Problem

Another related work is that some researchers surveyed about how to transform data in to numerical values. Almost all data-driven algorithms utilize numeric inputs. From a computer processing point of view, handling computations with numbers is easier and more efficient. Therefore, if the input values are non-numeric(e.g., text strings), they should be intelligently converted to meaningful numerical values in many cases. Numerical values can be seen as a data type and transforming data into numerical values is a kind of data types transformation. The strategies are included in the data types generalization process for performing data types transformation.

3. Analysis of the data types fittability problem


In recent years, due to the explosion of information and the rapid growth of database applications, data mining techniques become more and more important. For this reason, different kinds of data mining methods or algorithms have been proposed. However, it is difficult for users to choose a suitable one by themselves without prior knowledge about data mining. Actually, the kind of data mining methods should be applied depends on both the characteristic of the data to be mined and the kind of knowledge to be found through the data mining process. Hence, the types of data stored in databases play an important role during the data mining process and restrict the data mining methods can be chosen by users. It is true that all kinds of data mining methods can only be applied to particular databases suitable for each kind and this is what we called "the data types fittability problem" for data mining. To solve this problem, we need to investigate the relationships between the characteristics of the data to be mined and various kinds of data mining techniques. With the relationships, we can clearly analyze the data types fittability problem and further know whether the data types transformation can be performed or not. Hence, analyzing this kind of relationships is a preparation work for our data types generalization process, which explains why the data types generalization process can solve the data fittability problem. We now illustrate the analysis as follows. 3. I Four kinds of data forms for data mining Data mining techniques ususally can be applied to four kinds of data forms: texual, temporal, transactional and relational forms. Different kinds of data forms are used to store different kinds of data types. We describe each kind of data forms in the following:

Decide ~

1 kind of1 ~

Decide which

mining pmblcm

method IO bc used

Does iheoblained

What tipc ofdata is bcucr

"0

Figure 2: Solving data mining problems with data transformation data types transformation must based on the selected data mining method Some researchers proposed how to generalize the data contained in attributes using "attribute-oriented induction" [5, 8, 111 which allows the generalization of data, offers two major advantages for the mining of large databases. First, it allows the raw data to be handled at higher conceptual levels. Generalization is performed with the use of "attribute concept hierarchies", where the leaves of a given attribute's concept hierarchy correspond to the attribute's values in the data (referred to as primitive level data ). Generalization of the training data is achieved by replacing primitive level data by higher level concepts. In fact, data generalization using attribute concept hierarchies is a kind of data type transformation which reduces the number of distinct values contained in attributes. We first provide a typical description of the data types fittability problem and a data types generalization process to define and solve the data types transformation problem for attributes. Hence, data generalization using concept hierarchies is included in the process for performing specified data types transformation.

( I ) Textual data forms : Textual data forms are used to represent texts or documents. Basically, this kind of data forms can be seen as a set of characters with huge amount.
(2) Temporal data forms : Time-series data is stored in temporal data forms. Data that varies with time (such as historical data) can be stored in the form of numerical time-series.
(3) Transactional data forms : For example, the past transactions of a market can be stored in transactional data forms. Each transaction records a list of items bought in that transaction.

m -929

(4) Relational data forms : Relational data forms are the most widely used data forms and can store diffierent kinds of data. The basic units of relational data forms are relations(tab1es). Relations are composed of attributes, and each of which can be different data type. 3.2 Six kinds of data mining techniques Data mining techniques are usually classified based on the characteristics of the data to be mined and the knowledge users try to find out, which can be divided into six kinds of techniques according to the researches of former experts[4,9]. We simply list the knowledge to be found and the most suitable data types to be mined for each kind of data mining techniques in the following:

similarity search. The most suitable data types: Textual data forms are suitable for text similarity search and temporal data forms are suitable for time-series similarity search. (6) Discovery and analysis of time series or trends: Knowledge to be found: According to the previous change of data values at different times, the trend of time-series can be found and used to predict the possible change of the future. Predicting the stock prices is a good example. The most suitable data types : It is often applied to temporal data forms. Figure 3 illustrates the relationships between the data to be mined and the six kinds of data mining techniques. The links between the six kinds of data mining techniques and the four kinds of data forms imply that the data forms are most suitable for those mining techniques.

(1) Multilevel data generalization, summarization, and characterization [4, 111: Knowledge to be found: The purpose of this kind of technique is to observe the data stored in databases with a higher view. The higher views are used to represent rules that can explain certain concepts, and thus facilitate human to realize. The most suitable data types: It is applied to relational databases (relational data forms) by gathering statistics for the data stored in attributes, which can provide higher views.
(2) Mining association rules [ 1, 61: Knowledge to be found: The purpose of mining association rules is to find out the associations between items from a huge amount of transactions. For example, try to realize the behavior of customers purchase. The most suitable data types: It is usually applied to transactional data forms. But it is also suitable for data stored in relational data forms if we deal with the data in advance such as group by transaction. (3) Data classification: Knowledge to be found: The basic purpose is to find out the classification principle from a pre-classified data set (training data). This principle can be used to classify the newly coming data. ID tree, version space, case-based reasoning, and neural network are all popular classification methods. The most suitable data types: It is applied to relational databases and a tuple represents a sample. One attribute is seen as a target attribute for classification (output) and other attributes are seen as data pattern (inputs).

Figure 3: The relationships between data forms to be mined and the six kinds of data mining techniques 3.3 Analysis of the data types fittability problem for data mining We can now further analyze the data types fittability problem using Figure 3. The data types fittability problem can be described with two aspects:
(1) Data types fittability between different kinds of data forms

(4)Clustering analysis [15]:


Knowledge to be found: According to the similarities among patterns, the more similar patterns are grouped into a cluster. There is not a pre-defined class for each pattern. K-means clustering is a common clustering method. The most suitable data types: It is the same as classification; each tuple in the relational database represents a data pattern.
( 5 ) Pattern-based similarity search: Knowledge to be found: The purpose of this technique is to search the pattern in databases similar to the pattern in hand. It can be divided into two kinds: text similarity search and time-series

It is obvious that the data types stored in the four kinds of data forms are all different. Figure 3 points out that there are most suitable data forms to be mined for each kind of data mining techniques and hence the data types fittablity problem occurs. For example, transactional data forms are most suitable for mining association rules. But it does not mean that the technique of mining association rules can only be applied to transactional data forms. The transactions of a supermarket can also be stored in a relation. I f one wants to mine association rules through the relation, he/she can first deal with the data using some SQL operations such as group by transaction. After that the relation will be transaction like and can be used for mining association rules. Therefore, we can say that a mining technique can be applied to different kinds of data forms if the data to be mined can be transformed into the form that is most suitable for the mining technique. For another example, we can apply classification algorithms to textual data forms if each text in database can be transformed into the form of some attributes (features). Doing this kind of transformation depends on the knowledge users attempt to

III - 930

find out. In this paper, this kind of transformation is beyond our discussion and seen as a future work of our data types generalization process. We assume that users try to apply data mining techniques to the most suitable data forms here.
(2) Data types fittability for each kind of data forms We now analyze the data types for each of the four kinds of data forms. We can divide them into two kinds based on their characteristics of data:

relation

Selected

(a) Textual, temporal and transactional data forms: For these three kinds of data forms, there is no need to find ways to transform the data types of them. This is because the data stored in each of these data forms can be seen as an indivisible data type. We can say that texts, time-series, transactions are complex data types and data mining methods for mining them are specially for these data types. Since there are data mining techniques can be applied to these data types as well, we dont have to deal with these data forms. (b) Relational data forms: The difference between relational data forms and the above three kinds of data forms is that relational data forms consist of attributes. Each attribute indicates a real-world feature and hence can be different data type. There are many kinds of data mining algorithms can be applied to RDB and not all of them are suitable for various data types of attributes. Therefore, the relationships between the data types of attributes and the mining techniques need to be realized. With the relationships, one can know whether he/she needs to transform an attributes data type or not. After that, the data types of those attributes required to be transformed are transformed using some transformation strategies. The whole process for doing this is our data types generalization process. With this process, we can solve the data types fittability problem for the attributes of RDB.

Figure 4: A flow of the data types generalization process. The mapping result table stores the mapping between attributes and generalized data types.
4.1 The first phase: merging phase The first phase of data types generalization process is to merge the data types with similar characteristics into a generalized data type. Merging the data types is based on what kind of mining approaches can be applied to them. In other words, the data types belonged to same generalized data types are suitable for same data mining algorithms. To achieve the process in merging phase, we must survey the data types in relational databases and their relationships with respect to various data mining algorithms. As mentioned in Section 3.1, relational databases are the most widely used databases and are composed of tables (relations). Hence, in our process, table is the basic unit for performing data mining methods such as classification and clustering. A table has two parts, a heading and a body. The body is a set of tuples, each of which corresponds to a data sample in real world. The heading is a set of attributes, each of which indicates an independent and meaningful data feature. In addition, different attributes can be different data types. So it is suitable for us to analyze data types in attributes instead of analyzing the data types in RDB. According to former researches [3], we can generalize the data types contained in attributes into two generalized data types based on the characteristics of various mining methods applied to RDB :
(1) Discrete data type: Basically, the values contained in an attribute of discrete data type is composed of a predefined finite data set ,and the distance of any two values in the data set can not be directly computed. Typically, the enumerated data type (user-defined) or the character data type belongs to this kind. Moreover, the data in the data set can be enumerated or multi-dimensional enumerated. An enumerate data set implies the possible values of this attribute are bounded in a data set and can be listed one by one. Boolean type is the simplest example of this condition; the possible values are true and false. On the other hand, a multi-dimensional enumerated data set implies the possible values of an attribute are usually numerous and can not easily be listed one by one. Moreover, this kind of data has a fixed format and thus can be divided into several parts. Each part can be seen as an enumerated discrete data set and it is the reason why we call this kind of data multi-dimensional enumerated. For example, assume an attributes name is address. The number of possible values of an address is numerous, and an address may have a fixed format (such as floor, number, road, city, etc.). According to the format, an address can be divided into several parts and the values

4. The data types generalization process


The data types generalization process we proposed is a general process to solve the data fittability problem for the atrributes of RDB using some data transformation strategies. The goal of doing data transformation is to transform the data types of original data stored in databases into the form to which the desired data mining algorithm can be applied. Actually, it seems not necessary to transform over all data types in relational databases because most data mining algorithms can be handled in a subset of data types in RDB. Our idea is first to merge the data types which have similar characteristics into a generalized data type, and then to transform the resulting generalized data types second. With the above idea, our data types generalization process is composed of two phases: merging phase and transforming phase. The merging phase maps the original data types of attributes to the corresponding generalized data types. The transforming phase then transforms the data types of the attributes required to be changed. Now we briefly illustrate a flow of the data types generalization process in Figure 4.

Ill -931

in each part can be enumerated. When dealing with multi-dimensional enumerated data types, it is needed to make use of their formats. Discrete data type is usually user-defined type or character type, so it can be easily realized by human.
(2) Continuous data type: Numeric data types (e.g., int , long int , float ,double ,etc.) are continuous data type. Compared with discrete data type, each value in this type has a relation or order with other values and we can exactly know the distance or similarity between any two of these values. For example, if an attribute is the grade of a course for a student(assume the range of grade is 0-loo), then this attribute is continuous data type. We can exactly know the distance of any two students grades by subtracting the lower one from the higher one. In addition, continuous data type is not as easily as discrete data type for human to realize.

difficult to deal with an attribute with numerous distinct values such as the multi-dimensional enumerated discrete data type. Hence, reducing the number of distinct values in a discrete type attribute is another way to transfer the data type. In other words, we now have three kinds of generalized data type: discrete data with numerous distinct values, discrete data with a few distinct values, and continuous data type. Our target is to find ways in order to transform each of these three data types. The strategies are shown in Figure 5.
define dimncc uiint: domain

geneialmtlon
I

continuo^^ data type

Discrete data type is suitable for some supervised learning methods (i.e. classification methods, such as version space and ID tree), but can not directly used for unsupervised methods (i.e. clustering). It is because most clustering algorithms(i.e. k-means clustering) need to compute the distances or similarities between samples which can not be known from the original values of discrete data types. Continuous data type is suitable for clustering methods. We can do clustering on a table that consists of continuous type attributes. This kind of data type is also suitable for some classification methods (such as neural network). We summarize the comparison of these two data types in the Table 1. Table 1 : A comparison for discrete and continuous data woe Property Discrete Continuous

i
1

with numerou values (multi-dimensional enumerated dala)

I
define similarity using data format
I

Figure 5 : The strategies for transforming over three kinds of generalized data types. After finishing the merging phase, we have to find ways to transform over these generalized data types in the second phase: transforming phase. Domain knowledge often have to be involved in the data transforming phase, so it is difficult to do the transforming phase fully automatically. However, it is still useful to know the possible data transformation strategies. Therefore, we can have a good interaction with the domain experts by asking them to provide the knowledge required when doing data transformation. Furthermore, users can decide the depth of the transforming phase going on according to the available knowledge using different transformation strategies. This is the point we are concerned when developing a tool for data mining. We now illustrate four transformation strategies for transforming over theses three generalized data types in the following:
(1) Reducing the distinct values of discrete data type

~ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Enumerated (user-defined), Character IComputability of the1 distance of any two values Suitable for classification methods? Suitable for clustering No methods? Can be easily realized? Data types

ZCI~ I 1;; I
Yes K-means cluster-

Integer, Double

Float,

Some other data types in RDB are not discussed here, like texts or multimedia objects which are beyond our discussion. Our focus is how to transform these two kinds of data in order to apply the desired data mining algorithm.

Generalization on the target attribute: An attribute containing too many distinct values is not easy for many data mining algorithms to deal with. On the other hand, a user may think an attribute with numerous distinct values is not easily realized. In this condition, we can perform generalization on the attribute. The purpose of generalization is reducing the number of distinct values contained in target attributes while making the data more realizable for users at the same time. Domain knowledge (offered by domain experts) is usually needed for performing generalization process in many cases, although sometimes it still can be done automatically or semi-automatically. Generalization can be done in different levels and thus users can obtain different results using same data mining algorithm. It is obvious that excessive generalization will cause information lost. Therefore deciding the level of generalization to be done is a big concern.

4.2 The second phase: transforming phase As mentioned above, the data types of an attribute can be divided into two types. Now, consider the distinct values of an attribute of discrete data type. For most data mining methods, it is

( 2 ) Discrete data type to continuous data type Define distance or similarity using available domain knowledge: The consideration is how can we find the similarity

JII -932

between two distinct values in an attribute of discrete data type. Our principle is using the knowledge offered by domain experts, and this kind of knowledge may be a definition of the exact distance between any two of the distinct values or a similarity matrix. A similarity matrix is a matrix that states the similarity of any two distinct values. Once we know the distance or similarity , the discrete data can be seen as continuous data. (3) Multi-dimensional enumerated data type to continuous data type Define similarity using data format : The basic principle of this kind of transformation is to define the similarity of two distinct values by making use of their data format. This strategy is only suitable for dealing with multi-dimensional enumerated discrete data type. Because of the huge number of distinct values in multi-dimensional enumerated discrete attributes, we need to use the data format in order to define similarities among them. (4) Continuous data type to discrete data type Clustering on the target attribute and define clusters: Some mining methods are not suitable for numeric data; moreover, numeric data is not a good way to represent knowledge. Hence, we need methods to transform continuous data type into discrete data type. The best way we suggest in this paper is clustering on the target attribute. The number of final clusters that the users want can be predefined and the original numeric data is transformed into a set of clusters after performing clustering. The obtained clusters can be further given suitable names or symbols that indicate the meanings of clusters. Finally, we can use the names of clusters instead of the numeric data, and therefore the original data is changed from continuous to discrete.

5. Concluding remarks
In this paper, we first mentioned the fact that none of data mining algorithms can be applied to all kinds of data types. We then defined the data types fittability problem for data mining. After that, we surveyed the relationships between four kinds of data forms and the six kinds of widely used data mining techniques ,and then gave an analysis of the data types fittablity problem. Our data types generalization process has also been described in detail and various kinds of transformation strategies are listed. In conclusion, we proposed a data types generalization process which can help users to select a suitable data mining algorithm for their data sets to solve the data types fittablity problem for attributes in RDB.

197-210, 1997. [4] M. S. Chen, J. Han and P. S. Yu, Data Mining: An Overview from Database Perspective, In IEEE Trans. on Knowledge and Data Engineering, Vol. 8, No. 6, pp. 866-883, Dec., 1996. [5] D. W. Cheung, A. W.-C. Fu, and J. Han, Knowledge discovery in databases: A rule-based attribute-oriented approach, Proc. 1994 Intl Symp. On Methodologies for Intelligent Systems, pp. 164-173, Charlotte, North Carolina, Oct., 1994. [6] D. W. Cheung, V. T. Ng, A. W. Fu, and Y. Fu, Efficient mining of association rules in distributed databases, IEEE Tran. Knowledge and Data Engineering, Vol. 8, No. 6, pp. 9 1 1-922, Dec., 1996. [7] U. Fayyad, and P. Stolorz, Data mining and KDD: Promise and challenges, Future Generation Computer Systems Vol. 13, pp. 99-1 15, 1997. [8] J. Han, Mining knowledge at multiple concept levels, Proc. of 4Ih Int. Conf. on Information and Knowledge Management, pp. 19-24, Baltimore. Maryland, Nov., 1995 [9] J. Han, Data mining techniques, ACM-SIGMOD96 Conference Tutorial, June, 1996. [IO] J. Han, Y. Cai, and N. Cercone, Data-driven discovery of quantitative rules in relational database, IEEE Tran. Knowledge and Data Engineering, Vol. 5 , No. 1 , pp. 29-40, Feb., 1993. [I I] J. Han and Y. Fu, Dynamic generation and refinement of concept hierarchies for knowledge discovery in databases, Proc. AAAI94 Workshop on knowledge Discovery in Databases (KDD94), pp. 157-168, Seattle, Wa, July 1994. [I21 T. B. Ho, Discovering and using knowledge from unsupervised data, Japan Advanced Institute of Science and Technology, Tatsunokuchi, Ishikawa, 923- 12, Japan. [I31 G. J. Hwang and S. S. Tseng, EMCUD: A knowledge acquisition method which captures embedded meanings under uncertainty, International Journal of Man Machine Studies, Vol. 33, pp. 431-451, 1990. [I41 G. J. Hwang, Knowledge elicitation and integration from multiple experts, Journal of Information Science and Engineering, vol. 10, no. 1, pp. 99-109, March, 1994. [ 151. A. K. Jain and R. C. Dubes, Algorithms for clustering data, Prentice-Hall Inc., pp. 58-89, 1988. [I61 J. P. Yoon and L. Kerschberg, A framework for knowledge discovery and evolution in databases, IEEE Tran. Knowledge and Data Engineering, Vol. 5 , No.6, pp. 973-979, Dec., 1993.

References
[ I ] R. Agrawal and J. C. Shafer, Parallel mining of association rules, IEEE Tran. Knowledge and Data Engineering, Vol. 8, No. 6, pp. 962-969, Dec., 1996. [2] R. Agrawal, T. Imielinski, and A. Swami, Database mining: a performance perspective, IEEE Tran. Knowledge and Data Engineering, Vol. 5 , No. 6, pp. 914-925, Dec., 1993. [3] C. Apte, S. Weiss. Data mining with decision trees and decision rules Future Generation Computer Systems, Vol. 13, pp.

III -933

S-ar putea să vă placă și