Sunteți pe pagina 1din 21

DATA MINING TECHNIQUES

Abstract

With the large amount of data stored in databases and data warehouses, it is increasingly
Important to develop powerful tools for analysis of such data and mining interesting
knowledge from it. Data mining is a process of inferring knowledge from such huge data.
The main Problem related to the retrieval of information from the World Wide Web is the
enormous Number of unstructured documents and resources, i.e., the difficulty of locating
and tracking appropriate sources. In this report a survey of the research at San Francisco
Airport for customer satisfaction is explained and the way in which it can help to make the
decision for improving their airport facility
.
Introduction

Today is an information age because every people have much data for some particular topic.
In these data some are useful while other is useless. Today many corporate change their
management style and use customer relationship management approach. So, every company
wants their customers fully satisfied. To attain these objectives they conduct various
surveys. By these surveys they gather big data about their customer base. But the main
problem is to extract pattern from these data. These are very big data so it is next to
impossible to analyse these whole data manually. Data mining tool is helping them to
understand and extract pattern from these big data.
What is data mining?

Data mining is the extraction of useful data from large data sets, analysing and then
restructuring it into a useful form. It includes identifying patterns that are not previously
seen, also to identify the relationship amongst those patterns to predict future behaviour. It
basically derives its origin from computer science. In businesses, it can be used to identify
relationships amongst the various factors that affect the business. For example-retailers such
as Wal-Mart use data mining to identify the buying behaviour patterns amongst their
customers to implement various marketing strategies in the future to attract them. Data
mining has found extensive usage in the fields of business, science and engineering,
medicine, sports.
Stages in data mining.

Data mining involves the following basic stages1. Defining the problem-This involves identifying the objectives and requirements of

the project.
2. Collecting the data-To collect the appropriate data that can address the problem.

identifying any patterns in it. It also involves cleansing the data and data transformation.
3. Creating a model, testing, evaluating and interpreting the model-This includes building

the model with help of algorithms.


4. Application of the model-This involves the integration of the data mining models in
the applications.

VARIOUS TASK OF DATA MINING

There are six different types of tasks used to do data mining. These tasks are
further divided into two subtypes(1) Directed data mining (2) Undirected data mining.
1. Directed data mining- In directed data mining the data is directly use for building a model
that is used to describe one or more attribute of interest.
2. Undirected data mining- In undirected data mining the goal is to establish some
relationship among all the attribute.
The six tasks of data mining is as follows :
1.
2.
3.
4.
5.
6.

Classification
Estimation
Prediction
Affinity grouping or association rules
Clustering
Description and visualization

[1] CLASSIFICATION:

In classification, the features of the presented data are examined and assigning into
various pre-determined classes. Various methods are used to build a model that is then
applied to the unclassified data which converts it into classified data. The various
examples of classification task are as follows:
Loan applicants are classified as low, medium and high. Income groups are classified
as lower class, middle class and high income group.

[2] ESTIMATION:

In classification we get discrete outcomes like low, medium or high but in estimation we get
continuously valued outcomes. When some input data is available, estimation techniques is
used to come up with some unknown continuous variable. In estimation one wants to come
up with acceptable value or range for unknown parameters. For example:
Using literacy rate estimating no of children in one family , Estimating ones
profession using gadgets that person have

[3] PREDICTION:

In prediction, the task we do is same as we do in classification and estimation but the


only difference is that here we classified the data according to some predicted future value.
There are no such ways to check the accuracy of your classification except wait and see.

The main reason to separate prediction is that there are additional issues regarding the
temporal relationship of the input variables which is not there in classification and
estimation. Any method use for classification and estimation is useful to do the predictive
task. The historical data is use to build a model that explain current observed behaviour.
The various examples of predictive task of data mining is as follows:
Predicting which employee is leave within six months , Predicting the size of balance
that will be transferred
[4] ASSOCIATION RULES

This task is also known as affinity grouping. In this task we are supposed to find out
which tasks are going together. The example of this task is to find out which items in a
shopping mall cart are going together. This task is the main function of market basket
analysis. Retail chain often use this function to arrange the items in a shelves and in a
catalogue.
An association rule works on the simple approach of generating the rules from the data. If
two items, cell phone and memory card, occur together frequently then we generate two
association rules from that which is as follows:
People who buy cell phone also buy memory card with probability x
People who buy memory card also buy cell phone with probability y
[5] CLUSTERING

Clustering means dividing all the data into some number of groups. All these groups are
homogeneous in nature and the data within each group is heterogeneous in nature. The
things which distinguishes clustering from classification is that the latter relies on predetermined classes. In classification all data are classified into pre-determined classes on
the basis of model.
In clustering no such pre-defined classes are available. It is up to the user to determine
what meaning to attach to the resulting cluster. For example cluster of employees might
have different background. Clustering is often done as a prelude to some other form of data
mining or model. For example if you want to make promotional strategy for some product
then to make one strategy for every customer ,you divide your customer into cluster which
have similar buying habits. Then ask them which strategy is suitable for them.

[6] DESCRIPTION AND VISUALIZATION

The main task of description and visualization is to describe what is going on in


complicated data base in a way to increase our understanding. This is the most powerful tool
of data mining, although data visualization is not an easy task. Visualization is useful to
provide visual representation of data like companys customers on a map of a country.

Techniques of Data Mining

All above task have various techniques to build a model. Some of the most
important techniques are described below:
[1] STATISTICS

Various statistical techniques are used for data mining like Bayesian network,
regression analysis, correlation analysis and cluster analysis. Statistical model is built
up using training data base. Bayesian network is a directed graph which represents the
causal relationship among variable using Bayesian probability theorem. Correlation
analysis is use to find out correlation between two variables. Cluster analysis finds
out group from set of objects based on distance measures. The following is an
example of linear regression.

The correlation coefficient, a parameter used in correlational analysis, is a measure of the


linear association between two variables. The correlational coefficient lies between -1 and
+1. A value of +1 indicates that two variables are perfectly related in a positive linear sense,
a correlation coefficient of -1 indicates that two variables are perfectly related in a negative
linear sense, and a correlation coefficient of 0 indicates that there is no linear relationship
between the two variables.
The following is the example of Bayesian network where node represent variable
and Edges represent dependencies. From this diagram we see that age, occupation
and diet leads to disease and disease leads to symptoms.

[2] MACHINE LEARNING

Statistical methods do not work efficiently with subjective, non-quantifiable


information in their models. They also have to assume various distributions of
parameters and independence of attributes. Machine learning produces predictive
accuracy in their models because it is free from parametric and structural assumptions
that underlie statistical methods. Some techniques of machine learning is as follows:

Artificial Neural Network- Neural network is a very popular technique for data mining.

Neural network is a general purpose tool applied to prediction, classification and


clustering. This model has many non-linear processing elements arranged in a pattern
similar to biological neuron networks. This technique is applied to broad range of
industry like financial institutes to medical company.

Genetic Algorithm - This algorithm is based on natural selection and natural

genetics. Genetic algorithm is also called as evolutionary algorithm. This


method is use to optimise the problem in various industries like complex
scheduling problem and resource optimisation problem. This technique is also
use in combination of other data mining techniques.
Decision tree-Decision tree is a structure that divides large collection of data into
successive smaller sets of records by applying simple decision rules. Decision
tree is a flow chart like tree structure, where each internal node denotes a test on
an attribute and each branch represents an outcome of the test and leaf nodes
represent the classes or class distributions. The top most nodes in a tree is called
root node. The following decision tree , that indicates whether a customer at a
company is likely to buy a computer or not.

[3] FUZZY LOGIC:

This technique is simply an extension of classical logic system. It provides a


conceptual framework for dealing with the problem of knowledge representation in
an environment of uncertainty and imprecision. Fuzzy logic technique in its pure
form is not useful for classification but it is use for some other hybrid technique of
classification. In fuzzy logic any logical can be fuzzified.

[4] ROUGH SETS TECHNIQUE:

In this theory an approximation of sets or concepts are done by means of binary


relations constructed from empirical data based on the notion of indiscernibility and
the inability to distinguish between objects. The application of this model is work on
with two directions, first is Decision rule induction from attribute value table and
second is, Data filtration by template generation.
Techniques of Clustering

Clustering means grouping objects in such a way that objects in one group are more
homogeneous then other group objects. In data mining this procedure is done on data to
present data into organised way. All clustering technique is divided into 4 parts which is as
follows:
1. Hierarchical method
2. Partitioning method
3. Grid based method
4. Density based method
In each parts there are several techniques. Now we discuss each of the parts separately.
1. HIERARCHICAL METHOD

In this method the data are divided into a tree of cluster or dendrogram. In these
method two techniques is included into this 1) agglomerative and 2) divisive. In
agglomerative approach works are done by growing cluster. The process is start with
one cluster and continuously merges them until the single cluster encompassing all
items or certain terminating conditions are meet. In divisive approach works are done
by top down method. The process is start with root cluster and then dividing into sub
cluster and then again dividing them into sub cluster and this process is ends when all
clusters have only one item or certain terminating conditions are meet.

2. DENSITY BASED CLUSTERING METHOD

Density based clustering techniques work on to find arbitrary shaped


cluster. DBSCAN and OPTICS are two techniques for this method.
DBSCAN (Density Based Spatial Clustering Applications with Noise)

This technique is designed to find out clusters of arbitrary shape. The density
is measured by no. of objects close to it. It uses two input parameters; and MinPts.
is used to define neighbourhood of an object. MinPts is used to define the
minimum number of points that should be in the neighbourhood of an object if it is
to be considered as a core object. The algorithm works as follows:
(1) First of all mark all the objects as unvisited.
(2) Then randomly visit an unvisited object x. If x has at least MinPts number
of objects in its neighbourhood, then a new cluster C is created for it.
Otherwise it is marked as a noise point.

(3) If cluster is created, we iteratively visit each point y in this newly formed
cluster, if it is unvisited, mark it as visited and if this point has MinPts number
of points in its neighbourhood, we add those points to the cluster C. If y is not
a member of any cluster, it is added to the created cluster C.
(4) Repeat steps 2 and 3 until all objects are visited.

OPTICS (Ordering Points to Identify Clustering Structure)

DBSCAN have several disadvantages which are it burdens the user from choosing
the input parameters. Moreover, different parts of data could require different
parameters. So, OPTICS was designed to surmount these challenges. OPTICS give
two additional attributes which are: Core Distance and Reachability Distances,
which are used to derive the ordering such that clusters with higher density will be
finished first. It is also a problem in DBSCAN.

3. GRID BASED CLUSTERING METHOD

Grid Based methods divides the attribute into cells. These are the cell on which
all operations of clustering are performed. The main advantage of this technique
is fast processing time. STING and CLIQUE are two techniques use for this.
a. STING (Statistical Information Grid)

This technique is mainly using numerical attributes. First of all information such
as mean, maximum and minimum are stored in a rectangular cell. Then from the
parameters of the bottom level cells, Parameters at the higher level cells are drawn.
First, a layer is determined from which query processing is to begin. This layer
consists of small number of cells. For each cell we check its pertinence by
computing confidence internal. Irrelevant cells are removed and this process is
continue until the bottom layer is reached.
b. CLIQUE (Clustering in Quest)

There are certain data which is irrelevant and make whole process complicated.
CLIQUES divide whole data into non overlapping interval which is known as cell. if
the numbers of objects that map to it exceed the threshold then this cell is known as
dense. Otherwise, the cell is sparse. Procedure for this techniques is as follows: i. The
dimension space divided into no overlapping units called cells. ii. Then identify the
dense and sparse cells. iii. Then use the dense cells to assemble the clusters. iv. Then
starting with an arbitrary dense cell, we find the maximal region of all connected
dense cells in all dimensions. v. At last, repeat step 4 until all cells are covered.

4 PARTITIONING METHODS

In this method all data are divided into K clusters. Here K equals to less than or equal
to N and each K contain minimum one element. Cluster is improved by relocating the
objects from one group to a more relevant one. This process continue until the
clusters stabilize and no more migration of data from one cluster to another takes
place. This method is use for small and medium size data bases. Techniques uses for
this method are K-means and K-medroide
K-MEANS TECHNIQUE

This is the commonly used clustering algorithms. Here K refers to particular number
of cluster. Each cluster must contain one element and make sure that the elements in
cluster not overlap. Descriptions of K-means and related algorithms gloss over the
selection of K. But in many cases, there is no a priori reason to select a particular
value, there is really an outermost loop to these algorithms that occurs during analysis
rather than in the computer program. This outer loop consists of performing automatic
cluster detection using one value of K, evaluating the results, then trying again with
another value of K or perhaps modifying the data. After each trial, the strength of the
resulting clusters can be evaluated by comparing the average distance between records
in a cluster with the average distance between clusters, and by other procedures. These
tests can be automated, but the clusters must also be evaluated on a more subjective
basis to determine their usefulness for a given application. As shown in Figure
different values of K may lead to very different clustering that are equally valid. The
figure shows clustering
of a deck of playing cards for K = 2 and K = 4. Is one better than the other? It depends
on the use to which the clusters will be put.

PROCESS OF DOING K-MEANS

1. In first step, you wanted to choose the value of K. in above paragraph we see that
we can choose whatever amount of K we need according to our requirement. In
above example it is 2 or 4
2. Then, assign each element to the cluster to which the element is most suitable for.
Like in above example all play cards are divided into 2 clusters according to their
colour.
3. Then, we find out the means of the each cluster.
4. Then compare the value of each element and re assign the element into the cluster
for which it is most suitable for.
5. This process is done up to we get most suitable cluster.

ADVANTAGES:
This is very easy technique compare to other techniques and it is also very easy to
implement.
This technique is very efficient in processing large data bases.
WEAKNESSES:
This method is applicable only when the mean of the cluster defined.
This method is not applicable to categorical data.
You have to specify the amount of K before applying this techniques.
This method is unable to handle the data of outliers and noisy data.
Total run time depends on initial partition.
K-MEDOIDS ALGORITHM
This is a variation of the k-means techniques and is less sensitive to outliers because
Instead of using means, the clusters are represented by one of their points. Clusters are
formed by points close to respective medoids. The function used for classification is a
measure of dissimilarities of points in a cluster and their representative. The partitioning is
done based on minimizing the sum if the dissimilarities between each object and its cluster
representative. This criterion is called as absolute-error criterion. Two main types of kmedoids clustering are the PAM (Partitioning Around Medoids) and CLARA (Clustering
LARge Applications).

K-MEAN USING WEKA (SURVEY OF SAN FRANCISCO AIRPORT)

Now we understood K-means clustering very well. But the main thing is that, how we
apply this technique to solve the practical problems. This method help to solve the
problems like analysis of some research such as customer satisfaction survey in marketing,
geographical demand survey to reduce the distribution cost etc. so, K-means techniques is
easy and widely use clustering techniques in practical world.
We see that aviation industry is suffering from past 3-4 months. Every airport is trying to
give maximum satisfaction to their customers. So, San Francisco International Airport
conducted a survey to analysis their customers satisfaction level. They gather data of 3500
customers who answered their several questions regarding airport facility. Now, they
wanted to analyse this data to understand the pattern and various requirements of the
customers. So, they use data mining clustering techniques to solve this problem.
For them it is important to understand that how many customer use services offered by
airport facility, whether the customer visit the airport frequently or once a year, how they
feel and rate various services offered by airport authority. Thus to sort all this data and use it
to make various decisions there are various data mining software that are available to
analyse this type of problems and come to the decision such as Orange, Weka, Tanagra etc.
as per now we are analysing this data using Weka tool.
Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine
learning software written in Java, developed at the University Of Waikato, New Zealand

The Weka contains a collection of visualization tools and algorithms for data analysis and
predictive modelling together with graphical user interfaces for easy access to this
functionality. Weka supports several standard data mining tasks, more specifically, data preprocessing, clustering, classification, regression, visualization, and feature selection, but as
per our topic we would be dealing with clustering technique for data mining using weka
software.

So as you install weka tool and open it the above dialog box opens and there are various
options to open the application for clustering technique for our project we need click
explorer option, The function of all the options are

Explorer: - It is an platform for exploring data, it support data pre-processing, attribute


selection, learning and visualization
Experimenter: - It is a platform for performing experiment and conducting statistical test
between machine learning tools
Knowledge flow: - It is similar to Explorer but has drag and drop interface. It also gives a
visual design
Simple CLI: - it provides a simple command line interface for executing WEKA command.
So, by clicking exploring another dialog box open in which you upload the data that you
want to analyse. These data should be in either CSV or ASFF format. When we select
explorer option it will appear as shown in fig (2). Then select open file and then choose
data to be uploaded for analysing. Weka provide filter but it is not relevant in K-means
technique because it automatically handle the numerical and categorical data.
This algorithm normalizes numerical attributes automatically when doing distance
computations. This provides all the attributes that are present in the dataset. We can select
any one which we want to include or select all.

FIGURE (2)

For our particular data there are 3535 instances, instances are the total number of responses from the
customers, there are 16 attributes for the analysis of the data on which we can make decision on how
to improvise the service. The 16 attributes are namely
1. Art work and exhibition
2. Restaurant
3. Shop
4. Signage and direction
5. Escalators
6. Info in screen
7. Info booth at lower level
8. Info booth at upper level
9. Wifi
10. Parking
11. Long term parking
12. Rental centre
13. SFO as a whole
14. Age
15. Sex
16. Income
Over here in figure 2 it is shown that we had selected art and work exhibition as the attribute, the
table on the right side of weka tool shows number of responses of people who responded average,
good, outstanding, below average, blank, never visited, or unacceptable.
Thus it can be inferred from the data that 696 people responded average, 1066 responded good and
others as shown in the figure
To get the comparison between various attributes we can use the right bottom corner of the weka tool,
the bar graph shows explains that thing. Over here we selected income and other attribute, so the
graphs show the relation between Art work and exhibition and Income. The blue colour in the first
bar shows that from the 696 peoples who responded average for Art work & exhibition there were
around 18 per cent people who also chose the same for Income. So, around 128 peoples were
common for both the attributes who responded average.
Similarly the red colour indicate that from the 696 response of good for Art work & exhibition there
were around 20 per cent common people who also responded same for income.
The main drawback of this tool is we cant get the exact number of peoples who were common for
both the instances. This way we can relate other attributes with income. Figure 2(i) shows relation of
all the attributes with Income attribute.

FIGURE 2(i)

To know how clustering work for this type of data it is needed to know cluster based diagram.
to select clustering we choose cluster from the top line and click on choose button which is on left
side and then select clustering algorithm which we want to apply over here, as it provide various
algorithm like hierarchical clustering, cob web clustering, DBSCAN, EM etc. As the focus is on K
mean clustering we select simple k means. This is shown in below fig.

Then click on the textbox which is shown right of choose menu. By clicking this pop up
window opens, which is shown below, in which you edit the parameters of clustering like
distance function, maximum iteration, number of clusters, and seed etc. this all parameters
are very important for clustering techniques because it affects the final output .
To explain the different possible outcome which can be affected we can explain it asIn distance function we have Euclidian, Manhattan and chebyshev distance, this is important
because it is used to calculate between the centroid and the original value of the output.
for finding K mean we need to select Euclidean distance, because if we choose Manhattan
then we get K median clustering.

Iteration is the number of repetition done for finding the final cluster. Seed value is use for
generating initial cluster centers that is use for the starting of clustering process. Then no. of
clusters is selected.
We are choosing 5 clusters for our data
.

FIGURE (4)

Once the all parameters have been specified, start the clustering algorithm. For clustering we have
different modes like use training set, supplied test set, percentage split, and classes to cluster
evaluation. Each mode have different importance like training set use all data to build the clusters,
percentage spilt provide the value in % field which specifies the data which use for clustering, the
default value is 66% and classes to cluster evaluation is use for compare the resulting cluster
assignment with known classes of instances, and determine that instances of same class have been
assign to same cluster. We are using training set in cluster mode and then click start button. If you
want to view the results of clustering in a separate window then right click the result set in Result
List panel. This result window will show the centroid of cluster as well as statistics on the number
and percentage of instances assigned to different clusters. Cluster centroids are the mean vectors for
each cluster (so, each dimension value in the centroid represents the mean value for that dimension in
the cluster). Thus, centroids can be used to characterize the cluster.

Finally, we get the result which divide whole data into 5 different clusters.
In this result the number of iteration is 5 which means that the data is processed 5 times
before giving the required output, and the sum of the square error is 24399, it gives the
value of the square of the distance between the centroid and the instances. The lesser the
value of the square error the more accurate the cluster is said, and this value depends on
two parameters 1) number of clusters 2) seed value. In these we also see the total instances
and the percentage of instances assign to each clusters.
In cluster 0 there are 384 customer responses which included those who responded average
for all the parameters and people from age varying from 45-54 years, male.
In Cluster 1 there are 900 Customer responses which included those who didnt give any
response For Art Work & Exhibation, Restaurant, Shop, Info Booth Lower Level And Upper
Level, , Wifi, Parking, Long Term Parking, Rental Center, Responded good For Sfo As A
Whole, outstanding for Escalator, Info On Screen, Signage And Direction. It also included
the age group of 25-34 Years Of female in majority

It can also be inferred that majority of female which were from the age group of 25-34 are
satisfied by the services of SFO airport, and the male sector from the age group 45-54 are
not fully satisfied by the service, thus SFO should focus more on that age group and try to
improvise the facility in such a way to satisfy them.

FIGURE (6)

This is one way of analysing cluster. There are other ways also to analyse the clustering
result which is by visualization. This technique is easy to understand and judge. So, many
people use these to analyse the cluster.
For these first of all you right click the result set on the left "Result list" panel and by
selecting "Visualize cluster assignments". The visualization window pops up as shown in
Fig 6. In this, choose the No. of clusters and any attributes for each of the three different
dimensions available (x-axis, y-axis, and colour). Different combinations of choices will
result in a visual rendering of different relationships within each cluster.
In the above example we choose no of cluster in x axis, instance number in y axis and
art work and exhibition as a colour. We get more results while visualizing different
attributes. The figure for this is given below.

FIGURE (7)
So, these way K-means clustering works in Weka tool. We get final result as a graphical
format or in numerical format which is use for some further business goals. Like in our case
we analysis customer are satisfied or not, If customer are not happy then in which attribute
we are lacking into. Which types of customer are not happy whether they are rich or middle
class customers? So, these all information we generate from the graph and numerical
format we get.

The same thing also happened in the orange software. But the main difference between
Weka and Orange is that orange provide only graphical interpretation but Weka provide
both Graphical as well as Numerical.
So, these are the some information how K-means technique work in open source
software. This way clustering task is done in practical world using software.
Conclusion

Thus we can say that by using various tool we can segregate the data which can be useful in
taking decision required to improvise the service or get the knowledge regarding the different
requirement of the firm. Weka is one of the simple tool which can be used to classify, cluster
and interpret large data.

Bibliography

(n.d.).
Ian H. Witten, F. E. (2011). Data mining : practical machine learning tools and
techniques. Burlington: Elsevier Inc.
Jain, D. R. (n.d.). Introduction to Data Mining Techniques. Retrieved from iasri.res.in:
http://www.iasri.res.in/ebook/expertsystem/datamining.pdf
Kharb, M. J. (2013). K-means Clustering Technique on Search Engine.
International Journal of Information and Computation Technology, 506-510.
Michael J.A. Berry, G. L. (2004). Data mining techniques : for marketing, sales,
and customer. Indianapolis: Wiley Publishing, Inc.
Shalini S Singh, N. C. (2011). K-means v/s K-medoids: A Comparative Study.
University, D. (n.d.). 3-4. Retrieved from depaul edu wedsite:
http://facweb.cs.depaul.edu/mobasher/classes/ect584/weka/preprocess.html

S-ar putea să vă placă și