Sunteți pe pagina 1din 35

Data Mining and warehousing

Project 1

Done By: Anfal Alghanim


ID: 210018549
Submitted to: Ms. Rabab Alkalifah
Date: 11 May

A. Objectives
I have selected data set with the name flag. Source: The UCI Machine Learning Repository. Benefits
to be derived from association rule mining is to find all co-occurrence relationships (association).
Finding patterns can enhance predicting data for example the religion of a country from its size

and the no. of colors in its flag. I had chose this data set for its easy to read and understand and
good to apply several filters on it. Also, I found Initial interesting rules to extract from it. In later
steps several choices in different step that I made, reasoning for that will be shown at its time.

B. Data set description


I have 10 attributes are numeric-valued. The remainder is either Boolean- or nominal-valued.
Preprocessing was done to make it amenable for association rule mining: Data cleaning, integration,
reduction, transformation. The detail of each step will be specified later.
Preparing and preprocessing the data
I have converted the (.data & .name) files to .CSV format using the following way:
Open Excel > open file > choose flag. DATA > from the text import wizard window > choose
delimited and the start row > next > check comma in delimiters box > next > finish. Same for the
flag.name file except to remove everything in the file but not the attributes names.

Data Cleaning:
1. Missing values
In this step missing values / noisy data / inconsistent should be resolve , as my data set complete
in the original file flag-new ,so I deleted some values from two records ( language , religion) to
create the missing value problem , and then I applied in weka the following method to resolve it:
Open the file > from Choose button > weka > Filters > unsupervised > attribute > replace
missing values > apply button > save

This replaced my missing values of my dataset with the modes and means from the training data.
The missing fields filled with (5.298429, 2.172775) respectively. The new file name: flag-newReplace missing value

I also used constant value (ex Anfal for nominal attributes and 0 for numeric attributes) as a
replacement but this time I deleted first value in the first record for the mainhue attribute using
the following method:
Open the file > from Choose button > weka > Filters > unsupervised > attribute > replace
missing values with user constant > in the area of filter click it to specify the value > in the
nominal string replacement value field write Anfal > Numeric replacement value field write 0
> ok > apply button > save
The new file name: flag-new-Replace-Anfal.arff

2. Noisy data
I used a filter that removes instances which are incorrectly classified. Using the following
method:
Open the file > from Choose button > weka > Filters > unsupervised > instance >
removeMisclassField > ok > apply button > save.
Figure 1: data before applying the function

Figure 2: After Appling the function

3. Outlier detection
Now after removing the noise from my dataset, my records are 63 rows. To find outlier in my dataset I

applied the following method:


Open the file > from Choose button > weka > Filters > unsupervised > attribute > Inter quartile range >
apply > save.

So , before I used to have only 30 attribute but after applying the outlier , I had two new attributes which
are outlier and extreme value is in figure 6, it shows I have 5 instance having outlier and 58 dont have.
Which is good thing, thing the less the better. The IQR will put a label YES for the instance if it has
outlier and NO if its not thats in each attribute. Similar the Extreme value attribute , if the IQR finds
instance is representing extreme value then it will write the value YES otherwise NO, Figure 7 show how
many Extreme values I have.

Figure 3: Outlier

Figure 4: Extreme value

So in order to remove the outlier I used the following method:

Open the file > from Choose button > weka > Filters > unsupervised > instance > remove with values >
click on the filter field (to adjust the properties)

Now follow the Figures 8 and 9.


First I specify the index of the outlier attribute, and the nominal indices I specified it as last value ,
because as you can see in figure 9 the last attribute in the 31 instance is YES , so Im saying remove YES
values.

Figure 5: Remove with value

Figure 6: Outlier, YES

Figure 7: No outlier

And here we go after apply the data is cleaned from outliers , similar for extreme value.
Now after removing the noise from my dataset, my records are 16 rows

Integration:
Integration is by the mean of merging two files of dataset. My data set are filled in each record
with different country name, I have considered this column as ID for the record the. First I
divided my data into Part1 and Part2. Part1 contain first 16 attributes and the 16th is (black).
Part2 contain attributes from the attribute number 16 (black2) to the last one. I repeated attribute
number 16 in both files to create redundancy problem, but I needed to change the name in Part2
to make it work. After that I ran WEKA and clicked Simple CSI.
Figure 8: Step 1

In the below window, I wrote the following lines:

java weka.core.Instances merge C:\Users\Anna\Desktop\backupDM\Part1.csv


C:\Users\Anna\Desktop\backupDM\Part2.csv > C:\Users\Anna\Desktop\backupDM\Merge.csv

Figure 9: Step 2

Figure 10: Step 3

Result:
Finished redirecting output to 'C: \Users\Anna\Desktop\backupDM\Merge.csv'. This way I
created file called merge and merged both part files. Now how can I remove the redundant
attribute?
Remove redundant attributes:
Because Merge file doesnt open with WEKA so I made another version of type arff so
Open Merge2 file > from Choose button > weka > Filters > unsupervised > attribute> remove >
in the field net to choose button click and specify th index of the desired attribute > ok > apply >
save
The new file : mergeAndremove

Figure 11: Redundant attribute

Figure 12: After remove

Data reduction:
The idea behind this step is to reduce the dataset. Applying reduction is further reducing the
dataset. There are types of reduction Parametric (I will apply Sampling) and Non parametric (I
will apply Principle Component Analysis PCA).
First the Sampling method:
This extracts a certain specified percentage of a given dataset and returns the reduced dataset.
Open mergeAndremove file > from Choose button > weka > Filters > unsupervised >
Instance> resample > apply > save.

Noreplacement means only reduce the data dont redundant other records

Sample size percent means to specify how much percentage to reduce. I chose 50 to
decrease the dataset to half of it.

Figure 13: Before Resample

Figure 14: After resample

After apply discretiztion on color attribute it visualized only 3 colors while the records have 4
colors even when I specify the bin value 4 , one of the colors is completely removed with the
redundant colors , so there is misleading in the data, therefore I canceled this filter.
Second the PCA method:
The purpose of principal components analysis is to:

Reduce large number of variables to smaller number of summary variables called


Principal Components (or factors)
Reduce the complexity of the multivariate data into the principal components space. To
apply it on my data:
Open New-merge file > from Choose button > weka > Filters > unsupervised > Attribute
> Principle component analysis > apply > save.

Figure 15 : PCA

The PCA does not work with my dataset properly it caused lots of attributes to be deleted.

Parametric data reduction:


Using histogram to represent attribute Language and religion, using their class to visualize them. First, a
histogram of the religion figure 16, showing each color represent a religion, the numbers represent how
many countries has that religion. Similar for language histogram figure 17.
Figure 16: Histogram for religion, each bucket represents one value for several countries

C
a
t
h
o
l
i
c

M
u
s
l
i
m

B
u
d
d
h
i
s
t

E
t
h
n
i
c

M
a
r
x
i
s
t

Figure 17: Histogram for language, using single bucket each of which represent one value for several countries

S
p
a
n
i
s
h

G
e
r
m
a
n

S
l
a
v
i
c

A
r
a
b
i
c

O
t
h
e
r
s

Scatter plot visualization:


Scatter plot is used to show correlation among attributes, I have chosen to visualize no. of colors in flag
and the country name, the scatter show horizontal visualization that means there is no observed
correlation between the two attributes. Figure 18

Figure 18: Scatter plot neither positive nor negative

Transformation
In data transformation I will apply discretiztion on several attribute, colors, religion, language,
area:

Discretiztion
Open file > from Choose button > weka > Filters > unsupervised > Attribute > Discretize
> apply > save.

To some numeric attributes to be nominal I apply it on attribute color, area, religion and
language. Then I replaced the encrypted values with nominal values in the word file as
shown in figure 21 for the religion. The completed file name : DisLang. After that I

eliminated all other attributes except the name off course and implemented association rules,
described later.
Bin value differs according to how many values I have in my dataset for specific attribute, for
example the color attribute in my dataset currently has 4 colors.
Figure 19: Properties of color

Figure 20: after applying filter on color

Figure 21: Convert to nominal

Drawing decision tree of the data:


Name (ID)

Language

color

Area

Class : Religon

Asturia

German

Two

Below

Catholic

Bahrain

Arabic

Two

Below

Muslim

Bulgaria

Slavic

Five

Below

Marxist

Colombia

Spanish

Three

Below

Marxist

Conog

Others

Three

Below

Ethnic

Ecuador

Spanish

Three

Below

Catholic

Ethiopia

Others

Three

Below

Catholic

Giraltar

Spanish

Three

Below

Catholic

Kampuchea

Others

Two

Below

Buddhist

Liechtenstein

German

Three

Below

Catholic

Morocco

Arabic

Two

Below

Muslim

Poland

Slavic

Two

Below

Marxist

Spain

Spanish

Two

Above

Catholic

Thailand

Others

Three

Below

Buddhist

Vietnam

Others

Two

Below

Catholic

Yugoslavia

Slavic

Four

Below

Marxist

1) Using Weka:
I have copied this table to excel file and save it as CSV then open it in weka , go to classify tab
fom Choose button > weka > classifier > trees > J48
Then from test options choose use training set > start
In the result right click and choose visualise tree

Figure 22 : Tree in WEKA

Results:

2) Constracting the tree manually :


a. Calculate the entropy for the class and for each attribute with the gain information.

Results:
Higest gain is Language then it is the roote

Figure 23: Tree manually

C. Rule mining process:


Using Aprior method to find assosiations among spcific attributes. Using WEKA tool .
Open file RULL1 (extra attributes already elamniated) > Assosiation tab > choose aprior
> (properties of it) > (follow in pictures). When you are done click start

Figure 24: RULL

Figure 25: Generating rules

Figure 26: Properties

D. Resulting rules
I Found intersting pattren among (Language , religon , colors) , File name : RULL1, 34 rull were
found.

Figure 27: Resulting Rules

General description :
If we have certin religon we can tell which lanuage the people of that religon speaks (Rull # 8)
If w have number of colors in their flag is 2 then we can tell which language do they speck (Rull
# 20)
Ofcoures the higher is the number of attributes in the first parameter and the corrsponding
number of attribute of the rest/ resulting parameter , the higher is the confidnce of that rull
For example (Rull # 14) found for the first parameter 7 attributes , 4 of them in the corsponding
result attribute are confirming the rul of the second parameter language spanish , it has
confidence of 0.5

While ( Rull # 23) found for the first and second parameters 2 attributes , 2 of them ( all of tem
actually) in the corsponding result attribute are confirming the rul of the second parameter color
2 , it has confidence of 1.0
This is incresing the confidence rate to clients and help them to choose the suitable rull they need
based on there desition for example how much confidint of the rull they want fr specific
attributes as input.
Intersting pathes:
1) language=Spanish religion=Catholic 4 ==> colours=Three 3
2) colours=Three 7 ==> religion=Catholic 5

conf:(0.75) < lift:(1.71)>

conf:(0.71) < lift:(1.63)>

Then I applied the same process but for file name RULL2 and Found intersting pattren
among(religon , color , area) Ignore the rest of attributes which not important as those . Result :
18 rull were found. Intersting path:
1) area=Below religion=Catholic 6 ==> colours=Three 4

conf:(0.67) < lift:(1.52)>

Figure 28: Generating rules

the area value was devided to two parts , Above = >1000 and Below = <1000 thousands of
square km.
Selection to show to the client depends on the interesting rules and considering client request. An
example will be shown below in the next section

E. Recommendations
The client can use the discovering rules in education/ research area or tourism information
details. It depends on his goal to use rules. Lets say the client has program to travel around the
world to spread Islam religion, and he has few details of the countries he is attempting to visit, he
needs statistics to check the environment and believes of such countries with such details to
prepare himself to the community. As small example lets narrow the range to those attributes I
have (religion, area, language). A program can be built to be associated with weka, which takes
the client details as attributes and calculate statics using weka to give him results back, for
example take a look at the program snapshots that I built below using ASP.net.
Due to lack of time and just to show representative idea, the program I built is not connected
with weka but the result are extracted from there previously

The button show the client result of exactly his search and similar results that has some of his
search data.

Experiments
Extra work, my own test based on my understanding applied on flag-for-test dataset
Second problem I create noise and resolve it with clustering in weka as follow:
Open the file > from Choose button > weka > Filters > unsupervised > attribute > Add noise>
from the appeared window I specified 50 % noise to be applied on the last attribute > click apply
button > save.
Here you can see the data before the noisy affect them, where I have unique value for orange
attribute.

Figure 29: before noise

Here you can see after the noise function applied

Figure 30 : After noise exist

In order to minimize the noise, go to Cluster tab > under cluster mode > hit the radio button
class to clusters evaluation then choose the attribute that you created the noise in it > start
button.
Figure 31 : Cluster

Separate experiments applied on flag-for-test for reduction using excel filters:


In my dataset I have 194 records. I have used the filtering property in excel and filtered my data
by Circle, Triangle, orange Attributes. For the first column I specified the filtering property
to show the data related to the flags that has 1, 2, 4 values, that means show only the flags
data that has 1 or 2 or 4 circles. Second column I specified to show the data related to the flags
that has 0 value, it means show only the flags without triangles. Third column I specified to
show the data related to the flags that have 0 value that is means show the flags which dont
have orange color. Now my data set has 104 records.

S-ar putea să vă placă și