Sunteți pe pagina 1din 8

Proceedings of the International Conference on Communication and Electronics Systems (ICCES 2018)

IEEE Xplore Part Number:CFP18AWO-ART; ISBN:978-1-5386-4765-3

Preprocessing Using Attribute Selection in


Data Stream Mining
R.Sangeetha Dr. S. Sathappan
Research Scholar Dept. of Computer Science Associate Professor, Dept. of Computer Science
Erode Arts and Science College Erode Arts and Science College
Erode, India Erode, India
geethsanmca@gmail.com devisathappan@gmail.com

Abstract - Data stream mining has enormous, increasing, of data is bulky in the case of streaming, the resource allocation
dynamic set of data with the time field registered while is a big challenge. In order to minimize this issue data
entering the information in online. It includes numerous preprocessing is required in which the irrelevant attributes,
attributes which ease the communication and ordering duplicate instances are removed to reduce the file size and
process between the customer and commercial centre. The memory space.
basic principle in mining is to analyze the data in variant
perspectives. The need for exploring such data into useful Preprocessing
information in the concept of streaming in online and offline Preprocessing is a procedure for Cleaning, Integrating,
grow the challenge into a major thing. Data collection and Transforming the data before further processing. [2] Pre
preprocessing are the essential and earlier stage in mining. processor is a program that processes the raw data to the clean
The method used in preprocessing enhances and ease the and fine set that is used as input to another program. The data
process unless it will lead to a difficult mining while gaining contains impurities, missing values, redundancies that is not
knowledge about the data. Integrating and transforming the applicable for the direct processing as it leads to low accuracy,
data into an understandable format needs efficient high complexity. Streaming data has a list of attributes for the
preprocessing tools and procedures. This paper explores the customer convenience but all those are not needed for the
preprocessing methods by applying Attribute Selection in analysis. So preprocessing technique eliminates the extraneous
WEKA tool that aids in a simplified and structured set of attributes and instances in parallel it retain the set with most
information. The proposed Attribute selection method relevant information. The final stage of data stream mining is
removes the irrelevant attribute by using Cfs Subset acquiring the knowledge but it cannot be easily obtained due to
evaluator with Greedy search method. Finally the size of the the huge data growth and neither can be easily understood unless
file before and after preprocessing, number of attribute preprocessing method is employed.
elevated through this method is listed as the performance
Raw data
evaluation.

Keywords – Data stream mining, Preprocessing, Attribute Preprocessing methods


selection, Correlation coefficient, Greedy search.
Data cleaning, Integration,
I. INTRODUCTION Selection, Reduction and
Transformation.
Data Stream mining referred as analysing high speed
enormous data with no limit. [1] It has the feature of continual Pre processed data for
arrival, multiplicity, rapid, time variant, boundless, analysis
unpredictable streams that arise new research problems and
challenges. Analyzing this streaming model turn the researchers Figure 1. Data Preprocessing
into a significant attention in the past few years. The challenge Figure 1. Shows the Data Preprocessing methods. It consists
includes memory limitation, faster computing in just one pass of Cleaning, Integration, Selection and transformation.
etc. So the extraction of the hidden knowledge requires
automated efficient techniques. However the classic data mining Steps in Preprocessing:
procedures won’t fit for stream mining since it involves multiple A. Data Cleaning
pass over the data to explore the patterns. The pattern also keeps The data may have incompleteness, noisy and not
changing over the time so it should be recorded. As the volume consistent. There might be missing values and attributes. Also it

978-1-5386-4765-3/18/$31.00 ©2018 IEEE 431


Proceedings of the International Conference on Communication and Electronics Systems (ICCES 2018)
IEEE Xplore Part Number:CFP18AWO-ART; ISBN:978-1-5386-4765-3

contains errors and outliers. If the record is not pure, the results the same value or not. To reduce errors, integration process
would be neither reliable nor accurate. ‘Metadata’ is used. The most usual issue in the dataset is
This process involves techniques including: redundancy. The same fact may be available in different dataset
Aggregation, Sampling, Remove, Fill the missed value, Fill the or even in different sources. Without affecting the reliability, the
missing attribute etc. integration process tries to reduce redundancy.
Strategies: Strategies:
Missing Values 1. Manual Integration – User operate with the relevant
The file can have more missing entries and it make way to information.
misclassification. 2. Application Integration – A particular application do all
Methods to handle missing values the process.
1. Ignoring the tuple: This is taken when the label is 3. Middleware Integration – The integration process is
missing. While ignoring the tuple, the remaining transferred to the middleware.
attributes for that particular is also not used. But this 4. Virtual Integration –Data resides in the system and
not effective, if the tuple does not contains several have a set of unified view for accessing.
attributes with missed values. 5. Physical Integration – Creates an advanced system and
2. Filling the missed value: The values are entered has a Xerox from the base system.
manually and is time consuming.
3. Replace the value: Replace all values by a unique C. Data Selection and Reduction
constant. This method is easy and is perfect. The relevant data is retrieved for analysis. As the
4. Use a mean or median to fill value: which pick the information is collected from variant sources it must be selected
“middle” value of a series to fill the value. The
based on the requirements. Not all the field is needed for the
symmetric distributions require the mean value, while
skewed distribution requires median to fill or replace. particular analysis. So the relevant record is to be selected with
5. Use the most possible value to fill the value: Using a its attributes before it is mined all the others gets reduced.
“Bayesian method or decision tree” induction to find Strategies:
the most possible value. Data Selection -
Relevant attribute is selected while the others are eliminated by
Noisy Data manually or by procedures. It creates a subset of meaningful
Noise is said as a “random error or variance” in a measured attributes. Reduces the inputs for processing and analysis, or
variable. finding the most meaningful inputs.
Methods to handle noisy data: Selection methods:
1. Binning: These methods have a sorted data based on 1. Filter method: Features are selected on the basis of their
the values around it. Then the values are allotted into a scores in various statistical tests for their correlation
number of “bins”. It is a variation that each value is with the outcome variable.
exchanged with the mean. 2. Wrapper method: A predictive model used to evaluate a
2. Regression: Data smoothing can also done by combination of features and assign a score based on
regression. It assigns the values to a function. Linear model accuracy.
regression finds the “best” two attributes so that one 3. Embedded method: Embedded methods assess the best
can be used to foresee the other. Multiple linear is a contributing feature for the accuracy of the model while
variation in which multi attributes are selected and are it is being created.
distributed in a multi-dimensional surface.
3. Outlier analysis: It is analysed by clustering. The Data Reduction
unique values are formed as a unit in which the most Data reduction techniques are utilized to obtain a reduced
unfit one forms an outlier. The output of the cleaning representation of the data set that is much smaller in dimension
process is a structured one.
but still contain important information.
B. Data Integration Reduction methods:
The facts from different sources are integrated into one unit. 1. Dimensionality reduction
They are in different formats in different sites. The sources may It lessen the number of variables or attributes on some
be ‘files, spread sheets, documents, data cubes, and internet’ consideration.
then on. Integration is a vital task because they are from And it includes,
different sources and does not match. It is really difficult to a. Wavelet transforms: It transforms a vector into a
confirm that whether the entities in two different sources have numerically different vector with wavelet coefficients.
Wavelet transformed vector can be truncated.

978-1-5386-4765-3/18/$31.00 ©2018 IEEE 432


Proceedings of the International Conference on Communication and Electronics Systems (ICCES 2018)
IEEE Xplore Part Number:CFP18AWO-ART; ISBN:978-1-5386-4765-3

b. Principal Components Analytics: Selects the important


attributes. II. ATTRIBUTE/FEATURE SELECTION IN
c. Attribute subset formation: Redundant attributes are PREPROCESSING
removed by forming subset.
Data streaming has tremendous growth in data as per the
2. Numerosity reduction customer and electronic commerce communication. The data
This will replace the initial volume with the other forms of should be involved in classification, clustering techniques to get
data representation for its simplest form. the meaningful pattern. [3] The representation of data often
It may be, consists of too many features, but only a few of them might be
a. Parametric: Data parameters are stored instead of actual related to the concept.The feature that explains the data should
data. be limited by elevating the irrelevant, redundant to ease the
b. Non-parametric: Stores simplified forms of data with process in mining. For that easiness attribute selection
histograms, sampling. procedures in data preprocessing is required.
Features could be of any type such as discrete, continuous or
3. Compression nominal. Selection is a process of selecting a subset of original
The transformations are put to get a “reduced or features according to certain criteria. It increases the speed and
compressed” form of the prior. accuracy of mining algorithm.
It may be,
a. Lossless: Primary info can be recreated without loss. Objectives of attribute selection:
b. Lossy: Cannot recreate the primary one, only an a. To simplify the models
approximation will be get. b. To reduce training times
There are numerous lossless algorithms for compression, c. To lessen the difficulties of dimensionality
and they allow only limited data manipulation. Dimensionality
d. To enhance the process
reduction and numerosity reduction is also considered one form
of compression. e. To improve the accuracy
f. To avoid over fitting
D. Data Transformation
The record is in different formats as the sources are different. Procedure Feature subset
It might be transferred into suitable one for mining. This process
includes ‘Generalization, Aggregation, Normalization etc. The Step 1: Start with all features.
database can have mixed data types as integer and floating point Step 2: Apply evaluator for the generation of feature
data. Before it is mined it might be change into either one form subset.
for consistency. Step 3: Put in search method for the evaluation of the
Transformation Strategies subset.
1. Smoothing: It removes noise. Techniques include Step 4: Evaluate the Stopping Criteria.
‘binning, regression, and clustering’. Step 5: Validate the subset.
2. Attribute construction: New attributes are created and
added Description
3. Aggregation: Constructs a data cube for analysis. 1. Generation of the feature subset
4. Normalization: Attributes are scaled with a mini range. It begins with the all the set of features with the search
It assigns an equal numbers to all the attributes. strategy. The exhaustive search needs to search all of 2n possible
5. Discretization: Normal values are redirected by subsets of n features.
intervals.
6. Hierarchy generation: Attributes are modified from Characteristics of feature subset selection
higher to lower level. The following characters should be identified and removed.
a. Relevant: These features have an influence on the
The remaining section of this paper is organized as: output and it has the significant role in describing the
Section II describes the Attribute/Feature selection data.
Procedures in preprocessing, Section III explains the related b. Irrelevant: These features do not have any effect on the
works, Section IV elaborates the existing and the proposed outcome has no vital role.
preprocessing procedure, Section V list the experimental results c. Redundant: A redundancy exists whenever a feature
with evaluation metrics and Section VI concludes the survey and takes the role of another feature.
enlist the future work.

978-1-5386-4765-3/18/$31.00 ©2018 IEEE 433


Proceedings of the International Conference on Communication and Electronics Systems (ICCES 2018)
IEEE Xplore Part Number:CFP18AWO-ART; ISBN:978-1-5386-4765-3

2. Subset Evaluation accuracy seems to be better by applying the preprocessing Relief


After generating subsets of features, it should be evaluated. procedure.
There are three types of approaches namely Filter, Wrapper and Uma K, M. Hanumanthappa et al [7] proposed data
Embedded used as an evaluator. collection and preprocessing methods in health care set. This set
contains structured and unstructured data and it combines text
3. Stopping Criteria and images in a single file. This type of heterogeneous data
A stopping criterion has control on the flow of the procedure needs a best preprocessing tool for the cleaning of data for
and determines when the process must terminate. It is done further analysis. This paper discusses the collection method and
using a search algorithm. It is necessary to halt the search after types of preprocessing technique and its need. For handling the
the number of selected features reached a pre-determined missing value this paper employs the imputation method and
threshold value. After an iterative process, the best subsets finally data reduction method is used to eliminate the complexity
among the generated candidates are selected. and cost.
Frequently used stopping criteria: Tolga Dimergan et al [8] did a analyze on performance and
 Reach the end and the search gets completed. achievement analysis on student with preprocessing technique in
 Reach the specified boundary which is set as minimum WEKA. The information was created through the voluntary
number of features or maximum number of iterations. participation of Trakya University. 51 questions were framed
 Find out reasonable good number of good subset. against 156 students to assess. The contribution of each attribute
is found by the Select Attributes algorithms provided by WEKA.
4. Validation In this study, Info-Gain Attribute Evaluation was used to reveal
The straight forward method for validating is to apply how much information is provided by each attribute of each
learning algorithm by measuring the result with some prior instance in the set. Finally five most effective attribute were
knowledge about the description of the features. selected through this method.

Advantages of Attribute Selection Procedure IV. METHODOLOGY


 Improves the quality of the model.
 Reduction in memory space. Existing Preprocessing Procedure:
 Able to discover meaningful pattern in limited search. The existing procedures include Information gain and Relief-
 Reduction in high to low dimensionality leads to lessen F selection measure as preprocessing methods for feature or
the complexity and improves the accuracy. attribute selection.
A. Information Gain (IG) – Attribute selection
III. LITERATURE REVIEW To determine the best attribute the measure called
Information Gain is used. [8] A common measure for the
Poonam et al [4] did a survey on data stream mining. information is Shannon entropy.
Traditional dataset stores records without the concept of time Entropy measure:
unless it is added explicitly. Data stream mining differs as it Information entropy is the average rate at which
includes real time data with the time concept. So for this online information is produced by a source of data. The measure of
analysis variant and efficient techniques should be employed. information entropy associated with each possible data value is
This paper find a prominent streaming technique by having a the negative log value of the probability mass function given in
theoretical analysis on stream mining. (1). Thus, the data with less probability carries more information
Divya Tomar et al [5] made a survey on Preprocessing and than the higher-probability value.
Post processing technique in data mining. The data may include The Entropy of Y is,
several inconsistencies, missing values and irrelevant data.
These all removed with the use of preprocessing which is carried
--- (1)
earlier. This paper elaborates the pre and post processing with
Where pi is the probability of the random variable Y.
various methods. Also it describes three visualization tools as it
If the observed values in the training data set D are partitioned
is vital in exploring the data.
according to the values of a second attribute X, and the entropy
Francisca Rosario et al [6] put forth a preprocessing
of Y with respect to the partitions induced by X is less than the
procedure to select the significant attribute from the whole set.
entropy of Y prior to partitioning, then there is a relationship
Relief is a feature selection algorithm for random selection of
between the attributes Y and X as given in (2).
feature. This algorithm adopts the random selection for weight
The Entropy of Y after observing X is,
estimation. The attribute weight estimated by Relief has a
probabilistic interpretation. A sample is selected, and the nearest
hit and nearest miss are identified. From the result it shows the --- (2)

978-1-5386-4765-3/18/$31.00 ©2018 IEEE 434


Proceedings of the International Conference on Communication and Electronics Systems (ICCES 2018)
IEEE Xplore Part Number:CFP18AWO-ART; ISBN:978-1-5386-4765-3

Then Information Gain (IG) is calculated as, Step 2: Randomly select the instances of features.
Step 3: Find the nearest hit and nearest miss for
the randomly selected instances features
--- (3) based on Euclidean distance.
The equation in (3) shows that information gained about Y after Step 4: Calculate the weight for the feature.
observing X is equal to the information gained about X after Step 5: Select the feature above the threshold value.
observing Y. Step 6: Add the selected to the feature set.

Procedure IG Advantages:
Step 1: Initialize the feature set with zero.  More robust with noisy data.
Step 2: Calculate Shannon Entropy for the class.  Handle multi-class problems,
Step 3: Calculate Entropy for the attribute values.  Robust with incomplete data.
Step 4: Measure the conditional probability of each Disadvantages:
term.  The estimation of features using Euclidean distance
Step 5: Select the term with highest information gain. which uses mean value will have some negative effect
Step 6: Add the term to the feature set. if the instances of features are outliers.

Advantage: Proposed Preprocessing Procedure:


 Significantly improves the accuracy and complexity by The proposed method employs Attribute selection method
eliminating irrelevant attributes. which involves Correlation feature subset (Cfs) evaluator with
Disadvantage: Greedy forward search method. The Cfs is based on filter based
 It is biased in favour of attributes with more values method.
even when they are not more informative. Filter Method:
Filter model relied on analyzing the general characteristics of
B. Relief-F– Attribute Selection data [3] without involving any learning algorithms and is free
It is expanded as “Relief Feature”.Relief-F is a feature from bias associated with any learning models.
selection algorithm for random selection of feature with weight
calculation. This algorithm weights each feature according to its Dataset with Feature subset
relevance to the class. [6] Initially, all weights are set to zero and all features evaluator + Subset of
then updated iteratively. A sample is selected from the data, and Search features
the nearest neighbor sample that belongs to the same class method
known as nearest hit and the nearest neighboring sample that
belongs to the opponent class known as nearest miss are Learning
algorithm
identified. A change in attribute value accompanied by a change
in class leads to increase in weighting of the attribute and by no
change in class leads to decrease in weighting of the attribute. Knowledge gaining
This procedure of updating the weight of the attribute is
performed for a random set for every sample in the data. The
weight updates are then averaged so that the final weight is in Figure 2. Filter method
the range between −1, 1. The attribute weight estimated in (4) Figure 2. Shows the steps in evaluating filter method. This
has a probabilistic interpretation. It is proportional to the approach is efficient and fast to compute.
difference between two conditional probabilities, namely,
nearest miss and nearest hit respectively. Proposed Filter based Evaluator:
The weight is updated by, In the proposed method the features are estimated by the
score and the feature with the best scores are utilized in building
the model, while others are not used for analysis. Usually the
--- (4) features are ranked with score by Pearson correlation as in (5). It
Where, Wi = Weight vector; xi = Feature vector. is computed by calculating the covariance of two variables and
Nearhit and nearmiss are calculated using divide by the product of their standard deviations.
Euclidean distance.
Cfs subset Evaluator:
Procedure Relief-F It is expanded as “Correlation feature subset.”
Step 1: Initialize the weight with zero.

978-1-5386-4765-3/18/$31.00 ©2018 IEEE 435


Proceedings of the International Conference on Communication and Electronics Systems (ICCES 2018)
IEEE Xplore Part Number:CFP18AWO-ART; ISBN:978-1-5386-4765-3

It evaluate the [9] worth of a subset by considering the unique 1. At each stage, pick out the best attribute usually the nominal
predictive ability of each attribute along with the degree of class value as the test condition.
redundancy. It starts with discretization which means the 2. Now split the node into the possible outcomes
process of converting large number of data values into a smaller 3. Repeat the above steps till all the test conditions have been
one. The rest of features should be ignored. Correlation is a exhausted into leaf s.
well-known similarity measures to gain information between It is the simplest greedy algorithm. A greedy algorithm is a
two features. If two features are linearly dependent, their paradigm of making the locally optimal choice at each stage.
correlation coefficient is ±1. If the they are uncorrelated, the There is no backtracking. Greedy may start when there are no or
correlation coefficient is 0. all attribute subsets present or it in between the attribute subset
Hypothesis: searching for the best feature. By traversing the space it produce
“A good feature subset is one that contains features which are a ranked list of attributes and records the attribute which are
highly correlated within the class, but uncorrelated with the selected. It starts from the empty set and the features which
rest.” satisfies the objective function and update the final set.
This aid in two definitions. In general, greedy algorithms have five components:
Feature-class correlation – Indicates how much a feature is i. A candidate set, from which a solution is created.
correlated to a specific class. ii. A selection function, which chooses the best candidate
to be added to the solution.
Feature-Feature correlation – Indicates the correlation between
iii. A feasibility function that is used to determine if a
two features. candidate can be used to contribute to a solution.
Pearson Correlation: iv. An objective function, which assigns a value to a
solution, or a partial solution.
v. A solution function, which will indicate when we have
discovered a complete solution.
--- (5)
Equation (5), states the merit of a feature subset. Procedure Cfs with Greedy
Where, Step 1: Initialize the feature subset with empty.
k - Number of features in the set.
Step 2: Discretization with the training dataset.
Merits - Having the group of attribute S including k
attribute that chosen from the group. Step 3: Calculate Pearson correlation with the set.
Rcf - is the average value of the groups of chosen Step 4: Find feature-class and feature- feature
attributes from all attribute which have related correlation.
relationships with the type of data. Step 5: Select the feature with highly correlated.
Rff - is the average value of groups of chosen Step 6: Add it to the feature subset.
attributes from all attributes which have the Step 7: Initialize the candidate set with x entries
inter-related relationship in the same group of obtained by the correlation feature subset.
chosen attributes. Step 8: Initialize optimum solution set to empty.
Here, rcf is average feature-class correlation, and rff is average Step 9: Select entry from the candidate set using
feature-feature correlation. Greedy strategy.
Step 10: If it is optimum with high values, move to
Proposed search method: optimum solution set, else remove it.
A search method is a stage-by-stage procedure used to gain Step 11: Repeat step 9 until the candidate set is
significant data among a collection. It is considered a empty.
fundamental procedure in computing. The efficiency is
measured by the number of times a comparison of the search is Advantages of Cfs with Greedy procedure
done in the worst case. The search algorithm depends on the data a. Fast and easy to make with filter based.
structure, and prior knowledge about the data. b. Greedy takes decision in every step.
c. Never backtracks.
Greedy forward stepwise search method: d. Lessen the storage space.
Greedy Approach is based [10] on the concept of heuristic e. Minimize the time and complexity in further mining
Problem by making optimal local choice at each node. By process by reducing the feature set.
making these local optimal choices, the optimal solution is
reached.
The algorithm can be summarized as:

978-1-5386-4765-3/18/$31.00 ©2018 IEEE 436


Proceedings of the International Conference on Communication and Electronics Systems (ICCES 2018)
IEEE Xplore Part Number:CFP18AWO-ART; ISBN:978-1-5386-4765-3

Disadvantage of Cfs with Greedy procedure


a. For large and multidimensional it won’t give optimum
solution. Data set after Preprocessing
b. It takes longer time for big data.

V. EXPERIMENTAL RESULTS

This work is carried out in WEKA 3.8.6. “Waikato [11]


Environment for Knowledge Analysis” is a machine
learning software package coded in Java, built at the ‘University
of Waikato, New Zealand’. It supports solely ‘ARFF’ files. The
excel file are often simply transform to ARFF format via ‘CSV’
format. Also WEKA provides access to SQL and the result came Figure 4. Attributes after Preprocessing
back by a database query. Figure 4. Shows the attribute list screen after preprocessing in
WEKA. There are five attributes.
Data set
The Online Retail dataset is taken from the UCI [12] Evaluation
repository. The dataset has the information about one week Table 2. Evaluation metrics
online ordering details with eight attributes and 9974 instances Metrics Existing Existing Proposed
in UK and is a registered online retail store. The store uniquely Procedure 1 Procedure 2 Procedure
sells all occasional gifts. Many customers are wholesalers. (ReliefF) (Information (Correlation
Gain) feature
Data set description subset)
Table 1. Dataset description Number of 6 – A1, 6 – A1, 5 – A2,
Attribute Attribute Attribute attribute A2, A3, A4,
Number Name Description selected A3, A4, A6,
A1 Ivno Invoice Number A5, A5, A7,
A2 Scode Stock code A7, A7, A8.
A3 Product Name of the A8. A8.
Product Selected Ivno, scode, Ivno, scode,
A4 Quantity Ordered quantity Attribute Product, Product, quantity,
A5 Ivdte Invoice date cusid, Ivdte, quantity, uprice,
A6 Uprice Unit price country. Ivdte,cusid, cusid,
A7 Cusid Customer identity country. country.
A8 Country Customer country Handling Binary, date, Binary, Binary, date,
Class Numeric, Nominal, Numeric,
Table 1. list out the number of attributes, with its name and value type Nominal, Missing Nominal,
description. Misingvalue. value. Misingvalue.
File size in 822 / 751 822 / 716 822 / 228
Data set Before Preprocessing Kb Before
/ After
Preprocess
ing

Table 2. list out the performance of the existing and


proposed procedures with the evaluation metrics. Attributes A1,
A2,..A8 are described in Table 1.

Figure 3. Attributes before Preprocessing


Evaluation Metrics Description:
Number of attribute selected
Figure 3. Shows the attribute list screen before In the existing procedures 1 and 2 six are selected while in
preprocessing in WEKA. There are eight attributes. the proposed five relevant are selected.

978-1-5386-4765-3/18/$31.00 ©2018 IEEE 437


Proceedings of the International Conference on Communication and Electronics Systems (ICCES 2018)
IEEE Xplore Part Number:CFP18AWO-ART; ISBN:978-1-5386-4765-3

Selected Attribute REFERENCES


In the proposed procedure the selected attribute solely define
[1] Agarwal C., “Data Streams model and algorithms”, Springer science
the data set in a meaningful way. There is no redundant
and Business Media, Volume 31, 2007.
description. [2] J.Gama, “Knowledge discovery from data streams”, Chapman and
In Existing procedure 1 selected attribute Scode and Product Hall /(RC, 2010.
have the redundant information. Both describes the product, [3] K,Sudha, Dr.J. Jebamalar Tamilselvi, “A Review of Feature Selection
In Existing procedure 2 selected attribute Ivno and Ivdte have Algorithms for Data Mining Techniques”, International Journal on
redundant information. Computer Science and Engineering (IJCSE), Vol. 7 No.6 Jun 2015.
In Proposed procedure the relevant attribute with no [4] Poonam Debnath, Santosh kumar Chobe, “A Quick Survey on Data
redundancy is selected and the five attribute gives in depth Stream Mining”, International Journal of Computer Science and
knowledge about the dataset. Information Technologies, Vol. 5 (3), 2014.
[5] Divya Tomar, Sonali Agarwal, “ A survey on Preprocessing and Post
Capabilities in handling class value type Processing technique in data mining”, Intrnational Journal of
Database theory and applications, Vol 7, No.4, 2014.
Existing procedure 1 can handle variant class types.
[6] S. Francisca rosario, Dr. K. Thangadurai, “Relief: feature selection
Existing procedure 2 can handle three types of class values. approach “International journal of innovative research &
Proposed procedure can handle variant class types. development, vol 4, issue 11, 2015.
[7] Uma K, M. Hanumanthappa, “Data Collection Methods and Data
File Size Before/ After Preprocessing Preprocessing Techniques for Healthcare Data Using Data Mining”,
The file size before preprocessing is 822 kb. It is reduced International Journal of Scientific & Engineering Research Volume 8,
while eliminating the attributes in the preprocessing. It will ease Issue 6, June-2017.
[8] Tolga Demirhan, Ilker Hacioglu, “Performance and achievement
the further performance and may need limited amount of storage
analysis of a dataset of distance education samples with weka”, The
space. Eurasia Proceedings of Educational & Social Sciences (EPESS),
In Existing set the Product and IVdte is selected leads to high Volume 8, 2017.
storage space. [9] M.A. Hall, “Correlation-based Feature Selection for Machine
The Proposed procedure lessens the size of the file to a Learning “, department of computer science, The University of
maximum extent by elevating the insignificant attribute and the Wikato, Hamilton Newzealand, 1999.
attribute which gives redundant information, in parallel it retains [10] http:// en.wikipedia.org/wiki/Greedy algorithm.
the most relevant attribute. While elevating, the proposed [11] Eshwari Girish Kulkarni, Raj B. Kulkarni, PhD, “WEKA Powerful
procedure removes the Product name and invoice date which has Tool in Data Mining”, International Journal of Computer
high bits storage than the rest attribute and so the size is Applications (0975 – 8887) National Seminar on Recent Trends in
minimized to the maximum extent. The name of the product can Data Mining RTDM 2016.
[12] C. Blake, E. Keogh, C.J. Merz, “UCI repository of Machine Learning
be obtained by the code (Scode) of the product.
Database”.

VI. CONCLUSION AND FUTURE WORK

Data stream contains dynamic, fast growing and


unstructured data in online. Mining these kinds of data requires
highly efficient technique. Preprocessing eases this mining by
removing irrelevant information and reduces the size and
complexity involved. Attribute selection procedure which has
Cfs subset evaluator with Greedy search method is employed as
a preprocessing procedure. The evaluation metrics are measured
with the existing and proposed procedures. The proposed
procedure in WEKA performs better than the existing by
retaining the most relevant by reducing the number of attribute
and file size.
This preprocessing work will be carried in future for further
data mining or stream mining tasks such as Classification,
Clustering for analysing the underlying hidden pattern.

978-1-5386-4765-3/18/$31.00 ©2018 IEEE 438

S-ar putea să vă placă și