Documente Academic
Documente Profesional
Documente Cultură
Abstract - Data stream mining has enormous, increasing, of data is bulky in the case of streaming, the resource allocation
dynamic set of data with the time field registered while is a big challenge. In order to minimize this issue data
entering the information in online. It includes numerous preprocessing is required in which the irrelevant attributes,
attributes which ease the communication and ordering duplicate instances are removed to reduce the file size and
process between the customer and commercial centre. The memory space.
basic principle in mining is to analyze the data in variant
perspectives. The need for exploring such data into useful Preprocessing
information in the concept of streaming in online and offline Preprocessing is a procedure for Cleaning, Integrating,
grow the challenge into a major thing. Data collection and Transforming the data before further processing. [2] Pre
preprocessing are the essential and earlier stage in mining. processor is a program that processes the raw data to the clean
The method used in preprocessing enhances and ease the and fine set that is used as input to another program. The data
process unless it will lead to a difficult mining while gaining contains impurities, missing values, redundancies that is not
knowledge about the data. Integrating and transforming the applicable for the direct processing as it leads to low accuracy,
data into an understandable format needs efficient high complexity. Streaming data has a list of attributes for the
preprocessing tools and procedures. This paper explores the customer convenience but all those are not needed for the
preprocessing methods by applying Attribute Selection in analysis. So preprocessing technique eliminates the extraneous
WEKA tool that aids in a simplified and structured set of attributes and instances in parallel it retain the set with most
information. The proposed Attribute selection method relevant information. The final stage of data stream mining is
removes the irrelevant attribute by using Cfs Subset acquiring the knowledge but it cannot be easily obtained due to
evaluator with Greedy search method. Finally the size of the the huge data growth and neither can be easily understood unless
file before and after preprocessing, number of attribute preprocessing method is employed.
elevated through this method is listed as the performance
Raw data
evaluation.
contains errors and outliers. If the record is not pure, the results the same value or not. To reduce errors, integration process
would be neither reliable nor accurate. ‘Metadata’ is used. The most usual issue in the dataset is
This process involves techniques including: redundancy. The same fact may be available in different dataset
Aggregation, Sampling, Remove, Fill the missed value, Fill the or even in different sources. Without affecting the reliability, the
missing attribute etc. integration process tries to reduce redundancy.
Strategies: Strategies:
Missing Values 1. Manual Integration – User operate with the relevant
The file can have more missing entries and it make way to information.
misclassification. 2. Application Integration – A particular application do all
Methods to handle missing values the process.
1. Ignoring the tuple: This is taken when the label is 3. Middleware Integration – The integration process is
missing. While ignoring the tuple, the remaining transferred to the middleware.
attributes for that particular is also not used. But this 4. Virtual Integration –Data resides in the system and
not effective, if the tuple does not contains several have a set of unified view for accessing.
attributes with missed values. 5. Physical Integration – Creates an advanced system and
2. Filling the missed value: The values are entered has a Xerox from the base system.
manually and is time consuming.
3. Replace the value: Replace all values by a unique C. Data Selection and Reduction
constant. This method is easy and is perfect. The relevant data is retrieved for analysis. As the
4. Use a mean or median to fill value: which pick the information is collected from variant sources it must be selected
“middle” value of a series to fill the value. The
based on the requirements. Not all the field is needed for the
symmetric distributions require the mean value, while
skewed distribution requires median to fill or replace. particular analysis. So the relevant record is to be selected with
5. Use the most possible value to fill the value: Using a its attributes before it is mined all the others gets reduced.
“Bayesian method or decision tree” induction to find Strategies:
the most possible value. Data Selection -
Relevant attribute is selected while the others are eliminated by
Noisy Data manually or by procedures. It creates a subset of meaningful
Noise is said as a “random error or variance” in a measured attributes. Reduces the inputs for processing and analysis, or
variable. finding the most meaningful inputs.
Methods to handle noisy data: Selection methods:
1. Binning: These methods have a sorted data based on 1. Filter method: Features are selected on the basis of their
the values around it. Then the values are allotted into a scores in various statistical tests for their correlation
number of “bins”. It is a variation that each value is with the outcome variable.
exchanged with the mean. 2. Wrapper method: A predictive model used to evaluate a
2. Regression: Data smoothing can also done by combination of features and assign a score based on
regression. It assigns the values to a function. Linear model accuracy.
regression finds the “best” two attributes so that one 3. Embedded method: Embedded methods assess the best
can be used to foresee the other. Multiple linear is a contributing feature for the accuracy of the model while
variation in which multi attributes are selected and are it is being created.
distributed in a multi-dimensional surface.
3. Outlier analysis: It is analysed by clustering. The Data Reduction
unique values are formed as a unit in which the most Data reduction techniques are utilized to obtain a reduced
unfit one forms an outlier. The output of the cleaning representation of the data set that is much smaller in dimension
process is a structured one.
but still contain important information.
B. Data Integration Reduction methods:
The facts from different sources are integrated into one unit. 1. Dimensionality reduction
They are in different formats in different sites. The sources may It lessen the number of variables or attributes on some
be ‘files, spread sheets, documents, data cubes, and internet’ consideration.
then on. Integration is a vital task because they are from And it includes,
different sources and does not match. It is really difficult to a. Wavelet transforms: It transforms a vector into a
confirm that whether the entities in two different sources have numerically different vector with wavelet coefficients.
Wavelet transformed vector can be truncated.
Then Information Gain (IG) is calculated as, Step 2: Randomly select the instances of features.
Step 3: Find the nearest hit and nearest miss for
the randomly selected instances features
--- (3) based on Euclidean distance.
The equation in (3) shows that information gained about Y after Step 4: Calculate the weight for the feature.
observing X is equal to the information gained about X after Step 5: Select the feature above the threshold value.
observing Y. Step 6: Add the selected to the feature set.
Procedure IG Advantages:
Step 1: Initialize the feature set with zero. More robust with noisy data.
Step 2: Calculate Shannon Entropy for the class. Handle multi-class problems,
Step 3: Calculate Entropy for the attribute values. Robust with incomplete data.
Step 4: Measure the conditional probability of each Disadvantages:
term. The estimation of features using Euclidean distance
Step 5: Select the term with highest information gain. which uses mean value will have some negative effect
Step 6: Add the term to the feature set. if the instances of features are outliers.
It evaluate the [9] worth of a subset by considering the unique 1. At each stage, pick out the best attribute usually the nominal
predictive ability of each attribute along with the degree of class value as the test condition.
redundancy. It starts with discretization which means the 2. Now split the node into the possible outcomes
process of converting large number of data values into a smaller 3. Repeat the above steps till all the test conditions have been
one. The rest of features should be ignored. Correlation is a exhausted into leaf s.
well-known similarity measures to gain information between It is the simplest greedy algorithm. A greedy algorithm is a
two features. If two features are linearly dependent, their paradigm of making the locally optimal choice at each stage.
correlation coefficient is ±1. If the they are uncorrelated, the There is no backtracking. Greedy may start when there are no or
correlation coefficient is 0. all attribute subsets present or it in between the attribute subset
Hypothesis: searching for the best feature. By traversing the space it produce
“A good feature subset is one that contains features which are a ranked list of attributes and records the attribute which are
highly correlated within the class, but uncorrelated with the selected. It starts from the empty set and the features which
rest.” satisfies the objective function and update the final set.
This aid in two definitions. In general, greedy algorithms have five components:
Feature-class correlation – Indicates how much a feature is i. A candidate set, from which a solution is created.
correlated to a specific class. ii. A selection function, which chooses the best candidate
to be added to the solution.
Feature-Feature correlation – Indicates the correlation between
iii. A feasibility function that is used to determine if a
two features. candidate can be used to contribute to a solution.
Pearson Correlation: iv. An objective function, which assigns a value to a
solution, or a partial solution.
v. A solution function, which will indicate when we have
discovered a complete solution.
--- (5)
Equation (5), states the merit of a feature subset. Procedure Cfs with Greedy
Where, Step 1: Initialize the feature subset with empty.
k - Number of features in the set.
Step 2: Discretization with the training dataset.
Merits - Having the group of attribute S including k
attribute that chosen from the group. Step 3: Calculate Pearson correlation with the set.
Rcf - is the average value of the groups of chosen Step 4: Find feature-class and feature- feature
attributes from all attribute which have related correlation.
relationships with the type of data. Step 5: Select the feature with highly correlated.
Rff - is the average value of groups of chosen Step 6: Add it to the feature subset.
attributes from all attributes which have the Step 7: Initialize the candidate set with x entries
inter-related relationship in the same group of obtained by the correlation feature subset.
chosen attributes. Step 8: Initialize optimum solution set to empty.
Here, rcf is average feature-class correlation, and rff is average Step 9: Select entry from the candidate set using
feature-feature correlation. Greedy strategy.
Step 10: If it is optimum with high values, move to
Proposed search method: optimum solution set, else remove it.
A search method is a stage-by-stage procedure used to gain Step 11: Repeat step 9 until the candidate set is
significant data among a collection. It is considered a empty.
fundamental procedure in computing. The efficiency is
measured by the number of times a comparison of the search is Advantages of Cfs with Greedy procedure
done in the worst case. The search algorithm depends on the data a. Fast and easy to make with filter based.
structure, and prior knowledge about the data. b. Greedy takes decision in every step.
c. Never backtracks.
Greedy forward stepwise search method: d. Lessen the storage space.
Greedy Approach is based [10] on the concept of heuristic e. Minimize the time and complexity in further mining
Problem by making optimal local choice at each node. By process by reducing the feature set.
making these local optimal choices, the optimal solution is
reached.
The algorithm can be summarized as:
V. EXPERIMENTAL RESULTS