Sunteți pe pagina 1din 15

SEMINAR ON

A SURVEY OF CLASSIFICATION TECHNIQUES ON BIG DATA

Agenda

Introduction Classification techniques Comparative study Conclusion References

Introduction

Classification is the techniques that maps the data into predefined classes and groups

Predict group membership for data instance

Knowledge discovery

Future plan

Predicts categorical class labels

Classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Introduction (Cont.)

Classification
Classification
Introduction (Cont.) Classification Supervised Unsupervised Human-guided classification Predictive or directed
Introduction (Cont.) Classification Supervised Unsupervised Human-guided classification Predictive or directed
Introduction (Cont.) Classification Supervised Unsupervised Human-guided classification Predictive or directed
Introduction (Cont.) Classification Supervised Unsupervised Human-guided classification Predictive or directed
Supervised
Supervised
Unsupervised
Unsupervised
Introduction (Cont.) Classification Supervised Unsupervised Human-guided classification Predictive or directed
Introduction (Cont.) Classification Supervised Unsupervised Human-guided classification Predictive or directed
Introduction (Cont.) Classification Supervised Unsupervised Human-guided classification Predictive or directed
Human-guided classification Predictive or directed Decision tree
Human-guided
classification
Predictive or
directed
Decision tree
Calculated by software
Calculated by
software
Descriptive or undirected
Descriptive or
undirected
K- means
K- means

Classification Techniques

Decision Tree Supervised Support Nearest Classification Vector Neighbor Techniques Machine Naïve Bayes
Decision
Tree
Supervised
Support
Nearest
Classification
Vector
Neighbor
Techniques
Machine
Naïve
Bayes

Decision Tree

It is a flow chart like a tree structure which classify instances by sorting its attribute values It generates the rule for the classification of the dataset

 

Algorithms

Iterative Dichotomer

C4.5

Classification Regression Tree (CART)

(ID3)

Measure Entropy information gain

Gini diversity index

Top-down procedure

Construct Binary DT

Pruning through single pass algorithm

Post pruning based on cost

Bayesian Network

It is a graphical model for set of various variable features

Show high accuracy and speed

Probabilistic Learning
Probabilistic
Learning
Probabilistic Prediction
Probabilistic
Prediction
Incremental
Incremental
Standards
Standards

Nearest Neighbor

The heuristic techniques are used to select the good ‘k’

It has some strong consistency results

Instance-based classifiers work by storing training records and using them to predict the class

label of unseen cases

Support Vector Machine (SVM)

It trains classifier to predict the class of the new sample

Key Implementation Mathematic al Kernel programmi Function ng
Key Implementation
Mathematic
al
Kernel
programmi
Function
ng

Support Vector Machine (SVM)(Cont.)

 

Algorithms

Linear

Non-linear

Non separable use

Data is linearly separable

Not suitable for C class

Noisy data is available

hypothesis

Comparative Study

Techniques

Advantages

Disadvantages

Decision tree

Simple to understand and interpret

Locally-optimal decisions are made at each node

Requires little data preparation

Do not generalize well from the training data

Bayesian

Able to handle noisy data

Training time will be large

Network

Well suited for continuous features

Poor interoperability

Require parameters

K- nearest

Easy to understand

Computational costs are expensive

neighbor

Implement classification technique

The local data is very sensitive and require large storage

Support vector machine

Finds the best classification function of training data

Computationally expensive

Require large time and storage

 

Prevent over fitting than other methods

Poor interpretability of results

Cont.

Algorithms

Predictive

Fitting

Prediction

Memory

Easy to

Handles

Accuracy

Speed

Speed

Usage

Interpret

Categorical

Predictors

Trees

Low

Fast

Fast

Low

Yes

Yes

SVM

High

Medium

*

*

*

No

Naïve Bayes

Low

**

**

**

Yes

Yes

Nearest

***

Fast ***

Medium

High

No

Yes ***

Neighbor

* SVM prediction speed and memory usage are good ** Naïve Bayes speed and memory usage are good *** Nearest neighbor usually has good prediction in low dimensions

Conclusion

Here I have discussed various classification techniques such as Decision tree, Bayesian network, Nearest neighbor and Support vector machine

Decision tree and SVM have different operational profiles where one is accurate and other is not and vice versa

References

1.

Seema Sharma, Jitendra Agrawal, Shikha Agarwal, Sanjeev Sharma, “Machine Learning Techniques for Data Mining: A Survey”, 979-1-4799-1597-2/13,2013, IEEE

2.

Krisztian Balog, Heri Ramampiaro, “Cumulative Citation Recommendation: Classification vs Ranking”,

ACM 978-1-4503-2034-4/13/07, 2013

3.

Mohd Fauzi bin Othman, Thomas Moh Shan Yau, “Comparison of Different Classification Techniques Using WEKA for Breast Cancer”,520-523, Springer-Verlag Berlin Heidelberg 2007

4.

Francesco Ricci, lior Rokach, Bracha Shapira, Paul B, Kantor, “Recommender System Handbook”,

ISBN 978-0-387-85819-7 © Springer Science + Business Media, LLC 2011

15-Oct-14 1 5