Deep - Visualization in Data Mining

Visualization and Data Mining techniques
ByGroup number- 14 Chidroop Madhavarapu(105644921) Deepanshu Sandhuria(105595184) Data Mining CSE 634 Prof. Anita Wasilewska
References
http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/10335/ftp:zSzzSzftp.cs.umn.edu zSzdeptzSzuserszSzkumarzSzdatavis.pdf/ganesh96visual.pdf
http://www.ailab.si/blaz/predavanja/ozp/gradivo/2002-Keim-Visualization%20in%20DMIEEE%20Trans%20Vis.pdf
http://www.geocities.com/anand_palm/ http://citeseer.ist.psu.edu/cache/papers/cs/27216/http:zSzzSzwwwusers.cs.umn.eduzSzzCz7EctluzSzPaperTalkFilezSzits02.pdf/shekhar02cubeview.pdf http://www.cs.umn.edu/Research/shashi-group/ http://www.cs.umn.edu/Research/shashi-group/Book/sdb-chap1.pdf http://www.cs.umn.edu/research/shashi-group/alan_planb.pdf http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/27637/http:zSzzSzwwwusers.cs.umn.eduzSzzCz7EpushengzSzpubzSzkdd2001zSzkdd.pdf/shekhar01detecting.pdf
Motivation
Visualization for Data Mining Huge amounts of information Limited display capacity of output devices
Visual Data Mining (VDM) is a new approach for exploring very large data sets, combining traditional mining methods and information visualization techniques.
3
Why Visual Data Mining
Why Visual Data Mining
VDM Approach
VDM takes advantage of both, The power of automatic calculations, and The capabilities of human processing.
Human perception offers phenomenal abilities to extract structures from pictures.
Levels of VDM
No or very limited integration
Corresponds to the application of either traditional information visualization or automated data mining methods. Visualization and automated mining methods are applied sequentially. The result of one step can be used as input for another step.
Loose integration
Full integration

Automated mining and visualization methods applied in parallel. Combination of the results.
Methods of Data Visualization

Different methods are available for visualization of data based on type of data Data can be
Univariate
Bivariate
Multivariate
Univariate data
Measurement of single quantitative variable Characterize distribution Represented using following methods
Histogram
Pie Chart
Histogram
10
Pie Chart
11
Bivariate Data
Constitutes of paired samples of two quantitative variables Variables are related
Represented using following methods
Scatter plots
Line graphs
12
Scatter plots
13
Line graphs
14
Multivariate Data
Multi dimensional representation of multivariate data Represented using following methods
Icon based methods

Pixel based methods
Dynamic parallel coordinate system
15
Icon based Methods
16
Pixel Based Methods
Approach:
Each attribute value is represented by one colored pixel (the value ranges of the attributes are mapped to a fixed color map). The values of each attribute are presented in separate sub windows.
Examples:
Dense Pixel Displays
17
Dense Pixel Display

Approach:
Each attribute value is represented by one colored pixel (the value ranges of the attributes are mapped to a fixed color map). Different attributes are presented in separate sub windows.
18
Visual Data Mining: Framework and Algorithm Development
Ganesh, M., Han, E.H., Kumar, V., Shekar, S., & Srivastava, J. (1996).
Working Paper. Twin Cities, MN: University of Minnesota, Twin Cities Campus.
19
References
http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/10335/ftp :zSzzSzftp.cs.umn.eduzSzdeptzSzuserszSzkumarzSzdatavis.pdf/ganesh96visua l.pdf http://www.ailab.si/blaz/predavanja/ozp/gradivo/2002-KeimVisualization%20in%20DM-IEEE%20Trans%20Vis.pdf http://www.geocities.com/anand_palm/
20
Abstract
VDM refers to refers to the use of visualization techniques in Data Mining process to Evaluate Monitor Guide This paper provides a framework for VDM via the loose coupling of databases and visualization systems. The paper applies VDM towards designing new algorithms that can learn decision trees by manually refining some of the decisions made by well known algorithms such as C4.5.
21
Components of VQLBCI
The three major components of VQLBCI are Visual Representations, Computations and Events.
22
Visual Development of Algorithms
Most interesting use of visual data mining is the development of new insights and algorithms. The figure below shows the ER diagram for learning classification decision trees. This model allows the user to monitor the quality and impact of decisions made by the learning procedure. Learning procedure can be refined interactively via a visual interface.
23
ER diagram for the search space of decision tree learning algorithm
24
General Framework
Learning a classification decision tree from a training data set can be regarded as a process of searching for the best decision tree that meets user-provided goal constraints. The problem space of this search process consists of Model Candidates, Model Candidate Generator and Model Constraints. Many existing classification-learning algorithms like C4.5 and CDP fit nicely within this search framework. New learning algorithms that fit users requirements can be developed by defining the components of the problem space.
25
General Framework
Model Candidate corresponds to the partial classification decision tree. Each node of the decision tree is a Model Atom Search process is the process of finding a final model candidate such that it meets user goal specifications. Model Candidate Generator transforms the current model candidate into a new model candidate by selecting one model atom to expand from the expandable leaf model atoms. Model Constraints (used by Model Candidate Generator) provide controls and boundaries to the search space.
26
Search Process
27
Acceptability Constraint
Model Constraints consist of Acceptability constraints, Expandability constraints and a Data-Entropy calculation function. Acceptability constraint predicate specifies when a model candidate is acceptable and thus allows search process to stop. EX:

A1) Total no of expandable leaf model atoms = 0. A2) Overall error rate of the model candidate <= acceptable error rate. A3) Total number of model atoms in the model candidate>= maximal allowable tree size.
A1 is used in C4.5 and CDP
28
Expandability Constraint
An Expandability constraint predicate specifies whether a leaf model atom is expandable or not. EX:

C4.5 uses E1 and E2 CDP uses E2 and E3
29
Traversal Strategy
Traversal strategy ranks expandable leaf model atoms based on the model atom attributes. EX:

Increasing order of depth Decreasing order of depth Orders based on other model atom attributes.
30
Steps in Visual Algorithm Development
No single algorithm is the best all the time, performance is highly data dependent. By changing different predicates of model constraints, users can construct new classification-learning algorithm. This enables users to find an algorithm that works the best on a given data set. Two algorithms are developed : BF based on Best First search idea and CDP+ which is a modification of CDP
31
BF

This algorithm is based on the Best-First search idea. For Acceptability criteria, it includes A1 and A2 with a user specified acceptable error rate. The Traversal strategy chosen is T3 In Best-First, expandable leaf model atoms are ranked according to the decreasing order of the number of misclassified training cases. (local error rate * size of subset training data set) The traversal strategy will expand a model atom that has the most misclassified training cases, thus reducing the overall error rate the most.
32
CDP +
CDP+ is a modification of CDP CDP has dynamic pruning using expandability constraint E3.
Here, the depth is modified according to the size of the training data set of the model atom.
We set B is the branching factor of the decision tree, t is the size of training data set belonging to model atom, T is the whole training data set.
33
Comparison of different classification learning algorithms
34
Experiment

The new BF and CDP+ algorithms are compared with the C4.5 and CDP algorithms. Various metrics are selected to compare the efficiency, accuracy and size of final decision trees of the classification algorithm. The generation efficiency of the nodes is measured in terms of the total number of nodes generated. To compare accuracy of the various algorithms, the mean classification error on the test data sets have been computed.
35
Classification error for 10 data sets
36
Nodes generated for 10 data sets
37
Final decision tree size
38
Results/Conclusion

CDP has accuracy comparable to C4.5 while generating considerably fewer nodes. CDP+ has accuracy comparable to C4.5 while generating considerably fewer nodes. CDP+ outperformed CDP in error rate and number of nodes generated. Considering all performance metrics together, CDP+ is the best overall algorithm. Considering classification accuracy alone, C4.5P is the winner.
39
Conclusion

Different datasets require different algorithms for best results. Diverse user requirements put different constraints on the final decision tree. The experiment shows that Interactive Visual Data Mining Framework can help find the most suitable algorithm for a given data set and group of user requirements.
40
Data Mining for Selective Visualization of Large Spatial Datasets
Proceedings of 14th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'02), 2002. Washington (November 2002), DC, USA,
Shashi Shekhar, Chang-Tien Lu, Pusheng Zhang, Rulin Liu
Computer Science & Engineering Department University of Minnesota

41
References
http://citeseer.ist.psu.edu/cache/papers/cs/27216/http:zSzzSzwwwusers.cs.umn.eduzSzzCz7EctluzSzPaperTalkFilezSzits02.pdf/shekhar02 cubeview.pdf http://www.cs.umn.edu/Research/shashi-group/ http://www.cs.umn.edu/Research/shashi-group/Book/sdb-chap1.pdf http://www.cs.umn.edu/research/shashi-group/alan_planb.pdf http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/2 7637/http:zSzzSzwwwusers.cs.umn.eduzSzzCz7EpushengzSzpubzSzkdd2001zSzkdd.pdf/shek har01detecting.pdf
42
Basic Terminology
Spatial databases
Spatial mining
Alphanumeric data + geographical cordinates Mining of spatial databases
Spatial datawarehouse
Spatial outliers
Contains geographical data

Observations that appear to be inconsistent with the remainder of that set of data
43
Spatial Cluster
44
Contribution

Propose and implement the CubeView visualization system General data cube operations Built on the concept of spatial data warehouse to support data mining and data visualization Efficient and scalable spatial outlier detection algorithms
45
Challenges in spatial data mining
Classical data mining - numbers and categories. Spatial data

more complex and extended objects such as points, lines and polygons.
Second, classical data mining works with explicit inputs, whereas spatial predicates and attributes are often implicit. Third, classical data mining treats each input independently of other inputs.
46
Application Domain
The Traffic Management Center - Minnesota Department of Transportation (MNDOT) has a database to archive sensor network. Sensor network includes

about nine hundred stations each of which contains one to four loop detector
Measurement of Volume and occupancy.

Volume is # vehicles passing through station in 5minute interval Occupancy is percentage of time station is occupied with vehicles
47
Basic Concepts

Spatial Data Warehouse Spatial Data Mining Spatial Outliers Detection
48
Spatial Data Warehouse

Employs data cube structure Outputs - albums of maps. Traffic data warehouse

Measures - volume and occupancy Dimensions - time and space.
49
Spatial Data Mining

Process of discovering interesting and useful but implicit spatial patterns. key goal is to partially automate knowledge discovery Search for nuggets of information embedded in very large quantities of spatial data.
50
Spatial Outliers Detection

Suspiciously deviating observations Local instability Each Station

Spatial attributes time, space Non spatial attributes volume, occupancy
51
Basic Structure CubeView
52
CubeView Visualization System
Each node in cube a visualization style

S - Traffic volume of station at all times. TTD Time of the day TDW Day of the week STTD Daily traffic volume of each station TTD TDWS Traffic volume at each station at different times on different days
53
Dimension Lattice
54
55
56
57
Data Mining Algorithms for Visualization
Problem Definition

Given a spatial graph G ={ S , E } S - s1, s2, s3, s4.. E edges (neighborhood of stations) f ( x ) - attribute value for a data record N ( x )- fixed cardinality set of neighbors of x ) - Average attribute value of x neighbors S( x ) - difference of the attribute value of each data object and the average attribute value of neighbors.
58
Problem Definition cont
S( x ) - difference of the attribute value of each data object and the average attribute value of neighbors. Test for detecting an outlier
confidence level threshold
59
Few points
First, the neighborhood can be selected based on a fixed cardinality or a fixed graph distance or a fixed Euclidean distance. Second, the choice of neighborhood aggregate function can be mean, variance, or auto-correlation. Third, the choice for comparing a location with its neighbors can be either just a number or a vector of attribute values. Finally, the statistic for the base distribution can be selected as normal distribution.
60
Algorithms
Test Parameters Computation(TPC) Algorithm Route Outlier Detection(ROD) Algorithm
61
62
63
64
Software
http://www.cs.umn.edu/research/shashigroup/vis/traffic_volumemap2.htm http://www.cs.umn.edu/research/shashigroup/vis/DataCube.htm
65
Visualization and Data Mining techniques
Thank you!!!!
66

Deep - Visualization in Data Mining

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Deep - Visualization in Data Mining

Încărcat de

Drepturi de autor:

Formate disponibile

Visualization and Data Mining techniques

Why Visual Data Mining

Why Visual Data Mining

Human perception offers phenomenal abilities to extract structures from pictures.

No or very limited integration

Methods of Data Visualization

Constitutes of paired samples of two quantitative variables Variables are related

Represented using following methods

Multi dimensional representation of multivariate data Represented using following methods

Icon based methods

Dynamic parallel coordinate system

Icon based Methods

Pixel Based Methods

Dense Pixel Displays

Dense Pixel Display

Visual Data Mining: Framework and Algorithm Development

http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/10335/ftp :zSzzSzftp.cs.umn.eduzSzdeptzSzuserszSzkumarzSzdatavis.pdf/ganesh96visua l.pdf http://www.ailab.si/blaz/predavanja/ozp/gradivo/2002-KeimVisualization%20in%20DM-IEEE%20Trans%20Vis.pdf http://www.geocities.com/anand_palm/

Visual Development of Algorithms

ER diagram for the search space of decision tree learning algorithm

A1 is used in C4.5 and CDP

C4.5 uses E1 and E2 CDP uses E2 and E3

Steps in Visual Algorithm Development

Comparison of different classification learning algorithms

Classification error for 10 data sets

Nodes generated for 10 data sets

Final decision tree size

Data Mining for Selective Visualization of Large Spatial Datasets

Shashi Shekhar, Chang-Tien Lu, Pusheng Zhang, Rulin Liu

Computer Science & Engineering Department University of Minnesota

Alphanumeric data + geographical cordinates Mining of spatial databases

Contains geographical data

Challenges in spatial data mining

Classical data mining - numbers and categories. Spatial data

Measurement of Volume and occupancy.

Spatial Data Warehouse Spatial Data Mining Spatial Outliers Detection

Spatial Data Warehouse

Measures - volume and occupancy Dimensions - time and space.

Spatial Data Mining

Spatial Outliers Detection

Suspiciously deviating observations Local instability Each Station

Spatial attributes time, space Non spatial attributes volume, occupancy

Basic Structure CubeView

CubeView Visualization System

Each node in cube a visualization style

CubeView Visualization System

CubeView Visualization System

CubeView Visualization System

Data Mining Algorithms for Visualization

Data Mining Algorithms for Visualization

Problem Definition cont

confidence level threshold

Data Mining Algorithms for Visualization

Data Mining Algorithms for Visualization

Test Parameters Computation(TPC) Algorithm Route Outlier Detection(ROD) Algorithm

Data Mining Algorithms for Visualization

Data Mining Algorithms for Visualization

Data Mining Algorithms for Visualization

Visualization and Data Mining techniques

S-ar putea să vă placă și