Sunteți pe pagina 1din 23

CLUSTER

Q1.What is a cluster?

 A cluster is a subset of objects which are “similar”.
 A subset of objects such that the distance between any two objects in the cluster is
less than the distance between any object in the cluster and any object not located
inside it.
 A connected region of a multidimensional space containing a relatively high density
of objects
 Clustering is a process of partitioning a set of data (or objects) into a set of
meaningful sub-classes, called clusters. •
o Help users understand the natural grouping or structure in a data set.
Q2.Types of Clusterings:

A clustering is the set of Cluster
Important distinction between hierarchical and partitional sets of clusters
• Partitional Clustering
– A division data objects into non-overlapping subsets (clusters) such that each
data object is in exactly one subset
– Construct various partitions and then evaluate them by some criterion (we will
see an example called BIRCH)
– Nonhierarchical, each instance is placed in exactly one of K non-overlapping
clusters.
– Since only one set of clusters is output, the user normally has to input the
desired number of clusters K.
• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
– Create a hierarchical decomposition of the set of objects using some
criterion

Q3.Clustering Algorithms

• Partitional- K-means
• Hierarchical clustering
K-means Clustering (Partitional- K-means):
• Partitional clustering approach
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple :
Two different K-means Clustering’s
3

2.5

1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

3
3

2.5
2.5

2
2

1.5
1.5

y
y

1
1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Optimal Clustering Sub-optimal Clustering

Hierarchical Clustering
• Produces a set of nested clusters organized as a hierarchical tree
• Can be visualized as a dendrogram
– A tree like diagram that records the sequences of merges or splits

6 5
0.2

4
0.15 3 4
2
5
0.1 2

0.05 1
3 1

0
1 3 2 5 4 6
• Do not have to assume any particular number of clusters
– Any desired number of clusters can be obtained by ‘cutting’ the
dendogram at the proper level
• They may correspond to meaningful taxonomies
– Example in biological sciences (e.g., animal kingdom, phylogeny
reconstruction, …)

• Two main types of hierarchical clustering


– Agglomerative: Dendrogram
• Start with the points as individual clusters
• starts with singleton and merge clusters (bottom-up).
• At each step, merge the closest pair of clusters until only one cluster (or
k clusters) left
– Divisive:
• Start with one, all-inclusive cluster
• starts with one sample and split clusters (top-down).
• At each step, split a cluster until each cluster contains a point (or there
are k clusters)
Agglomerative Clustering Algorithm
• More popular hierarchical clustering technique
• Basic algorithm is straightforward
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains

• Key operation is the computation of the proximity of two clusters


– Different approaches to defining the distance between clusters distinguish the
different algorithms
Agglomerative HC Example:
Nearest Neighbor Level 2, k = 7 clusters.

Nearest Neighbor, Level 3, k = 6 clusters.

Nearest Neighbor, Level 4, k = 5 clusters.

Nearest Neighbor, Level 5, k = 4 clusters.

Nearest Neighbor, Level 6, k = 3 clusters.

Nearest Neighbor, Level 7, k = 2 clusters. Nearest Neighbor, Level 8, k = 1 cluster.


Q4.Birch:

Balanced Iterative Reducing and Clustering using Hierarchies
– Incrementally construct a CF (Clustering Feature) tree, a hierarchical data
structure summarizing cluster info; finds a good clustering with a single scan
– Apply multi-phase clustering; improves quality with a few additional scans
• Most of the existing algorithms DO NOT consider the case that
• datasets can be too large to fit in main memory
• They DO NOT concentrate on minimizing the number of scans of
• the dataset
• I/O costs are very high
• The complexity of BIRCH is O(n) where n is the number of objectsto be clustered.
Q5.Clustering Feature (CF),CF TREE & CFT

Clustering Feature (CF)
• Summary of the statistics for a given cluster: the 0-th, 1st and 2nd moments of
the cluster from the statistical point of view
• Used to compute centroids, and measure the compactness and distance of
clusters
CF-Tree
• height-balanced tree
• two parameters:
– number of entries in each node
– The diameter of all entries in a leaf node
• Leaf nodes are connected via prev and next pointers

Clustering feature tree (CFT):


• Clustering feature tree (CFT) is an alternative representation of data set:
– Each non-leaf node is a cluster comprising sub-clusters corresponding to
entries (at most B) in non-leaf node
– Each leaf node is a cluster comprising sub-clusters corresponding to
entries (at most L) in leaf node and each sub-cluster’s diameter is at most
T; when T is larger, CFT is smaller
– Each node must fit a memory page
CF Tree Insertion

Q6. BIRCH Algorithm



Q7.APRIORI

• APRIORI Algorithm
1. k = 1
2. Find frequent set Lk from Ck of all candidate itemsets
3. Form Ck+1 from Lk; k = k + 1
4. Repeat 2-3 until Ck is empty
• Details about steps 2 and 3
1. Step 2: scan D and count each itemset in Ck , if it’s greater than minSup, it is
frequent
2. Step 3: next slide
OBJECT-DATABASE SYSTEM
Object-Oriented database system(Pure OO concepts)
 are proposed as an alternative to relational system and aimed at application
domains where complex objects play a central role. An attempt to add DBMS
functionality to a programming language environment.
Object-relational database systems
 can be thought of as an attempt to extend relational database system with the
functionality necessary to support a broader class of application...
Object-database Systems (ODBMS)
 RDBMS
 Relational Database Management Systems
 OODBMS
 Object-Oriented Database Management Systems
 ORDBMS
 Object-Relational Database Management Systems

First Approach: Object-Oriented Model


 Concepts from OO programming languages
 ODL: Object Definition Language
 OQL: Object Oriented Query Language
Second Approach: Object-Relational Model
 Conceptual view
 Data Definition Language (Creating types, tables, and relationships)
 Querying object-relational database

Q7.Persistent Programming Languages



Persistent data
Data that continue to exist even after the program that created it has terminated.
Persistent programming language
Programming language extended with constructs to handle persistent data.
Difference between PPL and embedded SQL
 Type conversion
 Fetch data
 Languages extended with constructs to handle persistent data
 Programmer can manipulate persistent data directly
 no need to fetch it into memory and store it back to disk (unlike embedded
SQL)
 Approaches to make objects Persistent objects:
 Persistence by class - explicit declaration of persistence
 Persistence by creation - special syntax to create persistent objects
 Persistence by marking - make objects persistent after creation
 Persistence by reachability - object is persistent if it is declared explicitly to
be so or is reachable from a persistent object
Object Identity and Pointers
 Degrees of permanence of object identity
 Intraprocedure: only during execution of a single procedure
 Intraprogram: only during execution of a single program or query
 Interprogram: across program executions, but not if data-storage format on
disk changes
 Persistent: interprogram, plus persistent across data reorganizations
Q8.Query Processing

 two functionality issues
 user-defined aggregates
 security
 two efficiency issues
 method caching
 pointer swizzling
user-defined aggregates
To register an aggregation function, a user must implement three methods
 initialize method initializes the internal state for the aggregation.
 iterate method updates that state for every tuple seen
 terminate method computes the aggregation result based on the final state and
then cleans up.
Method Security
 A buggy or malicious ADT method can bring down the database server or
corrupt the database.
 to prevent problems - have the user methods be interpreted rather than
compiled.
 to allow user methods to be compiled from a general purpose programming
language such as C++, but to run those methods in a different address space
than the DBMS.
Method Caching
To cache the results of methods, in case they are invoked multiple times with the same
argument.
Pointer Swizzling
 During query processing it may make sense to cache the results of methods
 When an object O is brought into memory, they check each oid contained in O and
replace oids of in-memory objects by in-memory pointers to those objects. This
technique is called pointer swizzling and makes references to in-memory objects
very fast.
Q9.QUERY OPTIMIZATION
Registering indexes with the optimizer:
 As new index structures are added to a system-either via external interfaces or built-in
template structures the optimizer must be informed of their existence and their costs
of access.
 For a given index structure, the optimizer must know
(1) what WHERE-clause conditions are matched by that index
(2) what the cost of fetching a tuple is for that index.
 Given this information, the optimizer can use any index structure in constructing a
query plan.
Data Mining
Q10. What is Data Mining?

Data mining is integral part of knowledge discovery in databases (KDD), which is
the overall process of converting raw data into useful information. This process consists of
series of transformation steps from preprocessing to postprocessing of data mining results.

Q11.Architecture of Data Mining:



Step1:
Communicates between users and data mining system. Visualizes results or
perform exploration on data and schemas.
Step 2: Tests for interestingness of a pattern
Step3:
Performs functionalities like characterization, association, classification,
prediction etc.
Step 4:
Is responsible for fetching relevant data based on user request
Step 5:
This is usually the source of data. The data may require cleaning and integration.
Step 7:
This is the information of domain we are mining like concept hierarchies, to organize
attributes onto various levels of abstraction
Step 8:
Also contains user beliefs, which can be used to access interestingness of pattern or
thresholds

Q1Data Mining & Data Warehousing


• Data Warehouse: “is a repository (or archive) of information gathered from multiple
sources, stored under a unified schema, at a single site.” (Silberschatz)
 Collect data  Store in single repository
 Allows for easier query development as a single repository can be queried.
• Data Mining:
 Analyzing databases or Data Warehouses to discover patterns about the data
to gain knowledge.
 Knowledge is power.
Q12.Data Mining vs. Data Warehousing

• Major challenge to exploit data mining is identifying suitable data to mine.
• Data mining requires single, separate, clean, integrated, and self-consistent
source of data.
• A data warehouse is well equipped for providing data for mining.
• Data quality and consistency is a pre-requisite for mining to ensure the
accuracy of the predictive models. Data warehouses are populated with clean,
consistent data.
• Advantageous to mine data from multiple sources to discover as many
interrelationships as possible. Data warehouses contain data from a number of
sources.
• Selecting relevant subsets of records and fields for data mining requires query
capabilities of the data warehouse.
• Results of a data mining study are useful if there is some way to further
investigate the uncovered patterns. Data warehouses provide capability to go
back to the data source.
Q13. OLAP vs. Data Mining Tools

OLAP Tools Data Mining Tools
Are ad hoc, shrink wrapped tools that Methods for analyzing multiple data types
provide an interface to data -- Regression Trees
-- Neural networks
-- Genetic algorithms
Are used when you have specific known Are used when you don’t know what the
questions questions are
Looks and feels like a spreadsheet that Usually textual in nature
allow rotation, slicing and graphic
Can be deployed to large number of users Usually deployed to a small number of
analysts
Q14. DBMS, OLAP, and Data Mining

Q15. Difference between OLAP & Data Mining

Q16. The Knowledge Discovery Process

The term Knowledge Discovery in Databases, or KDD for short, refers to the broad process
of finding knowledge in data, and emphasizes the "high-level" application of particular data
mining methods. It is of interest to researchers in machine learning, pattern recognition,
databases, statistics, artificial intelligence, knowledge acquisition for expert systems, and data
visualization.

Steps of a KDD Process


The overall process of finding and interpreting patterns from data involves the
repeated application of the following steps:
1. Developing an understanding of
o the application domain
o the relevant prior knowledge
o the goals of the end-user
2. Creating a target data set: selecting a data set, or focusing on a subset of
variables, or data samples, on which discovery is to be performed.
3. Data cleaning and preprocessing.
o Removal of noise or outliers.
o Collecting necessary information to model or account for noise.
o Strategies for handling missing data fields.
o Accounting for time sequence information and known changes.
4. Data reduction and projection.
o Finding useful features to represent the data depending on the goal of
the task.
o Using dimensionality reduction or transformation methods to reduce the
effective number of variables under consideration or to find invariant
representations for the data.
5. Choosing the data mining task.
o Deciding whether the goal of the KDD process is classification,
regression, clustering, etc.
6. Choosing the data mining algorithm(s).
o Selecting method(s) to be used for searching for patterns in the data.
o Deciding which models and parameters may be appropriate.
o Matching a particular data mining method with the overall criteria of the
KDD process.
7. Data mining.
o Searching for patterns of interest in a particular representational form or
a set of such representations as classification rules or trees, regression,
clustering, and so forth.
8. Interpreting mined patterns.
9. Consolidating discovered knowledge.

Q17.Data Mining: Classification Schemes



• Decisions in data mining
1. Kinds of databases to be mined
2. Kinds of knowledge to be discovered
3. Kinds of techniques utilized
4. Kinds of applications adapted
• Data mining tasks
1. Descriptive data mining
2. Predictive data mining
Data Mining Tasks
Data Mining is generally divided into two tasks.
1. Predictive tasks
2. Descriptive tasks
Predictive Tasks
• Objective: Predict the value of a specific attribute (target/dependent
variable)based on the value of other attributes (explanatory/independent
variables).
Example: Judge if a patient has specific
disease based on his/her medical tests results.
Descriptive Tasks
• Objective: To derive patterns (correlation, trends, trajectories) that summarizes
the underlying relationship in data.
Example: Identifying web pages that are accessed together.(human interpretable
pattern)

Decisions in Data Mining


• Databases to be mined
 Relational, transactional, object-oriented, object-relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy, WWW,
etc.
• Knowledge to be mined
 Characterization, discrimination, association, classification, clustering,
trend, deviation and outlier analysis, etc.
 Multiple/integrated functions and mining at multiple levels
• Techniques utilized
 Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.
• Applications adapted
 Retail, telecommunication, banking, fraud analysis, DNA mining, stock
market analysis, Web mining, Weblog analysis, etc.
Q18. Classification Techniques of Data Mining:

1. Decision Tree based Methods
2. Rule-based Methods
3. Neural Networks
4. Naïve Bayes and Bayesian Belief Networks
Neural networks
Neural Network is information processing system graph the processing system
and the various algorithms that access that graph.
Neural Network is a structured graph with many nodes(processing elements) and
arcs(interconnections) between them.
NN a directed with 3 layers namely: input layer, hidden layer and output layer.
The development of NNs from neural biological generalizations has required
some basic assumptions which we list below:
1. “Information processing occurs at many simple elements called neurons [also
referred to as units, cells, or nodes].
2. Signals are passed between neurons over connection links.
3. Each connection link has an associated weight, which, in a typical neural net, m the
signal transmitted.
4. Each neuron applies an activation function (usually nonlinear) to its net input
(weighted input signals) to determine its output signal.”
The tasks to which artificial neural networks are applied tend to fall within the
following broad categories:
1. Function approximation, or regression analysis, including time series prediction
and modeling.
2. Classification, including pattern and sequence recognition, novelty detection
and sequential decision making.
3. Data processing, including filtering, clustering, blind signal separation and
compression.

S-ar putea să vă placă și