Sunteți pe pagina 1din 212

S.K.P.

Engineering College, Tiruvannamalai VI SEM

SKP Engineering College


Tiruvannamalai – 606611

A Course Material
on
Data Warehousing and Data Mining

By

K.Vijayakumar
Assistant Professor
Computer Science and Engineering Department

Computer Science Engineering Department -1- DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Quality Certificate

This is to Certify that the Electronic Study Material

Subject Code : IT6702

Subject Name : Data Warehousing and DataMining

Year/Sem: III/VI

Being prepared by me and it meets the knowledge requirement of the University curriculum.

Signature of the Author

Name: K.Vijayakumar

Designation: Assistant Professor

This is to certify that the course material being prepared by Mr.K.Vijayakumar is of the
adequate quality. He has referred more than five books and one among them is from
abroad author.

Signature of HD Signature of the Principal

Name: Name: Dr.V.Subramania Bharathi

Seal: Seal:

Computer Science Engineering Department -2- DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

IT6702 DATA WAREHOUSING AND DATA MINING LTPC3003

OBJECTIVES: The student should be made to:

 Be familiar with the concepts of data warehouse and data mining,


 Be acquainted with the tools and techniques used for Knowledge Discovery in
Databases.

UNIT I DATA WAREHOUSING 9

Data warehousing Components –Building a Data warehouse –- Mapping the Data


Warehouse to a Multiprocessor Architecture – DBMS Schemas for Decision Support – Data
Extraction, Cleanup, and Transformation Tools –Metadata.

UNIT II BUSINESS ANALYSIS 9

Reporting and Query tools and Applications – Tool Categories – The Need for Applications
– Cognos Impromptu – Online Analytical Processing (OLAP) – Need – Multidimensional
Data Model – OLAP Guidelines – Multidimensional versus Multirelational OLAP –
Categories of Tools – OLAP Tools and the Internet.

UNIT III DATA MINING 9

Introduction – Data – Types of Data – Data Mining Functionalities – Interestingness of


Patterns – Classification of Data Mining Systems – Data Mining Task Primitives –
Integration of a Data Mining System with a Data Warehouse – Issues –Data Preprocessing.

UNIT IV ASSOCIATION RULE MINING AND CLASSIFICATION 9

Mining Frequent Patterns, Associations and Correlations – Mining Methods – Mining


various Kinds of Association Rules – Correlation Analysis – Constraint Based Association
Mining – Classification and Prediction - Basic Concepts - Decision Tree Induction -
Bayesian Classification – Rule Based Classification – Classification by Back propagation –
Support Vector Machines – Associative Classification – Lazy Learners – Other
Classification Methods – Prediction.

UNIT V CLUSTERING AND TRENDS IN DATA MINING 9

Cluster Analysis - Types of Data – Categorization of Major Clustering Methods – K-means–


Partitioning Methods – Hierarchical Methods - Density-Based Methods –Grid Based
Methods – Model-Based Clustering Methods – Clustering High Dimensional Data -
Constraint – Based Cluster Analysis – Outlier Analysis – Data Mining Applications.

OUTCOMES: After completing this course, the student will be able to:

 Apply data mining techniques and methods to large data sets.


 Use data mining tools
 Compare and contrast the various classifiers.

Computer Science Engineering Department -3- DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

TEXT BOOKS:

1. Alex Berson and Stephen J.Smith, “Data Warehousing, Data Mining and OLAP”, Tata
McGraw – Hill Edition, Thirteenth Reprint 2008.
2. Jiawei Han and Micheline Kamber, “Data Mining Concepts and Techniques”, Third
Edition, Elsevier, 2012.

REFERENCES:

1. Pang-Ning Tan, Michael Steinbach and Vipin Kumar, “Introduction to Data Mining”,
Person Education, 2007.
2. K.P. Soman, Shyam Diwakar and V. Aja, “Insight into Data Mining Theory and Practice”,
Eastern Economy Edition, Prentice Hall of India, 2006.
3. G. K. Gupta, “Introduction to Data Mining with Case Studies”, Eastern Economy Edition,
Prentice Hall of India, 2006.
4. Daniel T.Larose, “Data Mining Methods and Models”, Wiley-Interscience, 2006.

Computer Science Engineering Department -4- DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

CONTENTS

S.No Particulars Page

1 Unit – I 06

2 Unit – II 41

3 Unit – III 65

4 Unit – IV 101

5 Unit – V 164

Computer Science Engineering Department -5- DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Unit – I

Part – A

1.Define data warehouse? [CO1-L1]


A data warehouse is a repository of multiple heterogeneous data sources organized under
a unified schema at a single site to facilitate management decision making .(or)A data
warehouse is a subject-oriented, time-variant and nonvolatile collection of data in support of
management’s decision-making process
2.What are operational databases? [CO1-L2]
Organizations maintain large database that are updated by daily transactions are called
operational databases.
3.Define OLTP? [CO1-L1]
If an on-line operational database systems is used for efficient retrieval, efficient storage
and management of large amounts of data, then the system is said to be on-line transaction
processing.
4.Define OLAP? [CO1-L2]
Data warehouse systems serves users (or) knowledge workers in the role of data analysis
and decision-making. Such systems can organize and present data in various formats.
These systems are known as on-line analytical processing systems.
5.How a database design is represented in OLTP systems? [CO1-L1]
Entity-relation model
6. How a database design is represented in OLAP systems? [CO1-L1]
• Star schema • Snowflake schema • Fact constellation schema
7.Write short notes on multidimensional data model? [CO1-L2]
Data warehouses and OLTP tools are based on a multidimensional data model.This model
is used for the design of corporate data warehouses and department data marts. This
model contains a Star schema, Snowflake schema and Fact constellation schemas. The
core of the multidimensional model is the data cube.
8.Define data cube? [CO1-L1]
It consists of a large set of facts (or) measures and a number of dimensions.

9.What are facts? [CO1-L1]

Facts are numerical measures. Facts can also be considered as quantities by which can
analyze the relationship between dimensions.

Computer Science Engineering Department -6- DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

10.What are dimensions? [CO1-L2]

Dimensions are the entities (or) perspectives with respect to an organization for keeping
records and are hierarchical in nature.

11.Define dimension table? [CO1-L1]

A dimension table is used for describing the dimension. (e.g.) A dimension table for item
may contain the attributes item_ name, brand and type.

12.Define fact table? [CO1-L1]

Fact table contains the name of facts (or) measures as well as keys to each of the related
dimensional tables.

13.What are lattice of cuboids? [CO1-L1]

In data warehousing research literature, a cube can also be called as cuboids. For different
(or) set of dimensions, we can construct a lattice of cuboids, each showing the data at
different level. The lattice of cuboids is also referred to as data cube.

14.What is apex cuboid? [CO1-L1]

The 0-D cuboid which holds the highest level of summarization is called the apex cuboid.
The apex cuboid is typically denoted by all.

15.List out the components of star schema? [CO1-L1]

• A large central table (fact table) containing the bulk of data with no redundancy.
• A set of smaller attendant tables (dimension tables), one for each dimension.

16.What is snowflake schema? [CO1-L2]


The snowflake schema is a variant of the star schema model, where some dimension
tables are normalized thereby further splitting the tables in to additional tables.
17.List out the components of fact constellation schema? [CO1-L1]
This requires multiple fact tables to share dimension tables. This kind of schema can be
viewed as a collection of stars and hence it is known as galaxy schema (or) fact
constellation schema.

18.Point out the major difference between the star schema and the snowflake
schema? [CO1-L2]

The dimension table of the snowflake schema model may be kept in normalized form to
reduce redundancies. Such a table is easy to maintain and saves storage space.

Computer Science Engineering Department -7- DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

19.Which is popular in the data warehouse design, star schema model (or) snowflake
schema model? [CO1-L2]

Star schema model, because the snowflake structure can reduce the effectiveness and
more joins will be needed to execute a query.

20.Define concept hierarchy? [CO1-L1]

A concept hierarchy defines a sequence of mappings from a set of low-level concepts to


higher-level concepts.

PART – B

1) .What are the Data warehouse Architecture components? [CO1-H2]

1. Data sourcing, cleanup, transformation, and migration tools


2. Metadata repository
3. Warehouse/database technology
4. Data marts
5. Data query, reporting, analysis, and mining tools
6. Data warehouse administration and management
7. Information delivery system

1.1 Architecture

 Data warehouse is an environment, not a product which is based on relational database


management system that functions as the central repository for informational data.
 The central repository information is surrounded by number of key components
designed to make the environment is functional, manageable and accessible.
 The data source for data warehouse is coming from operational applications. The data
entered into the data warehouse transformed into an integrated structure and format.
 The transformation process involves conversion, summarization, filtering and
condensation.
Computer Science Engineering Department -8- DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

 The data warehouse must be capable of holding and managing large volumes of data
as well as different structure of data structures over the time.

Seven Major components :-

1.2. Data warehouse database

This is the central part of the data warehousing environment. This is the item number 2 in
the above arch. diagram. This is implemented based on RDBMS technology.

1.3. Sourcing, Acquisition, Clean up, and Transformation Tools

This is item number 1 in the above arch diagram. They perform conversions,
summarization, key changes, structural changes and condensation. The data
transformation is required so that the information can by used by decision support tools.
The transformation produces programs, control statements, JCL code, COBOL code, UNIX
scripts, and SQL DDL code etc., to move the data into data warehouse from multiple
operational systems.

The functionalities of these tools are listed below:


 To remove unwanted data from operational db
 Converting to common data names and attributes
 Calculating summaries and derived data
 Establishing defaults for missing data
 Accommodating source data definition changes

Issues to be considered while data sourcing, cleanup, extract and transformation:

Data heterogeneity: It refers to DBMS different nature such as it may be in different data
modules, it may have different access languages, it may have data navigation methods,
operations, concurrency, integrity and recovery processes etc.,

Data heterogeneity: It refers to the different way the data is defined and used in different
modules. E.g Prism Solutions, Evolutionary Technology Inc., Vality, Praxis and Carleton

1.4 Meta data

It is data about data. It is used for maintaining, managing and using the data warehouse. It
is classified into two:

Technical Meta data:

It contains information about data warehouse data used by warehouse designer,


administrator to carry out development and management tasks. It includes,
 Information about data stores
 Transformation descriptions. That is mapping methods from operational db to
warehouse db
 Warehouse Object and data structure definitions for target data
 The rules used to perform clean up, and data enhancement

Computer Science Engineering Department -9- DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Data mapping operations


 Access authorization, backup history, archive history, info delivery history, data
acquisition history, data access etc.,

Business Meta data:

It contains info that gives info stored in data warehouse to users. It includes,
 Subject areas, and info object type including queries, reports, images, video, audio
clips etc.
 Internet home pages
 Info related to info delivery system
 Data warehouse operational info such as ownerships, audit trails etc.,

Meta data helps the users to understand content and find the data. Meta data are stored in
a separate data stores which is known as informational directory or Meta data repository
which helps to integrate, maintain and view the contents of the data warehouse.

The following lists the characteristics of info directory/ Meta data:


 It is the gateway to the data warehouse environment
 It supports easy distribution and replication of content for high performance and
availability
 It should be searchable by business oriented key words
 It should act as a launch platform for end user to access data and analysis tools
 It should support the sharing of information
 It should support scheduling options for request
 It should support and provide interface to other applications
 It should support end user monitoring of the status of the data warehouse
environment

1.5 Access tools

Its purpose is to provide info to business users for decision making. There are five
categories:
 Data query and reporting tools
 Application development tools
 Executive info system tools (EIS)
 OLAP tools
 Data mining tools
Query and reporting tools are used to generate query and report. There are two types of
reporting tools. They are:
 Production reporting tool used to generate regular operational reports
 Desktop report writer are inexpensive desktop tools designed for end users.

Managed Query tools: used to generate SQL query. It uses Meta layer software in between
users and databases which offers a point-and-click creation of SQL statement. This tool is a
preferred choice of users to perform segment identification, demographic analysis, territory
management and preparation of customer mailing lists etc.

Computer Science Engineering Department - 10 - DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Application development tools: This is a graphical data access environment which


integrates OLAP tools with data warehouse and can be used to access all db systems

OLAP Tools: are used to analyze the data in multi dimensional and complex views. To
enable multidimensional properties it uses MDDB and MRDB where MDDB refers multi
dimensional data base and MRDB refers multi relational data bases.

Data mining tools: are used to discover knowledge from the data warehouse data also can
be used for data visualization and data correction purposes.

1.6 Data marts

Departmental subsets that focus on selected subjects. They are independent used by
dedicated user group. They are used for rapid delivery of enhanced decision support
functionality to end users. Data mart is used in the following situation:
 Extremely urgent user requirement
 The absence of a budget for a full scale data warehouse strategy
 The decentralization of business needs
 The attraction of easy to use tools and mind sized project

Data mart presents two problems:


1. Scalability: A small data mart can grow quickly in multi dimensions. So that while
designing it, the organization has to pay more attention on system scalability,
consistency and manageability issues
2. Data integration

1.7 Data warehouse admin and management

The management of data warehouse includes,


 Security and priority management
 Monitoring updates from multiple sources
 Data quality checks
 Managing and updating meta data
 Auditing and reporting data warehouse usage and status
 Purging data
 Replicating, sub setting and distributing data
 Backup and recovery
 Data warehouse storage management which includes capacity planning, hierarchical
storage management and purging of aged data etc.,

1.8 Information delivery system

• It is used to enable the process of subscribing for data warehouse info.


• Delivery to one or more destinations according to specified scheduling algorithm

Computer Science Engineering Department - 11 - DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

2). How can you Building a Data warehouse? [CO1-H2]

There are two factors that drive you to build and use data warehouse. They are:

Business factors:
 Business users want to make decision quickly and correctly using all available data.
Technological factors:
 To address the incompatibility of operational data stores
 IT infrastructure is changing rapidly. Its capacity is increasing and cost is decreasing
so that building a data warehouse is easy

There are several things to be considered while building a successful data warehouse

2.1 Business considerations:

Organizations interested in development of a data warehouse can choose one of the


following
two approaches:

 Top - Down Approach (Suggested by Bill Inmon)


 Bottom - Up Approach (Suggested by Ralph Kimball)

Top - Down Approach

In the top down approach suggested by Bill Inmon, we build a centralized storage area to
house corporate wide business data. This repository (storage area) is called Enterprise
Data Warehouse (EDW). The data in the EDW is stored in a normalized form in order to
avoid redundancy.

The central repository for corporate wide data helps us maintain one version of truth of the
data.
The data in the EDW is stored at the most detail level. The reason to build the EDW on the
most detail level is to leverage
1. Flexibility to be used by multiple departments.
2. Flexibility to provide for future requirements.
The disadvantages of storing data at the detail level are
1. The complexity of design increases with increasing level of detail.
2. It takes large amount of space to store data at detail level, hence increased cost.
Implement the top-down approach when
1. The business has complete clarity on all or multiple subject areas data warehosue
requirements.
2. The business is ready to invest considerable time and money.
The advantage of using the Top Down approach is that we build a centralized repository to
provide for one version of truth for business data. This is very important for the data to be
reliable, consistent across subject areas and for reconciliation in case of data related
contention between subject areas.
The disadvantage of using the Top Down approach is that it requires more time and initial
investment. The business has to wait for the EDW to be implemented followed by building
the data marts before which they can access their reports.

Computer Science Engineering Department - 12 - DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Bottom Up Approach

The bottom up approach suggested by Ralph Kimball is an incremental approach to build a


data warehouse. Here we build the data marts separately at different points of time as and
when the specific subject area requirements are clear. The data marts are integrated or
combined together to form a data warehouse. Separate data marts are combined through
the use of conformed dimensions and conformed facts. A conformed dimension and a
conformed fact is one that can be shared across data marts.

A Conformed dimension has consistent dimension keys, consistent attribute names and
consistent values across separate data marts. The conformed dimension means exact
same thing with every fact table it is joined.

A Conformed fact has the same definition of measures, same dimensions joined to it and at
the same granularity across data marts.

The bottom up approach helps us incrementally build the warehouse by developing and
integrating data marts as and when the requirements are clear. We don’t have to wait for
knowing the overall requirements of the warehouse. We should implement the bottom up
approach when
1. We have initial cost and time constraints.
2. The complete warehouse requirements are not clear. We have clarity to only one
data mart.
The advantage of using the Bottom Up approach is that they do not require high initial costs
and have a faster implementation time; hence the business can start using the marts much
earlier as compared to the top-down approach.

The disadvantages of using the Bottom Up approach is that it stores data in the de
normalized format, hence there would be high space usage for detailed data. We have a
tendency of not keeping detailed data in this approach hence loosing out on advantage of
having detail data .i.e. flexibility to easily cater to future requirements.

2.2 Design considerations

To be a successful data warehouse designer must adopt a holistic approach that is


considering all data warehouse components as parts of a single complex system, and take
into account all possible data sources and all known usage requirements.

Most successful data warehouses that meet these requirements have these common
characteristics:
 Are based on a dimensional model
 Contain historical and current data
 Include both detailed and summarized data
 Consolidate disparate data from multiple sources while retaining consistency

Data warehouse is difficult to build due to the following reason:


 Heterogeneity of data sources
 Use of historical data
 Growing nature of data base

Computer Science Engineering Department - 13 - DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Data warehouse design approach muse be business driven, continuous and iterative
engineering approach. In addition to the general considerations there are following specific
points relevant to the data warehouse design:

Data content

The content and structure of the data warehouse are reflected in its data model. The data
model is the template that describes how information will be organized within the integrated
warehouse framework. The data warehouse data must be a detailed data. It must be
formatted, cleaned up and transformed to fit the warehouse data model.

Meta data

It defines the location and contents of data in the warehouse. Meta data is searchable by
users to find definitions or subject areas. In other words, it must provide decision support
oriented pointers to warehouse data and thus provides a logical link between warehouse
data and decision support applications.

Data distribution

One of the biggest challenges when designing a data warehouse is the data placement and
distribution strategy. Data volumes continue to grow in nature. Therefore, it becomes
necessary to know how the data should be divided across multiple servers and which users
should get access to which types of data. The data can be distributed based on the subject
area, location (geographical region), or time (current, month, year).

Tools

A number of tools are available that are specifically designed to help in the implementation
of the data warehouse. All selected tools must be compatible with the given data
warehouse environment and with each other. All tools must be able to use a common Meta
data repository.

Design steps

The following nine-step method is followed in the design of a data warehouse:


1. Choosing the subject matter
2. Deciding what a fact table represents
3. Identifying and conforming the dimensions
4. Choosing the facts
5. Storing pre calculations in the fact table
6. Rounding out the dimension table
7. Choosing the duration of the db
8. The need to track slowly changing dimensions
9. Deciding the query priorities and query models

Computer Science Engineering Department - 14 - DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

2.3 Technical considerations

A number of technical issues are to be considered when designing a data warehouse


environment. These issues include:
 The hardware platform that would house the data warehouse
 The dbms that supports the warehouse data
 The communication infrastructure that connects data marts, operational systems and
end users
 The hardware and software to support meta data repository
 The systems management framework that enables admin of the entire environment

2.4 Implementation considerations

The following logical steps needed to implement a data warehouse:


 Collect and analyze business requirements
 Create a data model and a physical design
 Define data sources
 Choose the db tech and platform
 Extract the data from operational db, transform it, clean it up and load it into the
warehouse
 Choose db access and reporting tools
 Choose db connectivity software
 Choose data analysis and presentation s/w
 Update the data warehouse

2.4.1 Access tools

Data warehouse implementation relies on selecting suitable data access tools. The best
way to choose this is based on the type of data can be selected using this tool and the kind
of access it permits for a particular user. The following lists the various type of data that can
be accessed:
 Simple tabular form data
 Ranking data
 Multivariable data
 Time series data
 Graphing, charting and pivoting data
 Complex textual search data
 Statistical analysis data
 Data for testing of hypothesis, trends and patterns
 Predefined repeatable queries
 Ad hoc user specified queries
 Reporting and analysis data
 Complex queries with multiple joins, multi level sub queries and sophisticated search
criteria

2.4.2 Data extraction, clean up, transformation and migration

Computer Science Engineering Department - 15 - DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

A proper attention must be paid to data extraction which represents a success factor for a
data warehouse architecture. When implementing data warehouse several the following
selection criteria that affect the ability to transform, consolidate, integrate and repair the
data should be considered:
 Timeliness of data delivery to the warehouse
 The tool must have the ability to identify the particular data and that can be read by
conversion tool
 The tool must support flat files, indexed files since corporate data is still in this type
 The tool must have the capability to merge data from multiple data stores
 The tool should have specification interface to indicate the data to be extracted
 The tool should have the ability to read data from data dictionary
 The code generated by the tool should be completely maintainable
 The tool should permit the user to extract the required data
 The tool must have the facility to perform data type and character set translation
 The tool must have the capability to create summarization, aggregation and
derivation of records
 The data warehouse database system must be able to perform loading data directly
from these tools

2.4.3.Data placement strategies

– As a data warehouse grows, there are at least two options for data placement. One
is to put some of the data in the data warehouse into another storage media.
– The second option is to distribute the data in the data warehouse across multiple
servers.

2.4.4 User levels

The users of data warehouse data can be classified on the basis of their skill level in
accessing the warehouse. There are three classes of users:

Casual users: are most comfortable in retrieving info from warehouse in pre defined formats
and running pre existing queries and reports. These users do not need tools that allow for
building standard and ad hoc reports

Power Users: can use pre defined as well as user defined queries to create simple and ad
hoc reports. These users can engage in drill down operations. These users may have the
experience of using reporting and query tools.

Expert users: These users tend to create their own complex queries and perform standard
analysis on the info they retrieve. These users have the knowledge about the use of query
and report tools

2.5 Benefits of data warehousing

Data warehouse usage includes,


– Locating the right info
– Presentation of info
– Testing of hypothesis
Computer Science Engineering Department - 16 - DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

– Discovery of info
– Sharing the analysis

The benefits can be classified into two:

 Tangible benefits (quantified / measureable):It includes,


– Improvement in product inventory
– Decrement in production cost
– Improvement in selection of target markets
– Enhancement in asset and liability management

 Intangible benefits (not easy to quantified): It includes,


– Improvement in productivity by keeping all data in single location and eliminating
rekeying of data
– Reduced redundant processing
– Enhanced customer relation

3). How can you Mapping the data warehouse architecture to Multiprocessor
architecture ? Explain. [CO1-H2]

3.1 Relational data base technology for data warehouse

The functions of data warehouse are based on the relational data base technology. The
relational data base technology is implemented in parallel manner. There are two
advantages of having parallel relational data base technology for data warehouse:

Linear Speed up: refers the ability to increase the number of processor to reduce response
time.
Linear Scale up: refers the ability to provide same performance on the same requests as
the database size increases.

3.1.1.Types of parallelism .There are two types of parallelism:

Inter query Parallelism: In which different server threads or processes handle multiple
requests at the same time.

Intra query Parallelism: This form of parallelism decomposes the serial SQL query into
lower level operations such as scan, join, sort etc. Then these lower level operations are
executed concurrently in parallel.

Intra query parallelism can be done in either of two ways:

Horizontal parallelism: which means that the data base is partitioned across multiple disks
and parallel processing occurs within a specific task that is performed concurrently on
different processors against different set of data

Vertical parallelism: This occurs among different tasks. All query components such as scan,
join, sort etc are executed in parallel in a pipelined fashion. In other words, an output from
one task becomes an input into another task.

Computer Science Engineering Department - 17 - DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

3.1.2 Data partitioning:

Data partitioning is the key component for effective parallel execution of data base
operations. Partition can be done randomly or intelligently.

Random portioning includes random data striping across multiple disks on a single server.
Another option for random portioning is round robin fashion partitioning in which each
record is placed on the next disk assigned to the data base.

Intelligent partitioning assumes that DBMS knows where a specific record is located and
does not waste time searching for it across all disks. The various intelligent partitioning
include:

Hash partitioning: A hash algorithm is used to calculate the partition number based on the
value of the partitioning key for each row

Key range partitioning: Rows are placed and located in the partitions according to the value
of the partitioning key. That is all the rows with the key value from A to K are in partition 1, L
to T are in partition 2 and so on.

Schema portioning: an entire table is placed on one disk; another table is placed on
different disk etc. This is useful for small reference tables.

User defined portioning: It allows a table to be partitioned on the basis of a user defined
expression.

3.2 Data base architectures of parallel processing

There are three DBMS software architecture styles for parallel processing:
1. Shared memory or shared everything Architecture
2. Shared disk architecture
3. Shared nothing architecture

Computer Science Engineering Department - 18 - DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

1.Shared Memory Architecture

Tightly coupled shared memory systems, illustrated in following figure have the following
characteristics:
 Multiple PUs share memory.
 Each PU has full access to all shared memory through a common bus.
 Communication between nodes occurs via shared memory.
 Performance is limited by the bandwidth of the memory bus.

Symmetric multiprocessor (SMP) machines are often nodes in a cluster. Multiple SMP
nodes can be used with Oracle Parallel Server in a tightly coupled system, where memory
is shared among the multiple PUs, and is accessible by all the PUs through a memory bus.
Examples of tightly coupled systems include the Pyramid, Sequent, and Sun SparcServer.
Performance is potentially limited in a tightly coupled system by a number of factors. These
include various system components such as the memory bandwidth, PU to PU
communication bandwidth, the memory available on the system, the I/O bandwidth, and the
bandwidth of the common bus.

Parallel processing advantages of shared memory systems are these:


 Memory access is cheaper than inter-node communication. This means that internal
synchronization is faster than using the Lock Manager.
 Shared memory systems are easier to administer than a cluster.

A disadvantage of shared memory systems for parallel processing is as follows:


 Scalability is limited by bus bandwidth and latency, and by available memory.

2.Shared Disk Architecture

Shared disk systems are typically loosely coupled. Such systems, illustrated in following
figure, have the following characteristics:
 Each node consists of one or more PUs and associated memory.
 Memory is not shared between nodes.
 Communication occurs over a common high-speed bus.
 Each node has access to the same disks and other resources.
 A node can be an SMP if the hardware supports it.
 Bandwidth of the high-speed bus limits the number of nodes (scalability) of the
system.

Computer Science Engineering Department - 19 - DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

The cluster illustrated in figure is composed of multiple tightly coupled nodes. The
Distributed Lock Manager (DLM ) is required. Examples of loosely coupled systems are
VAXclusters or Sun clusters.

Since the memory is not shared among the nodes, each node has its own data cache.
Cache consistency must be maintained across the nodes and a lock manager is needed to
maintain the consistency. Additionally, instance locks using the DLM on the Oracle level
must be maintained to ensure that all nodes in the cluster see identical data.

There is additional overhead in maintaining the locks and ensuring that the data caches are
consistent. The performance impact is dependent on the hardware and software
components, such as the bandwidth of the high-speed bus through which the nodes
communicate, and DLM performance.

Parallel processing advantages of shared disk systems are as follows:


 Shared disk systems permit high availability. All data is accessible even if one node
dies.
 These systems have the concept of one database, which is an advantage over
shared nothing systems.
 Shared disk systems provide for incremental growth.

Parallel processing disadvantages of shared disk systems are these:


 Inter-node synchronization is required, involving DLM overhead and greater
dependency on high-speed interconnect.
 If the workload is not partitioned well, there may be high synchronization overhead.
 There is operating system overhead of running shared disk software.

3.Shared Nothing Architecture

Shared nothing systems are typically loosely coupled. In shared nothing systems only one
CPU is connected to a given disk. If a table or database is located on that disk, access
depends entirely on the PU which owns it. Shared nothing systems can be represented as
follows:

Computer Science Engineering Department - 20 - DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Shared nothing systems are concerned with access to disks, not access to memory.
Nonetheless, adding more PUs and disks can improve scaleup. Oracle Parallel Server can
access the disks on a shared nothing system as long as the operating system provides
transparent disk access, but this access is expensive in terms of latency.

Shared nothing systems have advantages and disadvantages for parallel processing:
Advantages
 Shared nothing systems provide for incremental growth.
 System growth is practically unlimited.
 MPPs are good for read-only databases and decision support applications.
 Failure is local: if one node fails, the others stay up.

Disadvantages
 More coordination is required.
 More overhead is required for a process working on a disk belonging to another
node.
 If there is a heavy workload of updates or inserts, as in an online transaction
processing system, it may be worthwhile to consider data-dependent routing to
alleviate contention.

3.3 Data base architectures of parallel processing

 Scope and techniques of parallel DBMS operations


 Optimizer implementation
 Application transparency
 Parallel environment which allows the DBMS server to take full advantage of the
existing facilities on a very low level

Computer Science Engineering Department - 21 - DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 DBMS management tools help to configure, tune, admin and monitor a parallel
RDBMS as effectively as if it were a serial RDBMS
 Price / Performance: The parallel RDBMS can demonstrate a non linear speed up
and scale up at reasonable costs.

3.4 Parallel DBMS vendors

Oracle: Parallel Query Option (PQO)


Architecture: shared disk arch
Data partition: Key range, hash, round robin
Parallel operations: hash joins, scan and sort
Informix: eXtended Parallel Server (XPS)
Architecture: Shared memory, shared disk and shared nothing models
Data partition: round robin, hash, schema, key range and user defined
Parallel operations: INSERT, UPDATE, DELELTE
IBM: DB2 Parallel Edition (DB2 PE)
Architecture: Shared nothing models
Data partition: hash
Parallel operations: INSERT, UPDATE, DELELTE, load, recovery, index creation,
backup, table reorganization
SYBASE: SYBASE MPP
Architecture: Shared nothing models
Data partition: hash, key range, Schema
Parallel operations: Horizontal and vertical parallelism

4). Write all the DBMS schemas for decision support. [CO1-H1]

The basic concepts of dimensional modeling are: facts, dimensions and measures. A fact is
a collection of related data items, consisting of measures and context data. It typically
represents business items or business transactions. A dimension is a collection of data that
describe one business dimension. Dimensions determine the contextual background for the
facts; they are the parameters over which we want to perform OLAP. A measure is a
numeric attribute of a fact, representing the performance or behavior of the business
relative to the dimensions.
Considering Relational context, there are three basic schemas that are used in dimensional
modeling:
1. Star schema
2. Snowflake schema
3. Fact constellation schema

4.1.Star schema
The multidimensional view of data that is expressed using relational data base semantics is
provided by the data base schema design called star schema. The basic of stat schema is
that information can be classified into two groups:
 Facts
 Dimension
Star schema has one large central table (fact table) and a set of smaller tables
(dimensions) arranged in a radial pattern around the central table.
Facts are core data element being analyzed while dimensions are attributes about the facts.

Computer Science Engineering Department - 22 - DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.

 Each dimension in a star schema is represented with only one-dimension table.


 This dimension table contains the set of attributes.
 There is a fact table at the center. It contains the keys to each of four dimensions.
 The fact table also contains the attributes, namely dollars sold and units sold.

The star schema architecture is the simplest data warehouse schema. It is called a star
schema because the diagram resembles a star, with points radiating from a center. The
center of the star consists of fact table and the points of the star are the dimension tables.
Usually the fact tables in a star schema are in third normal form(3NF) whereas dimensional

Computer Science Engineering Department - 23 - DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

tables are de-normalized. Despite the fact that the star schema is the simplest architecture,
it is most commonly used nowadays and is recommended by Oracle.

Fact Tables

A fact table is a table that contains summarized numerical and historical data (facts) and a
multipart index composed of foreign keys from the primary keys of related dimension tables.

A fact table typically has two types of columns: foreign keys to dimension tables and
measures those that contain numeric facts. A fact table can contain fact's data on detail or
aggregated level.

Dimension Tables
Dimensions are categories by which summarized data can be viewed. E.g. a profit
summary in a fact table can be viewed by a Time dimension (profit by month, quarter, year),
Region dimension (profit by country, state, city), Product dimension (profit for product1,
product2).

Typical fact tables store data about sales while dimension tables data about geographic
region (markets, cities), clients, products, times, channels.
Measures are numeric data based on columns in a fact table. They are the primary data
which end users are interested in. E.g. a sales fact table may contain a profit measure
which represents profit on each sale.

The main characteristics of star schema:


 Simple structure -> easy to understand schema
 Great query effectives -> small number of tables to join
 Relatively long time of loading data into dimension tables -> de-normalization,
redundancy data caused that size of the table could be large.
 The most commonly used in the data warehouse implementations -> widely
supported by a large number of business intelligence tools

Potential Performance Problems with star schemas.

Computer Science Engineering Department - 24 - DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

The star schema suffers the following performance problems.

1.Indexing

Multipart key presents some problems in the star schema model.

(day->week-> month-> quarter-> year )

• It requires multiple metadata definition( one for each component) to design a single table.

• Since the fact table must carry all key components as part of its primary key, addition or
deletion of levels in the hierarchy will require physical modification of the affected table,
which is time-consuming processed that limits flexibility.

• Carrying all the segments of the compound dimensional key in the fact table increases the
size of the index, thus impacting both performance and scalability.

2.Level Indicator.

The dimension table design includes a level of hierarchy indicator for every record.
Every query that is retrieving detail records from a table that stores details and aggregates
must use this indicator as an additional constraint to obtain a correct result.

The user is not and aware of the level indicator, or its values are in correct, the otherwise
valid query may result in a totally invalid answer.

Alternative to using the level indicator is the snowflake schema. Aggregate fact tables are
created separately from detail tables. Snowflake schema contains separate fact tables for
each level of aggregation.

Other problems with the star schema design - Pairwise Join Problem

5 tables require joining first two tables, the result of this join with third table and so on.
The intermediate result of every join operation is used to join with the next table.
Selecting the best order of pairwise joins rarely can be solve in a reasonable amount of
time.
Five-table query has 5!=120 combinations

2 .Snowflake schema: is the result of decomposing one or more of the dimensions. The
many-to-one relationships among sets of attributes of a dimension can separate new
dimension tables, forming a hierarchy. The decomposed snowflake structure visualizes the
hierarchical structure of dimensions very well.

Computer Science Engineering Department - 25 - DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

3.Fact constellation schema: For each star schema it is possible to construct fact
constellation schema(for example by splitting the original star schema into more star
schemes each of them describes facts on another level of dimension hierarchies). The fact
constellation architecture contains multiple fact tables that share many dimension tables.

The main shortcoming of the fact constellation schema is a more complicated design
because many variants for particular kinds of aggregation must be considered and
selected. Moreover, dimension tables are still large.

4.2 STAR join and STAR Index.

A STAR join is high-speed, single pass, parallelizable muti-tables join method. It performs
many joins by single operation with the technology called Indexing. For query processing
the indexes are used in columns and rows of the selected tables.

Red Brick's RDBMS indexes, called STAR indexes, used for STAR join performance. The
STAR indexes are created on one or more foreign key columns of a fact table. STAR index
contains information that relates the dimensions of a fact table to the rows that contains
those dimensions. STAR indexes are very space-efficient. The presence of a STAR index
allows Red Brick's RDBMS to quickly identify which target rows of the fact table are of
interest for a particular set of dimension. Also, because STAR indexes are created over
foreign keys, no assumptions are made about the type of queries which can use the STAR
indexes.
Computer Science Engineering Department - 26 - DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

4.3 Bit Mapped Indexing

 SYBASE IQ is an example of a product that uses a bit mapped index structure of the data
stored in the SYBASE DBMS.
 Sybase released SYBASE IQ database targeted an "ideal" data mart solution for handle
multi user adhoc(unstructured) queries.
Over view:
 SYBASE IQ is a separate SQL database.
 Once loaded, SYBASE IQ converts all data into a series of bit maps, which are then
highly compressed and stored on disk.
 SYBASE positions SYBASE IQ as a read only database for data marts, with a practical
size limitations currently placed at 100 Gbytes.

Data cardinality: Bitmap indexes are used to optimize queries against low- cardinality data
— that is, data in which the total number of possible values is relatively low.

(Cardinal meaning – important)

Fig: - Bitmap index

For example, address data cardinality pin code is 50 (50 possible values), and gender data
cardinality is only 2 (male and female)..

If the bit for a given index is "on", the value exists in the record. Here, a 10,000 — row
employee table that contains the "gender" column is bitmap-indexed for this value.

Bitmap indexes can become bulky and even unsuitable for high cardinality data where the
range of possible values is high. For example, values like "income" or "revenue" may have
an almost infinite number of values.

SYBASE IQ uses a patented technique called Bit-wise technology to build bitmap indexes
for high-cardinality data.

Index types: The first release of SYBASE IQ provides five index techniques.

Computer Science Engineering Department - 27 - DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Fast projection index


 A low-or high cardinality index.
 Low fast index, involves functions like SUM, AVERAGE and COUNTS.
 Low Disk index, involves disk space usage.
 High group and high non-group index.

SYBASE IQ advantages/Performance:

 Bitwise technology
 Compression
 Optimized memory-based processing
 Column wise processing
 Low operating cost
 Large block I/O
Operating-system-level parallelism
Prejoin and ad hoc join capabilities
Disadvantages of SYBASE IQ indexing:

 No updates
 Lack of core RDBMS features
 Less advantageous for planned queries
 High memory usage
4.4 Column Local Storage
.
Thinking Machine Corporation has developed CM-SQL RDBMS product, this approach is
based on storing data column-wise, as opposed to traditional row wise storage.

A traditional RDBMS approach to storing data in memory and on the disk is to store it one
row at a time, and each row can be viewed and accessed a single record. This approach
works well for OLTP environments in which a typical transaction access a record at a time.
Computer Science Engineering Department - 28 - DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

However, for a set processing adhoc query environment in data warehousing the goal
is to retrieve multiple values of several columns. For example, if a problem is to calculate
average, maximum and minimum salary, the column wise storage of the salary field
requires a DBMS to read only one record.

5). Explain in detail Data Extraction, Cleanup, and Transformation Tools[CO1-H2]

5.1 Tool Requirements .

The tools that provide data contents and formats from operational and external data stores
into the data warehouse includes following tasks.

• Data transformation - from one format to another on possible differences between


the source and target platforms.
• Data transformation and calculation - based on the application of business rules.
• Data consolidation and integration,- which include combining several source
records into a single record to be loaded into the warehouse.
• Metadata synchronization and management- which includes storing and/or
updating meta data definitions about source data files, transformation actions,
loading formats, and events, etc.

The following are the Criteria’s that affects the Tools ability to transform, consolidate,
integrate and repair the data.

1. The ability to identify data - in the data source environments that can be read by the
conversion tool is important.
2. Support for flat files, indexed files is critical. eg. VSAM , IMS and CA-IDMS
3. The capability to merge data from multiple data stores is required in many
installations.
4. The specification interface to indicate the data to be extracted and the conversion
criteria is important.

Computer Science Engineering Department - 29 - DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

5. The ability to read information from data dictionaries or import information from
warehouse products is desired.
6. The code generated by the tool should be completely maintainable from within the
development environment.
7. Selective data extraction of both data elements and records enables users to extract
only
the required data.
8. A field-level data examination for the transformation of data into information is
needed.
9. The ability to perform data-type and character-set translation is a requirement when
moving data between incompatible systems.
10. The capability to create summarization, aggregation and derivation records and field
is very important.
11. Vendor stability and support for the product items must be carefully evaluated.

5.2 Vendor Approaches

Integrated solutions can fall into one of the categories described below.

• Code generators create modified 3GL/4GL programs based on source, target data
definitions, data transformation, improvement rules defined by the developer. This
approach reduces the need for an organization to write its own data capture,
transformation, and load programs.

•Database data replication tools utilize database triggers or a recovery log to capture
changes
to a single data source on one system and apply the changes to a copy of the source data
located on a different systems.

• Rule-driven-dynamic transformation engines ( data mart builders). Capture data from a


source system at user defined intervals, transforms the data, and then send and load the
results into a target environment, typically a data mart.

5.3 Access to legacy Data

With Enterprise/Access, legacy systems on virtually any platform can be connected to a


new data warehouse via client/server interfaces without the significant time, cost, or risk
involved in reengineering application code.

Enterprise/Access provides a three-tiered architecture that defines applications partitioned


with new-term integration and long-term migration objectives.

• The data layer provides - data access and transaction services for management of
corporate data assets. This layer is independent of any current business process or user
interface application. It manages the data and implements the business rules for data
integrity

. • The process layer - provides services to manage automation and support for current
business processes. It allows modification of the supporting application logic independent of
the necessary data or user interface.
Computer Science Engineering Department - 30 - DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

• The user layer - manages user interaction with process and/or data layer services. It
allows the user interface to change independently of the basic business processes.

5.4 Vendor Solution

• Prism solutions
• SAS Institute
• Validity Corporation
• Information Builders

Prism solutions: While Enterprise/Access focuses on providing access to legacy data;


Prism warehouse manager provides a solution for data warehousing by mapping source
data to a target dbms to be used as a warehouse.

Prism warehouse manager can extract data from multiple source, environments, including
DB2, IDMS, IMS, VSAM, RMS, and sequential files under UNIX or MVS. It has strategic
relationship with pyramid and Informix.

SAS institute:
SAS starts with the basis of critical data still resides in the data center and offer its
traditional SAS system tools to serve at data warehousing functions. Its data repository
function can act to build the informational database.

SAS Data Access Engines serve as extraction tools to combine common variables,
transform data representation forms for consistency, consolidate redundant data, and use
business rules to produce computed values in the warehouse.

SAS engines can work with hierarchical and relational database and sequential files.

Validity Corporation:

Validity Corporation's Integrity data reengineering tool is used to investigate, standardize,


transform and integrate data from multiple operational systems and external sources.
Integrity is a specialized, multipurpose data tool that organizations apply on projects such
as:
• Data audits
• Data warehouse and decision support systems
• Customer information files and house holding applications
• Client/Server business applications such as SAP R/S, Oracle and Hogan
• System consolidations

Information builders:

A product that can be used as a component for data extraction, transformation and legacy
access tool suite for building data warehouse is EDA/SQL from information builders.

EDA/SQL implements a client/server model that is optimized for higher performance


EDA/SQL supports copy management, data quality management, data replication
capabilities, and standards support for both ODBC and the X/Open CLI.
Computer Science Engineering Department - 31 - DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

5.5. Transformation Engines

1.Informatica:

This is a multicompany metadata integration idea. Informatica joined services with Andyne,
Brio, Business objects, Cognos, Information Advantage, Info space, IQ software and
Microstrategy to deliver a "back-end" architecture and publish AFI specifications supporting
its technical and business metadata.

2. Power Mart:
Informatica's flagship product — PowerMart suite — consists of the following components.
• Power Mart Designer
• Power Mart server
• The Informatica Server Manager
• The Informatica Repository
• Informatica Power Capture

3. Constellar:

The constellar Hub consists of a set of components supporting the distributed


transformation management capabilities. The product is designed to handle the movement
and transformation of data for both data migration and data distribution, in an operational
system, and for capturing operational data for loading into a data warehouse.

The transformation hub performs the tasks of data cleanup and transformation.

The Hub Supports:

 Record reformatting and restructuring.


 Field level data transformation, validation and table look up.
 File and multi-file set-level data transformation and validation.
 The creation of intermediate results for further downstream transformation by the hub.

Computer Science Engineering Department - 32 - DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Computer Science Engineering Department - 33 - DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

6). What do you mean by Metadata? [CO1-H2]

6.1 Metadata Defined

Metadata is one of the most important aspects of data warehousing. It is data about data
stored in the warehouse and its users.

Computer Science Engineering Department - 34 - DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Metadata contains :-

i. The location and description of warehouse system and data components (warehouse
objects).
ii. Names, definition, structure and content of the data warehouse and end user views.
iii. Identification of reliable data sources (systems of record).
iv. Integration and transformation rules - used to generate the data warehouse; these
include the mapping method from operational databases into the warehouse, and
algorithms used to convert, enhance, or transform data.
v. Integration and transformation rules - used to deliver data to end-user analytical
tools.
vi. Subscription information - for the information delivery to the analysis subscribers.
vii. Data warehouse operational information, - which includes a history of warehouse
updates, refreshments, snapshots, versions, ownership authorizations and extract audit
trail.
viii. Metrics - used to analyze warehouse usage and performance and end user usage patterns.
ix. Security - authorizations access control lists, etc.

6.2 Metadata Interchange Initiative (idea)

In a situation such as a data warehouse different tools must be able to freely and easily
access, and in some cases manipulate and update, the metadata must be created by other
tools and stored in a variety of different storages. To achieve this goal is to establish atleast
minimum common method of interchange standards and guidelines for fulfill different
vendors tools. This can be offered by the data warehousing vendors and is known as the
metadata interchange initiative.

The metadata interchange standard defines two different meta models:

 The application meta model — the tables, etc., used to "hold" the metadata for a
particular application.
 The metadata meta model — the set of objects that the metadata interchange standard
can be used to describe.

These represent the information that is common to one or more classes of tools, such as
data extraction tools, replication tools, user query tools and database servers.

Metadata interchange standard framework – (architecture):

Computer Science Engineering Department 35 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

This defines three approaches.

• Procedural approach: The API(Application program Interface) tool need to create update,
access, and interact with metadata. This approach used to do this in terms of developing
the standard metadata implementation.

• ASCII batch approach: This approach depend on the ASCII file format which contains the
description of metadata components and standardized access requirements that make up
the interchange standard meta data model.

• Hybrid approach: Data-driven model, A table driven API support only fully qualified
references for each metadata element, a tool interact with the API through the standard
access framework and directly access just the specific metadata object needed.

The components of metadata interchange standard framework are:

• The standard metadata model, which refers to the ASCII file format used to represent the
metadata that is being exchanged.
• The standard access framework, which describes the minimum number of API functions a
vendor must support
• Tool profile, which is provided by each tool vendor. The tool profile is a file that describes
what aspects of the interchange standard metamodel a particular tool supports.
• The user configuration, which is a file describing the legal interchange paths for metadata
in the user's environment.

Computer Science Engineering Department 36 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

1.7.3 Metadata Repository(storage)

• The data warehouse architecture framework includes the metadata interchange


framework as one of its components.
• It defines a number of components all of which interact with each other via the
architecturally defined layer of metadata.

Metadata repository management software can be used to map the source data to the
target database, generate code for data transformations, integrate and transform the data,
and control moving data to the warehouse.

Metadata defines the contents and location of data (data model) in the warehouse,
relationships between the operational databases and the data warehouse and the business
views of the warehouse data that are accessible by end-user tools.

A data warehouse design ensures a mechanism for maintaining the metadata repository
and all the access paths to the data warehouse must have metadata-as an entry point.

The variety of access paths available into the data warehouse, and at the same time to
show how many tool classes can be involved in the process.

Fig: 1.11- Data warehouse architecture

Computer Science Engineering Department 37 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

The metadata access and collection is indicated by double lines.


The warehouse design should prevent any direct access to the warehouse data, without the
use of metadata definition.

Fig: - Tool landscape and metadata integration points

Meta Data repository provides the following benefits.

 It provides a complete set of tools for metadata management.


 It reduces and eliminates information redundancy, inconsistency.
 It simplifies management and improves organization, control and accounting of
information assets.
 It increases identification, understanding, coordination and utilization of enterprise wide
information assets.
 It provides effective data administration tools to manage corporate information assets
with full-function data dictionary.
 It increases flexibility, control, and reliability of the application development, process and
step up internal application development.
 It controls investment in legacy systems with the ability to inventory and utilize existing
applications.
 It provides a universal relational model for heterogeneous RDBMS to interact and share
information.
 It implements CASE development standards and eliminates redundancy with the ability
share and reuse metadata.

6.4 Metadata Management

A major problem in data warehousing is the inability to communicate to the end user about
what information resides in the data warehouse and how it can be accessed.
Computer Science Engineering Department 38 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

• It can define all data elements and their attributes, data sources and timing, and the rules
that govern data use and data transformation
. • Metadata needs to be collected as the warehouse is designed and built.
• Even through there are a number of tools available to help users understand and use the
warehouse, these tools need to be carefully evaluated before any purchasing decision is
made.

6.5 Implementation Examples

Platinum technologies, R&O, Prism solutions and Logic works.

6.6 Metadata Trends

The data warehouse arena must include external data within the data warehouse.
The data warehouse must reduce costs and to increase competitiveness and business
quickness.
The process of integrating external and internal data into the warehouse faces a number of
challenges.
 In consistent data formats
 Missing or invalid data
 Different levels of aggregation
 Semantic inconsistency
 Unknown or questionable data quality and timeliness
Data warehouses integrate various data types such as alphanumeric data types, data types
for, text, voice, image, full motion video, web pages in HTML formats.

Computer Science Engineering Department 39 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

UNIVERSITY QUESTIONS
UNIT- I
Part A
1. Define the term ‘Data Warehouse’.
2. List out the functionality of metadata.
3. What are nine decision in the design of a Data warehousing?
4. List out the two different types of reporting tools.
5. What are the technical issues to be considered when designing and implementing a
data warehouse environment?
6. What are the advantages of data warehousing.
7. Give the difference between the Horizontal and Vertical Parallelism.
8. Define star schema.
9. What are the steps to be followed to store the external source into the data
warehouse?
10. Define Legacy data.
Part-B
1. Enumerate the building blocks of data warehouse. Explain the importance of
metadata in a data warehouse environment. [16]
2. Explain various methods of data cleaning in detail [8]
3. Diagrammatically illustrate and discuss the data warehousing architecture with briefly
explain components of data warehouse [16]
4. (i) Distinguish between Data warehousing and data mining. [8]
(ii)Describe in detail about data extraction, cleanup [8]
5. Write short notes on (i) Transformation [8]
(ii) Metadata [8]
6. List and discuss the steps involved in mapping the data warehouse to a
multiprocessor architecture. [16]
7. Explain in detail about different Vendor Solutions. [16]

Computer Science Engineering Department 40 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

UNIT II

1.Define schema hierarchy? [CO2-L1]

A concept hierarchy that is a total (or) partial order among attributes in a database schema
is called a schema hierarchy.

2.List out the OLAP operations in multidimensional data model? [CO2-L1]

• Roll-up • Drill-down • Slice and dice • Pivot (or) rotate

3.What is roll-up operation? [CO2-L1]

The roll-up operation is also called drill-up operation which performs aggregation on a data
cube either by climbing up a concept hierarchy for a dimension (or) by dimension reduction.

4.What is drill-down operation? [CO2-L2]

Drill-down is the reverse of roll-up operation. It navigates from less detailed data to more
detailed data. Drill-down operation can be taken place by stepping down a concept
hierarchy for a dimension

5.What is slice operation? [CO2-L1]

The slice operation performs a selection on one dimension of the cube resulting in a sub
cube.

6.What is dice operation? [CO2-L1]

The dice operation defines a sub cube by performing a selection on two (or) more
dimensions.

7.What is pivot operation? [CO2-L1]

This is a visualization operation that rotates the data axes in an alternative presentation of
the data.

8.List out the views in the design of a data warehouse? [CO2-L1]

• Top-down view • Data source view • Data warehouse view • Business query view

9.What are the methods for developing large software systems? [CO2-L1]

• Waterfall method
• Spiral method

10.How the operation is performed in waterfall method? [CO2-L2]

The waterfall method performs a structured and systematic analysis at each step before
proceeding to the next, which is like a waterfall falling from one step to the next.
Computer Science Engineering Department 41 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

11.List out the steps of the data warehouse design process? [CO2-L2]

• Choose a business process to model.


• Choose the grain of the business process
• Choose the dimensions that will apply to each fact table record.
• Choose the measures that will populate each fact table record.

12. Define ROLAP? [CO2-L1]

The ROLAP model is an extended relational DBMS that maps operations on


multidimensional data to standard relational operations.

13. Define MOLAP? [CO2-L2]

The MOLAP model is a special purpose server that directly implements multidimensional
data and operations.

14. Define HOLAP? [CO2-L2]

The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the
greater scalability of ROLAP and the faster computation of MOLAP,(i.e.) a HOLAP server
may allow large volumes of detail data to be stored in a relational database, while
aggregations are kept in a separate MOLAP store.

15.What is enterprise warehouse? [CO2-L1]

An enterprise warehouse collects all the information’s about subjects spanning the entire
organization. It provides corporate-wide data integration, usually from one (or)more
operational systems (or) external information providers. It contains detailed data as well as
summarized data and can range in size from a few giga bytes to hundreds of giga bytes,
tera bytes (or) beyond. An enterprise data warehouse may be implemented on traditional
mainframes, UNIX super servers (or) parallel architecture platforms. It requires business
modeling and may take years to design and build.

16.What is data mart? [CO2-L2]

Data mart is a database that contains a subset of data present in a data warehouse.Data
marts are created to structure the data in a data warehouse according to issues such as
hardware platforms and access control strategies. We can divide a data warehouse into
data marts after the data warehouse has been created. Data marts are usually
implemented on low-cost departmental servers that are UNIX (or) windows/NT based. The
implementation cycle of the data mart is likely to be measured in weeks rather than months
(or) years.

17.What are dependent and independent data marts? [CO2-L2]

Dependent data marts are sourced directly from enterprise data warehouses. Independent
data marts are data captured from one (or) more operational systems (or) external

Computer Science Engineering Department 42 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

information providers (or) data generated locally with in particular department(or)


geographic area.

18.What is virtual warehouse? [CO2-L2]

A virtual warehouse is a set of views over operational databases. For efficient query
processing, only some of the possible summary views may be materialized. A virtual
warehouse is easy to build but requires excess capability on operational database servers.

19.Define indexing? [CO2-L2]

Indexing is a technique, which is used for efficient data retrieval (or) accessing data in a
faster manner. When a table grows in volume, the indexes also increase in size requiring
more storage.

20.Define metadata? [CO2-L2]

Metadata is used in data warehouse is used for describing data about data. (i.e.) meta data
are the data that define warehouse objects. Metadata are created for the data names and
definitions of the given warehouse.

Part - B

1. Define all the Reporting and query tools for data analysis:- [CO2-H2]
The principal purpose of data warehousing is to provide information to business users for
strategic decision making. These users interact with the data warehouse using front-end
tools, or by getting the required information through the information delivery system.
1.1 Tool Categories
There are five categories of decision support tools
1. Reporting
2. Managed query
3. Executive information systems (EIS)
4. On-line analytical processing (OLAP)
5. Data mining (DM)
1.1.1.Reporting tools:
Reporting tools can be divided into production reporting tools and desktop report writers.

1.1.Production reporting tools: Companies generate Production reporting tools for regular
operational reports or support high-volume batch jobs. E.g calculating and printing pay
checks.

Production reporting tools include third-generation languages such as COBOL, specialized


fourth-generation languages, such as Information Builders, Inc.'s Focus, and high-end
client/server tools, such as MITI'S SQL.

Computer Science Engineering Department 43 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

1.2 Report writers: Are inexpensive desktop tools designed for end users. Products such
as Seagate software's crystal reports allows users to design and run reports without having
to rely on the IS department.

In general, report writers have graphical interfaces and built-in charting functions,
They can pull groups of data from a variety of data sources and integrate them in a single
report.

Leading report writers include Crystal Reports, Actuate and Platinum Technology,
Inc's Info Reports. Vendors are trying to increase the scalability of report writers by-
supporting three-tiered architectures in which report processing is done on a Windows NT
or UNIX server.
Report writers also are beginning to offer object-oriented interfaces for designing
and manipulating reports and modules for performing ad hoc queries and OLAP analysis.
Users and related activities

User Activity Tools


Clerk Simple retrieval 4GL
Executive Exception reports EIS
Manager Simple retrieval 4GL
Business analysts Complex analysis Spreadsheets; OLAP, data
mining

1.1.2.Managed query tools:


Managed query tools protect end users from the complexities of SQL and database
structures by inserting a metalayer between users and the database. Metalayer is the
software that provides subject-oriented views of a database and supports point-and-click
creation of SQL. Some vendors, such as Business objects, Inc., call this layer a "universe".
Managed query tools have been extremely popular because they make it possible for
knowledge workers to access corporate data without IS intervention.
Most managed query tools have embraced three-tiered architectures to improve scalability.
Managed query tool vendors are racing to embed support for OLAP and Data mining
features.
Other tools are IQ software's IQ objects, Andyne Computing Ltd,'s GQL, IBM's Decision
Server, Speedware Corp's Esperant (formerly sold by software AG), and Oracle Corp's
Discoverer/2000.
1.1.3. Executive Information System tools:
Executive Information System (EIS) tools earlier than report writers and managed query
tools they were first install on mainframes.
EIS tools allow developers to build customized, graphical decision support applications or
"briefing books".
• EIS applications highlight exceptions to normal business activity or rules by using color —
coded graphics.
Computer Science Engineering Department 44 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

EIS tools include pilot software, Inc.'s Light ship, Platinum Technology's Forest and Trees,
Comshare, Inc.'s Commander Decision, Oracle's Express Analyzer and SAS Institute, Inc.'s
SAS/EIS.
EIS vendors are moving in two directions.

 Many are adding managed query functions to compete head-on with other -decision
support tools.
 Others are building packaged applications that address horizontal functions, such as
sales budgeting, and marketing, or vertical industries such as financial services.
Ex: Platinum Technologies offers Risk Advisor.
1.1.4 .OLAP tools:
It provides a sensitive way to view corporate data.
These tools aggregate data along common business subjects or dimensions and then let
users navigate through the hierarchies and dimensions with the click of a mouse button.
Some tools such as Arbor software Corp.'s Essbase , Oracle's Express, pre aggregate
data in special multi dimensional database.
Other tools work directly against relational data and aggregate data on the fly, such as
Micro-strategy, Inc.'s DSS Agent or Information /Advantage, Inc.'s Decision suite.
Some tools process OLAP data on the desktop instead of server.
Desktop OLAP tools include Cognos Power play, Brio Technology, In is Brio query,
Planning Sciences, Inc.'s Gentium, and Andyne's Pablo.
1.1.5.Data mining tools:
Provide close to corporate data that aren't easily differentiate with managed query or OLAP
tools.
Data mining tools use a variety of statistical and artificial intelligence (AI) algorithm to
analyze the correlation of variables in the data and search out interesting patterns and
relationship to investigate.
Data mining tools, such as IBM's Intelligent Miner, are expensive and require statisticians
to implement and manage.
These include Data Mind CorP's Data Mind, Pilot's Discovery server, and tools from
Business objects and SAS Institute.
This tools offer simple user interfaces that plug in directly to existing OLAP tools or
databases and can be run directly against data warehouses.
For example, all end-user tools use metadata definitions to obtain access to data stored in
the warehouse, and some of these tools (eg., OLAP tools) may employ additional or
intermediary data stores. (eg., data marts, multi dimensional data base).

Computer Science Engineering Department 45 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

1.1.6 Applications
Organizations use a familiar application development approach to build a query and
reporting environment for the data warehouse. There are several reasons for doing this:

 A legacy DSS or EIS system is still being used, and the reporting facilities appear adequate.
 An organization has made a large investment in a particular application development
environment (eg., Visual C++, Power Builder).
 A new tool may require an additional investment in developers skill set, software, and the
infrastructure, all or part of which was not budgeted for in the planning stages of the project.
 The business users do not want to get involved in this phase of the project, and will continue
to relay on the IT organization to deliver periodic reports in a familiar format .
 A particular reporting requirement may be too complicated of an available reporting tool to
handle.
All these reasons are perfectly valid and in many cases result in a timely and cost-effective
delivery of a reporting system for a data warehouse.
2. What are the need for applications:- [CO2-H2]
The tools and applications fit into the managed query and EIS categories. As these are easy-to-
use, point-and-click tools that either accept SQL or generate SQL statements to query relational
data stored in the warehouse.
Some of these tools and applications can format the retrieved data in easy-to-read reports, while
others concentrate on the on-screen presentation.
The users of business applications such as
 segment identification,
 demographic analysis,
 territory management, and
 Customer mailing lists.

The complexity of the question grows these tools may rapidly become inefficient. Consider the
various access types to the data stored in a data warehouse.

 Simple tabular form reporting.


 Ad hoc user-specified queries.
 Predefined repeatable queries.
 Complex queries with multi table joins, multi-level sub queries, and .
Sophisticated search criteria.
 Ranking.
 Multivariable analysis.
 Time series analysis.
 Data visualization, graphing, charting and pivoting.
 Complex textual search
 Statistical analysis
 AI techniques for testing of hypothesis, trends discovery, definition and validation
of data clusters and segments.
 Information mapping (i.e., mapping of spatial data in geographic information
Computer Science Engineering Department 46 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

Systems).
 Interactive drill-down reporting and analysis.

The first four types of access are covered by the combined category of tools called query
and reporting tools
1. Creation and viewing of standard reports:
This is the main reporting activity: the routine delivery of reports based on pre determined
measures.
2. Definition and creation of ad-hoc reports:
These can be quite complex, and the trend is to off-load this time-consuming activity to the
users.
Reporting tools that allow managers and business users to quickly create their own reports
and get quick answers to business questions are becoming increasingly popular.
3. Data exploration: With the newest wave of business intelligence tools, users can easily
"surf' through data without a preset path to quickly uncover business trends or problems.
While reporting type 1 may appear relatively simple, types 2 and 3, combined
with certain business requirements often exceed existing tools capabilities and may
require building sophisticated applications to retrieve and analyze warehouse data.

This approach may be very useful for those data warehouse users who are not yet
comfortable with ad hoc queries.

3. COGNOS Impromptu explain in detail with examples. [CO2-H2]


Overview
Impromptu is an interactive database reporting tool. It allows Users to query data without
programming knowledge. When using the Impromptu tool, no data is written or changed in
the database. It is only capable of reading the data.

 Impromptu from Cognos Corporation for interactive database reporting that delivers 1-
to 1000 + seat scalability.
 Impromptu's object-oriented architecture ensures control administrative consistency
across all users and reports.
 Users access Impromptu through its easy-to-use graphical user interface.
 Offers a fast and strong implementation at the enterprise level, and feature s full
administrative control, ease of deployment, and low cost of ownership.
 It can support database reporting tool and single user reporting on personal data.

The Impromptu Information Catalog


Impromptu stores metadata in subject related folders. This metadata is will be used to
develop a query for a report. The metadata set is stored in a file called a catalog. The
catalog does not contain any data. It just contains information about connecting to the
Computer Science Engineering Department 47 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

database and the fields that will be accessible for reports.

A catalog contains:

 Folders—meaningful groups of information representing columns from one or


more tables
 Columns—individual data elements that can appear in one or more folders
 Calculations—expressions used to compute required values from existing data
 Conditions—used to filter information so that only a certain type of information
is displayed
 Prompts—pre-defined selection criteria prompts that users can include in reports
they create
 Other components, such as metadata, a logical database name, join information,
and user classes
 Impromptu reporting begins with the information catalog, a LAN based repository
(Storage area) of business knowledge and data access rules.
 The catalog insulates users from such technical aspects of the database as SQL
syntax, table joins and hidden table and field names.
 Creating a catalog is a relatively simple task, so that an Impromptu administrator can
be anyone who's familiar with basic database query function.
 The catalog presents the database in a way that reflects how the business is
organized, and uses the terminology of the business.
 Impromptu administrators are free to organize database items such as tables and
fields into Impromptu subject-oriented folders, subfolders and columns.,
 This enables business-relevant reporting through, business rules, which can consists
of shared calculations, filters and ranges for critical success factors.
Use of catalogs

 view, run, and print reports


 export reports to other applications
 disconnect from and connect to the database
 create reports
 change the contents of the catalog
 add user classes
Object-created architectures

 Impromptu's object-oriented architecture drives inheritance-based administration and


distributed catalogs.
 Impromptu implements management functionality through the use of governors.
 The governors allow administrators to control the enterprise's reporting
 environment. Activities and processes that governors can control are

• Query activity
• Processing location
• Database connections
• Reporting permissions
• User profiles
• Client/server balancing
Computer Science Engineering Department 48 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

• Database transaction
• Security by value
• Field and table security

Reporting

 Impromptu is designed to make it easy for users to build and run their own reports.
 Impromptu's predefined report wise templates include templates for mailing labels,
invoices, sales reports, and directories. These templates are complete with
formatting, logic, calculations, and custom automation.
 The templates are database-independent; therefore, users simply map their data
onto the existing placeholders to quickly create reports.
 Impromptu provides users with a variety of page and screen formats, known as Head
starts.
 Impromptu offers special reporting options that increase the value of distributed
standard reports.
Picklists and prompts: Organizations can create standard Impromptu reports for which
users can select from lists of value called picklists. Picklists and prompts make a single
report flexible enough to serve many users.
Custom templates: Standard report templates with global calculations and business rules
can be created once and then distributed to users of different databases.
A template's standard logic, calculations and layout complete the report automatically in the
user's choice of format.
Exception reporting: Exception reporting is the ability to have reports highlight values that
lie outside accepted ranges. Impromptu offers three types of exception reporting.

 Conditional filters — Retrieves only these values that are outside threshold •
 Conditional highlighting — Create rules for formatting data on the basis of data
values.
 Conditional display — Display report objects under certain conditions
Interactive reporting: Impromptu unifies querying and reporting in a single interface. Users
can perform both these tasks by interacting with live, data in one integrated module.
Frames: Impromptu offers an interesting frame based reporting style.
Frames are building blocks that may be used to produce reports that are formatted with
fonts, borders, colors, shading etc.
Frames or combination of frames, simplify building even complex reports.
The data formats itself according to the type of frame selected by the user.

 List frames are used to display detailed information.


 Form frames offer layout and design flexibility.
 Cross-tab frames are used to show the totals of summarized data at selected
intersections.
 Chart frames make it easy for users to see their business data in 2-D and 3-D
displays using line, bar, ribbon, area and pie charts.
Computer Science Engineering Department 49 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

 Text frames allow users to add descriptive text to reports and display binary large
objects (BLOBS) such as product descriptions.
 Picture frames incorporate bitmaps to reports or specific records, perfect for visually
enhancing reports.
 OLE frames make it possible for user to insert any OLE object into a report.
Impromptu Request Server
The new request server, which allows clients to off-load the query process to the server. A
PC user can now schedule a request to run on the server, and an Impromptu request
server will execute the request, generating the result on the server. When done, the
scheduler notifies the user, who can then access, view or print at will from PC.
The Impromptu request server runs on HP/UX 9.X, IBM AIX 4.X and Sun Solaris 2.4. It
supports data maintained in ORACLE 7.X and SYBASE system 10/11.
Supported databases
Impromptu provides a native database support for ORACLE, Microsoft SQL Server,
SYBASE, SQL Server, Omni SQL Gateway, SYBASE Net Gateway. MDI DB2 Gateway,
Informix, CA-Ingres, Gupta SQL-Base, Borland InterBase, Btrieve, dBASE, Paradox, and
ODBC accessing any database with an ODBC driver,
Impromptu features include:
* Unified query and reporting interface
* Object-oriented architecture
* Complete integration with power play
* Scalability
* Security and control
*Data presented in business content
* Over 70_redefined report templates
* Frame-based reporting
* Business-relevant reporting
* Database-independent catalogs

3. write short notes on Online Analytical Processing[CO2-H1]


OLAP stands for Online Analytical Processing. It uses database tables (fact and dimension
tables) to enable multidimensional viewing, analysis and querying of large amounts of data.
E.g. OLAP technology could provide management with fast answers to complex queries on
their operational data or enable them to analyze their company's historical data for trends
and patterns.

Online Analytical Processing (OLAP) applications and tools are those that are designed to
ask “complex queries of large multidimensional collections of data.” Due to that OLAP is
accompanied with data warehousing.

OLAP is an application architecture, not basically a data warehouse or a database


management system (DBMS). Whether it utilizes a data warehouse or not OLAP is

Computer Science Engineering Department 50 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

becoming an architecture that an increasing number of enterprises are implementing to


support analytical applications.
The majority of OLAP applications are deployed in a "stovepipe" fashion, using specialized
MDDBMS technology, a narrow set of data; a preassembled application- user interface.
4. What are the needs for OLAP? [CO2-H2]
Business problems such as market analysis and financial forecasting requires query-centric
database schemas that are array-oriented and multi dimensional in nature.
These business problems are characterized by the need to retrieve large number of records
from very large data sets (hundreds of gigabytes and even terabytes). The multidimensional
nature of the problems it is designed to address is the key driver for OLAP.
The result set may look like a multidimensional spreadsheet (hence the term multi
dimensional). All the necessary data can be represented in a relational database accessed
via SQL.
The two dimensional relational model of data and the Structured Query Language (SQL)
have limitations for such complex real-world problems.
SQL Limitations and need for OLAP :

One of the limitations of SQL is, it cannot represent complex problems. A query will be
translated in to several SQL statements. These SQL statements will involve multiple joins,
intermediate tables, sorting, aggregations and a huge temporary memory to store these
tables. These procedures required a lot of computation which will require a long time in
computing.

The second limitation of SQL is its inability to use mathematical models in these SQL
statements. If an analyst, create these complex statements using SQL statements, there
will be a large number of computation and huge memory needed. Therefore the use of
OLAP is preferable to solve this kind of problem.

5. Explain Multidimensional Data Model with neat diagram. [CO2-H2]

The multidimensional data model is an integral part of On-Line Analytical Processing, or


OLAP. Because OLAP is on-line, it must provide answers quickly; analysts create iterative
queries during interactive sessions, not in batch jobs that run overnight. And because OLAP
is also analytic, the queries are complex. The multidimensional data model is designed to
solve complex queries in real time.

Multidimensional data model is to view it as a cube. The cable at the left contains detailed
sales data by product, market and time. The cube on the right associates sales number
(unit sold) with dimensions-product type, market and time with the unit variables organized
as cell in an array.

Computer Science Engineering Department 51 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Fig: 2.2 - Relational tables and multidimensional cubes

This cube can be expended to include another array-price-which can be associates with all
or only some dimensions. As number of dimensions increases number of cubes cell
increase exponentially.

Dimensions are hierarchical in nature i.e. time dimension may contain hierarchies for years,
quarters, months, weak and day. GEOGRAPHY may contain country, state, city etc.

In this cube we can observe, that each side of the cube represents one of the elements of
the question. The x-axis represents the time, the y-axis represents the products and the z-
axis represents different centers. The cells of in the cube represents the number of product
sold or can represent the price of the items

This Figure also gives a different understanding to the drilling down operations. The
relations defined must not be directly related, they related directly.

The size of the dimension increase, the size of the cube will also increase exponentially.
The time response of the cube depends on the size of the cube.

OLAP Operations (Operations in Multidimensional Data Model:)


 Roll-up
 Drill-down
 Slice and dice
 Pivot (rotate)
1.Roll-up
Roll-up performs aggregation on a data cube in any of the following ways:
Computer Science Engineering Department 52 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

 By climbing up a concept hierarchy for a dimension


 By dimension reduction

The following diagram illustrates how roll-up works.


 Roll-up is performed by climbing up a concept hierarchy for the dimension location.
 Initially the concept hierarchy was "street < city < state < country".
 On rolling up, the data is aggregated by ascending the location hierarchy from the
level of city to the level of country.
 The data is grouped into cities rather than countries.
 When roll-up is performed, one or more dimensions from the data cube are
removed.

Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways:

 By stepping down a concept hierarchy for a dimension


 By introducing a new dimension.
The following diagram illustrates how drill-down works:

Computer Science Engineering Department 53 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Drill-down is performed by stepping down a concept hierarchy for the dimension


time.
 Initially the concept hierarchy was "day < month < quarter < year."
 On drilling down, the time dimension is descended from the level of quarter to the
level of month.
 When drill-down is performed, one or more dimensions from the data cube are added.
 It navigates the data from less detailed data to highly detailed data.

Slice
The slice operation selects one particular dimension from a given cube and provides a new
sub-cube. Consider the following diagram that shows how slice works.

Computer Science Engineering Department 54 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Here Slice is performed for the dimension "time" using the criterion time = "Q1".
 It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.

The dice operation on the cube based on the following selection criteria involves three
dimensions.
 (location = "Toronto" or "Vancouver")
 (time = "Q1" or "Q2")
 (item =" Mobile" or "Modem")
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to
provide an alternative presentation of data. Consider the following diagram that shows the
pivot operation.

In this the item and location axes in 2-D slice are rotated.

Computer Science Engineering Department 55 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

5.Name all the OLAP Guidelines and rules for implementation process. [CO2-H2]
Dr. E.F. Codd the ―father of the relational model, created a list of rules to deal with the
OLAP systems.

These rules are:

1).Multidimensional conceptual view: The OLAP should provide a suitable


multidimensional business model that suits the business problems and requirements.

2).Transparency: -(OLAP must transparency to the input data for the users).
The OLAP systems technology, the basic database and computing architecture
{client/server, mainframe gateways, etc.) and the heterogeneity of input data sources
should be transparent to users to save their productivity and ability with front-end
environments and tools (eg., MS Windows, MS Excel).

3).Accessibility:-(OLAP tool should only access the data required only to the analysis
Needed).
The OLAP system should access only the data actually required to perform the
analysis. The system should be able to access data from all heterogeneous enterprise data
sources required/for the analysis.

4).Consistent reporting performance: Size of the database should not affect in


performance).
As the number of dimensions and the size of the database increase, users should
not identify any significant decrease in performance.

5).Client/server architecture:(c/s architecture to ensure better performance and flexibility ).


The OLAP system has to conform to client/server architectural principles for
maximum price and performance, flexibility, adaptivity and interoperability

6).Generic dimensionality: Data entered should be equivalent to the structure and operation
requirements.

7).Dynamic sparse matrix handling: The OLAP too should be able to manage the sparse
matrix and so maintain the level of performance.

8).Multi-user support: The OLAP should allow several users working concurrently to work
together on a specific model.

9).Unrestricted cross-dimensional operations: The OLAP systems must be able to


recognize dimensional hierarchies and automatically perform associated roll-up calculations
within and across dimensions.

10).Intuitive data manipulation. Consolidation path reorientation pivoting drill down and Toll-
up and other manipulation should be accomplished via direct point-and-click; drag-and-drop
operations on the cells of the cube.
11).Flexible reporting: The ability to arrange rows, columns, and cells in a fashion that
facilitates analysis by spontaneous visual presentation of analytical report must exist
Computer Science Engineering Department 56 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

12).Unlimited dimensions and aggregation levels: This depends on the kind of business,
where multiple dimensions and defining hierarchies can be made.

In addition to these guidelines an OLAP system should also support:

13).Comprehensive database management tools: This gives the database management to


control distributed businesses.

14).The ability to drill down to detail source record level: Which requires that the OLAP tool
should allow smooth transitions in the multidimensional database.

15).Incremental database refresh: The OLAP tool should provide partial refresh.

16).Structured Query Language (SQL interface): the OLAP system should be able to
integrate

6.Differentiate the MultiDimensional OLAP and MultiRelational OLAP in detail.

Multidimensional structure: - " A variation of the relational model that uses multidimensional
structures for organize data and express the relationships between data”.
Multidimensional: MOLAP
MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.
MOLAP stores this data in optimized multidimensional array storage, rather than in a
relational database. Therefore it requires the pre-computation and storage of information in
the cube the operation known as processing.
MOLAP analytical operations :-
Consolidation: involves the aggregation of data such as roll-ups or complex expressions
involving interrelated data. For example, branch offices can be rolled up to cities and rolled
up to countries.

Drill-Down: is the reverse of consolidation and involves displaying the detailed data that
comprises the consolidated data.

Slicing and dicing: refers to the ability to look at the data from different viewpoints. Slicing
and dicing is often performed along a time axis in order to analyze trends and find patterns.

Multi relational OLAP: ROLAP

ROLAP works directly with relational databases. The base data and the dimension tables
are stored as relational tables and new tables are created to hold the aggregated
information.
It depends on a specialized schema design.
Computer Science Engineering Department 57 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

This methodology relies on manipulating the data stored in the relational database to give
the appearance of traditional OLAP's slicing and dicing functionality.

Comparison:

 MOLAP implementations are smooth to database explosion, such as usage of large


storage space ,high number of dimensions, pre-calculated results and sparse
multidimensional data.
 MOLAP generally delivers better performance by indexing and storage optimizations.
 MOLAP also needs less storage space compared to ROLAP because the specialized
storage typically includes compression techniques.

 ROLAP is generally more scalable. However large volume pre-processing is difficult to


implement efficiently so it is frequently skipped.
 ROLAP query performance can therefore suffer extremely.
 ROLAP relies more on the database to perform calculations, it has more limitations in
the specialized functions it can use.
A chart comparing capabilities of these two classes of OLAP tools.

Computer Science Engineering Department 58 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

The area of the circle implies data size.


Fig: - OLAP style comparison

MOLAP ROLAP

1 Information retrieval is fast. Information retrieval is comparatively


slow.

2 Uses sparse array to store data-sets. Uses relational table.

3 MOLAP is best suited for inexperienced ROLAP is best suited for experienced
users, since it is very easy to use. users.

4 Maintains a separate database for data It may not require space other than
cubes. available in the Data warehouse.

5 DBMS facility is weak. DBMS facility is strong.

7. Briefly explain about the Categories of OLAP Tools. [CO2-H2]

1. MOLAP Multidimesional OLAP


2. ROLAP Relational OLAP
3. Managed query environment (MQE)

1.MOLAP
The products used a data structure [multidimensional database management systems
(MDDBMS)] to organize, navigate, and analyze data, typically in an accumulated form and
required a tight coupling with the application layer and presentation layer.

Computer Science Engineering Department 59 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Architectures enables excellent performance when the data is utilized as designed and
predictable application response times for applications addressing a narrow breadth of data
for a specific DSS requirement.
Applications requiring iterative and comprehensive time series analysis of trends are well
suited for MOLAP technology (eg., financial analysis and budgeting). Examples include
Arbor software's Ess base, Oracle's Express Server.
The implementation of applications with MOLAP products.
First, there are limitations in the ability of data structures to support multiple subject areas of
data (a common trait of many strategic DSS applications) and the detail data required by
many, analysis applications. This has begun to be addressed in some products, utilizing
basic "reach through" mechanisms that enable the MOLAP tools to access detail data
maintained in an RDBMS.(Fig ).
Fig: - MOLAP architecture

MOLAP products require a different set of skills and tools for the database administrator to
build and maintain the database, thus increasing the cost and complexity of support.
These hybrid solutions have as their primary characteristic the integration of specialized
multidimensional data storage with RDBMS technology, providing users with a facility that
tightly "couples" the multidimensional data structures (MDDSs) with data maintained in an
RDBMS.
This approach can be very useful for organizations with performance — sensitive
multidimensional analysis requirements and that have built, or are in the process of
building, a data warehouse architecture that contains multiple subject areas.
Eg: (Product and sales region) to be stored and maintained in a persistent structure.These
structure can be automatically refreshed at predetermined intervals established by an
administrator.
2.ROLAP
The fastest growing style of OLAP technology, with new vendors (eg., Sagnent technology)
entering the market at an accelerating step. Products directly through a dictionary layer of
metadata, bypassing any requirement for creating a static multidimensional data structures
Fig: - ROLAP architecture

Computer Science Engineering Department 60 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

This enables multiple multidimensional views of the two-dimensional relational tables to be


created without the need to structure the data around the desired view.
Some of the products in this segment have developed strong SQL-generation engines to
support the complexity of multidimensional analysis.
Flexibility is an attractive feature of ROLAP products, there are products in this segment
that recommend, or require, the use of highly de-normalized database designs (e.g., Star
schema).
Shift in technology importance is coming in two forms.
First is the movement toward pure middleware technology that provides facility to simply
development of multidimensional applications. Second, there continues further hiding of the
lines that define ROLAP and hybrid-OLAP products. Example include Information
Advantage (Axsys), Micro strategy (DSS Agent/DSS Sever) Platinum/Pr odea Software
(Bercon), Informix/Standard Technology Group (Meta cube), and Sybase (High Gate
Project).
3. Managed Query Environment (MQE)
This style of OLAP, which is beginning to see increased activity, provided users with the
ability to perform limited analysis capability, either directly against RDBMS products, or by
force an intermediate MOLAP server Fig.
Fig: - Hybrid/MQE architecture

Some products (e.g, Andyne's Pablo) that have a custom in ad hoc query have developed
features to provide "datacube" and "slice" and "dice" analysis capabilities. This is achieved
by first developing a query to select data from the DBMS which then delivers the requested
data to the desktop, where it is placed into a data cube. This data cube can be stored and

Computer Science Engineering Department 61 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

maintained locally, to reduce the overhead required to create the structure each time the
query is executed.
Once the data is in the data cube; users can perform multidimensional analysis (i.e., Slice,
dice, and pivot operations) against it. The simplicity of the installation and administration of
such products makes them particularly attractive to organizations looking to provide
seasoned users with more sophisticated analysis capabilities, without the significant cost
and maintenance of more complex products.
This mechanism allows for the flexibility of each user to build a custom datacube, the lack
of data consistency among users, and the relatively small amount of data that can be
efficiently maintained are significant challenges facing tool administrators. Examples
include Cognos Software's Power play, Andyne Software's Pablo, Business Objects,
Mercury Project, Dimensional Insight's cross target and Speedware's Media.

8. In what way the OLAP Tools are use for the Internet? [CO2-H1]
The two important themes in computing are Internet/Web and data warehousing. The
reason for this trend is simple: the advantages in using the web for access are magnified
even further in a data warehouse.
• The internet is a virtually free resource which provides a universal connectivity within and
between companies.
• The web simplifies complex administrative tasks of managing distributed environment.
• The web allows companies to store and manage both data and applications on servers
that can be centrally managed maintained and updated, thus eliminating problems with
software and data concurrency.
The general features of the web-enabled data access.
• The first-generation web sites used a static distribution model, in which clients access
static HTML pages via web browsers. The decision support reports were stored as HTML,
documents and delivered to users on request. This model has some serious deficiencies,
including, inability to provide web clients with interactive analytical capabilities such as drill
down.
• The second-generation web sites support interactive database queries by utilizing a multi
tiered architecture in which a web client submits a query in the form of HTML-encoded
request to a web server, which in turn transforms the request for structured data into 'a CGI
(Common Gateway Interface) script, or a script written to a proprietary web-server API (i.e.,
Netscape server API, or NSAPI). The gateway submits SQL queries to the database,
receives the results translates them into HTML, and sends the pages to the requester.
Requests for the unstructured data (eg., images, other HTML documents etc.,) can be sent
directly to the unstructured data store.
• The emerging third-generation web sites replace HTML gateways with web-based
application. These servers can download Java applets or Active X applications that execute
on clients, or interact with corresponding applets running on servers servlets.

Computer Science Engineering Department 62 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

The third-generation web servers provide users with all the capabilities of existing decision-
support applications without requiring them to load any client software except a web
browser. Decision support applications, especially query, reporting and OLAP tools are
rapidly converting their tools to work on the web.

Fig: - Web processing model

• HTML publishing: This approach involves transforming an output of a query into the
HTML page that can be downloaded into a browser.
• Helper applications: A tool is configured as a helper application that resides within a
browser. This is a case of a fat client, in which, once the data is downloaded, users can
take advantage of all capabilities of the tool to analyze data.
• Plug-ins: A variation on the previous approach, plug-ins are helper applications that are
downloaded from the web server prior to their initial use. Since the plug-ins are downloaded
from the server, their normal administration and installation tasks are significantly reduced.
• Server-centric components: In this approach the vendor rebuilds a desktop tool as a
server component, or creates a new server component that can be integrated with the web
via a web gateway (eg., CGI or NSAPI scripts).
• Java and Active X applications: This approach is for a vendor to redevelop all or portions
of its tool in Java or Active X. This result is a true "thin" client model. It is promising and
flexible.
Several OLAP Tools from a Perspective of Internet/Web Implementations
Arbor Essbase Web: Essbase is one of the most determined of the early web products. It
includes not only OLAP manipulations, such as drill up, down, and across; pivot; slice and
dice; and fixed and dynamic reporting but also data entry, including full multi-user
concurrent write capabilities a feature that differentiates it from the others.

Computer Science Engineering Department 63 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Information advantage web OLAP: Information advantage uses a server-centric messaging


architecture, which is composed of a powerful analytical engine that generates SQL to pull
data from relational databases, manipulates the results, and transfers the result to a client.
Micro strategy DSS web: Micro strategy flagship product, DSS agent, was originally a
Windows-only tool, but micro strategy has smoothly made the transition, first with an NT-
based server product, and now as one of the first OLAP tools to have a web-access
product.

 DSS server relational OLAP server,


 DSS Architect data modeling tool
 DSS executive design tool for building executive information systems.

Brio technology: Brio shipped a suite of new products called brio.web.warehouse. This suite
implements several of the approaches listed above for deploying decision support.
OLAP applications on the web: The key to Brio's strategy is a new server component called
brio-query.server.

Computer Science Engineering Department 64 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

UNIT III

Part- A

1.Define Data mining? [CO3-L2]

It refers to extracting or “mining” knowledge from large amount of data. Data mining is a
process of discovering interesting knowledge from large amounts of data stored either, in
database, data warehouse, or other information repositories.

2.Give some alternative terms for data mining. [CO3-L2]

• Knowledge mining
• Knowledge extraction
• Data/pattern analysis.
• Data Archaeology
• Data dredging

3.What is KDD? [CO3-L1]

KDD-Knowledge Discovery in Databases.

4.What are the steps involved in KDD process? [CO3-L2]

• Data cleaning
• Data Mining
• Pattern Evaluation
• Knowledge Presentation
• Data Integration• Data Selection
• Data Transformation

5.What is the use of the knowledge base? [CO3-L1]

Knowledge base is domain knowledge that is used to guide search or evaluate the
Interestingness of resulting pattern. Such knowledge can include concept hierarchies used
to organize attribute /attribute values in to different levels of abstraction.

6 What is the purpose of Data mining Technique? [CO3-L1]

It provides a way to use various data mining tasks.

7.Define Predictive model? [CO3-L1]

It is used to predict the values of data by making use of known results from a different set
of sample data.

Computer Science Engineering Department 65 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

8.Define descriptive model? [CO3-L1]

It is used to determine the patterns and relationships in a sample data. Data mining tasks
that belongs to descriptive model: • Clustering • Summarization • Association rules •
Sequence discovery

9. Define the term summarization? [CO3-L1]

The summarization of a large chunk of data contained in a web page or a document.


Summarization = characterization=generalization

10. List out the advanced database systems? [CO3-L2]

• Extended-relational databases
• Object-oriented databases
• Deductive databases
• Spatial databases
• Temporal databases
• Multimedia databases
• Active databases• Scientific databases
• Knowledge databases

11. Define cluster analysis? [CO3-L2]

Cluster analyses data objects without consulting a known class label. The class labels are
not present in the training data simply because they are not known to begin with.

12.Describe challenges to data mining regarding data mining methodology and user
interaction issues? [CO3-L2]

• Mining different kinds of knowledge in databases


• Interactive mining of knowledge at multiple levels of abstraction
• Incorporation of background knowledge
• Data mining query languages and ad hoc data mining
• Presentation and visualization of data mining results
• Handling noisy or incomplete data
• Pattern evaluation

13.Describe challenges to data mining regarding performance issues? [CO3-L1]

• Efficiency and scalability of data mining algorithms


• Parallel, distributed, and incremental mining algorithms

14.Describe issues relating to the diversity of database types? [CO3-L1]

• Handling of relational and complex types of data


• Mining information from heterogeneous databases and global information systems

Computer Science Engineering Department 66 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

15.What is meant by pattern? [CO3-L2]

Pattern represents knowledge if it is easily understood by humans; valid on test data with
some degree of certainty; and potentially useful, novel,or validates a hunch about which the
used was curious. Measures of pattern interestingness, either objective or subjective, can
be used to guide the discovery process.

16. How is a data warehouse different from a database? [CO3-L2]

Data warehouse is a repository of multiple heterogeneous data sources, organized under a


unified schema at a single site in order to facilitate management decision- making.
Database consists of a collection of interrelated data.

17 Define Association Rule Mining [CO3-L1]


.
Association rule mining searches for interesting relationships among items in a given data
set.

18. When we can say the association rules are interesting? [CO3-L1]

Association rules are considered interesting if they satisfy both a minimum support
threshold and a minimum confidence threshold. Users or domain experts can set such
thresholds

. 19. Define support and confidence in Association rule mining. [CO3-L3]

Support S is the percentage of transactions in D that contain AUB. Confidence c is the


percentage of transactions in D containing A that also contain B. Support ( A=>B)= P(AUB)
Confidence (A=>B)=P(B/A)

20. How are association rules mined from large databases? [CO3-L1]

I step: Find all frequent item sets: II step: Generate strong association rules from frequent
item sets

21. Describe the different classifications of Association rule mining? [CO3-L2]

• Based on types of values handled in the Rule

• Based on the dimensions of data involved

• Based on the levels of abstraction involved

• Based on various extensions

Computer Science Engineering Department 67 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

PART –B

1. Explain Data Mining confluence of multiple disciplines. [CO3-H2]


Data mining is the non-trivial(non-unimportant) process of identifying valid, novel(original),
potentially useful and ultimately understandable patterns(model) in data.
Data mining techniques supports automatic searching of data and tries to source out
patterns and trends in the data and also gather rules from these patterns which will help the
user to support review and examine decisions in some related business- or scientific area.
Data refers to extracting or mining knowledge from large databases. Data mining and
knowledge discovery in the databases is a new inter disciplinary field, merging ideas from
statistics, machine, learning, databases and parallel computing.
Fig:1 - Data mining as a confluence of multiple disciplines

Fig:2 - Data mining — searching for knowledge (interesting patterns) in your data

KDD: Knowledge Discovery in Database (KDD) was formalized in search of seeking


knowledge from the data.
Fayyed et al distinguish between KDD and data mining by giving the following definitions:

Computer Science Engineering Department 68 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Knowledge Discovery in Databases KDD is the process of identifying a valid, potentially


useful and ultimately understandable structure in data. This process involves selecting or
sampling data from a data warehouse, cleaning or pre-processing it, transforming or
reducing it, applying a data mining component to produce a structure and then evaluating
the derived structure.
Data mining is a step in the KDD process concerned with the algorithmic means by which
patterns or structures are enumerated from the data under acceptable, computational
efficiency limitations.
Some of the definitions of data mining are:
1. Data mining is the non-trivial extraction of implicit, exactly unknown and potentially useful
information from the data.
2. Data mining is the search for the relationships and global patterns that exist in large
databases but are hidden among vast amounts of data.
3. Data mining refers to using a variety of techniques to identify piece of information or
decision-making knowledge in the database and extracting these in such a way they can be
put to use in areas such as decision support, prediction, forecasting and estimation.
4. Data mining system self-learns from the previous history of investigated system,
formulating and testing hypothesis about rules which system obey.
5. Data mining is the process of discovering meaningful new correlation pattern and trends
by shifting through large amount of data stored in repositories, using pattern recognition
techniques as well as statistical and mathematical techniques.
Fig: 3 - Data mining as a step in the process of knowledge discover

Computer Science Engineering Department 69 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Steps in KDD process:


Data cleaning: It is the process of removing noise and inconsistent data.
Data integrating: It is the process of combining data from multiple sources.
Data selection: It is the process of retrieving relevant data from the databases.
Data transformation: In this process, data are transformed or consolidated into forms
suitable for mining by performing summary of aggregation operations.
Data mining: It is an essential process where intelligent methods are applied in support to
extract data patterns.
Pattern evaluation: The patterns obtained in the data mining stage are converted into
knowledge based on some interestingness measures.
Knowledge presentation: Visualization and knowledge representation techniques are used
to present the mined knowledge to the user.

Architecture of Data Mining System

Fig:4 - Architecture of a typical data mining system


Data mining is the process of discovering interesting knowledge from large amounts of data
stored either in databases, data warehouses or other information repositories.
Major components.
Database, data warehouse or other information repository: This is a single or a
collection of multiple databases, data warehouse, flat file spread sheets or other kinds of

Computer Science Engineering Department 70 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

information repositories. Data cleaning and data integration techniques may be performed
on the data.
Database or data warehouse server: The database or data warehouse serve obtains the
relevant data, based on the user's data mining request.
Knowledge base: This is the domain knowledge that used to guide the search or evaluate
the interestingness of resulting patterns. Such knowledge can include concept hierarchies,
used to organize attributes or attribute values into different levels of abstraction knowledge
such as user beliefs; threshold and metadata can be used to access a patterns
interestingness.
Data mining engine: This is essential to the data mining system and ideally consists of a
set of functional modules for task such as characterization, association classification,
cluster analysis, evolution and outlier analysis.
Pattern evaluation module: This component uses interestingness measures and interacts
with the data mining modules so as to focus the search towards increasing patterns. It may
use interestingness entrances to / filter out discovered patterns. Alternately, the pattern
evaluation module may also be integrate with mining module.
Graphical user interface: This module communicates between users and the data mining
system, allowing the user to interact with the system by specifying a task or data mining
query for performing exploratory data mining based on intermediate data mining results.
This module allows the user to browse database and datawarehouse schemes or data
structure, evaluate mined patterns and visualize the pattern in different forms such as
maps, charts etc.
Data Mining — on What Kind of Data
Data mining should be applicable to any kind of information repository. This includes
 Flat files
 Relational databases,
 Data warehouses,
 Transactional databases,
 Advanced database systems,
 World-Wide Web.
Advanced database systems include
 Object-oriented and
 Object relational databases, and
 Special c application-oriented databases such as
 Spatial databases,
 Time-series databases,
 Text databases,
 Multimedia databases.
Flat files: Flat files are simple data files in text or binary format with a structure known by
the data mining algorithm to be applied. The data in these files can be transactions, time-
series data, scientific measurements, etc.
Relational databases: A relational database is a collection of tables. Each table consists of
a set of attributes (columns or fields) and a set of tuples (records or rows). Each tuple is
identified by a unique key and is described by a set of attribute values. Entity relationships

Computer Science Engineering Department 71 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

(ER) data model is often constructed for relational databases. Relational data can be
accessed by database queries written in a relational query language.
e.g Product and market table

Data warehouse:
A data warehouse is a repository of information collected from multiple sources, stored
under a unified scheme residing on a single site.
A data warehouse is formed by a multidimensional database structure, where each
dimension corresponds to an attribute or a set of attributes in the schema.
Fig: - Architecture of data warehouse

Fig: - Data cube

Computer Science Engineering Department 72 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Data warehouse is formed by data cubes. Each dimension is an attribute and each cell
represents the aggregate measure. A data warehouse collects information about subjects
that cover an entire organization whereas data mart focuses on selected subjects. The
multidimensional data views makes (OLAP) Online Analytical Processing easier.
Transactional databases: A transactional database consists of a file where each record
represents a transaction. A transaction includes transaction identity number, list of items,
date of transactions etc.
Advanced databases:
Object oriented databases: Object oriented databases are based on object-oriented
programming concept. Each entity is considered as an object which encapsulates data and
code into a single unit objects are grouped into a class.
Object-relational database: Object relational database are constructed based on an
object relational data mode which extends the basic relational data model by handling
complex data types, class hierarchies and object inheritance.
Spatial databases: A spatial database stores a large amount of space-related data, such
as maps, preprocessed remote sensing or medical imaging data and VLSI chip layout data.
Spatial data may be represented in raster format, consisting of n-dimensional bit maps or
fixed maps.
Temporal Databases, Sequence Databases, and Time-Series Databases

A temporal database typically stores relational data that include time-related attributes.
A sequence database stores sequences of ordered events, with or without a existing
view of time. E.g customer shopping sequences, Web click streams.
A time-series database stores sequences of values or events obtained over repeated
measurements of time (e.g., hourly, daily, weekly). E.g stock exchange, inventory control,
observation of temperature and wind.

Text databases and multimedia databases: Text databases contains word descriptions
of objects such as long sentences or paragraphs, warning messages, summary reports etc.
Text database consists of large collection of documents from various sources. Data stored
in most text databases are semi structured data.
A multimedia database stores and manages a large collection of multimedia objects such
as audio data, image, video, sequence and hypertext data.

Computer Science Engineering Department 73 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Heterogeneous databases and legacy databases:


A heterogeneous database consists of a set of interconnected, autonomous component
databases. The components communicate in order to exchange information and answer
queries.
A legacy database is a group of heterogeneous databases that combines different kinds of
data systems, such as relational or object-oriented databases, hierarchical databases,
network databases, spreadsheets, multimedia databases, or file systems.

The heterogeneous databases in a legacy database may be connected by intra or inter-


computer networks.

The World Wide Web: The World Wide Web and its associated distributed information
services, such as Yahoo!, Google, America Online, and AltaVista, provides worldwide, on-
line information services. Capturing user access patterns in such distributed information
environments is called Web usage mining or Weblog mining.

2. Write down the Data mining functionalities. [CO3-H2]

Data Mining tasks: what kinds of patterns can be mined?


Data mining functionalities are used to specify the kind of patterns to be found in data
mining tasks. In general, data mining tasks can be classified into two categories: descriptive
and predictive.

Descriptive mining:- tasks characterize the general properties of the data in the database.
Predictive mining: - tasks perform conclusion on the current data in order to make
predictions.

Users have no idea regarding what kinds of patterns is required, so they search for
several different kinds of patterns in parallel.
Data mining systems
- should be able to discover patterns at various granularity
- should help users for interesting patterns.
Data mining functionalities:

 Concept/class description - Characterization, Discrimination,


 Association and correlation analysis,
 Classification, prediction,
 Clustering,
Computer Science Engineering Department 74 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

 Outlier analysis
 Evolution analysis

Concept/Class Description: Characterization and Discrimination


Data can be associated with class or concepts, for example, in the all electronics store,
classes of items for sale include computers and printers and concepts of customers include
big spenders and budget spenders. Such descriptions of a class or a concept are called
concept /class descriptions.
These descriptions can be derived via.
1. Data characterization
2. Data discrimination
3. Both data characterization and discrimination

Data characterization: It is a summarization of the general characteristics of a class


(target class) of data. The data related to the user specified class are collected by a
database query. Several methods like OLAP roll up operation and attribute-oriented
technique are used for effective data summarization and characterization. The output of
data characterization can be presented in various forms like
• Pie charts • Bar charts • Curves • Multidimensional cubes • Multidimensional tables etc.
The resulting descriptions can be presented as generalized relations or in rule forms called
characteristics rules.
Data discrimination is a comparison of the general features of target class data objects
with the general features of objects from one or a set of contrasting classes. The output of
data discrimination can be presented in the same manner as data characterization.
Discrimination descriptions expressed in rule form are referred to as discriminant rules.
E.g the user may like to compare the general features of software products whose sales
increased by 10% in the last year with those whose sales decreased by at least 30% during
the same period
Mining Frequent Patterns, Associations, and Correlations
Frequent patterns, are patterns that occur frequently in data.
Kinds of frequent patterns,
 Itemsets - refers to a set of items that frequently appear together in a transactional
data set – e.g milk and bread.
 Subsequences e.g- customers like to purchase first a PC, followed by a digital
camera, and then a memory card ,
 Substructures e.g graphs, trees, or lattices

Association analysis.

single-dimensional association rule.


A marketing manager of All Electronics shop, find which items are frequently purchased
together within the same transactions. An example of such a rule, mined from the
AllElectronics transactional database, is
Computer Science Engineering Department 75 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

X => variable representing a customer.


Confidence of 50% => that if a customer buys a computer, there is a 50% chance that he
will buy software
support of 1 % => computer and software were purchased together.

This association rule involves a single attribute or predicate (i.e., buys) that repeats.
Association rules that contain a single predicate are referred to as single-dimensional
association rules.
Also, the above rule can be written simply as

Multidimensional association rule.


Consider “AllElectronics” relational database relating to purchases.

A data mining system may find association rules like

The rule indicates that of the AllElectronics customers under study, 2% are 20 to 29 years
of age with an income of 20,000 to 29,000 and have purchased a CD player at
AllElectronics. There is a 60% probability that a customer in this age and income group will
purchase a CD player. Note that this is an association between more than one attribute, or
predicate (i.e., age, income, and buys).

Classification and prediction

Classification -> process of finding a model (or function) that describes and differentiates
data classes or concepts, for the purpose of using the model to predict the class of objects
whose class label is unknown.
The derived model is based on the analysis of a set of training data (i.e., data objects
whose class label is known).
The derived model represented by
(i).classification (IF-THEN) rules, (ii). decision trees,(iii) neural networks

Computer Science Engineering Department 76 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

A decision tree is a flow-chart-like tree structure, node -> a test on an attribute


value, branch-> outcome of the test, tree leaves -> classes or class distributions.
A neural network is typically a collection of neuron-like processing units with
weighted connections between the units.

Prediction
Prediction models calculate continuous-valued functions. Prediction is used to predict
missing or unavailable numerical data values. Prediction refers both numeric prediction and
class label prediction. Regression analysis is a statistical methodology is used for numeric
prediction. Prediction also includes the identification of distribution trends based on the
available data.

Clustering Analysis
Clustering analyzes data objects without consulting a known class label.
Clusters can be grouped based on the principle of maximizing the intra-class similarity and
minimizing the interclass similarity.
Clustering is a method of grouping data into different groups, so that data in each group
share similar trends and patterns. The objectives of clustering are
* To uncover natural groupings
* To initiate hypothesis about the data
* To find consistent and valid organization of the data

Computer Science Engineering Department 77 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

5 .Outlier Analysis
A database may contain data objects that do not fulfil with the general model of the data.
These data objects are called outliers.
Most data mining methods discard outliers as noise or exceptions.
Applications like credit card fraud detection, cell phone cloning fraud and detection of
suspicious activities the rare events can be more interesting than the more regularly
occurring ones. The analysis of outlier data is referred to as outlier mining.
Outliers may be detected using statistical tests, distance measures, deviation-based
methods

6. Evolution Analysis
Data evolution analysis describes and model regularities (or) trends of objects whose
behaviour changes over time. Normally, evolution analysis is used to predict the future
trends by effective decision making process.
It include characterization, discrimination, association and correlation analysis,
classification, prediction, or clustering of time-related data, time-series data analysis,
sequence or periodicity pattern matching, and similarity-based data analysis.

E.g stock market (time-series) data of the last several years available from the New York
Stock Exchange and like to invest in shares of high-tech industrial companies.

Computer Science Engineering Department 78 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

3. Describe the Interestingness of patterns. [CO3-H1]

 A data mining system has the possible to generate thousands or even millions of
patterns, or rules.
 A pattern is interesting if it is (1) easily understood by humans, (2) valid on new or test
data with some degree of certainty, (3) possibly useful, and (4) innovative.
 A pattern is also interesting if it validates a assumptions that the user wanted to confirm.
An interesting pattern represents knowledge.

Objective measures of pattern interestingness.


These are based on the structure of discovered patterns and the statistics underlying them.
An objective measure for association rules of the form X =>Y is rule support, representing
the percentage of transactions from a transaction database that the given rule satisfies.
This is taken to be the probability P(X Y),where X Y indicates that a transaction contains
both X and Y, that is, the union of itemsets X and Y.

Another objective measure for association rules is confidence, which measures the degree
of certainty of the detected association. This is taken to be the conditional probability
P(Y/X), that is, the probability that a transaction containing X also contains Y. More
formally, support and confidence are defined as
support(X=>Y) = P(X Y):
confidence(X=>)Y) = P(Y/X):
For example, rules that do not satisfy a confidence 50% can be considered uninteresting.
Rules below this reflect noise, exceptions, or minority cases and are probably of less value.

Subjective interestingness measures


These are based on user beliefs in the data. These measures find patterns interesting if
they are unexpected (opposing a user’s belief) or offer strategic information on which the
user can act. Patterns that are expected can be interesting if they confirm a hypothesis that
the user wished to validate, or resemble a user’s idea.

A data mining system generate all of the interesting patterns—refers to the completeness of
a data mining algorithm. It is often unrealistic and inefficient for data mining systems to
generate all of the possible patterns.

A data mining system generate only interesting patterns—is an optimization problem in data
mining. It is highly desirable for data mining systems to generate only interesting patterns.
This is efficient for users and data mining systems, because have search through the
patterns generated in order to identify the truly interesting ones.

4. What are the Classification of Data Mining Systems? [CO3-H2]


Data mining is an interdisciplinary field, that merging a set of disciplines, including
database systems, statistics, machine learning, visualization, and information science.

Depending on the data mining approach used, techniques from other disciplines may be
applied, such as
o neural networks,
o fuzzy and/or rough set theory,

Computer Science Engineering Department 79 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

o knowledge representation,
o inductive logic programming,
o high-performance computing.

Depending on the kinds of data to be mined or on the given data mining application, the
data mining system may also integrate techniques from
o spatial data analysis,
o information retrieval,
o pattern recognition,
o image analysis,
o signal processing,
o computer graphics,
o Web technology,
o economics,
o business,
o bioinformatics,
o psychology

Data mining systems can be categorized according to various criteria, as follows:

Classification according to the kinds of databases mined: Database systems can be


classified
according to different criteria may require its own data mining technique. For example, if
classifying according to data models, it may have a relational, transactional, object-
relational, or data warehouse mining system. If classifying according to the special types of
data handled, it may have a spatial, time-series, text, stream data, multimedia data mining
system, or a World Wide Web mining system.

Classification according to the kinds of knowledge mined:


o It is, based on data mining functionalities, such as characterization, discrimination,
association and correlation analysis, classification, prediction, clustering, outlier
analysis, and evolution analysis.
Computer Science Engineering Department 80 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

o A complete data mining system usually provides multiple and/or integrated data mining
functionalities.
o Moreover, data mining systems can be famous based on the granularity or levels of
abstraction of the knowledge mined, including generalized knowledge ,primitive-level
knowledge, or knowledge at multiple levels .
o An advanced data mining system should facilitate the discovery of knowledge at multiple
levels of abstraction.
o Data mining systems can also be categorized as those that mine data regularities
(commonly occurring patterns) versus those that mine data irregularities (such as
exceptions, or outliers).
 In general, concept description, association and correlation analysis, classification,
prediction, and clustering mine data regularities, rejecting outliers as noise. These
methods may also help detect outliers.

Classification according to the kinds of techniques utilized: These techniques can be


described according to the degree of user interaction involved e.g. Autonomous systems,
interactive exploratory systems, query-driven systems or the methods of data analysis
employed.

e.g., database-oriented or data warehouse–oriented techniques, machine learning,


statistics, visualization, pattern recognition, neural networks, and so on.

Classification according to the applications adapted.


For e.g., data mining systems may be personalized specifically for finance, tele
communications, DNA, stock markets, e-mail, and so on.

Different applications often require the integration of application-specific methods.


Therefore, a generic, all-purpose data mining system may not fit domain-specific mining
tasks.

5.Name all the Data Mining Task Primitives. [CO3-H1]

 A data mining task can be specified in the form of a data mining query, which is input to
the data mining system.
 A data mining query is defined in terms of data mining task primitives. These primitives
allow the user to interactively communicate with the data mining system during
discovery in order to direct the mining process, or examine the findings from different
angles or depths.

 The data mining primitives specify the following, as illustrated in Figure.

Computer Science Engineering Department 81 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

The set of task-relevant data to be mined: This specifies the portions of the database or the
set of data in which the user is interested. This includes the database attributes or data
warehouse dimensions of interest
The kind of knowledge to be mined: This specifies the data mining functions to be
performed,
such as characterization, discrimination, association or correlation analysis, classification,
prediction, clustering, outlier analysis, or evolution analysis.
The background knowledge to be used in the discovery process: This knowledge about
the domain to be mined is useful for guiding the knowledge discovery process and
for evaluating the patterns found.
Concept hierarchies (shown in Fig 2) are a popular form of background knowledge, which
allow data to be mined at multiple levels of abstraction.

Computer Science Engineering Department 82 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

The interestingness measures and thresholds for pattern evaluation: They may be used to
guide the mining process or, after discovery, to evaluate the discovered patterns.
Different kinds of knowledge may have different interestingness measures.

The expected representation for visualizing the discovered patterns: This refers to the
form in which discovered patterns are to be displayed ,which may include rules, tables,
charts, graphs, decision trees, and cubes.

A data mining query language can be designed to incorporate these primitives,


allowing users to flexibly interact with data mining systems.

This facilitates a data mining system’s communication with other information systems
and its integration with the overall information processing environment.

Computer Science Engineering Department 83 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Integration of a Data Mining System with a Database or Data Warehouse System

When a DM system works in an environment that requires it to communicate with other


information system components, such as DB and DW systems, possible integration
schemes include
 No coupling,
 Loose coupling,
 Semi tight coupling,
 Tight coupling

No coupling: means that a DM system will not utilize any function of a DB or DW system.
It may fetch data from a file system, process data using some data mining algorithms, and
then store the mining results in another file.
Drawbacks.
First, a DB system provides flexibility and efficiency at storing, organizing, accessing, and
processing data. Without using a DB/DWsystem, a DM system spend more time for finding,
collecting, cleaning, and transforming data. In DB/DW systems, data’s are well organized,
indexed, cleaned, integrated, or consolidated, so that finding the task-relevant, high-quality
data becomes an easy task.
Second, there are many tested, scalable algorithms and data structures implemented in DB
and DW systems. Without any coupling of such systems, a DM system will need to use
other tools to extract data, making it difficult to integrate such a system into an information
processing environment. Thus, no coupling represents a poor design.

Computer Science Engineering Department 84 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Loose coupling: means that a DM system will use some facilities of a DB or DW system,
fetching data from a data repository managed by these systems, performing data mining,
and then storing the mining results either in a file or in a designated place in a database or
data warehouse.

Advantages : Loose coupling is better than nocoupling because it can fetch any portion of
data stored in DB’s or DW’s by using query processing, indexing, and other system
facilities.
Drawbacks : However, many loosely coupled mining systems are main memory-based.
Because mining does not explore data structures and query optimization methods provided
by DB or DW systems, it is difficult for loose coupling to achieve high scalability and good
performance with large data sets.

Semitight coupling: means that too linking a DM system to a DB/DW system, efficient
implementations of a few essential data mining primitives can be provided in the DB/DW
system.
These primitives can include sorting, indexing, aggregation, histogram analysis, multiway
join, and precomputation of some essential statistical measures, such as sum, count, max,
min, standard deviation, and so on.
Moreover, some frequently used intermediate mining results can be precomputed and
stored in the DB/DW system.

Tight coupling: means that a DM system is smoothly integrated into the DB/DW system.
This approach is highly desirable because it facilitates efficient implementations of data
mining functions, high system performance, and an integrated information processing
environment.

Data mining queries and functions are optimized based on mining query analysis, data
structures, indexing schemes, and query processing methods of a DB or DW system.

By technology advances, DM, DB, and DW systems will integrate together as one
information system with multiple functionalities. This will provide a uniform information
processing environment.

6.What are the Major Issues in Data Mining? [CO3-H1]

Mining methodology and user interaction issues: These reflect the kinds of knowledge
mined, the ability to mine knowledge at multiple granularities, the use of domain knowledge,
ad hoc mining, and knowledge visualization.
 Mining different kinds of knowledge databases: Data mining should cover a wide data
analysis and knowledge discovery tasks, including data characterization, discrimination,
association, classification, prediction, clustering, outlier analysis.
 Interactive mining of knowledge at multiple levels of abstraction: The data mining process
should be interactive. Interactive mining allows users to focus the search for patterns,
providing and refining data mining requests based on returned results.
Computer Science Engineering Department 85 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

 Incorporation of background knowledge: Background knowledge may be used to guide the


discovery process and allow discovered patterns to be expressed in concise terms and at
different levels of abstraction.
 Data mining query languages and ad hoc mining: Relational query languages (such as
SQL) allow users to use ad hoc queries for data retrieval.
 Presentation and visualization of data mining results: Discovered knowledge should be
expressed in high-level languages, visual representations, or other expressive forms and
directly usable by humans.
 Handling noisy or incomplete data: When mining data regularities, these objects may
confuse the process, causing the knowledge model constructed to over fit the data.
 Pattern evaluation--the interestingness problem: A data mining system can uncover
thousands of patterns. Many of the patterns discovered may be uninteresting to the given
user, representing common knowledge or lacking newness.

Performance issues:

Efficiency and scalability of data mining algorithms: To effectively extract information from a
huge amount of data in databases, data mining algorithms must be efficient and scalable.

Parallel, distributed, and incremental mining algorithms: The huge size of many databases,
the wide distribution of data, and the computational complexity of some data mining
methods are factors motivating the development of algorithms that divide data into
partitions that can be processed in parallel.

Issues relating to the diversity of database types:


 Handling of relational and complex types of data: Specific data mining systems should be
constructed for mining specific kinds of data.
 Mining information from heterogeneous databases and global information systems: Local-
and wide-area computer networks (such as the Internet) connect many sources of data,
forming huge, distributed, and heterogeneous databases.

7. What is Data Preprocessing and Why preprocess the data?. Also explain Data
cleaning ,Data integration and transformation, Data reduction, Discretization and
concept hierarchy generation. [CO3-H3]

I. Data Preprocessing :-

 Data in the real world is dirty


o incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
 e.g., occupation=“ ”
o noisy: containing errors or outliers
 e.g., Salary=“-10”
o inconsistent: containing discrepancies in codes or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records

Data Dirty reason

Computer Science Engineering Department 86 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Incomplete data may come from


o “Not applicable” data value when collected
o Different considerations between the time when the data was collected and when
it is analyzed.
o Human/hardware/software problems
 Noisy data (incorrect values) may come from
o Faulty data collection instruments
o Human or computer error at data entry
o Errors in data transmission
 Inconsistent data may come from
o Different data sources
o Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning

Data Preprocessing Important

 No quality data, no quality mining results!


o Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or even misleading
statistics.
o Data warehouse needs consistent integration of quality data
 Data extraction, cleaning, and transformation comprises the majority of the work of
building a data warehouse

Major Tasks in Data Preprocessing

 Data cleaning
o Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
 Data integration
o Integration of multiple databases, data cubes, or files
 Data transformation
o Normalization and aggregation
 Data reduction
o Obtains reduced representation in volume but produces the same or similar
analytical results
 Data discretization
o Part of data reduction but with particular importance, especially for numerical
data

Computer Science Engineering Department 87 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

II. Data Cleaning

 Importance
o “Data cleaning is one of the three biggest problems in data warehousing”—Ralph
Kimball
o “Data cleaning is the number one problem in data warehousing”—DCI survey

 Data cleaning tasks


o Fill in missing values
o Identify outliers and smooth out noisy data
o Correct the inconsistent data
o Resolve redundancy caused by data integration

(i) .Missing Data

 Data is not always available


o E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
Computer Science Engineering Department 88 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

 Missing data may be due to


o equipment malfunction
o inconsistent with other recorded data and thus deleted
o data not entered due to misunderstanding
o certain data may not be considered important at the time of entry
o not register history or changes of the data
 Missing data may need to be inferred.

Handle Missing Data

 Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per attribute varies
considerably.
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
o a global constant : e.g., “unknown”, a new class?!
o the attribute mean
o the attribute mean for all samples belonging to the same class: smarter
o the most probable value: inference-based such as Bayesian formula or decision
tree

(ii). Noisy Data

 Noise: random error or variance in a measured variable


 Incorrect attribute values may due to
o faulty data collection instruments
o data entry problems
o data transmission problems
o technology limitation
o inconsistency in naming convention
 Other data problems which requires data cleaning
o duplicate records
o incomplete data
o inconsistent data

Handling Noisy Data

 Binning
o first sort data and partition into (equal-frequency) bins
o then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
 Regression
o smooth by fitting the data into regression functions
 Clustering
o detect and remove outliers
 Combined computer and human inspection
o detect suspicious values and check by human (e.g., deal with possible outliers)

Simple Discretization Methods: Binning


Computer Science Engineering Department 89 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

 Equal-width (distance) partitioning


o Divides the range into N intervals of equal size: uniform grid
o if A and B are the lowest and highest values of the attribute, the width of intervals
will be: W = (B –A)/N.
o The most straightforward, but outliers may dominate presentation
o Skewed data is not handled well
 Equal-depth (frequency) partitioning
o Divides the range into N intervals, each containing approximately same number
of samples
o Good data scaling
o Managing categorical attributes can be tricky

Binning Methods for Data Smoothing

 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into equal-frequency (equi-depth) bins:


- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

Cluster Analysis

III. Data Integration and Transformation

 Data integration:
o Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id , B.cust no.
o Integrate metadata from different sources
Computer Science Engineering Department 90 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

 Entity identification problem:


o Identify real world entities from multiple data sources,
o e.g., Bill Clinton = William Clinton
 Detecting and resolving data value conflicts
o For the same real world entity, attribute values from different sources are different
o Possible reasons: different representations, different scales,
e.g., metric vs. British units

Handling Redundancy in Data Integration

 Redundant data occur often when integration of multiple databases


o Object identification: The same attribute or object may have different names in
different databases
o Derivable data: One attribute may be a “derived” attribute in another table, e.g.,
annual revenue
 Redundant attributes may be able to be detected by correlation analysis
 Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality

Correlation Analysis (Numerical Data)

 Correlation coefficient (also called Pearson’s product moment coefficient)

rA, B 
 ( A  A)(B  B)   ( AB)  n AB
  1)AB of tuples,
where n is the(nnumber (n  1)Aand
B B are the respective means of A and B, σA
A
and σB are the respective standard deviation of A and B, and Σ(AB) is the sum of the AB
cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the
stronger correlation.
 rA,B = 0: independent; rA,B < 0: negatively correlated.

Correlation Analysis (Categorical Data)

 Χ2 (chi-square) test

 The larger the Χ2 value, the more likely the variables are related
 The cells that contribute the most to the Χ2 value are those whose actual count is
very different from the expected count
 Correlation does not imply causality
o No., of hospitals and no., of car-theft in a city are correlated
o Both are causally linked to the third variable: population

Data Transformation

 Smoothing: remove noise from data


Computer Science Engineering Department 91 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

 Aggregation: summarization, data cube construction


 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified range
o min-max normalization
o z-score normalization
o normalization by decimal scaling
 Attribute/feature construction
o New attributes constructed from the given ones

Data Transformation: Normalization

 Min-max normalization: to [new_minA, new_maxA]

o Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
$73,000 is mapped to
 Z-score normalization (μ: mean, σ: standard deviation):

o Ex. Let μ = 54,000, σ = 16,000. Then


 Normalization by decimal scaling

Where j is the smallest integer such that Max(|ν’|) < 1

IV.Data reduction

 Data reduction necessity


o A database/data warehouse may store terabytes of data
o Complex data analysis/mining may take a very long time to run on the
complete data set
 Data reduction
o Obtain a reduced representation of the data set that is much smaller in
volume but yet produce the same (or almost the same) analytical results

Data reduction strategies

1. Data cube aggregation:


2. Dimensionality reduction — e.g., remove unimportant attributes
3. Data Compression
4. Numerosity reduction — e.g., fit data into models
5. Discretization and concept hierarchy generation
Computer Science Engineering Department 92 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

1. Data cube aggregation:

Computer Science Engineering Department 93 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Attribute Subset Selection

 Feature selection (i.e., attribute subset selection):


o Select a minimum set of features such that the probability distribution of different
classes given the values for those features is as close as possible to the original
distribution given the values of all features
o reduce no., of patterns in the patterns, easier to understand
 Heuristic methods (due to exponential no., of choices):
o Step-wise forward selection
o Step-wise backward elimination
o Combining forward selection and backward elimination
o Decision-tree induction

2. Dimensionality Reduction: Wavelet Transformation

 Discrete wavelet transform (DWT): linear signal processing, multi-resolutional analysis


 Compressed approximation: store only a small fraction of the strongest of the wavelet
coefficients
 Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space
 Method:
o Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
o Each transform has 2 functions: smoothing, difference
o Applies to pairs of data, resulting in two set of data of length L/2
o Applies two functions recursively, until reaches the desired length

Computer Science Engineering Department 94 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Principal Component Analysis (PCA)

 Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components)
that can be best used to represent data
 Steps
o Normalize input data: Each attribute falls within the same range
o Compute k orthonormal (unit) vectors, i.e., principal components
o Each input data (vector) is a linear combination of the k principal component vectors
o The principal components are sorted in order of decreasing “significance” or strength
o Since the components are sorted, the size of the data can be reduced by eliminating the weak
components, i.e., those with low variance. (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data
 Works for numeric data only
 Used when the number of dimensions is large

3. Data Compression

 String compression
o There are extensive theories and well-tuned algorithms
o Typically lossless
o But only limited manipulation is possible without expansion
 Audio/video compression
o Typically lossy compression, with progressive refinement
o Sometimes small fragments of signal can be reconstructed without reconstructing the
whole
 Time sequence is not audio
o Typically short and vary slowly with time

Computer Science Engineering Department 95 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

4. Numerosity Reduction
 Reduce data volume by choosing alternative, smaller forms of data representation
 Parametric methods
o Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers)
o Example: Log-linear models—obtain value at a point in m-D space as the product on
appropriate marginal subspaces
 Non-parametric methods
o Do not assume models
o Major families: histograms, clustering, sampling

Parametric methods

Regression and Log-Linear Models


 Linear regression: Data are modeled to fit a straight line
o Often uses the least-square method to fit the line
 Multiple regression: allows a response variable Y to be modeled as a linear function of
multidimensional feature vector
 Log-linear model: approximates discrete multidimensional probability distributions

Non-parametric methods

 Histograms,

 Divide data into buckets and store average (sum) for each bucket
 Partitioning rules:
o Equal-width: equal bucket range
o Equal-frequency (or equal-depth)
o V-optimal: with the least histogram variance (weighted sum of the original values that
each bucket represents)
o MaxDiff: set bucket boundary between each pair for pairs have the β–1 largest
differences

Computer Science Engineering Department 96 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Clustering

 Partition data set into clusters based on similarity, and store cluster representation (e.g.,
centroid and diameter) only
 Can be very effective if data is clustered but not if data is “dirty”
 Can have hierarchical clustering and be stored in multi-dimensional index tree structures
 There are many choices of clustering definitions and clustering algorithms.

Sampling

 Sampling: obtaining a small sample s to represent the whole data set N


 Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data
 Choose a representative subset of the data
o Simple random sampling may have very poor performance in the presence of skew
 Develop adaptive sampling methods
o Stratified sampling:
 Approximate the percentage of each class (or subpopulation of interest) in the
overall database
 Used in conjunction with skewed data
 Note: Sampling may not reduce database I/Os (page at a time)

V. Discretization and concept hierarchy generation


Computer Science Engineering Department 97 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

 Three types of attributes:


o Nominal — values from an unordered set, e.g., colour, profession
o Ordinal — values from an ordered set, e.g., military or academic rank
o Continuous — real numbers, e.g., integer or real numbers
Discretization:

o Divide the range of a continuous attribute into intervals


o Some classification algorithms only accept categorical attributes.
o Reduce data size by discretization
o Prepare for further analysis
o Reduce the number of values for a given continuous attribute by dividing the range of
the attribute into intervals
o Interval labels can then be used to replace actual data values
o Supervised vs. unsupervised
o Split (top-down) vs. merge (bottom-up)
o Discretization can be performed recursively on an attribute

Concept hierarchy formation

o Recursively reduce the data by collecting and replacing low level concepts (such as
numeric values for age) by higher level concepts (such as young, middle-aged, or
senior)

Discretization and Concept Hierarchy Generation for Numeric Data

 Typical methods: All the methods can be applied recursively


o Binning (see above)
 Top-down split, unsupervised,
o Histogram analysis (see above)
 Top-down split, unsupervised
o Clustering analysis (see above)
 Either top-down split or bottom-up merge, unsupervised
o Entropy-based discretization: supervised, top-down split
o Interval merging by c2 Analysis: unsupervised, bottom-up merge
o Segmentation by natural partitioning: top-down split, unsupervised

Segmentation by natural partitioning: top-down split, unsupervised

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform, “natural”
intervals.

o If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the
range into 3 equi-width intervals
o If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4
intervals
o If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into
5 intervals

Concept Hierarchy Generation for Categorical Data

Computer Science Engineering Department 98 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Specification of a partial/total ordering of attributes explicitly at the schema level by users or


experts
o street < city < state < country
 Specification of a hierarchy for a set of values by explicit data grouping
o {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
o E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute levels) by the analysis of the number of
distinct values
o E.g., for a set of attributes: {street, city, state, country}

Automatic Concept Hierarchy Generation


 Some hierarchies can be automatically generated based on the analysis of the number of
distinct values per attribute in the data set
o The attribute with the most distinct values is placed at the lowest level of the hierarchy
o Exceptions, e.g., weekday, month, quarter, year.

UNIT-III
University questions
PART A
1. Define data.
2. State why the data preprocessing an important issue for data warehousing and data mining.
3. What is the need for discretization in data mining?.
4. What are the various forms of data preprocessing?
5. Define Data Mining.
6. List out any four data mining tools.
7. What do data mining functionalities include?
8. Define patterns.
PART-B
1. (i) Explain the various primitives for specifying Data mining Task. [6]

(ii) Describe the various descriptive statistical measures for data mining.[10]

2. Discuss about different types of data and functionalities. [16]

Computer Science Engineering Department 99 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

3. (i)Describe in detail about Interestingness of patterns. [6]

(ii)Explain in detail about data mining task primitives.[10]

4. (i)Discuss about different Issues of data mining. [6]


(ii)Explain in detail about data preprocessing. [10]

5. How data mining system are classified? Discuss each classification with an
example. [16]

6. How data mining system can be integrated with a data warehouse? Discuss
with an example. [16]

Computer Science Engineering Department 100 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

UNIT 4

PART – A

1. What is the purpose of Apriori Algorithm? [CO4-L2]

Apriori algorithm is an influential algorithm for mining frequent item sets for Boolean association
rules. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of
frequent item set properties.

2. Define anti-monotone property? [CO4-L1]

If a set cannot pass a test, all of its supersets will fail the same test as well

3. How to generate association rules from frequent item sets? [CO4-L2]

Association rules can be generated as follows For each frequent item set1, generate all non empty
subsets of 1. For every non empty subsets s of 1, output the rule “S=>(1-s)”if Support
count(1)=min_conf, Support_count(s) where min_conf is the minimum confidence threshold.

4. Give few techniques to improve the efficiency of Apriori algorithm? [CO4-L2]

• Hash based technique


• Transaction Reduction
• Portioning
• Sampling
• Dynamic item counting

5. What are the things suffering the performance of Apriori candidate generation
technique? [CO4-L2]

• Need to generate a huge number of candidate sets


• Need to repeatedly scan the scan the database and check a large set of candidates by pattern
matching

6. Describe the method of generating frequent item sets without candidate generation?
[CO4-L2]

Frequent-pattern growth(or FP Growth) adopts divide-and-conquer strategy. Steps:


• Compress the database representing frequent items into a frequent pattern tree or FP tree,
• Divide the compressed database into a set of conditional database,
• Mine each conditional database separately.

7. Mention few approaches to mining Multilevel Association Rules? [CO4-L2]

• Uniform minimum support for all levels(or uniform support) • Using reduced minimum support at
lower levels(or reduced support) • Level-by-level independent • Level-cross filtering by single item
• Level-cross filtering by k-item set

8. What are multidimensional association rules? [CO4-L2]

Computer Science Engineering Department 101 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Association rules that involve two or more dimensions or predicates


• Inter dimension association rule: Multidimensional association rule with no repeated predicate
or dimension.
• Hybrid-dimension association rule: Multidimensional association rule with multiple
occurrences of some predicates or dimensions.

9. Define constraint-Based Association Mining? [CO4-L1]

Mining is performed under the guidance of various kinds of constraints provided by the user. The
constraints include the following • Knowledge type constraints • Data constraints • Dimension/level
constraints • Interestingness constraints • Rule constraints.

10. Define the concept of classification? [CO4-L2]

Two step process • A model is built describing a predefined set of data classes or concepts. • The
model is constructed by analyzing database tuples described by attributes.The model is used for
classification.

11 What is Decision tree? [CO4-L1]

A decision tree is a flow chart like tree structures, where each internal node denotes a test on an
attribute, each branch represents an outcome of the test,and leaf nodes represent classes or class
distributions. The top most in a tree is the root node.

12. What is Attribute Selection Measure? [CO4-L1]

The information Gain measure is used to select the test attribute at each node in the decision
tree. Such a measure is referred to as an attribute selection measure or a measure of the
goodness of split.

13. Describe Tree pruning methods. [CO4-L1]

When a decision tree is built, many of the branches will reflect anomalies in the training data due
to noise or outlier. Tree pruning methods address this problem of over fitting the data.
Approaches:
• Pre pruning
• Post pruning

14. Define Pre Pruning[CO4-L1]


A tree is pruned by halting its construction early. Upon halting, the node becomes a leaf. The leaf
may hold the most frequent class among the subset samples.

15. Define Post Pruning. [CO4-L1]


Post pruning removes branches from a “Fully grown” tree. A tree node is pruned by removing its
branches. Eg: Cost Complexity Algorithm

16. What is meant by Pattern? [CO4-L1]

Pattern represents the knowledge.

Computer Science Engineering Department 102 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

17. Define the concept of prediction. [CO4-L2]

Prediction can be viewed as the construction and use of a model to assess the class of an
unlabeled sample or to assess the value or value ranges of an attribute that a given sample is
likely to have.

18 What is the use of Regression [CO4-L2]

Regression can be used to solve the classification problems but it can also be used for
applications such as forecasting. Regression can be performed using many different types of
techniques; in actually regression takes a set of data and fits the data to a formula.

19 What are the requirements of cluster analysis? [CO4-L2]

The basic requirements of cluster analysis are


• Dealing with different types of attributes.
• Dealing with noisy data.
• Constraints on clustering.
• Dealing with arbitrary shapes.
• High dimensionality
• Ordering of input data
• Interpretability and usability
• Determining input parameter and
• Scalability

20.What are the different types of data used for cluster analysis? [CO4-L1]

The different types of data used for cluster analysis are interval scaled, binary, nominal, ordinal
and ratio scaled data.

PART -B

1. Describe the Mining Frequent Patterns and Associations & Correlations. [CO4-H2]

Basic Concepts
Market Basket Analysis:
Frequent Itemsets, Closed Itemsets, and Association Rules
Frequent Pattern Mining: A Road Map

1. Basic Concepts

Frequent patterns are patterns (such as itemsets, subsequences, or substructures) that appear
in
a data set frequently.
For example, a set of items, such as milk and bread, that appear frequently together in a
transaction data set is a frequent itemset.
A subsequence, such as buying first a PC, then a digital camera, and then a memory card,
if it occurs frequently in a shopping history database, is a (frequent) sequential pattern.
Computer Science Engineering Department 103 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

A substructure can refer to different structural forms, such as subgraphs, subtrees, or


sublattices, which may be combined with itemsets or subsequences. If a substructure occurs
frequently, it is called a (frequent) structured pattern.

2.Market Basket Analysis


Frequent itemset mining used to find associations and correlations of all items in large
transactional or relational data sets. With large amounts of data continuously collected and stored,
many industries are interested in mining such patterns from their databases. This can help in
many business decision-making processes, such as catalogue design, cross marketing, and
customer shopping behaviour analysis.
A typical example of frequent itemset mining is market basket analysis. This process
analyzes customer buying habits by finding associations between the different items that
customers place in their “shopping baskets” (Figure 5.1).
The discovery of such associations can help retailers develop marketing strategies by
which items are frequently purchased together by customers. For example, if customers are
buying milk, how many of them also buy bread on the same trip to the supermarket? Such
information can lead to increased sales by helping retailers do selective marketing and plan their
shelf space.

3.Frequent Itemsets, Closed Itemsets, and Association Rules

 A set of items is referred to as an itemset.


 An itemset that contains k items is a k-itemset.
 The set {computer, antivirus software} is a 2-itemset.
 The occurrence frequency of an itemset is the number of transactions that contain the itemset.
This is also known, simply, as the frequency, support count, or count of the itemset.

Computer Science Engineering Department 104 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Note that the itemset support defined in Equation is sometimes referred to as relative support,
whereas the occurrence frequency is called the absolute support.

 From above Equation we have

 Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence
threshold (min conf) are called Strong Association Rules.

In general, association rule mining can be viewed as a two-step process:

1.Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently
as a predetermined minimum support count, min_sup.

2. Generate strong association rules from the frequent itemsets: By definition, these rules must
satisfy minimum support and minimum confidence.

Closed Itemsets : An itemset X is closed in a data set S if there exists no proper super-itemset Y
such that Y has the same support count as X in S. An itemset X is a closed frequent itemset in set
S if X is both closed and frequent in S.

Maximal frequent itemset: An itemset X is a maximal frequent itemset (or max-itemset) in set S if X
is frequent, and there exists no super-itemset Y such that X belongsY and Y is frequent in S.

4.Frequent Pattern Mining

Frequent pattern mining can be classified in various ways, based on the following criteria:

1. Based on the completeness of patterns to be mined: The following can be mined based on the
Completeness of patterns.
 Frequent itemsets, Closed frequent itemsets, Maximal frequent itemsets,
 Constrained frequent itemsets (i.e., those that satisfy a set of user-defined constraints),
 Approximate frequent itemsets (i.e., those that derive only approximate support counts for
the mined frequent itemsets),
 Near-match frequent itemsets (i.e., those that tally the support count of the near or almost
matching itemsets),
 Top-k frequent itemsets (i.e., the k most frequent itemsets for a user-specified value, k),

2. Based on the levels of abstraction involved in the rule set:


e.g “computer” is a higher-level abstraction of “laptop computer”

buys (X, “computer”) =>buys (X, “HP printer”)


buys (X, “laptop computer”) => buys (X, “HP printer”)

3. Based on the number of data dimensions involved in the rule:

Computer Science Engineering Department 105 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

buys (X, “computer”) =>buys (X, “HP printer”)


buys (X, “laptop computer”) => buys (X, “HP printer”)
buys (X, “computer”) => buys (X, “antivirus software”)

The above Rules are single-dimensional association rules they refer only one dimension, buys.
The following rule is an example of a multidimensional rule:

4. Based on the types of values handled in the rule:

Associations between the presence or absence of items, it is a Boolean association rule.e.g


Computer => antivirus software [support = 2%; confidence = 60%]
buys (X, “computer”) =>buys (X, “HP printer”)
buys (X, “laptop computer”) => buys (X, “HP printer”)
Associations between quantitative items or attributes, then it is a quantitative association rule. E.g

5. Based on the kinds of rules to be mined:


e.g Association rules and Correlation rules.

6. Based on the kinds of patterns to be mined:


 Frequent itemset mining: mining of frequent itemsets (sets of items) from
transactional or
relational data sets.
 Sequential pattern mining: searches for frequent subsequences in a sequence data
set
 Structured pattern mining: searches for frequent substructuresin a structured data
set.

2. Write all the Mining Methods? [CO4-H3]

1. The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation


2. Generating Association Rules from Frequent Itemsets
3. Improving the Efficiency of Apriori
4. Mining Frequent Itemsets without Candidate Generation
5. Mining Frequent Itemsets Using Vertical Data Format
6. Mining Closed Frequent Itemsets

1. The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation

Apriori is an algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent
itemsets for Boolean association rules. The name of the algorithm is based on the fact that the
algorithm uses prior knowledge of frequent itemset properties.

Apriori uses an iterative approach known as a level-wise search, where k-itemsets are used
to explore (k+1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the
count for each item, and collecting those items that satisfy minimum support. The resulting set is
denoted L1.
Computer Science Engineering Department 106 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so
on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of
the database.
To improve the efficiency of the level-wise generation of frequent itemsets, an important
property called the Apriori property, presented below, is used to reduce the search space.

Apriori property: All nonempty subsets of a frequent itemset must also be frequent.

E.g

Consider the table D with nine transactions |D| = 9.

1. In the first iteration of the algorithm, each item is a member of the set of candidate 1-itemsets,
C1. The algorithm simply scans all of the transactions for count the number of occurrences of each
item.
2.Minimum support count is 2, i.e min sup = 2. The set of frequent 1-itemsets, L1, can then be
determined. It consists of the candidate 1-itemsets satisfying minimum support. In our example, all
of the candidates in C1 satisfy minimum support.

3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join to generate
a candidate set of 2-itemsets, C2. C2 consists of 2-itemsets. Note that no candidates are
removed fromC2 during the prune step because each subset of the candidates is also frequent.

Computer Science Engineering Department 107 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

4. Next, the transactions in D are scanned and the support count of each candidate itemset in C2 is
accumulated, as shown below

5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets
in C2 having minimum support.

7. The generation of the set of candidate 3-itemsets,C3, is as follows. From the join step,
get C3 = = {I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5g, } {I2, I4, I5}. Based on
the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine
that the four latter candidates cannot possibly be frequent. The resulting pruned version of C3 is
shown below.

Computer Science Engineering Department 108 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

7. The transactions in D are scanned in order to determine L3, consisting of those candidate
3-itemsets in C3 having minimum support (Figure 5.2).

8. The algorithm uses to generate a candidate set of 4-itemsets, C4. Although


the join results in {I1, I2, I3, I5} this itemset is pruned because its subset {I2, I3, I5} is not frequent.
Thus, C4

 Pseudo-code:

Ck: Candidate itemset of size k


Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk ;

1.The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation


The Apriori Algorithm: Basics
The Apriori Algorithm is an influential algorithm for mining frequent itemsets for boolean
association rules.
Key Concepts :
• Frequent Itemsets: The sets of item which has minimum support (denoted by L i for ith-
Itemset).
• Apriori Property: Any subset of frequent itemset must be frequent.
• Join Operation: To find Lk , a set of candidate k-itemsets is generated by joining Lk-1 with
itself.
The Apriori Algorithm Steps
 Find the frequent itemsets: the sets of items that have minimum support
 A subset of a frequent itemset must also be a frequent itemset
 i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset
 Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)
 Use the frequent itemsets to generate association rules.
 Join Step: Ck is generated by joining Lk-1with itself
Computer Science Engineering Department 109 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM

 Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-
itemset

 Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
The Apriori Algorithm: Example

Computer Science Engineering Department 110 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Consider a database, D , consisting of 9 transactions.


 Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % )
 Let minimum confidence required is 70%.
 We have to first find out the frequent itemset using Apriori algorithm.
 Then, Association rules will be generated using min. support & min. confidence.

Step 1: Generating 1-itemset Frequent Pattern

• In the first iteration of the algorithm, each item is a member of the set of
candidate.
• The set of frequent 1-itemsets, L1 , consists of the candidate 1-itemsets
satisfying minimum support.

Step 2: Generating 2-itemset Frequent Pattern

Computer Science Engineering Department 111 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 To discover the set of frequent 2-itemsets, L2 , the algorithm uses L1 Join L1 to


generate a candidate set of 2-itemsets, C2.
 Next, the transactions in D are scanned and the support count for each candidate
itemset in C2 is accumulated (as shown in the middle table).
 The set of frequent 2-itemsets, L2 , is then determined, consisting of those
candidate 2-itemsets in C2 having minimum support.
 Note: We haven’t used Apriori Property yet.

Step 3: Generating 3-itemset Frequent Pattern

 The generation of the set of candidate 3-itemsets, C3 , involves use of the Apriori
Property.
 In order to find C3, we compute L2 Join L2.
 C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.
 Now, Join step is complete and Prune step will be used to reduce the size of C 3.
Prune step helps to avoid heavy computation due to large Ck.
 Based on the Apriori property that all subsets of a frequent itemset must also be
frequent, we can determine that four latter candidates cannot possibly be frequent.
How ?
 For example , lets take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2,
I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3}
in C3.
 Lets take another example of {I2, I3, I5} which shows how the pruning is performed.
The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.
 BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori
Property. Thus We will have to remove {I2, I3, I5} from C3.
 Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of
Join operation for Pruning.
 Now, the transactions in D are scanned in order to determine L3, consisting of those
candidates 3-itemsets in C3 having minimum support.

Step 4: Generating 4-itemset Frequent Pattern

 The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4.


Although the join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset
{{I2, I3, I5}} is not frequent.

Computer Science Engineering Department 112 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Thus, C4 = φ , and algorithm terminates, having found all of the frequent items.
This completes our Apriori Algorithm.

Step5.Generating Association Rules from Frequent Itemsets

Once the frequent itemsets from transactions in a database D have been found, then
generate strong association rules from them (where strong association rules satisfy both
minimum support and minimum confidence). This can be done using Equation as

The conditional probability is expressed in terms of itemset support count, where:-


support_count(AUB) is the number of transactions containing the itemsets AUB
support_count(A) is the number of transactions containing the itemset A.

Based on this equation, association rules can be generated as follows:


 For each frequent itemset l, generate all nonempty subsets of l.
 For every nonempty subset s of l, output the rule “s => ( l - s)” if support_count( l ) /
support_count(s)>=min_cont, where min_cont is the minimum confidence threshold.

Back to e.g

L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3},
{I1,I2,I5}}.
 Lets take l = {I1,I2,I5}.
 Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
 Let minimum confidence threshold is , say 70%.
 The resulting association rules are shown below, each listed with its confidence.
o Rule1: I1 ^ I2  I5
 Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%
 R1 is Rejected.
o Rule2 : I1 ^ I5  I2
 Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%
 R2 is Selected.
o Rule3: I2 ^ I5  I1
 Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100%
 R3 is Selected
o Rule4: I1  I2 ^ I5
 Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
 R4 is Rejected.
o Rule5: I2  I1 ^ I5

Computer Science Engineering Department 113 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%


 R5 is Rejected.
o Rule6: I5  I1 ^ I2
 Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
 R6 is Selected.
In this way, We have found three strong association rules.

3.Improving the Efficiency of Apriori

 Hash-based itemset counting: A k-itemset whose corresponding hashing bucket


count is below the threshold cannot be frequent.
 Transaction reduction: A transaction that does not contain any frequent k-itemset
is useless in subsequent scans.
 Partitioning: Any itemset that is potentially frequent in DB must be frequent in at
least one of the partitions of DB.
 Sampling: mining on a subset of given data, lower support threshold + a method
to determine the completeness.
 Dynamic itemset counting: add new candidate itemsets only when all of their
subsets are estimated to be frequent.

• Apriori Advantages:
– Uses large itemset property.
– Easily parallelized
– Easy to implement.
• Apriori Disadvantages:
– Assumes transaction database is memory resident.
– Requires up to m database scans

4.Mining Frequent Itemsets without Candidate Generation

 Compress a large database into a compact, Frequent-Pattern tree (FP-tree)


structure
 highly condensed, but complete for frequent pattern mining
 avoid costly database scans
 Develop an efficient, FP-tree-based frequent pattern mining method
 A divide-and-conquer methodology: decompose mining tasks into smaller
ones
 Avoid candidate generation: sub-database test only!

FP-Growth Method: An Example

Computer Science Engineering Department 114 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Consider the same previous example of a database, D , consisting of 9 transactions.


 Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % )
 The first scan of database is same as Apriori, which derives the set of 1-itemsets &
their support counts.
 The set of frequent items is sorted in the order of descending support count.
 The resulting set is denoted as L = {I2:7, I1:6, I3:6, I4:2, I5:2}

FP-Growth Method: Construction of FP-Tree

 First, create the root of the tree, labeled with “null”.


 Scan the database D a second time. (First time we scanned it to create 1-itemset
and then L).
 The items in each transaction are processed in L order (i.e. sorted order).
 A branch is created for each transaction with items having their support count
separated by colon.
 Whenever the same node is encountered in another transaction, we just
increment the support count of the common node or Prefix.
 To facilitate tree traversal, an item header table is built so that each item points to
its occurrences in the tree via a chain of node-links.
 Now, The problem of mining frequent patterns in database is transformed to that
of mining the FP-Tree.

FP-Growth Method: Construction of FP-Tree


Mining the FP-Tree by Creating Conditional (sub) pattern bases

[See class notes for above two topics]

Steps:
1. Start from each frequent length-1 pattern (as an initial suffix pattern).

Computer Science Engineering Department 115 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

2. Construct its conditional pattern base which consists of the set of prefix paths in
the FP-Tree co-occurring with suffix pattern.
3. Then, Construct its conditional FP-Tree & perform mining on such a tree.
4. The pattern growth is achieved by concatenation of the suffix pattern with the
frequent patterns generated from a conditional FP-Tree.
5. The union of all frequent patterns (generated by step 4) gives the required
frequent itemset.

Table : Mining the FP-Tree by creating conditional (sub) pattern bases

Now, Following the above mentioned steps:

 Lets start from I5. The I5 is involved in 2 branches namely {I2 I1 I5: 1} and {I2 I1
I3 I5: 1}.
 Therefore considering I5 as suffix, its 2 corresponding prefix paths would be {I2
I1: 1} and {I2 I1 I3: 1}, which forms its conditional pattern base.
 Out of these, Only I1 & I2 is selected in the conditional FP-Tree because I3 is not
satisfying the minimum support count.
o For I1 , support count in conditional pattern base = 1 + 1 = 2
o For I2 , support count in conditional pattern base = 1 + 1 = 2
o For I3, support count in conditional pattern base = 1
o Thus support count for I3 is less than required min_sup which is 2 here.
 Now , We have conditional FP-Tree with us.
 All frequent pattern corresponding to suffix I5 are generated by considering all
possible combinations of I5 and conditional FP-Tree.
 The same procedure is applied to suffixes I4, I3 and I1.
 Note: I2 is not taken into consideration for suffix because it doesn’t have any
prefix at all.

Advantages of FP growth

 Performance study shows


 FP-growth is an order of magnitude faster than Apriori, and is also faster
than tree-projection
 Reasoning
 No candidate generation, no candidate test

Computer Science Engineering Department 116 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Use compact data structure


 Eliminate repeated database scan
 Basic operation is counting and FP-tree building

Disadvantages of FP-Growth
– FP-Tree may not fit in memory!!
– FP-Tree is expensive to build

Pseudo Code

Algorithm: FP growth. Mine frequent itemsets using an FP-tree by pattern fragment


growth.
Input:
• D, a transaction database;
• min sup, the minimum support count threshold.
Output: The complete set of frequent patterns.
Method:

1. The FP-tree is constructed in the following steps:


(a) Scan the transaction database D once. Collect F, the set of frequent items, and their
support counts. Sort F in support count descending order as L, the list of frequent items.
(b) Create the root of an FP-tree, and label it as “null.” For each transaction Trans in D
do the following.
Select and sort the frequent items in Trans according to the order of L. Let the sorted
frequent item
list in Trans be [pjP], where p is the first element and P is the remaining list. Call insert
tree([pjP], T), which is performed as follows. If T has a child N such that N.item-
name=p.item-name, then increment N’s count by 1; else create a new node N, and let
its count be 1, its parent link be linked to T, and its node-link to the nodes with the same
item-name via the node-link structure. If P is nonempty, call insert tree(P, N) recursively.

2. The FP-tree is mined by calling FP growth(FP tree, null), which is implemented as


follows.

procedure FP growth(Tree, a)

(1) if Tree contains a single path P then


(2) for each combination (denoted as b) of the nodes in the path P
(3) generate pattern b[a with support count = minimum support count o f nodes in b;
(4) else for each ai in the header of Tree f
(5) generate pattern b = ai [a with support count = ai:support count;
(6) construct b’s conditional pattern base and then b’s conditional FP tree Treeb;
(7) if Treeb 6= /0 then
(8) call FP growth(Treeb, b); g

Computer Science Engineering Department 117 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

5 .Mining Frequent Itemsets Using Vertical Data Format

Both the Apriori and FP-growth methods mine frequent patterns from a set of
transactions
in TID-itemset format (that is,{ TID : itemset}), where TID is a transaction-id and itemset
is the set of items bought in transaction TID. This data format is known as horizontal
data format.

Alternatively, data can also be presented in item-TID-set format (that is,{item : TID-
set}), where item is an item name, and TID set is the set of transaction identifiers
containing the item. This format is known as vertical data format.

Computer Science Engineering Department 118 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

********************************************************************************

2. What are the Various Kinds of Mining Association Rules?

• Multilevel association rules involve concepts at different levels of abstraction.


• Multidimensional association rules involve more than one dimension or
predicate(e.g., rules relating what a customer buys as well as the customer’s age.)
• Quantitative association rules involve numeric attributes that have an implicit
ordering among values (e.g., age).

1.Mining Multilevel Association Rules - involve concepts at different levels of


abstraction

Let’s examine the following example.


The transactional data in Table for sales in an AllElectronics store, showing the
items purchased for each transaction. The concept hierarchy for the items is shown next
Figure .
A concept hierarchy defines a sequence of mappings from a set of low-level
concepts to higher level, more general concepts.
Data can be generalized by replacing low-level concepts within the data by their
higher-level concepts, or ancestors, from a concept hierarchy.

Computer Science Engineering Department 119 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Association rules generated from mining data at multiple levels of abstraction are
called multiple-level or multilevel association rules.
Multilevel association rules can be mined efficiently using concept hierarchies
under a support-confidence framework. A top-down strategy is used, where counts are
collected for the calculation of frequent itemsets at each concept level, starting at the
concept level 1 and working downward in the hierarchy towards the specific concept
levels, until no more frequent itemsets can be found. For each level, any algorithm for
discovering frequent itemsets may be used, such as Apriori or its variations.

 Using uniform minimum support for all levels (referred to as uniform support):
The same minimum support entry is used when mining at each level of abstraction.
For example, in following Figure , a minimum support enrty of 5% is used throughout
(e.g., for mining from “computer” down to “laptop computer”). Both “computer” and
“laptop computer” are found to be frequent, while “desktop computer” is not.
The method is also simple in that users are required to specify only one
minimum support entry. An Apriori-optimization technique can be used, based on the
concept of an ancestor is a superset of its children’s: The search avoids examining
itemsets containing any item whose ancestors do not have minimum support.

Computer Science Engineering Department 120 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Using reduced minimum support at lower levels (referred to as reduced


support): Each level of abstraction has its own minimum support threshold. The
deeper the level of abstraction, the smaller the corresponding threshold is. For
example, in Figure, the minimum support thresholds for levels 1 and 2 are 5% and
3%, respectively. In this way, “computer,” “laptop computer,” and “desktop computer”
are all considered frequent.

 Using item or group-based minimum support (referred to as group-based


support):

When mining multilevel rules users approaching which groups are more important
than others, also it is important to set up user-specific, item, or group based minimal
support entries.
For example, a user could set up the minimum support entries based on product
price, on items of interest, such as low support entries for laptop computers and flash
drives which association patterns containing items in these categories.

2.Mining Multidimensional Association Rules from Relational Databases and


DataWarehouses (involves more than one dimension or predicate)

Computer Science Engineering Department 121 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

E.g Mining association rules containing single predicates,

Following the terminology used in multidimensional databases, we refer to each


distinct predicate in a rule as a dimension. Hence, we can refer to Rule above as a
single dimensional or intra dimensional association rule because it contains a single
distinct predicate (e.g., buys)with multiple occurrences (i.e., the predicate occurs more
than once within the rule).

Considering each database attribute or warehouse dimension as a predicate, we


can therefore mine association rules containing multiple predicates, such as

Association rules that involve two or more dimensions or predicates can be


referred to as multidimensional association rules. Rule above contains three
predicates (age, occupation, and buys), each of which occurs only once in the rule.
Hence, it has no repeated predicates. Multidimensional association rules with no
repeated predicates are called inter dimensional association rules.
We can also mine multidimensional association rules with repeated predicates,
which contain multiple occurrences of some predicates. These rules are called hybrid-
dimensional association rules.
An example of such a rule is the following, where the predicate buys is repeated:

Note that database attributes can be categorical or quantitative.


Categorical attributes have a finite number of possible values, with no ordering
among the values (e.g., occupation, brand, color). Categorical attributes are also called
nominal attributes, because their values are “names of things.”
Quantitative attributes are numeric and have an implicit ordering among values
(e.g., age, income, price).

Techniques for mining multidimensional association rules can be categorized into two
basic approaches regarding the treatment of quantitative attributes.

(i).Mining Multidimensional Association Rules Using Static Discretization of Quantitative


Attributes
(ii).Mining Quantitative Association Rules

(i).Mining Multidimensional Association Rules Using Static Discretization of


Quantitative Attributes

Computer Science Engineering Department 122 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Quantitative attributes, are discretized before mining using predefined concept


hierarchies or data discretization techniques, where numeric values are replaced by
interval labels.
The transformed multidimensional data may be used to construct a data cube.
Data cubes are well suited for the mining of multidimensional association rules.
They store aggregates (such as counts), in multidimensional space, which is
essential for computing the support and confidence of multidimensional association
rules.

Following Figure shows the lattice of cuboids defining a data cube for the dimensions
age, income, and buys. The cells of an n-dimensional cuboid can be used to store the
support counts of the corresponding n-predicate sets.
The base cuboid aggregates the task-relevant data by age, income, and buys;
the 2-D cuboid, (age, income), aggregates by age and income, and so on; the 0-D
(apex) cuboid contains the total number of transactions in the task-relevant data.

(ii). Mining Quantitative Association Rules

Quantitative association rules are multidimensional association rules in which the


numeric attributes are dynamically discretized during the mining process satisfy mining
criteria, such as maximizing the confidence or compactness of the rules.
Quantitative association rules, having two quantitative attributes on the left-hand
side of the rule and one categorical attribute on the right-hand side of the rule. That is,

Computer Science Engineering Department 123 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

where Aquan1 and Aquan2 are tests on quantitative attribute intervals, and Acat tests a
categorical attribute from the task-relevant data. Such rules have been referred to as
two-dimensional quantitative association rules, because they contain two quantitative
dimensions.
An example of such a 2-D quantitative association rule is

Association Rule Clustering System

This approach maps pairs of quantitative attributes onto a 2-D grid for tuples satisfying a
given categorical attribute condition. The grid is then searched for clusters of points
from which the association rules are generated. The following steps are involved in
ARCS:

Binning: Quantitative attributes can have a very wide range of values defining their
domain. A big 2-D grid can be if it is plotted age and income as axes, where each
possible value of age was assigned a unique position on one axis, and similarly, each
possible value of income was assigned a unique position on the other axis.
To keep grids down to a manageable size, we instead partition the ranges of
quantitative attributes into intervals.The partitioning process is referred to as binning,
that is, where the intervals are considered “bins.” Three common binning strategies area
as follows:

 Equal-width binning, where the interval size of each bin is the same
 Equal-frequency binning, where each bin has approximately the same number of
tuples assigned to it.
 Clustering-based binning, where clustering is performed on the quantitative attribute
to group neighboring points (judged based on various distance measures) into the
same bin

Finding frequent predicate sets: Once the 2-D array containing the count distribution
for each category is set up, it can be scanned to find the frequent predicate sets (those
satisfying minimum support) that also satisfy minimum confidence. Strong association
rules can then be generated from these predicate sets, using a rule generation
algorithm.
Clustering the association rules: The strong association rules obtained in the
previous

Computer Science Engineering Department 124 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

step are then mapped to a 2-D grid. Following figure shows a 2-D grid for 2-D
quantitative
association rules predicting the condition buys (X, “HDTV”) on the rule right-hand side,
given the quantitative attributes age and income.

The four Xs correspond to the rules

The four rules can be combined or “clustered” together to form the following simpler
rule, which

subsumes and replaces the above four rules:

4.Explain in detail -Correlation Analysis (correlation - relationship)

Strong Rules Are Not Necessarily Interesting: An Example

In analysing transactions at AllElectronics shop purchase of computer games and


videos. Let game refer to the transactions of computer games, and video refer to
videos.

Computer Science Engineering Department 125 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Of the 10,000 transactions analyzed, 6,000 of the customer transactions included


computer games, while 7,500 included videos, and 4,000 included both computer
games and videos.

If minimum support 30% and a minimum confidence of 60% was given then the
following association rule is discovered:

buys(X, “computer games”))buys(X, “videos”) [support = 40%, confidence


= 66%]

Above Rule is a strong association rule since its support value of 4,000/10,000 =40%
and confidence value of 4,000/6,000 =66% satisfy the minimum support and minimum
confidence thresholds, respectively.

However, Rule is misleading because the probability of purchasing videos is 75%,


which is even larger than 66%.The above example also illustrates that the confidence of
a rule A=>B can be misleading in that it is only an estimate of the conditional probability
of itemset B given itemset A.
It does not measure the real strength of the correlation and implication between A and
B.

From Association Analysis to Correlation Analysis.


From above, the support and confidence measures are insufficient at filtering out
uninteresting association rules. To avoid this weakness, a correlation measure can be
used to increase the support-confidence framework for association rules.

This leads to correlation rules of the form

That is, a correlation rule is measured not only by its support and confidence but also by
the correlation between itemsets A and B.

Lift is a simple correlation measure that is given as follows. The occurrence of itemset
A is independent of the occurrence of itemset B if
otherwise, itemsets A and B are dependent and correlated as
events.
This definition can easily be extended to more than two itemsets. The lift between the
occurrence of A and B can be measured by computing

If the resulting value of above Equation is less than 1, then the occurrence of A
is negatively
correlated with the occurrence of B.
If the resulting value is greater than 1, then A and B are positively correlated,
meaning that the occurrence of one implies the occurrence of the other.

Computer Science Engineering Department 126 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

If the resulting value is equal to 1, then A and B are independent and there is no
correlation between them.
Above last Equation is equivalent to , which
is also
referred as the lift of the association (or correlation) rule A=>B.

5.write about Constraint Based Association Mining[CO4-H1]

1. Metarule-Guided Mining of Association Rules


2. Constraint Pushing: Mining Guided by Rule Constraints

A data mining process may uncover so many rules which uninteresting to the users. A
good practical is to have the users constraints to limit the search space. This strategy is
known as constraint-based mining.

The constraints can include the following:

Knowledge type constraints: These specify the type of knowledge to be mined, such as
association or correlation.
Data constraints: These specify the set of task-relevant data.
Dimension/level constraints: These specify the desired dimensions (or attributes) of the
data, or levels of the concept hierarchies, to be used in mining.
Interestingness constraints: These specify thresholds on statistical measures of rule
interestingness, such as support, confidence, and correlation.
Rule constraints: These specify the form of rules to be mined. Such constraints may be
expressed as metarules

Metarule-guided mining.

E.g Consider Market analyst for AllElectronics, describing customers (such as


customer age, address, and credit rating) the list of customer transactions. Finding
associations between customer characters and the customers purchased items. Instead
of finding all of the association rules find only which pairs of customer characters
increase the sale of office software.

An example of such a metarule is

P1 ( X, Y ) ^ P2( X, W ) => buys(X, “office software”), (1)

where P1 and P2 are variables that are instantiated to attributes from the given database
during the mining process, X is a variable representing a customer, and Y and W take
on values of the attributes assigned to P1 and P2, respectively.

Computer Science Engineering Department 127 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

The data mining system can then search for rules that match the given metarule. For
instance, Rule (2) matches or complies with Metarule (1).

age(X, “30……39”) ^ income(X, “41K….60K”) =>buys(X, “office software”) (2)

Constraint Pushing: Mining Guided by Rule Constraints

Rule constraints specify expected set/subset relationships of the variables in the mined
rules, constant initiation of variables, and aggregate functions.

6. Explain the basic concepts in Classification and Prediction[CO4-H2]

1.Basic Concepts

What Is Classification? What Is Prediction?

Databases are rich with hidden information that can be used for intelligent decision
making.
Classification and prediction are two forms of data analysis that can be used to extract
models describing important data classes or to predict future data trends.

Whereas classification predicts definite labels, prediction represents continuous valued


functions.
For example, build a classification model to categorize bank loan applications as either
safe or risky, build a prediction model to predict the expenses of customers on
computer devices given their income and occupation.

A bank loans officer needs analysis of her data in order to learn which loan applicants
are
“safe”and which are “risky” for the bank.
A marketing manager at AllElectronics needs data analysis to help guess whether a
customer with a given profile will buy a new computer.
A medical researcher wants to analyze breast cancer data in order to predict which one
of
three specific treatments a patient should receive.

In each of these examples, the data analysis task is classification, where a model or
classifier is constructed to predict categorical labels, such as “safe” or “risky” for the
loan application data; “yes” or “no” for the marketing data; or “treatment A,” “treatment
B,” or “treatment C” for the medical data.

Suppose the marketing manager like to predict how much a given customer will spend
during a sale at AllElectronics. This data analysis task is an example of numeric
prediction, where the model constructed predicts a continuous-valued function, or
ordered value, as opposed to a categorical label. This model is a predictor.

Computer Science Engineering Department 128 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Regression analysis is a statistical methodology that is used for numeric prediction,


hence the two terms are often used equally.

Classification and numeric prediction are the two major types of prediction problems.
The term of prediction to refer to numeric prediction.

Classification work. Data classification is a two-step process, as shown for the loan
application data of Figure 1.

Issues: Data Preparation


 Data cleaning
o Preprocess data in order to reduce noise and handle missing values
 Relevance analysis (feature selection)

Computer Science Engineering Department 129 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

o Remove the irrelevant or redundant attributes


 Data transformation
o Generalize and/or normalize data

Issues: Evaluating Classification Methods


 Accuracy
o classifier accuracy: predicting class label
o predictor accuracy: guessing value of predicted attributes
 Speed
o time to construct the model (training time)
o time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability:
o understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as decision tree size or
compactness of classification rules.

7. How can you Classify Decision Tree Induction (generation) [CO4-H1]

Decision tree induction is the learning of decision trees from class-labelled training
tuples.
A decision tree is a flowchart-like tree structure, where each internal node (non leaf
node)
denotes a test on an attribute, each branch represents an outcome of the test, and each
leaf
node (or terminal node) holds a class label. The topmost node in a tree is the root
node.Internal nodes are denoted by rectangles, and leaf nodes are denoted by ovals.

Computer Science Engineering Department 130 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Decision trees are used for classification- Given a tuple, X, for which the associated
class label is unknown, the attribute values of the tuple are tested against the decision
tree. A path is traced from the root to a leaf node, which holds the class prediction for
that tuple. Decision trees can easily be converted to classification rules.

“Why are decision tree classifiers so popular?”


 The construction of decision tree classifiers does not require any domain
knowledge
 Decision trees can handle high dimensional data.
 The learning and classification steps of decision tree induction are simple and
fast.
 Decision tree classifiers have good accuracy.

Decision tree induction algorithms have been used for classification in many application
areas, such as medicine, manufacturing and production, financial analysis, astronomy,
and molecular biology.

Decision Tree Induction

Algorithm: Generate decision tree. Generate a decision tree from the training tuples of
data
partition D.
Input:
 Data partition, D, which is a set of training tuples and their associated class
labels;
 attribute list, the set of candidate attributes;

Computer Science Engineering Department 131 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Attribute selection method, a procedure to determine the splitting criterion that


“best” partitions the data tuples into individual classes. This criterion consists of a
splitting attribute and, possibly, either a split point or splitting subset.

Output: A decision tree.

Method:
(1) create a node N;
(2) if tuples in D are all of the same class, C then
(3) return N as a leaf node labeled with the class C;
(4) if attribute list is empty then
(5) return N as a leaf node labeled with the majority class in D; // majority voting
(6) apply Attribute selection method(D, attribute list) to find the “best” splitting criterion;
(7) label node N with splitting criterion;
(8) if splitting attribute is discrete-valued and multiway splits allowed then // not
restricted to binary trees
(9) attribute list attribute list � splitting attribute; // remove splitting attribute
(10) for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
(11) let Dj be the set of data tuples in D satisfying outcome j; // a partition
(12) if Dj is empty then
(13) attach a leaf labeled with the majority class in D to node N;
(14) else attach the node returned by Generate decision tree(Dj, attribute list) to node N;
endfor
(15) return N;

Algorithm

 Basic algorithm (a greedy algorithm)


o Tree is constructed in a top-down recursive divide-and-conquer manner
o At start, all the training examples are at the root
o Attributes are categorical (if continuous-valued, they are discretized in
advance)
o Examples are partitioned recursively based on selected attributes
o Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)

o Conditions for stopping partitioning


o All samples for a given node belong to the same class
o There are no remaining attributes for further partitioning – majority voting
is employed for classifying the leaf
o There are no samples left

Computer Science Engineering Department 132 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

2. Attribute Selection Measures

o An attribute selection measure is a trial for selecting by splitting the criterion to “best”
from a given data partition, D, of class-labelled training tuples into individual classes.
o If we were to split D into smaller partitions according to the outcomes of the splitting
criterion, ideally each partition would be pure.
o Attribute selection measures are also known as splitting rules because they
determine how the tuples at a given node are to be split.

The three popular attribute selection measures


1).Information gain, 2). Gain ratio, 3). Gini index
1). Information Gain
ID3 uses information gain as its attribute selection measure.

Computer Science Engineering Department 133 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Information gain is defined as the difference between the original information


requirement (i.e., based on just the proportion of classes) and the new requirement (i.e.,
obtained after partitioning on A). That is,

Above Table presents a training set, D, of class-labeled tuples randomly selected


from the AllElectronics customer database. In this example, each attribute is discrete-
valued. Continuous-valued attributes have been generalized. The class label attribute,
buys computer, has two distinct values (namely, {yes, no}); therefore, there are two
distinct classes (that is, m = 2).
Let class “p”correspond to yes and class “n” correspond to no. There are nine
tuples of class yes and five tuples of class no. A (root) node N is created for the tuples
in D. To find the splitting criterion for these tuples, compute the information gain of each
attribute.
o Class P: buys_computer = “yes”
o Class N: buys_computer = “no”
9 9 5 5
Info ( D)  I (9,5)   log 2 ( )  log 2 ( ) 0.940
14 14 14 14

Computer Science Engineering Department 134 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

age pi ni l(pi,ni)
Youth 2 3 0.971
Middle-aged 4 0 0
Senior 3 2 0.971

Info age (D) = [ 5 / 14 log ( 2, 3 )] + [ 4 / 14 log ( 4 , 0 )] + [5 / 14 log ( 3 ,2 )] =


0.694
5 / 14 log (2,3) means “age = youth ” has 5 out of 14 samples, with 2 yes’s and 3 no’s.
Hence

Gain(age)  Info( D)  Infoage( D)  0.246


Similarly,

Gain(income)  0.029
Gain( student )  0.151
Gain(credit _ rating )  0.048

Computer Science Engineering Department 135 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

2). Gain ratio

C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to
information gain). To find gain ratio

Gain Ratio (A) = Gain (A) / SplitInfo(A) where

Computation of gain ratio for the attribute income. E.g

So Gain_ratio(income) = 0.029/0.926 = 0.031

The attribute with the maximum gain ratio is selected as the splitting attribute

Computer Science Engineering Department 136 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

3. Gini index

The Gini index is used in CART. Using the notation described above, the Gini index
measures the impurity of D, a data partition or set of training tuples, as

E.g

D has 9 tuples in buys_computer = “yes” and


5 in “no”
Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2

but Gini{medium,high} is 0.30 and thus the best since it is the lowest.

Tree Pruning
 Overfitting: An induced tree may overfit the training data
o Too many branches, some may reflect differences due to noise or outliers
o Poor accuracy for unseen samples
 Two approaches to avoid overfitting
o Prepruning: Halt tree construction early—do not split a node if this would
result in the goodness measure falling below a threshold
 Difficult to choose an appropriate threshold
o Postpruning: Remove branches from a “fully grown” tree—get a sequence of
progressively pruned trees
 Use a set of data different from the training data to decide which is the
“best pruned tree”

Scalable Decision Tree Induction Methods


 SLIQ
o Builds an index for each attribute and only class list and the current attribute
list reside in memory

Computer Science Engineering Department 137 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 SPRINT
o Constructs an attribute list data structure
 PUBLIC
o Integrates tree splitting and tree pruning: stop growing the tree earlier
 RainForest
o Builds an AVC-list (attribute, value, class label)
 BOAT Bootstrapped Optimistic Algorithm for Tree Construction
o Uses bootstrapping to create several small samples

8. Explain Bayesian Classification with examples. [CO4-H3]


Bayes’ Theorem
Naïve Bayesian Classification
Bayesian Belief Networks
Training Bayesian Belief Networks

Need for Bayesian Classification

 A statistical classifier: performs probabilistic prediction, i.e., predicts class


membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve Bayesian classifier, has
comparable performance with decision tree and selected neural network
classifiers
 Incremental: Each training example can incrementally increase/decrease the
probability that a hypothesis is correct — prior knowledge can be combined with
observed data
 Standard: Even when Bayesian methods are computationally inflexible, they can
provide a standard of optimal decision making against which other methods can
be measured

1. Bayes’ Theorem : Basics

 Let X be a data sample (“evidence”): class label is unknown


 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), the probability that the hypothesis holds
given the observed data sample X
 P(H) (prior probability), the initial probability
o E.g., X will buy computer, irrespective of age, income, …
 P(X): probability that sample data is observed
 P(X|H) (posteriori probability), the probability of observing the sample X, given
that the hypothesis holds
o E.g., Given that X will buy computer, the prob. that X is 31..40, medium
income

Computer Science Engineering Department 138 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the


P( H | X)  P(X | H ) P( H )
Bayes theorem
P(X)
 Informally, this can be written as
 posteriori = likelihood x prior/evidence
 Predicts X belongs to C2 if the probability P(Ci|X) is the highest among all the
P(Ck|X) for all the k classes
 Practical difficulty: require initial knowledge of many probabilities, significant
computational cost

Naïve Bayesian Classifier

 Let D be a training set of tuples and their associated class labels, and each tuple
is represented by an n-D attribute vector X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)
 This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)

Since P(X) is constant for all classes, only needs to be maximized only

P(C | X)  P(X | C )P(C )


i
needs to be maximized.
i i
 A simplified assumption: attributes are conditionally independent (i.e., no
dependence relation between attributes):
n
P( X | C i)   P( x | C i)  P( x | C i)  P( x | C i)  ...  P( x | C i)
k 1 2 n
k 1
 This greatly reduces the computation cost: Only counts the class distribution
 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided
by |Ci, D| (# of tuples of Ci in D)
 If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian
distribution with a mean μ and standard deviation σ
( x )2
1 
g ( x,  ,  )  e 2 2

2 
and P(xk|Ci) is
P(X | C i)  g ( xk , Ci ,  Ci )
Naïve Bayesian Classifier: Training Dataset

Class:
C1:buys_computer = ‘yes’

Computer Science Engineering Department 139 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

C2:buys_computer = ‘no’
Data sample

X = (age <=30, Income = medium, Student = yes , Credit_rating = Fair)

age income student credit_rating buys_computer


Youth high no fair no
Youth high no excellent no
middle-aged high no fair yes
senior medium no fair yes
senior low yes fair yes
senior low yes excellent no
middle-aged low yes excellent yes
Youth medium no fair no
Youth low yes fair yes
senior medium yes fair yes
Youth medium yes excellent yes
middle-aged medium no excellent yes
middle-aged high yes fair yes
senior medium no excellent no

 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643


P(buys_computer = “no”) = 5/14= 0.357

 Compute P(X|Ci) for each class

P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222


P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044


P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

Computer Science Engineering Department 140 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028


P(X|buys_computer =
“no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)

o Advantages
 Easy to implement
 Good results obtained in most of the cases
o Disadvantages
 Assumption: class conditional independence, therefore loss of accuracy
 Practically, dependencies exist among variables
 E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
 Dependencies among these cannot be modeled by Naïve Bayesian
Classifier
o To deal with these dependencies
 Bayesian Belief Networks are used.

Bayesian Belief Networks

o Bayesian belief network allows a subset of the variables conditionally


independent
o A graphical model of causal relationships
 Represents dependency among the variables
 Gives a specification of joint probability distribution
 Nodes: random variables
 Links: dependency
 X and Y are the parents of Z, and Y is the parent of P
 No dependency between Z and P
 Has no loops or cycles

Computer Science Engineering Department 141 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Derivation of the probability of a particular combination of values of X, from


Conditional probability table :
n
P( x1 ,..., xn )   P( xi | Parents (Y i ))
i 1

Training Bayesian Networks

o Several scenarios:
 Given both the network structure and all variables observable: learn only
the CPTs
 Network structure known, some hidden variables: gradient descent
(greedy hill-climbing) method, analogous to neural network learning
 Network structure unknown, all variables observable: search through the
model space to reconstruct network topology
 Unknown structure, all hidden variables: No good algorithms known for
this purpose

9.Rule Based Classification – describe [CO4-H1]

1. Using IF-THEN Rules for Classification


2. Rule Extraction from a Decision Tree

Computer Science Engineering Department 142 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

3. Rule Induction Using a Sequential Covering Algorithm

1) Using IF-THEN Rules for Classification

Rules are a good way of representing information or bits of knowledge. A rule-based


classifier uses a set of IF-THEN rules for classification. An IF-THEN rule is an
expression of the form
IF condition THEN conclusion.
An example is rule R1,
R1: IF age = youth AND student = yes THEN buys computer = yes.
The “IF”-part (or left-hand side)of a rule is known as the rule antecedent or
precondition. The “THEN”-part (or right-hand side) is the rule consequent. In the rule
antecedent, the condition consists of one or more attribute tests (such as age = youth,
and student = yes) that are logically ANDed. The rule’s consequent contains a class
prediction (in this case, we are predicting whether a customer will buy a computer). R1
can also be written as

A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a
class labeled data set D, let ncovers be the number of tuples covered by R; ncorrect be the
number of tuples correctly classified by R; and |D| be the number of tuples in D. We can
define the coverage and accuracy of R as

e.g Consider rule R1 above, which covers 2 of the 14 tuples. It can correctly classify
both tuples. Therefore, coverage(R1) = 2/14 = 14:28% and accuracy (R1) = 2/2 =
100%.( See table)

 If more than one rule is triggered, need conflict resolution


o Size ordering: assign the highest priority to the triggering rules that has the
“toughest” requirement (i.e., with the most attribute test)
o Class-based ordering: decreasing order of prevalence or misclassification
cost per class
o Rule-based ordering: (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts

2.Rule Extraction from a Decision Tree

Computer Science Engineering Department 143 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

To extract rules from a decision tree, one rule is created for each path from the root to a
leaf node. Each splitting criterion along a given path is logically ANDed to form the rule
antecedent (“IF” part). The leaf node holds the class prediction, forming the rule
consequent (“THEN” part).

E.g Extracting classification rules from a decision tree. The above decision tree can be
converted to classification IF-THEN rules by tracing the path from the root node to each
leaf node in the tree. The rules extracted from Figure are

R1: IF age = youth AND student = no THEN buys computer = no


R2: IF age = youth AND student = yes THEN buys computer = yes
R3: IF age = middle aged THEN buys computer = yes
R4: IF age = senior AND credit rating = excellent THEN buys computer = yes
R5: IF age = senior AND credit rating = fair THEN buys computer = no

3.Rule Induction Using a Sequential Covering Algorithm

IF-THEN rules can be extracted directly from the training data (i.e., without having to
generate a decision tree first) using a sequential covering algorithm. In this the rules are
learned sequentially (one at a time), where each rule for a given class will ideally cover
many of the tuples of that class (and none of the tuples of other classes).

Algorithm: Sequential covering. Learn a set of IF-THEN rules for classification.

Input:
D, a data set class-labeled tuples;
Att vals, the set of all attributes and their possible values.

Output: A set of IF-THEN rules.

Method:
(1) Rule_set = { }; // initial set of rules learned is empty
(2) for each class c do
(3) repeat

Computer Science Engineering Department 144 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

(4) Rule = Learn_ One_ Rule(D, Att_ vals, c);


(5) remove tuples covered by Rule from D;
(6) until terminating condition;
(7) Rule_ set = Rule_ set +Rule; // add new rule to rule set
(8) endfor
(9) return Rule_ Set;

Computer Science Engineering Department 145 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

10. Why Classification by Back propagation happened? [CO4-H1]

A Multilayer Feed-Forward Neural Network


Defining a Network Topology
Backpropagation
Inside the Black Box: Backpropagation and Interpretability

 Backpropagation: A neural network learning algorithm


 Started by psychologists and neurobiologists to develop and test computational
analogues of neurons
 A neural network: A set of connected input/output units where each connection has
a weight associated with it
 During the learning phase, the network learns by adjusting the weights so as to
be able to predict the correct class label of the input tuples
 Also referred to as connectionist learning due to the connections between units

Neural Network as a Classifier


 Weakness
o Long training time
o Require a number of parameters typically best determined empirically, e.g.,
the network topology or ``structure."
o Poor interpretability: Difficult to interpret the symbolic meaning behind the
learned weights and of ``hidden units" in the network
 Strength
o High tolerance to noisy data
o Ability to classify untrained patterns
o Well-suited for continuous-valued inputs and outputs
o Successful on a wide array of real-world data
o Algorithms are inherently parallel
o Techniques have recently been developed for the extraction of rules from
trained neural networks

A Neuron (= a perceptron)

Computer Science Engineering Department 146 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

For Examp le
n
y  sign(  wi xi   k )
i 0
 The n-dimensional input vector x is mapped into variable y by means of the scalar
product and a nonlinear function mapping.

A Multilayer Feed-Forward Neural Network

Working process of Multilayer Feed-Forward Neural Network

 The inputs to the network correspond to the attributes measured for each
training tuple

Computer Science Engineering Department 147 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Inputs are fed simultaneously into the units making up the input layer
 They are then weighted and fed simultaneously to a hidden layer
 The number of hidden layers is arbitrary, although usually only one
 The weighted outputs of the last hidden layer are input to units making up the
output layer, which emits the network's prediction
 The network is feed-forward in that none of the weights cycles back to an input
unit or to an output unit of a previous layer
 From a statistical point of view, networks perform nonlinear regression: Given
enough hidden units and enough training samples, they can closely approximate
any function

Defining a Network Topology

 First decide the network topology: number of units in the input layer, number of
hidden layers (if > 1), number of units in each hidden layer, and number of units
in the output layer
 Normalizing the input values for each attribute measured in the training tuples to
[0.0—1.0]
 One input unit per domain value, each initialized to 0
 Output, if for classification and more than two classes, one output unit per class
is used
 Once a network has been trained and its accuracy is unacceptable, repeat the
training process with a different network topology or a different set of initial
weights.

Backpropagation
 Iteratively process a set of training tuples & compare the network's prediction
with the actual known target value
 For each training tuple, the weights are modified to minimize the mean squared
error between the network's prediction and the actual target value
 Modifications are made in the “backwards” direction: from the output layer,
through each hidden layer down to the first hidden layer, hence
“backpropagation”
 Steps
o Initialize weights (to small random #s) and biases in the network
o Propagate the inputs forward (by applying activation function)
o Backpropagate the error (by updating weights and biases)
o Terminating condition (when error is very small, etc.)

Backpropagation and Interpretability

 Efficiency of backpropagation: Each time (one interation through the training set)
takes O(|D|x w), with |D| tuples and w weights, but number of times can be
exponential to n, the number of inputs, in the worst case.

Computer Science Engineering Department 148 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Rule extraction from networks: network pruning


o Simplify the network structure by removing weighted links that have the
least effect on the trained network
o Then perform link, unit, or activation value clustering
o The set of input and activation values are studied to derive rules
describing the relationship between the input and hidden unit layers
 Sensitivity analysis: assess the impact that a given input variable has on a
network output. The knowledge gained from this analysis can be represented in
rules

.Support Vector Machines

 A new classification method for both linear and nonlinear data.


 It uses a nonlinear mapping to transform the original training data into a higher
dimension
 With the new dimension, it searches for the linear optimal separating hyperplane
(i.e., “decision boundary”)
 With an suitable nonlinear mapping to a sufficiently high dimension, data from two
classes can always be separated by a hyperplane
 SVM finds this hyperplane using support vectors (“essential” training tuples) and
margins (defined by the support vectors)
 Used both for classification and prediction
 Applications:
a. handwritten digit recognition, object recognition, speaker identification,
benchmarking time-series prediction tests
1. The Case When the Data Are Linearly Separable

Computer Science Engineering Department 149 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Let data D be (x1, y1), (x2, y2) …, (x|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi.
 There are infinite lines (hyperplanes) separating the two classes but find the best
one.
 SVM searches for the hyperplane with the largest margin, i.e., maximum marginal
hyperplane (MMH)
 A separating hyperplane can be written as
W . X+b = 0;
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
 For 2-D it can be written as
w0 + w1x1 +w2x2 = 0:
 The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
 Any training tuples that fall on hyperplanes H1 or H2 (i.e., the sides defining the
margin) are support vectors
 This becomes a constrained (convex) quadratic optimization problem: Quadratic
objective function and linear constraints  Quadratic Programming (QP) 
Lagrangian multipliers
 That is, any tuple that falls on or above H1 belongs to class +1, and any tuple that
falls
on or below H2 belongs to class -1. Combining the two inequalities of above two
Equations

Computer Science Engineering Department 150 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

we get
yi (w0 + w1x1 + w2x2 ) ≥ 1, .
 Any training tuples that fall on hyperplanes H1 or H2 (i.e., the “sides” defining the
margin) satisfy above Equation and are called support vectors.

2. The Case When the Data Are Linearly inseparable

Two steps for extending linear SVMs to nonlinear SVMs .


Step 1. Transform the original input data into a higher dimensional space using a
nonlinear mapping.
Step 2. Search for a linear separating hyperplane in the new space. This can be end up
with a quadratic optimization problem solved using the linear SVM formulation. The
maximal marginal hyperplane found in the new space corresponds to a nonlinear
separating hypersurface in the original space.

11.Associative Classification Classification by Association Rule Analysis [CO4-


H1]

Association rules show strong associations between attribute-value pairs (or items) that
occur frequently in a given data set. Such analysis is useful in many decision-making
processes, such as product placement, catalog design, and cross-marketing.

Association rules are mined in a two-step process frequent itemset mining, and rule
generation.
The first step searches for patterns of attribute-value pairs that occur repeatedly
in a data set, where each attribute-value pair is considered an item. The resulting
attribute value pairs form frequent itemsets.
The second step analyses the frequent itemsets in order to generate association
rules.
Advantages

Computer Science Engineering Department 151 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

o It explores highly confident associations among multiple attributes and may


overcome some constraints by decision-tree induction, which considers only one
attribute at a time
o It is more accurate than some traditional classification methods, such as C4.5

Classification: Based on evaluating a set of rules in the form of


p1 ^ p2 … ^ pi => Aclass = C (confidence, support)

where “^” represents a logical “AND.”

Typical Associative Classification Methods


1. CBA (Classification By Association)

 CBA uses an approach to frequent itemset mining, where multiple passes are made
over the data and the derived frequent itemsets are used to generate longest
itemsets. In general, the number of passes made is equal to the length of the longest
rule found. The complete set of rules satisfying minimum confidence and minimum
support thresholds are found and insert in the classifier.
 CBA uses a method to construct the classifier, where the rules are organized
according to decreasing preference based on their confidence and support. In this
way, the set of rules making up the classifier form a decision list.

2. CMAR (Classification based on Multiple Association Rules)

It uses several rule pruning strategies with the help of a tree structure for efficient
storage and retrieval of rules.
 CMAR adopts a variant of the FP-growth algorithm to find the complete set of rules
satisfying the minimum confidence and minimum support thresholds. FP-growth
uses a tree structure, called an FP-tree, to register all of the frequent itemset
information contained in the given data set, D. This requires only two scans of D.
The frequent itemsets are then mined from the FP-tree.
 CMAR uses an enhanced FP-tree that maintains the distribution of class labels
among tuples satisfying each frequent itemset. In this way, it is able to combine rule
generation together with frequent itemset mining in a single step.
 CMAR employs another tree structure to store and retrieve rules efficiently and to
prune rules based on confidence, correlation, and database coverage. Rule pruning
strategies are triggered whenever a rule is inserted into the tree.
 CMAR also prunes rules for which the rule antecedent and class are not positively
correlated, based on a c2 test of statistical significance.

3. CPAR (Classification based on Predictive Association Rules)

Computer Science Engineering Department 152 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 CPAR uses an algorithm for classification known as FOIL (First Order Inductive
Learner). FOIL builds rules to differentiate positive tuples ( having class buys
computer = yes) from negative tuples (such as buys computer = no).
 For multiclass problems, FOIL is applied to each class. That is, for a class, C, all
tuples of class C are considered positive tuples, while the rest are considered
negative tuples. Rules are generated to differentiate C tuples from all others. Each
time a rule is generated, the positive samples it satisfies (or covers) are removed
until all the positive tuples in the data set are covered.
 CPAR relaxes this step by allowing the covered tuples to remain under
consideration, but reducing their weight. The process is repeated for each class. The
resulting rules are merged to form the classifier rule set.

Computer Science Engineering Department 153 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

12.Lazy Learners (or Learning from Your Neighbours) [CO4-H2]

Eager learners
 Decision tree induction, Bayesian classification, rule-based classification,
classification by backpropagation, support vector machines, and classification based
on association rule mining—are all examples of eager learners.
 Eager learners - when given a set of training tuples, will construct a classification
model before receiving new tuples to classify.

Lazy Learners
o In a lazy approach, for a given training tuple, a lazy learner simply stores it or does
only a little minor processing and waits for until a test tuple given. After seeing the
test tuple it perform classification in order to classify the tuple based on its similarity
to the stored training tuples.

o Lazy learners do less work when a training tuple is presented and more work when
making a classification or prediction. Because lazy learners store the training tuples
or “instances,” (they are also referred to as instance based learners,) even though all
learning is essentially based on instances.

Examples of lazy learners:


 k-nearest neighbour classifiers
 case-based reasoning classifiers

1. k-nearest neighbour classifiers (K-NN classifier)

 Nearest-neighbour classifiers are based on a comparison, between given test tuple


with training tuples that are similar to it.
 The training tuples are named as n attributes. Each tuple represents a point in an n-
dimensional space. All of the training tuples are stored in an n-dimensional pattern
space.
 When given an unknown tuple, a k-nearest-neighbour classifier searches the pattern
space for the k training tuples that are closest to the unknown tuple. These k training
tuples are the k “nearest neighbours” of the unknown tuple.
 “Closeness” is defined in terms of a distance metric, such as Euclidean distance.
The Euclidean distance between two points or tuples, say,

and is

Computer Science Engineering Department 154 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 For k-nearest-neighbour classification, the unknown tuple is assigned the most


common class among its k nearest neighbours. When k = 1, the unknown tuple is
assigned the class of the training tuple that is closest to it in pattern space.

 Nearest neighbour classifiers can also be used for prediction, that is, to return a real-
valued prediction for a given unknown tuple. In this case, the classifier returns the
average value of the real-valued labels associated with the k nearest neighbours of
the unknown tuple.

Computer Science Engineering Department 155 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Case-Based Reasoning (CBR)

 Case-based reasoning classifiers use a database of problem solutions to solve new


problems. CBR stores the tuples or “cases” for problem solving as complex symbolic
descriptions. e.g Medical education - where patient case histories and treatments
are used to help diagnose and treat new patients.

 When given a new case to classify, a case-based reasoner will first check if an
identical training case exists. If one is found, then the associated solution to that
case is returned. If no identical case is found, then the case-based reasoner will
search for training cases having components that are similar to those of the new
case.

 Ideally, these training cases may be considered as neighbours of the new case. If
cases are represented as graphs, this involves searching for subgraphs that are
similar to subgraphs within the new case. The case-based reasoner tries to combine
the solutions of the neighbouring training cases in order to propose a solution for the
new case.

Challenges in case-based reasoning

 Finding a good similarity metric and suitable methods for combining solutions.
 The selection of salient features for indexing training cases and the development
of efficient indexing techniques.
 A balance between accuracy and efficiency changes as the number of stored
cases becomes very large. As this number increases, the case-based reasoner
becomes more intelligent. After a certain point, however, the efficiency of the
system will suffer as the time required searching for and process relevant cases
increases.

13. What are the following Classification Methods. [CO4-H2]

1. Genetic Algorithms
2. Rough Set Approach
3. Fuzzy Set Approaches

1. Genetic Algorithms

 Genetic Algorithm: based on a comparison to biological evolution


 An initial population is created consisting of randomly generated rules
o Each rule is represented by a string of bits
o e.g., “IF A1 AND NOT A2 THEN C2” can be encoded as “100”
o Similarly, the rule “IF NOT A1 AND NOT A2 THEN C1” can be encoded as
“001.”

Computer Science Engineering Department 156 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

o If an attribute has k > 2 values, k bits can be used to encode the attribute’s
values
 Based on the notion of survival of the fittest, a new population is formed to consist of
the fittest rules and their offspring
 The fitness of a rule is represented by its classification accuracy on a set of training
examples
 Off springs are generated by crossover and mutation
 The process continues until a population P evolves when each rule in P satisfies a
pre-specified threshold
 Slow but easily parallelizable

2. Rough Set Approach

o Rough sets are used to approximately or “roughly” define equivalent classes


o A rough set for a given class C is approximated by two sets: a lower
approximation (certain to be in C) and an upper approximation (cannot be
described as not belonging to C)
o Finding the minimal subsets (reducts) of attributes for feature reduction is NP-
hard but a discernibility matrix (which stores the differences between attribute
values for each pair of data tuples) is used to reduce the computation intensity

3.Fuzzy Set Approaches

o Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of
membership (such as using fuzzy membership graph)

Computer Science Engineering Department 157 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

o Attribute values are converted to fuzzy values


o e.g., income is mapped into the discrete categories {low, medium, high}
with fuzzy values calculated
o For a given new sample, more than one fuzzy value may apply
o Each applicable rule contributes a vote for membership in the categories
o Typically, the truth values for each predicted category are summed, and these
sums are combined

14.Explain the following Predictions (Numeric prediction / Regression) [CO4-H2]

1. Linear Regression
2. Nonlinear Regression
3. Other Regression-Based Methods

Numeric prediction is the task of predicting continuous values for given input. e.g.,To
predict the salary of employee with 10 years of work experience, the sales of a new
product.

An approach for numeric prediction is regression, a statistical methodology. Regression


analysis can be used to model the relationship between one or more independent or
predictor variables and a dependent or response variable (which is continuous-valued).

Computer Science Engineering Department 158 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

The predictor variables are the attributes of the tuple. In general, the values of the
predictor variables are known. The response variable is unknown so predict it.

To solve regression problems software packages such as SAS (www.sas.com), SPSS


(www.spss.com), and S-Plus (www.insightful.com), Numerical Recipes in C were used .

1. Linear Regression

Straight-line regression analysis involves a response variable, y, and a single predictor


variable, x. It is the simplest form of regression, and models y as a linear function of x.
That is,
where y is constant, and b and w are regression coefficients.

The regression coefficients, w and b, can also be as weights, so that


The regression coefficients can be estimated using this method with the following
equations:

where x is the mean value of x1, x2, ….. , x|D|, and y is the mean value of y1, y2, ….,
y|D|.

Example. Straight-line regression using method of least squares. Table shows a set of
paired data where x is the number of years of work experience of a employee and y is
the corresponding salary of the employee.

Computer Science Engineering Department 159 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

The 2-D data can be graphed on a scatter plot, as in Figure. The plot suggests a linear
relationship between the two variables, x and y.

We model the relationship that salary may be related to the number of years of work
experience with the equation .

Given the above data, we compute x = 9.1 and y = 55.4. Substituting these values into
Equations we get

Thus, the equation of the least squares line is estimated by y = 23.6+3.5x. Using this
equation, we can predict that the salary of a college graduate with, say, 10 years of
experience is $58,600.

For Multiple linear regression .

2. Nonlinear Regression

 Some nonlinear models can be modeled by a polynomial function


 A polynomial regression model can be transformed into linear regression model.
 For example,
convert to linear with new variables as follows

then the above equation as

 Other functions, such as power function, can also be transformed to linear model.
 Some models are intractable nonlinear (e.g., sum of exponential terms)
 possible to obtain least square estimates through extensive calculation on more
complex formulae

Computer Science Engineering Department 160 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

3. Other Regression-Based Methods

(i).Generalized linear model:

o Foundation on which linear regression can be applied to modeling categorical


response variables.
o Variance of y is a function of the mean value of y, not a constant.
o Logistic regression: models the probability of some event occurring as a linear
function of a set of predictor variables
o Poisson regression: models the data that exhibit a Poisson distribution.

(ii).Log-linear models: (for categorical data)


o Approximate discrete multidimensional prob. distributions
o Also useful for data compression and smoothing

(iii). Regression trees and model trees


o Trees to predict continuous values rather than class labels

 Regression tree: proposed in CART system


o CART: Classification And Regression Trees
o Each leaf stores a continuous-valued prediction
o It is the average value of the predicted attribute for the training tuples that
reach the leaf

 Model tree: proposed by Quinlan (1992)


o Each leaf holds a regression model—a multivariate linear equation for the
predicted attribute
o A more general case than regression tree

 Regression and model trees tend to be more accurate than linear regression when
the data are not represented well by a simple linear model

University Questions
Unit 4
Part A
1. What is meant by market Basket analysis?
2. What is the use of multilevel association rules?
3. What is meant by pruning in a decision tree induction?
4. Write the two measures of Association Rule.
5. With an example explain correlation analysis.
6. How are association rules mined from large databases?
7.What is tree pruning in decision tree induction?
8.What is the use of multi-level association rules?
9. What are the Apriori properties used in the Apriori algorithms?

Computer Science Engineering Department 161 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

PART-B
1. Decision tree induction is a popular classification method. Taking one typical decision
tree induction algorithm , briefly outline the method of decision tree classification. [16]

2. Consider the following training dataset and the original decision tree induction
algorithm (ID3). Risk is the class label attribute. The Height values have been already
discredited into disjoint ranges. Calculate the information gain if Gender is chosen as
the test attribute. Calculate the information gain if Height is chosen as the test attribute.
Draw the final decision tree (without any pruning) for the training dataset. Generate all
the “IF-THEN rules from the decision tree.
Gender Height Risk
F (1.5, 1.6) Low
M (1.9, 2.0) High
F (1.8, 1.9) Medium F (1.8, 1.9) Medium F (1.6, 1.7) Low
M (1.8, 1.9) Medium
F (1.5, 1.6) Low M (1.6, 1.7) Low M (2.0, 8) High M (2.0, 8) High
F (1.7, 1.8) Medium M (1.9, 2.0) Medium F (1.8, 1.9) Medium F (1.7, 1.8) Medium
F (1.7, 1.8) Medium [16]

(a) Given the following transactional database


1 C, B, H
2 B, F, S
3 A, F, G
4 C, B, H
5 B, F, G
6 B, E, O
(i) We want to mine all the frequent itemsets in the data using the Apriori algorithm.
Assume the minimum support level is 30%. (You need to give the setof frequent item
sets in L1, L2,… candidate item sets in C1, C2,…) [9]

(ii) Find all the association rules that involve only B, C.H (in either leftor right hand side
of the rule). The minimum confidence is 70%. [7]

3. Describe the multi-dimensional association rule, giving a suitable example. [16]

4. (a)Explain the algorithm for constructing a decision tree from training samples [12]
(b)Explain Bayes theorem. [4]

5. Develop an algorithm for classification using Bayesian classification.Illustrate the


algorithm with a relevant example. [16]

6. Discuss the approaches for mining multi level association rules from the transactional
databases. Give relevant example. [16]

Computer Science Engineering Department 162 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

7. Write and explain the algorithm for mining frequent item sets without candidate
generation. Give relevant example. [16]

8. How is attribute oriented induction implemented? Explain in detail. [16]

9. Discuss in detail about Bayesian classification [8]


A database four transactions. Let min sup=60% and min conf=80%. TID

DATE

T100 10/15/07 {K,A,B}


T200 10/15/07 {D,A,C,E,B}
T300 10/19/07 {C,A,B,E}
T400 10/22/07 {B,A,D}

Computer Science Engineering Department 163 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

UNIT V

1. Define Clustering? [CO5-L1]

Clustering is a process of grouping the physical or conceptual data object into clusters.

2. What do you mean by Cluster Analysis? [CO5-L1]

A cluster analysis is the process of analyzing the various clusters to organize the
different objects into meaningful and descriptive objects.

3. What are the fields in which clustering techniques are used? [CO5-L2]

• Clustering is used in biology to develop new plants and animal taxonomies. •


Clustering is used in business to enable marketers to develop new distinct groups of
their customers and characterize the customer group on basis of purchasing. •
Clustering is used in the identification of groups of automobiles Insurance policy
customer. • Clustering is used in the identification of groups of house in a city on the
basis of house type, their cost and geographical location.• Clustering is used to classify
the document on the web for information discovery.

4.What are the requirements of cluster analysis? [CO5-L2]

The basic requirements of cluster analysis are • Dealing with different types of
attributes. • Dealing with noisy data. • Constraints on clustering. • Dealing with arbitrary
shapes. • High dimensionality • Ordering of input data • Interpretability and usability •
Determining input parameter and • Scalability

5.What are the different types of data used for cluster analysis? [CO5-L2]

The different types of data used for cluster analysis are interval scaled, binary, nominal,
ordinal and ratio scaled data.

6.What are interval scaled variables? [CO5-L1]

Interval scaled variables are continuous measurements of linear scale. For Example ,
height and weight, weather temperature or coordinates for any cluster. These
measurements can be calculated using Euclidean distance or Minkowski distance

7. Define Binary variables? And what are the two types of binary variables? [CO5-
L2]

Binary variables are understood by two states 0 and 1, when state is 0, variable is
absent and when state is 1, variable is present. There are two types of binary variables,

Computer Science Engineering Department 164 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

symmetric and asymmetric binary variables. Symmetric variables are those variables
that have same state values and weights. Asymmetric variables are those variables that
have not same state values and weights.

8. Define nominal, ordinal and ratio scaled variables? [CO5-L1]

A nominal variable is a generalization of the binary variable. Nominal variable has more
than two states, For example, a nominal variable, color consists of four states, red,
green, yellow, or black. In Nominal variables the total number of states is N and it is
denoted by letters, symbols or integers. An ordinal variable also has more than two
states but all these states are ordered in a meaningful sequence. A ratio scaled variable
makes positive measurements on a non-linear scale, such as exponential scale, using
the formula AeBt or Ae-Bt Where A and B are constants.

9. What do you mean by partitioning method? [CO5-L2]

In partitioning method a partitioning algorithm arranges all the objects intovarious


partitions, where the total number of partitions is less than the total number of objects.
Here each partition represents a cluster. The two types of partitioning method are k-
means and k-medoids.

10. Define CLARA and CLARANS? [CO5-L1]

Clustering in LARge Applications is called as CLARA. The efficiency of CLARA depends


upon the size of the representative data set. CLARA does not work properly if any
representative data set from the selected representative data sets does not find best k-
medoids. To recover this drawback a new algorithm, Clustering Large Applications
based upon RANdomized search (CLARANS) is introduced. The CLARANS works like
CLARA, the only difference between CLARA and CLARANS is the clustering process
that is done after selecting the representative data sets.

11. What is Hierarchical method? [CO5-L2]

Hierarchical method groups all the objects into a tree of clusters that are arranged in a
hierarchical order. This method works on bottom-up or top-down approaches.

12. Differentiate Agglomerative and Divisive Hierarchical Clustering? [CO5-L2]

Agglomerative Hierarchical clustering method works on the bottom-up approach.In


Agglomerative hierarchical method, each object creates its own clusters. The single
Clusters are merged to make larger clusters and the process of merging continues until
all the singular clusters are merged into one big cluster that consists of all the objects.
Divisive Hierarchical clustering method works on the top-down approach. In this method
all the objects are arranged within a big singular cluster and the large cluster is
continuously divided into smaller clusters until each cluster has a single object.

Computer Science Engineering Department 165 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

13. What is CURE? [CO5-L1]

Clustering Using Representatives is called as CURE. The clustering algorithms


generally work on spherical and similar size clusters. CURE overcomes the problem of
spherical and similar size cluster and is more robust with respect to outliers.

14. Define Chameleon method? [CO5-L1]

Chameleon is another hierarchical clustering method that uses dynamic modeling.


Chameleon is introduced to recover the drawbacks of CURE method. In this method
two clusters are merged, if the interconnectivity between two clusters is greater than the
interconnectivity between the objects within a cluster.

15. Define Density based method? [CO5-L1]

Density based method deals with arbitrary shaped clusters. In density-based method,
clusters are formed on the basis of the region where the density of the objects is high.

16. What is a DBSCAN? [CO5-L2]

Density Based Spatial Clustering of Application Noise is called as DBSCAN. DBSCAN


is a density based clustering method that converts the high-density objects regions into
clusters with arbitrary shapes and sizes. DBSCAN defines the cluster as a maximal set
of density connected points.

17. What do you mean by Grid Based Method? [CO5-L1]

In this method objects are represented by the multi resolution grid data structure. All the
objects are quantized into a finite number of cells and the collection of cells build the
grid structure of objects. The clustering operations are performed on that grid structure.
This method is widely used because its processing time is very fast and that is
independent of number of objects.

18. What is a STING? [CO5-L1]

Statistical Information Grid is called as STING; it is a grid based multi resolution


clustering method. In STING method, all the objects are contained into rectangular cells,
these cells are kept into various levels of resolutions and these levels are arranged in a
hierarchical structure.

19. Define Wave Cluster? [CO5-L2]

It is a grid based multi resolution clustering method. In this method all the objects are
represented by a multidimensional grid structure and a wavelet transformation is applied

Computer Science Engineering Department 166 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

for finding the dense region. Each grid cell contains the information of the group of
objects that map into a cell. A wavelet transformation is a process of signaling that
produces the signal of various frequency sub bands.

20. What is Model based method? [CO5-L1]

For optimizing a fit between a given data set and a mathematical model based methods
are used. This method uses an assumption that the data are distributed by probability
distributions. There are two basic approaches in this method that are
Approach

21. What is the use of Regression? [CO5-L2]

Regression can be used to solve the classification problems but it can also be used for
applications such as forecasting. Regression can be performed using many different
types of techniques; in actually regression takes a set of data and fits the data to a
formula
22. What are the reasons for not using the linear regression model to estimate the
output data? [CO5-L2]

There are many reasons for that, One is that the data do not fit a linear model, It is
possible however that the data generally do actually represent a linear model, but
thelinear model generated is poor because noise or outliers exist in the data. Noise is
erroneous data and outliers are data values that are exceptions to the usual and
expected data.

23. What are the two approaches used by regression to perform classification?
[CO5-L2]

Regression can be used to perform classification using the following approaches


ions based on class.

24. What do u mean by logistic regression? [CO5-L2]

Instead of fitting a data into a straight line logistic regression uses a logistic curve. The
formula for the univariate logistic curve is P= e (C0+C1X1) 1+e (C0+C1X1) The logistic
curve gives a value between 0 and 1 so it can be interpreted as the probability of class
membership.

25. What is Time Series Analysis? [CO5-L1]

A time series is a set of attribute values over a period of time. Time Series Analysis may
be viewed as finding patterns in the data and predicting future values.

Computer Science Engineering Department 167 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

26. What are the various detected patterns? [CO5-L2]

Detected patterns may include:


-repetitive changes to the values over
time.

n , techniques may be needed to remove or


reduce the impact of outliers.

27. What is Smoothing? [CO5-L1]

Smoothing is an approach that is used to remove the nonsystematic behaviors found in


time series. It usually takes the form of finding moving averages of attribute values. It is
used to filter out noise and outliers.

Part B
UNIT V CLUSTERING AND APPLICATIONS AND TRENDS IN DATA MINING 8

1. Cluster Analysis

2. Types of Data

 Interval – scaled Variables


 Binary variables
 Categorical, Ordinal and Ratio scaled Variables
 Variables of mixed type
 Vector Objects

3. Categorization of Major Clustering Methods

1) Partitioning Methods
2) Hierarchical Methods
3) Density-Based Methods
4) Grid Based Methods
5) Model-Based Clustering Methods
6) Clustering High Dimensional Data
7) Constraint Based Cluster Analysis
8) Outlier Analysis

Computer Science Engineering Department 168 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

4. Data Mining Applications.

Computer Science Engineering Department 169 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

1. Briefly explain all the Cluster Analysis concepts with suitable examples [CO5-
H1]

Cluster. A cluster is a collection of data objects that are similar to one another within
the same cluster and are dissimilar to the objects in other clusters.

Clustering. The process of grouping a set of physical or abstract objects into classes of
similar objects is called clustering.

Cluster analysis has wide applications, - market or customer segmentation, pattern


recognition, biological studies, spatial data analysis, Web document classification, etc

Cluster analysis can be used as a


 Stand-alone data mining tool to gain insight into the data distribution
 Serve as a pre-processing step for other data mining algorithms

General Applications of Clustering:

 Pattern Recognition
 Spatial Data Analysis
o Create thematic maps in Geographical information system by clustering
feature spaces
o Detect spatial clusters or for other spatial mining tasks
 Image Processing
 Economic Science (especially market research)
 World Wide Web
o Document classification
o Cluster Weblog data to discover groups of similar access patterns

Clustering Applications - Marketing , Land use, Insurance, City-planning , Earth-


quake studies.

Requirements of Clustering in Data Mining


 Scalability
 Ability to deal with different types of attributes
 Ability to handle dynamic data
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability

Computer Science Engineering Department 170 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

2. What are the types of Data in cluster analysis? [CO5-H2]

 Interval – scaled Variables


 Binary variables
 Categorical, Ordinal and Ratio scaled Variables
 Variables of mixed type
 Vector Objects

Suppose that a data set to be clustered contains n objects, which may represent
persons, houses, documents, countries, and so on. The two data structures are used.

Data matrix (or object-by-variable structure): This represents n objects, such as


persons, with p variables (also called measurements or attributes), such as age, height,
weight, gender, and so on. The structure is in the form of a relational table, or n-by-p
matrix (n objects _p variables):

Dissimilarity matrix (or object-by-object structure): This stores a collection of


proximities that are available for all pairs of n objects. It is often represented by an n-by-
n table:

Measure the Quality of Clustering

 Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function,


typically metric: d(i, j)
 There is a separate “quality” function that measures the “goodness” of a cluster.
 The definitions of distance functions are usually very different for interval-scaled,
boolean, categorical, ordinal ratio, and vector variables.
 Weights should be associated with different variables based on applications and
data semantics.

Computer Science Engineering Department 171 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 It is hard to define “similar enough” or “good enough”


o the answer is typically highly subjective.

1.Interval – scaled Variables - Euclidean distance, Manhattan distance

Interval-scaled variables are continuous measurements of a roughly linear scale.


Examples -weight and height, latitude and longitude coordinates and weather
temperature.

After standardization, or without standardization in certain applications, the dissimilarity


or similarity between the objects described by interval-scaled variables is typically
computed based on the distance between each pair of objects.

1). The most popular distance measure is Euclidean distance, which is defined as

where are two n-dimensional data


objects.

2). Another metric is Manhattan distance, defined as

Both the Euclidean distance and Manhattan distance satisfy the following mathematic
requirements of a distance function

Computer Science Engineering Department 172 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

2. Binary variables

 Distance measure for symmetric binary variables:

 Distance measure for asymmetric binary variables:

 Jaccard coefficient (similarity measure for asymmetric binary variables):

Computer Science Engineering Department 173 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Dissimilarity between Binary Variables:

 gender is a symmetric attribute


 the remaining attributes are asymmetric binary
 let the values Y and P be set to 1, and the value N be set to 0

3. Categorical, Ordinal and Ratio scaled Variables

A categorical variable is a generalization of the binary variable in that it can take on


more
than two states. For example, map colour is a categorical variable that may have, say,
five
states: red, yellow, green, pink, and blue.

The dissimilarity between two objects i and j can be computed based on the ratio of
mismatches:

where m is the number of matches (i.e., the number of variables for which i and j are
in the same state), and p is the total number of variables.

Dissimilarity between categorical variables

Computer Science Engineering Department 174 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Consider object identifier, test-1 column only to find the categorical variables. By using
above equation we get

Ordinal Variables
 An ordinal variable can be discrete or continuous
 Order is important, e.g., rank
 The values of an ordinal variable can be mapped to ranks. For example, suppose
that an ordinal variable f has Mf states. These ordered states define the ranking 1,
….., Mf .

 Ordinal variables handled by

 Dissimilarity between ordinal variables.

Computer Science Engineering Department 175 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

o From above table consider only the object-identifier and the continuous ordinal
variable, test-2, are available. There are 3 states for test-2, namely fair, good, and
excellent, that is Mf =3.
o For step 1, if we replace each value for test-2 by its rank, the four objects are
assigned the ranks 3, 1, 2, and 3, respectively.
o Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and rank 3 to
1.0.
o For step 3, we can use, say, the Euclidean distance (Equation (7.5)), which results in
the following dissimilarity matrix:

Ratio scaled Variables

 A ratio-scaled variable makes a positive measurement on a nonlinear scale, such as


an
exponential scale, approximately following the formula

where A and B are positive constants, and t typically represents time. E.g.,the growth of
a bacteria population , the decay of a radioactive element.

 Methods to handle ratio-scaled variables for computing the dissimilarity between


objects by Apply logarithmic transformation to a ratio-scaled
variable.

 Dissimilarity between ratio-scaled variables.

o This time, from the above table consider only the object-identifier and the ratio-
scaled variable, test-3, are available.
o Logarithmic transformation of the log of test-3 results in the values 2.65, 1.34, 2.21,
and 3.08 for the objects 1 to 4, respectively.
o Using the Euclidean distance on the transformed values, we obtain the following
dissimilarity matrix:

4. Variables of Mixed Types

Computer Science Engineering Department 176 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 A database may contain all the six types of variables


o symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio
 One may use a weighted formula to combine their effects

o f is interval-based : {write interval based formula}


o f is binary or categorical: {write binary formula}
o f is ordinal : {write ordinal formula}
o f is ratio-scaled: {write ratio-scaled formula}

5. Vector objects:

 Vector objects: keywords in documents, gene features in micro-arrays, etc.


 Broad applications: information retrieval, biologic taxonomy, etc.
 To define such a similarity function, s(x, y), to compare two vectors x and y.
Cosine measure

and y.

 A variant: Tanimoto coefficient

Computer Science Engineering Department 177 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

3.Categorize the Major Clustering Methods in detail. [CO5-H2]

Clustering is a dynamic field of research in data mining. Many clustering algorithms


have been developed. These can be categorized into (i).Partitioning methods,
(ii).hierarchical methods,(iii). density-based methods, (iv).grid-based methods,
(v).model-based methods, (vi).methods for high-dimensional data, and (vii), constraint
based methods.

A partitioning method first creates an initial set of k partitions, where parameter k is


the number of partitions to construct. It then uses an iterative relocation technique that
attempts to improve the partitioning by moving objects from one group to another.
Typical partitioning methods include k-means, k-medoids, CLARANS, and their
improvements.

A hierarchical method creates a hierarchical decomposition of the given set of data


objects. The method can be classified as being either agglomerative (bottom-up) or
divisive (top-down), based on how the hierarchical decomposition is formed. To
compensate for the rigidity of merge or split, the quality of hierarchical agglomeration
can be improved by analyzing object linkages at each hierarchical partitioning (such as
in ROCK and Chameleon), or by first performing microclustering (that is, grouping
objects into “microclusters”) and then operating on the microclusters with other
clustering techniques, such as iterative relocation (as in BIRCH).

A density-based method clusters objects based on the notion of density. It either


grows clusters according to the density of neighborhood objects (such as in DBSCAN)
or according to some density function (such as in DENCLUE). OPTICS is a density
based method that generates an increased ordering of the clustering structure of the
data.

A grid-based method first quantizes the object space into a finite number of cells that
form a grid structure, and then performs clustering on the grid structure. STING is a
typical example of a grid-based method based on statistical information stored in grid
cells. WaveCluster and CLIQUE are two clustering algorithms that are both grid based
and density-based.

A model-based method hypothesizes a model for each of the clusters and finds the
best fit of the data to that model. Examples of model-based clustering include the EM
algorithm (which uses a mixture density model), conceptual clustering (such as
COBWEB), and neural network approaches (such as self-organizing feature maps).

Clustering high-dimensional data is of vital importance, because in many advanced


applications, data objects such as text documents and microarray data are high-
dimensional in nature. There are three typical methods to handle high dimensional data
sets: dimension-growth subspace clustering, represented by CLIQUE, dimension-

Computer Science Engineering Department 178 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

reduction projected clustering, represented by PROCLUS, and frequent pattern–based


clustering, represented by pCluster.

A constraint-based clustering method groups objects based on application


dependent or user-specified constraints. For example, clustering with the existence of
obstacle objects and clustering under user-specified constraints are typical methods of
constraint-based clustering. Typical examples include clustering with the existence of
obstacle objects, clustering under user-specified constraints, and semi-supervised
clustering based on “weak” supervision (such as pairs of objects labeled as belonging to
the same or different cluster).

One person’s noise could be another person’s signal. Outlier detection and analysis
are very useful for fraud detection, customized marketing, medical analysis, and many
other tasks. Computer-based outlier analysis methods typically follow either a statistical
distribution-based approach, a distance-based approach, a density-based local outlier
detection approach, or a deviation-based approach.

Partitioning Methods

Given D, a data set of n objects, and k, the number of clusters to form, a partitioning
algorithm organizes the objects into k partitions (k ≤ n), where each partition represents
a cluster. The commonly used partitioning methods are (i). k-means, (ii). k-medoids.

Centroid-Based Technique: The k-Means Method

o k-means. where each cluster’s center is represented by the mean value of the
objects in the cluster. i.e Each cluster is represented by the center of the cluster.

Computer Science Engineering Department 179 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

o Algorithm

Input:
k: the number of clusters,
D: a data set containing n objects.

Output: A set of k clusters.

Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar, based on
the mean value of the objects in the cluster;
(4) update the cluster means, i.e., calculate the mean value of the objects for each
cluster;
(5) until no change;

o Strength: Relatively efficient:


o Comment: Often terminates at a local optimum. The global optimum may be found
using techniques such as: deterministic annealing and genetic algorithms
o Weakness
 Applicable only when mean is defined, then what about categorical data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers

Computer Science Engineering Department 180 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Not suitable to discover clusters with non-convex shapes

Representative Object-Based Technique: The k-Medoids Method

o The k-means algorithm is sensitive to outliers .Since an object with an extremely


large value may largely change the distribution of the data.

o K-Medoids: Instead of taking the mean value of the object in a cluster as a


reference point, medoids can be used, which is the most centrally located object
in a cluster.

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

The K-Medoids Clustering Methods

Find representative objects, called medoids, in clusters


1. PAM (Partitioning Around Medoids, 1987)
o starts from an initial set of medoids and iteratively replaces one of the medoids
by one of the non-medoids if it improves the total distance of the resulting
clustering
o PAM works effectively for small data sets, but does not scale well for large data
sets
2. CLARA ((Clustering LARge Applications)
3. CLARANS (Ng & Han, 1994): Randomized sampling

PAM (Partitioning Around Medoids)

Computer Science Engineering Department 181 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Algorithm: k-medoids. PAM, a k-medoids algorithm for partitioning based on medoid


or central objects.

Input:
k: the number of clusters,
D: a data set containing n objects.

Output: A set of k clusters.

Method:
(1) arbitrarily choose k objects in D as the initial representative objects or seeds;
(2) repeat
(3) assign each remaining object to the cluster with the nearest representative object;
(4) randomly select a nonrepresentative object, Orandom;
(5) compute the total cost, S, of swapping representative object, Oj, with Orandom;
(6) if S < 0 then swap Oj with Orandom to form the new set of k representative objects;
(7) until no change

CLARA (Clustering LARge Applications) - Sampling based method

 PAM works efficiently for small data sets but does not scale well for large data sets.

Computer Science Engineering Department 182 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Built in statistical analysis packages, such as S+


 It draws multiple samples of the data set, applies PAM on each sample, and gives
the best clustering as the output
 Strength: deals with larger data sets than PAM
 Weakness:
o Efficiency depends on the sample size
o A good clustering based on samples will not necessarily represent a good
clustering of the whole data set if the sample is biased

3. CLARANS (“Randomized” CLARA)

 CLARANS (A Clustering Algorithm based on Randomized Search)


 CLARANS draws sample of neighbors dynamically.
 The clustering process can be presented as searching a graph where every node is
a potential solution, that is, a set of k medoids
 If the local optimum is found, CLARANS starts with new randomly selected node in
search for a new local optimum
 It is more efficient and scalable than both PAM and CLARA
 Focusing techniques and spatial access structures may further improve its
performance

Computer Science Engineering Department 183 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

4. Which Hierarchical clustering methods is called agglomerative? [CO5-H2]

A hierarchical clustering method works by grouping data objects into a tree of clusters.
Hierarchical clustering methods can be further classified as either agglomerative or
divisive,
depending on whether the hierarchical decomposition is formed in a bottom-up
(merging) or top-down (splitting) fashion.

There are two types of hierarchical clustering methods:

Agglomerative hierarchical clustering: This bottom-up strategy starts by placing each


object in its own cluster and then merges these atomic clusters into larger and larger
clusters, until all of the objects are in a single cluster or until certain termination
conditions are satisfied..

Divisive hierarchical clustering: This top-down strategy does the reverse of


agglomerative hierarchical clustering by starting with all objects in one cluster. It
subdivides the cluster into smaller and smaller pieces, until each object forms a cluster
on its own or until it satisfies certain termination conditions, such as a desired number of
clusters is obtained.

Decompose data objects into a several levels of nested partitioning (tree of clusters),
called a dendrogram.
A clustering of the data objects is obtained by cutting the dendrogram at the desired
level, then each connected component forms a cluster.

Computer Science Engineering Department 184 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Major weakness of agglomerative clustering methods


o do not scale well: time complexity of at least O(n2), where n is the number of
total objects
o can never undo what was done previously

 Integration of hierarchical with distance-based clustering


o BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-
clusters
o ROCK (1999): clustering categorical data by neighbor and link analysis
o CHAMELEON (1999): hierarchical clustering using dynamic modeling

BIRCH (1996):
 Birch: Balanced Iterative Reducing and Clustering using Hierarchies
 Scales linearly: finds a good clustering with a single scan and improves the quality
with a few additional scans
 Weakness: handles only numeric data, and sensitive to the order of the data record.

Cluster Feature (CF)

 A CF tree is a height-balanced tree that stores the clustering features for a


hierarchical
Clustering.

 Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for


multiphase clustering
o Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression
of the data that tries to preserve the inherent clustering structure of the data)
o Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-
tree

Computer Science Engineering Department 185 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 A CF tree is a height-balanced tree that stores the clustering features for a


hierarchical clustering
o A nonleaf node in a tree has descendants or “children”
o The nonleaf nodes store sums of the CFs of their children

 A CF tree has two parameters


o Branching factor: specify the maximum number of children.
o threshold: max diameter of sub-clusters stored at the leaf nodes

ROCK . A Hierarchical Clustering

 ROCK: RObust Clustering using linKs


 Major ideas
o Use links to measure similarity/proximity
o Not distance-based
o Computational complexity: The

 Algorithm: sampling-based clustering


o Draw random sample
o Cluster with links
o Label data in disk
 Experiments
o Congressional voting, mushroom data

 Similarity Measure in ROCK


o Example: Two groups (clusters) of transactions

Computer Science Engineering Department 186 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

o C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e},
{b, c, d},
{b, c, e}, {b, d, e}, {c, d, e}

o C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
o Jaccard co-efficient may lead to wrong clustering result
o C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})
o C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f})
o Jaccard co-efficient-based similarity function:

o Ex. Let T1 = {a, b, c}, T2 = {c, d, e}

 Link Measure in ROCK

o Links: no. of common neighbors


o C1 <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e},
{b, c, d},
 {b, c, e}, {b, d, e}, {c, d, e}

o C2 <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}

o Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}

o link(T1, T2) = 4, since they have 4 common neighbors


 {a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}
o link(T1, T3) = 3, since they have 3 common neighbors
 {a, b, d}, {a, b, e}, {a, b, g}
o Thus link is a better measure than Jaccard coefficient

CHAMELEON: Hierarchical Clustering Using Dynamic Modeling (1999)

 Measures the similarity based on a dynamic model


o Two clusters are merged only if the interconnectivity and closeness (proximity)
between two clusters are high relative to the internal interconnectivity of the
clusters and closeness of items within the clusters
o Cure ignores information about interconnectivity of the objects, Rock ignores
information about the closeness of two clusters
 A two-phase algorithm
o Use a graph partitioning algorithm: cluster objects into a large number of
relatively small sub-clusters

Computer Science Engineering Department 187 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

o Use an agglomerative hierarchical clustering algorithm: find the genuine clusters


by repeatedly combining these sub-clusters.

5. Density-Based Methods clustering – explain [CO5-H2]

Density-based clustering methods developed to discover clusters with arbitrary shape.

 Major features:
o Discover clusters of arbitrary shape
o Handle noise
o One scan
o Need density parameters as termination condition
Methods (1). DBSCAN (2).OPTICS (3).DENCLUE

1) .DBSCAN: A Density-Based Clustering Method Based on Connected Regions


with Sufficiently High Density

 DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density


based clustering algorithm.
 The algorithm grows regions with sufficiently high density into clusters and
discovers clusters of arbitrary shape in spatial databases with noise. It defines a
cluster as a maximal set of density-connected points.
 Density-reachability and density connectivity.
 Consider Figure for a given £ represented by the radius of the circles, and, MinPts =
3.
 Labeled points ,m, p, o, and r are core objects because each is in an £
neighbourhood containing at least three points.
 q is directly density-reachable from m. m is directly density-reachable from p and
vice versa.
 q is (indirectly) density-reachable from p because q is directly density-reachable
from
m and m is directly density-reachable from p. However, p is not density-reachable
from q because q is not a core object. Similarly, r and s are density-reachable from
o, and o is density-reachable from r.

Computer Science Engineering Department 188 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 o, r, and s are all density-connected.

 DBSCAN searches for clusters by checking the £ -neighborhood of each point in the
database. If the £ neighborhood of a point p contains more than MinPts, a new
cluster with p as a core object is created.
 DBSCAN then iteratively collects directly density-reachable objects from these core
objects, which may involve the merge of a few density-reachable clusters. The
process terminates when no new point can be added to any cluster.

2) .OPTICS : Ordering Points to Identify the Clustering Structure

OPTICS computes an better cluster ordering for automatic and interactive cluster
analysis .The cluster ordering can be used to extract basic clustering information such
as cluster centers or arbitrary-shaped clusters as well as provide the basic clustering
structure.

Fig : OPTICS terminology.

Computer Science Engineering Department 189 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Core-distance and reachability-distance.

 Figure illustrates the concepts of core distance and reachability distance.


 Suppose that £ =6 mm and MinPts = 5.
 The core distance of p is the distance, £ ’, between p and the fourth closest data
object.
 The reachability-distance of q1 with respect to p is the core-distance of p (i.e., £ ‘ =3
mm) because this is greater than the Euclidean distance from p to q1.
 The reachability distance of q2 with respect to p is the Euclidean distance from p to
q2 because this is greater than the core-distance of p.

Fig :Cluster ordering in OPTICS

For example, in above Figure is the reachability plot for a simple two-dimensional data
set, which presents a general overview of how the data are structured and clustered.
The data objects are plotted in cluster order (horizontal axis) together with their
respective reachability-distance (vertical axis). The three Gaussian “bumps” in the plot
reflect three clusters in the data set.
3). DENCLUE (DENsity-based CLUstEring)
Clustering Based on Density Distribution Functions

DENCLUE is a clustering method based on a set of density distribution functions. The


method is built on the following ideas:

(1) the influence of each data point can be formally modeled using a mathematical
function called an influence function, which describes the impact of a data point within
its neighborhood;

Computer Science Engineering Department 190 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

(2) the overall density of the data space can be modeled analytically as the sum of the
influence function applied to all data points.

(3) clusters can then be determined mathematically by identifying density attractors,


where density attractors are local maxima of the overall density function.

Fig .Possible density functions for a 2-D


data set

Advantages

 Solid mathematical foundation


 Good for data sets with large amounts of noise
 It allows a compact mathematical description of arbitrarily shaped clusters in high
dimensional data sets.
 Significantly faster than some influential algorithms than DBSCAN.

6. Grid-Based Methods “STING” explain.

The grid-based clustering approach uses a multiresolution grid data structure. It


quantizes the object space into a finite number of cells that form a grid structure on
which all of the operations for clustering are performed.

STING: STatistical INformation Grid

Computer Science Engineering Department 191 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 STING is a grid-based multiresolution clustering technique in which the spatial


area is divided into rectangular cells. These cells form a hierarchical structure. Each
cell at a high level is partitioned to form a number of cells at the next lower level.
 Statistical parameters of higher-level cells can easily be computed from the
parameters of the lower-level cells.
 These parameters includes
o Attribute independent parameter, count;
o Attribute dependent parameters, mean, stdev (standard deviation), min ,
max.
o Attribute type of distribution such as normal, uniform, exponential, or none.
 When the data are loaded into the database, the parameters count, mean, stdev,
min, and max of the bottom-level cells are calculated directly from the data.
 The value of distribution may either be assigned by the user if the distribution
type is known beforehand or obtained by hypothesis tests such as the X2 test.
 The type of distribution of a higher-level cell can be computed based on the
majority of distribution types of its corresponding lower-level cells in conjunction with
a threshold filtering process.
 If the distributions of the lower level cells disagree with each other and fail the
threshold test, the distribution type of the high-level cell is set to none.

WaveCluster: Clustering Using Wavelet Transformation


 WaveCluster is a multiresolution clustering algorithm summarizes the data by
imposing a multidimensional grid structure onto the data space.

Computer Science Engineering Department 192 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 It then uses a wavelet transformation to transform the original feature space, finding
dense regions in the transformed space.
 A wavelet transform is a signal processing technique that decomposes a signal into
different frequency subbands.
 The wavelet model can be applied to d-dimensional signals by applying a one-
dimensional wavelet transforms d times.
 In applying a wavelet transform, data are transformed so as to reserve distance
between objects at different levels of resolution. This allows natural clusters in the
data to become more different.
 Clusters can then be identified by searching for dense regions in the new domain.

Advantages:

 It provides unsupervised clustering.

Computer Science Engineering Department 193 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 The multiresolution property of wavelet transformations can help detect clusters at


varying levels of accuracy.
 Wavelet-based clustering is very fast and made parallel

7. Model-Based Clustering Methods [CO5-H2]

Model-based clustering methods attempt to optimize the fit between the given data and
some mathematical model. Such methods are often based on the assumption that the
data are generated by a mixture of underlying probability distributions.

 Typical methods
o Statistical approach
 EM (Expectation maximization), AutoClass
o Machine learning approach
 COBWEB, CLASSIT
o Neural network approach
 SOM (Self-Organizing Feature Map)

(i). Statistical approach : EM (Expectation maximization),

 EM — A popular iterative refinement algorithm

Computer Science Engineering Department 194 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 An extension to k-means
o Assign each object to a cluster according to a weight (prob. distribution)
o New means are computed based on weighted measures
 General idea
o Starts with an initial estimate of the parameter vector
o Iteratively rescores the patterns against the mixture density produced by the
parameter vector
o The rescored patterns are used to update the parameter updates
o Patterns belonging to the same cluster, if they are placed by their scores in a
particular component
 Algorithm converges fast but may not be in global optima

The EM (Expectation Maximization) Algorithm

 Initially, randomly assign k cluster centers


 Iteratively refine the clusters based on two steps
o Expectation step: assign each data point Xi to cluster Ci with the following
probability

o Maximization step:
 Estimation of model parameters

(ii). Machine learning approach ( COBWEB)

 Conceptual clustering
o A form of clustering in machine learning
o Produces a classification scheme for a set of unlabeled objects
o Finds characteristic description for each concept (class)

 COBWEB (Fisher’87)
o A popular a simple method of incremental conceptual learning
o Creates a hierarchical clustering in the form of a classification tree
o Each node refers to a concept and contains a probabilistic description of that
concept

 Fig. A classification Tree for a set of animal data.

Computer Science Engineering Department 195 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Working method:

o For a given new object, COBWEB decides where to include it into the classification
tree. For this COBWEB derives the tree along an suitable path, updating counts
along the way, in search of the “best host” or node at which to classify the object.

o If the object does not really belong to any of the concepts represented in the tree
then better to create a new node for the given object. The object is then placed in an
existing class, or a new class is created for it, based on the partition with the highest
category utility value.

 Limitations of COBWEB
o The assumption that the attributes are independent of each other is often too
strong because correlation may exist
o Not suitable for clustering large database data – skewed tree and expensive
probability distributions

 . CLASSIT
o an extension of COBWEB for incremental clustering of continuous data
o suffers similar problems as COBWEB

(iii). Neural network approach - SOM (Self-Organizing Feature Map)

 Neural network approaches


o Represent each cluster as an exemplar, acting as a “prototype” of the cluster
o New objects are distributed to the cluster whose exemplar is the most similar
according to some distance measure
 Typical methods
o SOM (Soft-Organizing feature Map)
o Competitive learning
 Involves a hierarchical architecture of several units (neurons)

Computer Science Engineering Department 196 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Neurons compete in a “winner-takes-all” fashion for the object


currently being presented

SOM (Soft-Organizing feature Map)


 SOMs, also called topological ordered maps, or Kohonen Self-Organizing Feature
Map (KSOMs)
 It maps all the points in a high-dimensional source space into a 2 to 3-d target
space, such that the distance and proximity relationship (i.e., topology) are
preserved as much as possible.
 Similar to k-means: cluster centers tend to lie in a low-dimensional manifold in the
feature space
 Clustering is performed by having several units competing for the current object
o The unit whose weight vector is closest to the current object wins
o The winner and its neighbors learn by having their weights adjusted
 SOMs are believed to resemble processing that can occur in the brain.
 Useful for visualizing high-dimensional data in 2- or 3-D space.

8. Clustering High-Dimensional Data [CO5-H2]

 Clustering high-dimensional data


o Many applications: text documents, DNA micro-array data
o Major challenges:
 Many irrelevant dimensions may mask clusters
 Distance measure becomes meaningless—due to equi-distance
 Clusters may exist only in some subspaces
 Methods
o Feature transformation: only effective if most dimensions are relevant
 PCA & SVD useful only when features are highly correlated/redundant
o Feature selection: wrapper or filter approaches
 useful to find a subspace where the data have nice clusters
o Subspace-clustering: find clusters in all the possible subspaces
 CLIQUE, ProClus, and frequent pattern-based clustering

(i).CLIQUE: A Dimension-Growth Subspace Clustering Method

 CLIQUE (CLustering InQUEst)


 Automatically identifying subspaces of a high dimensional data space that allow
better clustering than original space
 CLIQUE can be considered as both density-based and grid-based
o It partitions each dimension into the same number of equal length interval
o It partitions an m-dimensional data space into non-overlapping rectangular
units
o A unit is dense if the fraction of total data points contained in the unit exceeds
the input model parameter

Computer Science Engineering Department 197 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

o A cluster is a maximal set of connected dense units within a subspace

CLIQUE: The Major Steps

 Partition the data space and find the number of points that lie inside each cell of the
partition.
 Identify the subspaces that contain clusters using the Apriori principle
 Identify clusters
o Determine dense units in all subspaces of interests
o Determine connected dense units in all subspaces of interests.
 Generate minimal description for the clusters
o Determine maximal regions that cover a cluster of connected dense units for
each cluster
o Determination of minimal cover for each cluster.

Computer Science Engineering Department 198 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Fig .Dense units found with respect to age for the dimensions salary and vacation are
intersected in order to provide a candidate search space for dense units of higher
dimensionality.

 Strength
o automatically finds subspaces of the highest dimensionality such that high
density clusters exist in those subspaces
o insensitive to the order of records in input and does not presume some
canonical data distribution
o scales linearly with the size of input and has good scalability as the number of
dimensions in the data increases
 Weakness
o The accuracy of the clustering result may be degraded at the expense of
simplicity of the method

(ii). PROCLUS: A Dimension-Reduction Subspace Clustering Method

 PROCLUS (PROjected CLUStering) is a typical dimension-reduction subspace


clustering method.
 It starts by finding an initial calculation of the clusters in the high-dimensional
attribute
space.
 Each dimension is then assigned a weight for each cluster, and the updated weights
are used in the next iteration to regenerate the clusters.
 This leads to the search of solid regions in all subspaces of some desired
dimensionality and avoids the generation of a large number of overlapped clusters in
projected dimensions of
lower dimensionality.
 The PROCLUS algorithm consists of three phases: initialization, iteration, and
cluster refinement.

(iii). Frequent Pattern–Based Clustering Methods

Computer Science Engineering Department 199 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

o Frequent pattern mining can be applied to clustering, resulting in frequent


pattern–based cluster analysis.
o Frequent pattern mining - searches for patterns (such as sets of items or objects)
that occur frequently in large data sets.
o Frequent pattern mining can lead to the discovery of interesting associations and
correlations among data objects.

Two forms of frequent pattern–based cluster analysis:


o Frequent term–based text clustering.
o Clustering by pattern similarity in microarray data analysis.

(a).Frequent term–based text clustering.

 Text documents are clustered based on the frequent terms they contain. A term
can be made up of a single word or several words. Terms are then extracted.
 A stemming algorithm is then applied to reduce each term to its basic stem. In
this way, each document can be represented as a set of terms. Each set is
typically large. Collectively, a large set of documents will contain a very large set
of different terms.
 Advantage: It automatically generates a description for the generated clusters in
terms of their frequent term sets.

(b). Clustering by pattern similarity in DNA microarray data analysis ( pClustering


)

o Figure.1 shows a fragment of microarray data containing only three genes (taken as
“objects” ) and ten attributes (columns a to j ).
o However, if two subsets of attributes, {b, c, h, j, e} and { f , d, a, g, i}, are selected
and plotted as in Figure. 2 (a) and (b) respectively,
o Figure. 2(a) forms a shift pattern, where the three curves are similar to each other
with respect to a shift operation along the y-axis.
o Figure.2(b) forms a scaling pattern, where the three curves are similar to each other
with respect to a scaling operation along the y-axis.

Computer Science Engineering Department 200 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

Fig: Raw data from a fragment of microarray data containing only 3 objects and 10
attributes

Fig. Objects in Figure 1 form


Fig (a) a shift pattern in subspace {b,c,h,j,e}
Fig (b) a scaling pattern in subspace { f,d,a,g,i}.

9. Constraint-Based Cluster Analysis [CO5-H2]

Constraint-based clustering finds clusters that satisfy user-specified preferences or


constraints. Depending on the nature of the constraints, constraint-based clustering may
adopt different approaches.

Different constraints in cluster analysis


i. Constraints on individual objects. (E.g Cluster on houses worth over $300K)
ii. Constraints on the selection of clustering parameters.

Computer Science Engineering Department 201 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

iii. Constraints on distance or similarity functions (e.g.,Weighted functions, obstacles


(e.g., rivers, lakes)
iv. User-specified constraints on the properties of individual clusters. (no.,of clusters,
MinPts)
v. Semi-supervised clustering based on “partial” supervision.(e.g., Contain at least
500 valued customers and 5000 ordinary ones)

I. Constraints on distance or similarity functions: Clustering with obstacle objects


(obstacle meaning -> difficult)

Clustering with obstacle objects using a partitioning approach requires that the distance
between each object and its corresponding cluster center be re-evaluated at each
iteration whenever the cluster center is changed.

e.g A city may have rivers, bridges, highways, lakes, and mountains. We do not want to
swim across a river to reach an ATM.

Approach for the problem of clustering with obstacles.

Fig(a) :First, a point, p, is visible from another point, q, in Region R, if the straight line
joining p and q does not intersect any obstacles.

Computer Science Engineering Department 202 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

The shortest path between two points, p and q, will be a subpath of VG’ as shown in
Figure (a).
We see that it begins with an edge from p to either v1, v2, or v3, goes through some
path in VG, and then ends with an edge from either v4 or v5 to q.

Fig.(b).To reduce the cost of distance computation between any two pairs of objects,
microclusters techniques can be used. This can be done by first triangulating the region
R into triangles, and then grouping nearby points in the same triangle into microclusters,
as shown in Figure (b).

After that, precomputation can be performed to build two kinds of join indices based on
the shortest paths:
o VV index: indices for any pair of obstacle vertices
o MV index: indices for any pair of micro-cluster and obstacle indices

II. User-specified constraints on the properties of individual clusters

 e.g., A parcel delivery company with n customers would like to determine locations
for k service stations so as to minimize the traveling distance between customers
and service stations.
 The company’s customers are considered as either high-value customers (requiring
frequent, regular services) or ordinary customers (requiring occasional services).
 The manager has specified two constraints: each station should serve (1) at least
100 high-value customers and (2) at least 5,000 ordinary customers.

 Proposed approach to solve above

Computer Science Engineering Department 203 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

o Find an initial “solution” by partitioning the data set into k groups and satisfying
user-constraints
o Iteratively refine the solution by micro-clustering relocation (e.g., moving δ μ-
clusters from cluster Ci to Cj) and “deadlock” handling (break the microclusters
when necessary)
o Efficiency is improved by micro-clustering

III. Semi-supervised clustering


Clustering process based on user feedback or guidance constraints is called semi-
supervised clustering.

Methods for semi-supervised clustering can be categorized into two classes:


(1).constraint-based semi-supervised clustering
(2).distance-based semi-supervised clustering.

Constraint-based semi-supervised clustering trusts on user-provided labels or


constraints to guide the algorithm toward a more suitable data partitioning. This includes
modifying the objective function based on constraints, or initializing and constraining the
clustering process based on the labeled objects.

Distance-based semi-supervised clustering employs a distance measure that is trained


to satisfy the constraints in the supervised data. A method CLTree (CLustering based
on decision TREEs), integrates unsupervised clustering with the idea of supervised
classification.

10.Outlier Analysis [CO5-H2]

 Data objects which are totally different from or inconsistent with the remaining set of
data, are called outliers. Outliers can be caused by measurement or execution error
E.g The display of a person’s age as 999.
 Outlier detection and analysis is an interesting data mining task, referred to as outlier
mining.
 Applications:
o Fraud Detection (Credit card, telecommunications, criminal activity in e-
Commerce)
o Customized Marketing (high/low income buying habits)
o Medical Treatments (unusual responses to various drugs)
o Analysis of performance statistics (professional athletes)
o Weather Prediction
o Financial Applications (loan approval, stock tracking)

1) Statistical Distribution-Based Outlier Detection

Computer Science Engineering Department 204 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

The statistical distribution-based approach to outlier detection assumes a distribution or


probability model for the given data set (e.g., a normal or Poisson distribution) and then
identifies outliers with respect to the model using a discordancy test.
A statistical discordancy test examines two hypotheses:
 a working hypothesis
 an alternative hypothesis.

 Working hypothesis
A working hypothesis, H, is a statement that the entire data set of n objects comes from
an initial distribution model, F, that is,

A discordancy test verifies whether an object, oi, is significantly large (or small) in
relation to the distribution F.
 Alternative hypothesis.
An alternative hypothesis, H, which states that oi comes from another distribution
model, G, is adopted
 There are different kinds of alternative distributions.
o Inherent alternative distribution
o Mixture alternative distribution
o Slippage alternative distribution

There are two basic types of procedures for detecting outliers:


o Block procedures
o Consecutive procedures
Drawbacks:
o most tests are for single attributes
o in many cases, the data distribution may not be known.

2) Distance-Based Outlier Detection

An object, O, in a data set, D, is a distance-based (DB) outlier with parameters pct and
dmin, that is, a DB(pct;dmin)-outlier, if at least a fraction, pct, of the objects in D lie at a
distance greater than dmin from O.

Algorithms for mining distance-based outliers are


 Index-based algorithm, Nested-loop algorithm, Cell-based algorithm

 Index-based algorithm

Given a data set, the index-based algorithm uses multidimensional indexing structures,
such as R-trees or k-d trees, to search for neighbours of each object o within radius
dmin around that object.

o Nested-loop algorithm

Computer Science Engineering Department 205 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

This algorithm avoids index structure construction and tries to minimize the number of
I/Os. It divides the memory buffer space into two halves and the data set into several
logical blocks. I/O efficiency can be achieved by choosing the order in which blocks are
loaded into each half.

o Cell-based algorithm: A cell-based algorithm was developed for memory-resident


data sets. Its complexity is O(ck +n), where c is a constant depending on the number
of cells and k is the dimensionality.

3) Density-Based Local Outlier Detection

o An object is a local outlier if it is outlying relative to its local neighbourhood,


particularly with respect to the density of the neighbourhood.
o In this view, o2 is a local outlier relative to the density of C2. Object o1 is an outlier
as well, and no objects in C1 are mislabelled as outliers. This forms the basis of
density-based local outlier detection.

4) Deviation-Based Outlier Detection

It identifies outliers by examining the main characteristics of objects in a group. Objects


that “deviate” from this description are considered outliers. Hence, deviations is used to
refer outliers.

Techniques
o Sequential Exception Technique
o OLAP Data Cube Technique

Sequential Exception Technique


o simulates the way in which humans can decide unusual objects from among a series
of supposedly like objects.It uses implicit redundancy of the data.

Computer Science Engineering Department 206 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

o Given a data set, D, of n objects, it builds a sequence of subsets,{D1, D2, …,Dm},


of these objects with 2<=m <= n such that

The technique introduces the following key terms.

 Exception set: This is the set of deviations or outliers.

 Dissimilarity function: It is any function that, if given a set of objects, returns a low
value if the objects are similar to one another. The greater the dissimilarity among
the objects, the higher the value returned by the function.

 Cardinality function: This is typically the count of the number of objects in a given
set.

 Smoothing factor: This function is computed for each subset in the sequence. It
assesses how much the dissimilarity can be reduced by removing the subset from
the original set of objects.

OLAP Data Cube Technique

o An OLAP approach to deviation detection uses data cubes to identify regions of


differences
in large multidimensional data.
o A cell value in the cube is considered an exception if it is different from the
expected value, based on a statistical model.
o The method uses visual cues such as background colour to reflect the degree of
exception of each cell.
o The user can choose to drill down on cells that are flagged as exceptions.
o The measure value of a cell may reflect exceptions occurring at more detailed or
lower levels of the cube, where these exceptions are not visible from the current
level.

11.Data Mining Applications [CO5-H2]


 Data mining is an interdisciplinary field with wide and various applications
o There exist nontrivial gaps between data mining principles and domain-
specific applications
 Some application domains
o Financial data analysis
o Retail industry
o Telecommunication industry
o Biological data analysis

I. Data Mining for Financial Data Analysis

Computer Science Engineering Department 207 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 Financial data collected in banks and financial institutions are often relatively
complete, reliable, and of high quality
 Design and construction of data warehouses for multidimensional data analysis and
data mining
o View the debt and revenue changes by month, by region, by sector, and by
other factors
o Access statistical information such as max, min, total, average, trend, etc.
 Loan payment prediction/consumer credit policy analysis
o feature selection and attribute relevance ranking
o Loan payment performance
o Consumer credit rating
 Classification and clustering of customers for targeted marketing
o multidimensional segmentation by nearest-neighbor, classification, decision
trees, etc. to identify customer groups or associate a new customer to an
appropriate customer group
 Detection of money laundering and other financial crimes
o integration of from multiple DBs (e.g., bank transactions, federal/state crime
history DBs)
o Tools: data visualization, linkage analysis, classification, clustering tools,
outlier analysis, and sequential pattern analysis tools (find unusual access
sequences)

II. Data Mining for Retail Industry

 Retail industry: huge amounts of data on sales, customer shopping history, etc.
 Applications of retail data mining
o Identify customer buying behaviors
o Discover customer shopping patterns and trends
o Improve the quality of customer service
o Achieve better customer retention and satisfaction
o Enhance goods consumption ratios
o Design more effective goods transportation and distribution policies

Examples

 Ex. 1. Design and construction of data warehouses based on the benefits of


data mining
 Ex. 2.Multidimensional analysis of sales, customers, products, time, and region
 Ex. 3. Analysis of the effectiveness of sales campaigns
 Ex. 4. Customer retention: Analysis of customer loyalty
o Use customer loyalty card information to register sequences of purchases
of particular customers
o Use sequential pattern mining to investigate changes in customer
consumption or loyalty

Computer Science Engineering Department 208 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

o Suggest adjustments on the pricing and variety of goods


 Ex. 5. Purchase recommendation and cross-reference of items

III. Data Mining for Telecommunication Industry

 A rapidly expanding and highly competitive industry and a great demand for data
mining
o Understand the business involved
o Identify telecommunication patterns
o Catch fraudulent activities
o Make better use of resources
o Improve the quality of service

The following are a few scenarios for which data mining may improve
telecommunication services

 Multidimensional analysis of telecommunication data


o Intrinsically multidimensional: calling-time, duration, location of caller, location
of callee, type of call, etc.
 Fraudulent pattern analysis and the identification of unusual patterns
o Identify potentially fraudulent users and their atypical usage patterns
o Detect attempts to gain fraudulent entry to customer accounts
o Discover unusual patterns which may need special attention
 Multidimensional association and sequential pattern analysis
o Find usage patterns for a set of communication services by customer group,
by month, etc.
o Promote the sales of specific services
o Improve the availability of particular services in a region
 Mobile telecommunication services
 Use of visualization tools in telecommunication data analysis

IV. Biomedical Data Analysis

 DNA sequences: 4 basic building blocks (nucleotides): adenine (A), cytosine (C),
guanine (G), and thymine (T).
 Gene: a sequence of hundreds of individual nucleotides arranged in a particular
order
 Humans have around 30,000 genes
 Tremendous number of ways that the nucleotides can be ordered and sequenced to
form distinct genes

Data mining may contribute to biological data analysis in the following aspects

 Semantic integration of heterogeneous, distributed genome databases

Computer Science Engineering Department 209 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

o Current: highly distributed, uncontrolled generation and use of a wide variety of


DNA data
o Data cleaning and data integration methods developed in data mining will help
 Alignment, indexing, similarity search, and comparative analysis ofmultiple
nucleotide/
protein sequences
o Compare the frequently occurring patterns of each class (e.g., diseased and
healthy)
o Identify gene sequence patterns that play roles in various diseases
 Discovery of structural patterns and analysis of genetic networks and protein
pathways:
 Association analysis: identification of co-occurring gene sequences
o Most diseases are not triggered by a single gene but by a combination of
genes acting together
o Association analysis may help determine the kinds of genes that are likely to
co-occur together in target samples
 Path analysis: linking genes to different disease development stages
o Different genes may become active at different stages of the disease
o Develop pharmaceutical interventions that target the different stages
separately
 Visualization tools and genetic data analysis

V. Data Mining in Other Scientific Applications

 Vast amounts of data have been collected from scientific domains (including
geosciences, astronomy, and meteorology) using sophisticated telescopes,
multispectral high-resolution remote satellite sensors, and global positioning
systems.

 Large data sets are being generated due to fast numerical simulations in various
fields, such as climate and ecosystem modeling, chemical engineering, fluid
dynamics, and structural mechanics.

 some of the challenges brought about by emerging scientific applications of data


mining, such as the following
o Data warehouses and data preprocessing:
o Mining complex data types:
o Graph-based mining:
o Visualization tools and domain-specific knowledge:

VI. Data Mining for Intrusion Detection

Computer Science Engineering Department 210 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

 The security of our computer systems and data is at constant risk. The extensive
growth of the Internet and increasing availability of tools and tricks for interrupting
and attacking networks have prompted intrusion detection to become a critical
component of network administration.

 An intrusion can be defined as any set of actions that threaten the integrity,
confidentiality, or availability of a network resource .

The following are areas in data mining technology applied or further developed for
intrusion detection:

o Development of data mining algorithms for intrusion detection


o Association and correlation analysis, and aggregation to help select and build
discriminating attributes
o Analysis of stream data
o Distributed data mining
o Visualization and querying tools

UNIT-V
University Questions
PART A
1. What are the requirements of clustering?
2. What are the applications of spatial data bases?
3. What is text mining?
4. Distinguish between classification and clustering.
5. Define a Spatial database.
7. What is the objective function of K-means algorithm?
8. Mention the advantages of Hierarchical clustering.
9. What is an outlier? Give example.
10. What is audio data mining?
11. List two application of data mining.

PART-B
1. BIRCH and CLARANS are two interesting clustering algorithms that perform effective
clustering in large data sets.
(i) Outline how BIRCH performs clustering in large data sets. [10] (ii) Compare and
outline the major differences of the two scalable clustering algorithms BIRCH and
CLARANS. [6]
2. Write a short note on web mining taxonomy. Explain the different activities of text
mining.
3. Discuss and elaborate the current trends in data mining. [6+5+5]
4. Discuss spatial data bases and Text databases [16]
5. What is a multimedia database? Explain the methods of mining multimedia
database? [16]

Computer Science Engineering Department 211 DataWarehousing and DataMining


S.K.P. Engineering College, Tiruvannamalai VI SEM

6. (a) Explain the following clustering methods in detail.


(a) BIRCH (b) CURE [16]
7. Discuss in detail about any four data mining applications. [16]
8. Write short notes on
(i) Partitioning methods [8] (ii) Outlier analysis [8]
9. Describe K means clustering with an example. [16]
10. Describe in detail about Hierarchical methods.

Computer Science Engineering Department 212 DataWarehousing and DataMining

S-ar putea să vă placă și