Documente Academic
Documente Profesional
Documente Cultură
BCA VI SEM
2/16/2019 2
Data Cube: Multidimensional Model
Introduction
Schemas
Star Schema
Snowflake Schema
Fact Constellation
Data Warehouse Usage
Data Warehouse Implementation
Data Cube Materialization
OLAP Operations –Rollup,Drill-Down,Slice,Dice,Pivot
OLAP Models
OLAM
Data Mining Architecture
Mining Frequent Patterns
Association Rule Mining
Apriori Algorithm
FP Growth
2/16/2019 3
What is Data Warehouse?
The term "Data Warehouse" was first coined by Bill
Inmon in 1990.
According to Inmon, a data warehouse is a subject
oriented, integrated, time-variant, and non-
volatile collection of data.
A data warehouse refers to a data repository that is
maintained separately from an organization’s
operational databases. Data warehouse systems allow
for integration of a variety of application systems. They
support information processing by providing a solid
platform of consolidated historic data for analysis.
2/16/2019 4
What is data warehouse?
Data warehousing provides architectures and
tools for business executives to systematically
organize, understand, and use their data to make
strategic decisions.
The data warehouse is the core of the BI system which
is built for data analysis and reporting.
Data warehouses generalize and consolidate data in
multidimensional space. The construction of data
warehouses involves data cleaning, data integration,
and data transformation, and can be viewed as an
important preprocessing step for data mining.
2/16/2019 5
Moreover, data warehouses provide online analytical
processing (OLAP) tools for the interactive analysis of
multidimensional data of varied granularities, which
facilitates effective data generalization and data
mining.
2/16/2019 8
2/16/2019 9
2/16/2019 10
How are organizations using the
information from data warehouse?
Many organizations use this information to support business
decision-making activities, including:
2/16/2019 14
Database design: An OLTP system usually adopts an
entity-relationship (ER) data model and an
application-oriented database design. An OLAP system
typically adopts either a star or a snowflake model and
a subject-oriented database design.
2/16/2019 15
Access patterns: The access patterns of an OLTP
system consist mainly of short, atomic transactions.
Such a system requires concurrency control and
recovery mechanisms. However, accesses to OLAP
systems are mostly read-only operations (because
most data warehouses store historic rather than up-to-
date information), although many could be complex
queries.
2/16/2019 16
2/16/2019 17
Data Warehousing: A multitiered
Architecture(previous year paper)
2/16/2019 18
Data warehouses often adopt a three-tier architecture:
2/16/2019 20
Data warehouse Models
Enterprise warehouse: An enterprise warehouse collects
all of the information about subjects spanning the entire
organization.
It provides corporate-wide data integration, usually from one
or more operational systems or external information
providers, and is cross-functional in scope.
It typically contains detailed data as well as summarized data,
and can range in size from a few gigabytes to hundreds of
gigabytes, terabytes, or beyond.
An enterprise data warehouse may be implemented on
traditional mainframes, computer superservers, or parallel
architecture platforms. It requires extensive business
modeling and may take years to design and build.
2/16/2019 21
Data mart: A data mart contains a subset of corporate-wide
data that is of value to a specific group of users. The scope
is confined to specific selected subjects. For example, a
marketing data mart may confine its subjects to customer, item,
and sales. The data contained in data marts tend to be
summarized.
Data marts are usually implemented on low-cost departmental
servers that are Unix/Linux or Windows based. The
implementation cycle of a data mart is more likely to be measured
in weeks rather than months or years. However, it may involve
complex integration in the long run if its design and planning were
not enterprise-wide.
Depending on the source of data, data marts can be categorized
as independent or dependent.
Independent data marts are sourced from data captured from one
or more operational systems or external information providers, or
from data generated locally within a particular department or
geographic area.
Dependent data marts are sourced directly from enterprise data
warehouses.
2/16/2019 22
Virtual warehouse: A virtual warehouse is a set of
views over operational databases.
For efficient query processing, only some of the
possible summary views may be materialized.
A virtual warehouse is easy to build but requires excess
capacity on operational database servers.
2/16/2019 23
Difference Between Data Warehouse and Data Mart
2/16/2019 24
Data Warehouse Applications
A data warehouse helps business executives to organize,
analyze, and use their data for decision making. A data
warehouse serves as a sole part of a plan-execute-
assess "closed-loop" feedback system for the enterprise
management. Data warehouses are widely used in the
following fields :
Financial services
Banking services
Consumer goods
Retail sectors
Controlled manufacturing
2/16/2019 25
How Data warehouse works?
By merging all of this information in one place, an
organization can analyze its customers more
holistically.
This helps to ensure that it has considered all the
information available.
Data warehousing makes data mining possible. Data
mining is looking for patterns in the data that may
lead to higher sales and profits.
2/16/2019 26
Advantages of Data Warehouse:
Data warehouse allows business users to quickly access
critical data from some sources all in one place.
Data warehouse provides consistent information on various
cross-functional activities. It is also supporting ad-hoc
reporting and query.
Data Warehouse helps to integrate many sources of data to
reduce stress on the production system.
Data warehouse helps to reduce total turnaround time for
analysis and reporting.
Restructuring and Integration make it easier for the user to
use for reporting and analysis.
Data warehouse allows users to access critical data from the
number of sources in a single place. Therefore, it saves
user's time of retrieving data from multiple sources.
Data warehouse stores a large amount of historical data.
This helps users to analyze different time periods and
trends to make future predictions.
2/16/2019 27
Disadvantages of Data Warehouse:
Not an ideal option for unstructured data.
Creation and Implementation of Data Warehouse is
surely time confusing affair.
Data Warehouse can be outdated relatively quickly.
Difficult to make changes in data types and ranges,
data source schema, indexes, and queries.
The data warehouse may seem easy, but actually, it is
too complex for the average users.
Sometime warehouse users will develop different
business rules.
Organizations need to spend lots of their resources for
training and Implementation purpose.
2/16/2019 28
ETL-Extraction, Transformation
and Loading(previous year paper)
Data warehouse systems use back-end tools and utilities to
populate and refresh their Data .These tools and utilities include
the following functions:
2/16/2019 29
Metadata Repository(previous
year paper)
Metadata are data about data. When used in a data
warehouse, metadata are the data that define
warehouse objects. Metadata repository lies within the
bottom tier of the data warehousing architecture.
Metadata are created for the data names and
definitions of the given warehouse.
2/16/2019 30
A metadata repository should contain the following:
A description of the data warehouse structure, which
includes the warehouse schema, view, dimensions,
hierarchies, and derived data definitions, as well as data
mart locations and contents.
2/16/2019 32
Role of Metadata
Metadata play a very different role than other data warehouse
data and are important for many reasons.
For example, metadata are used as a directory to help the
decision support system analyst locate the contents of the
data warehouse, and as a guide to the data mapping when
data are transformed from the operational environment to
the data warehouse environment.
Metadata also serve as a guide to the algorithms used for
summarization between the current detailed data and the lightly
summarized data, and between the lightly summarized data and
the highly summarized data.
Metadata should be stored and managed persistently (i.e., on
disk).
2/16/2019 33
Data Cube: A Multidimensional Data
Model
“What is a data cube?” A data cube allows data to be modeled and
viewed in multiple dimensions.
It is defined by dimensions and facts.
In general terms, dimensions are the perspectives or entities
with respect to which an organization wants to keep records.
For example, AllElectronics may create a sales data warehouse in
order to keep records of the store’s sales with respect to the
dimensions time, item, branch, and location. These dimensions
allow the store to keep track of things like monthly sales of items and
the branches and locations at which the items were sold.
Each dimension may have a table associated with it, called a
dimension table, which further describes the dimension.
For example, a dimension table for item may contain the attributes
item name, brand, and type. Dimension tables can be specified by
users or experts, or automatically generated and adjusted based on
data distributions.
2/16/2019 34
A multidimensional data model
A multidimensional data model is typically organized
around a central theme, such as sales. This theme is
represented by a fact table.
Facts are numeric measures. Think of them as the
quantities by which we want to analyze
relationships between dimensions.
Examples of facts for a sales data warehouse
include dollars sold (sales amount in dollars),
units sold (number of units sold), and amount
budgeted.
The fact table contains the names of the facts, or
measures, as well as keys to each of the related
dimension tables.
2/16/2019 35
A multidimensional data model
Multidimensional data model stores data in the form
of data cube. Mostly, data warehousing supports two
or three-dimensional cubes.
A data cube allows data to be viewed in multiple
dimensions. A dimensions are entities with respect to
which an organization wants to keep records. For
example in store sales record, dimensions allow the
store to keep track of things like monthly sales of
items and the branches and locations.
A multidimensional databases helps to provide data-
related answers to complex business queries quickly
and accurately.
2/16/2019 36
A multidimensional data model
2/16/2019 37
A multidimensional data model
2/16/2019 38
A multidimensional data model
2/16/2019 39
Schemas for Multidimensional
Data Model are
Star Schema
Snowflakes Schema
Fact Constellations Schema
2/16/2019 40
Star Schemas for Multidimensional Model
The simplest data warehouse schema is star schema
because its structure resembles a star.
Star schema consists of data in the form of facts and
dimensions.
The fact table present in the center of star and points
of the star are the dimension tables.
In star schema fact table contain a large amount
of data, with redundancy.
Each dimension table is joined with the fact table
using a primary or foreign key.
2/16/2019 41
2/16/2019 42
An example of E-Commerce Website
2/16/2019 43
Dimension Tables
2/16/2019 44
2/16/2019 45
2/16/2019 46
2/16/2019 47
As, it can be observed from previous
slides that dimension tables contains
lot of redundant data, so we need a
schema where tables can be normalized
to reduce redundancy.
2/16/2019 48
ANOTHER
EXAMPLE
OF STAR
SCHEMA
2/16/2019 49
Snowflake Schemas for Multidimensional
Model
The snowflake schema is a more complex than star
schema because dimension tables of the snowflake are
normalized.
The snowflake schema is represented by centralized
fact table which is connected to multiple dimension
table and this dimension table can be normalized into
additional dimension tables.
The major difference between the snowflake and
star schema models is that the dimension tables
of the snowflake model are normalized to reduce
redundancies.
2/16/2019 50
2/16/2019 51
SNOWFLAKE
SCHEMA FOR SAME
EXAMPLE
2/16/2019 52
Normalized Dimensional Tables
2/16/2019 53
2/16/2019 54
2/16/2019 55
Another Example
of Snowf lake
Schema
2/16/2019 56
Difference between the snowflake
and star schema models
The major difference between the snowflake and star
schema models is that the dimension tables of the
snowflake model may be kept in normalized form to
reduce redundancies.
2/16/2019 58
2/16/2019 59
FACT CONSTELLATION
2/16/2019 60
Data Warehouse Usage for
Information Processing
Data warehouses and data marts are used in a wide
range of applications. Business executives use the data
in data warehouses and data marts to perform data
analysis and make strategic decisions.
2/16/2019 61
There are three kinds of data warehouse applications:
information processing, analytical processing, and data
mining.
2/16/2019 62
Data mining supports knowledge discovery by
finding hidden patterns and associations,
constructing analytical models, performing
classification and prediction, and presenting the
mining results using visualization tools.
2/16/2019 63
OLAP vs. Data Mining
The functionalities of OLAP and data mining can be
viewed as disjoint:
2/16/2019 64
Data Warehouse Implementation
A data cube is a lattice of cuboids. Suppose that you want to
create a data cube for AllElectronics sales that contains the
following: city, item, year, and sales in dollars. You want to be
able to analyze the data, with queries such as the following:
2/16/2019 67
Curse of Dimensionality
Online analytical processing may need to access different
cuboids for different queries. Therefore, it may seem like a
good idea to compute in advance all or at least some of the
cuboids in a data cube. Precomputation leads to fast
response time and avoids some redundant
computation.
A major challenge related to this precomputation,
however, is that the required storage space may explode
if all the cuboids in a data cube are precomputed,
especially when the cube has many dimensions. The
storage requirements are even more excessive when many
of the dimensions have associated concept hierarchies,
each with multiple levels.
This problem is referred to as the curse of
dimensionality.
2/16/2019 68
2/16/2019 69
Data Cube Materialization
There are three choices for data cube materialization given a base cuboid:
1. No materialization: Do not precompute any of the “nonbase”
cuboids. This leads to computing expensive multidimensional aggregates
on-the-fly, which can be extremely slow.
2. Full materialization: Precompute all of the cuboids. The resulting
lattice of computed cuboids is referred to as the full cube. This choice
typically requires huge amounts of memory space in order to store all of the
precomputed cuboids.
3. Partial materialization: Selectively compute a proper subset of the
whole set of possible cuboids. Alternatively, we may compute a subset of
the cube, which contains only those cells that satisfy some user-specified
criterion, such as where the tuple count of each cell is above some
threshold. We will use the term subcube to refer to the latter case, where
only some of the cells may be precomputed for various cuboids. Partial
materialization represents an interesting trade-off between storage space
and response time.
2/16/2019 70
Data warehouse Application/usage
Most common sectors where Data warehouse is used:
Airline:
In the Airline system, it is used for operation purpose like crew
assignment, analyses of route profitability, frequent flyer program
promotions, etc.
Banking:
It is widely used in the banking sector to manage the resources
available on desk effectively. Few banks also used for the market
research, performance analysis of the product and operations.
Healthcare:
Healthcare sector also used Data warehouse to strategize and predict
outcomes, generate patient's treatment reports, share data with tie-in
insurance companies, medical aid services, etc.
Public sector:
In the public sector, data warehouse is used for intelligence gathering.
It helps government agencies to maintain and analyze tax records,
health policy records, for every individual.
2/16/2019 71
Investment and Insurance sector:
In this sector, the warehouses are primarily used to analyze data
patterns, customer trends, and to track market movements.
Retain chain:
In retail chains, Data warehouse is widely used for distribution
and marketing. It also helps to track items, customer buying
pattern, promotions and also used for determining pricing
policy.
Telecommunication:
A data warehouse is used in this sector for product promotions,
sales decisions and to make distribution decisions.
Hospitality Industry:
This Industry utilizes warehouse services to design as well as
estimate their advertising and promotion campaigns where they
want to target clients based on their feedback and travel
patterns.
2/16/2019 72
Online Analytical Processing Server (OLAP)
-Definition
On-Line Analytical Processing (OLAP) is a category of
software technology that enables analysts, managers
and executives to gain insight into data through fast,
consistent, interactive access in a wide variety of
possible views of information that has been
transformed from raw data to reflect the real
dimensionality of the enterprise as understood by the
user.
2/16/2019 73
Analysts frequently need to group, aggregate and join
data. These operations in relational databases are
resource intensive. With OLAP data can be pre-
calculated and pre-aggregated, making analysis faster.
2/16/2019 74
2/16/2019 75
At the core of the OLAP, concept is an OLAP Cube. The
OLAP cube is a data structure optimized for very quick
data analysis.
The OLAP Cube consists of numeric facts called measures
which are categorized by dimensions. OLAP Cube is also
called the hypercube.
Usually, data operations and analysis are performed using
the simple spreadsheet, where data values are arranged in
row and column format. This is ideal for two-dimensional
data. However, OLAP contains multidimensional data,
with data usually obtained from a different and unrelated
source. Using a spreadsheet is not an optimal option. The
cube can store and analyze multidimensional data in a
logical and orderly manner.
2/16/2019 76
How does OLAP work?
A Data warehouse would extract information from
multiple data sources and formats like text files, excel
sheet, multimedia files, etc.
The extracted data is cleaned and transformed. Data is
loaded into an OLAP server (or OLAP cube) where
information is pre-calculated in advance for further
analysis.
2/16/2019 77
Basic analytical operations of
OLAP
Four types of analytical operations in OLAP are:
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
2/16/2019 78
Roll-up
Roll-up is also known as "consolidation" or
"aggregation." The Roll-up operation can be
performed in 2 ways
Reducing dimensions
Climbing up concept hierarchy. Concept hierarchy is a
system of grouping things based on their order or
level.
2/16/2019 79
2/16/2019 80
Roll-up
In this example, cities New York and Chicago are rolled
up into country USA
The sales figure of New York and Chicago are 440 and
1560 respectively. They become 2000 after roll-up
In this aggregation process, data in location hierarchy
moves up from city to the country.
2/16/2019 81
Drill-down
2/16/2019 82
2/16/2019 83
Drill Down
Quater Q1 is drilled down to months January,
February, and March. Corresponding sales are also
registered.
In this example, dimension months are added.
2/16/2019 84
Slice
2/16/2019 85
2/16/2019 86
Dice
2/16/2019 87
2/16/2019 88
Pivot
2/16/2019 89
2/16/2019 90
2/16/2019 91
Types of OLAP Servers/Models
2/16/2019 92
MOLAP
In the MOLAP model, data for analysis is stored in
specialized multidimensional databases. Large
multidimensional arrays form the storage structures.
Precalculated and prefabricated multidimensional
data cubes are stored in multidimensional databases.
The MOLAP engine in the application layer pushes a
multidimensional view of the data from the MDDBs to
the users.
The users who need summarized data enjoy fast
response times from the preconsolidated data.
2/16/2019 93
2/16/2019 94
ROLAP
ROLAP works with data that exist in a relational
database. Facts and dimension tables are stored as
relational tables. It also allows multidimensional
analysis of data and is the fastest growing OLAP.
2/16/2019 95
Local hypercubing is a variation of ROLAP provided
by vendors. This is how it works:
1. The user issues a query.
2. The results of the query get stored in a small, local,
multidimensional database.
3. The user performs analysis against this local database.
4. If additional data is required to continue the analysis,
the user issues another query and the analysis continues
2/16/2019 96
ROLAP
Drawbacks of ROLAP model:
Demand for higher resources: ROLAP needs high
utilization of manpower, software, and hardware
resources.
Aggregately data limitations. ROLAP tools use SQL
for all calculation of aggregate data. However, there are
no set limits to the for handling computations.
Slow query performance. Query performance in this
model is slow when compared with MOLAP.
2/16/2019 97
2/16/2019 98
2/16/2019 99
HOLAP – Refers to hybrid OLAP. This model attempts
to combine the strengths and features of ROLAP and
MOLAP
DOLAP – Refers to Desktop OLAP. It is meant to
provide users of OLAP. In this methodology,
multidimensional datasets are created and transferred
to the desktop machine, requiring only the DOLAP
software to exist on machine.
WEB OLAP- Refers to OLAP where OLAP data is
accessible from a web browser.
2/16/2019 100
Advantages of OLAP
OLAP is a platform for all type of business includes planning,
budgeting, reporting, and analysis.
Information and calculations are consistent in an OLAP cube.
This is a crucial benefit.
Quickly create and analyze "What if" scenarios
Easily search OLAP database for broad or specific terms.
OLAP provides the building blocks for business modeling tools,
Data mining tools, performance reporting tools.
Allows users to do slice and dice cube data all by various
dimensions, measures, and filters.
It is good for analyzing time series.
Finding some clusters and outliers is easy with OLAP.
It is a powerful visualization online analytical process system
which provides faster response times
2/16/2019 101
Disadvantages of OLAP
2/16/2019 102
Selection Criteria for OLAP Tools
Multidimensional representation of data.
Aggregation, summarization, precalculation, and
derivations.
Formulas and complex calculations in an extensive
library.
Cross-dimensional calculations.
Time intelligence such as year-to-date, current and
past fiscal periods, moving averages, and moving
totals.
Pivoting, cross-tabs, drill-down, and roll-up along
single or multiple dimensions.
2/16/2019 103
OLAP Tools
IBM Cognos
SAP NetWeaver BW
Essbase
icCube
Oracle Database OLAP option
2/16/2019 104
Starnet Query Model(previous year
paper)
The querying of multidimensional databases can be based on a
starnet model, which consists of radial lines emanating from a
central point, where each line represents a concept hierarchy for
a dimension. Each abstraction level in the hierarchy is called a
footprint. These represent the granularities available for use by
OLAP operations such as drill-down and roll-up.
2/16/2019 105
A concept hierarchy may involve a single attribute
(e.g., date for the time hierarchy) or several attributes
(e.g., the concept hierarchy for location involves the
attributes street, city, province or state, and country).
2/16/2019 106
2/16/2019 107
Data Mining
Data Mining is defined as extracting information from
huge sets of data.
In other words, we can say that data mining is the
procedure of mining knowledge from data.
Data mining is the process of extracting the useful
information, which is stored in the large database.
It is a powerful tool, which is useful for organizations to
retrieve the useful information from available data
warehouses.
Data mining can be applied to relational databases, object-
oriented databases, data warehouses, structured-
unstructured databases, etc.
Data mining is used in numerous areas like banking,
insurance companies, pharmaceutical companies etc.
2/16/2019 108
Data Mining Architecture(previous
year paper)
2/16/2019 109
Components of Data Warehouse Architecture
Data Sources
Operational database, World Wide Web (WWW), text files and other
documents are the actual sources of data. You need large volumes of
historical data for data mining to be successful. Organizations usually
store data in databases or data warehouses. Data warehouses may
contain one or more databases, text files, spreadsheets or other kinds of
information repositories.
Different Processes
The data needs to be cleaned, integrated and selected before passing it
to the database or data warehouse server. As the data is from different
sources and in different formats, it cannot be used directly for the data
mining process because the data might not be complete and reliable.
So, first data needs to be cleaned and integrated. Again, more data than
required will be collected from different data sources and only the data
of interest needs to be selected and passed to the server. These
processes are not as simple as we think. A number of techniques may
be performed on the data as part of cleaning, integration and selection.
2/16/2019 110
Data Warehouse:
A data warehouse is a place which store information collected
from multiple sources under unified schema. Information
stored in a data warehouse is critical to organizations for the
process of decision-making.
Data Mining Engine:
Data Mining Engine is the core component of data mining
process which consists of various modules that are used to
perform various tasks like clustering, classification,
prediction and correlation analysis.
Pattern Evaluation:
The patterns generated by Data Mining Engine are evaluated
by the pattern evaluation module for the measure of
interestingness of the pattern by using a threshold value. It
interacts with the data mining engine to focus the search
towards interesting patterns.For example using support and
confidence to judge whether the association rules derived
from Market Basket Analysis are interesting or not.
2/16/2019 111
e) Graphical User Interface
The graphical user interface module communicates between
the user and the data mining system. This module helps the
user use the system easily and efficiently without knowing the
real complexity behind the process. When the user specifies a
query or a task, this module interacts with the data mining
system and displays the result in an easily understandable
manner.
f) Knowledge Base
The knowledge base is helpful in the whole data mining
process. It might be useful for guiding the search or
evaluating the interestingness of the result patterns. The
knowledge base might even contain user beliefs and data from
user experiences that can be useful in the process of data
mining. The data mining engine might get inputs from the
knowledge base to make the result more accurate and reliable.
The pattern evaluation module interacts with the knowledge
base on a regular basis to get inputs and also to update it.
2/16/2019 112
OLAM = OLAP + DATA MINING
Online Analytical Mining integrates with Online
Analytical Processing with data mining and mining
knowledge in multidimensional databases.
As data mining need to work on preprocessed data
which is stored in data warehouses and OLAP cubes
are computed over data warehouses, applying data
mining techniques on OLAP cubes can make the best
use of available infrastructures rather than building
everything from scratch.
2/16/2019 113
Effective data mining needs exploratory data analysis. A
user will often want to traverse through a database, select
portions of relevant data, analyze them at different
granularities, and present knowledge/ results in different
forms.
2/16/2019 114
Mining frequent patterns
Frequent patterns are patterns (e.g., itemsets, subsequences, or
substructures) that appear frequently in a data set.
2/16/2019 115
If a substructure occurs frequently, it is called a
(frequent) structured pattern. Finding frequent
patterns plays an essential role in mining associations,
correlations, and many other interesting relationships
among data.
2/16/2019 116
Support and Confidence(previous year
paper)
The rule A- B holds in the transaction set D with
support s, where s is the percentage of transactions in
D that contains A U B.
2/16/2019 117
2/16/2019 118
Market Basket Analysis
Frequent itemset mining leads to the discovery of
associations and correlations among items in large
transactional or relational data sets. With massive amounts
of data continuously being collected and stored, many
industries are becoming interested in mining such patterns
from their databases.
2/16/2019 119
A typical example of frequent itemset mining is
market basket analysis. This process analyzes
customer buying habits by finding associations
between the different items that customers place in
their “shopping baskets”.
2/16/2019 120
Market Basket Analysis
“Which groups or sets of items are customers likely to
purchase on a given trip to the store?”
2/16/2019 121
2/16/2019 122
Market basket analysis can also help retailers plan
which items to put on sale at reduced prices. If
customers tend to purchase computers and printers
together, then having a sale on printers may encourage
the sale of printers as well as computers.
2/16/2019 124
Apriori Algorithm: Finding Frequent
Itemsets by Confined Candidate Generation
Apriori is a seminal algorithm proposed by R. Agrawal and
R. Srikant in 1994 for mining frequent itemsets for Boolean
association rules
The name of the algorithm is based on the fact that the
algorithm uses prior knowledge of frequent itemset
properties,
Apriori employs an iterative approach known as a level-wise
search, where k-itemsets are used to explore (k + 1)
itemsets.
First, the set of frequent 1-itemsets is found by scanning
the database to accumulate the count for each item, and
2/16/2019 125
collecting those items that satisfy minimum support.
The resulting set is denoted by L1.
Next, L1 is used to find L2, the set of frequent 2-
itemsets, which is used to find L3, and so on, until no
more frequent k-itemsets can be found.
The finding of each Lk requires one full scan of the
database.
To improve the efficiency of the level-wise generation
of frequent itemsets, an important property called the
Apriori property is used to reduce the search space.
Apriori property: All nonempty subsets of a
frequent itemset must also be frequent.
2/16/2019 126
In general, association rule mining can be viewed as a two-step process:
1. Find all frequent itemsets: By definition, each of these itemsets will occur
at least as frequently as a predetermined minimum support count, min sup.
2/16/2019 127
2/16/2019 128
2/16/2019 129
Association Rules for Example in slide 116
2/16/2019 130
Refer following pages for
Apriori Algorithm and
FP-Growth Algorithm
2/16/2019 131
From Association Mining to
Correlation Analysis
• After generating the association rules using
Apriori or FP growth Algorithm based on
minimum support and minimum confidence
interestingness measures, it can be concluded
that which rules are strong or which are not.