Sunteți pe pagina 1din 149

UNIT-2

BCA VI SEM

Ms. Sushma Malik Ms. Deepti Gupta


Asst. Prof., IINTM Asst. Prof., IINTM
Contents
 Data Warehouse(DW)
 Features of DW
 OLTP and OLAP
 Difference
 Data Warehouse
 Architecture
 Models
 Applications
 Advantages and Disadvantages
 ETL
 Metadata
 Starnet Query Model

2/16/2019 2
 Data Cube: Multidimensional Model
 Introduction
 Schemas
 Star Schema
 Snowflake Schema
 Fact Constellation
 Data Warehouse Usage
 Data Warehouse Implementation
 Data Cube Materialization
 OLAP Operations –Rollup,Drill-Down,Slice,Dice,Pivot
 OLAP Models
 OLAM
 Data Mining Architecture
 Mining Frequent Patterns
 Association Rule Mining
 Apriori Algorithm
 FP Growth

2/16/2019 3
What is Data Warehouse?
 The term "Data Warehouse" was first coined by Bill
Inmon in 1990.
 According to Inmon, a data warehouse is a subject
oriented, integrated, time-variant, and non-
volatile collection of data.
 A data warehouse refers to a data repository that is
maintained separately from an organization’s
operational databases. Data warehouse systems allow
for integration of a variety of application systems. They
support information processing by providing a solid
platform of consolidated historic data for analysis.

2/16/2019 4
What is data warehouse?
 Data warehousing provides architectures and
tools for business executives to systematically
organize, understand, and use their data to make
strategic decisions.
 The data warehouse is the core of the BI system which
is built for data analysis and reporting.
 Data warehouses generalize and consolidate data in
multidimensional space. The construction of data
warehouses involves data cleaning, data integration,
and data transformation, and can be viewed as an
important preprocessing step for data mining.

2/16/2019 5
 Moreover, data warehouses provide online analytical
processing (OLAP) tools for the interactive analysis of
multidimensional data of varied granularities, which
facilitates effective data generalization and data
mining.

 Many other data mining functions, such as association,


classification, prediction, and clustering, can be integrated
with OLAP operations to enhance interactive mining of
knowledge at multiple levels of abstraction. Hence, the
data warehouse has become an increasingly important
platform for data analysis and OLAP and will provide an
effective platform for data mining.
2/16/2019 6
Data Warehouse Features(previous year
paper)
 Subject Oriented − A data warehouse is organized around
major subjects such as customer, supplier, product, and
sales. Rather than concentrating on the day-to-day
operations and transaction processing of an organization, a
data warehouse focuses on the modeling and analysis of
data for decision makers. Hence, data warehouses typically
provide a simple and concise view of particular subject
issues by excluding data that are not useful in the decision
support process.

 Integrated −A data warehouse is usually constructed by


integrating multiple heterogeneous sources, such as
relational databases, flat files, and online transaction
records. Data cleaning and data integration techniques are
applied to ensure consistency in naming conventions,
encoding structures, attribute measures, and so on.
2/16/2019 7
Data Warehouse Features
 Time Variant − Data are stored to provide information
from an historic perspective (e.g., the past 5–10 years). Every
key structure in the data warehouse contains, either
implicitly or explicitly, a time element.

 Non-volatile − Non-volatile means the previous data is not


erased when new data is added to it. A data warehouse is
kept separate from the operational database. Due to this
separation, a data warehouse does not require transaction
processing, recovery, and concurrency control mechanisms.
It usually requires only two operations in data accessing:
initial loading of data and access of data.

2/16/2019 8
2/16/2019 9
2/16/2019 10
How are organizations using the
information from data warehouse?
Many organizations use this information to support business
decision-making activities, including:

 (1) Increasing customer focus, which includes the analysis


of customer buying patterns (such as buying preference,
buying time, budget cycles, and appetites for spending);
 (2) Repositioning products and managing product
portfolios by comparing the performance of sales by
quarter, by year, and by geographic regions in order to fine-
tune production strategies;
 (3) Analyzing operations and looking for sources of profit
 (4) Managing customer relationships, making
environmental corrections, and managing the cost of
corporate assets.
2/16/2019 11
Data Warehouse- An update driven Approach
 Data warehousing employs an update driven approach in which
information from multiple, heterogeneous sources is integrated
in advance and stored in a warehouse for direct querying and
analysis.

 Unlike online transaction processing databases, data warehouses


do not contain the most current information. However, a data
warehouse brings high performance to the integrated
heterogeneous database system because data are copied,
preprocessed, integrated, annotated, summarized, and
restructured into one semantic data store.

 Furthermore, query processing in data warehouses does not


interfere with the processing at local sources. Moreover, data
warehouses can store and integrate historic information and
support complex multidimensional queries.
2/16/2019 12
OLTP and OLAP
 Because most people are familiar with commercial relational
database systems, it is easy to understand what a data warehouse
is by comparing these two kinds of systems.

 The major task of online operational database systems is to


perform online transaction and query processing. These systems
are called online transaction processing (OLTP) systems.
They cover most of the day-to-day operations of an organization
such as purchasing, inventory, manufacturing, banking, payroll,
registration, and accounting.

 Data warehouse systems, on the other hand, serve users or


knowledge workers in the role of data analysis and decision
making. Such systems can organize and present data in various
formats in order to accommodate the diverse needs of different
users. These systems are known as online analytical
processing (OLAP) systems.
2/16/2019 13
Major Distinguishing Features between
OLTP and OLAP(previous year paper)
The major distinguishing features of OLTP and OLAP are
summarized as follows:
 Users and system orientation: An OLTP system is customer-
oriented and is used for transaction and query processing by clerks,
clients, and information technology professionals. An OLAP system
is market-oriented and is used for data analysis by knowledge
workers, including managers, executives, and analysts.
 Data contents: An OLTP system manages current data that,
typically, are too detailed to be easily used for decision making. An
OLAP system manages large amounts of historic data, provides
facilities for summarization and aggregation, and stores and manages
information at different levels of granularity. These features make the
data easier to use for informed decision making.

2/16/2019 14
 Database design: An OLTP system usually adopts an
entity-relationship (ER) data model and an
application-oriented database design. An OLAP system
typically adopts either a star or a snowflake model and
a subject-oriented database design.

 View: An OLTP system focuses mainly on the current data


within an enterprise or department, without referring to
historic data or data in different organizations. In contrast,
an OLAP system often spans multiple versions of a
database schema, due to the evolutionary process of an
organization. OLAP systems also deal with information
that originates from different organizations, integrating
information from many data stores. Because of their huge
volume, OLAP data are stored on multiple storage media.

2/16/2019 15
 Access patterns: The access patterns of an OLTP
system consist mainly of short, atomic transactions.
Such a system requires concurrency control and
recovery mechanisms. However, accesses to OLAP
systems are mostly read-only operations (because
most data warehouses store historic rather than up-to-
date information), although many could be complex
queries.

2/16/2019 16
2/16/2019 17
Data Warehousing: A multitiered
Architecture(previous year paper)

2/16/2019 18
 Data warehouses often adopt a three-tier architecture:

 The bottom tier is a warehouse database server that is


almost always a relational database system. Back-end tools
and utilities are used to feed data into the bottom tier from
operational databases or other external sources (e.g.,
customer profile information provided by external
consultants).

 These tools and utilities perform data extraction, cleaning,


and transformation (e.g., to merge similar data from
different sources into a unified format), as well as load and
refresh functions to update the data warehouse.The data
are extracted using application program interfaces known
as gateways. A gateway is supported by the underlying
DBMS and allows client programs to generate SQL code to
be executed at a server. Examples of gateways include
ODBC (Open Database Connection) and OLEDB (Object
2/16/2019 19
Linking and Embedding Database) by Microsoft and JDBC
(Java Database Connection).This tier also contains a
metadata repository, which stores information about the
data warehouse and its contents.

 2. The middle tier is an OLAP server that is typically


implemented using either (1) a relational OLAP(ROLAP)
model (i.e., an extended relational DBMS that maps
operations on multidimensional data to standard relational
operations); or (2) a multidimensional OLAP (MOLAP)
model (i.e., a special-purpose server that directly
implements multidimensional data and operations).

 3. The top tier is a front-end client layer, which contains
query and reporting tools, analysis tools, and/or data
mining tools (e.g., trend analysis, prediction, and so on).

2/16/2019 20
Data warehouse Models
 Enterprise warehouse: An enterprise warehouse collects
all of the information about subjects spanning the entire
organization.
 It provides corporate-wide data integration, usually from one
or more operational systems or external information
providers, and is cross-functional in scope.
 It typically contains detailed data as well as summarized data,
and can range in size from a few gigabytes to hundreds of
gigabytes, terabytes, or beyond.
 An enterprise data warehouse may be implemented on
traditional mainframes, computer superservers, or parallel
architecture platforms. It requires extensive business
modeling and may take years to design and build.

2/16/2019 21
 Data mart: A data mart contains a subset of corporate-wide
data that is of value to a specific group of users. The scope
is confined to specific selected subjects. For example, a
marketing data mart may confine its subjects to customer, item,
and sales. The data contained in data marts tend to be
summarized.
Data marts are usually implemented on low-cost departmental
servers that are Unix/Linux or Windows based. The
implementation cycle of a data mart is more likely to be measured
in weeks rather than months or years. However, it may involve
complex integration in the long run if its design and planning were
not enterprise-wide.
 Depending on the source of data, data marts can be categorized
as independent or dependent.
 Independent data marts are sourced from data captured from one
or more operational systems or external information providers, or
from data generated locally within a particular department or
geographic area.
 Dependent data marts are sourced directly from enterprise data
warehouses.

2/16/2019 22
 Virtual warehouse: A virtual warehouse is a set of
views over operational databases.
 For efficient query processing, only some of the
possible summary views may be materialized.
 A virtual warehouse is easy to build but requires excess
capacity on operational database servers.

2/16/2019 23
Difference Between Data Warehouse and Data Mart

2/16/2019 24
Data Warehouse Applications
A data warehouse helps business executives to organize,
analyze, and use their data for decision making. A data
warehouse serves as a sole part of a plan-execute-
assess "closed-loop" feedback system for the enterprise
management. Data warehouses are widely used in the
following fields :
 Financial services
 Banking services
 Consumer goods
 Retail sectors
 Controlled manufacturing
2/16/2019 25
How Data warehouse works?
 By merging all of this information in one place, an
organization can analyze its customers more
holistically.
 This helps to ensure that it has considered all the
information available.
 Data warehousing makes data mining possible. Data
mining is looking for patterns in the data that may
lead to higher sales and profits.

2/16/2019 26
Advantages of Data Warehouse:
 Data warehouse allows business users to quickly access
critical data from some sources all in one place.
 Data warehouse provides consistent information on various
cross-functional activities. It is also supporting ad-hoc
reporting and query.
 Data Warehouse helps to integrate many sources of data to
reduce stress on the production system.
 Data warehouse helps to reduce total turnaround time for
analysis and reporting.
 Restructuring and Integration make it easier for the user to
use for reporting and analysis.
 Data warehouse allows users to access critical data from the
number of sources in a single place. Therefore, it saves
user's time of retrieving data from multiple sources.
 Data warehouse stores a large amount of historical data.
This helps users to analyze different time periods and
trends to make future predictions.
2/16/2019 27
Disadvantages of Data Warehouse:
 Not an ideal option for unstructured data.
 Creation and Implementation of Data Warehouse is
surely time confusing affair.
 Data Warehouse can be outdated relatively quickly.
 Difficult to make changes in data types and ranges,
data source schema, indexes, and queries.
 The data warehouse may seem easy, but actually, it is
too complex for the average users.
 Sometime warehouse users will develop different
business rules.
 Organizations need to spend lots of their resources for
training and Implementation purpose.

2/16/2019 28
ETL-Extraction, Transformation
and Loading(previous year paper)
Data warehouse systems use back-end tools and utilities to
populate and refresh their Data .These tools and utilities include
the following functions:

 Data extraction, which typically gathers data from multiple,


heterogeneous, and external sources.
 Data cleaning, which detects errors in the data and rectifies
them when possible.
 Data transformation, which converts data from legacy or host
format to warehouse format.
 Load, which sorts, summarizes, consolidates, computes views,
checks integrity, and builds indices and partitions.
 Refresh, which propagates the updates from the data sources to
the warehouse.

2/16/2019 29
Metadata Repository(previous
year paper)
 Metadata are data about data. When used in a data
warehouse, metadata are the data that define
warehouse objects. Metadata repository lies within the
bottom tier of the data warehousing architecture.
Metadata are created for the data names and
definitions of the given warehouse.

2/16/2019 30
 A metadata repository should contain the following:
 A description of the data warehouse structure, which
includes the warehouse schema, view, dimensions,
hierarchies, and derived data definitions, as well as data
mart locations and contents.

 Operational metadata, which include data lineage


(history of migrated data and the sequence of
transformations applied to it), currency of data (active,
archived, or purged), and monitoring information
(warehouse usage statistics, error reports, and audit trails).

 The algorithms used for summarization, which include


measure and dimension definition algorithms, data on
granularity, partitions, subject areas, aggregation,
summarization, and predefined queries and reports.
2/16/2019 31
 Mapping from the operational environment to the
data warehouse, which includes source databases and
their contents, gateway descriptions, data partitions, data
extraction, cleaning, transformation rules and defaults,
data refresh and purging rules, and security (user
authorization and access control).

 Data related to system performance, which include


indices and profiles that improve data access and retrieval
performance, in addition to rules for the timing and
scheduling of refresh, update, and replication cycles.

 Business metadata, which include business terms and


definitions, data ownership information, and charging
policies.

2/16/2019 32
Role of Metadata
 Metadata play a very different role than other data warehouse
data and are important for many reasons.
 For example, metadata are used as a directory to help the
decision support system analyst locate the contents of the
data warehouse, and as a guide to the data mapping when
data are transformed from the operational environment to
the data warehouse environment.
 Metadata also serve as a guide to the algorithms used for
summarization between the current detailed data and the lightly
summarized data, and between the lightly summarized data and
the highly summarized data.
 Metadata should be stored and managed persistently (i.e., on
disk).

2/16/2019 33
Data Cube: A Multidimensional Data
Model
 “What is a data cube?” A data cube allows data to be modeled and
viewed in multiple dimensions.
 It is defined by dimensions and facts.
 In general terms, dimensions are the perspectives or entities
with respect to which an organization wants to keep records.
 For example, AllElectronics may create a sales data warehouse in
order to keep records of the store’s sales with respect to the
dimensions time, item, branch, and location. These dimensions
allow the store to keep track of things like monthly sales of items and
the branches and locations at which the items were sold.
 Each dimension may have a table associated with it, called a
dimension table, which further describes the dimension.
 For example, a dimension table for item may contain the attributes
item name, brand, and type. Dimension tables can be specified by
users or experts, or automatically generated and adjusted based on
data distributions.

2/16/2019 34
A multidimensional data model
 A multidimensional data model is typically organized
around a central theme, such as sales. This theme is
represented by a fact table.
 Facts are numeric measures. Think of them as the
quantities by which we want to analyze
relationships between dimensions.
 Examples of facts for a sales data warehouse
include dollars sold (sales amount in dollars),
units sold (number of units sold), and amount
budgeted.
 The fact table contains the names of the facts, or
measures, as well as keys to each of the related
dimension tables.
2/16/2019 35
A multidimensional data model
 Multidimensional data model stores data in the form
of data cube. Mostly, data warehousing supports two
or three-dimensional cubes.
 A data cube allows data to be viewed in multiple
dimensions. A dimensions are entities with respect to
which an organization wants to keep records. For
example in store sales record, dimensions allow the
store to keep track of things like monthly sales of
items and the branches and locations.
 A multidimensional databases helps to provide data-
related answers to complex business queries quickly
and accurately.

2/16/2019 36
A multidimensional data model

2/16/2019 37
A multidimensional data model

2/16/2019 38
A multidimensional data model

2/16/2019 39
Schemas for Multidimensional
Data Model are

 Star Schema
 Snowflakes Schema
 Fact Constellations Schema

2/16/2019 40
Star Schemas for Multidimensional Model
 The simplest data warehouse schema is star schema
because its structure resembles a star.
 Star schema consists of data in the form of facts and
dimensions.
 The fact table present in the center of star and points
of the star are the dimension tables.
 In star schema fact table contain a large amount
of data, with redundancy.
 Each dimension table is joined with the fact table
using a primary or foreign key.

2/16/2019 41
2/16/2019 42
An example of E-Commerce Website

2/16/2019 43
Dimension Tables

2/16/2019 44
2/16/2019 45
2/16/2019 46
2/16/2019 47
As, it can be observed from previous
slides that dimension tables contains
lot of redundant data, so we need a
schema where tables can be normalized
to reduce redundancy.

2/16/2019 48
ANOTHER
EXAMPLE
OF STAR
SCHEMA

2/16/2019 49
Snowflake Schemas for Multidimensional
Model
 The snowflake schema is a more complex than star
schema because dimension tables of the snowflake are
normalized.
 The snowflake schema is represented by centralized
fact table which is connected to multiple dimension
table and this dimension table can be normalized into
additional dimension tables.
 The major difference between the snowflake and
star schema models is that the dimension tables
of the snowflake model are normalized to reduce
redundancies.

2/16/2019 50
2/16/2019 51
SNOWFLAKE
SCHEMA FOR SAME
EXAMPLE

2/16/2019 52
Normalized Dimensional Tables

2/16/2019 53
2/16/2019 54
2/16/2019 55
Another Example
of Snowf lake
Schema

2/16/2019 56
Difference between the snowflake
and star schema models
 The major difference between the snowflake and star
schema models is that the dimension tables of the
snowflake model may be kept in normalized form to
reduce redundancies.

 Such a table is easy to maintain and saves storage


space. However, this space savings is negligible in
comparison to the typical magnitude of the fact table.

 Furthermore, the snowflake structure can reduce the


effectiveness of browsing, since more joins will be
needed to execute a query.
2/16/2019 57
Fact constellation Schemas for Multidimensional
Model
 A fact constellation can have multiple fact tables that
share many dimension tables.
 This type of schema can be viewed as a collection of
stars, Snowflake and hence is called a galaxy schema or
a fact constellation.

 The main disadvantage of fact constellation schemas is


its more complicated design.

2/16/2019 58
2/16/2019 59
FACT CONSTELLATION

2/16/2019 60
Data Warehouse Usage for
Information Processing
 Data warehouses and data marts are used in a wide
range of applications. Business executives use the data
in data warehouses and data marts to perform data
analysis and make strategic decisions.

 Data warehouses are used extensively in banking and


financial services, consumer goods and retail
distribution sectors, and controlled manufacturing
such as demand-based production.

2/16/2019 61
 There are three kinds of data warehouse applications:
information processing, analytical processing, and data
mining.

 Information processing supports querying, basic


statistical analysis, and reporting using crosstabs,
tables, charts, or graphs. A current trend in data warehouse
information processing is to construct low-cost web-based
accessing tools that are then integrated with web browsers.

 Analytical processing supports basic OLAP


operations, including slice-and-dice, drill-down, roll-
up, and pivoting. It generally operates on historic data in
both summarized and detailed forms. The major strength
of online analytical processing over information processing
is the multidimensional data analysis of data warehouse
data.

2/16/2019 62
 Data mining supports knowledge discovery by
finding hidden patterns and associations,
constructing analytical models, performing
classification and prediction, and presenting the
mining results using visualization tools.

2/16/2019 63
OLAP vs. Data Mining
 The functionalities of OLAP and data mining can be
viewed as disjoint:

 OLAP is a data summarization/aggregation tool that helps


simplify data analysis, while data mining allows the
automated discovery of implicit patterns and interesting
knowledge hidden in large amounts of data.

 OLAP tools are targeted toward simplifying and supporting


interactive data analysis, whereas the goal of data mining
tools is to automate as much of the process as possible, while
still allowing users to guide the process.

2/16/2019 64
Data Warehouse Implementation
 A data cube is a lattice of cuboids. Suppose that you want to
create a data cube for AllElectronics sales that contains the
following: city, item, year, and sales in dollars. You want to be
able to analyze the data, with queries such as the following:

 “Compute the sum of sales, grouping by city and item.”


 “Compute the sum of sales, grouping by city.”
 “Compute the sum of sales, grouping by item.”

 What is the total number of cuboids, or group-by’s, that


can be computed for this data cube?

 the total number of cuboids, or groupby’s, that can be


computed for this data cube is
2/16/2019 65
2/16/2019 66
The compute cube operator and the
Curse of Dimensionality

One approach to cube computation extends SQL so as to


include a compute cube operator. Firstly the cube is defined
using SQL statement define cube and then the compute cube
operator computes aggregates over all subsets of dimensions
specified in the operation.

2/16/2019 67
Curse of Dimensionality
 Online analytical processing may need to access different
cuboids for different queries. Therefore, it may seem like a
good idea to compute in advance all or at least some of the
cuboids in a data cube. Precomputation leads to fast
response time and avoids some redundant
computation.
 A major challenge related to this precomputation,
however, is that the required storage space may explode
if all the cuboids in a data cube are precomputed,
especially when the cube has many dimensions. The
storage requirements are even more excessive when many
of the dimensions have associated concept hierarchies,
each with multiple levels.
 This problem is referred to as the curse of
dimensionality.

2/16/2019 68
2/16/2019 69
Data Cube Materialization
 There are three choices for data cube materialization given a base cuboid:
1. No materialization: Do not precompute any of the “nonbase”
cuboids. This leads to computing expensive multidimensional aggregates
on-the-fly, which can be extremely slow.
2. Full materialization: Precompute all of the cuboids. The resulting
lattice of computed cuboids is referred to as the full cube. This choice
typically requires huge amounts of memory space in order to store all of the
precomputed cuboids.
3. Partial materialization: Selectively compute a proper subset of the
whole set of possible cuboids. Alternatively, we may compute a subset of
the cube, which contains only those cells that satisfy some user-specified
criterion, such as where the tuple count of each cell is above some
threshold. We will use the term subcube to refer to the latter case, where
only some of the cells may be precomputed for various cuboids. Partial
materialization represents an interesting trade-off between storage space
and response time.

2/16/2019 70
Data warehouse Application/usage
Most common sectors where Data warehouse is used:
 Airline:
In the Airline system, it is used for operation purpose like crew
assignment, analyses of route profitability, frequent flyer program
promotions, etc.
 Banking:
It is widely used in the banking sector to manage the resources
available on desk effectively. Few banks also used for the market
research, performance analysis of the product and operations.
 Healthcare:
Healthcare sector also used Data warehouse to strategize and predict
outcomes, generate patient's treatment reports, share data with tie-in
insurance companies, medical aid services, etc.
 Public sector:
In the public sector, data warehouse is used for intelligence gathering.
It helps government agencies to maintain and analyze tax records,
health policy records, for every individual.

2/16/2019 71
 Investment and Insurance sector:
In this sector, the warehouses are primarily used to analyze data
patterns, customer trends, and to track market movements.
 Retain chain:
In retail chains, Data warehouse is widely used for distribution
and marketing. It also helps to track items, customer buying
pattern, promotions and also used for determining pricing
policy.
 Telecommunication:
A data warehouse is used in this sector for product promotions,
sales decisions and to make distribution decisions.
 Hospitality Industry:
This Industry utilizes warehouse services to design as well as
estimate their advertising and promotion campaigns where they
want to target clients based on their feedback and travel
patterns.
2/16/2019 72
Online Analytical Processing Server (OLAP)
-Definition
 On-Line Analytical Processing (OLAP) is a category of
software technology that enables analysts, managers
and executives to gain insight into data through fast,
consistent, interactive access in a wide variety of
possible views of information that has been
transformed from raw data to reflect the real
dimensionality of the enterprise as understood by the
user.

2/16/2019 73
 Analysts frequently need to group, aggregate and join
data. These operations in relational databases are
resource intensive. With OLAP data can be pre-
calculated and pre-aggregated, making analysis faster.

 OLAP databases are divided into one or more cubes.


The cubes are designed in such a way that creating and
viewing reports become easy.

2/16/2019 74
2/16/2019 75
 At the core of the OLAP, concept is an OLAP Cube. The
OLAP cube is a data structure optimized for very quick
data analysis.
 The OLAP Cube consists of numeric facts called measures
which are categorized by dimensions. OLAP Cube is also
called the hypercube.
 Usually, data operations and analysis are performed using
the simple spreadsheet, where data values are arranged in
row and column format. This is ideal for two-dimensional
data. However, OLAP contains multidimensional data,
with data usually obtained from a different and unrelated
source. Using a spreadsheet is not an optimal option. The
cube can store and analyze multidimensional data in a
logical and orderly manner.

2/16/2019 76
How does OLAP work?
 A Data warehouse would extract information from
multiple data sources and formats like text files, excel
sheet, multimedia files, etc.
 The extracted data is cleaned and transformed. Data is
loaded into an OLAP server (or OLAP cube) where
information is pre-calculated in advance for further
analysis.

2/16/2019 77
Basic analytical operations of
OLAP
 Four types of analytical operations in OLAP are:
 Roll-up
 Drill-down
 Slice and dice
 Pivot (rotate)

2/16/2019 78
Roll-up
 Roll-up is also known as "consolidation" or
"aggregation." The Roll-up operation can be
performed in 2 ways
 Reducing dimensions
 Climbing up concept hierarchy. Concept hierarchy is a
system of grouping things based on their order or
level.

2/16/2019 79
2/16/2019 80
Roll-up
 In this example, cities New York and Chicago are rolled
up into country USA
 The sales figure of New York and Chicago are 440 and
1560 respectively. They become 2000 after roll-up
 In this aggregation process, data in location hierarchy
moves up from city to the country.

2/16/2019 81
Drill-down

 In drill-down data is fragmented into smaller parts. It


is the opposite of the rollup process.
 It can be done via moving down the concept hierarchy
 Increasing a dimension

2/16/2019 82
2/16/2019 83
Drill Down
 Quater Q1 is drilled down to months January,
February, and March. Corresponding sales are also
registered.
 In this example, dimension months are added.

2/16/2019 84
Slice

 Here, one dimension is selected, and a new sub-cube is


created.
 Dimension Time is Sliced with Q1 as the filter.
 A new cube is created altogether.

2/16/2019 85
2/16/2019 86
Dice

 This operation is similar to a slice. The difference in


dice is you select 2 or more dimensions that result in
the creation of a sub-cube.

2/16/2019 87
2/16/2019 88
Pivot

 In Pivot, you rotate the data axes to provide a


substitute presentation of data.

2/16/2019 89
2/16/2019 90
2/16/2019 91
Types of OLAP Servers/Models

 MOLAP – Multidimensional OLAP


 ROLAP- Relational OLAP
 HOLAP –Hybrid OLAP
 DOLAP- Desktop OLAP
 Web OLAP

2/16/2019 92
MOLAP
 In the MOLAP model, data for analysis is stored in
specialized multidimensional databases. Large
multidimensional arrays form the storage structures.
 Precalculated and prefabricated multidimensional
data cubes are stored in multidimensional databases.
 The MOLAP engine in the application layer pushes a
multidimensional view of the data from the MDDBs to
the users.
 The users who need summarized data enjoy fast
response times from the preconsolidated data.

2/16/2019 93
2/16/2019 94
ROLAP
 ROLAP works with data that exist in a relational
database. Facts and dimension tables are stored as
relational tables. It also allows multidimensional
analysis of data and is the fastest growing OLAP.

 True ROLAP has three distinct characteristics:


 Supports all the basic OLAP features and functions
discussed earlier
 Stores data in a relational form
 Supports some form of aggregation

2/16/2019 95
 Local hypercubing is a variation of ROLAP provided
by vendors. This is how it works:
 1. The user issues a query.
 2. The results of the query get stored in a small, local,
multidimensional database.
 3. The user performs analysis against this local database.
 4. If additional data is required to continue the analysis,
the user issues another query and the analysis continues

2/16/2019 96
ROLAP
Drawbacks of ROLAP model:
 Demand for higher resources: ROLAP needs high
utilization of manpower, software, and hardware
resources.
 Aggregately data limitations. ROLAP tools use SQL
for all calculation of aggregate data. However, there are
no set limits to the for handling computations.
 Slow query performance. Query performance in this
model is slow when compared with MOLAP.

2/16/2019 97
2/16/2019 98
2/16/2019 99
 HOLAP – Refers to hybrid OLAP. This model attempts
to combine the strengths and features of ROLAP and
MOLAP
 DOLAP – Refers to Desktop OLAP. It is meant to
provide users of OLAP. In this methodology,
multidimensional datasets are created and transferred
to the desktop machine, requiring only the DOLAP
software to exist on machine.
 WEB OLAP- Refers to OLAP where OLAP data is
accessible from a web browser.

2/16/2019 100
Advantages of OLAP
 OLAP is a platform for all type of business includes planning,
budgeting, reporting, and analysis.
 Information and calculations are consistent in an OLAP cube.
This is a crucial benefit.
 Quickly create and analyze "What if" scenarios
 Easily search OLAP database for broad or specific terms.
 OLAP provides the building blocks for business modeling tools,
Data mining tools, performance reporting tools.
 Allows users to do slice and dice cube data all by various
dimensions, measures, and filters.
 It is good for analyzing time series.
 Finding some clusters and outliers is easy with OLAP.
 It is a powerful visualization online analytical process system
which provides faster response times

2/16/2019 101
Disadvantages of OLAP

 OLAP requires organizing data into a star or snowflake


schema. These schemas are complicated to implement
and administer.
 You cannot have large number of dimensions in a
single OLAP cube.
 Transactional data cannot be accessed with OLAP
system.
 Any modification in an OLAP cube needs a full update
of the cube. This is a time-consuming process.

2/16/2019 102
Selection Criteria for OLAP Tools
 Multidimensional representation of data.
 Aggregation, summarization, precalculation, and
derivations.
 Formulas and complex calculations in an extensive
library.
 Cross-dimensional calculations.
 Time intelligence such as year-to-date, current and
past fiscal periods, moving averages, and moving
totals.
 Pivoting, cross-tabs, drill-down, and roll-up along
single or multiple dimensions.

2/16/2019 103
OLAP Tools
 IBM Cognos
 SAP NetWeaver BW
 Essbase
 icCube
 Oracle Database OLAP option

2/16/2019 104
Starnet Query Model(previous year
paper)
 The querying of multidimensional databases can be based on a
starnet model, which consists of radial lines emanating from a
central point, where each line represents a concept hierarchy for
a dimension. Each abstraction level in the hierarchy is called a
footprint. These represent the granularities available for use by
OLAP operations such as drill-down and roll-up.

 A starnet query model for the AllElectronics data warehouse is


shown .This starnet consists of four radial lines, representing concept
hierarchies for the dimensions location, customer, item, and time,
respectively. Each line consists of footprints representing abstraction
levels of the dimension. For example, the time line has four
footprints: “day,” “month,” “quarter,” and “year.”

2/16/2019 105
 A concept hierarchy may involve a single attribute
(e.g., date for the time hierarchy) or several attributes
(e.g., the concept hierarchy for location involves the
attributes street, city, province or state, and country).

 In order to examine the item sales at AllElectronics,


users can roll up along the time dimension from month
to quarter, or, say, drill down along the location
dimension from country to city.

2/16/2019 106
2/16/2019 107
Data Mining
 Data Mining is defined as extracting information from
huge sets of data.
 In other words, we can say that data mining is the
procedure of mining knowledge from data.
 Data mining is the process of extracting the useful
information, which is stored in the large database.
 It is a powerful tool, which is useful for organizations to
retrieve the useful information from available data
warehouses.
 Data mining can be applied to relational databases, object-
oriented databases, data warehouses, structured-
unstructured databases, etc.
 Data mining is used in numerous areas like banking,
insurance companies, pharmaceutical companies etc.
2/16/2019 108
Data Mining Architecture(previous
year paper)

2/16/2019 109
Components of Data Warehouse Architecture
 Data Sources
 Operational database, World Wide Web (WWW), text files and other
documents are the actual sources of data. You need large volumes of
historical data for data mining to be successful. Organizations usually
store data in databases or data warehouses. Data warehouses may
contain one or more databases, text files, spreadsheets or other kinds of
information repositories.

 Different Processes
 The data needs to be cleaned, integrated and selected before passing it
to the database or data warehouse server. As the data is from different
sources and in different formats, it cannot be used directly for the data
mining process because the data might not be complete and reliable.
So, first data needs to be cleaned and integrated. Again, more data than
required will be collected from different data sources and only the data
of interest needs to be selected and passed to the server. These
processes are not as simple as we think. A number of techniques may
be performed on the data as part of cleaning, integration and selection.

2/16/2019 110
 Data Warehouse:
 A data warehouse is a place which store information collected
from multiple sources under unified schema. Information
stored in a data warehouse is critical to organizations for the
process of decision-making.
 Data Mining Engine:
 Data Mining Engine is the core component of data mining
process which consists of various modules that are used to
perform various tasks like clustering, classification,
prediction and correlation analysis.
 Pattern Evaluation:
 The patterns generated by Data Mining Engine are evaluated
by the pattern evaluation module for the measure of
interestingness of the pattern by using a threshold value. It
interacts with the data mining engine to focus the search
towards interesting patterns.For example using support and
confidence to judge whether the association rules derived
from Market Basket Analysis are interesting or not.

2/16/2019 111
 e) Graphical User Interface
 The graphical user interface module communicates between
the user and the data mining system. This module helps the
user use the system easily and efficiently without knowing the
real complexity behind the process. When the user specifies a
query or a task, this module interacts with the data mining
system and displays the result in an easily understandable
manner.
 f) Knowledge Base
 The knowledge base is helpful in the whole data mining
process. It might be useful for guiding the search or
evaluating the interestingness of the result patterns. The
knowledge base might even contain user beliefs and data from
user experiences that can be useful in the process of data
mining. The data mining engine might get inputs from the
knowledge base to make the result more accurate and reliable.
The pattern evaluation module interacts with the knowledge
base on a regular basis to get inputs and also to update it.
2/16/2019 112
OLAM = OLAP + DATA MINING
 Online Analytical Mining integrates with Online
Analytical Processing with data mining and mining
knowledge in multidimensional databases.
 As data mining need to work on preprocessed data
which is stored in data warehouses and OLAP cubes
are computed over data warehouses, applying data
mining techniques on OLAP cubes can make the best
use of available infrastructures rather than building
everything from scratch.

2/16/2019 113
 Effective data mining needs exploratory data analysis. A
user will often want to traverse through a database, select
portions of relevant data, analyze them at different
granularities, and present knowledge/ results in different
forms.

 Multidimensional data mining provides facilities for


mining on different subsets of data and at varying levels of
abstraction—by drilling, pivoting, filtering, dicing, and
slicing on a data cube and/or intermediate data mining
results.

 This, together with data/knowledge visualization tools,


greatly enhances the power and flexibility of data mining.

2/16/2019 114
Mining frequent patterns
 Frequent patterns are patterns (e.g., itemsets, subsequences, or
substructures) that appear frequently in a data set.

 For example, a set of items, such as milk and bread, that


appear frequently together in a transaction data set is a
frequent itemset.

 A subsequence, such as buying first a PC, then a digital


camera, and then a memory card, if it occurs frequently in a
shopping history database, is a (frequent) sequential pattern.

 A substructure can refer to different structural forms, such as


subgraphs, subtrees, or sublattices, which may be combined
with itemsets or subsequences.

2/16/2019 115
 If a substructure occurs frequently, it is called a
(frequent) structured pattern. Finding frequent
patterns plays an essential role in mining associations,
correlations, and many other interesting relationships
among data.

 Moreover, it helps in data classification, clustering,


and other data mining tasks. Thus, frequent pattern
mining has become an important data mining task
and a focused theme in data mining research.

 Frequent pattern mining searches for recurring


relationships in a given data set.

2/16/2019 116
Support and Confidence(previous year
paper)
 The rule A- B holds in the transaction set D with
support s, where s is the percentage of transactions in
D that contains A U B.

 The rule A-- B has confidence c in the transaction


set D, where c is the percentage of transactions in D
containing A that also contains B.

2/16/2019 117
2/16/2019 118
Market Basket Analysis
 Frequent itemset mining leads to the discovery of
associations and correlations among items in large
transactional or relational data sets. With massive amounts
of data continuously being collected and stored, many
industries are becoming interested in mining such patterns
from their databases.

 The discovery of interesting correlation relationships


among huge amounts of business transaction records can
help in many business decision-making processes such as
catalog design, cross-marketing, and customer shopping
behavior analysis.

2/16/2019 119
 A typical example of frequent itemset mining is
market basket analysis. This process analyzes
customer buying habits by finding associations
between the different items that customers place in
their “shopping baskets”.

 The discovery of these associations can help retailers


develop marketing strategies by gaining insight into
which items are frequently purchased together by
customers. For instance, if customers are buying milk,
how likely are they to also buy bread (and what kind of
bread) on the same trip

2/16/2019 120
Market Basket Analysis
 “Which groups or sets of items are customers likely to
purchase on a given trip to the store?”

 To answer your question, market basket analysis may be


performed on the retail data of customer transactions at your
store. You can then use the results to plan marketing or
advertising strategies, or in the design of a new catalog.

 For instance, market basket analysis may help you design


different store layouts. In one strategy, items that are frequently
purchased together can be placed in proximity to further
encourage the combined sale of such items. If customers who
purchase computers also tend to buy antivirus software at the
same time, then placing the hardware display close to the
software display may help increase the sales of both items.

2/16/2019 121
2/16/2019 122
 Market basket analysis can also help retailers plan
which items to put on sale at reduced prices. If
customers tend to purchase computers and printers
together, then having a sale on printers may encourage
the sale of printers as well as computers.

 Buying patterns that reflect items that are frequently


associated or purchased together. These patterns can
be represented in the form of association rules.

 Association rules are if-then statements that help


to show the probability of relationships
between data items within large data sets in
various types of databases.
2/16/2019 123
Association Rule Mining(previous year
paper)
 Association rule learning is a rule-based machine
learning method for discovering interesting
relations between variables in large databases. It
is intended to identify strong rules discovered in
databases using some measures of
interestingness.

 Market Basket Analysis is one of the applications of


Association Rule Mining that can be used to find out
the frequent itemsets that is items that are frequently
purchased together.

2/16/2019 124
Apriori Algorithm: Finding Frequent
Itemsets by Confined Candidate Generation
 Apriori is a seminal algorithm proposed by R. Agrawal and
R. Srikant in 1994 for mining frequent itemsets for Boolean
association rules
 The name of the algorithm is based on the fact that the
algorithm uses prior knowledge of frequent itemset
properties,
 Apriori employs an iterative approach known as a level-wise
search, where k-itemsets are used to explore (k + 1)
itemsets.
 First, the set of frequent 1-itemsets is found by scanning
the database to accumulate the count for each item, and
2/16/2019 125
 collecting those items that satisfy minimum support.
The resulting set is denoted by L1.
 Next, L1 is used to find L2, the set of frequent 2-
itemsets, which is used to find L3, and so on, until no
more frequent k-itemsets can be found.
 The finding of each Lk requires one full scan of the
database.
 To improve the efficiency of the level-wise generation
of frequent itemsets, an important property called the
Apriori property is used to reduce the search space.
 Apriori property: All nonempty subsets of a
frequent itemset must also be frequent.

2/16/2019 126
In general, association rule mining can be viewed as a two-step process:
1. Find all frequent itemsets: By definition, each of these itemsets will occur
at least as frequently as a predetermined minimum support count, min sup.

2. Generate strong association rules from the frequent itemsets: By


definition, these rules must satisfy minimum support and minimum
confidence.

2/16/2019 127
2/16/2019 128
2/16/2019 129
Association Rules for Example in slide 116

2/16/2019 130
Refer following pages for
Apriori Algorithm and
FP-Growth Algorithm

2/16/2019 131
From Association Mining to
Correlation Analysis
• After generating the association rules using
Apriori or FP growth Algorithm based on
minimum support and minimum confidence
interestingness measures, it can be concluded
that which rules are strong or which are not.

• But all strong rules(one that have support and


confidence greater than minimum support and
confidence) are not interesting and impactful for
business.
Correlation Analysis
• To decide whether two items in a rule AB
are positively, negatively or not correlated,
another interestingness measure that is used
apart from support and confidence is lift,
which can be computed as:
• If the resulting value of lift is less than 1, then the
occurrence of A is negatively correlated with the
occurrence of B, meaning that the occurrence of
one likely leads to the absence of the other one.

• If the resulting value is greater than 1, then A


and B are positively correlated, meaning that the
occurrence of one implies the occurrence of the
other.

• If the resulting value is equal to 1, then A and B


are independent and there is no correlation
between them.
EXAMPLE

S-ar putea să vă placă și