Sunteți pe pagina 1din 52

DATA WAREHOUSE AND

DATA MINING
Team Members
• Rohan Cambell (11)
• Vishal Chaudhari (12)
• Namitha Chitnis (13)
• Anurag Chivilkar (14)
• Poonam Dhakol (15)
Introduction to Data
Warehousing
• A single, complete
and consistent
store of data
obtained from a
variety of
different sources
made available
to end users in a
what they can
understand and
use in a business
Introduction to Data
Warehousing
A data warehouse is a

• subject-oriented
• integrated
• time-varying
• non-volatile
 collection of data that is used
primarily in organizational
decision making. – Bill Inmon,
Building the Data Warehouse
1996
Subject Oriented
• Data is arranged and
optimized to provide
answer to questions from
diverse functional areas.
• Data is organized and
 summarized by topic:
• Sales/Marketing/Finance/
 Distribution etc.
Time Variant
• The Data Warehouse
represents the flow
of data through
time
• Can contain projected
data from statistical
models
• Data is periodically
uploaded then time-
dependent data is
Non Volatile
• Once data is entered it is NEVER
removed
• Represents the company’s
entire history
– Near term history is continually
added to it
– Always growing
– Must support terabyte databases
and multiprocessors
• Read-Only database for data
W H Y D A TA
W A R E H O U S IN G ????
A Producer wants to
know…….
Which
Which are
are our
our
lowest
lowest//highest
highest
margin
margin Who
Who are
are my
my
Which customers
customers ?? customers
customers
Which is
is and
the most
the most and what
what
effective products
products
effective are
distributio
distributio are they
they
n buying?
buying?
Whichn
Which
channel?
channel?
product Which
Which
product customers
prom
prom-- customers
--otions are
are most
most
otions likely
have
have the
the likely to
to go
go
biggest
biggest What
What impact
impact
impact to
to the
the
impact on on will
will competition
new
new competition
revenue? products ??
revenue? products//ser
ser
vices
vices
have
have on
on 9
revenue
revenue
Comparing a Data
Warehouse
and an Operational
D a ta W a re h Database
o u se O p e ra tio n a l
D a ta b a se
Subject oriented Application oriented

Integrated Multiple diverse


sources
Time - variant Real time

Non volatile updateable


Warehouse Architecture

• Architecture, in the context of


an organization's data
warehousing efforts, is a
conceptualization of how the
data warehouse is built.
• The worthiness of the
architecture can be judged
from how the
conceptualization aids in the
building, maintenance, and
Warehouse Architecture
Clien Clien
t t
Query &
Analysis

Metadata Warehouse

Integratio
n

Source Source Source


• A data warehouse architecture
consists of the following
interconnected layers:
1)Operational Database Layer
 The source data for the data
warehouse — An
organization's Enterprise
Resource Planning systems fall
into this layer.
 2) Data Access Layer
 The interface between the
operational and informational
access layer — Tools to extract,
transform, load data into the

 3. Metadata Layer
• The data directory -
This is usually more
detailed than an
operational system
data directory.
• There are dictionaries
for the entire
warehouse and
sometimes
dictionaries for the
data that can be
accessed by a

 4. Informational Access Layer


• The data accessed for reporting
and analyzing and the tools for
reporting and analyzing data
— Business Intelligence tools
fall into this layer.

DESIGN PROCESS OF
DATA WAREHOUSE
The actual design process for
developing
data warehouse
• talk to the users
• determine their needs in terms
that can be measured
• design a database to support
those needs
• document the data descriptions
and other attributes
• design the logic for translating
data from various sources into
an integrated data store
The actual design process for
developing
data warehouse
• Writing code for extracting data
from various sources &
transforming it into data
warehouse, with updates to the
metadata
• Finally package the procedures
to handle scheduling,
management and maintenance

Flow Diagram
Example
Data warehouse
components
• LOAD MANAGEMENT
– relates to the collection of
info. from disparate
internal or external
– sources.
– loading process includes
summarizing,
manipulating and
changing data
– structures into format that
lends itself to analytical
processing.
– Actual raw data kept
Data Warehouse
Components
• WAREHOUSE MANAGEMENT
– relates to day-to-day
management of data
warehouse.
– management tasks
associated with warehouse
includes ensuring its
– availability, the effective
backup of its contents &
security.
Data Warehouse
Components
• QUERY MANAGEMENT
– relates to provision of access
to contents of warehouse
– includes partitioning of info.
into different areas with
different
– privileges to different users.
– Access may be provided
through custom-built
applications, or ad hoc
– query tools.
Types of Data
Warehousing Application
• Data warehousing systems
target three different types of
applications:
Ø personal productivity
Ø query and reporting
Ø planning and analysis
Personal Productivity
Applications
• spreadsheets, statistical
packages and graphics tools,
• are useful for manipulating and
presenting data on individual
PCs.
• Developed for a standalone
environment, these tools
address applications requiring
only small volumes of
warehouse data.
Data query and reporting
applications
• deliver warehouse-wide data
access through simple, list-
oriented queries,
• the generation of basic reports.
• These reports provide a view of
historical data
• they do not address the
enterprise need for in-depth
analysis and planning.
Planning and analysis
applications
address essential business

requirements as
• budgeting,
• forecasting,
• product line and customer
profitability,
• sales analysis,
• financial consolidations
• manufacturing mix analysis
• --applications that use
historical, projected and
Technologies Involved In Data
Warehousing
• source system identification:
• data warehouse design and creation:
• changed data capture:
• data acquisition:
• data cleansing:
• data aggregation
• multi-dimensional analysis tools:
• business intelligence (bi)
• metadata management
• data mining tools:
• data visualization tools:
• query tools:
Benefits of Data
Warehouse
• A data warehouse provides a
common data model for all data
of interest regardless of the
data's source.
• Prior to loading data into the
data warehouse,
inconsistencies are identified
and resolved. This greatly
simplifies reporting and
analysis.
• Information can be stored for
long

Benefits of Data
Warehouse
• Data warehouses provide retrieval
of data without slowing down
operational systems as they are
different from O/S.
• Data Warehouse can work in
conjunction with operational
business applications such as
CRM.
• Data warehouses facilitate decision
 support system applications such
as
Disadvantages of Data
Warehouse
• Data warehouses are not
the optimal environment
for unstructured data.
• Because data must be
extracted, transformed
and loaded into the
warehouse, there is an
element of latency in data
warehouse data.
• Over their life, data
Disadvantages of Data
Warehouse
• Data warehouses can get
outdated relatively
quickly. There is a cost of
delivering suboptimal
information to the
organization.
• There is a fine line
between data warehouse
and operational systems.
So the functionality
developed may be
Very Large Databases
Terabytes -- 10^12 Walmart -- 24
bytes : Terabytes

Petabytes -- 10^15 Geographic


bytes : Information
Systems
Exabytes -- 10^18 National Medical
bytes : Records
Zettabytes -- 10^21
bytes : Weather images

Zottabytes -- 10^24 Intelligence Agency


bytes : Videos

DATA MINING
Data Mining
• Discover Previously unknown
data characteristics,
relationships, dependencies,
or trends
• Typical Data Analysis Relies on
end users
– Define the Problem
– Select the Data
– Initial the Data Analysis
Data Mining
• Proactive
• Automatically searches
– Anomalies
– Possible Relationships
– Identify Problems before the
end-user
• Data Mining tools analyze the
data, uncover problems or
opportunities hidden in data
relationships, form computer
models based on their findings,
and then user the models to
Four Phases of Data
Mining
• Data Preparation
• Identify the main data sets to
be used by the data mining
operation (usually the data
warehouse)
• Data Analysis and Classification
ØStudy the data to identify
common data
characteristics or patterns
• Data groupings,
classifications, clusters,
sequences
• Data dependencies, links, or
relationships
Four Phases of Data
Mining
• Knowledge Acquisition
– Uses the Results of the Data
Analysis and Classification
phase
– Data mining tool selects the
appropriate modeling or
knowledge-acquisition
algorithms
• Neural Networks
• Decision Trees
• Rules Induction
• Genetic algorithms
• Memory-Based Reasoning
F o u r P h a se s o f D a ta
M in in g
• Prognosis
–Predict Future Behavior
–Forecast Business
Outcomes
• 65% of customers who
did not use a
particular credit card
in the last 6 months
are 88% likely to
cancel the account.

Some Basic
Operations
• Predictive:

– Regression
 – Classification
 Descriptive:


– Clustering / similarity
matching
– Association rules and variants

– Deviation detection
Classification
• Given old data about customers
and payments, predict new
applicant’s loan eligibility.
Classifiers Decision Rules
Previous
Customers

Salary > 5L
Age Good /
Salary Prof. = Exec
Profession
Location Bad
Customer Type

New Applicant ’ s
Data
• Regression :- A tte m p ts to fin d a
fu n ctio n w h ich m o d e ls th e d a ta
w ith th e le a st e rro r.

• Clustering:- Is lik e cla ssifica tio n


b u t th e g ro u p s a re n o t
p re d e fin e d , so th e a lg o rith m
w ill try to g ro u p sim ila r ite m s
to g e th e r.

Association Rule
Learning
• Searches for relationships
between variables.
• For example a supermarket
might gather data on
customer purchasing habits.
Using association rule
learning, the supermarket can
determine which products are
frequently bought together
and use this information for
marketing purposes.
• This is sometimes referred to as
Notable Uses
• Games
• Business:-
ü C u sto m e r R e la tio n sh ip M a n a g e m e n t
ü M a rk e t B a sk e t A n a ly sis
ü S tu d y o f C o n su m e r B e h a v io u r
• Science and Engineering
 B io in fo rm a tics , g e n e tics , m e d icin e ,
 e d u ca tio n a n d e le ctrica l p o w e r
 e n g in e e rin g

Data Mining
• Still a New Technique
• May find many Unmeaningful
Relationships
• Good at finding Practical
Relationships
– D e fin e C u sto m e r B u y in g P a tte rn s
– Im p ro v e P ro d u ct D e v e lo p m e n t
a n d A cce p ta n ce
– E tc .
• Potential of becoming the next
Data Mining V/s Data
Warehousing
• Data Warehousing is to support
the decision making with data.
• Data Mining can be used in
conjunction with a data
warehouse to help with certain
types of decisions.
• To make data mining more
efficient,
 the data warehouse should have
 an aggregated or summarized
 collection of data

Data Mining V/s Data
Warehousing
• Data mining helps in extracting
meaningful new patterns that
cannot be found necessarily by
merely querying or processing
data or metadata in the data
warehouse.
• In short we can say that
successful
 use of data mining applications
will

Walmart Case Study
• Founded by Sam Walton
• One of the largest
supermarket chain in the US

• Walmart = 2000 + retail
stores
• SAM's Clubs
100+Wholesalers Stores
Walmart Case Study
ØOld Paradigm
W a lM a rt Suppliers
• –Inventory –Accept Orders
Management –Promote
–Merchandise Products
Accounts Payable –Provide special
–Purchasing Incentives
–Supplier –Monitor and
Promotions : Track The
National , Region , Incentives
Store Level –Bill and
 Collect
Receivables
–Estimate
Retailer
Walmart Case Study
ØNew (Just-in-Time) retail
paradigm:-
• No more deals
• Shelf-Pass Through (POS Application)
– One Unit Price
• Suppliers paid once a week on ACTUAL
items sold
– WalMart Manager
• Daily Inventory Restock
• Suppliers ( sometimes SameDay ) ship to
WalMart
• Warehouse-Pass Through
– Stock some Large Items
• Delivery may come from supplier
– Distribution Center
• Supplier ’ s merchandise unloaded
directly onto Wal * Mart Trucks
Walmart Case Study
ØWalmart System
24 TB Raw Disk ; 700 -
• NCR 5100M 96 Nodes 1000 Pentium CPUs
Number of Rows : > 5 Billions
Historical Data : 65 weeks ( 5 Quarters )
New Daily Volume : Current Apps : 75
Million
Number of Users : New Apps : 100 Million
Number of Queries : +
Thousands
60 , 000 per week
Thank you
&
Happy Valentine
Day

S-ar putea să vă placă și