Sunteți pe pagina 1din 28

Retail Sales

Overview

Four-step dimensional design process


Transaction-level fact tables
Additive and non-additive facts
Sample dimension table attributes
Causal dimensions
Degenerate dimensions
Extending an existing dimension model
Snowflaking dimension attributes
Avoiding the too many dimensions trap
Surrogate keys

Four-Step Dimensional Design Process


1.

Select the business process to model.

2.

not business department or function


E.g., purchasing, ordering, shipping,
invoicing, inventorying

Declare the grain of the business


process.

Specifies individual fact table row


E.g., individual line item on sales ticket, daily
snapshot of the inventory levels for a product

Four-Step Dimensional Design Process


3.

Choose the dimensions that apply for each fact


table row.

4.

Identify the numeric (measured) facts that will


populate each fact table row.

Q: How do business people describe the data that


results from the business process?
E.g., date, product, store, customer, transaction
type
Q: What are we measuring?
Typical facts are numeric additive figures
E.g., quantity ordered, dollar cost amount

In making decisions regarding the 4 steps,


consider both the user requirements as well as
the realities of the source data

Retail Case Study

Large grocery chain: 100 grocery stores over


10regions
Each store:

Departments: grocery, frozen foods, dairy, meat,


produce, bakery, floral, health/beauty aids, etc.
60,000 products (SKUs = stock keeping units) on
shelves
55,000 SKUs with UPCs
5,000 SKUs without UPCs but with assigned SKU
numbers

Data is collected:

from cash registers into a point-of-sale (POS)


system
at back door where vendors make deliveries

Retail Case Study Contd

Management concerns

Logistics of ordering, stocking, and selling


products
Maximizing profit
Product pricing
Lowering cost of acquisition and overhead
Use of promotions to increase sales
temporary price reductions
newspaper ads
grocery store displays
coupons

Step 1. Select the Business Process

Decide what business process to model, by


combining an understanding of the business
requirements with an understanding of data
realities.
The first dimensional model built should be the
one

with the most impact,


that answers the most pressing business questions,
is readily accessible for data extraction.

In retail case study: POS retail sales


Business Question: What products are selling in
which stores on what days and under what
promotional conditions?

Step 2. Declare the Grain

What level of data detail should be made


available in the dimensional model?
Choose the most atomic information
captured by the business process.

Atomic data
Most detailed, cannot be subdivided
Facilitates ad hoc, unexpected usage and
ability to drill down to details

Case study grain: individual line item on


a POS transaction

Step 3. Choose the Dimensions

A careful grain statement determines the


primary dimensions.
It is then usually possible to add
additional dimensions.
If an additional desired dimension violates
the grain by causing additional fact rows
to be generated, then the grain statement
must be revised to accommodate this
dimension.
Case study dimensions: date, product,
store, promotion

Preliminary Retail Sales Schema

POS Sales Transaction Fact

Product Dimension

Promotion Key (PK)


Promotion attributes TBD

Date Dimension

Product Key (PK)


Product attributes TBD

Promotion Dimension

Date Key (FK)


Product Key (FK)
Store Key (FK)
Promotion Key (FK)
POS Transaction Number
Other facts TBD

Date Key (PK)


Date attributes TBD

Store Dimension

Store Key (PK)


Store attributes TBD

Step 4. Identify the Facts

Picking the business measurements for the fact


table: true to the grain.
Case study - Facts collected by POS system:

Sales quantity, sales price/unit, sales $ amount,


standard cost $ amount
Gross Profit = cost sales

Recommendation: Include in fact table even


though it can be calculated. Eliminates the
possibility of user error.

For non-additive measurements such as


percentages and ratios (e.g., gross margin) store
the numerator (gross profit) and denominator ($
revenue) in the fact table. The ratio can be
calculated in a data access tool for any slice of the
fact table. Caution: Calculate the ratio of the
sums, not the sum of the ratios

Date Dimension

Ubiquitous in every data mart


Use verbose, self-explanatory values rather than
coded values. They are used as column headers
in reports. By decoding in the database, we
ensure consistency across different application
environments.

E.g., Holiday Indicator use values: Holiday,


Nonholiday; as opposed to Y/N

Date Key should be an integer rather than a date


data type
Data warehouses need an explicit date dimension
table to describe fiscal periods, seasons, holidays,
weekends, and other calendar calculations that
are not supported by the SQL date function.
If transaction time is of interest, we may need a
separate Time Dimension table

Product Dimension

Describes every SKU in the store


Fill this dimension with as many descriptive
attributes as possible.
Robust dimension attributes deliver robust
analytic slicing and dicing capabilities.
Hierarchies = groups of attributes
Merchandise hierarchy

SKUs roll up to brands to categories to


departments.
Each is a many-to-one relationship

Although there will be redundancy, no need to


normalize. Given the relative size of the
dimension (as compared to the fact table) space
saving is minimal.

Store Dimension

The store dimension: Store Key


(PK), Store Name, Store Number
(Natural Key), Store Address,
Possible to represent multiple
hierarchies in a dimension table

Store to any geographic attribute (e.g.,


ZIP, county, state)
Store to store district to region

Promotion Dimension

Describes the promotion conditions under which a


product is sold
Called a causal dimension describes factors
thought to cause a change in product sales (price
reductions, ads, displays, coupons)
Could keep all 4 causal mechanisms in a single
dimension

or split into 4 separate dimensions

They are highly correlated, so not much difference in


space requirements
More efficient browsing for finding out how various
promotions are used together
May be more understandable to business
Administration may be more straightforward

To avoid null keys in the fact table (violation of


referential integrity), for line items not being
promoted include a row in the promotion dimension
to indicate No Promotion in Effect

Factless Fact Table

Q: Which products were under promotion but did


not sell?
Cannot answer yet. POS sales fact table has only
products that were sold
Answer: Create Promotion Coverage Factless
Fact Table

Factless Fact Table = has no measurement metrics


Contains date, product, store, and promotion keys

Two-step process to answer Q:

Query Promotion Coverage table: products under


promotion on given date
From POS Sales Fact table: products sold
Answer is the set difference of above

Degenerate Dimension (DD)

Dimension keys used in fact table without


corresponding dimension tables
In case study: POS Transaction #
Still useful for grouping by transaction
Common DDs: order numbers, invoice
numbers
Fact table primary key: Product Key and
POS Transaction Number

Retail Schema Extensibility

Original schema extends gracefully


because POS transaction data was
modeled at its most granular level.
Premature aggregation limits ability to
extend if new dimensions do not apply to
higher grain
Case study new dimensions:

Frequent Shopper
Clerk
Time of Day

Schema Extensibility

Dimensional models can handle extensions without


invalidating existing applications:
New dimension attributes simply add columns
to dimension table. If new attribute is only available
after point in time, populate old dimension records
with something like Not Available
New dimensions add foreign field keys to fact
table
New measured facts add to fact table. If not at
the same grain, then need separate fact table
Dimension becoming more granular create
new dimension. May imply more granular fact table,
in which case, may have to rebuild the fact table.
Addition of a completely new data source
involving existing and new dimensions usually
needs new fact table

Resisting Dimension Normalization

Snowflaking = Dimension table normalization

Redundant attributes are removed from the denormalized


dimension table and are placed in normalized secondary
dimension tables
Fully snowflaked schema = 3NF ER diagram

The dimension tables must not be normalized, and should


remain as flat tables.
Numerous tables and joins usually translate into slower
query performance.
Efforts to normalize any of the tables in a dimensional
database solely in order to save disk space are a waste of
time. Disk space savings gained by normalizing the
dimension tables are typically less than one percent of
the total disk space needed for the overall schema.
Normalized dimension tables destroy the ability to browse
within a dimension or across dimensions (e.g., list
package types for each brand in a category). SQL needed
becomes too complex.
The fact table is naturally normalized.

10

Too Many Dimensions

Too many dimensions increase space


requirements for the fact table.
A very large number of dimensions
typically means that several dimensions
are not completely independent and
should be combined.
A single hierarchy should not be captured
in separate dimensions.

Surrogate Keys

Surrogate keys are integers assigned sequentially as


needed to populate a dimension. They serve to join
dimension tables to the fact table.
Avoid embedding intelligence in the data warehouse
keys.
Benefits:

Surrogate keys buffer the DW environment from


operational changes. What happens when operations
decide to recycle account numbers after some period of
inactivity? Fine for operational systems, but problematic
for DW if it is using account numbers as a PK.
Can more easily integrate data from multiple operational
systems, even if they lack consistent source keys.
Performance advantages because small size of surrogate
keys leads to smaller fact tables
Surrogate keys are used to support one of the primary
techniques for handling changes in dimension table
attributes.

11

Inventory

Overview

Value chain implications


Inventory periodic snapshot model,
transaction and accumulating
snapshot models
Semi-additive facts
Enhanced inventory facts
Data Warehouse bus architecture
and matrix
Conformed dimensions and facts

12

Value Chain

The value chain identifies the natural,


logical flow of an organizations primary
activities.
Operational source systems produce
transactions or snapshots at each step in
the value chain. They generate
interesting performance metrics along the
way.
Each business process generates one or
more fact tables.

Inventory Models

Inventory periodic snapshot

Inventory transactions

Inventory level of each product measured daily


(or weekly) represented as a separate row in
a fact table
As products move through the warehouse, all
transactions with impact on inventory levels
are recorded

Inventory accumulating snapshot

One fact table row for each product updated as


the product moves through the warehouse

13

Inventory Periodic Snapshot Model

Business need

Analysis of daily quantity-on-hand inventory


levels by product and store

Business process

Granularity

Dimensions

Retail store inventory


Daily inventory by product at each store
Date, product, store

Fact

Quantity on hand

Inventory Periodic Snapshot Model Challenge

Very dense (huge) fact table

As opposed to retail sales, which was sparse


because only about 10% of products sell each day

60,000 items in 100 stores = 6,000,000 rows


If 14 bytes per row: 84MB per day
One-year period: 365 x 84MB = 30GB
Solution: Reduce snapshot frequencies over time

Last 60 days at daily level


Weekly snapshots for historical data
For a 3-year period =208 snapshots vs.
3x365=1095 snapshots; reduction by a factor of 5

14

Semiadditive Facts
Inventory levels (quantity on hand) are
additive across products or stores, but
NOT across dates = semi-additive facts
Compare to Retail Sales:

once the product is sold it is not counted again

Static level measurements (inventory,


balances) are not additive across date
dimension; to aggregate over time use
average over number of time periods.

Enhanced Inventory Facts

Number of turns = total quantity sold / daily average


quantity on hand
Days supply = final quantity on hand / average quantity
sold
Gross profit = value at latest selling price - value at cost
Gross margin = gross profit / value at latest selling price
GMROI (Gross Margin Return On Inventory)

Need additional facts:

GMROI = number of turns * gross margin


measures effectiveness of inventory investment
high = lot of turns and more profit, low = low turns and low
profit
quantity sold, value at cost, value at latest selling price

GMROI is not additive and, therefore, is not stored in


enhanced fact table. It is calculated from the constituent
columns.

15

Inventory Transactions Model

Record every transaction that affects inventory

Receive product
Place product into inspection hold
Release product from inspection hold
Return product to vendor due to inspection failure
Place product in bin
Authorize product for sale
Pick product from bin
Package product for shipment
Ship product to customer
Receive product from customer
Return product to inventory from customer return
Remove product from inventory

Inventory Transactions Model - Cont

Dimensions: date, warehouse, product, vendor,


inventory transaction type.
The transaction-level fact table contains the most
detailed information possible about the inventory.
It is useful for measuring the frequency and
timing of specific transaction types.
It is impractical for broad data warehouse
questions that span dates or products.
To give a more cumulative view of a process,
some form of snapshot table often accompanies a
transaction fact table.

16

Inventory Accumulating Snapshot


Model

Build one record in the fact table for each product delivery to
the warehouse
Track disposition of a product until it leaves the warehouse

Receiving
Inspection
Bin placement
Authorization to sell
Picking
Boxing
Shipping

The philosophy of the inventory accumulating snapshot fact


table is to provide an updated status of the product shipment
as it moves through above milestones.
Rarely used in long-running, continuously replenished
inventory processes.
More on this in chapter 5.

Value Chain Integration

Both business and IT organizations are interested


in value chain integration
Desire to look across the business to better
evaluate overall performance
Data marts may correspond to different business
processes
Need to look consistently at dimensions shared
between business processes
Need an integrated data warehouse architecture
If dimension table attributes in various marts are
identical, each mart is queried separately; the
results are then outer-joined based on a common
dimension attribute = drill across

17

Data Warehouse Bus Architecture

Cannot built the enterprise data warehouse in one


step.
Building isolated pieces will defeat consistency
goal.
Need an architected incremental approach
data warehouse bus architecture.
By defining a standard bus interface for the data
warehouse environment, separate data marts can
be implemented by different groups at different
times. The separate data marts can be plugged
together and usefully coexist if they adhere to the
standard.

Data Warehouse Bus Architecture


Contd

During architecture phase, team


designs a master suite of
standardized dimensions and facts
that have uniform interpretation
across the enterprise.
Separate data marts are then
developed adhering to this
architecture.

18

Data Warehouse Bus Matrix

The rows of the bus matrix correspond to


business processes data marts
Separate rows should be created if:

the sources are different,


the processes are different, or
a row represents more than what can be tackled in
a single implementation iteration.

Creating the DW bus matrix is a very important


up-front deliverable of a DW implementation. The
DW bus matrix is a hybrid resource: technical
design tool, project management tool, and
communication tool.

Conformed Dimensions

Conformed dimensions are:

Conformed dimensions have consistent

identical, or
strict mathematical subsets of the most granular,
detailed dimension.
Dimension keys
Attribute column names
Attribute definitions
Attribute values

If two marts have dimensions (e.g., customer,


product) that are not conformed, then they
cannot be used together

19

Types of Dimension Conformity

Mean same thing

Rolled-up level of granularity

Roll-up dimensions conform to the base-level


atomic dimension if they are a strict subset of that
atomic dimension. (see Fig. 3.9)

Dimension subset at same level of granularity

Single shared table or physical copy


Consistent data content, data interpretation, user
presentation

At same level but one represents only a subset of


rows

Combination of above

Centralized Dimension Authority

The major responsibility of the centralized


dimension authority is to:

establish,
maintain, and
publish the conformed dimensions to all client
data marts.

90% of up-front data architecture effort


Political challenge

20

Conformed Facts

In general, facts table data is not


duplicated explicitly in multiple data
marts.
If facts live in more than one location,
then their definitions and equations must
be the same and they must be called the
same.
If it is impossible to conform a fact
exactly, then different names should be
given to different interpretations. This
will make it less likely that incompatible
facts will not be used in a calculation.

Procurement

21

Overview

Value chain reinforcement


Blended versus separate transaction
schema
Slowly changing dimension techniques

Procurement Case Study

Procurement involves a wide range of activities:

Negotiation of contracts
Issuing of purchase requisitions & POs
Tracking receipts
Authorizing payments

Common analytic requirements:

Which materials/products are purchased most


frequently? Who supplies them? At what prices?
Across the enterprise, are there opportunities to
negotiate contracts by consolidating suppliers,
single sourcing?
Are we purchasing from the preferred vendors?
How are vendors performing?

22

Procurement Transactions
1.

Business process to model: Procurement

2.
3.

Grain: 1 row per procurement transaction


Key dimensions:

Transaction date
Product
Vendor (1 row for each vendor)
Contract terms (1 row for each set of terms negotiated with a vendor)
Procurement transaction type

Transaction date and product are conformed dimensions

4.

Transactions: purchase requisitions, purchase orders, shipping notifications,


receipts, and payments

Measured facts:

Procured units
Transaction amount

Multiple- vs. Single-Transaction Fact


Tables: Issues

Business users view purchase orders, shipping notices,


warehouse receipts, and vendor payments as separate &
unique processes
Several procurements transactions come from different
source systems

Several transaction types have different dimensionality

Purchasing system: purchase requisitions and purchase orders


Warehousing system: shipping notices and warehouse receipts
Accounts payable system: vendor payments
E.g., discounts applicable to vendor payments but not to other
types of transactions

Control numbers such as PO # generated during procurement


process are candidates for degenerate dimensions
Design decision:

Build separate fact tables for each transaction type, or


Build a blended transaction fact table with a transaction type
dimension

No simple answer.

23

Multiple- vs. Single-Transaction Fact


Tables - Cont.

Questions to ask:

What are the users analytic requirements?


How do users usually analyze data?
Do they analyze multiple transaction types together
or do they usually look at a single transaction type?
Are there multiple unique business processes?
Yes leaning towards separate tables
Are multiple source systems involved?
Yes leaning towards separate tables
What is the dimensionality of the facts? Do some
dimensions apply only to some transaction types?
Yes leaning towards separate tables

Solution: multiple transaction fact tables

Multiple Fact Tables

Multiple fact tables for procurement


processes
Advantages:

Richer, more descriptive dimensions and


attributes
Simplified staging activities, since operational
data exist in separate source systems
Loading data into separate fact tables will be
less complex than attempting to integrate from
multiple sources

Disadvantages:

More time to manage and administer: more


tables to load, index, and aggregate

24

Slowly Changing Dimensions

We have assumed dimensions to be


independent of time, but some
dimensions (other than natural keys) may
change slowly with time
Need to track change, without full-blown
normalized structure; without making
every dimension time-dependent
For each attribute in our dimension
tables, we must specify a strategy to
handle change

Techniques for Dealing with Dimension


Change Type 1: Overwrite the Value

Overwrite the old attribute value in the


dimension row, replacing it with the
current value. the attribute always
reflects the most recent assignment
The type 1 response is easy to
implement, but it does not maintain any
history of prior attribute values
Question: Is there a business need for
retaining the old attribute value?

25

Techniques for Dealing with Dimension


Change Type 2: Add a Dimension Row

Create a new dimension row reflecting the new attribute


Two separate surrogate keys one for old row and one for
new

Could also use a most recent row indicator to tell us which


of the two rows is the current
Fact table is again untouched
Adding a dimension row is the primary technique for
accurately tracking SCD attributes
Advantage:

Product Key is used as the primary key instead of the SKU


number, which is the natural key and is the same for both rows

New dimension row automatically partitions history in fact table


pre-change fact rows use the pre-change surrogate key
No need to revisit preexisting aggregation tables

Disadvantage:

Accelerated dimension table growth


Does not allow us to associate the new attribute value with the
old fact file or vice versa

Techniques for Dealing with Dimension


Change Type 3: Add a Dimension Column

Add a new dimension column containing the old


attribute value (E.g., Prior Department)
Overwrite the old value with the new
More appropriate when there is a need to associate
new attribute values with old fact history

E.g., business need to track both old and new values of


department attribute both forward and backward

Management can use either value for analysis


Allows for observing new and historical fact data by
either the new or prior attribute values
Used less frequently
Inappropriate to track numerous intermediate attribute
values

26

Hybrid Slowly Changing Dimension


Techniques

Two approaches that combine the


basic SCD techniques:

Predictable changes with multiple


version overlays
Unpredictable changes with singleversion overlay

These approaches provide more


flexibility at the cost of greater
complexity

Predictable Changes with Multiple


Version Overlays

Used in cases of sales organization realignments


Example: Over a 5-year period the sales organization
is reorganized five times.
At first sight, candidate for Type 2 approach (add
dimension row), but more complex business
requirements. E.g.,

Report each years sales using the district map for that
year
Report each years sales using the district map from an
arbitrary different year
Report an arbitrary span of years sales using a single
district map from a chosen year.

Type 3 is also inappropriate because >2 district maps


Because changes are predictable, an extension of Type
3 is possible Multiple District columns:

Current District; District 2001; District 2002;

27

Unpredictable Changes with SingleVersion Overlay

Preserve historical accuracy surrounding


unpredictable attribute changes while
supporting the ability to report historical
data according to the current values
Issue a new dimension row (type 2) to
capture the change and add a new
dimension column to track the historical
value (type 3). Also, overwrite Current
Department value (Type 1).

More Rapidly Changing Dimensions

Break off the rapidly changing


attributes into one or more
separate dimensions
Two foreign keys in fact table:
1.
2.

Primary dimension table


Rapidly changing attribute(s)

28

S-ar putea să vă placă și