Sunteți pe pagina 1din 25

Lessons & Challenges

from Mining Retail ECommerce Data


Kohavi et. al (2004)

Motivation
n
n
n
n

Important domain of data mining


Massive amounts of data is collected
Data collection is automatic not prone to errors
Data is Rich has a lot of potential for discovering
patterns
Three types of Data: Clickstream data, Transactional
data and User Profile data
Combined mining of these 3 types of data is possible
10%

90%

The E-Commerce Data Mining Suite


n

n
n

E-Commerce data mining suite developed by


Blue Martini Software
Purchased and used by many Brand Name
retailers: Debenhams, Harley Davidson,
Sainburys, Sprint etc.
System designed specifically for BI
End-to-end solution:
n
n
n
n
n

Data Collection
Data Warehousing
Data Transformations
Visualization
Data Mining

The Business Intelligence Process


Pattern Evaluation

Data Mining
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Data Sources

Selection and
Reduction

The Experience Shared


Business Lessons & Technical Lessons have been shared
Data Mining projects executed for more than 20 clients
Clients from different industry verticals with varying
business models
Clients spread over: US, Europe, Asia & Africa
Data Sizes upto 100 million records
Diverse data:
Clickstream
User Profile
Demographic
Response to Mail Campaigns
Orders Placed through website / telephone / in-store

Business Lessons

Requirements Gathering is Challenging


n

Clients are reluctant to list business questions


n
n

Clients present standard reporting type


questions, e.g.
n
n

They may not know what questions to ask


They do not understand the underlying technology
and how much it can do

What is the gender-wise distribution of customers?


What is the region-wise response rate of the mail
campaign?

Instead of asking questions like:


n

What are the characteristics of customers who spend


more than $500?
What kind of people responded to the mail
campaign?

Educating the Users


n

Involving the users is critical for success


n
n

Understanding the business


Uncovering the real needs

Users will have to educated


n
n
n

What can be achieved by BI


Prototypes / Demo Systems
Case studies

Business Events
n

The architecture records


n

n
n
n

Every customer search and number of results returned:


Too many rows, No rows
Shopping cart events: Add to cart, Change Quantity,
Delete
Registration, log-in, checkout, payment, order
confirmation
Any failure / crashes
Users timezone
Technical capabilities of the users computer

These details are collected particularly


because they are useful for ANALYSIS

Data Collection
n

Usual methods of data collection:


n
n

Stateless Http requests from multiple web servers


Parsing and loading them session-wise and userwise
Difficult Web logs were designed for debugging
web servers not to provide data for BI

Blue Martini architecture was designed for BI


n

n
n
n

Session & user data collected and linked together


at Application Server level
Transactions automatically tied to sessions
All data automatically recorded in a database
Pre-processing and data cleaning is not required

Data Collection Lessons


n

Collect the right data upfront


n

All data that could be useful should be


collected and integrated
Stored in a database / data warehouse

Integrate with External Events


n
n

Marketing events like promotions


Cannot be captured by the data collection
systems

Creating the Data Warehouse


n

DW creation requires substantial data


transformations
Can take 80% of the time taken to the
complete BI exercise
Requires integration of several data sources:
n
n
n
n
n

Website
Payment gateway
Call center
POS terminals / shops systems
External systems / inputs (e.g. promotions /
campaigns data)

Logical DW Architecture

Data Warehousing: Challenges


Loading and Maintaining Consistent Data

Loading and Storing Large Volumes of Data


Coping with Changes in Operational Definitions
Providing Reasonable Response Times
If it is an E-Commerce site the website itself will
be outside the Firewall, so data will have to be
copied across the Firewall

Business Intelligence Tools


n

The software provided: Reports, Visualization


and Data Mining
Data Mining algorithms included:
n
n
n
n

Rule Induction
Anomaly (outlier) detection
Entropy-based statistics
Association Rules

Business Intelligence Lessons (1)


n

Operational transactions have higher


priority than BI
n

BI can be taken up after the system


stabilizes
Can take several months to get started

Users are happy with basic reports /


MIS
n

Unexpectedly insightful findings capture their


interest
This can start the BI process

Business Intelligence Lessons (2)


n

Trained Data Analysts are required


n
n

Domain knowledge is important


Technical know-how is essential

Terminology needs to be Defined


n
n

Users can misinterpret results


Potentially useful findings may be ignored or
unrealistic expectations can arise

Business Intelligence: Challenges


Designing user-friendly interactive interface

Automatic Feature Construction


Building models that users can interpret
Making users understand that correlation does not
imply causality
Explaining insights
Linking ROI to insights

Deployment
n

Insights need to be shared


n

Insights obtained by Data Mining needs to be


shared across the organization
Easy to use tools for capturing and
communicating (e.g. by E-mail) will help

Taking Action
n
n

Business users must see the value


Acting on the results may be difficult (e.g.
designing a campaign for a special segment
of customers)
A good architecture would help

Technical Lessons

Data Collection and Management Lessons


n

Collect data at the right level


n

Data was collected at the Application Server


level
Reduced pre-processing of weblog data

Design the GUI with Data Mining in mind


n
n
n

All useful data can be captured


Default values should be avoided
Validate data to reduce cleaning effort

Data Collection and Management


Challenges
n

Should data be sampled?


n
n
n

Slowly changing dimensions


n

E-Commerce data is huge in volume


Is it necessary to store all the data?
Will rare events be missed if sampling is done?
Customers evolve (e.g. lifetime changes, lifestyle
changes)
Products evolve (e.g. new lines, new technology)

Frequency of DW uploads
n
n

DW uploads take time and processing power


Should not disrupt BI analysts work

Data Cleaning and Pre-processing


Lessons & Challenges
n

Time-outs, incomplete sessions, crashes


n
n

Duplicates
n
n
n

Needs to be detected
What to do with such data?
Same customer with more than one ID
Same account used by multiple customers
Guest log-ins

Missing, unknown, not applicable or default


values
Hierarchical Attributes
n

Most algorithms cannot handle hierarchical attributes

An Attribute Hierarchy
all

all

region

country

city
office

Europe

Germany ...

Frankfurt

...

...

Spain

North_America

Canada

Vancouver ...

...

Mexico

Toronto

L. Chan ... M. Wind

Analysis Lessons & Challenges


n

Enriching the Data


n
n
n

Exploration
n
n
n

Add demographic attributes


Create derived attributes
Calculate weighted averages, moving averages
Visualization
Domain knowledge can help in gaining insight
Customer propensity scoring

Building Models
n
n
n

n
n

Start with simple models (easy to explain to users)


Build models at the right level of the attribute hierarchy
Address scalability issues (to maintain users interest and
confidence)
Test and validate the models
Estimate accuracy levels

S-ar putea să vă placă și