Sunteți pe pagina 1din 43

SQL Server 2008 for Business Intelligence

UTS Short Course

Eric Phan SA @ SSW


w: ericphan.info | e: EricPhan@ssw.com.au | t: @ericphan

Loves C# and .NET

Specializes in

Application architecture and design SQL Performance Tuning and Optimization Agile, Scrum Certified Scrum Trainer Technology aficionado
Silverlight ASP.NET Windows Forms

Admin Stuff

Attendance

You initial sheet You get me to initial sheet

Hands On Lab

Homework Certificate

At end of 5 sessions If I say if you have completed successfully

Course Website

Course Timetable & Materials

http://bit.ly/UTSSQL

Resources

http://sharepoint.ssw.com.au/Training/UTSSQL/

Course Overview
Session Date 1 Time 18:00 - 21:00 Topic SSIS and Creating a Data Warehouse

Tuesday 01-05-2012
Tuesday 08-05-2012 Tuesday 15-05-2012 Tuesday 22-05-2012 Tuesday 29-05-2012

18:00 - 21:00

OLAP Creating Cubes and Cube Issues

18:00 - 21:00

Reporting Services

18:00 - 21:00

Alternative Cube Browsers

18:00 - 21:00

Data Mining

Last week(s)
1.

Other cube browsers


Microsoft Data Analyzer Proclarity Excel 2003/2007/2010 Excel services Thinslicer Performance Point Power Pivot

The plan

Step by step to BI
1. 2. 3. 4. 5.

Create Data Warehouse Copy data to data warehouse Create OLAP Cubes Create Reports Browse the cube

6.

Do some Data Mining

Discovering relationships Predict future events

Agenda
1. 2. 3.

What is Data Mining? Why? Uses

4.
5. 6.

Algorithms
Demo Hands on Lab

What is Data Mining?


Data mining is the use of powerful software tools to discover significant traits or relationships, from databases or data warehouses and often used to predict future events

What is Data Mining?


It exploits statistical algorithms Once the knowledge is extracted it:

Can be used to discover Can be used to predict values of other cases

Why Data Mining?

Marketing

Who picks the movie? The kids, the wife, me Who are our Customers and what sort of films do they hire? Is a 30 year old woman with 2 children going to hire Arnies latest film Validation Is this data sensible? Terminator 2 and Toy Story Prediction

Sales Next Year

Why? Its all about money


1.

Get new information from data, future trends, past trends, outlier, maximums, minimums Analyse data from different perspectives and summarizing it into useful information New information to increase revenue cuts costs or both :-)

2.

3.

Which Questions are Data Mining?


Who are our biggest customers? What are customers buying with cigars? What are the customer retention levels of our branches?

Which customers have bought olives, feta cheese but no ciabatta bread?
Which regions have the highest male/female ratio of single 20 somethings? Which region has lowest customer retention levels and list out lost customers?

Whats not data mining


Ad hoc query Drill through to details Business Intelligence tool

Data - Uncover patterns in samples


Huge amount of data Good raw material good data mining Samples should be representative Samples "similar" to domain Not all-seeing crystal ball Verify and Validate!

OLAP versus Data Mining

OLAP

Is about fast ad hoc querying Analysis by dimensions and measures Gives precise answers May use RDBMS or OLAP source Is about discovering and predicting Gives imprecise answers

Data Mining

OLAP is not a prerequisite for data mining, but it almost always comes first

(learning to ride a bike before a car)

Types of Data Mining Algorithms

Classification algorithms

predict one or more discrete variables, based on the other attributes in the dataset

Regression algorithms

predict one or more continuous variables, such as profit or loss, based on other attributes in the dataset

Segmentation algorithms

divide data into groups, or clusters, of items that have similar properties

Association algorithms

find correlations between different attributes in a dataset

Sequence analysis algorithms

summarize frequent sequences or episodes in data, such as a Web path flow

Complete Set Of Algorithms Ways to analyze your data

Decision Trees

Clustering

Time Series

Neural Network

Association

Nave Bayes

Sequence Clustering

Linear Regression

Logistic Regression

Decision trees

Split data Each of branch is like an attribute Brightness = amount of data

Decision Trees (1)

Decision Trees assign (classify) each case to one of a few (discrete) broad categories of selected attribute (variable) and explains the classification with few selected input variables The process of building is recursive partitioning splitting data into partitions and then splitting it up more Initially all cases are in one big box

Decision Trees (2)

The algorithm tries all possible breaks in classes using all possible values of each input attribute; it then selects the split that partitions data to the purest classes of the searched variable

Several measures of purity Again testing all possible breaks

Then it repeats splitting for each new class

Unuseful branches of the tree can be pre-pruned or post-pruned

Decision Trees (3)


Decision trees are used for classification and prediction Typical questions:

Predict which customers will leave Help in mailing and promotion campaigns Explain reasons for a decision What are the movies young female customers like to buy?

Decision Trees Who Decides

Nave Bayes

Bayes Formula Uses statistics to say falls into certain category or not with probability Spam filtering: score of spam (Bayes) Testing only a particular attribute

Nave Bayes

Quickly builds mining models that can be used for classification and prediction It calculates probabilities for each possible state of the input attribute, given each state of the predictable attribute

This can later be used to predict an outcome of the predicted attribute based on the known input attributes

This makes the model a good option for exploring the data

Cluster Analysis (1)

Grouping data into clusters

Objects within a cluster have high similarity based on the attribute values

The class label of each object is not known


Several techniques

Partitioning methods Hierarchical methods Density based methods Model based methods And more

Cluster Analysis (2)

Segments a heterogeneous population into a number of more homogenous subgroups or clusters Some typical questions:

Discover distinct groups of customers Identification of groups of houses in a city In biology, derive animal and plant taxonomies Find outliers

Clustering

Annual Income

Age

Time series

Timebased data prediction

Sequence clustering

Numbers orders stronger associations Direction of association (not necessary the other direction)

Association

If you own certain stocks ' you own maybe other ones as well Probability = thickness of line

Neural Nets

Let system learn how to classify data Neural Network adapts to the new data Formulate statement/hypothesis

Outcome is know
(Data / Surveys)

1. 70% data to train network (outcome is known) 2. 30% of data to test network (outcome is known) 3. New data (no survey needed, predict from network)

Other example: OCR

Conclusion: When To Use What


Task Predicting a discrete attribute. Microsoft algorithms to use
Decision Trees Algorithm Naive Bayes Algorithm Clustering Algorithm Neural Network Algorithm Microsoft Microsoft For example, predict whether the recipient of a targeted mailing campaign Microsoft Microsoft will buy a product. For example, forecast next year's sales.

Predicting a continuous attribute. Microsoft Decision Trees Algorithm


Microsoft Time Series Algorithm

Predicting a sequence.

Microsoft Sequence Clustering Algorithm

For example, perform a clickstream analysis of a company's Web site.

Finding groups of common items in transactions.

Microsoft Association Algorithm Microsoft Decision Trees Algorithm

For example, use market basket analysis to suggest additional products to a customer for purchase.

Finding groups of similar items.

Microsoft Clustering Algorithm For example, segment demographic data Microsoft Sequence Clustering Algorithm into groups to better understand the relationships between attributes.

There is more...
Visual Numerics

3rd party algorithms

http://www.vni.com/company/whitepapers/ MicrosoftBIwithNumericalLibraries.pdf

Excel Data Mining

Microsoft SQL Server 2008 Data Mining Add-ins for Microsoft Office 2007
http://www.microsoft.com/downloads/en/details.aspx?familyid=8 96A493A-2502-4795-94AE-E00632BA6DE7&displaylang=en

Other usages of data mining Find patterns - Profiling

Train station / airport

Who is the bad guy

Farmers

Find the best crops

Supermarket

Find to figure out how to get you to buy more, where the expensive items

Tip

SSIS 2008 - Data profiling task Get a profile of the data in a table

potential candidate keys length of data values in columns Null percentage of rows distribution of values ....

Resources 1

Video: Simple data mining model

http://www.sqlservercentral.com/articles/Video/65055/

Video: Data mining and Reporting Services

http://www.sqlservercentral.com/articles/Video/64190/

Data Mining Algorithms

http://msdn.microsoft.com/en-us/library/ms175595.aspx

Resources 2

Jamie MacLennan

http://blogs.msdn.com/b/jamiemac/

Richard Lees on BI

http://richardlees.blogspot.com/

Book Data Mining with Microsoft SQL Server 2008


http://www.amazon.com/gp/product/0470277742?ie=UTF8&tag=sqlserverda0920&linkCode=as2&camp=1789&creative=9325&creativeASIN=0470277742

Summary

Why Data Mining? Uses Algorithms

Demo
Hands on Lab

3 things

EricPhan@ssw.com.au http://ericphan.info/ twitter.com/ericphan

Thank You!
Gateway Court Suite 10 81 - 91 Military Road Neutral Bay, Sydney NSW 2089 AUSTRALIA ABN: 21 069 371 900

Phone: + 61 2 9953 3000 Fax: + 61 2 9953 3105


info@ssw.com.au www.ssw.com.au

S-ar putea să vă placă și