Documente Academic
Documente Profesional
Documente Cultură
Published: July 2011 Applies to: SQL Server 2008 R2, SQL Server 2008, SQL Server 2005
Summary: This article is a walkthrough that illustrates how to build multiple related data models by using the tools that are provided with Microsoft SQL Server Integration Services. In this walkthrough, you will learn how to automatically build and process multiple data mining models based on a single mining structure, how to create predictions from all related models, and how to save the results to a relational database for further analysis. Finally, you view and compare the predictions, historical trends, and model statistics in SQL Server Reporting Services reports.
Copyright
This document is provided as-is. Information and views expressed in this document, including URL and other Internet Web site references, may change without notice. You bear the risk of using it. Some examples depicted herein are provided for illustration only and are fictitious. No real association or connection is intended or should be inferred. This document does not provide you with any legal rights to any intellectual property in any Microsoft product. You may copy and use this document for your internal, reference purposes. 2011 Microsoft. All rights reserved.
Contents
.1 Introduction .................................................................................................................................................. 5 Automating the Creation of Data Mining Models ........................................................................................ 5 Solution Walkthrough ................................................................................................................................... 6 Scope ......................................................................................................................................................... 6 Overall Process.......................................................................................................................................... 7 Phase 1 - Preparation .................................................................................................................................... 7 Create the Forecasting mining structure and default time series mining model ..................................... 7 Extract and edit the XMLA statement....................................................................................................... 8 Prepare the replacement parameters ...................................................................................................... 9 Phase 2 - Model Creation............................................................................................................................ 12 Create package and variables (CreateParameterizedModels.dtsx) ........................................................ 13 Configure the Execute SQL task (Get Model Parameters) ...................................................................... 13 Configure the Foreach Loop container (Foreach Model Definition)....................................................... 14 Phase 3 - Process the Models ..................................................................................................................... 15 Create package and variables (ProcessEmptyModels.dtsx) ................................................................... 16 Add a Data Mining Query task (Execute DMX Query) ............................................................................ 16 Create Execute SQL task (List Unprocessed Models) .............................................................................. 17 Create a Foreach Loop container (Foreach Model in Variable) .............................................................. 17 Add an Analysis Services Processing task to the Foreach Loop (Process Current Model) ...................... 17 Add a Data Mining Query task after the Foreach Loop (Update Processing Status) .............................. 18 Phase 4 - Create Predictions for All Models................................................................................................ 19 Create package and variables (PredictionsAllModels.dtsx) .................................................................... 20 Create Execute SQL task (Get Processed Models) .................................................................................. 20 Create Execute-SQL task (Get Series Names) ......................................................................................... 20 Create Foreach Loop container (Predict Foreach Model) ...................................................................... 21 Create variables for the Foreach Loop container ................................................................................... 22 3
Create the Data Mining Query task (Predict Amt) .................................................................................. 23 Create the second Data Mining Query task (Predict Qty) ...................................................................... 24 Create Data Flow tasks to Archive the Results ....................................................................................... 25 Create Data Flow tasks (Archive Results Qty, Archive Results Amt) ...................................................... 26 Run, Debug, and Audit Packages ............................................................................................................ 27 Phase 5 - Analyze and Report ..................................................................................................................... 27 Using the Data Mining Viewer ................................................................................................................ 27 Using Reporting Services for Data Mining Results .................................................................................. 28 Interpreting the Results .......................................................................................................................... 31 Discussion.................................................................................................................................................... 32 The Case for Ensemble Models ............................................................................................................... 33 Closing the Loop: Interpreting and Getting Feedback on Models .......................................................... 33 Conclusion ................................................................................................................................................... 34 Resources .................................................................................................................................................... 34 Acknowledgements..................................................................................................................................... 34 Code for Script Task .................................................................................................................................... 35
Introduction
This article is a walkthrough that illustrates how to use the data mining tools that are provided with Microsoft SQL Server Integration Services. If you are an experienced data miner, you probably already use the tools provided in Business Intelligence Development Studio or the Data Mining Client Add-in for Microsoft Excel for building or browsing mining models. However, Integration Services helps you automate many processes. This solution also introduces the concept of ensemble models for data mining, which are sets of multiple related models. For most data mining projects, you need to create several models, analyze the differences, and compare outputs before you can select a best model to use operationally. Integration Services provides a framework within which you can easily generate and manage ensemble models. In this series, you will learn how to: Configure the Integration Services components that are provided for data mining. Automatically build and update mining models by using Integration Services. Store mining model parameters and prediction results in the database engine. Integrate reporting requirements in the model design workflow.
Note that these are just a few of the ways that you can use Integration Services to incorporate data mining into analytic and data handling workflows. Hopefully these examples will help you get more mileage out of existing installations of Integration Services and SQL Server Analysis Services.
Solution Walkthrough
This section describes the complete solution that builds multiple models and creates queries that return predictions from each model. It contains these parts: [1] Analysis Services project: To create this project, follow the instructions in the Data Mining tutorial on MSDN (http://msdn.microsoft.com/en-us/library/ms169846.aspx) to create a Forecasting mining structure and default time series mining model. [2] Integration Services project: You will create a new project, containing multiple packages: A package that builds multiple models, using the Analysis Services Execute DDL task A package that processes multiple models, using the Analysis Services Processing task A package that creates predictions from all models, using the Data Mining query task
Scope
The following Integration Services tasks and components are used in this walkthrough. For more information from SQL Server Books Online, click on the link in the Task or component column. Task or component Execute SQL task Analysis Services Execute DDL task Analysis Services Processing task Foreach Loop Container Script task Data Mining Query task Data Flow task OLE DB source OLE DB destination Derived Column transformation Used for Gets variable values, and creates tables to store results Creates individual models Populates the models with data Builds and processes multiple data mining models Builds the required XMLA commands Creates predictions from each model Manages and merges prediction results Gets data from temporary prediction table Writes predictions to permanent table Adds metadata about predictions
Even though the following Integration Services components are also very useful for data mining, they are not used in this walkthroughlook for examples in a later paper: Data Profiling task Conditional Split transformation Percentage Sampling transformation Lookup and Fuzzy Lookup transformation Data Mining Training destination
Note: The SQL Server Reporting Services project containing the reports that compare models is not included here, even though this project generates all the data required for the reports. That is because the report creation process is somewhat lengthy to describe, especially if you 6
are not familiar with Reporting Services. Moreover, since all the prediction data is stored in the relational database, there are other reporting clients you can use, including Microsoft PowerPivot for Excel and Project Crescent. However, we hope to describe the process in a separate article later on the TechNet Wiki (http://social.technet.microsoft.com/wiki/contents/articles/default.aspx).
Overall Process
Phase 1 - Preparation: The definition of the models you want to create is stored in SQL Server as a set of parameters, values, and model names. Phase 2 Model creation: Integration Services retrieves the model definitions and passes the parameter values to a Foreach Loop that builds and then executes the XML for Analysis (XMLA) statement for each model. Phase 3 Model processing: Integration Services retrieves a list of available models, and then it processes each model by populating it with data. Phase 4 - Prediction: Integration Services issues a prediction query to each processed model. Each set of predictions is saved to a SQL Server table. Phase 5 Reporting and analysis: The prediction trends for each model are compared by using reports (created by Reporting Services, PowerPivot, or your favorite reporting client) using the data in the relational table.
Phase 1 - Preparation
In this phase, you set up the structure, sample data, and parameters your packages will use. Before you build the working packages, you need to complete the following tasks: Create the Forecasting mining structure used by all mining models. Generate the sample XMLA that represents the default time series mining model, to use as a template. Create a table that stores replacement parameters for the new models, and then insert the parameter values.
Create the Forecasting mining structure and default time series mining model
To create multiple models based on a single mining structure, you need to create the Forecasting mining structure first. Based on that mining structure, you also need to create a time series model that can be used as the template for generating other models. If you do not already have a mining structure capable of supporting a time series model, you can build one by following the steps described in the Microsoft Time Series tutorial (http://msdn.microsoft.com/en-us/library/ms169846.aspx) in SQL Server Books Online.
After you generate XMLA for the default time series model by using the script option, it looks like the following code. (The XMLA statement for models can be lengthy, so only an excerpt is shown here.) The XMLA statement always includes the database, the mining structure, metadata such as the model name, and the algorithm used for analysis. It can optionally include multiple parameters.
<ParentObject> <DatabaseID>Forecasting Models</DatabaseID> <MiningStructureID>Forecasting</MiningStructureID> </ParentObject> <ObjectDefinition> <MiningModel> <ID>ARIMA_1-10-30</ID> <Name>ARIMA_1-10-30</Name> <Algorithm>Microsoft_Time_Series</Algorithm> <AlgorithmParameters> <AlgorithmParameter> <Name>FORECAST_METHOD</Name> <Value xsi:type="xsd:string">ARIMA</Value> </AlgorithmParameter> <AlgorithmParameter> <Name>PERIODICITY_HINT</Name> <Value xsi:type="xsd:string">{1,10,30}</Value>
Next, make the following changes to the command text that you extracted: Add the parameters that you want to change, if they are not already present in the model. Default parameters are part of the XMLA output, so if your base model does not contain any parameters, you will need to add the XMLA section that contains parameters. Remove unnecessary white space and all line breaks. For this walkthrough, the XMLA is stored as a string in a variable, which cannot contain line breaks. If you leave in any line breaks, the problem is not detected at package validation, but at run time, the Analysis Services engine attempts to execute the XMLA and fails with an error.
To clean up the file, use your favorite text editor. White spaces such as tabs and multiple space characters are okay but you can remove them if you like, to shorten the string variable. There is no limit on the size of string variables, but there is a 4,000-character limit in the expression editor. 5. If your model does not already contain the parameters FORECAST_METHOD and PERIODICITY_HINT, use the code listed earlier, and copy the XML node that begins with <AlgorithmParameters> and ends with </AlgorithmParameters>. Paste it into the text file containing the XMLA command, directly below the line that defines the algorithm, and before the section that defines the columns. 6. Edit the entire XMLA statement to remove line breaks. You can use any text editor that you like, so long as you verify that the result is a single line of text.
Integration Services is extremely flexible, so there are many different ways to store the parameters and insert them into the model XMLA. For example, you could: Store the parameters as text in a SQL Server table, and then insert them into the XMLA within a Foreach Loop, by using an ADO.NET iterator. Save the XMLA command as a text file and read it using a flat file connection. Save the variables in a configuration file and apply them at run time. Save the XMLA command as an .xml file, and then read it into a package by using an XML Source. Insert the variables into the XML by using the properties and methods of the XML task. Create multiple XMLA files in advance and then read the files with a combination of a Foreach loop and an XML Source connection.
However, for this scenario, you need to be able to easily add new sets of parameters, and to view and update the complete list of models and parameters. Therefore, youll use the first method: create the parameter-value pairs as records in a SQL Server database, and then read in the new values at run time by using a Foreach Loop container. This way, you can easily view or update the parameters by using SQL queries. Run the following statement to create the parameters table.
USE [DMReporting] -- substitute your database name here GO /****** Object: ******/ SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO CREATE TABLE [dbo].[ModelParameters]( [RecordID] [int] IDENTITY(1,1) NOT NULL, [RecordDate] [datetime] NULL, [ModelID] [nvarchar](50) NULL, [ModelName] [nvarchar](50) NULL, [ForecastMethod] [nvarchar](50) NULL, Table [dbo].[ModelParameters] Script Date: 11/09/2010 10:56:26
10
The following table lists the parameters that are used to build models in this walkthrough. Insert these values into the parameters table you created by using the script. ModelID ARIMA_1-7-10 ARIMA_1-10-30 ARIMA_nohints MIXED_1-7-10 MIXED_1-10-30 MIXED_nohints ARTXP_1-7-10 ARTXP_1-10-30 ARTXP_nohints ForecastMethod ARIMA ARIMA ARIMA MIXED MIXED MIXED ARTXP ARTXP ARTXP PeriodicityHint {1,7,10} {1,10,30} {1} {1,7,10} {1,10,30} {1} {1,7,10} {1,10,30} {1}
This scenario uses the parameters FORECAST_METHOD and PERIODICITY_HINT because they are among the most important parameters for time series models (also because they are string values and easy to change!). However, the parameters that you change will be completely different for other algorithms. For example, if you build a clustering model, you might decide to change the CLUSTERING_METHOD parameter and build models using each of the four clustering methods, such as K-Means. You might also try altering the MINIMUM_SUPPORT parameter, or trying a variety of cluster seeds. For a list of the parameters provided by the different algorithms, see the algorithm technical reference topics (http://msdn.microsoft.com/enus/library/cc280427.aspx) in MSDN. Important note for data miners: Altering parameter values can strongly affect the model results. Therefore, you should have some sort of plan for analyzing the results and weeding out badly fit models. For example, because the time series algorithm is very sensitive to periodicity hints, it can produce poor results if you provide the wrong hint. If you specify that the data contains weekly cycles and it actually contains monthly cycles, the algorithm attempts to fit the data to the suggested weekly cycle and might produce odd results. Some of the models generated by this automation process demonstrate this behavior. There are many ways that you can check the validity of models: Use descriptive statistics and metadata for the individual models to eliminate models that have characteristics of overfitting or poor fit.
11
Validate data sets and models by using cross-validation or one of the other accuracy measures provided by SQL Server. For more information, see Validating Data Mining Models (http://technet.microsoft.com/en-us/library/ms174493.aspx) in SQL Server Books Online. Choose only parameters and values that make sense for the business problem; use business rules to guide your modeling.
This completes the preparations, and you can now build the three packages. The instructions for each package begin with a diagram that illustrates the package workflow and briefly describes the package components. The diagram is followed by steps that you can follow to configure each task or destination.
12
13
4. For the ResultSet property, choose Full row set. This enables you to store a multi-row result in a variable. 5. In the Result Set pane, for Result Name, type 0 (zero), and then assign the variable User:objAllModelParameters.
14
9. Add an Analysis Services Processing task inside the Foreach Loop container and name it Execute Model XMLA. 10. For the Connection property, specify the instance of Analysis Services where your models are stored. 11. On the DDL tab, for SourceType, choose Variable, and then select the variable User::strModelXMLA. This completes the package that creates the models. You can now execute just this package by right-clicking the package in Solution Explorer and then clicking Execute Now. After the package runs, if there are no errors, you can connect to your Analysis Services database by using SQL Server Management Studio and see the list of new models that were created. However, you cannot browse the models or build prediction queries yet. That is because the models are just metadata until they are processed, and they contain no data or patterns. In the next package, you will process the models.
15
Not all of these columns are needed for processing, but you can add the columns now and update the information later. 5. On the Output tab, for Connection, select the relational database where you will store the results. For this solution, it is <local server name>/DM_Reporting. 6. For Output table, type a temporary table name (in this solution, tmpProcessingStatus) and then select the option Drop and re-create the output table.
16
6. On the Result Set tab, assign the columns in the result set to variables. There is only one column in the result set, so you assign the variable, User::objModelList, to ResultSet 0 (zero).
Add an Analysis Services Processing task to the Foreach Loop (Process Current Model)
The editor for this task requires that you first connect to an Analysis Services database and then choose from a list of objects that can be processed. However, because you need to automate this task, you cant use the interface to choose the objects to process. So how do you iterate through a list of objects for processing?
17
The solution is to use an expression to alter the contents of the property, ProcessingCommand. You use the variable, strXMLAProcess1, which you set up earlier, to store the basic XMLA for processing a model, but you insert a placeholder that you can modify later when you read the variable. You alter the command using an expression and write the new XMLA out to a second variable, strXMLAProcess2. 1. Drag a new Analysis Services Processing task into the Foreach Loop container you just created. Name it Process Current Model. 2. With the Foreach Loop selected, open the Variables window, and then select the variable User::strXMLAProcess2. 3. In the Properties pane, select Evaluate as expression and set it to True. 4. For the value of the variable, type or build this expression.
REPLACE( @[User::strXMLAProcess1] , "ModelNameHere", @[User::strModelName1] )
5. In the Analysis Services Processing Task Editor, click Expressions, and then expand the list of expressions. 6. Select ProcessingCommand and then type the variable name as follows: @[User::strXMLAProcess2] Another way to train the model would be to add a processing task within the same Foreach Loop that you used to create the model. However, there are good reasons to build and process the models in separate packages. For example: Processing can be time-consuming, and it depends on connections to source data. It is easier to debug problems when model creation and processing are in separate packages.
Moreover, the Data Mining Query task that is provided in the Control Flow can be used to execute many different types of queries against an Analysis Services data source. You can use schema rowset queries within this task to get information about other Analysis Services objects, including cubes and tabular models, or even run Data Mining Extensions (DMX) DDL statements. (In contrast, the Data Mining Query Transformation component, available in the Data Flow, can only be used to create predictions against an existing mining model.) The final step in this phase is to add a task that updates the status of your mining models. You can now execute this package as before.
Add a Data Mining Query task after the Foreach Loop (Update Processing Status)
This task uses the Data Mining Query task to get the updated status of the mining models, and write that to a relational data table. 1. Right-click the Data Mining Query task you created before, because it has all the right connections and the correct query text, and then click Copy. 2. Paste the task after the Foreach Loop container and connect it to the loop. 18
3. Rename the task Update Processing Status. 4. Open the Data Mining Query Task Editor, click the Output tab, and verify that the option Drop and re-create the output table is selected. This completes the package. You can now execute this package as before. When you execute this package, the actual processing of each model can take a fairly long time, depending on how many models are available. You might want to add logging to the package to track the time used for processing each model.
19
6. On the Result Set tab, assign the columns in the result set to variables. There is only one column in the result set, so you assign the variable User::objProcessedModels to ResultSet 0 (zero). Tip: When you are working with data mining models and especially when you are building complex queries, we recommend that you build DMX queries beforehand by opening the model directly in Business Intelligence Developer Studio and using Prediction Query Builder, or by launching Prediction Query Builder from SQL Server Management Studio. The reason for is that when you build queries by using the data mining designers in SQL Server Management Studio or Business Intelligence Developer Studio, Analysis Services does some validation, which enables you to browse and select valid objects. However, the Query Builder provided in the Data Mining Query task does not have this context and cannot validate or help with your selections.
2. Connect it to the previous task. 3. In the Execute SQL Task Editor, choose OLE DB connection, and for Connection, type <local server name>/DM_Reporting. 4. For Result set, select None. 5. For SQLSourceType, select Direct input. 6. For SQL Statement, type the following query text.
IF EXISTS (SELECT [modelregion] FROM DMReporting.dbo.tmpModelRegions) BEGIN TRUNCATE TABLE DMReporting.dbo.tmpModelRegions INSERT DMReporting.dbo.tmpModelRegions SELECT DISTINCT [ModelRegion] FROM AdventureWorksDW2008R2.dbo.vTimeSeries END
strModelName, which is in turn used to update the prediction query in the next set of tasks.
Important: The query here is formatted for readability, but the query will not work if you copy and paste these statements into the variable as is. You must copy the statement into a text editor first and remove all line breaks. Unfortunately the Integration Services editors do not detect line breaks or raise any errors while you are editing the task, but when you run the package, you will get an error. So be sure to remove the line breaks first! 3. For the value of variable strQueryBaseQty, type the following query after removing the line breaks.
SELECT FLATTENED 'ModelNameHere' as [Model Name], [ModelNameHere].[Model Region] as [Model and Region], (SELECT $TIME as NewTime,
22
Quantity as NewValue, PredictStDev([Quantity])as ValueStDev, PredictVariance([Quantity]) as ValueVariance FROM PredictTimeSeries([ModelNameHere].[Amount],10) ) AS Predictions FROM [ModelNameHere]
Notice the placeholder, ModelNameHere, in this procedure. This placeholder will be replaced with a valid model name, which the package gets from the variable strModelName. The next steps explain how to create an expression that updates the query text each time the loop is executed. 4. In the Variables window, select the variable strPredictQty, and then open or select the Properties window to see the extended properties of the variable. 5. Locate Evaluate as Expression and set the value to True. 6. Locate Expression and type or paste in the following expression.
REPLACE( @[User::strQueryBaseQty] , "ModelNameHere", @[User::strModelName2] )
7. Repeat this process for the variable strPredictAmt, using the following expression.
REPLACE( @[User::strQueryBaseAmount] , "ModelNameHere", @[User::strModelName2] )
6. On the Query tab, for Build query, you can paste in the base query temporarily. It will be replaced with the contents of a variable. After you run the package once, you should see the text of the base query. 7. With the Predict Amt task selected, open the Properties pane and locate the Expressions property. 8. Expand the list of expressions and add the variable @[User::strPredictAmt] as the value for the QueryString property. You can also select the value from a list by clicking the Browse (...) button.
These brackets will be populated by a variable that supplies the model name at run time. To summarize all the variable activity at run time: The package gets a variable with a list of models. The loop gets the name of one model from the list. The loop gets a prediction query from a variable, inserts the model name, and writes out a new prediction query. The query task executes the updated prediction query.
24
Note that these prediction queries all write their results to the same temporary table, which is dropped and then rebuilt during each loop. Therefore, you need to add a Data Flow task in between, which moves the results to the archive table and also adds some metadata.
You can also add any extra metadata that might be useful later, such as the date the predictions were generated, a job ID, and so forth. Lets take another look at the DMX query statements used to generate the predictions.
SELECT $TIME as NewTime, Amount as NewValue, PredictStDev([Amount])as ValueStDev,PredictVariance([Amount]) as ValueVariance FROM PredictTimeSeries
Ordinarily the default column names that are generated in a prediction query are named by default based on the predictable column name, so the names would be something like PredictAmount and PredictQuantity. However, you can use a column alias in the output (here, it is NewValue) to make it easier to combine predicted values. Again, because Integration Services is so flexible, there are lots of ways you might accomplish this task: Store results in memory and merge them before writing to the archive table. Store the results in different columns, one for each prediction type. Write the results to temporary tables and merge them later. Use the Integration Services raw file format to quickly write out and then read the interim results.
However, in this scenario, you want to verify the prediction data that is generated by each query. So you use the following approach: 25 Write predictions to a temporary table.
Use an OLE DB Source component to get the predictions that were written to the temporary table. Use a Derived Column transformation to clean up the data and add some simple metadata. Save the results to the archive table that is used for reporting on all models.
The graphic illustrates the overall task flow within each Data Flow task.
Create Data Flow tasks (Archive Results Qty, Archive Results Amt)
1. Within the loop Predict Foreach Model, create two Data Flow tasks and name them Archive Results Qty and Archive Results Amt. 2. Connect each Data Flow task to its related Data Mining Query task, in the order shown in the earlier Control Flow diagram for Package 3. Note: You must have these tasks in a sequence, because they use the same temporary table and archive table. If Integration Services executes the tasks in parallel, the processes could create conflicts when attempting to access the same table. 3. In each Data Flow task, add the following three components: An OLE BD data source that reads from tmpPredictionResults A Derived Column transformation as defined in the following table An OLE DB destination that writes to table ArchivedPredictions 4. Create expressions in each Derived Column transformation, to generate the data for the new columns as follows. Task name Archive Results Qty Archive Results Qty 26 Derived column name PredictionDate PredictedValue Data type datetime string Value GETDATE() Amount
PredictionDate PredictedValue
datetime string
GETDATE() Quantity
Tip: Isolating the data flows for each prediction type has another advantage: it is much, much easier to modify the package later. For example, you might decide that there is no good reason for creating a separate prediction for quantity. Instead of editing your query or the output, you can just disable that part of the package and it will still run without modification you just wont have predictions for Quantity.
27
However, some users are not comfortable with using Business Intelligence Development Studio. Even if you use the Data Mining Add-ins for Microsoft Office, which provides a Microsoft Visio viewer and a time series browser, the amount of detail in the time series viewer can be overwhelming. In contrast, analysts typically want even more detail, including statistics embedded in the model content, together with the metadata you captured about the source and the model parameters. Its impossible to please everyone! Fortunately Reporting Services lets you pick the data you want, add extra data sets and linked reports, filter, and group, so you can create reports that meet the needs of each set of users.
28
Additional requirements might include: A chart showing historical values along with predictions. Statistics derived from comparison of prediction values. Metadata about each model in a linked report.
As the analyst, you might want even more detail: First and last dates used for training each model List of the algorithm parameters and pattern formulas Descriptive statistics that summarize the variability and range of source data in each of the series, or across series
However, for the purposes of this walkthrough, there is already plenty of detail for comparing models. You can always add this data later and then present it in linked reports The following graphic shows the Reporting Services report that compares the prediction results for each model:
29
Notice that you can configure a report to show all kinds of information in ToolTipsin this example, as you pause the mouse over a prediction, you see the standard deviation and variance for the predictions. The next shows a series of charts that have been copied into a matrix. By using a matrix, you can create a set of filtered charts. This series of graphs shows predictions for Amount for all models.
30
M200 Europe
R750
T1000
North America
Pacific
The following trend lines are interesting, and they illustrate some problems you might see with models. The results might indicate that the data is bad, there is inadequate data, or the data is too variable to fit.
31
When you see wildly varying trends from models on the same data, you should of course reexamine the model parameters, but you might also use cross-prediction or aggregate your data differently, to avoid being influenced too strongly by a single data series: With cross-prediction, you can build a reliable model from aggregated data or a series with solid data, and then make predictions based on that model for all series. ARTXP models and mixed models support cross-prediction. If you do not have enough data to meaningfully analyze each region or product line separately, you might get better results by aggregating by product or region or both, and create predictions from the aggregate model.
Discussion
Data mining can be a labor-intensive process. From data acquisition and preparation to modeling, testing, and exploration of the results, much effort is needed to ensure that the data supports the intended analysis and that the output of the model is meaningful. Some parts of the model-building process will always require human intervention understanding the results, for example, requires careful review by an expert who can assess whether the numbers make sense. However, by automating some part of the data mining process, Integration Services can not only speed the process, but also potentially improve the results. For example, if you dont know which mixture of algorithms produces the best results, or what the possible time cycles are in your data, you can use automation to experiment. Moreover, there are benefits beyond simple time saving. 32
There has been much research in recent years on the best methods for combining the estimates from ensemble models merging, bagging, voting, averaging, weighting by posterior evidence, gating, and so forth. A discussion of ensemble models is beyond the scope of this paper, and you will note that we did not attempt to combine prediction results in this paper; we only presented them for comparison. However, we encourage you to read the linked resources to learn more about these techniques.
Fortunately, because you have created an extensible framework for incorporating data mining in analysis using Integration Services, it will be relatively easy to collect more data, update models, and refine your presentation.
33
Conclusion
This paper introduced a framework for automation of data mining, with results saved to a relational data store, to encourage a systematic approach to predictive analytics. This walkthrough showed that it is relatively easy to set up Integration Services packages that create data mining models and generate predictions from them. A framework like the one demonstrated here could be extended to support further parameterization, encourage the use of ensemble models, and incorporate data mining in other analytic workflows.
Resources
[1] Jamie MacLennan: Walkthrough of SQL Server 2005 Integration Services for data mining http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=96&Id=338 [2] Microsoft Research: Ensemble models http://academic.research.microsoft.com/Paper/588724.aspx [3] Reporting Services tutorials http://msdn.microsoft.com/en-us/library/bb522859.aspx [4] Michael Ohler: Assessing forecast accuracy http://www.isixsigma.com/index.php?option=com_k2&view=item&id=1550:assessing-forecastaccuracy-be-prepared-rain-or-shine&Itemid=1&tmpl=component&print=1 [5] Statistical methods for assessing mining models http://ms-olap.blogspot.com/2010/12/do-you-trust-your-data-mining-results.html [6] John Maindonald, Data Mining from a Statistical Perspective http://maths.anu.edu.au/~johnm/dm/dmpaper.html.
Acknowledgements
I am indebted to my coworkers for their assistance and encouragement. Carla Sabotta (technical writer, Integration Services) provided invaluable feedback on the steps in each of the SSIS packages, ensuring that I didnt leave out anything. Ranjeeta Nanda of the Integration Services test team kindly reviewed the code in the Script task. Mary Lingel (technical writer, Reporting Services) took my complex data source and developed a set of reports that made it look simple.
34
'create local variables and fill them with values from the SQL query Dim txtModelID As String = Dts.Variables("strModelID").Value.ToString Dim txtModelName As String = Dts.Variables("strModelName").Value.ToString Dim txtForecastMethod As String = Dts.Variables("strForecastMethod").Value.ToString Dim txtPeriodicityHint As String = Dts.Variables("strPeriodicityHint").Value.ToString
'first update base XMLA with new model ID and model name ' ' <ID>ForecastingDefault</ID> <Name>ForecastingDefault</Name>
Dim txtNewID As String = "<ID>" & txtModelID & "</ID>" Dim txtNewName As String = "<Name>" & txtModelName & "</Name>"
35
'display model names for troubleshooting only MessageBox.Show(strXMLANewDef, "Verify new model ID and name")
'create temporary variables for replacement operations Dim strParameterName As String = "" Dim strParameterValue As String = ""
'update value for FORECAST METHOD. Because all possible values have exactly 5 chars, simply replace strParameterName = "FORECAST_METHOD" strParameterValue = "MIXED" 'default value
If strXMLABaseDef.Contains(strParameterValue) Then
'replace the default value MIXED with whatever is in the variable from the SQL Server query strXMLANewDef = strXMLANewDef.Replace(strParameterValue, txtForecastMethod) 'display Forecast parameter value for troubleshooting only MessageBox.Show(strXMLANewDef, "Check Forecast Method", MessageBoxButtons.OK)
Else : MessageBox.Show("Problem with base XMLA", "The XMLA definition does not include the parameter, ;" & _ strParameterName, MessageBoxButtons.YesNoCancel) End If
'look for a PERIODICITY_HINT value strParameterName = "PERIODICITY_HINT" strParameterValue = "{1}" 'default value
36
If strXMLABaseDef.Contains(strParameterName) Then Dim StartString As Integer = strXMLABaseDef.IndexOf("{") Dim EndString As Integer = strXMLABaseDef.IndexOf("}")
'replace the default value {1} with whatever is in the variable strXMLANewDef = strXMLANewDef.Replace(strParameterValue, txtPeriodicityHint) MessageBox.Show(strXMLANewDef, "Check Periodicity Hint", MessageBoxButtons.OK)
Else : MessageBox.Show("Problem with base XMLA", "The XMLA definition does not include the parameter, ;" & _ strParameterName, MessageBoxButtons.YesNoCancel) End If
For more information: http://www.microsoft.com/sqlserver/: SQL Server Web site http://technet.microsoft.com/en-us/sqlserver/: SQL Server TechCenter http://msdn.microsoft.com/en-us/sqlserver/: SQL Server DevCenter
Did this paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to 5 (excellent), how would you rate this paper and why have you given it this rating? For example:
37
Are you rating it high due to having good examples, excellent screen shots, clear writing, or another reason? Are you rating it low due to poor examples, fuzzy screen shots, or unclear writing?
This feedback will help us improve the quality of white papers we release. Send feedback.
38