Sunteți pe pagina 1din 77

Ministry of Higher Education and Scientific Research

University of Carthage
*-*-*-*-*

Engineering School of Statistics and Data Analysis

End of Studies Project


For obtaining the

National Diploma of Engineering in Statistics and Data Analysis

Temporal Data Analysis and Machine


Learning for Decision Support Applications
Perfomed by
Slimaine Ben Attia
Hosting Company:

ESSAI tutor: Mrs.Fatma CHAKER-KHARRAT


Company supervisor : Mrs. Imen Majed
Supported by: Mr. Mohamed Mehdi Sidommou

Academic Year:2013/2014
1

To my beloved Mother, for her prayers to me and who


emphasized the importance of education and helped me with my
lessons throughout her life.
To my father, the first to teach me and who makes me want to
be a better person.
To my brother and my sister who have been my emotional
anchors through not only the vagaries of graduate school, but
my entire life.
To Zeineb, for her support throughout these five years and each
step of my way.
To my friends, for their presence and encouragement,

To all of you, I dedicate this work


Slimaine

Acknowledgments

I would like to take opportunity to express my gratitude to everyone who contributed


to the realization of this project.

First, I owe thanks to Mr. Samy Achour, CEO of Integration Objects, It has been a
great privilege to achieve this internship at this company.

I would like to thank Mrs. Imen Majed and Mr. Mehdi Sidommou, my supervisors
from Integration Objects, for providing all facilities and support to meet our project
requirements.

I would like to express my deepest gratitude to my tutor from ESSAI, Mrs. Fatma
Chaker, for her help, guidance and willingness to share her vast knowledge.

My thanks go also to the highly esteemed members of the jury, Mrs. Hlla Ouaili
Mallek and Mrs. Ben Slama Nawel, for accepting to evaluate this project.

Furthermore, I would like to thank my colleagues at Integration Objects for providing


a friendly environment, which had helped me in the achievement of this work.

Table of Contents

Acknowledgments ...................................................................................................................... 3
Table of Contents ....................................................................................................................... 4
List of Tables .............................................................................................................................. 7
List of Figures ............................................................................................................................ 8
General Introduction ................................................................................................................ 10
Chapter I GENERAL PRESENTATION ................................................................................ 12
1

Hosting Company: Integration Objects ....................................................................... 13


1.1

Overview .................................................................................................................. 13

1.2

Expertise ................................................................................................................... 13

1.3

Industry Participation and Certification ................................................................... 14

1.4

Technical Department .............................................................................................. 15

Project Overview ............................................................................................................ 15


2.1

Functional Scope ...................................................................................................... 15

2.2

Project Challenges .................................................................................................... 16

2.3

Project Goals ............................................................................................................ 17

2.4

Project Planning ....................................................................................................... 18

Chapter II Preliminary study.................................................................................................... 20


1

State of the Art................................................................................................................ 21


1.1

EVIEWS ................................................................................................................... 21

1.2

IBM SPSS ................................................................................................................ 22

1.3

GRETL ..................................................................................................................... 23

1.4

R ............................................................................................................................... 23

1.5

SAS........................................................................................................................... 24

Comparative Table ......................................................................................................... 25

Statistical Frameworks .................................................................................................. 26


3.1

R.NET ...................................................................................................................... 27

3.2

SAS Integration ........................................................................................................ 27

3.3

Integrate Accord.Net framework.............................................................................. 27

Chapter III Requirements Analysis and Specification ............................................................. 28


1

General Specifications.................................................................................................... 29
1.1

User characteristics .................................................................................................. 29

1.2

Design and implementation constraints ................................................................... 29

System features ............................................................................................................... 29


2.1

Transformation ......................................................................................................... 29

2.1.1 LAG Transformation ............................................................................................ 29


2.1.2 LEAD Transformation ......................................................................................... 30
2.1.3 Power Transformation .......................................................................................... 30
2.1.4 Integrate Transformation ...................................................................................... 31
2.1.5

Seasonal Differencing .......................................................................................... 32

2.1.6 Box-Cox Transformation ..................................................................................... 32


2.1.7 Exponential Smoothing ........................................................................................ 33
2.2

Statistical Test .......................................................................................................... 36

2.2.1 Dickey Fuller Test ................................................................................................ 36


2.2.2 Jarque-Berra Test ................................................................................................. 37
2.2.3 Shapiro-Wilk Test ................................................................................................ 38
2.3

Models and Prediction .............................................................................................. 38

2.3.1 Temporal PLS ...................................................................................................... 38


2.3.2 ARMAX model .................................................................................................... 40
2.3.3 ARIMA model...................................................................................................... 40
2.3.4 Linear Prediction .................................................................................................. 41
2.4

Graph ........................................................................................................................ 42

2.4.1 Box-Plot ............................................................................................................... 42


2.4.2 ACF Graph ........................................................................................................... 42
2.4.3 PACF Graph ......................................................................................................... 43
3

Use case Model ................................................................................................................ 43


3.1

Global Use Case ....................................................................................................... 43

3.2

Manage Project ......................................................................................................... 45

3.3

Missing values Use Case .......................................................................................... 46

Chapter IV Design .................................................................................................................... 48


1

Global Architecture of the System ................................................................................ 49

System Diagrams ............................................................................................................ 50

2.1

Package diagram ...................................................................................................... 50

2.2

Class diagram ........................................................................................................... 51

2.3

Sequence diagram .................................................................................................... 53

2.3.1 Load Data ............................................................................................................. 53


2.3.2 Apply algorithm ................................................................................................... 54
Chapter V Implementation and Test ........................................................................................ 55
1

Development environment ............................................................................................. 56


1.1

Software Environment.............................................................................................. 56

1.2

Hardware environment ............................................................................................. 58

Achieved Work ............................................................................................................... 58

In this section, we are going to present our solution. ............................................................... 58

2.1

Management of missing values ................................................................................ 59

2.2

Data description........................................................................................................ 62

2.3

Transformation ......................................................................................................... 65

2.4

Modeling .................................................................................................................. 67

Performance Tests .......................................................................................................... 71

Conclusion and Perspectives .................................................................................................... 73


Bibliography ............................................................................................................................ 75
Netography ............................................................................................................................ 76

List of Tables
Table 1.Comparative table ....................................................................................................... 26
Table 2: Inputs/Outputs LAG ................................................................................................... 30
Table 3: Inputs/Outputs LEAD ................................................................................................ 30
Table 4: Inputs/Outputs Power ................................................................................................ 30
Table 5. Inputs/Outputs Integrate ............................................................................................. 32
Table 6: Inputs/Outputs Seasonal Differencing ....................................................................... 32
Table 7. Inputs/Outputs Box Cox ............................................................................................. 33
Table 8. Inputs/Outputs SES .................................................................................................... 34
Table 9. Inputs/Outputs HS ...................................................................................................... 34
Table 10. Parameters of Winters Smoothing ........................................................................... 35
Table 11. Inputs/Outputs WS ................................................................................................... 36
Table 12. Inputs/Outputs ADF ................................................................................................. 37
Table 13. Inputs/Outputs Jarque Berra ..................................................................................... 37
Table 14. Inputs/Outputs Shapiro Wilk.................................................................................... 38
Table 15. Inputs/Outputs PLS .................................................................................................. 40
Table 16. Inputs/Outputs ARMAX .......................................................................................... 40
Table 17. Inputs/Outputs ARIMA ............................................................................................ 41
Table 18. Inputs/Outputs Linear Prediction ............................................................................. 42
Table 19. Inputs/Outputs Box Plot ........................................................................................... 42
Table 20. Inputs/Outputs ACF ................................................................................................. 42
Table 21. Inputs/Outputs PACF ............................................................................................... 43
Table 22. Hardware Environment ............................................................................................ 58
Table 23. Performance Tests .................................................................................................... 71

List of Figures

Figure 1: IO Services Manufacturing Operation Management[N1] ..................................... 13


Figure 2: KnowledgeNet Architecture [N2]............................................................................ 14
Figure 3.Modeling cycle .......................................................................................................... 17
Figure 4. Project Planning ........................................................................................................ 18
Figure 6. EVIEWS interface [N3] ............................................................................................ 21
Figure 7.SPSS interface [N4] ................................................................................................... 22
Figure 8.GRETL interface [N5] ............................................................................................... 23
Figure 9.R interface [N6] ......................................................................................................... 24
Figure 10.SAS interface[N7].................................................................................................... 24
Figure 11.

Time Series ........................................................................................................ 31

Figure 12.Integrated

Time Series ........................................................................................ 31

Figure 13.Global Use Case....................................................................................................... 44


Figure 14.Manage Project Use Case ........................................................................................ 45
Figure 15. Missing Values Use Case ....................................................................................... 46
Figure 16.Global architecture of the system ............................................................................ 49
Figure 17.Package diagram ...................................................................................................... 50
Figure 18.Class diagram ........................................................................................................... 52
Figure 19. Load Data ................................................................................................................ 53
Figure 20.Select method ........................................................................................................... 54
Figure 21.Microsoft Office Project Logo ................................................................................. 56
Figure 22.Accord.Net Logo [N11] ........................................................................................... 56
Figure 23.MVS Logo [N12] ..................................................................................................... 57
Figure 24.Entreprise Architect Logo [N13] ............................................................................. 57
Figure 25.DevExpress Logo [N14] .......................................................................................... 57
Figure 26.Main Interface.......................................................................................................... 58
Figure 27:File bar ..................................................................................................................... 59
Figure 28.Home Interface ........................................................................................................ 59

Figure 29.Data bar .................................................................................................................... 60


Figure 30.Summary Interface ................................................................................................... 60
Figure 31.Impute Interface ....................................................................................................... 61
Figure 32.Methods of Impute ................................................................................................... 61
Figure 33.Descriptive Statistics Impute Interface .................................................................... 62
Figure 34. Data Description menu ........................................................................................... 62
Figure 35. Line and bar Chart .................................................................................................. 63
Figure 36.Correlogram chart .................................................................................................... 63
Figure 37.Box Plot chart .......................................................................................................... 64
Figure 38.Descriptive Statistics Interface ................................................................................ 64
Figure 39.Shapiro Wilk Test Interface ..................................................................................... 65
Figure 40.ADF Test Interface .................................................................................................. 65
Figure 41.Transformation menu ............................................................................................... 65
Figure 42.Integrate Interface .................................................................................................... 66
Figure 43.Smoothing Interface ................................................................................................. 66
Figure 44.Models menu............................................................................................................ 67
Figure 45.PLS main interface................................................................................................... 67
Figure 46.Factors Interface ...................................................................................................... 68
Figure 47.Projection Interface .................................................................................................. 68
Figure 48.Regression Interface ................................................................................................ 69
Figure 49.Forecast menu .......................................................................................................... 69
Figure 50: Linear Regression Interface .................................................................................... 70
Figure 51: Holt's Smoothing Interface ..................................................................................... 70

General Introduction
nowledge discovery is one of the most recent and fastest growing elds of

research in computer science. It combines techniques from machine learning


and database technology to uncover meaningful knowledge from large and
real world databases. However, most of the real world data are time based:

for example stock prices, dairy cow milk production gures or meteorological data and
especially in the process industry. Most current knowledge discovery systems use similaritybased machine learning methodslearning from exampleswhich do not generally suite
this type of data. Time-series analysis techniques are used extensively in signal processing
and sequence identication applications such as speech recognition, but have not often been
considered for knowledge discovery tasks.
The popularity of time-series databases in many applications has created an increasing
demand for performing data-mining tasks (description, transformation, modeling, etc.) on
time-series data. Currently, however, almost no single system or library exists that specializes
on providing efficient implementations of data-mining techniques for time-series data,
supports the necessary concepts of representations, statistical test and forecasting, and which
can be used by both expert and non-expert of statistics.
Integration Objects deals with heterogeneous types of temporal data coming from
different equipments such as sensors, data feeds, etc. This large amount of time series data
challenges the way they would be analyzed, interpreted, modeled and predicted with
developing models that are both accurate and user-friendly.
For these reasons our project, developed within the Integration objects company, is a
solution that can perform analysis of temporal data. It aims to offer a rich environment that
meets the standards and the expectations of the company's customers, which was the scope of
our end of studies project.
The following report details the different steps we have been through in our project. This
report presents five main chapters.
10

In the first chapter, we introduce the project environment by presenting the hosting
company, the project challenges and goals as well as the project management methodology
applied during the project lifecycle.
In the second chapter, we present the state of the art based on the concepts of time
series analysis and a description of the competitors.
The specification and analysis of every requirement is presented in the third chapter in
which the functional and non-functional requirements as well as the design of these needs are
described in details.
The fourth chapter covers the architecture and design phase of the solution. The fifth
chapter details the aspects of the implementation illustrated by the establishment of a real case
example. Finally, we complete this report with a conclusion and present the project
perspectives.

Chapter I
GENERAL PRESENTATION

12

Chapter I

General Presentation

Introduction
In this chapter, we start by covering the internship environment and by presenting the hosting
company. Then, we focus on the project, by detailing its environment, goals and challenges.

1 Hosting Company: Integration Objects


This section covers Integration Objects presentation, by describing its profile, expertise and
activities.

1.1 Overview
Integration Objects is a software development firm created in 2002, based in Tunisia with
sales representatives in Houston, Texas and Genoa. It is a world leading systems integrator
and solutions provider for knowledge management, advanced analytics, automation, plant
information management, root cause analysis, performance management and decision support
applications for the process industry

1.2 Expertise
Integration Objects is specialized in the development of software solutions for the sectors of
industry and energy, including oil and chemicals. Software developed by Integration Objects
focus on Manufacturing Operation Management which the objective is management and
optimization of production under operational constraints: the safety of staff and assets,
production goals, costs

Figure 1: IO Services Manufacturing Operation Management[N1]

Integration Objects offers highly scalable and reliable solutions that allow real-time data
collection from multiple plant systems and various enterprise networks.

This enables
13

Chapter I

General Presentation

companies to turn data, information, and knowledge into operational intelligence, thereby
optimizing their business and manufacturing processes.
One of these solutions is KnowledgeNetTM (KNet). It is an innovative intelligent framework
application specialized in collecting real-time data, detecting abnormal conditions, automating
root cause analysis, and applying best practices through the workflow engine.

Figure 2: KnowledgeNet Architecture [N2]


KNet is primarily used to empower operations in the chemical oil and gas, power, and utilities
industries in making timely business decisions to increase production uptime and safety.
Users may include operators, shift supervisors, process engineers, operators, and plant
managers.

1.3 Industry Participation and Certification


As an active member of the OPC Foundation, MIMOSA, and ISA, Integration Objects is
dedicated to providing products and services that incorporate industry standards and enable
14
`

Chapter I

General Presentation

interoperability between different applications, systems, and vendors. Its quality and
management standards are reflected in its status as an ISO 9001:2008 certified company.
Their Customers are located on five continents and include the largest industrial companies in
the world such as ExxonMobil, Chevron, Saudi Aramco and Solvey.

1.4 Technical Department


To ensure best performance and results, Integration Objects technical department is divided
into three main teams:
-

The development team: This team is responsible for design, development and
maintenance of software solutions provided by Integration Objects for the process
industry including plug and play connectivity products and knowledge management
products.

The automation team: This team is responsible for all automation, installation,
deployment activities at customer sites. Automation Engineers act as end users for the
products delivered by the development team and are so responsible for the testing and
validation of Integration Objects software products.

The process team: The process team deals with more advanced applications used in
the process industry such as data validation and reconciliation applications, oil
movement applications, expert systems, diagnosis applications, etc.

2 Project Overview
2.1 Functional Scope
Time series analysis comprises methods for analyzing time series data in order to extract
meaningful statistics and other characteristics of data. Time series forecasting is the use of
a model to predict future values based on previously observed values, while regression
analysis is often employed in such a way to test theories that the current values of one or more
independent time series affect the current value of another time series.

15
`

Chapter I

General Presentation

Time series data have a natural temporal ordering. This makes time series analysis distinct
from other common data analysis problems, in which there is no natural ordering of the
observations (explaining people's wages by reference to their respective education levels,
where the individuals' data could be entered in any order).
Time series analysis is also distinct from spatial data analysis where the observations typically
relate to geographical locations (accounting for house prices by the location as well as the
intrinsic characteristics of the houses). A stochastic model for a time series will generally
reflect the fact that observations close together in time will be more closely related than
observations further apart.
In addition, time series models will often make use of the natural one-way ordering of time so
that values for a given period will be expressed as deriving in some way from past values,
rather than from future values.

2.2 Project Challenges


Our project tries to find an efficient way to enable creating an application for decision support
systems. By providing friendly interfaces and several algorithms, our solution provide to its
users outstanding functions to find out the degree of dependence between the values of a time
series, to discover trends (seasonal or not), to apply specific pretreatments such as the
Autoregressive Moving Average variants and finally to build predictive models.

16
`

Chapter I

General Presentation

Figure 3.Modeling cycle


Our solution allows you to take into account explanatory variables through a linear model
using the Partial Least Square (PLS) is a statistical method that tries to find the
multidimensional direction in the X space that explains the maximum multidimensional
variance direction in the Y space.

2.3 Project Goals


There are two main goals of our application:
-

Identifying the nature of the phenomenon represented by the sequence of


observations.

Forecasting (predicting future values of the time series variable).

Both of these goals require the time series pattern to be identified and formally described.
Our project consist in designing and implementing an analytics module allowing simple users
to apply several analysis algorithms in order to better treat their time series according to their
needs.

17
`

Chapter I

General Presentation

2.4 Project Planning


The figure below presents our project planning.

Figure 4. Project Planning


This schedule has been updated gradually during the project period. Using this approach has
helped us to estimate each part of the project and to optimize the work time in order to present
the project deliveries at time.
Conclusion
In this chapter we have presented the host company as well as the general context of the
project. In the next chapter, we are going to present the preliminary study that will allow a
better understanding of our goal.

18
`

19

Chapter II
Preliminary study

20

Chapter II

Preliminary study

Introduction
In this chapter, we start by defining Time Series Analysis concept. We continue by presenting
the principal market players and our proposed solution. Finally we present the statistical
frameworks.

1 State of the Art


In order to develop a time series analysis application we need to analyze and browse the most
known solutions in the market. The solutions we present in the next sections are: EVIEWS,
GRETL, IBM SPSS, R, and SAS.

1.1 EVIEWS
EVIEWS(Econometric Views) is a statistical package for Windows, used mainly for timeseries oriented econometric analysis. It is developed by Quantitative Micro Software (QMS),
now a part of IHS. Version 1.0 was released in March 1994, and replaced MicroTSP. The
current version of EVIEWS is 8.0, released in March 2013.
EVIEWS can be used for general statistical analysis and econometric analyses, such as crosssection and panel data analysis and time series estimation and forecasting.

Figure 5. EVIEWS interface [N3]


21
`

Chapter II

Preliminary study

1.2 IBM SPSS


SPSS Statistics (Statistical Package for the Social Sciences)is a software package used
for statistical analysis. Long produced by SPSS Inc., it was acquired by IBM in 2009. The
current versions (2014) are officially named IBM SPSS Statistics.
Companion products in the same family are used for survey authoring and deployment (IBM
SPSS Data Collection), data mining (IBM SPSS Modeler), text analytics, and collaboration
and deployment (batch and automated scoring services).
SPSS is a widely used program for statistical analysis in social science. It is also used by
market researchers, health researchers, survey companies, government, education researchers,
marketing organizations, data miners, and others.
Statistics included in the base software:
-

Descriptive statistics: Cross tabulation, Frequencies, Descriptives, Explore,


Descriptive Ratio Statistics

Bivariate statistics: Means, t-test, ANOVA, Correlation (bivariate, partial,


distances), Nonparametric tests

Prediction for numerical outcomes: Linear regression

Prediction for identifying groups: Factor analysis, cluster analysis (two-step, Kmeans, hierarchical), Discriminant

Figure 6.SPSS interface [N4]


22
`

Chapter II

Preliminary study

1.3 GRETL
Gretl is an open-source statistical package, mainly for econometrics. The name is an acronym
for Gnu Regression, Econometrics and Time-series Library. It has a graphical user interface
and can be used together with X-12-ARIMA, TRAMO/SEATS, R, Octave, and Ox. It is
written in C, uses GTK as widget toolkit for creating its GUI, and uses gnu plot for generating
graphs. As a complement to the GUI it also has a command line interface.

Figure 7.GRETL interface [N5]

1.4 R
R is a free software programming language and software environment for statistical
computing and graphics. The R language is widely used among statisticians and data
miners for developing statistical software and data analysis. Polls and surveys of data
miners are showing R's popularity has increased substantially in recent years.
R

provides

wide

variety

including linear and nonlinear modeling,

of
classical

statistical
statistical

and graphical techniques,


tests, time-series

analysis,

classification, clustering, and others.

23
`

Chapter II

Preliminary study

Figure 8.R interface [N6]

1.5 SAS
SAS (Statistical Analysis System) is a software suite developed by SAS Institute for advanced
analytics, business intelligence, data management, and predictive analytics. It is the largest
market-share holder for advanced analytics.
SAS is a software suite that can mine, alter, manage and retrieve data from a variety of
sources and perform statistical analysis on it. SAS provides a graphical point-and-click user
interface for non-technical users and more advanced options through the SAS programming
language. SAS programs have a DATA step, which retrieves and manipulates data, usually
creating a SAS data set, and a PROC step, which analyzes the data.

Figure 9.SAS interface[N7]


24
`

Chapter II

Preliminary study

2 Comparative Table
A brief study of those five possibilities led us to prepare this comparative table that allows us
to see clearly the features provided by each software.

EVIEWS

GRETL

SAS

SPSS

LAG/LEAD

Yes

Yes

Yes

Yes

Yes

Box Cox

No

No

Yes

Yes

Yes

Smoothing

Yes

No

Yes

Yes

Yes

Holt's Smoothing

No

No

No

Yes

Yes

Seasonal Differencing

Yes

Yes

No

Yes

No

Integrate

No

No

No

No

No

ARMAX

Yes

No

Yes

No

No

Linear Regression

Yes

No

Yes

Yes

Yes

No

No

Yes

Yes

No

Yes

Yes

No

No

No

Transformation

Models

Partial Least
Squares(PLS)
Statistical Test
Augmented Dickey
Fuller

25
`

Chapter II

Preliminary study

Shapiro Wilk

Yes

Yes

No

No

No

Mean test

Yes

No

No

Yes

No

ACF(Correlogram)

Yes

Yes

Yes

Yes

Yes

PACF(Correlogram)

Yes

Yes

Yes

Yes

Yes

Box plot

Yes

Yes

Yes

Yes

No

Bar

Yes

Yes

Yes

Yes

Yes

Line

Yes

Yaes

Yes

Yes

Yes

Points

Yes

Yes

Yes

Yes

Yes

Summary

No

No

No

Yes

No

Linear prediction

No

No

No

Yes

No

K-Nearest Neighbors

No

No

No

No

No

Descriptive analysis

No

No

No

Yes

Yes

Charts

Missed Values

Table 1.Comparative table

3 Statistical Frameworks
For a better design and development of our project, we have to go through a research phase
about the best statistical framework. In our case, we will develop an application that will treat
large amount of data calculation that is why we must have a tool that contains several
mathematical functions. After this research phase we discover that our solution can be done
by integrating statistical software such as R or SAS or integrate the Accord.net frameworks.
26
`

Chapter II

Preliminary study

3.1 R.NET
R.NET enables the .NET Framework to interoperate with the R statistical language in the
same process.[N8]

3.2 SAS Integration


SAS Integration Technologies, in combination with other SAS software and solutions,
enables you to make information delivery and decision support a part of the information
technology architecture for your enterprise.
SAS Integration Technologies provides you with the enabling software to build a secure
client-server infrastructure on which to implement SAS distributed processing solutions. With
SAS Integration Technologies, you can integrate SAS with other applications in your
enterprise; provide proactive delivery of information from SAS throughout the enterprise;
extend the capabilities of SAS to meet your organization's specific needs; and develop your
own distributed applications that leverage the analytic and reporting powers of SAS. [N9]

3.3 Integrate Accord.Net framework


The Accord.NET Framework is a complete framework for building machine learning,
computer vision, computer audition, signal processing and statistical applications. Sample
applications provide a fast start to get up and running quickly, and an extensive
documentation helps fill in the details. [N10]
Conclusion
In this chapter, we have presented some basic concepts that are necessary for the
understanding of our project and its context. We have also presented some of the existing
solutions. The next chapter will describe the specification phase we have been through.

27
`

Chapter III
Requirements Analysis and Specification

28

Chapter III

Requirements Analysis and Specification

Introduction
In this chapter, we describe the global characteristics of the solution. Then, we analyze the
functional and non-functional requirements of the solution, and identify the different use cases
of the application.

1 General Specifications
1.1 User characteristics
Our solution can be used by both expert and non-expert of statistics such as chemist,
industrial and automation engineers...

1.2 Design and implementation constraints


All application software shall be modularized into classes using object-oriented design
principles.
The application has to provide users with an easy way to apply several analysis algorithms in
order to better treat their time series according to their needs.

2 System features
2.1 Transformation
2.1.1 LAG Transformation
Description
In time series analysis, the lag operator or backshift operator operates on an element of a time
series to produce the previous element.
For example, given some time series:

Then
(1)
or equivalently
(2)
where L is the lag operator
29
`

Chapter III

Requirements Analysis and Specification

Sometimes the symbol B for backshift is used instead. Note that the lag operator can be
raised to arbitrary integer powers so that

And

(3)

Inputs
-

Outputs
-

Initial series
Order of lag

Backward series

Table 2: Inputs/Outputs LAG

2.1.2 LEAD Transformation


It is an operator that allows forwarding the series with a very precise order.

Inputs
-

Outputs
-

Initial series
Order of lead

Series conducted

Table 3: Inputs/Outputs LEAD

2.1.3 Power Transformation


In statistics, the power transform is from a family of functions that are applied to create a
rank-preserving transformation of data using power functions. This is a useful data
transformation technique used to stabilize variance, make the data more normal distributionlike

Inputs
-

Initial series
Power degree

Outputs
-

New series (variance stabilized)

Table 4: Inputs/Outputs Power


30
`

Chapter III

Requirements Analysis and Specification

2.1.4 Integrate Transformation


Most of the real time series are not stationary, and their average level varies over time. The
figure shows the series which we will denote by"

" shows a clearly decreasing trend and

thus is not stationary.

Figure 10.

Time Series

The figure shows the first difference in this series, that is, the series of variations in market
share from one week to the next. If we let

denote this new series, we see that its

values oscillate around a constant mean and seem to correspond to a stationary series.

Figure 11.Integrated

Time Series

31
`

Chapter III

Requirements Analysis and Specification

We conclude that the series

seems to be an integrated series, which is transformed into a

stationary one by means of differentiation. We say then that it is integrated of order one, the
number of differences needed to obtain a stationary process being the order of integration.

Inputs
-

Outputs
-

Non-stationary series

Stationary series
Order of integration

Table 5. Inputs/Outputs Integrate

2.1.5 Seasonal Differencing


The seasonal difference of a time series is the series of changes from one season to the next.
For monthly data, in which there are 12 periods in a season, the seasonal difference of Y at
period t is

Inputs
-

Outputs
-

Initial series
Order Of Differencing
Order Of seasonality

Series without seasonality

Table 6: Inputs/Outputs Seasonal Differencing

2.1.6 Box-Cox Transformation


Box-Cox transforms non-normally distributed data to a set of data that has approximately
normal distribution. The Box-Cox transformation is a family of power transformations.
If is 0, then:
If is = 0, then:

(3)
(4)

The logarithm is the natural logarithm (log base e). The algorithm calls for finding the value
that maximizes the Log-Likelihood Function (LLF).

32
`

Chapter III

Requirements Analysis and Specification

Inputs
-

Outputs
-

Initial series
Lambda parameter

New series(normal distribution)

Table 7. Inputs/Outputs Box Cox

2.1.7 Exponential Smoothing


Smoothing is a technique that can be applied to time series data, either to produce smoothed
data for presentation, or to make forecasts. The time series data themselves are a sequence of
observations. The observed phenomenon may be an essentially random process, or it may be
an orderly, but noisy, process. Whereas in the simple moving average the past observations
are weighted equally, exponential smoothing assigns exponentially decreasing weights over
time.
2.1.7.1 Simple Exponential Smoothing
Exponential smoothing is a technique that can be applied to time series data, either to produce
smoothed data for presentation, or to make forecasts. The time series data themselves are a
sequence of observations. The observed phenomenon may be an essentially random process,
or it may be an orderly, but noisy, process. Whereas in the simple moving average the past
observations are weighted equally, exponential smoothing assigns exponentially decreasing
weights over time.
Exponential smoothing is commonly applied to financial market and economic data, but it can
be used with any discrete set of repeated measurements. The raw data sequence is often
represented by { }, and the output of the exponential smoothing algorithm is commonly
written as { }, which may be regarded as a best estimate of what the next value of x will be.
When the sequence of observations begins at time t = 0, the simplest form of exponential
smoothing is given by the formula:

(6)

33
`

Chapter III

Requirements Analysis and Specification

Where is the smoothing factor, and 0 < < 1.

Inputs
-

Outputs
-

Data
smoothing
parameter

Comments
-

Smoothed
series

No Trend
No Seasonality

Table 8. Inputs/Outputs SES


2.1.7.2 Holt Smoothing
Holt (1957) extended simple exponential smoothing to allow forecasting of data with a trend.
This method involves a forecast equation and two smoothing equations (one for the level and
one for the trend):

(7 )

Forecast equation

(8)

Level equation
Trend equation

(9)

where

denotes an estimate of the

trend

denotes an estimate of the level of the series at time t,


(slope)

level, 01 and

of

the

series

at

time t, is

smoothing

is the smoothing parameter for the trend, 0

Inputs
-

the

Data
the relative level
another Trend

Smoothed series

for

the

1.

Outputs
-

parameter

Comments
-

With Trend

- No Seasonality

Table 9. Inputs/Outputs HS

34
`

Chapter III

Requirements Analysis and Specification

2.1.7.3 Winters Smoothing


Winters exponential smoothing model is the second extension of the basic Exponential
smoothing model.It is used for data that exhibit both trend and seasonality.
It is a three parameter model that is an extension of Holts method. An additional equation
adjusts the model for the seasonal component.
The four equations necessary for Winters multiplicative method are:
The exponentially smoothed series:
y
t (1 )( L
L
b
)(10 )
t
t 1 t 1
S
ts
The trend estimate:

bt ( Lt Lt 1) (1 )bt 1(11)

The seasonality estimate

y
St t (1 ) St s (12)
Lt
Forecast m period into the future:

Ft m ( Lt mbt ) St ms (13)
-

= level of series.
= smoothing constant for the data.
= new observation or actual value in period t.
= smoothing constant for trend estimate.
= trend estimate.
= smoothing constant for seasonality estimate.
=seasonal component estimate.

m = Number of periods in the forecast lead period.

s = length of seasonality (number of periods in the season)


Table 10. Parameters of Winters Smoothing
35

Chapter III

Requirements Analysis and Specification

Inputs
-

Data
The relative level
another on the Trend
the last to Seasonality

Outputs
-

Comments

Smoothed series

No Trend, with
Saisonality
With Trend, with
saisonality

Table 11. Inputs/Outputs WS

2.2 Statistical Test


2.2.1 Dickey Fuller Test
Description

(14)

A simple AR(1) model is


where

is the variable of interest, t is the time index,

the error term. A unit root is present if

is a coefficient, and

is

. The model would be non-stationary in this

case.
The regression model can be written as

(15)
where

is the first difference operator. This model can be estimated and testing for a unit

root is equivalent to testing

( where

). Since the test is done over the

residual term rather than raw data, it is not possible to use standard t-distribution to provide
critical values. Therefore this statistic

has a specific distribution simply known as

the DickeyFuller table.


There are three main versions of the test:

Test for a unit root :

Test for a unit root with drift:

(16)
(17 )

36
`

Chapter III
-

Requirements Analysis and Specification

Test for a unit root with drift and deterministic time trend:

(18)
Inputs
-

Outputs

Initial series

F-statistic

P-value

Order of Lag

Table 12. Inputs/Outputs ADF

2.2.2 Jarque-Berra Test


In statistics, the JarqueBera test is a goodness-of-fit test of whether sample data have
the skewness and kurtosis matching a normal distribution.
The test statistic JB is defined as
(19)

Where n is the number of observations (or degrees of freedom in general);


S is the sample skewness, and K is the sample kurtosis.

Inputs
-

Initial series

Outputs
-

Jarque Berra test

Kurtosis

Mean

Skewness

Standev

Variance

Variance MLE

Table 13. Inputs/Outputs Jarque Berra

37
`

Chapter III

Requirements Analysis and Specification

2.2.3 Shapiro-Wilk Test


The ShapiroWilk test is a test of normality in frequentist statistics. The ShapiroWilk test
utilizes the null hypothesis principle to check whether a sample

) came from

a normally distributed population. The test statistic is:

(20)

The constants

are given by

Where

and

are

the expected

values of

the order

statistics of independent and identically distributed random variables sampled from the
standard normal distribution, and V is the covariance matrix of those order statistics. The user
may reject the null hypothesis if W is below a predetermined threshold.

Inputs
-

initial series

Outputs
-

Jarque Berra test

Kurtosis

Mean

Skewness

Standev

Variance

Variance MLE

Table 14. Inputs/Outputs Shapiro Wilk

2.3 Models and Prediction


2.3.1 Temporal PLS
Description

Partial least squares regression (PLS regression) is a statistical method that bears some
relation to principal components regression; instead of finding hyper planes of minimum
variance between the response and independent variables, it finds a linear regression model by
38
`

Chapter III

Requirements Analysis and Specification

projecting the predicted variables and the observable variables to a new space. Because both
X and Y data are projected to new spaces.
As in multiple linear regression, the main purpose of partial least squares regression is to
build a linear model, Y=XB+E, where Y is an n cases by m variables response matrix, X is an
n cases by p variables predictor (design) matrix, B is a p by m regression coefficient matrix,
and E is a noise term for the model which has the same dimensions as Y.
For establishing the model, partial least squares regression produces a p by c weight matrix W
for X such that T=XW, i.e., the columns of W are weight vectors for the X columns
producing the corresponding n by c factor score matrix T. These weights are computed so that
each of them maximizes the covariance between responses and the corresponding factor
scores. Ordinary least squares procedures for the regression of Y on T are then performed to
produce Q, the loadings for Y (or weights for Y) such that Y=TQ+E. Once Q is computed, we
have Y=XB+E, where B=WQ, and the prediction model is complete.
One of the most important steps in the application of the PLS regression is the determination
of the correct number of dimensions to use in order to avoid over-fitting, and therefore to
obtain a robust predictive model.
Comparison between PCR and PLS
Principal components regression and partial least squares regression differ in the methods
used in extracting factor scores. In short, principal components regression produces the
weight matrix W reflecting the covariance structure between the predictor variables, while
partial least squares regression produces the weight matrix W reflecting the covariance
structure between the predictor and response variables.
Temporal approach
The aim of this work is to propose a new technique for the application of PLS regression to
time series. This technique is based on the Exponential smoothing of the loadings weights
vectors (w) obtained at each iteration step. This smoothing progressively displaces the random
or quasi-random variations from earlier (most important) to later (less important) PLS latent
variables.

39
`

Chapter III

Requirements Analysis and Specification

Inputs
-

data of predictors

response variable

Outputs
-

Estimators

Table 15. Inputs/Outputs PLS

2.3.2 ARMAX model


ARMAX models are useful when you have dominating disturbances that enter early in the
process, such as at the input. For example, a wind gust affecting an aircraft is a dominating
disturbance early in the process.
"ARMAX modeling" treats the given signals x, y, z as Auto-Regressive Moving Average with
eXtra / eXternal (ARMAX) process according t

(21)
where x is the input signal (usually a noise signal), y is the output signal and z is the external
input signal. The model coefficients of the given orders are estimated and the residual r (the
estimation error) is returned. Input parameters are order P of the AR process, order Q of the
MA process (choose Q=0 for an ARX model) and order R of the eXternal process.

Inputs
-

Estimation data

order P

order Q

order R

Outputs
-

Identified ARMAX structure


polynomial model.

Table 16. Inputs/Outputs ARMAX

2.3.3 ARIMA model


ARIMA(p,d,q): ARIMA models are, in theory, the most general class of models for
forecasting a time series which can be stationarized by transformations such as differencing
and logging.

40
`

Chapter III

Requirements Analysis and Specification

The acronym ARIMA stands for "Auto-Regressive Integrated Moving Average." Lags of the
differenced series appearing in the forecasting equation are called "auto-regressive" terms,
lags of the forecast errors are called "moving average" terms, and a time series which needs to
be differenced to be made stationary is said to be an "integrated" version of a stationary
series. Random-walk and random-trend models, autoregressive models, and exponential
smoothing models are all special cases of ARIMA models.
A non-seasonal ARIMA model is classified as an "ARIMA(p,d,q)" model, where:
-

p is the number of autoregressive terms,

d is the number of non-seasonal differences, and

q is the number of lagged forecast errors in the prediction equation.

Inputs
-

Estimation data

order p

order d

order q

Outputs
-

Identified ARIMA structure


polynomial model.

Table 17. Inputs/Outputs ARIMA

2.3.4 Linear Prediction


Linear prediction is a mathematical operation where future values of a discrete-time signal are
estimated as a linear function of previous samples.
The most common representation is:
where

is the predicted signal value,

the previous observed values,

and the

predictor coefficients. The error generated by this estimate is:


where

is the true signal value.

41
`

Chapter III

Requirements Analysis and Specification

Inputs
-

Initial Series

Horizon

Outputs
-

Predicted Series

Table 18. Inputs/Outputs Linear Prediction

2.4 Graph
2.4.1 Box-Plot
A box plot is a convenient way of graphically depicting groups of numerical data through
their quartiles.

Inputs
-

Outputs
-

Series

Box Plot Graph

Table 19. Inputs/Outputs Box Plot

2.4.2 ACF Graph


The autocorrelation can detect regularities, repeated patterns in a signal as a periodic signal
disturbed by a lot of noise, or a fundamental frequency of a signal that does not contain this
fundamental fact, but involved with several of its harmonics.

(22)

Where x is the average of the n observations.

Inputs

Outputs

Series

table of AC

number of Lag

Correlogram

Table 20. Inputs/Outputs ACF

42
`

Chapter III

Requirements Analysis and Specification

2.4.3 PACF Graph


In time series analysis, the partial autocorrelation function (PACF) plays an important role
in data analyses aimed at identifying the extent of the lag in the models.

k=2..n , j=1,2..k-1
k=3n (23)

Inputs

Outputs

Series

matrix of PAC

number of Lag

Correlogram

Table 21. Inputs/Outputs PACF

3 Use case Model


A use case diagram at its simplest is a representation of a user's interaction with the system
and depicting the specifications of a use case. A use case diagram can portray the different
types of users of a system and the various ways that they interact with the system.

3.1 Global Use Case


This use case presents the global interactions between the system and the actors.

43
`

Chapter III

Requirements Analysis and Specification

Figure 12.Global Use Case


Description

UC: Interact with application


Scope: TSAnalytics
Actor: User
Pre-Condition: Application executed
Main Scenario
1. Create new project.
2. Import data.
3. Choose method.
4. Save/Exit without Save/Choose another method.
44
`

Chapter III

Requirements Analysis and Specification

Alternative Scenario
1. Open project.
2. Choose method.
3. Save/Exit without Save/Choose another method.

3.2 Manage Project


The system allows users to manage workspace by creating new projects, to save and load
projects.

Figure 13.Manage Project Use Case


Description
UC: Manage project
Scope: TSAnalytics
Actor: User
Pre-Condition: Application executed

45
`

Chapter III

Requirements Analysis and Specification

Main Scenario
1. The application requests to manage project
2. The user choose to open or create new project
Post-Condition
Existence of a project

3.3 Missing values Use Case


The system allows users to summarized the missing values and complete the data with several
statistical methods.

Figure 14. Missing Values Use Case


Description
UC: Impute missing values
Scope: TSAnalytics
Actor: User
Pre-Condition: Missing data

46
`

Chapter III

Requirements Analysis and Specification

Main Scenario
1. choose to describe the missing data
2. choose to Impute the missing data
3. choose the method of impute
4. Save the completed data
Post-Condition:
Completed data
Conclusion
Throughout this chapter, we have detailed the functional and non-functional requirements of
the solution as well as the use cases. In the next chapter we begin the analysis and design of
theses specifications.

47
`

Chapter IV
Design

48

Chapter IV

Design

Introduction
The Design is a creative process, a crucial phase of developing project. Supporting this phase
with techniques and tools appropriate is important to product a high quality application. To
present our design we begin this section by giving a global view of our solutions architecture
after that we will detail our design choices through the package, classes and sequences
diagrams.

1 Global Architecture of the System


An application architecture describes the structure and behavior of applications used in a
business, focused on how they interact with each other and with users. It is focused on the
data consumed and produced by applications rather than their internal structure.
This involves defining the interaction between application packages, databases, and
middleware systems in terms of functional coverage. This helps identify any integration
problems or gaps in functional coverage
For our application we opted for three-layer architecture

Figure 15.Global architecture of the system

49
`

Chapter IV

Design

These main Layers are:


Human Machine Interaction: HMI (Human Machine Interaction) aims to improve the
interactions between users and computers by making computers more usable and receptive to
users' needs. Specifically.
Here we have divided into 2 layers: Graphical interface, Controls.
Algorithms: Algorithms layer is composed of the services requested by the user. It contains
all the functional requirements. It is compound of four packages which are Transformation,
Models, Test, Graph.
Data Source: layer provide our solution to communicate with other systems and other
applications. It contains two types that supports access to the data Bases and files.

2 System Diagrams
2.1 Package diagram
Package diagram is UML structure diagram which shows packages and dependencies between
the packages. Our application is composed by five packages:
The package is the one that interact with the different other packages in order to flow the
execution from data into visualization.

Figure 16.Package diagram

50
`

Chapter IV

Design
TSAnalytics

Contains the form that hosts the main window of the solution and graphical charts like box
plot, Correlogram etc
Transformation
Provides the Transformation Algorithms requested by users like Lag, Integrate, Exponential
Smoothing..
Test
Provides the Test Algorithms requested by users like Duckey Fuller, Shapiro Wilk
Model
Provides the models Algorithms requested by users like ARMAX,ARIMA,PLS..
Graph
Provides the Graph Algorithms requested by users like BoxPlot, Correlogram..

2.2 Class diagram


The class diagram is a type of static structure diagram that describes the structure of a system
by showing the classes of the system, their attributes, operations (or methods), and the
relationships among objects.
This section will present the different classes diagrams for the different modules of our
solution.

51
`

Figure 17.Class diagram


52

Chapter IV
2.3

Design

Sequence diagram

A sequence diagram is an interaction diagram that shows the order how classes operate
between each others. It describe the objects and classes involved in the scenario and the
sequence of messages exchanged between the objects needed to carry out the functionality of
the scenario. Sequence diagrams are typically associated with use case realizations in the
Logical View of the system under development.
In this part, we present some sequence diagrams to describe interactions between the user and
the application.

2.3.1 Load Data

Figure 18. Load Data


The user request to load data, the main interface opens the file dialog for the user to choose.
Once the data ils selected, the data source loads the selected file then requests to display it.

53
`

Chapter IV

Design

2.3.2 Apply algorithm

Figure 19.Select method


The user selects a method, the main interface uses the method user control and user must set
the selected method. Once settings is finished the user control turn the algorithm and recovers
the result to display it.

Conclusion
Throughout this chapter, we have presented a conceptual view of our project. And we have
detailed the software architecture of the solution in the form of modules. In the fifth and final
chapter, we will describe the step of project implementation

54
`

Chapter V
Implementation and Test

55
`

Chapter V

Implementation and Test

Introduction
In this chapter, we devote the first part of the presentation for the development environment,
and then we focus on the presentation of the implemented solution and the performed tests.

1 Development environment
1.1 Software Environment

Microsoft Office Project Professional

Microsoft Office Project Professional 2007 is fairly developed software that includes features
for project management. It is an application that allows monitoring of projects by ensuring the
accomplishment of tasks such as scheduling and jobs.

Figure 20.Microsoft Office Project Logo

Accord.NET

We have been faced to choose third party integration for the analytics algorithms. This phase has
leaded us to choose the scientific calculations framework Accord.NET .We chose this framework
for its performance and the possibility of its configuration and its adaptation to our needs during the
implementation of the solution. Accord.NET is based on the mathematical framework "Aforge.Net".
This framework is composed of a variety of libraries including statistics, machine learning, pattern
recognition, etc.

Figure 21.Accord.Net Logo [N11]


56
`

Chapter V

Implementation and Test

Microsoft Visual Studio

Visual Studio is an integrated development environment (IDE) providing a set of tools and
services to develop desktop applications, web, or mobile. It incorporated several languages
such as C #, C + +, J # and F #. Its used to develop and test our solution.

Figure 22.MVS Logo [N12]

Enterprise Architect

"Enterprise Architect is a comprehensive UML analysis and design tool for UML, SysML,
BPMN and many other technologies. Covering software development from requirements
gathering through to the analysis stages, design models, testing and maintenance.

Figure 23.Entreprise Architect Logo [N13]

DevExpress

DevExpress is a stunning software development toolset for .NET developers. It includes a


complete range of controls and libraries for all major Microsoft platforms, including
WinForms, ASP.NET, WPF, Silverlight, and Windows 8.

Figure 24.DevExpress Logo [N14]

57
`

Chapter V

Implementation and Test

1.2 Hardware environment


During the development of our application we have used the hardware environment described
in the table below.
CPU

Intel Core i5-4200U,1.6GHz

Memory

6GB
Windows 7, 64bits

OS
Table 22. Hardware Environment

2 Achieved Work
In this section, we are going to present our solution.
Main Interface
As the end user launches the solution, he will be leaded to the main screen that is
presented in the figure below.

Figure 25.Main Interface

58
`

Chapter V

Implementation and Test

Load interface

Figure 26: File bar


When the users load a data, it will be displayed automatically.

Figure 27.Home Interface


Our solution is composed by four main categories that are the basics and important phases of
the Time Series Analysis: management of missing values, Data description, Transformation
and modeling-forecasting.

2.1 Management of missing values


If the data contains missing values or non-numeric values,our solution offers the possibility to
make the "Summary" and the "Imputation" of these values by three methods:
Statistical Description( min, max, mean, 1st Qr, median, 3rd Qr)
Linear Prediction

59
`

Chapter V

Implementation and Test


K-Nearest Neighbors

The next figure presents "Treatments of missing values" tools in Data menu bar:

Figure 28.Data bar


Summary interface
The next figure presents the interface of Summary of missing value which contains a
description of incomplete data through a summary table and bar plot. It provides the user the
number of missing values and the percent for each column.

Figure 29.Summary Interface

60
`

Chapter V

Implementation and Test

Impute interface
The next figure is the Impute missing Values interface.

Figure 30.Impute Interface

The user can choose one of the following three methods for imputation:

Figure 31.Methods of Impute

The next figure is the Impute missing Values" interface by Descriptive Statistics. The user
can choose the given method for each column or for all columns, and clicking on the impute
button leads the user to a table which does not contain missing values

61
`

Chapter V

Implementation and Test

Figure 32.Descriptive Statistics Impute Interface

2.2 Data description


The figure below presents the description menu, we can describe data, by plot or calculate
statistical description, or calculate statistical test to identify the behavior and structure of the
series.

Figure 33. Data Description menu


62
`

Chapter V

Implementation and Test

Chart
Line and bar interface

Figure 34. Line and bar Chart


Correlogram
The next figure presents the correlogram which represents the autocorrelation function.

Figure 35.Correlogram chart

Box plot interface


63
`

Chapter V

Implementation and Test

Figure 36.Box Plot chart


Descriptive statistics interface
This figure presents the descriptive statistics for a chosen variable

Figure 37.Descriptive Statistics Interface

Statistical Test interface


The next figure presents the result of "Shapiro Wilk" test for checking the normality of series:
64
`

Chapter V

Implementation and Test

Figure 38.Shapiro Wilk Test Interface


The next figure presents the result of "Augmented Dickey Fuller" test for checking the
stationarity of series:

Figure 39.ADF Test Interface

2.3 Transformation
The next figures present the Transformation menu which contains several transformation can
be applied to series.

Figure 40.Transformation menu


Integrate interface

65
`

Chapter V

Implementation and Test

To make series stationary with a single transformation and find the necessary order of
difference, our solution offers this possibility, with "Integrate Transformation":

Figure 41.Integrate Interface

Smoothing interface
For smoothing and forecasting we can use "Simple Exponential Smoothing":

Figure 42.Smoothing Interface

66
`

Chapter V

Implementation and Test

2.4 Modeling
The next figures present the models menu which contains several methods of modeling. These
methods can be applied to univariate or multivariate series:

Figure 43.Models menu


Temporal Partial Least Squares

Main interface

When we first launch the PLS control, we will be leaded to the home screen that is presented
in the figure below:

Figure 44.PLS main interface


PLS algorithm performed by our solution provides users the following results:
-

Factors

Loadings matrix

Weights matrix

Model

Projection
67

Chapter V
-

Implementation and Test

Regression

Factors interface

Figure 45.Factors Interface

Projection interface

Figure 46.Projection Interface


68
`

Chapter V

Implementation and Test

Regression interface

Figure 47.Regression Interface


1.1. Forecasting
The next figures present the forecast menu which contains two algorithms: linear model and
Holt's Smoothing.

Figure 48.Forecast menu


69
`

Chapter V

Implementation and Test

For the two methods of prediction we provide a friendly interface in order to help users to
easily change inputs and outputs. The results are displayed in both charts and data table.

Linear regression interface

Figure 49: Linear Regression Interface

Holt's Smoothing interface

Figure 50: Holt's Smoothing Interface


70
`

Chapter V

Implementation and Test

Performance Tests
After the end of the implementation phase we have to go through a testing phase of the

application. The test phase is needed to detect anomalies and validate our application. It
ensures that our solution will react as intended and that the quality of the code is in line with
expectations.
We have performed some stress tests to check the performance and response time of our
application. The next table presents some stress tests executed.
Test case

Load data

Transformation

Inputs

Duration

Table which contain 70 columns and 13500 lines

2.5 seconds

Table which contain 40 columns and 5300 lines

1 second

Table which contain 70 columns and 13500 lines

1 second

Description Analysis

Table which contain 1 columns and 13500 lines

1 second

Partial Least Squares

Table which contain 40 columns and 5300 lines

9 seconds

Table which contain 1 columns and 13500 lines

5.5 seconds

Linear Prediction

Table which contain 70 columns and 13500 lines

10 seconds

Exponential Smoothing

Table which contain 70 columns and 13500 lines

6 seconds

ARMAX model

Table which contain 1 columns and 13500 lines

7 seconds

Augmented Dickey
Fuller test

Table 23. Performance Tests

71
`

Chapter V

Implementation and Test

Conclusion
In this chapter, we have presented the implementation phase of the solution. We have started
by describing the different tool and libraries we have been using throughout the project. Then,
we have presented the most important features offered by our application by showing the
most important interfaces of our application. Finally, we have finished by performing some
tests to validate our application

72
`

Conclusion and Perspectives

raditionally, data mining and time series analysis have been seen as separate
approaches to analyzing enterprise data. However, much of the data used by
business processes is time-stamped. Time series Analysis is a mixture of
forecasting and traditional data mining techniques that uses time dimensions

and predictive analytics to make better business decisions.


Our project, developed within the Integration objects company, is a data mining
solution that can enhance the capabilities of the user in the area of time series analysis and
data preparation. Finding time series that exhibit similar statistical characteristics allows
analysts to easily identify customer or process behaviors of interest in large volumes of time
series data. With the wealth of enterprise data stored in time series, the power to integrate this
data into analysis workflows will help user to easily build valuable models.

In our project, we started by focusing on the understanding of the discipline by studying the
concept of time series analysis and reviewing the existing tools. The next step was to study
and analyze the features to design and implement in our solution and bring out the functional
and non-functional requirement of our project. We then proceeded with the design phase, by
detailing the architecture of our application as well as static and dynamic design through the
development of packages and class diagrams.
Finally, we concluded the report by presenting the implementation and test phase of our
project. This chapter describes the tools and frameworks used to achieve our solution, and
expose the work done through screenshots which cover the most important features of the
solution.
Much of the data that are used in the operational side of a business have a built-in time
dimension. One of the challenges of developing this solution is the complexity of handling a
large number of time series.
73
`

In addition to the technical acquired knowledge, this internship has been an


opportunity for me to adapt and integrate myself in a professional environment and to
improve our communication skills and collaboration with Integration Objects team.
To conclude, we have met the initial objectives, but the project remains open to
several enhancements. Firstly, our application can be easily extended with new modeling and
forecasting algorithms.
Besides, one of the enhancements that can be applied to our application is the
optimization of the current algorithms in order to improve the response time and find better
methods to handle and load big data.

74
`

Bibliography

Bastien P., Esposito Vinzi E., Tenenhaus M. (2005) PLS generalised linear regression,
Computational Statistics and Data Analysis, 48, 17-46.

Wang H.W., Wu Z.B., Meng J. (2006) Partial Least-Squares Regression-Linear and


Nonlinear Methods, National Defense Industry Press, Beijing

AitSaidi, A., Ferraty, F. etKassa, R. (2005) Single functional index model for a time
series. Rev. Roumaine Math. Pures Appl. 50 (4) 321-330.

Fan, J. et Zhang, J.-T. (2000) Two-step estimation of functional linear modelswith


applications

75
`

Netography
[N1] http://www.integrationobjects.com/services.php
[N2]http://www.integrationobjects.com/knowledgenet.php
[N3]

http://www.eviews.com/home.html

[N4]

http://www-01.ibm.com/software/analytics/spss/

[N5]

http://gretl.sourceforge.net/

[N6]

http://www.r-project.org/

[N7]

http://www.sas.com/en_us/software/analytics.html

[N9]

http://www.sas.com/en_us/software/integration-technologies.html

[N10] http://accord-framework.net/intro.html
[N8]

http://rdotnet.codeplex.com/

[N13] http://www.sparxsystems.com.au/
[N12] http://www.microsoft.com/visualstudio/fra
[N14] http://www.devexpress.com/
[N11] https://code.google.com/p/accord/

76
`

77
`

S-ar putea să vă placă și