Sunteți pe pagina 1din 84

A producer wants to know….

Which are our


lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?
channel?

What product prom- Which customers


-otions have the biggest are most likely to go
impact on revenue? to the competition ?
What impact will
new products/services
have on revenue
and margins?
2
Data, Data everywhere
yet ...  I can’t find the data I need
 data is scattered over the network
 many versions, subtle differences

 I can’t get the data I need


need an expert to get the data

 I can’t understand the data I found


available data poorly documented

 I can’t use the data I found


results are unexpected
data needs to be transformed
from one form to other
3
Data Mining
 Data mining refers to extracting or
“Mining” knowledge from large amount
of data
 Many terms carry similar or slightly
different meaning to data mining ,such
as knowledge mining from data,
knowledge extraction, data pattern
analysis, data archeology.
Data Mining: An overview
 Data mining looks for relations and association
between phenomenon that are know
beforehand .
 Online analytical processing (OLAP) is usually
a graphical instrument used to highlight
relations between the variables available.
 OLAP is important tool for business
intelligence.
 The query and reporting tools describes what
database contains
 But OLAP is used to explain why certain
relation exists.
 OLAP is used in processing stage of
data mining
 It makes understanding the data easier
because it focus on relevant data,
identify special case or looking for
principal interrelations.
 The final data mining results expressed
using specific summary variables.
Data mining and Statistics
 Difference :
 Statistical methods are developed in relation
to the data being analyzed but also
according to conceptual reference
 Statistical methods limited their ability to
adept quickly to the new methodology
arising from new technology and new
machine learning technique.
 SM analyzing primary data that has been
collected and check research hypotheses
 Statistical data may be experimental data.
 Two approach : Top down(confirmative),
bottom up(explorative)
 Top down : confirm or reject hypothesis
and tries to wider our knowledge of
partially understood phenomenon, it
achieve this by traditional statistical
methods .
 Bottom up approach analyze where the
user looks for useful information
previously unnoticed, searching through
the data and looking to way it to create
hypothesis.
 Bottom up approach is data mining.
 Bottom up analysis identify important
relations and tendencies ,and can not
explain why this discoveries are useful
and to what extent they are valid .
 The confirmative tools of top down
analysis can be used to confirm the
discovery and evaluate the quantity of
decision based on those discoveries.
Three aspects: distinguish statistical
data analysis from data mining
 Data mining analyze great mass of data
 Many database do not lead to the
classical forms of statistical organization
 Data mining results must be some
consequence. This mean that constant
attention must be given to the results
achieved with data analysis method
Knowledge discovery process
Steps: Knowledge discovery
 Database, data ware house, WWW or other
information repository :This is one of set of
database, data ware house, spread sheets or
other kind of information repository.
 Database and data ware house server:
responsible for fetching the relevant data ,
based on user’s data mining request
 Knowledge based: used to guide search or
evaluate the interestingness of resulting
pattern. Such knowledge can include concept
hierarchy used to organize attributes or
attributes values in to different level of
abstraction.
Three aspects: distinguish statistical
data analysis from data mining
 Data mining analyze great mass of data
 Many database do not lead to the
classical forms of statistical organization
 Data mining results must be some
consequence. This mean that constant
attention must be given to the results
achieved with data analysis method
Organization of data
 Data is organized in to ordered
database
 Unified information system called data
warehouse
Data warehouse
 A data warehouse is a repository of
information collected from multiple
sources, stored under a unified schema,
and that usually resides at a single site.
 Data warehouses are constructed via a
process of data cleaning, data
integration, data transformation, data
loading, and periodic data refreshing.
Example
 It is an integrated collection of data
 Characteristics of Data warehouse
 Data in data warehouse is divided according to
subject rather than business
 Data integration
 Data warehouse is not volatile because data is
added rather than updated
 This means a data warehouse is like a container
of all the data to carry out business intelligence
Classification of data
 Data mart should be classified according
to two principles
 Statistical units the elements in the
reference population that are considered
important for the aim of analysis.(the supply
company ,the people who visit the site)
 The statistical unit( the amount of customers
buy the payment method they use , the
socio demography profile of such customer)
 Statistical variables are main source of
information to work on in order to extract
conclusion about observed units and
eventually to extend these conclusion to
wider population.
 Once the units and interested variables
in the statistical analysis of data have
been established ,each observation is
related to statistical unit and distinct
value for each variable is assigned. This
process is known as classification
 Two different type of variables :
 Qualitative : Expressed as adjectival phrase,
so they are classified in to level some times
known as categories. Example : postal
address
 Quantitative : nominal if it appears in
different categories have an order that is
either explicit or implicit .Nominal Example :
Eye color.
 Quantitative variables: age, income
Reasons for growing data mining
research
 Amount of digital data exploding in last decade, while
number of scientist ,engineers and analyst available to
analyze the data has been static.
 To bridge this gap requires the solution of fundamentally new
research problem which have following challenge
 Developing algorithms and systems to mine large ,massive and
high
 Dimensional data sets
 Developing algorithms and systems to mine new data types of data
 Developing algorithms ,protocols and other infrastructure to mine
distributed data
 Improving the ease of use of data mining systems
 Developing appropriate privacy and security models for data
mining
Data mining : benefits
 Keep business profitable
 According to IBM : Data mining offers firms in
many industries the ability to discover hidden
patterns that can help them to understand
customer behavior and market trends
 Data mining saves money and improves
efficiency.
 It improves automation to your existing
software and hardware.
Types of System
 An information system (IS) is a system
composed of people and computers that
processes or interprets information
 information systems differ in their business
needs and the information varies depending
upon different levels in organization.
 information system can be broadly
categorized into following :
 Operational system
 Informational system
Operational system

 An operational system is a system that


is used to process the day-to-day
transactions of an organization
 These are back-bone system of any
enterprise like order entry, inventory,
manufacturing, payroll and accounting.
 It is process-oriented or process-driven:
focused on specific business processes
or tasks. Example :tasks include billing,
registration, etc
 Operational systems maintain records of
daily transactions.
 Operational systems are generally
concerned with current data.
 Data within operational systems are
generally updated regularly according to
need.
 Operational systems are generally
optimized to perform fast inserts and
updates of relatively small volumes of data
Informational system

 Informational system is an integrated set


of component for collecting storing and
processing data and delivering
information, knowledge.
 It analyzing data and making decisions
often major decisions about how the
enterprise will operate now and in the
future.
Comparison: Operational and
information system
 An operational system records the
 sales,
 updates the customer account balance and
 make deduction from inventory.
 using this information the informational
system can produce reports that
 recap daily sales activity,
 list of customer with past due account balance
 graph slow and fast selling products,
 highlighting items that need reordering
OLTP and DSS systems
 OLTP stands for Online Transaction
Processing.
 OLTP system is used in the operational
environment when transaction executes,
the execution entails little data .
 As a few as two or three rows of data
may be required for the execution of an
operational transaction.
 A decision support system(DSS) is a
computer-based information system that
supports business or organizational
decision-making activities.
 DSSs serve the management,
operations, and planning levels of an
organization and help to make
decisions, which may be rapidly
changing and not easily specified in
advance
 The transaction run in the DSS
environment may access thousand of
row of data
 The response time in DSS environment is
different from response time in OLTP.
 Response time may vary from second to
hour.
 DSS there are two response time
 One response time is length of time from the
moment when the transaction is initiated until
the first of results are returned.
 Measurable response time is the length of time
from the moment of the initiation of the
transaction until the moment when last of the
result are returned.
What is Data warehouse?

 A single, complete and consistent store


of data obtained from a variety of
different sources made available to end
users in a they can understand and use
in a business context.
 A data warehouse is a collection of
integrated databases designed to
support a DSS.
 A data warehouse is a special type of
database.
 It is used to store large amounts of data, such as
analytics, historical, or customer data, and then
build large reports and Data mining against it.
 The data warehouse is information environment
that
 Provide an integrate and total view of enterprise
 Make the enterprise current and historical
information easily available for decision making
 Make decision support transaction possible
without hindering operational system.
 Render organizations information consistent
 Present flexible and interactive source of strategic
information.
Data Warehouse
 A data warehouse is a
 subject-oriented
 integrated
 time-varying
 non-volatile

collection of data that is used primarily in


organizational decision making.
-- Bill Inmon, Building the Data Warehouse 1996

Prepared By :mitali
38 mistry (PICA)
Characteristics and functioning
of Data Warehousing
 A common way of introducing data
warehousing is to refer to the
characteristics of a data warehouse.

 Subject Oriented
 Integrated
 Nonvolatile
 Time Variant

Prepared By :mitali mistry (PICA) 39


Subject Oriented:
 Data warehouses are designed to help you analyze
data. For example customer/product/sales Etc
 FOR EX: to learn more about your company's sales
data, you can build a warehouse that concentrates
on sales.
 Using this warehouse, you can answer questions like
1. "Who was our best customer for this item last year?“
2. Who is likely to be our best customer next year?

 This ability to define a data warehouse by subject


matter, sales in this case, makes the data warehouse
subject oriented.

Prepared By :mitali mistry (PICA) 40


Integrated:
 Integration is closely related to subject orientation.
 Data warehouses must put data from disparate sources
into a consistent format.
 They must resolve such problems as naming conflicts
and inconsistencies among units of measure.
 When they achieve this, they are said to be integrated.
 Constructed by integrating multiple, heterogeneous data
sources
 relational databases, flat files, on-line transaction records
 Data cleaning and data integration techniques are applied.
 Ensure consistency in naming conventions, encoding structures,
attribute measures, etc. among different data sources
○ E.g., Hotel price: currency, tax, breakfast covered, etc.
 When data is moved to the warehouse, it is converted.

Prepared By :mitali mistry (PICA) 41


Nonvolatile:
 Nonvolatile means that, once entered into
the warehouse, data should not change.
So, historical data in a data warehouse
should never be altered.

Prepared By :mitali mistry (PICA) 42


Time Variant:
 In order to discover trends in business,
analysts need large amounts of data.
This is very much in contrast to online
transaction processing (OLTP) systems,
 where performance requirements
demand that historical data be moved to
an archive.
 A data warehouse's focus on change
over time is what is meant by the term
time variant.

Prepared By :mitali mistry (PICA) 43


 Typically, data flows from one or more online
transaction processing (OLTP) databases into a
data warehouse on a monthly, weekly, or daily
basis.
 Historical data is kept in a data warehouse. For
example, one can retrieve data from 3 months,
6 months, 12 months, or even older data from a
data warehouse. This contrasts with a
transactions system, where often only the most
recent data is kept. For example, a transaction
system may hold the most recent address of a
customer, where a data warehouse can hold all
addresses associated with a customer.

Prepared By :mitali mistry (PICA) 44


Data Warehouse S/W and H/W
Architecture
 Architecture in data ware house is complex
and includes many elements
 Data warehouse is an amalgamation of
different system.
 All software development project require
selection of technical infrastructure
 Basic technical infrastructure includes
operating system , hardware platform ,
data base management system and
network.
Steps to develop DW Architecture
 Enlist the full support and commitment
of project sponsor/executive of the
company
 Appointed staff in architecture must be
strongly skilled person.
 Prototype/bench mark all technologies
you interested in using. Design and
develop a prototype that can be used to
test all of the different technologies that
are being considered
 The architecture team should give enough
time to built the architecture before
development begins
 The development staff must be trained on the
use of architecture before development
begins.
 Provide freedom to architecture team to
enhance and improve the architecture as
project move forward
Architectural component of data
warehouse
 Architecture include everything that is
needed to prepare the data and store it.
 Architecture further composed of the
rules ,procedures and functions that
enable your data warehouse to work
and fulfill business requirement.
 Finally architecture made up of number
of interconnected component
Data Warehouse Architecture

 Figure shows a simple architecture for a data warehouse. End


users directly access data derived from several source systems
through the data warehouse.
Prepared By :mitali mistry (PICA) 49
Components
 Operational Data :
 Source of data
 The source includes data from main frame
systems in the traditional network and
hierarchical format , relational DBMS like
Oracle.
 In addition to these internal data ,
operational data also includes external data
obtained from commercial data base.
 Load Manager :
 Responsible for collection of data from operational
system and convert them into usable form of the users.
 Perform all the operations associated with extraction and
loading data into data warehouse.
 Task
○ Identification of data
○ Validation of data about accuracy
○ Cleansing of data by eliminating meaning less values
○ Data formatting
○ Data standardization by getting them into consistent form
○ Data merging by taking data from different source and
consolidating into one place
○ Enable reference integrity
Warehouse Manager
 Perform all operation associated with the
management of data in the ware house.
 The operations performed by warehouse
manager includes
 Analysis of data to perform consistency
 Transformation and merging the source data
from temporary storage in to data ware house
tables
 Create index view on the base tables
 De-normalization
 Generate aggregation
 Backing up and archiving data
 Query Manager
 Perform all operation necessary to support the query
management process
 It direct queries to the appropriate data tables and
schedules the execution of used queries
 Detailed data
 Stores detail data in database schema
 Summaries are very valuable in data warehouses
because they pre-compute long operations in advance.
For example, a typical data warehouse query is to
retrieve something like August sales
 Archive and backup data
 It stores detailed and summarize data for the purpose
of archiving and backup
 Data transferred to storage archive such as magnetic
tapes or optical disc
 Meta data
 Data about data
 To provide universal data access it is necessary to
maintain some form of data dictionary of meta data
information
 End user access tools
 Purpose of data ware house is to provide information to
the business managers for strategic decision making.
 This users interact with the warehouse using end user
tools
 Example end user access tools
○ Reporting and query tools
○ Application development tools
○ Executive information system tools
○ Online analytical tools
○ Data mining tools
Single-Layer Architecture

 A single-layer architecture is not


frequently used in practice. Its goal is to
minimize the amount of data stored; to
reach this goal, it removes data
redundancies

Prepared By :mitali mistry (PICA) 55


Prepared By :mitali mistry (PICA) 56
 In this case, data warehouses are virtual

 This means that a data warehouse is


implemented as a multidimensional view
of operational data created by specific
middleware, or an intermediate
processing layer

Prepared By :mitali mistry (PICA) 57


 The weakness of this architecture lies in its
failure to meet the requirement for separation
between analytical and transactional
processing.
 Analysis queries are submitted to operational
data after the middleware interprets them.
 It this way, the queries affect regular
transactional workloads.
 In addition, although this architecture can meet
the requirement for integration and correctness
of data, it cannot log more data than sources
do. For these reasons, a virtual approach to
data warehouses can be successful only if
analysis needs are particularly restricted and
the data volume to analyze is huge.
Prepared By :mitali mistry (PICA) 58
Two-Layer Architecture

 a two-layer architecture to highlight a


separation between physically available
sources and data warehouse
 it actually consists of four subsequent data
flow stages
1. Source layer
2. Data staging
3. Data warehouse layer
4. Analysis

Prepared By :mitali mistry (PICA) 59


Prepared By :mitali mistry (PICA) 60
 The component marked as a data
warehouse in Figure is also often called the
primary data warehouse or corporate data
warehouse
 It acts as a centralized storage system for all
the data being summed up.

Prepared By :mitali mistry (PICA) 61


Source layer

 A data warehouse system uses


heterogeneous sources of data. That
data is originally stored to corporate
relational databases or it may come
from information systems outside the
corporate walls or flat files.

Prepared By :mitali mistry (PICA) 62


Data staging
 The data stored to sources should be extracted,
cleansed to remove inconsistencies and fill gaps,
and integrated to merge heterogeneous sources into
one common schema.
 The so-called Extraction, Transformation, and
Loading tools (ETL) can merge heterogeneous
schemata, extract, transform, cleanse, validate, filter,
and load source data into a data warehouse
 This stage deals with problems that are typical for
distributed information systems, such as inconsistent
data management and incompatible data structures
Section

Prepared By :mitali mistry (PICA) 63


Data warehouse layer

 Information is stored to one logically centralized


single repository: a data warehouse.
 The data warehouse can be directly accessed,
but it can also be used as a source for creating
data marts, which partially replicate data
warehouse contents and are designed for specific
enterprise departments.
 Meta-data repositories store information on
sources, access procedures, data staging, users,
data mart schemata, and so on.
Prepared By :mitali mistry (PICA) 64
Data Marts

 A data mart is a subset or an aggregation


of the data stored to a primary data
warehouse. It includes a set of information
pieces relevant to a specific business area,
corporate department, or category of users
 Data marts can be viewed as small, local
data warehouses replicating (and summing
up as much as possible) the part of a
primary data warehouse required for a
specific application domain.

Prepared By :mitali mistry (PICA) 65


 The data marts populated from a
primary data warehouse are often called
dependent. Although data marts are not
strictly necessary, they are very useful
for data warehouse systems in midsize
to large enterprises

Prepared By :mitali mistry (PICA) 66


Advantage of data mart

 they are used as building blocks while


incrementally developing data
warehouses;
 they mark out the information required
by a specific group of users to solve
queries;
 they can deliver better performance
because they are smaller than primary
data warehouses.

Prepared By :mitali mistry (PICA) 67


Analysis
 In this layer, integrated data is efficiently
and flexibly accessed to issue reports,
dynamically analyze information, and
simulate hypothetical business
scenarios.
 Technologically speaking, it should
feature aggregate data navigators,
complex query optimizers, and user-
friendly GUIs.

Prepared By :mitali mistry (PICA) 68


 Sometimes, mainly for organization and
policy purposes, you should use a
different architecture in which sources
are used to directly populate data marts.
These data marts are called
independent

Prepared By :mitali mistry (PICA) 69


benefits of a two-layer
architecture

 In data warehouse systems, good quality information is always


available, even when access to sources is denied temporarily
for technical or organizational reasons.
 Data warehouse analysis queries do not affect the
management of transactions, the reliability of which is vital for
enterprises to work properly at an operational level.
 Data warehouses are logically structured according to the
multidimensional model, while operational sources are
generally based on relational or semi-structured models.
 A mismatch in terms of time and granularity occurs between
OLTP systems, which manage current data at a maximum level
of detail, and OLAP systems, which manage historical and
summarized data.
 Data warehouses can use specific design solutions aimed at
performance optimization of analysis and report applications.

Prepared By :mitali mistry (PICA) 70


Two layer architecture

Prepared By :mitali mistry (PICA) 71


Three layer architecture
 This architecture contain three layers
1. Source layer
2. Reconciled layer
3. Data Warehouse layer

Prepared By :mitali mistry (PICA) 72


Prepared By :mitali mistry (PICA) 73
Reconciled data layer
 This layer materializes operational data
obtained after integrating and cleansing
source data.
 As a result, those data are integrated,
consistent, correct, current, and detailed
 Here data warehouse is not populated
from its sources directly, but from
reconciled data.

Prepared By :mitali mistry (PICA) 74


 The main advantage of the reconciled
data layer is that it creates a common
reference data model for a whole
enterprise
 At the same time, it sharply separates
the problems of source data extraction
and integration from those of data
warehouse population.

Prepared By :mitali mistry (PICA) 75


 the reconciled layer is also directly used to
better accomplish some operational tasks, such
as producing daily reports that cannot be
satisfactorily prepared using the corporate
applications, or generating data flows to feed
external processes periodically so as to benefit
from cleaning and integration.
 reconciled data leads to more redundancy of
operational source data.
 Note that we may assume that even two-layer
architectures can have a reconciled layer that is
not specifically materialized, but only virtual,
because it is defined as a consistent integrated
view of operational source data

Prepared By :mitali mistry (PICA) 76


Data Warehouse component

Prepared By :mitali mistry (PICA) 77


Data Warehouse component
 different components of Data
Warehouse architecture are.
1. Data Source
2. Data Staging Area
3. Data Presentation Area
4. Data Access Tools

Prepared By :mitali mistry (PICA) 78


1. Source system
 Its the traditional OLTP systems which stores
transaction data of the organizations business.
Its generally used one record at any time not
necessarily stores history of the organizations
information’s. Operational source systems
generally not used for reporting like data
warehouse.
 The sources can be quite diverse
 Production Databases like Oracle, Sybase, SQL.
 Excel Sheets.
 Database of small time applications like in MS
Access.
 ASCII/Data flat files.

Prepared By :mitali mistry (PICA) 79


2.Data Staging Area
 Data staging area is the storage area as well
as set of ETL process that extract data from
source system. It is everything between
source systems and Data warehouse.
 Data staging are never be used for reporting
purpose. Data is extracted from source system
and stored, cleansed, transformed in staging
area to load into data warehouse.
 Staging are not necessarily the DBMS. It could
be flat files also. Staging area can be
structured like normalized source systems. It
totally depends on choice and need of
development process.

Prepared By :mitali mistry (PICA) 80


 Data staging covers most of the 'back-bone' activities of a
Data-Warehouse,
 These activities are 'Extraction' and ‘Transformation‘ and
‘Loading’
 ETL-Data Extraction
 Data Extraction is an activity, which pulls the data from various
data sources. Most of these sources are production systems OR
are used for transaction level work.
 ETL-Data Transformation
 The Transformation makes sure that the transaction level raw
data is transformed into a form (while still being detailed) so that
it can be loaded into the 'presentation/Loaded' area.
 ETL-Presentation/Loaded 'Area'
 This is the repository where the data is finally loaded after going
through all the works of Extraction and Transformation. This
becomes the ultimate source for information for various reasons
ranging from queries to advanced data modeling.

Prepared By :mitali mistry (PICA) 81


3. Data Presentation Area
 Data presentation area is generally
called as data warehouse. It’s the place
where cleaned, transformed data is
stored in a dimensionally structured
warehouse and made available for
analysis purpose

Prepared By :mitali mistry (PICA) 82


4. Data Access Tools
 once data is available in presentation
area it is accessed using data access
tools like Business Objects.

Prepared By :mitali mistry (PICA) 83

S-ar putea să vă placă și