Sunteți pe pagina 1din 7

Data Warehousing and Data Mining


Explain the Top-Down and Bottom-up Data Warehouse development






Despite the fact that Data Warehouses can be designed in a number of different
ways, they all share a number of important characteristics.
Most Data Warehouses are Subject Oriented. This means that the information
that is in the Data Warehouse is stored in a way that allows it to be connected to
objects or event, which occur in reality.
Another characteristic that is frequently seen in Data Warehouses is called Time
Variant. A time variant Data Warehouse will allow changes in the information to
be monitored and recorded over time.
All the programs that are used by a particular institution will be stored in the Data
Warehouse, and it will be integrated together.
The first Data Warehouses were developed in the 1980s.
As societies entered the information age, there was a large demand for efficient
methods of storing information.
Many of the systems that existed in the 1980s were not powerful enough to store
and manage large amounts of data.
The systems that existed at the time took too long to report and process
information. Many of these systems were not designed to analyze or report
In addition to this, the computer programs that were necessary for reporting
information were both costly and slow. To solve these problems, companies
began designing computer databases that placed an emphasis on managing and
analyzing information. These were the first Data Warehouses, and they could
obtain data from a variety of different sources, and some of these include PCs
and mainframes.
Spreadsheet programs have also played an important role in the development of
Data Warehouses. By the end of the 1990s, the technology had greatly
advanced, and was much lower in cost.
The technology has continued to evolve to meet the demands of those who are
looking for more functions and speed.
There are four advances in Data Warehouse technology that has allowed it to
evolve. These advances are offline operational databases, real time Data
Warehouses, offline Data Warehouses, and the integrated Data Warehouses.
The offline operational database is a system in which the information within the
database of an operational system is copied to a server that is offline.
When this is done, the operational system will perform at a much higher level. As
the name implies, a real time Data Warehouse system will be updated every time
an event occurs. For example, if a customer orders a product, a real time Data
Warehouse will automatically update the information in real time.
Another important concept that is related to Data Warehouses is called data
transformation. As the name suggests, data transformation is a process in which
information transferred from specific sources is cleaned and loaded into a

Data Warehousing and Data Mining

Q2) Explain the Functionalities and advantages of Data Warehouses

Data Warehouses exist to facilitate complex, data-intensive and frequent adhoc
Data Warehouses must provide far greater and more efficient query support than is
demanded of transactional databases.
Data Warehouses provide the following functionality:
o Roll-up: Data is summarized with increased generalization.
o Drill-down: Increasing levels of detail are revealed.
o Pivot: Cross tabulation that is, rotation is performed.
o Slice and Dice: Performing projection operations on the dimensions.
o Sorting: Data is sorted by ordinal value.
o Selection: Data is available by value or range.
o Derived or Computer Attributes: Attributes are computed by operations on
stored data and values are derived.
A Data Warehouse provides a common data model for data, regardless of the
data source.
This makes it easier to report and analyze information than it would be if multiple
data models from disparate sources were used to retrieve information such as
sales invoices, order receipts, general ledger charges, etc.
Prior to loading data into the Data Warehouse inconsistencies are identified and
resolved. This greatly simplifies reporting and analysis.
Information in the Data Warehouse is under the control of Data Warehouse users
so that, even if the source system data is purged over time, the information in the
warehouse can be stored safely for extended periods of time.
Because they are separate from operational systems, Data Warehouses provide
fast retrieval of data without slowing down operational systems.
Data Warehouses facilitate Decision Support System applications such as trend
reports (e.g., the items with the most sales in a particular area within the last two
years), exception reports, and reports that show actual performance versus

Data Warehousing and Data Mining

Q3) Describe about Hyper Cube and Multicube .

Multidimensional databases can present their data to an application using
two types of cubes:
- hypercubes and
- multicubes.
The Hypercube is the cube with four Dimensions. In the hypercube model, as shown in
the following illustration, all data appears logically as a single cube.

MDS of four dimensions

Page display for 4 dimensional data

This intuitive representation is a hypercube, a representation that accommodates more
than three dimensions. At a lower level of simplification, a Hypercube can very well
accommodate three dimensions. A hypercube is a general metaphor for representing
multidimensional data. Often, Multi Dimensional Structures (MDS) are used to represent
such data.
Multicube: In the multicube model, data is segmented into a set of smaller cubes, each
of which is composed of a subset of the available dimensions It means we can view the
cube in multiple dimensions.

Data Warehousing and Data Mining


Q4) List and explain the Strategies for data reduction.

Strategies for data reduction include the following:
Date cube aggregation, where aggregation operations are applied to the data in the
construction of a data cube.
Dimension reduction, where irrelevant, weakly relevant, or redundant attributes or
dimensions may be detected and removed.
Data compression, where encoding mechanisms are used to reduce the data set
Numerosity reduction, where the data are replaced or estimated by alternative,
smaller data representations such as a parametric models,or nonparametric methods
such as clustering, sampling, and the use of histograms.
Discretization and concept hierarchy generation, where raw data values for
attributes are replaced by ranges or higher conceptual levels.
Data Cube Aggregation
Imagine that you have collected the data for your analysis. These data consist of the All
Electronics sales per quarter, for the years 1997 to 1999. You are however interested in
the annual sales (total per year), rather than the total per quarter. Thus the data can be
aggregated so that the resulting data summarize the total sales per year instead of per
quarter. This aggregation is illustrated in Fig. 10.4. The resulting data set is smaller in
volume, without loss of information necessary for data analysis task.

First fig shows a data cube for multidimensional analysis of sales data with respect to
annual sales per item type for each All Electronics branch. Each cell holds an aggregate
data value, corresponding to the data point in multidimensional space.
Dimensionality Reduction
Basic heuristic methods of attribute subset selection include the following techniques,
some of which are illustrated in below fig.

Data Warehousing and Data Mining


Stepwise forward selection: The procedure starts with an empty set of attributes.
The best of the original attributes is determined and added to the set
Stepwise backward elimination: The procedure starts with the full set of attributes.
At each step, it removes the worst attribute remaining in the set.
Combination of forward selection and backward elimination: The stepwise
forward selection and backward elimination methods can be combined so that, at
each step, the procedure selects the best attribute and removes the worst from
among the remaining attributes.

Numerosity Reduction
In Numerosity reduction data are replaced or estimated by alternative, smaller data
representations such as a parametric models (which need store only the model
parameters instead of the actual data), or nonparametric methods such as clustering,
sampling, and the use of histograms.
Sampling can be used as a data reduction technique since it allows a large data set
to be represented by a much smaller random sample (or subset) of the data.
Suppose that a large data set, D, contains N tuples.

Data Warehousing and Data Mining

Q5) Describe K-means method for clustering. List its advantages and drawbacks.
K-means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms
that solve the well known clustering problem.
The procedure follows a simple and easy way to classify a given data set through a
certain number of clusters (assume k clusters) fixed a priori.
The main idea is to define k centroids, one for each cluster.
The basic step of k-means clustering is simple. In the beginning we determine
number of cluster K and we assume the centroid or center of these clusters. We can
take any random objects as the initial centroids or the first K objects in sequence can
also serve as the initial centroids.
Then the K means algorithm will do the three steps given below until convergence
iterate until stable (= no object move group)
o Determine the centroid coordinate
o Determine the distance of each object to the centroids.
o Group the object based on minimum distance.

With a large number of variables, K-Means may be computationally faster than
hierarchical clustering (if K is small).
K-Means may produce tighter clusters than hierarchical clustering, especially if the
clusters are globular.
The K-means method as described has the following drawbacks:
It does not do well with overlapping clusters.
The clusters are easily pulled off-center by outliers.

Data Warehousing and Data Mining

Each record is either inside or outside of a given cluster.

Q6) Describe about Multilevel Databases and Web Query Systems

Multilevel Databases
Several researchers have proposed a multilevel database approach to organizing
Web-based information.
The main idea behind these proposals is that the lowest level of the database
contains primitive semi-structured information stored in various web repositories,
such as hypertext documents.
At the higher level(s) meta data or generalizations are extracted from lower levels
and organized in structured collections such as relational or object-oriented
Web Query Systems
There have been many web-base query systems and languages developed recently
that attempt to utilize standard database query languages such as SQL, structural
information about web documents, and even natural language processing for
accommodating the types of queries that are used in World Wide Web searches.
We mention a few examples of these Web-base query systems here. W3QL
combines structure queries, based on the organization of hypertext documents, and
content queries, based on information retrieval techniques.
WebLog is a logic-based query language for restructuring extracted information from
Web information sources.
Lorel and UnQL support querying of heterogeneous and semi-structured information
on the Web using a labeled graph data model.
TSIMMIS helps to extract data from heterogeneous and semi-structured information
sources and correlates them to generate an integrated database representation of
the extracted information.