Sunteți pe pagina 1din 39

Data Warehousing and Data Mining

Unit-1
Estimated Time: 5 hrs.
Syllabus(Unit-1)
• Concepts of Data Warehouse and Data Mining
including its functionalities
• Application of Data Warehouse and Data
Mining
• Issues in Data Warehouse and Data Mining
• Stages of Knowledge discovery in
database(KDD)
• Setting up a KDD environment
What is Data?
• A representation of facts, concepts, or
instructions in a formal manner suitable for
communication, interpretation, or processing
by human beings or by computers.
??

Wisdom

Knowledge

Information

Data
Problems with Data
• The Explosive Growth of Data: from terabytes
to petabytes
• High-dimensionality of data
• High complexity of data
• New and sophisticated applications
• Fast developing Computer Science and
Engineering generates new demands
Evolution of Database Technology
• 1960s: Data collection, database creation, IMS
and network DBMS
• 1970s: Relational data model, relational DBMS
implementation
• •1980s: RDBMS, advanced data models
(extended-relational, OO, deductive, etc.) and
application-oriented DBMS (spatial, scientific,
engineering, etc.)
• 1990s—2000s: Data mining and data
warehousing, multimedia databases, and Web
databases
Size of Databases
• Terabytes -- 10^12 bytes: Walmart -- 24 Terabytes
• Petabytes -- 10^15 bytes: Geographic Information
Systems
• Exabytes -- 10^18 bytes: National Medical Records
• Zettabytes -- 10^21 bytes: Weather images
• •Yottabytes -- 10^24 bytes: Intelligence Agency Videos

We are drowning in data, but starving for knowledge!


Data Mining
Art/Science of extracting non-trivial, implicit,
previously unknown, valuable, and potentially
useful information from a large database

The key properties of data mining are


• Automatic discovery of patterns
• Prediction of likely outcomes
• Creation of actionable information
• Focus on large datasets and databases
Architecture of Data Mining
Data Mining is..
• A technique that find patterns in data
• A technique that find relationships that have
not previously been discovered
• A relatively easy task that requires knowledge
of the business problem/subject matter
expertise
Data Mining is not..
• Brute-force crunching of bulk data
• “Blind” application of algorithms
• Presenting data in different ways
• Once the patterns are found Data Mining
process is finished. Queries to the database
are not DM.
Applications of Data Mining
• Financial Data Analysis
• Telecommunications
• Biological Data Analysis. Eg: DNA
• Stock Markets
• E-mail, etc.
Applications of Data Mining
• Identify unexpected shopping patterns in supermarkets.
• Optimize website profitability by making appropriate offers
to each visitor.
• Predict customer response rates in marketing campaigns.
• Defining new customer groups for marketing purposes.
• Predict customer defections: which customers are likely to
switch to an alternative supplier in the near future.
• Distinguish between profitable and unprofitable customers.
• Identify suspicious (unusual) behavior, as part of a fraud
detection process.
• Data analysis and decision support
Applications of Data Mining
• Market analysis and management
• Target marketing, customer relationship management
(CRM), market basket analysis, cross selling, market
segmentation
• Risk analysis and management
• Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
• Fraud detection and detection of unusual patterns
(outliers)
• Other Applications
-Text mining (news group, email, documents) and Web mining
-Stream data mining
Issues in Data Mining
• Efficiency and scalability of data mining algorithms
• Parallel, distributed, stream, and incremental mining methods
• Handling high-dimensionality
• Handling noise, uncertainty, and incompleteness of data
• Incorporation of constraints, expert knowledge, and
background knowledge in data mining
• Pattern evaluation and knowledge integration
• Mining diverse and heterogeneous kinds of data
• Application-oriented and domain-specific data mining
• Invisible data mining (embedded in other functional modules)
• Protection of security, integrity, and privacy in data mining
Data Warehouse
• According to W. H. Inmon, a data warehouse is a
subject-oriented, integrated, time-variant, nonvolatile
collection of data in support of management decisions.

• “A data warehouse is a copy of transaction data


specifically structured for querying and reporting” –
Ralph Kimball

• •Data Warehousing is a process of transforming data


into information and making it available to users in a
timely enough manner to make a difference
Data Warehouse
• Subject-Oriented: A data warehouse can be used
to analyze a particular subject area. For example,
"sales" can be a particular subject.

• Integrated: A data warehouse integrates data


from multiple data sources. For example, source
A and source B may have different ways of
identifying a product, but in a data warehouse,
there will be only a single way of identifying a
product.
Data Warehouse
• Time-Variant: Historical data is kept in a data
warehouse. For example, one can retrieve data from 3
months, 6 months, 12 months, or even older data from
a data warehouse. For example, a transaction system
may hold the most recent address of a customer,
where a data warehouse can hold all addresses
associated with a customer.

• Non-volatile: Once data is in the data warehouse, it


will not change. So, historical data in a data warehouse
should never be altered.
Data Warehouse Architecture
Why Data Warehouse?
Which are our
lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?
channel?

What product prom- Which customers


-otions have the biggest are most likely to go
impact on revenue? to the competition ?
What impact will
new products/services
have on revenue
and margins?
Advantages of Data Warehouse
• Queries do not impact Operational systems
• Provides quick response to queries for reporting
• Enables Subject Area Orientation
• Integrates data from multiple, diverse sources
• Enables multiple interpretations of same data by
different users or groups
• Provides thorough analysis of data over a period
of time
• Accuracy of Operational systems can be checked
• Provides analysis capabilities to decision makers
Advantages of Data Warehouse
• Increase customer profitability
• Cost effective decision making
• Manage customer and business partner relationships
• Manage risk, assets and liabilities
• Identify developing trends and reduce time to market
• Strategic advantage over competitors
• Potential high returns on investment
• Increased productivity of corporate decision-makers
• Provide reliable, High performance access
Issues with Data Warehouse
Underestimation of resources of data loading
• Some times we underestimate the time required to
extract, clean, and load the data into the warehouse. It
may take the significant proportion of the total
development time.
Hidden problems with source systems
• Some times hidden .problems associated with the
source systems feeding the data warehouse may be
identified after years of being undetected.
Required data not captured
• In some cases the required data is not captured by the
source systems which may be very important for the
data warehouse purpose.
Issues with Data Warehouse
Increased end-user demands
• After satisfying some of end-users queries, requests for support
from staff may increase rather than decrease. This is caused by an
increasing awareness of the users on the capabilities of the data
warehouse.
Data homogenization
• The concept of data warehouse deals with similarity of data formats
between different data sources. Thus, results in to lose of some
important value of the data.
High demand for resources
• The data warehouse requires large amounts of data.
Data ownership
• Sensitive data that owned by one department has to be loaded in
data warehouse for decision making purpose. But some time it
results in to reluctance of that department because it may hesitate
to share it with others.
Issues with Data Warehouse
High maintenance
• Data warehouses are high maintenance systems. Any
reorganization· of the business processes and the source
systems may affect the data warehouse and it results high
maintenance cost.
Long-duration projects
• The building of a warehouse can take years, which is why
some organizations are reluctant in investigating in to data
warehouse.
Complexity of integration
• An organization must spend a significant amount of time
determining how well the various different data warehousing
tools can be integrated into the overall solution that is needed
Knowledge Discovery in Database
• Knowledge Discovery in Databases is the process of
searching for hidden knowledge in the massive
amounts of data that we are technically capable of
generating and storing.

• Knowledge discovery in databases (KDD) is the process


of discovering useful knowledge from a collection of
data.

• Major KDD application areas include marketing, fraud


detection, telecommunication and manufacturing.
Knowledge Discovery in Database
• “KDD refers to the overall process of
discovering useful knowledge from data, and
data mining refers to a particular step in this
process.”
• “Data mining is the application of specific
algorithms for extracting patterns from data.”
Steps in KDD
• Data selection
• Cleaning
• Enrichment (Integration)
• Coding (Transformation)
• Data Mining
• Reporting (Evaluation)
Data Selection
• Once you have formulated your informational
requirements, the nest logical step is to collect
and select the data you need.
• In this step, data relevant to the analysis task
are retrieved from the database.
Data Cleaning
• Before we start the data mining process, we
have to clean up the data as much as possible.
• In this step the noise and inconsistent data is
removed.
Data Enrichment(Integration)
• In this step multiple data sources are
combined.
• In a relational environment, we can simply
join this information with our original data.
Data Coding(Transformation)
• In this step data are transformed or consolidated
into forms appropriate for mining by performing
summary or aggregation operations.

• We can apply following coding technique:


Address to regions
Birthdate to age
Convert cars yes/no to 1/0
Data Mining
• It is a discovery stage in KDD process.
• In this step intelligent methods are applied in
order to extract data patterns.
Reporting(Evaluation)
• It uses two functions:
1.Analysis of the results
2.Application of results

• Visualization and knowledge representation


techniques are used to present the mined
knowledge to the user.
Knowledge Discovery in Database
Setting up a KDD environment
• Setting up a KDD environment is not a trivial
task. It deals with the following questions:

1. What is the application domain?


2. What is the goal of the process?
3. Which techniques are suitable for my data?
4. How do I set up a data mining environment?
5. How do I justify the costs?
Applications of KDD
• SKICAT(Sky Image Cataloging and Analysis Tool), a
system which automatically detects and classifies sky
objects image data resulting from a major astronomical
sky survey.

• Market Basket Analysis (MBA) has incorporated


discovery driven data mining techniques to gain insights
about customer behavior.

• Other applications are being used in the Molecular


Biology, Global Climate Change Modeling and other
concentrations where the volume of data exceeds our
ability to decipher its meaning.

S-ar putea să vă placă și