Sunteți pe pagina 1din 24

Course : COMP6140 – Data Mining

Effective Period : September 2017

Introduction
Session #1
Acknowledgement
These slides have been adapted from:
Han, J., Kamber, M., & Pei, J. (2011). Data Mining: Concepts and
Techniques (3rd ed.). San Francisco, CA, USA: Morgan Kaufmann
Publishers Inc.

2
Learning Objective
LO1 : explain concept of data and data preprocessing

3
Outline
• Science history
• Database technology history
• Data mining definition
• Knowledge discovery (KDD) process
• Multi-dimensional view of data mining
• Data mining technologies and applications
• Major issues in data mining

4
Science History

Empirical Theoretical Computational Data


Science Science Science Science

Before 1600 1600-1950s 1950s-1990s 1990-Present

Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002

5
Database Technology History
• Data collection, database creation, IMS and network DBMS
1960s

• Relational data model, relational DBMS implementation


1970s
• RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
1980s • Application-oriented DBMS (spatial, scientific, engineering, etc.)

• Data mining, data warehousing, multimedia databases, and Web


1990s databases
• Stream data management and mining
• Data mining and its applications
2000s • Web technology (XML, data integration) and global information systems

6
Data Mining Definition
• Data mining is an extraction of interesting (non-trivial, implicit,
previously unknown, and potentially useful) patterns or knowledge
from huge amount of data

• Known as,
– Knowledge discovery (mining) in database (KDD)
– Knowledge extraction
– Data/patterns analysis
– Data archeology
– Data dredging
– Information harvesting
– Business intelligence

7
Knowledge Discovery (KDD) Process
• Based on database systems and data warehousing communities
• Data mining plays
an essential role in
the KDD process

8
Knowledge Discovery (KDD) Process (2)
• Based on machine learning (ML) and statistics communities

Pattern
Input Data Pre- Data Post- Information
Data Processing Mining Processing Knowledge

Data integration Pattern discovery Pattern evaluation


Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Dimension reduction Clustering Pattern visualization
Outlier analysis
…………
9
Data Mining in Business Intelligence
Increasing potential
to support
business decisions
Decision End User
Making

Data Presentation Business


Visualization Techniques Analyst

Data Mining Data


Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


Data Sources
DBA
Paper, Files, Web documents, Scientific experiments, Database Systems
10
Examples of KDD Process
Based on the community of:
• Database systems and data warehousing
– Web mining framework
– Business intelligence
• Machine learning and statistics
– Health care and medical data mining

11
Multi-Dimensional View of Data Mining
Data to be mined:
• Database-oriented datasets and applications
– Relational database, data warehouse, transactional database
• Advanced datasets and applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data (incl. bio-sequences)
– Structure data, graphs, social networks, and multi-linked data
– Object-relational databases
– Heterogeneous databases and legacy databases
– Spatial data and spatiotemporal data
– Multimedia and/or text databases
– The World-Wide Web 12
Multi-Dimensional View of Data Mining (2)
Knowledge to be mined (data mining functions):
• Generalization
– Information integration and data warehouse construction
– Data cube technology: OLAP (online analytical processing)
– Multidimensional concept description: Characterization and
discrimination
• Association and correlation analysis
• Classification and label prediction
– Methods: Decision trees, naïïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-based
classification, logistic regression, etc.
• Cluster analysis
– Unsupervised learning, group data to form new categories
13
Multi-Dimensional View of Data Mining (3)
• Outlier analysis: by product of clustering or regression analysis
• Descriptive vs. predictive data mining
• Multiple/integrated functions and mining at multiple levels
Techniques utilized:
• Data-intensive, data warehouse (OLAP), machine learning, statistics,
pattern recognition, visualization, high-performance, etc.
Application adapted:
• Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, text mining, web mining, etc.

14
Evaluation of Knowledge
• Are all mined knowledge interesting?
– One can mine tremendous amount of “patterns” and knowledge
– Some may fit only certain dimension space (time, location, etc.)
– Some may not be representative, may be transient, etc.
• Evaluation of mined knowledge → directly mine only interesting
knowledge?
– Descriptive vs. predictive
– Coverage
– Typicality vs. novelty
– Accuracy
– Timeliness
– …
15
Confluence of Disciplines in Data Mining
Machine
learning Pattern
Statistics recognition

Database
Visualization
systems
Data Mining
Data
warehouse Algorithm

Information High-
retrieval performance
Applications computing

16
Confluence of Disciplines in Data Mining (2)
• Tremendous amount of data
– Algorithms must be highly scalable to handle such as Terabytes of data
• High-dimensionality of data
– Micro-array may have tens of thousands of dimensions
• High complexity of data
– Data streams and sensor data
– Time-series data, temporal data, sequence data
– Structure data, graphs, social networks and multi-linked data
– Heterogeneous databases and legacy databases
– Spatial, spatiotemporal, multimedia, text and Web data
– Software programs, scientific simulations
• New and sophisticated applications
17
Applications of Data Mining

Business
intelligence

Recommender Web search


systems engine
Data mining
apps

Biological &
Collaborative
medical data
analysis
analysis

18
Major Issues in Data Mining
• Mining methodology
– Mining various and new kinds of knowledge
– Mining knowledge in multi-dimensional space
– Data mining: an interdisciplinary effort
– Boosting the power of discovery in a networked environment
– Handling noise, uncertainty, and incompleteness of data
– Pattern evaluation and pattern- or constraint-guided mining
• User interaction
– Interactive mining
– Incorporation of background knowledge
– Presentation and visualization of data mining results
19
Major Issues in Data Mining (2)
• Efficiency and Scalability
– Efficiency and scalability of data mining algorithms
– Parallel, distributed, stream, and incremental mining methods
• Diversity of data types
– Handling complex types of data
– Mining dynamic, networked, and global data repositories
• Data mining and society
– Social impacts of data mining
– Privacy-preserving data mining
– Invisible data mining

20
Summary
• Data mining definition
• Evolution of database technology
• KDD process
• Mining data
• Data mining functionalities
• Data mining technologies and applications
• Major issues in data mining

21
Exercise
1. Which one is NOT an alias of data mining?
a. Knowledge discovery
b. Data analysis
c. Data replacement
d. Data dredging

2. What are the functionalities of data mining?


a. Generalization, association analysis, classification, cluster analysis
b. Generalization, correlation analysis, classification, cluster analysis
c. Characterization, discrimination, classification, outlier analysis
d. All choices are data mining functionalities
22
References
Han, J., Kamber, M., & Pei, J. (2011). Data Mining: Concepts and
Techniques (3rd ed.). San Francisco, CA, USA: Morgan Kaufmann
Publishers Inc.

23
Thank You

24

S-ar putea să vă placă și