Documente Academic
Documente Profesional
Documente Cultură
Introduction
Session #1
Acknowledgement
These slides have been adapted from:
Han, J., Kamber, M., & Pei, J. (2011). Data Mining: Concepts and
Techniques (3rd ed.). San Francisco, CA, USA: Morgan Kaufmann
Publishers Inc.
2
Learning Objective
LO1 : explain concept of data and data preprocessing
3
Outline
• Science history
• Database technology history
• Data mining definition
• Knowledge discovery (KDD) process
• Multi-dimensional view of data mining
• Data mining technologies and applications
• Major issues in data mining
4
Science History
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002
5
Database Technology History
• Data collection, database creation, IMS and network DBMS
1960s
6
Data Mining Definition
• Data mining is an extraction of interesting (non-trivial, implicit,
previously unknown, and potentially useful) patterns or knowledge
from huge amount of data
• Known as,
– Knowledge discovery (mining) in database (KDD)
– Knowledge extraction
– Data/patterns analysis
– Data archeology
– Data dredging
– Information harvesting
– Business intelligence
7
Knowledge Discovery (KDD) Process
• Based on database systems and data warehousing communities
• Data mining plays
an essential role in
the KDD process
8
Knowledge Discovery (KDD) Process (2)
• Based on machine learning (ML) and statistics communities
Pattern
Input Data Pre- Data Post- Information
Data Processing Mining Processing Knowledge
Data Exploration
Statistical Summary, Querying, and Reporting
11
Multi-Dimensional View of Data Mining
Data to be mined:
• Database-oriented datasets and applications
– Relational database, data warehouse, transactional database
• Advanced datasets and applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data (incl. bio-sequences)
– Structure data, graphs, social networks, and multi-linked data
– Object-relational databases
– Heterogeneous databases and legacy databases
– Spatial data and spatiotemporal data
– Multimedia and/or text databases
– The World-Wide Web 12
Multi-Dimensional View of Data Mining (2)
Knowledge to be mined (data mining functions):
• Generalization
– Information integration and data warehouse construction
– Data cube technology: OLAP (online analytical processing)
– Multidimensional concept description: Characterization and
discrimination
• Association and correlation analysis
• Classification and label prediction
– Methods: Decision trees, naïïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-based
classification, logistic regression, etc.
• Cluster analysis
– Unsupervised learning, group data to form new categories
13
Multi-Dimensional View of Data Mining (3)
• Outlier analysis: by product of clustering or regression analysis
• Descriptive vs. predictive data mining
• Multiple/integrated functions and mining at multiple levels
Techniques utilized:
• Data-intensive, data warehouse (OLAP), machine learning, statistics,
pattern recognition, visualization, high-performance, etc.
Application adapted:
• Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, text mining, web mining, etc.
14
Evaluation of Knowledge
• Are all mined knowledge interesting?
– One can mine tremendous amount of “patterns” and knowledge
– Some may fit only certain dimension space (time, location, etc.)
– Some may not be representative, may be transient, etc.
• Evaluation of mined knowledge → directly mine only interesting
knowledge?
– Descriptive vs. predictive
– Coverage
– Typicality vs. novelty
– Accuracy
– Timeliness
– …
15
Confluence of Disciplines in Data Mining
Machine
learning Pattern
Statistics recognition
Database
Visualization
systems
Data Mining
Data
warehouse Algorithm
Information High-
retrieval performance
Applications computing
16
Confluence of Disciplines in Data Mining (2)
• Tremendous amount of data
– Algorithms must be highly scalable to handle such as Terabytes of data
• High-dimensionality of data
– Micro-array may have tens of thousands of dimensions
• High complexity of data
– Data streams and sensor data
– Time-series data, temporal data, sequence data
– Structure data, graphs, social networks and multi-linked data
– Heterogeneous databases and legacy databases
– Spatial, spatiotemporal, multimedia, text and Web data
– Software programs, scientific simulations
• New and sophisticated applications
17
Applications of Data Mining
Business
intelligence
Biological &
Collaborative
medical data
analysis
analysis
18
Major Issues in Data Mining
• Mining methodology
– Mining various and new kinds of knowledge
– Mining knowledge in multi-dimensional space
– Data mining: an interdisciplinary effort
– Boosting the power of discovery in a networked environment
– Handling noise, uncertainty, and incompleteness of data
– Pattern evaluation and pattern- or constraint-guided mining
• User interaction
– Interactive mining
– Incorporation of background knowledge
– Presentation and visualization of data mining results
19
Major Issues in Data Mining (2)
• Efficiency and Scalability
– Efficiency and scalability of data mining algorithms
– Parallel, distributed, stream, and incremental mining methods
• Diversity of data types
– Handling complex types of data
– Mining dynamic, networked, and global data repositories
• Data mining and society
– Social impacts of data mining
– Privacy-preserving data mining
– Invisible data mining
20
Summary
• Data mining definition
• Evolution of database technology
• KDD process
• Mining data
• Data mining functionalities
• Data mining technologies and applications
• Major issues in data mining
21
Exercise
1. Which one is NOT an alias of data mining?
a. Knowledge discovery
b. Data analysis
c. Data replacement
d. Data dredging
23
Thank You
24