Sunteți pe pagina 1din 40

Data Mining: Concepts and Techniques

(3rd ed.)

Chapter 1
Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign &


Simon Fraser University 2013 Han, Kamber & Pei. All rights reserved.
1

Chapter 1. Introduction

Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kinds of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Kinds of Technologies Are Used? What Kinds of Applications Are Targeted?

Major Issues in Data Mining


A Brief History of Data Mining and Data Mining Society Summary
2

Why Data Mining?


The amount of information in the world doubles every 20 months and The sizes as well as number of databases are increasing even faster.

Why Data Mining?


The data storage bits/bytes are calculated as follows:

1 byte = 8 bits 1 kilobyte (K/KB) = 2 ^ 10 bytes = 1,024 bytes 1 megabyte (M/MB) = 2 ^ 20 bytes = 1,048,576 bytes 1 gigabyte (G/GB) = 2 ^ 30 bytes = 1,073,741,824 bytes 1 terabyte (T/TB) = 2 ^ 40 bytes = 1,099,511,627,776 bytes 1 petabyte (P/PB) = 2 ^ 50 bytes = 1,125,899,906,842,624 bytes 1 exabyte (E/EB) = 2 ^ 60 bytes = 1,152,921,504,606,846,976 bytes 1 zettabyte (Z/ZB) =1 000 000 000 000 000 000 000 bytes 1 yottabyte (Y/YB) =1 000 000 000 000 000 000 000 000 bytes
4

Why Data Mining?

Information is at the heart of business operations and brain of decision makers. Database Management Systems gave access to the data stored but this was only a small part of what could be gained from the data. , OLTPs, are good at putting data into databases quickly, safely and efficiently but are not good at delivering meaningful analysis in return. Analyzing data can provide further knowledge about a business Can make decision efficiently and effectively

Chapter 1. Introduction

Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kinds of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Kinds of Technologies Are Used? What Kinds of Applications Are Targeted?

Major Issues in Data Mining


Summary

What Is Data Mining?

Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer? Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Simple search and query processing (Deductive) expert systems
7

Alternative names

Watch out: Is everything data mining?


Knowledge Discovery (KDD) Process

This is a view from typical database systems and data warehousing Pattern Evaluation communities Data mining plays an essential role in the knowledge discovery process Data Mining Task-relevant Data Data Warehouse Selection

Data Cleaning
Data Integration Databases
8

Example: A Web Mining Framework

Web mining usually involves Data cleaning Data integration from multiple sources Warehousing the data Data cube construction Data selection for data mining Data mining Presentation of the mining results Patterns and knowledge to be used or stored into knowledge-base
9

Data Mining in Business Intelligence


Increasing potential to support business decisions

Decision Making
Data Presentation Visualization Techniques Data Mining Information Discovery

End User

Business Analyst Data Analyst

Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
DBA

10

Data warehousing

Data warehousing provides architectures and tools for business executives to systematically organize, understand, and use their data to make strategic decisions. Data warehouse systems are valuable tools in todays competitive, fast-evolving world According toWilliam H. Inmon, a leading architect in the construction of data warehouse systems, A data warehouse is

decision making process

a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of managements

11

Which View Do You Prefer?

Which view do you prefer?


KDD vs. ML/Stat. vs. Business Intelligence Depending on the data, applications, and your focus Business intelligence view

Data Mining vs. Data Exploration

Warehouse, data cube, reporting but not much mining

Business objects vs. data mining tools Supply chain example: mining vs. OLAP vs. presentation tools Data presentation vs. data exploration
12

Chapter 1. Introduction

Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kinds of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Kinds of Technologies Are Used? What Kinds of Applications Are Targeted?

Major Issues in Data Mining


A Brief History of Data Mining and Data Mining Society Summary
13

Multi-Dimensional View of Data Mining

Data to be mined Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, timeseries, sequence, text and web, multi-media, graphs & social and information networks Knowledge to be mined (or: Data mining functions) Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Descriptive vs. predictive data mining Multiple/integrated functions and mining at multiple levels Techniques utilized Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

14

Chapter 1. Introduction

Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kinds of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Kinds of Technologies Are Used? What Kinds of Applications Are Targeted?

Major Issues in Data Mining


A Brief History of Data Mining and Data Mining Society Summary
15

Data Mining: On What Kinds of Data?

Database-oriented data sets and applications

Relational database, data warehouse, transactional database (we focus this category in this course)

Object-relational databases, Heterogeneous databases

A relational database is a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows).

16

Data Mining: On What Kinds of Data?

Data warehouse

is a repository of information collected from multiple sources

are constructed via a process of data cleaning, data integration, data transformation, and periodic data refreshing. To facilitate decision making

the data are subject oriented. Ex. major subjects are customer, item, supplier, and activity.

Usually modeled by a multidimensional structure, called data cube Each dimension corresponds to an attribute
17

Multidimensional data mining also called exploratory data mining

18

Data Mining: On What Kinds of Data?


Transactional database Capture a transaction

Such as customers purchase Flight booking Users click on web page

A transaction typically includes a unique transaction identity number (trans ID) and a list of the items making up the transaction (such as items purchased in a store).

19

Data Mining: On What Kinds of Data?


Advanced data sets and advanced applications

Data streams and sensor data Time-series data, temporal data, sequence data (incl. bio-sequences) Structure data, graphs, social networks and information networks Spatial data and spatiotemporal data Multimedia database Text databases The World-Wide Web

20

Data Mining: On What Kinds of Data?


Advanced data sets and advanced applications Time related or Sequence data: historical data, stock exchange data, biological sequence data Data stream: video surveillance and sensor data which are continuously transmitted Spatial data: maps Graph and network data: social and information network Temporal data: helps banks tellers to make schedule according to customer traffic Spatial data: helps to look pattern the poverty level of different area

21

Chapter 1. Introduction

Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kinds of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Kinds of Technologies Are Used? What Kinds of Applications Are Targeted?

Major Issues in Data Mining


A Brief History of Data Mining and Data Mining Society Summary
22

What Kinds of Patterns Can Be Mined? Data mining functionalities


Characterization and Discrimination Frequent pattern, association and correlation Classification and regression Clustering analysis Outlier analysis

Data mining functionalities is to specify the kind of pattern to be found

Two categories Descriptive Predictive

23

Characterization and Discrimination

Data characterization is summarizing of general characteristics or features of a target class of data

Ex. to study the characteristics of software products whose sales increased by 10% in the last year

Data discrimination is comparison of general characteristics or features (often called the contrasting classes

to compare the general features of software products whose sales increased by 10% in the last year with those whose sales decreased by at least 30% during the same period.
24

Frequent pattern, association and correlation

Frequent patterns, are the patterns that occur frequently in data. frequent patterns, including itemsets, subsequences, and substructures

Frequent itemset: a set of items that frequently appear together in a transactional data set, such as milk and bread Frequent subsequences: customers tend to purchase first a PC, followed by a digital camera, and then a memory card,

25

Classification and regression

Classification is the process of finding a model that describes and distinguishes data classes The model is based on the analysis of a set of training data set Able to use the model to predict the class of objects whose class label is unknown The model may be represented in various forms, such as

classification (IF-THEN) rules, decision trees, mathematical formulae, or neural networks


26

Clustering

The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity. clusters of objects are formed so that objects within a cluster have high similarity in comparison to one another,

27

Outlier analysis

A database may contain objects that do not comply with the general behavior or model of the data. These data objects are outliers. Most data mining methods discard outliers as noise or exceptions.

28

Chapter 1. Introduction

Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kinds of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Kinds of Technologies Are Used? What Kinds of Applications Are Targeted?

Major Issues in Data Mining


A Brief History of Data Mining and Data Mining Society Summary
29

Data Mining: Confluence of Multiple Disciplines


Machine Learning Pattern Recognition Statistics

Applications

Data Mining

Visualization

Algorithm

Database Technology

High-Performance Computing

30

Why Confluence of Multiple Disciplines?

Tremendous amount of data Algorithms must be scalable to handle big data High-dimensionality of data Micro-array may have tens of thousands of dimensions High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social and information networks Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations

31

Chapter 1. Introduction

Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kinds of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Kinds of Technologies Are Used? What Kinds of Applications Are Targeted?

Major Issues in Data Mining


A Brief History of Data Mining and Data Mining Society Summary
32

What Kinds of Applications Are Targeted?

Business Intelligence (BI) BI technologies provides historical, current, and predictive views of business operations. Data mining is the core of business intelligence OLAP is the tools, rely on data warehousing and multidimensional data mining. Classification is the core of predictive analytics in BI Clustering plays central role for customer relationship management grouping customer based on their similarities.
33

Applications of Data Mining

Web page analysis: from web page classification, clustering to PageRank & HITS algorithms Collaborative analysis & recommender systems Basket data analysis to targeted marketing Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis Data mining and software engineering

From major dedicated data mining systems/tools (e.g., SAS, MS SQLServer Analysis Manager, Oracle Data Mining Tools) to invisible data mining
34

Summary

Data mining: Discovering interesting patterns and knowledge from massive amount of data A natural evolution of science and information technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of data Data mining functionalities: characterization, discrimination, association, classification, clustering, trend and outlier analysis, etc. Data mining technologies and applications Major issues in data mining
35

April 17, 2014

Data Mining: Concepts and Techniques

36

Major Issues in Data Mining (1)

Mining Methodology

Mining various and new kinds of knowledge Mining knowledge in multi-dimensional space Data mining: An interdisciplinary effort Boosting the power of discovery in a networked environment Handling noise, uncertainty, and incompleteness of data Pattern evaluation and pattern- or constraint-guided mining

User Interaction

Interactive mining
Incorporation of background knowledge Presentation and visualization of data mining results
37

Major Issues in Data Mining (2)

Efficiency and Scalability


Efficiency and scalability of data mining algorithms


Parallel, distributed, stream, and incremental mining methods Handling complex types of data Mining dynamic, networked, and global data repositories Social impacts of data mining

Diversity of data types


Data mining and society

Privacy-preserving data mining


Invisible data mining

38

Conferences and Journals on Data Mining

KDD Conferences ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD) SIAM Data Mining Conf. (SDM) (IEEE) Int. Conf. on Data Mining (ICDM) European Conf. on Machine Learning and Principles and practices of Knowledge Discovery and Data Mining (ECML-PKDD) Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD) Int. Conf. on Web Search and Data Mining (WSDM)

Other related conferences

DB conferences: ACM SIGMOD, VLDB, ICDE, EDBT, ICDT, Web and IR conferences: WWW, SIGIR, WSDM ML conferences: ICML, NIPS

PR conferences: CVPR,
Data Mining and Knowledge Discovery (DAMI or DMKD) IEEE Trans. On Knowledge and Data Eng. (TKDE) KDD Explorations ACM Trans. on KDD
39

Journals

Where to Find References? DBLP, CiteSeer, Google

Data mining and KDD (SIGKDD: CDROM)


Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc. Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc. Conferences: SIGIR, WWW, CIKM, etc. Journals: WWW: Internet and Web Information Systems, Conferences: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc. Conference proceedings: CHI, ACM-SIGGraph, etc. Journals: IEEE Trans. visualization and computer graphics, etc. 40

Database systems (SIGMOD: ACM SIGMOD AnthologyCD ROM)


AI & Machine Learning

Web and IR

Statistics

Visualization

S-ar putea să vă placă și