Sunteți pe pagina 1din 44

Introduction

CZ4032CPE489CSC489
DataAnalyticsandMining
[DataMining]
1

Outline

MotivationofDataMining
EvolutionofDataMining
DefinitionsofDataMiningandKDD
DataSources
DataMiningTasks
Summary

Motivation:TheAgeofBigData

Motivation:WhyMineData?
ExplosiveGrowthofData:fromTerabytestoPetabytes
Datacollectionanddatastoragetechnology
Automateddatacollectiontools,database
systems,Web,computerizedsociety,
mobiledevices,socialnetworks,etc.

Majorsourcesofabundantdata
Business:Web,ecommerce,
transactions,stocks,
Science:Remotesensing,bioinformatics,
scientificsimulation,
Societyandeveryone:news,digitalcameras,
socialmedia,
4

Motivation:WhyMineData?
Example:

Facebook
800 million active users
60 billion photos in total, 250 million photos uploaded per day
80 groups/events per user (till Feb 2011)
Flickr
60 million users
Five billion photos
10 million groups (till Feb 2011)
Twitter
175 million users (registered)
140 million tweets per day

Weibo
200 million users
(till June 2011)

Wearedrowningindata,butstarvingforknowledge!
Necessityisthemotherofinvention
DataMiningAutomatedanalysisofmassivedatasets
5

Motivation:WhyMineData?
FromCommercialViewpoint
Lotsofdataarebeingcollected
andwarehoused
Webdata,ecommerce
purchasesatdepartment/
grocerystores
Bank/CreditCard
transactions
Socialnetworks
Computershavebecomecheaperandmorepowerful
Competitivepressureisstrong
Providebetter,customizedservicesforanedge(e.g.in
CustomerRelationshipManagement)

Motivation:WhyMineData?
FromScientificViewpoint
Datacollectedandstoredat
enormousspeeds(GB/hour)
remotesensorsonasatellite
telescopesscanningtheskies
microarraysgeneratinggene
expressiondatainbiology
scientificsimulations
generatingterabytesofdata

Traditionaltechniquesinfeasibleforrawdata
Dataminingmayhelpscientists
inclassifyingandsegmentingdata
inHypothesisFormation
7

Evolution:fromDBtoDMtoDS
Datacollection

Databasecreation

1960s:Fromprimitivedatacollectionsystemstosophisticatedand
powerfuldatabasesystems(NavigationalDBMS)

DatamanagementAdvanceddataanalysis
1970s:Fromearlyhierarchicalandnetworkmodelstorelational
models(OLTP),SQLDBMS
1980s:Furtherresearchintoobjectorienteddatabasesystems,
Internet,applicationoriented,etc.
1990s:Cheaperhardware,advancedDBMS,DataWarehouse,OLAP
(OnLineanalyticalprocessing)
Late1990s/present:Datarichbutinformationpoorsituationrequired
powerfulanalyticaltools,whichmotivatesDataMiningtechnology

DataScienceKnowledgeDiscoveryinDatabases

Definitions:KDD&DataMining
KDD(KnowledgeDiscoveryinDatabases)
Theoverallprocessofnontrivialextractionof
implicit,previouslyunknown andpotentiallyuseful
knowledgefromlargeamountsofdata
KDDalsostandforKnowledgeDiscoveryandData
Mining

DataMining:AKDDProcess
DataMining:ThecorestepsofKDD
Applicationofspecific
algorithmsforextracting
patternsfromdata

10

Whatis(not)DataMining?
WhatisNOT DataMining?
Lookupphonenumber
inaphonedirectory

QueryaWebsearch
engineforinformation
aboutAmazon

WhatisDataMining?
Certainnamesaremore
prevalentincertainUSlocations
(OBrien,ORurke,OReillyin
Bostonarea)
Grouptogethersimilar
documentsreturnedbysearch
engineaccordingtotheir
content/context(e.g.Amazon
rainforest,Amazon.com,)
11

OriginsofDataMining
Drawsideasfromstatistics/AI,machinelearning/pattern
recognition,anddatabasesystems,etc.
TraditionalTechniques
maybeunsuitabledueto
Statistics/
Machine Learning/
AI
Pattern
Enormityofdata
Recognition
Highdimensionality
Data Mining
ofdata
Heterogeneous,
distributednature
Database
systems
ofdata
12

Whynotuseclassicaldataanalysis?
Tremendousamountofdata
Algorithmsmustbehighlyscalabletohandlemassivedata,
suchasterabytesofdata

Highdimensionalityofdata
E.g.,microarraydatamayhavetensofthousandsofdimensions

Highcomplexityofdata

Datastreamsandsensordata
Timeseriesdata,temporaldata,sequencedata
Structuredata,graphs,socialnetworksandmultilinkeddata
Heterogeneousdatabasesandlegacydatabases
Spatial,spatiotemporal,multimedia,textandWebdata
Softwareprograms,scientificsimulations

Newandsophisticatedapplications
13

MajorStepsofDataMining(KDD)
Input
data

Data
Preprocessing

Data
Mining

Postprocessing

Knowledge

1. DataPreprocessing
A. DataIntegration

Combinemultipledatasources

B. DataCleaning

Removenoiseandinconsistentdata

C. DataSelection

Selecttaskrelevantdata

D. DataTransformation

Transform/consolidateselecteddataforfurtheranalysis
14

MajorStepsofDataMining(KDD)
2. DataMining

Applydatamining&machinelearningmethods
(e.g.,classification/clustering)toextractpatternsfromdata

3. PatternEvaluation(PostProcessing)

EvaluateandIdentifytrulyinterestingpatterns

4. Visualization(PostProcessing)

Presenttheminedpatternstousers

Although Data Mining is just one of the many steps,


it is usually used to refer to the whole process of KDD, as
other steps are implied.
15

TheArchitectureofaTypicalDataMiningSystem
User
Visualization
PostProcessingInterestingPatterns
DataMiningEngine
DataPreprocessing(DataWarehouse,DBServer)
Cleaning/Integration/Selection/Reduction/Transformation
Databases
16

DataMining&BusinessIntelligence

17

DataMiningTasks
Prediction Methods
Usesomevariablestopredict unknownorfuture
valuesofothervariables.

Description Methods
Findhumaninterpretablepatternsthatdescribe
thedata.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
18

DataMiningTaxonomy
DataMiningTasks
Descriptive

Association
Rule Mining
Clustering

Predictive

Classification

Sequence
Pattern Mining

Regression
Outlier
Detection

Thistaxonomyisbasedonthekindsofpatternsoutputbydataminingtasks.
19

AssociationRuleMining:Definition
Givenasetofrecords eachofwhichcontainssomenumberof
items fromagivencollection,
Producedependency rules whichwillpredictoccurrence
ofanitembasedonoccurrences ofotheritems.
TID

Items

1
2
3
4
5

Bread, Coke, Milk


Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk

Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}

markettransactionalDB
20

AssociationRuleMining:Application1
MarketingandSalesPromotion
Lettherulediscoveredbe
{Coke,}>{PotatoChips}
PotatoChipsasconsequent
Canbeusedtodeterminewhatshouldbedonetoboost
itssales.
Coke intheantecedent
Canbeusedtoseewhichproductswouldbeaffectedif
thestorediscontinuessellingCoke.
Coke inantecedentandPotatochipsinconsequent
Canbeusedtoseewhatproductsshouldbesoldwith
CoketopromotesaleofPotatochips!
21

AssociationRuleMining:Application2
Supermarketshelfmanagement.
Goal:Toidentifyitemsthatareboughttogetherbysufficiently
manycustomers.
Approach:Processthepointofsaledatacollectedwith
barcodescannerstofinddependenciesamongitems.
Hereisaclassicalrule:{diaper,milk}>{beer}
Ifacustomerbuysdiaper andmilk,thenheisverylikelytobuybeer.
So,dontbesurprisedifyoufindsixpacksstackednexttodiapers!

Anotherpossiblerulemaybe:{flowers,beer}>{condoms}
Ifacustomerbuysflowersandbeer,thenhecouldverylikelybuy
condoms
22

Classification/Regression/Prediction
Classification
Fromthetrainingdatawithclasslabels,aclassificationmodel is
learnttodistinguishdatainstancesbetweendifferentclasses
Themodel canberepresentedasclassificationrules,decisiontrees,
mathematicalformulae,neuralnetworks,etc.
Whennew(test)datawithoutclasslabelscomes,themodelisused
topredict itsclasslabels

Regression
Usedtomap adataitemtoarealvalued predictionvariable
Involvesthelearningofthefunctionthatdoesthemapping

Regressionlearnsamodeltopredictcontinuoustarget,whereas
classificationlearnsamodeltopredictcategoricallabels.
23

Classification:Definition
Givenacollectionofrecords(trainingset),eachrecordcontainsa
setofattributes,oneoftheattributesistheclass.
Findamodel forclassattributeasafunction ofthevaluesofother
attributes.
Goal:Toensurethatpreviouslyunseen recordsshouldbeassigneda
classasaccurately aspossible.

Atestset isusedtodeterminetheaccuracy ofthemodel.Usually,the


givendatasetisdividedintotraining andtest sets,withthetraining
setusedtobuildthemodelandthetestsetusedtovalidateit.
24

ExampleofAClassificationTask

Class Label

25

Classification Rules

Class Labels

26

Decision Tree

Attributes

AttributeValues

Class Labels

27

Classification:Application1
DirectMarketing
Goal:
Reducecostofmailingbytargeting asetofconsumerslikelyto
buyanewcellphoneproduct.

Approach:
Usethedataforasimilarproductintroducedbefore.Weknow
whichcustomersdecidedtobuyandwhichdecidedotherwise.
This{buy,dontbuy}decisionformstheclass attribute.
Collectvariousdemographic,lifestyle,andcompanyinteraction
relatedinformationaboutallsuchcustomers.
Typeofbusiness,wheretheystay,howmuchtheyearn,etc.
Usethisinformationasinputattributestolearnaclassifier
model.
28

Classification:Application2
FraudDetection
Goal:
Predictfraudulentcasesincreditcardtransactions.

Approach
Usecreditcardtransactionsandtheinformationonitsaccountholder
asattributes.
Whendoesacustomerbuy,whatdoeshebuy,howoftenhepayson
time,etc.
Labelpasttransactionsasfraud orfair transactions.Thisformstheclass
attribute.
Learnamodel fortheclassofthetransactions.
Usethismodel todetectfraudbyobservingcreditcardtransactionson
anaccount.
29

Regression
Goal:
Predictavalueofagivencontinuousvaluedvariablebasedonthe
valuesofothervariables,assumingalinearornonlinearmodelof
dependency.

Extensivelystudiedinstatistics,neuralnetworkfields.
Examples:
Predictingsalesamountsofnewproductbasedonadvertising
expenditure.
Predictingwindvelocitiesasafunctionoftemperature,humidity,
airpressure,etc.
Timeseriespredictionofstockmarketindices.
30

Prediction&Regression:Example1
Relationshipbetweensystolicbloodpressure(y),birthweight (x1),andage (indays)(x2)
i

Birthweight
in oz (x1)

Age
in days (x2)

Systolic BP
mm HG (y)

135

89

120

90

100

83

105

77

130

92

125

98

125

82

105

85

120

96

10

90

95

11

120

80

12

95

79

13

120

86

14

150

97

15

160

92

16

125

88

Exampleofmultipleregression:
Useleastsquaresmethodto
determinetheregressioneqn.

53.45 0.126 x1 5.89 x2

Prediction:
TopredictthesystolicBPofa
babywithbirthweight 8lb(128
oz)measuredat3daysoflife

53.45 0.126(128) 5.89(3)


87.2 mm Hg
31

Prediction & Regression: Example 2


Stock MarketPrediction
Blackdots:trainingdata
RedLine(continuousanddashed):Predictions
Bluedots:test(unseen)actualdata
http://www.goldeagle.com/editorials_03/sornette112403.html

32

Clustering:Definition
Givenasetofdatapoints,eachhavingasetofattributes,
andaproximitymeasureamongthem,thegoalof
clusteringistofindclusters suchthat
Datapointsinthesame clusteraremore similar toone
another.
Datapointsindifferent clustersareless similartoone
another.
ProximityMeasures:
EuclideanDistanceifattributesarecontinuous.
CosineSimilarityfordocumentdata
OtherProblemspecificMeasures.
33

Clustering:Principle
Unlikeclassificationandprediction,whichanalyzeclasslabeled
dataobjects,clusteringanalyzesdatawithoutclasslabels
Couldhelptodetermineclasslabels
Objectsareclusteredbasedontheprinciple:
minimizetheintraclusterdistance andmaximizetheinterclusterdistance
Intraclusterdistances
areminimized

Interclusterdistances
aremaximized

34

Clustering:Application1
MarketSegmentation:
Goal:
Tosubdivideamarketintodistinctsubsetsofcustomers
whereanysubsetmayconceivablybeselectedasamarket
targettobereachedwithadistinctmarketingmix.

Approach:
Collectdifferentattributesofcustomersbasedontheir
geographicalandlifestylerelatedinformation.
Findclustersofsimilarcustomers.
Measuretheclusteringqualitybyobservingbuyingpatterns
ofcustomersinsameclustervs.thosefromdifferentclusters.

35

Clustering:Application2
DocumentClustering:
Goal:
Tofindgroupsofdocumentsthataresimilartoeachotherbased
ontheimportanttermsappearinginthem.

Approach:
Toidentifyfrequentlyoccurringtermsineachdocument.Forma
similaritymeasurebasedonthefrequenciesofdifferentterms.
Useittocluster.

Gain/Consequence:
InformationRetrievalcanutilizetheclusterstorelateanew
documentorsearchtermtoclustereddocuments.
36

Clustering:Application3
ClusteringofS&P500StockData
ObserveStockMovementseveryday.
Clusteringpoints:Stock{UP/DOWN}
SimilarityMeasure:Twopointsaremoresimilariftheeventsdescribedbythem
frequentlyhappentogetheronthesameday.

37

Outlier/AnomalyDetection
Outlier:somedatapointdoesnotcomplywiththegeneral
behaviorofthedata
Goal:Todetectsignificantdeviations(outliers)fromthe
normalbehavior
Althoughinmanyapplicationsoutliersareunnecessary,in
someapplicationstheyareveryuseful
Frauddetectionincreditcardpurchase,authentication
(password),networkintrusiondetection
38

Outlier/AnomalyDetection
Applications:
CreditCardFraudDetection
NetworkIntrusionDetection
Approach:
Itcanbedetectedbyassumingadistributionforthegeneral
behaviorandanypointoutsidethisareconsideredoutlier
Example
Creditcardpurchase:checktheamount,placeof
purchase,purchasefrequency
39

Summary
Datamining:Discoveringinterestingpatternsfromlarge
amountsofdata
Anaturalevolutionofdatabasetechnology,ingreat
demand,withwideapplications
AKDDprocessincludesdatacleaning,dataintegration,
dataselection,transformation,datamining,pattern
evaluation,andknowledgepresentation

40

ChallengesofDataMining

Scalability
Dimensionality
ComplexandHeterogeneousData
DataQuality
DataOwnershipandDistribution
PrivacyPreservation
StreamingData
41

CareerinDataMining

http://www.kdnuggets.com/2015/03/salaryanalyticsdatasciencepollwell
compensated.html 2015Salarysurvey(US$)

42

ABriefHistoryofDataMiningSociety

1989IJCAIWorkshoponKnowledgeDiscoveryinDatabases
KnowledgeDiscoveryinDatabases(G.PiatetskyShapiroandW.Frawley,
1991)
19911994WorkshopsonKnowledgeDiscoveryinDatabases
AdvancesinKnowledgeDiscoveryandDataMining(U.Fayyad,G.Piatetsky
Shapiro,P.Smyth,andR.Uthurusamy,1996)
19951998InternationalConferencesonKnowledgeDiscoveryinDatabasesand
DataMining(KDD9598)
JournalofDataMiningandKnowledgeDiscovery(1997)
ACMSIGKDDconferencessince1998andSIGKDDExplorations
Moreconferencesondatamining
PAKDD(1997),PKDD(1997),SIAMDataMining(2001),(IEEE)ICDM(2001),
etc.
ACMTransactionsonKDDstartingin2007
43

ConferencesandJournalsonDataMining

KDDConferences
Otherrelatedconferences
ACMSIGKDDInt.Conf.on
ACMSIGMOD
KnowledgeDiscoveryinDatabases
VLDB
andDataMining(KDD)
(IEEE)ICDE
SIAMDataMiningConf.(SDM)
WWW,SIGIR
(IEEE)Int.Conf.onDataMining
ICML,CVPR,NIPS
(ICDM)
Conf.onPrinciplesandpracticesof Journals
KnowledgeDiscoveryandData
DataMiningandKnowledge
Mining(PKDD)
Discovery(DAMIorDMKD)
PacificAsiaConf.onKnowledge
IEEETrans.OnKnowledgeand
DiscoveryandDataMining(PAKDD)
DataEng.(TKDE)
KDDExplorations
ACMTrans.onKDD
44

S-ar putea să vă placă și