Sunteți pe pagina 1din 42

Machine Learning on Streams

Dr. Mikio Braun @mikiobraun ! Berlin " stream#rill

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

Where does the data come from?


Finance Gaming Monitoring

Ad ertisment

Sensor Networks

Social Media

Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

Static Data vs. Streaming

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

Mostly it's not even ML

urn e$ent stream into some statisti%s &e$ents'se%on#, tren#s, %ounting( )*urn pre#i%tion &but really +ust training on #ata( )lustering o, ne-s stories &ok( .utlier #ete%tion ,or monitoring &ok, yea*(

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

Still it's hard!

Many /$ents

Many .b+e%ts

100 e$ents ' se%on# 010k per *our 2.1M per #ay 210M per mont* 0.2B per year

!tt":##www$flickr$com#"!otos#arenamontanus#%&'()*))+#

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

Time for response

3eekly reports4 a #ay 5e%ommen#ations4 a ,e- *ours 3eb Analyti%s4 se%on#s to a ,e- minutes A# .ptimi6ation4 millise%on#s

It's only real-time if yo real-time.

can react in

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

!pproaches to stream mining

streaming ,eature e7tra%tion learning4


up#ate iterati$ely lo%al ops only not*ing beyon# .&n82(

9uery is o,ten ok &%a%*e(

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

Isn't Machine Learning easily "nline?

:to%*asti% gra#ient #es%ent

%on$erges, e.g. i,

!tt":##leon$bottou$org#researc!#stoc!astic Big Data Beers, April 15, 2014, Berlin


2014 by Mikio L. Braun

Time hori#ons vs. Learning rate

;ou %an<t +ust #o online learning on e$ent #ata=


Big Data Beers, April 15, 2014, Berlin
2014 by Mikio L. Braun

$atch% Store and &r nch

>ust like you -oul# -it* stati% #ata, in a B?@ A.5 L..B #ata managementC *uge laten%y=

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

Map 'ed ce (rocess

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

Map 'ed ce and ML

Neural ,nformation -rocessing Systems .onference, %//&

)o$ers Lo%ally 3eig*te# Linear 5egression, Dai$e Bayes, @aussian Dis%riminati$e Analysis, kEMeans, Logisti% 5egression, Deural Det-orks, Brin%ipal )omponent Analysis, ?n#epen#ent )omponent Analysis, /7pe%tation Ma7imi6ation, :upport Fe%tor Ma%*ines
Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

!pache )adoop

?n#ustry :tan#ar# ,or Map 5e#u%e .riginally #e$elope# at ;a*oo A ,ait*,ul embo#iment o, t*e >a$a /nterprise Min#set, bin#ings " tools ,or e$eryt*ing )ore %omponents4 GDA: H Map5e#u%e

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

)adoop *acts

;ou s*oul#n<t be a,rai# o, >a$a GDA: like a Boor Mans A: &rea#'-rite, no seek( to #istribute #ata Map splits essentially on Ile le$el &some Ile ,ormats are un#erstoo#( >ob startup takes 9uite some time &minutes( Map'5e#u%e +obs %an be s%ripts JK inter,a%e to e$ery language in prin%iple Fery e7tensible &in >a$a(4 !ser DeIne# Aun%tions, et%.

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

(aralleli#e% Stream (rocessing

:plit -ork into small pie%es o, %o#e *an#ling a single e$ent Baralleli6e @oo# laten%y Do Luery Layer=

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

!ctor +ased conc rrency

M)lassi%alN multiEt*rea#e# programming %onsi#ere# *arm,ul &lo%king, %on%urrent a%%ess( ?ntera%ting a%tors &ea%* running singleEt*rea#e#( /mp*asis on ,un%tional programming " immutable #ata stru%tures

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

!ctor +ased conc rrency

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

!pache Storm

Basi%ally s%ale t*e a%tor base# mo#el

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

(aralleli#e% Micro-$atches

/$entEoriente# pro%essing %an be te#ious Apa%*e :park4 Mi%roEBat%*es

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

!pache Spar,

Stream "rocessing not 0ust ma"#reduce but more com"lete 1functional collection style A-,2, also for streams in memory !y"ed Hadoo" com"etitor de elo"ed wit! su""ort by 3. 4erkeley

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

!ppro-imate% Stream Mining

:%aling is ni%e, but W. !LL /0"W4


Data is noisy Dot e$ery #ata point is important Met*o#s are noisy, too Absolute numbers are o,ten not important, too

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

(ro+a+ilistic Data Str ct res

Brobabilisti% Algorit*m4 3it* n spa%e, per,orm task -it* error e = f(n) -it* e O 0 as n O P Mot*er o, all algorit*ms4 $loom1lter

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

)eavy )itters 2a.,.a. Top-,3

)ount a%ti$ities o$er large item sets &millions, e$en more, e.g. ?B a##resses, -itter users( ?ntereste# in most a%ti$e elements only.
.ase (: element already in data base (5% (+% +5% ))5 6(% /%5 () (% * ) 5 % 6(5 5 .ase %: new element 6(5 /%5 % (+% (+% (% (5

Fi7ed tables of counts

Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2 !

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

&o nt-Min S,etches

:ummari6e *istograms o$er large ,eature sets Like bloom Ilters, but better
m bins / ( / % / ( ) + 5 / 5 ) / % % / / / ( / % 5 5 % / ) 6 / / % 5 * n different !as! functions

Luery4 ake minimum o$er all *as* ,un%tions


Big Data Beers, April 15, 2014, Berlin

8uery result: (

3"dates for new entry

G$ .ormode and S$ Mut!ukris!nan$ "n impro#ed data stream summary$ The count-min sketch and its applications% 9A:,N %//+, J$ Algorit!m ));(<: )*=6) ;%//)< $
2014 by Mikio L. Braun

&l stering 4ith co nt-min S,etches

.nline %lustering

Aor ea%* #ata point4


Map to %losest %entroi# &Q %ompute #istan%es( !p#ate %entroi#

%ountEmin sket%*es to represent sum o$er all $e%tors in a %lass


/ ( / % / ( ) + 5 / 5 ) / % % / / / ( / % 5 5 % / ) 6 / / % 5 *

Aggarwal, " Frame&ork for Clusterin' (assi#e-Domain Data Streams, IEEE International Conference on Data En'ineerin' , 2

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

Wait a min te? "nly &o nting?

3ell, getting t*e top most a%ti$e items is alrea#y use,ul.

3eb analyti%s, !sers, ren#ing opi%s

)ounting is statisti%s=

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

&o nting is Statistics

/mpiri%al mean4

)orrelations4

Brin%ipal )omponent Analysis4

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

More% Ma-im m-Li,elihood

/stimate probabilisti% mo#els

based on

w!ic! is slig!tly biased, but sim"ler

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

" tlier detection

.n%e you *a$e a mo#el, you %an %ompute pE$alues &base# on re%ent time ,rames=(

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

"nline T*-ID*

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

&lassi1cation 4ith 0a5ve $ayes

Dai$e Bayes is also +ust %ounting, rig*tC


fre?uency of word in document Number of times word a""ears in class

class "riors

-riors

Multinomnial na> e 4ayes

:otal number of words in class

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

&lassi1cation 4ith 0aive $ayes

Big Data Beers, April 15, 2014, Berlin

,.M9 %//5
2014 by Mikio L. Braun

&lassi1cation 4ith 0aive $ayes

R :teps to impro$e DB4


trans,orm A to log& . H 1( ?DAEstyle normali6ation s9uare lengt* normali6ation use %omplement probability anot*er log normali6e t*ose -eig*ts again Bre#i%t linearly using t*ose -eig*ts

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

So m ch more to do 4ith trends

Least 5e%ently !se# )a%*es :parse Fe%tors :parse Matri%es )on#itional Brobabilities &Gistograms( A%%umulators ...

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

Streamdrill

5ealE ime Analysis :olutions )ore /ngine4


Gea$y Gitters %ounting H e7ponential #e%ay Instant %ounts " topEk results o$er time -in#o-s ?nEMemory BroIling an# ren#ing 5e%ommen#ations )ount Distin%t

Mo#ules

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

!rchitect re "vervie4

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

.-ample% T4itter Stoc, !nalysis

!tt":##"lay$streamdrill$com# is#
Big Data Beers, April 15, 2014, Berlin
2014 by Mikio L. Braun

.-ample% T4itter Stoc, !nalysis

ren#s4

symbol4%ombinations symbol4*as*tag symbol4key-or#s symbol4mentions symbol tren# symbol4url

SAABL4S@..@ SAABL4Ttra#ing S@..@4#isruption S@..@43all:treet)om SAABL

SAB4*ttp4''on.-s+.%om'15,GaU3

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

.-ample% T4itter Stoc, !nalysis

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

.-ample% T4itter Stoc, !nalysis

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

.-ample% T4itter Stoc, !nalysis


:witter

tweets Ja aScri"t ia @AS: :weet AnalyBer u"dates streamdrill

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

ML on Streams

)onstantly #eal -it* ne- #ata ?t<s o,ten not t*at %omple7, really Gig* #ata rate " e$ent range Bat%*4 Gig* laten%y O Ga#oop :tream4 Lo- laten%y O :torm ' :park Appro7imate4 @oo# enoug* O :tream#rill

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

S-ar putea să vă placă și