Documente Academic
Documente Profesional
Documente Cultură
of Time Series
Outline of Talk
Prologue: Background on Time Series Data Mining
The importance of the right representation
A new symbolic representation
motif discovery
anomaly detection
visualization
Appendix: classification, clustering, indexing
25.1750
25.2250
25.2500
25.2500
25.2750
25.3250
25.3500
25.3500
25.4000
25.4000
25.3250
25.2250
25.2000
25.1750
..
..
24.6250
24.6750
24.6750
24.6250
24.6250
24.6250
24.6750
24.7500
28
27
26
25
24
23
50
100
150
200
250
300
350
400
450
500
Steady
pointing
Hand moving to
shoulder level
Point
Hand at rest
0
10
20
30
40
50
60
70
80
90
Steady
pointing
Hand moving to
shoulder level
Hand moving
down to grasp gun
Hand moving
above holster
Hand at rest
Gun-Draw
0
10
20
30
40
50
60
70
80
90
Motif Discovery
Classification
Rule
10
Discovery
Query by
Content
s = 0.5
c = 0.3
Novelty Detection
Motif Discovery
Classification
Rule
10
Discovery
Query by
Content
s = 0.5
c = 0.3
Novelty Detection
Q = q1qn
and
C = c1cn
their Euclidean distance is
defined as:
D Q, C qi ci
n
i 1
D(Q,C)
But which
which approximation
approximation
But
should we
we use?
use?
should
Time Series
Representations
Data Adaptive
Sorted
Coefficients
Singular Symbolic
Piecewise
Value
Polynomial Decomposition
Piecewise
Linear
Approximation
Adaptive
Piecewise
Constant
Approximation
Natural
Language
Trees
Strings
Wavelets
Random
Mappings
Fourier
Transform
Interpolation Regression
Spectral
Piecewise
Aggregate
Approximation
Discrete
Cosine
Transform
Symlets
UUCUCUCD
0
20 40 60 80 100 120
20 40 60 80 100 120
20 40 60 80 100 120
20 40 60 80 100 120
20 40 60 80 100 120
20 40 60 80 100 120
20 40 60 80 100 120
DFT
DWT
SVD
APCA
PAA
PLA
SYM
This only
only works
works if
if the
the
This
approximation allows
allows
approximation
lower bounding
bounding
lower
D(Q,S)
DLB(Q,S)
D(Q,S)
qi si
n
i 1
DLB(Q,S)
2
2
(
sr
sr
)(
qv
sv
)
i1 i i1 i i
M
Time Series
Representations
Data Adaptive
Sorted
Coefficients
Singular Symbolic
Piecewise
Value
Polynomial Decomposition
Piecewise
Linear
Approximation
Interpolation Regression
Adaptive
Piecewise
Constant
Approximation
Natural
Language
Strings
Wavelets
Random
Mappings
Spectral
Fourier
Transform
Piecewise
Aggregate
Approximation
Discrete
Cosine
Transform
Symlets
baabccbc
C
0
40
60
80
100
120
20
c
b
b
a
20
a
40
60
80
100
baabccbc
120
c
b
b
a
P roba bility
0.99
0.98
0.95
0.90
0.75
0.50
0.25
0.10
0.05
0.02
0.01
0.003
0.001
-10
10
20
Why aa
Why
Gaussian?
Gaussian?
a
40
60
80
Visual Comparison
3
2
1
0
-1
-2
-3
DFT
f
e
d
c
b
a
PLA
Haar
APCA
D Q, C q i c i
n
1.5
i 1
0.5
Euclidean Distance
0
-0.5
-1
-1.5
0
20
40
60
80
100
120
DR(Q , C )
1.5
n
w
c
i 1 i i
w
PAA distance
lower-bounds
the Euclidean
Distance
0.5
0
-0.5
-1
-1.5
0
C =
Q =
20
40
baabccbc
babcacca
60
80
100
120
MINDIST (Q , C )
n
w
dist
(
q
,
c
)
i 1
i
i
w
Novelty Detection
Fault detection
Interestingness detection
Anomaly detection
Surprisingness detection
note that
that this
this problem
problem should
should not
not be
be
note
confused with
with the
the relatively
relatively simple
simple
confused
problem of
of outlier
outlier detection.
detection. Remember
Remember
problem
Hawkins famous
famous definition
definition of
of an
an outlier...
outlier...
Hawkins
an outlier
outlier is
is an
an observation
observation
...... an
that deviates
deviates so
so much
much from
from
that
other observations
observations as
as to
to arouse
arouse
other
suspicion that
that itit was
was generated
generated
suspicion
from aa different
different mechanism...
mechanism...
from
Thanks Doug,
Doug, the
the check
check is
is in
in
Thanks
the mail.
mail.
the
We are
are not
not interested
interested in
in
We
finding individually
individually surprising
surprising
finding
datapoints,, we
we are
are interested
interested in
in
datapoints
finding surprising
surprising patterns
patterns..
finding
Douglas M. Hawkins
Lots of
of good
good folks
folks have
have worked
worked on
on
Lots
this, and
and closely
closely related
related problems.
problems.
this,
It is
is referred
referred to
to as
as the
the detection
detection
It
of Aberrant
Aberrant Behavior
Behavior11,,
of
22, Anomalies33, Faults44,
Novelties
Novelties , Anomalies , Faults ,
55, Deviants66 ,Temporal
Surprises
Surprises , Deviants ,Temporal
Change77,, and
and Outliers
Outliers88..
Change
1.
2.
3.
4.
5.
6.
7.
8.
Arrr... what
what be
be wrong
wrong with
with
Arrr...
current approaches?
approaches?
current
The blue
blue time
time series
series at
at the
the top
top is
is aa
The
normal healthy
healthy human
human electrocardiogram
electrocardiogram
normal
with an
an artificial
artificial flatline
flatline added.
added. The
The
with
sequence in
in red
red at
at the
the bottom
bottom indicates
indicates
sequence
how surprising
surprising local
local subsections
subsections of
of the
the
how
time series
series are
are under
under the
the measure
measure
time
introduced in
in Shahabi
Shahabi et.
et. al.
al.
introduced
Our Solution
Based on the following intuition, a
pattern is surprising if its frequency of
occurrence is greatly different from
that which we expected, given
previous experience
This is a nice intuition, but useless unless we can
more formally define it, and calculate it efficiently
Notethat
thatunlike
unlikeall
allprevious
previousattempts
attemptsto
tosolve
solvethis
this
Note
problem,our
ournotion
notionsurprisingness
surprisingnessof
ofaapattern
patternisisnot
not
problem,
tiedexclusively
exclusivelyto
toits
itsshape.
shape.Instead
Insteadititdepends
dependson
onthe
the
tied
differencebetween
betweenthe
theshapes
shapesexpected
expectedfrequency
frequency
difference
andits
itsobserved
observedfrequency.
frequency.
and
Forexample
exampleconsider
considerthe
thefamiliar
familiarhead
headand
andshoulders
shoulders
For
patternshown
shownbelow...
below...
pattern
Theexistence
existenceof
ofthis
thispattern
patternininaastock
stockmarket
markettime
timeseries
series
The
shouldnot
notbe
beconsider
considersurprising
surprisingsince
sincethey
theyare
areknown
knownto
to
should
occur(even
(evenififonly
onlyby
bychance).
chance).However,
However,ififititoccurred
occurredten
ten
occur
timesthis
thisyear,
year,as
asopposed
opposedto
tooccurring
occurringan
anaverage
averageof
oftwice
twiceaa
times
yearininprevious
previousyears,
years,our
ourmeasure
measureof
ofsurprise
surprisewill
willflag
flagthe
the
year
shapeas
asbeing
beingsurprising.
surprising.Cool
Cooleh?
eh?
shape
Thepattern
patternwould
wouldalso
alsobe
besurprising
surprisingififits
itsfrequency
frequencyof
of
The
occurrenceisisless
lessthan
thanexpected.
expected.Once
Onceagain
againour
ourdefinition
definition
occurrence
wouldflag
flagsuch
suchpatterns.
patterns.
would
Tarzan(R)
(R)isisaaregistered
registered
Tarzan
trademarkofofEdgar
EdgarRice
Rice
trademark
Burroughs,Inc.
Inc.
Burroughs,
We begin
begin by
by defining
defining some
some
We
terms Professor
Professor Frink?
Frink?
terms
Definition 1:
1: AA time
time series
series pattern
pattern P,
P,
Definition
extracted from
from database
database XX is
is surprising
surprising
extracted
relative to
to aa database
database R,
R, if
if the
the
relative
probability of
of its
its occurrence
occurrence is
is greatly
greatly
probability
different to
to that
that expected
expected by
by chance,
chance,
different
assuming that
that RR and
and XX are
are created
created by
by
assuming
the same
same underlying
underlying process.
process.
the
Definition 1:
1: AA time
time series
series pattern
pattern P,
P,
Definition
extracted from
from database
database XX is
is surprising
surprising
extracted
relative to
to aa database
database R,
R, if
if the
the
relative
probability of
of occurrence
occurrence is
is greatly
greatly
probability
different to
to that
that expected
expected by
by chance,
chance,
different
assuming that
that RR and
and XX are
are created
created by
by
assuming
the same
same underlying
underlying process.
process.
the
But you
you can
can never
never know
know the
the
But
probability of
of aa pattern
pattern you
you have
have
probability
never seen!
seen!
never
And probability
probability isnt
isnt even
even defined
defined
And
for real
real valued
valued time
time series!
series!
for
principalskinner
IfIf xx == principalskinner
{a,c,e,i,k,l,n,p,r,s}
isis {a,c,e,i,k,l,n,p,r,s}
|x| isis 16
16
|x|
skin
skin
prin
prin
ner
ner
IfIf
IfIf
yy
yy
substring of
of xx
isis aa substring
prefix of
of xx
isis aa prefix
suffix of
of xx
isis aa suffix
==
==
in, then
then ffxx(y)
(y) == 22
in,
pal, then
then ffxx(y)
(y) == 11
pal,
principalskinner
principalskinner
Experimental Evaluation
We would like to demonstrate two
features of our proposed approach
Sensitivity (High True Positive Rate)
The algorithm can find truly surprising
patterns in a time series.
Selectivity (Low False Positive Rate)
The algorithm will not find spurious
surprising patterns in a time series
Sensitive
Sensitive
and
and
Selective,
Selective,
just like
like me
me
just
Test data
(subset)
0
Tarzans level of
0
surprise
200
400
600
800
1000
1200
1400
1600
200
400
600
800
1000
1200
1400
1600
2000
4000
6000
8000
10000
12000
2000
4000
6000
8000
10000
12000
2000
4000
6000
8000
10000
12000
Test data
(subset)
Tarzans level of
surprise
400
350
300
Normal
sequence
250
200
Actor
misses
holster
150
100
100
200
300
Laughing and
flailing hand
Normal
sequence
500
600
700
Demandfor
for
Demand
Power?
Power?
Excellent!
Excellent!
200
400
600
800
1000
1200
1400
1600
1800
2000
Thefirst
first33weeks
weeksof
ofthe
thepower
powerdemand
demanddataset.
dataset.Note
Notethe
the
The
repeatingpattern
patternof
ofaastrong
strongpeak
peakfor
foreach
eachof
ofthe
thefive
five
repeating
weekdays,followed
followedby
byrelatively
relativelyquite
quiteweekends
weekends
weekdays,
Tarzan
TSA Tree
IMM
NASA recently
recently said
said TARZAN
TARZAN
NASA
holds great
great promise
promise for
for the
the
holds
future*.
*.
future
There is
is now
now aa journal
journal version
version
There
of TARZAN
TARZAN (under
(under review),
review), if
if
of
you would
would like
like aa copy,
copy, just
just ask.
ask.
you
In the
the meantime,
meantime, let
let us
us
In
consider motif
motif discovery
discovery
consider
* Isaac, D. and Christopher Lynnes, 2003. Automated Data Quality Assessment in the Intelligent
Archive, White Paper prepared for the Intelligent Data Understanding program.
(
0
50 0
1000
150 0
Winding Dataset
2500
Motif Discovery
To find these 3 motifs would require about
6,250,000 calls to the Euclidean distance function.
A
0
500
20
1500
2000
B
40
60
80
100
120
140
20
1000
A
0
Winding Dataset
2500
C
40
60
80
100
120
140
20
40
60
80
100
120
140
Trivial
Matches
C
0
100
200
3 00
400
( Inertial Sensor )
500
600
70 0
800
900
100 0
Definition 1. Match: Given a positive real number R (called range) and a time series T containing a
subsequence C beginning at position p and a subsequence M beginning at q, if D(C, M) R, then M is
called a matching subsequence of C.
Definition 2. Trivial Match: Given a time series T, containing a subsequence C beginning at position
p and a matching subsequence M beginning at q, we say that M is a trivial match to C if either p = q or
there does not exist a subsequence M beginning at q such that D(C, M) > R, and either q < q< p or
p < q< q.
Definition 3. K-Motif(n,R): Given a time series T, a subsequence length n and a range R, the most
significant motif in T (hereafter called the 1-Motif(n,R)) is the subsequence C1 that has highest count
of non-trivial matches (ties are broken by choosing the motif whose matches have the lower variance).
The Kth most significant motif in T (hereafter called the K-Motif(n,R) ) is the subsequence CK that has
the highest count of non-trivial matches, and satisfies D(CK, Ci) > 2R, for all 1 i < K.
500
( m= 1000)
1000
C1
C^1 a c b a
S
1 a
2 b
58 a
985 b
c
c
b
a
a
b
a = 3 {a,b,c}
n = 16
w=4
a
2 b
58 a
985 b
c
c
b
a
a
b
58
1
2
457
58
985
985
1
1
58
985
a
2 b
58 a
985 b
c
c
b
a
a
b
58
1
2
58
985
985
1
1
58
985
k
E( k , a, w, d , t )
2
i w a 1 1
1- i
w
a a
i 0
d
w i
Suppose
E(k,a,w,d,t) = 2
58 27
3
985 0
58
3
985
A Simple Experiment
Lets imbed two motifs into a random walk time
series, and see if we can recover them
C
A
D
0
20
40
60
200
80
100
120
400
20
40
600
60
80
100
800
120
1000
1200
Planted Motifs
Real Motifs
20
40
60
80
100
120
20
40
60
80
100
120
500
1000
1500
2000
350
0
450
0
550
0
650
0
Seconds
10k
8k
Brute Force
6k
TS-P
4k
2k
0
1000
2000
3000
4000
5000
SAX Summary
For most classic data mining tasks
(classification, clustering and
indexing), SAX is at least as good as
the raw data, DFT, DWT, SVD etc.
SAX allows the best anomaly
detection algorithm.
SAX is the engine behind the only
realistic time series motif discovery
algorithm.
Conclusions
SAX is posed to make major contributions to
time series data mining in the next few years.
A more general conclusion, if you want to
solve you data mining problem, think
representation, representation, representation.
Experimental Validation
Clustering
Hierarchical
Partitional
Classification
Nearest Neighbor
Decision Tree
Indexing
VA File
Clustering
Hierarchical Clustering
Compute pairwise distance, merge similar
clusters bottom-up
Compared with Euclidean, IMPACTS, and
SDA
Hierarchical Clustering
Euclidean
IMPACTS (alphabet=8)
Hierarchical Clustering
SAX
SDA
Clustering
Hierarchical Clustering
Compute pairwise distance, merge similar clusters
bottom-up
Compared with Euclidean, IMPACTS, and SDA
Partitional Clustering
K-means
Optimize the objective function by minimizing the sum
of squared intra-cluster errors
Compared with Raw data
Objective Function
Raw
Rawdata
data
255000
Our
Symbolic
SAX
Approach
250000
245000
240000
235000
230000
225000
220000
Number of Iterations
Partitional (k-means) Clustering
10
11
Classification
Nearest Neighbor
Leaving-one-out cross validation
Compared with Euclidean Distance, IMPACTS,
SDA, and LP
Datasets: Control Charts & CBF (Cylinder,
Bell, Funnel)
Nearest Neighbor
Cylinder-Bell-Funnel
Control Chart
0.6
0.5
Impacts
0.4
Error Rate
SDA
0.3
Euclidean
LPmax
0.2
SAX
0.1
0
Alphabet Size
10
Alphabet Size
Nearest Neighbor
10
Classification
Nearest Neighbor
Leaving-one-out cross validation
Compared with Euclidean Distance, IMPACTS, SDA,
and LP
Datasets: Control Charts & CBF (Cylinder, Bell,
Funnel)
Decision Tree
Defined for real data, but attempting to use DT on time
series raw data would be a mistake
Adaptive Piecewise
Constant Approximation
50
100
Dataset
SAX
Regression Tree
CC
3.04 1.64
2.78 2.11
CBF
0.97 1.41
1.14 1.02
Indexing
Indexing scheme similar to VA (Vector
Approximation) File
Dataset is large and disk-resident
Reduced dimensionality could still be too high
for R-tree to perform well
Indexing
0.6
0.5
DWT
0.4
SAX
Haar
0.3
0.2
0.1
0
Ballbeam
Chaotic
Memory
Dataset
Winding