Auditory Scene Analysis: Phenomena, Theories and Computational Models

Auditory Scene Analysis:
phenomena, theories
and computational models
July 1998
Dan Ellis
International Computer Science Institute, Berkeley CA
<dpwe@icsi.berkeley.edu>
Outline
1 The computational theory of ASA
2 Cues & grouping
3 Expectations & inference
4 Big issues
ASA - Dan Ellis 1998jul11 - 1

Auditory Scene Analysis
What does our sense of hearing do?
- recover useful information
... about objects of interest
... in a wide range of circumstances
Measuring objects in an auditory scene:

Subjective analysis
of auditory scenes
f/Hz
City
4000
2000
1000
400
200
0 1 2 3 4 5 6 7 8 9
Horn1 (10/10)
S9−horn 2
S10−car horn
S4−horn1
S6−double horn
S2−first double horn
S7−horn
S7−horn2
S3−1st horn
S5−Honk
S8−car horns
S1−honk, honk
Crash (10/10)
S7−gunshot
S8−large object crash
S6−slam
S9−door Slam?
S2−crash
S4−crash
S10−door slamming
S5−Trash can
S3−crash (not car)
S1−slam
Horn2 (5/10)
S9−horn 5
S8−car horns
S2−horn during crash
S6−doppler horn
S7−horn3
Truck (7/10)
S8−truck engine
S2−truck accelerating
S5−Acceleration
S1−rev up/passing
S6−acceleration
S3−closeup car
S10−wheels on road
Horn3 (5/10)
S7−horn4
S9−horn 3
S8−car horns
S3−2nd horn
S10−car horn
• Subjects identify structures in dense scenes

with high agreement

Outline
- ASA and CASA
- The grouping paradigm
- Marr’s three levels of explanation
2 Cues & grouping
4 Big issues

Auditory Scene Analysis (ASA)
“The organization of sound scenes
according to their inferred sources”
• Real-world sounds rarely occur in isolation
→a useful sense of hearing must be able to
segregate mixtures
- people (and ...) do this very well;
unexpectedly difficult to model
- depends on:
subjective definition of relevant sources
regularity/constraints of real-world sounds
• Studied via experimental psychology
- characterize ‘rules’ for organizing simple pieces
(tones, noise bursts, clicks)
i.e. ‘reductive’ approach

Computational Auditory Scene Analysis
(CASA)
• Psychological ‘rules’ suggest computer
implementation
- .. but many practical problems arise!
• Motivations:
Practical applications
- real-world interactive systems
- indexing of media databases
- hearing prostheses
Crossover opportunities
- unknown signal/information processing
principles?
Benefits for theory
- implementations are very revealing

The grouping paradigm
• Standard theory of ASA (Bregman, Darwin &c):
- sound mixture is broken up into small elements
e.g. time-frequency ‘cells’
- each element has a number of feature
dimensions (amplitude, ITD, period)
- elements are grouped together according to their
features to form larger structures
- resulting groups have overall attributes (pitch,
location)
(from Darwin 1996)

Marr’s levels-of-explanation
of information processing
• Three distinct aspects to info. processing
Sound
Computational ‘what’ and ‘why’;
source
Theory the overall goal
organization
‘how’;
Auditory
Algorithm an approach to
grouping
meeting the goal
practical Feature
Implementation realization of the calculation &
process. binding
Why bother? - to help organize understanding

- avoid confusion/wasted effort
→use as an analysis tool...

Level 1: Computational theory
• The underlying regularities that make the
problem possible
- i.e. the ‘ecological’ facts
• Implicit definition of “what is a source?”:
Independence of attributes between sources
Continuity of attributes for each source
+ other source-specific constraints

Level 2: Algorithm
• A particular approach to exploiting the
constraints of the computational theory
- both process & representation
• Audition:
the “elements-then-grouping” approach
- could have been otherwise e.g. templates
• Often the focus of analysis
- but: debate is muddled without a clear
computational theory

Level 3: Implementation
• A specific realization of the algorithm
- computer programs
- neurons
- ...
• Can be analyzed separately?
- provided epiphenomena are correctly assigned
• Needs context of algorithm,
computational theory
“You cannot understand stereopsis simply by
thinking about neurons”

The advantage of the appropriate level
• Computational theory
- determines the purpose of the process;
provides focus necessary for analysis
e.g. biosonar: benefit of hyperresolution
• Algorithm
- abstraction that is still specific, transferable
e.g. autocorrelation for pitch
• Implementation
- explain ‘epiphenomena’
e.g. ‘subjective octave’ from refractory period

An example: Neural inhibition
Frequency- X(f)
Computational
domain
theory
processing
f
Discrete-time
Algorithm filtering
(subtraction)
Neurons with
Implementation GABAergic
inhibitions

Summary 1
• Acoustic scenes are very complex
• .. but the auditory system extracts useful
information
• Grouping is the main focus of Auditory Scene
Analysis
• .. but it fits into a larger Marrian framework

Outline
2 Cues & grouping

- Cue analysis
- Simple scenes
- Models
- Complications: interaction, ambiguity, time
4 Big issues

Cues to grouping
• Common onset/offset/modulation (“fate”)
• Common periodicity (“pitch”)
Common onset Periodicity
Acoustic (Nonlinear) cyclic

Computational
consequences tend processes are
theory
to be synchronized common
Group elements that ? Place patterns

Algorithm
start in a time range ? Autocorrelation
Onset detector cells ? Delay-and-mult

Implementation
Synchronized osc’s? ? Modulation spect
• Spatial location (ITD, ILD, spectral cues)

• Sequential cues...
• Source-specific cues...

Simple grouping
• E.g. isolated tones
freq
time
Computational • common onset

theory • common period (harmonicity)
• locate elements (tracks)

Algorithm
• group by shared features
? exhaustive search
Implementation
• evolution in time

Computer models of grouping
• “Bregman at face value” (e.g. Brown 1992):
input signal discrete

mixture features Object objects Grouping Source
Front end
(maps) formation rules groups
freq
onset
time
period
frq.mod
- feature maps
- periodicity cue
- common-onset boost
- resynthesis

Grouping model results
• Able to extract voiced speech:
brn1h.aif brn1h.fi.aif
frq/Hz frq/Hz
3000 3000
2000 2000
1500 1500
1000 1000
600 600
400 400
300 300
200 200
150 150
100 100
0.2 0.4 0.6 0.8 1.0 time/s 0.2 0.4 0.6 0.8 1.0 time/s
• Periodicity is the primary cue

- how to handle aperiodic energy?
• Limitations
- resynthesis via filter-mask
- only periodic targets
- robustness of discrete objects

Complications for grouping:
1: Cues in conflict
• Mistuned harmonic (Moore, Darwin..):
freq
time
- harmonic usually groups by onset & periodicity
- can alter frequency and/or onset time
- ‘degree of grouping’ from overall pitch match
• Gradual, various results:
pitch shift
mistuning
3%
- heard as separate tone, still affects pitch

Complications for grouping:
2: The effect of time
• Added harmonics:
freq
time
- onset cue initially segregates;

periodicity eventually fuses
• The effect of time
- some cues take time to become apparent
- onset cue becomes increasingly distant...
• What is the impetus for fission?
- e.g. double vowels
- depends on what you expect .. ?

Summary 2
• Known grouping cues make sense
• Simple examples are straightforward
• Models can be implemented directly
• .. but problematic situations abound

Outline
2 Cues & grouping

- “Old-plus-new”
- Streaming
- Restoration & illusions
- Top-down models
4 Big issues

The effect of context
• Context can create an ‘expectation’:
i.e. a bias towards a particular interpretation
• e.g. Bregman’s “old-plus-new” principle:
A change in a signal will be interpreted as an
added source whenever possible
freq/kHz
2
1
0
0.0 0.4 0.8 1.2 time/s
- a different division of the same energy

depending on what preceded it

Streaming
• Successive tone events form separate streams
freq.
TRT: 60-150 ms
1 kHz
∆f:
±2 octaves
time
• Order, rhythm &c within, not between, streams
Computational Consistency of properties for

theory successive source events
• ‘expectation window’ for known

Algorithm
streams (widens with time)
• competing time-frequency
Implementation
affinity weights...

Restoration & illusions
• Direct evidence may be masked or distorted
→make best guess using available information
• E.g. the ‘continuity illusion’:
f/Hz
ptshort
4000
2000
1000
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

time/s
- tones alternates with noise bursts

- noise is strong enough to mask tone
... so listener discriminate presence
- continuous tone distinctly perceived
for gaps ~100s of ms
→ Inference acts at low, preconscious level

Speech restoration
• Speech provides very strong bases for
inference (coarticulation, grammar, semantics):
nsoffee.aif
frq/Hz
3500
3000
2500
2000
• Phonemic 1500
restoration 1000
500
0
1.2 1.3 1.4 1.5 1.6 1.7 time/s
Temporal compound (1998jul10)
20
• Temporal 40
compounds 60
80
100
120
• Sinewave 50 100 150 200 250 300

time / ms
350 400 450 500 550
f/Bark
S1−env.pf:0
speech 80
15
10
(duplex?) 60
5
40
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8

Models of top-down processing
Perception as a search for plausible explanations
• ‘Prediction-driven’ CASA (PDCASA):
hypotheses
Noise
components
Hypothesis Predict
management & combine
Periodic
components
prediction
errors
input signal predicted
mixture features Compare features
Front end
& reconcile
• An approach as well as an implementation...

• Key features:
- ‘complete explanation’ of all scene energy
- vocabulary of periodic/noise/transient elements
- multiple hypotheses
- explanation hierarchy
PDCASA for old-plus-new
• Incremental analysis
t1 t2 t3
Input signal
Time t1:
initial element
created
Time t2:
Additional
element required
Time t3:
Second element
finished

PDCASA for the continuity illusion
• Subjects hear the tone as continuous
... if the noise is a plausible masker
f/Hz
ptshort
4000
2000
1000
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

i /
• Data-driven analysis gives just visible portions:
• Prediction-driven can infer masking:

PDCASA analysis of a complex scene
f/Hz
City
4000
2000
1000
400
200
1000
400
200
100
50
0 1 2 3 4 5 6 7 8 9
f/Hz
Wefts1−4 Weft5 Wefts6,7 Weft8 Wefts9−12
4000
2000
1000
400
200
1000
400
200
100
50
Horn1 (10/10)
Horn2 (5/10)
Horn3 (5/10)
Horn4 (8/10)
Horn5 (10/10)
f/Hz
Noise2,Click1
4000
2000
1000
400
200
Crash (10/10)
f/Hz
Noise1
4000
2000
1000 −40
400
200 −50
−60
Squeal (6/10)
Truck (7/10)
−70
0 1 2 3 4 5 6 7 8 9 dB
time/s

Marrian analysis of PDCASA
• Marr invoked to separate high-level function
from low-level details
Computational • Objects persist predictably

theory • Observations interact irreversibly
• Build hypotheses from generic

Algorithm elements
• Update by prediction-reconciliation
Implementation ???
“It is not enough to be able to describe the response of single

cells, nor predict the results of psychophysical experiments.
Nor is it enough even to write computer programs that perform
approximately in the desired way:
One has to do all these things at once, and also be very aware
of the computational theory...”

Summary 3
• Perceptual processing is highly
context-dependent
• Auditory system will use prior knowledge
to fill-in gaps (subconsciously)
• Prediction-reconciliation models can
encompass this behavior

Outline
2 Cues & grouping
4 Big issues
- the state of ASA and CASA
- outstanding issues
- discussion points

The current state of ASA and CASA
• ASA
- detailed descriptions of “in vitro” tests
- some quite subtle effects explained (DV beats)
but: how to extend to complex scenarios?
• CASA
- numerous models, some convergence
(mainly periodicity-based)
- best results sound impressive
(least plausible systems!)
- applications in speech recognition?
but: domains limited, poor robustness

Big issues in CASA:
• Plausibility
- correct level for human correspondence?
- which phenomena are important to match?
- how to implement symbolic-style processing?
• Top-down vs. bottom-up
- different approaches to ambiguity, latency
- how far down for top-down?
- how far ‘up’ for high level?
- choice between extraction & inference?
• Integrating multiple cues (e.g. binaural)
• Other debates:
- what is the real goal?
- resynthesis
- evaluation
Big issues in ASA & CASA:
• Knowledge:
how to acquire, represent & store ...
- short-term: context
- long-term: memories
- abstract: classes, generalities
• Attention:
- what does it mean in these models?
- limitation or important principle?

Conclusions
• Real-world sounds are complex;
scene-analysis is required
• We know certain cues & some rules,
but real situations raise contradictions
• Current models handle ‘obvious’ cases;
robustness & generality are hard
• Many issues remain

Discussion points
• Are Marr’s levels important? Useful?
Can you study levels in isolation?
• What do restoration phenomena imply about
internal representations?
• Do we have an adequate account of an ASA
algorithm? e.g. where do hypotheses come
from?
• How important/challenging are phenomena like
duplex perception, sinewave speech etc.?

Auditory Scene Analysis: Phenomena, Theories and Computational Models

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Auditory Scene Analysis: Phenomena, Theories and Computational Models

Încărcat de

Drepturi de autor:

Formate disponibile

Auditory Scene Analysis:

1 The computational theory of ASA

2 Cues & grouping

3 Expectations & inference

ASA - Dan Ellis 1998jul11 - 1

ASA - Dan Ellis 1998jul11 - 2

• Subjects identify structures in dense scenes

ASA - Dan Ellis 1998jul11 - 3

2 Cues & grouping

3 Expectations & inference

ASA - Dan Ellis 1998jul11 - 4

ASA - Dan Ellis 1998jul11 - 5

ASA - Dan Ellis 1998jul11 - 6

(from Darwin 1996)

ASA - Dan Ellis 1998jul11 - 7

Why bother? - to help organize understanding

ASA - Dan Ellis 1998jul11 - 8

+ other source-specific constraints

ASA - Dan Ellis 1998jul11 - 9

ASA - Dan Ellis 1998jul11 - 10

ASA - Dan Ellis 1998jul11 - 11

ASA - Dan Ellis 1998jul11 - 12

ASA - Dan Ellis 1998jul11 - 13

ASA - Dan Ellis 1998jul11 - 14

2 Cues & grouping

3 Expectations & inference

ASA - Dan Ellis 1998jul11 - 15

Acoustic (Nonlinear) cyclic

Group elements that ? Place patterns

Onset detector cells ? Delay-and-mult

• Spatial location (ITD, ILD, spectral cues)

ASA - Dan Ellis 1998jul11 - 16

Computational • common onset

• locate elements (tracks)

ASA - Dan Ellis 1998jul11 - 17

input signal discrete

ASA - Dan Ellis 1998jul11 - 18

• Periodicity is the primary cue

ASA - Dan Ellis 1998jul11 - 19

- heard as separate tone, still affects pitch

- onset cue initially segregates;

ASA - Dan Ellis 1998jul11 - 21

ASA - Dan Ellis 1998jul11 - 22

2 Cues & grouping

3 Expectations & inference

ASA - Dan Ellis 1998jul11 - 23

- a different division of the same energy

ASA - Dan Ellis 1998jul11 - 24

• Order, rhythm &c within, not between, streams

Computational Consistency of properties for

• ‘expectation window’ for known

ASA - Dan Ellis 1998jul11 - 25

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

- tones alternates with noise bursts

ASA - Dan Ellis 1998jul11 - 26

• Sinewave 50 100 150 200 250 300

ASA - Dan Ellis 1998jul11 - 27

• An approach as well as an implementation...

ASA - Dan Ellis 1998jul11 - 29

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

• Data-driven analysis gives just visible portions:

• Prediction-driven can infer masking: