10 1 1 35 3269

A Monophonic Pitch Tracking
Algorithm
D. Cooper
Department of Music,
The University of Leeds, Leeds LS2 9JT, United Kingdom
Email : mus6dc@sun.leeds.ac.uk
and
K. C. Ng
Division of Articial Intelligence, School of Computer Studies,

The University of Leeds, Leeds LS2 9JT, United Kingdom
Email : kia@scs.leeds.ac.uk
May 10, 1994
1
1 Introduction
In many musical situations current MIDI controllers prove in exible or ar-
ticial to musicians unfamiliar with their ergonomic eccentricities. Our aim
is to provide a means whereby conventional instruments could be used to
control musical processes in a way which was transparent to the performer.
In this paper we propose an approach to monophonic pitch tracking using
the pattern and shape of the digitised sound wave. We discuss the strategy
of segmentation of the sound wave and a method for nding the shortest
distance between two shape-similar repeating segments. A new composition
which makes use of this algorithm is reported.
2 Related work
Methods of pitch detection previously cited in the literature include ltering,
spectral analysis, and autocorrelation.
Complex signals may be passed through a parallel series of sharp narrow-
band lters whose upper and lower cuto frequencies lie one equal tempered
semitone apart. The lowest frequency output of the lterbank is adjudged
to be the fundamental frequency component of the signal, and therefore the
musical pitch of the input signal. If implemented in software, this requires
substantial computation - to cover the range of a piano keyboard of 7 oc-
taves, 85 separate lters would be required, and the outputs of the lters
would need to be regularly monitored. Lane (1990) and Kuhn (1990) have
detailed algorithms utilising ltering techniques which considerably reduce
the amount of processing with high reported rates of accuracy.
Fourier techniques allow the analysis of a complex signal into its discrete
sinusoidal components, the lowest peak in the resulting amplitude/frequency
data being taken to be the fundamental. From the Discrete Fourier Trans-
form (Lockhart 1989)
X [k] =
X x[n]e?
N? 1
j2nk
N k = 0; 1; ::::; N ? 1
n
=0
it will be noted that resolution is 1/Nt Hz, where N is the window length, t
is the sampling period, k is the target index, and n is the sample index. Thus
a 0.5Hz resolution requires a 2 second window. Given overlapping windows
2
of a much shorter length it is possible, however, to get very accurate results
by comparing the output phases from the same Fourier bins in neighbouring
windows (Bailey 1994). Even allowing for this and for Fast Fourier algorithms
this method involves very considerable computational eort.
The Phase Vocoder described by Portno (Portno 1976) has possible
applications as a sophisticated pitch tracker (Bailey 1993). Whilst this is
clearly eective, the implementation discussed required a parallel processor
for the heavy load, and was certainly not appropriate for the project discussed
in this article.
Autocorrelation algorithms have been successfully used in speech process-
ing for some time. The autocorrelation function (Rabiner 1977)
N? X
x(m) = N1 [x(n + l)w(n)][x(n + l + m)w(n + m)]
1
n=0
0mM ?1 0
will produce maxima at the period of the input signal. N is the sample
window length, n is the index of the current sample, m is the index of the
delayed sample, w(:) is window function, and l is the frame oset. However,
Kuhn has shown that, given a sampling rate of 10KHz, a window length
of 50 msec, and a pitch sample rate of 20Hz, 1,000,000 multiplications and
additions are needed each second (Kuhn 1990).
3 Method
A digitised sound wave can be viewed as a continuous stream of points in two
dimensional space, time and amplitude, uctuating around a silent threshold
with a repeating pattern and shape. Looking at a snapshot of a sound source
through an oscilloscope, we can usually determine the points where a cycle
repeats itself visually, and hence determine the frequency of the wave if the
sampling frequency is known. Thus, we propose a fast algorithm which uses
the shape of the digitised sound wave to nd its pitch.
A xed length snapshot window, 800 samples in this case, sampled at
32kHz is taken from a AD converter. Firstly, the positive going zero crossing
points are located. Traversing through the samples, at each point, if the am-
plitude of the sample (N ) is less than the silent threshold, and the amplitude
3
of the sample after it (N + 1) is more than or equal to the silent threshold,
then the sample point at (N + 1) is a zero crossing point. An illustration of
the upward zero crossing points can be found in Figure 1.
+ve
silent threshold
time
-ve
upward zero crossing point
Figure 1: Upward zero crossing points.

To obtain a higher accuracy, it is possible to extend the length of the
snapshot window, nd two sample points where the rst one has an negative
amplitude and the other one a positive amplitude and calculate the zero
crossing point by interpolation. In practice, however, the rst method is
sucient.
With a pure sine wave, clearly, the pitch can be calculated from the dis-
tance between two upward zero crossing points, but the sound wave produced
by real musical instruments are more complex, depending on the number of
component harmonics the instrument generates.
The section between two upward zero crossing points is referred to as
a segment. A shape description of a segment is estimated by dividing the
segment into eight equal length subsegments, and taking the rst and last
three amplitude value of the subsegment respectively as the six landmark
points. These six landmarks, as shown in Figure 2 provide a simplied shape,
and are used to determine the similarity measure between segments. The
4
landmark point
time
segment
length segment length
a b
Figure 2: The setting of landmarks.

normalised distance which is used to provide the similarity ratio between
two segments, a and b, can be written as :
similarity ratio = (a:a) + (a:b
b:b) ? (a:b)
where the dot product is dened as :
X
a:b = aibi
6
i
=1
A similarity ratio of 1 indicates a perfect match, and 0 suggests that the two
segments are completely dierent.
Note that incomplete segment at the head and at the tail of the snapshot
window are ignored. The length of the largest segment is compared with all
other segments, and the distances, and the similarity ratios between them,
are calculated. The number of sample points from which the cycle length is
calculated is determined by three conditions. Firstly, the dierence in length
of the two segments must be less than a denable threshold, which is, by
default, set to 5 in this implementation. Secondly, the similarity ratio of the
two segments should be high, for example, at least 0.75 in our case. Lastly,
the similar segment must be as near as possible to the largest segment.
5
On locating the next similar segment, the frequency of that snapshot can
be calculated by :
sampling frequency
distance between two similar segments
For the concert reported in the later section, an automatic self condence
test was implemented. When a pitch is tracked, a condence counter is
increased each time the same pitch is recognised at the subsequent snapshots,
and decreased otherwise. If the condence counter is 0, the current pitch is
tracked. The system responds after a condence count of 3.
The self condence threshold may be ne tuned to cope with dierent
kinds of instruments. Percussion instruments, for example, have rapid de-
cays, and low condence thresholds, characteristically 2, may be needed. On
the other hand, sustaining instruments such as the violin with vibrato, may
need larger thresholds, typically 5, to determine their actual centre pitch.
Figure 3: An illustration of the prototype system.

Figure 3 shows the prototype working on a Silicon Graphics Indy work-
station. The vertical lines show the boundaries of the segments. The average
response time is below 1 second.
4 Application
The algorithm (see also Cooper and Ng, 1992) was developed and rened
as part of a commission for a new work for Brass Band and electronics, by
6
the composer Philip Wilby, for the National Youth Brass Band of Great
Britain, with funding from The Arts Council of Great Britain. The composi-
tion `Dance before the Lord' was given its premiere in Gloucester Cathedral,
England, in April 1994 as the climax of the band's annual Easter course.
Begin
AD convertor
Pitch tracker
no
Pitch tracked with confidence ?
yes
no
Pitch of interest ?
yes
Respond via MIDI
no
Key ‘Q’ pressed ?
yes
End
Figure 4: Basic system ow.

The composer had felt that the instruments should somehow control the
electronics rather than merely passively sharing the acoustic space with them.
As an element of the electro-acoustic content of the piece, it was suggested
that the pitch tracking algorithm could form the basis of a program which
would identify predetermined control notes from melodic lines played by
soloists into a microphone, and respond to them by outputting les stored in
a proprietary format (see Figure 4), which was created by a program using a
non-linear dynamical system as a compositional generator (also designed by
the authors).
7
The program was written in C for a low specication computer (an Atari
STE running at 8 MHz) with an 8 bit analogue to digital converter for data
acquisition using a stripped-down version of the algorithm, which only uses
two landmark points. The interface was minimal, providing an oscilloscope
function (primarily for adjusting input levels) which could be switched in and
out, a snapshot of the sampler's input buer in signed character format, a
display of tracked ouput (both raw and condence tested), and indications of
program and MIDI status. Control of the above functions, manual override,
and MIDI termination was provided via keyboard character input.
Pitch
Mic tracker Synthesiser
AD
FX
Mixer
Amplifier
Speaker
Figure 5: Concert conguration.

Philip Wilby felt that the sound of the soloist should be transformed and
projected with the synthesised response, so the arrangement shown in Figure
5 was adopted. Brass instruments can produce high sound pressure levels,
and to deal with this, and minimise problems of sound isolation, a robust
dynamic microphone with a hypercardioid pattern was used. Care had been
taken in the score to avoid the trigger tones in passages leading up to control
8
events, and the microphone channel was kept closed in the sections of the
work which did not require pitch identication.
Although a limited set of pitches common to the 3 solo instruments (eu-
phonium, trombone and soprano cornet) was selected for control purposes,
tracking was complicated both by the dierences in their partial content, and
their relative intensities. Wilby required the synthesised responses to appear
at the end of crescendi, and care had to be taken to avoid the threshold
for tracking to be reached too soon. In order to avoid this and to reinforce
the theatricality of the gesture, a technique of initially holding the bell away
from the microphone and gradually turning it towards it in the course of the
crescendo was adopted.
Other possible applications utilising the pitch tracking algorithm include
ones to give assistance in the acquisition of accurate vocal intonation, `mel-
ograph' type programs which give a graphic representation of melodic line,
and the development of complex interactive performance situations where
the computer might act as a `live musician'.
5 Future directions
For a possibly more consistent shape estimation, landmark points could be
taken as the average over an ensemble, rather than the single amplitude
values used in this implementation. Automation could include exible land-
mark positioning which takes into account the signicant features of a shape,
self-condence threshold decision making which is dependent upon the rate
of decay of the input sound wave, and sampling frequency selection varying
according to the nature of the sound being sampled.
Currently the segmentation process is zero crossing orientated. If a
method such as chain code (Gonzalez 1987, Boyle 1993) was to be used to
provide a compact way to describe shape, it is possible that a segmentation
free algorithm could be achieved.
A graphical interface which allows the user to dene the system's response
to the detection of pitches and threshold loudness values (which may also be
used as control parameters) remains to be written.
9
6 Conclusion
Experiments have been carried out with many dierent acoustic instruments
as well as synthesised sound waves. The results from the visually driven al-
gorithm are extremely satisfactory. With the self-condence threshold set to
3, a 32kHz sampling rate, and an 800 sample snapshot window, a recognition
rate of more than 90% was achieved.
The capabilities of the algorithm remain to be further explored. It may
oer a new direction in electronic composition in which composers and play-
ers can command a computer to respond using live instruments as trigger
sources.
The possible use of the shape description method from pattern recognition
allowing a segmentation free algorithm represents work in hand. Future
developments will include faster response time, and higher resolution.
Other applications might include the use of the tracked amplitude level
to control response parameters, for example by the association of a series of
threshold levels with timbral types.
Our dream is of a program which, whilst making appropriate responses
to control pitches, would continuously track the surrounding sounds and
make ne adjustments to performance parameters such as timing - a virtual
musician who listens, responds and communicates! One could imagine a
group of machines, or even just a number of processes on a parallel machine
tracking each other - machines playing music together!
References
Bailey, N. J. 1994. \Private Email communication."
Bailey, N. J. et al. 1993. \Applications of the Phase Vocoder in the Control of
Real-time Electronic Musical Instruments." Interface 22(3): 259-275.
Cooper, D. and Ng, K. 1992. \A Computationally Non-Intensive Algorithm For
Pitch Recognition." ARRAY 12(2): 6-9.
Gonzalez, R. C. and Wintz P. 1987. \Digital Image Processing." Addision-Wesley
Publishing Company.
Kuhn, W. B. 1990. \A Real-Time Pitch Recognition Algorithm for Music
Applications." Computer Music Journal 14(3): 60-71.
10
Lane, J. E. 1990. \Pitch Detection using a Tunable IIR Filter." Computer Music
Journal 14(3): 46-59.
Lockhart, G. B. and Cheetham, B. M. G. 1989. \BASIC Digital Signal
Processing." London:Butterworths.
Portno, M. R. 1976. \Implementation of the Digital Phase Vocoder Using the
Fast Fourier Transform." IEEE Transactions on Acoustics, Speech, and
Signal Processing 24(3): 243-248.
Rabiner, L. R. 1977. \On the Use of Autocorrelation Analysis for Pitch
Detection." IEEE Transactions on Acoustics, Speech, and Signal
Processing 25(1): 24-33.
Sonka, M., Hlavac, V. and Boyle, R. 1993. \Image Processing, Analysis and
Machine Vision." Chapman & Hall Computing.
11

10 1 1 35 3269

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

10 1 1 35 3269

Încărcat de

Drepturi de autor:

Formate disponibile

A Monophonic Pitch Tracking

Division of Articial Intelligence, School of Computer Studies,

upward zero crossing point

Figure 1: Upward zero crossing points.

Figure 2: The setting of landmarks.

Figure 3: An illustration of the prototype system.

Respond via MIDI

Figure 4: Basic system ow.

Figure 5: Concert conguration.

S-ar putea să vă placă și

10 1 1 35 3269

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

10 1 1 35 3269

Încărcat de

Drepturi de autor:

Formate disponibile

A Monophonic Pitch Tracking

Division of Arti cial Intelligence, School of Computer Studies,

upward zero crossing point

Figure 1: Upward zero crossing points.

Figure 2: The setting of landmarks.

Figure 3: An illustration of the prototype system.

Respond via MIDI

Figure 4: Basic system ow.

Figure 5: Concert con guration.

S-ar putea să vă placă și

Division of Articial Intelligence, School of Computer Studies,

Figure 5: Concert conguration.