Documente Academic
Documente Profesional
Documente Cultură
Abstract
Aknowledgements
I would like to thank to my supervisors Joan Serrà, from the Music Technology
Group (MTG) and Pedro Cano from Barcelona Music & Audio Technologies
(BMAT) for all the support given during the whole work. Specially for being
always near to give advise and bring the necessary support, any time in a so
positive way.
Cristian.
Contents
Abstract............................................................................................................. i
Acknowledgments............................................................................................. ii
Chapter 1: Introduction 1
1.1. Motivation................................................................................................... 1
1.2. Audio fingerprinting..................................................................................... 1
• 1.2.1. Properties of audio fingerprinting.................................................. 2
• 1.2.2. Requirements for audio fingerprinting........................................... 3
1.3. Initial problem statement and hypotheses.................................................. 4
1.4. Goals and organization of the thesis.......................................................... 5
Chapter 5: Methodology 50
5.1. Test set....................................................................................................... 50
5.2. Evaluation measures.................................................................................. 51
Chapter 6: Results 52
6.1. Results: system previous to modifications.................................................. 52
6.2. Results: new matching algorithm versus existing matching method.......... 53
Chapter 7: Conclusions 55
Chapter 8: References 56
List of figures
Chapter 1
Introduction
1.1. Motivation
With the rise of digital media and Internet as a way of audio files distribution,
large amounts of audio files belonging to companies, author societies,
consumers, etc. beyond to a control that they might want to have on these
files. Possible examples might be customers who have the copyright of an
amount of audios and want monitor on television, radio or festivals to know, if
somehow, are broadcasting of some of its audio on copyright, to just take
control or to get the legal royalties. Also companies that include
advertisements in various channels of communication and want to know if
they have sounded at the time they hired and which was the quality of the
emission. The automatic monitoring solution becomes a potential need that
might grow in future given the entropy of audio files on the network and
therefore by many and various distribution channels.
The main motivation for this work has been to make a contribution in the
Music Information Retrieval field. As the amount of audio files grows and
grows every day, the need of take control of what is going on around the web
with these files also grows. Nowadays fingerprint technologies are getting
more used and the service given by them more valuable, so it was interesting
to work and know more of them. Finally a knowledge gap seemed to be
identified gap in the detection of electronic music with the system Vericast the
company BMAT, an automatic audio identification system by fingerprinting. So
there were many important reasons that driven the work to its final
composition.
The reasons why there was a problem in the operation of the product were
quite simple. Usually the kind of songs to detect were mainly non-electronic
music audios, always having good results, with success rates around 95%. It
must be said also there was no formalized testing with genres treated
separately, especially with electronic music. Once there was the need of use
of this technology to identify electronic music, it seemed not reach the heights
of success that previously were achieved for other musical styles. These
observations were basically submitted by customers and then reaffirmed by
BMAT developers.
The algorithm works based in audio frames. Every audio frame (from the
unlabeled audio input), the system gives the more probable songs and its
temporal points that are supposed to be entering the system to be identified.
The probable songs and its temporal points will be called as candidates
through the work. The problem encountered in the matching algorithm was
that this was not receiving correct temporal trajectories between candidates.
The song was detected, but the temporal sequences between candidates
were not having a clear order. Techno music and many styles inside electronic
music, in general, are built based on loops, with repetitions of a piece of
audio. The algorithm detects these loops in various parts of the song as they
are repeated through. Sometimes it doesn't detect the correct one, as there
are others very similar or completely equal, causing bad temporal sequences.
5
After this observation the problem was redefined, as the problem was not in
the electronic music itself but more related with 'repetitive music'.
To formalize the problem and check that it really existed, it was decided to
develop a test. The test basically consisted of many songs from different
genres of music with the tags: pop, rhythm 'n blues, hip hop, dance, jazz,
classical and techno, that last one representing the electronic music style. The
songs were divided by its genre, having for every genre the same amount of
excerpts. Every genre was tested separately to check whether the accuracy
for techno music (and maybe for some variants like dance) was lower than in
the other genres.
After the test it was noticed that the system was actually responding with less
accuracy for the songs tagged as 'techno' and secondly for the ‘dance’ songs,
a style that also has similarities with 'techno' (see section 6.1).
Several hypotheses were thrown to (a) explain and (b) fix the problem that
was occurring. Proposed solutions were basically:
(1) The similarity distance was quite revised and the one developed inside
seemed to be an efficient one. (2) Tune in another way the parameters would
have good results maybe for concrete cases but it's difficult to find a way to
have a pattern. (3) In works like [Logan2000] is shown that MFCC (the kind of
features that are used to fill the acoustic vector) are the features that can
satisfactory model the human perception of sound. In conclusion, from these
possible solutions, the one that might change more the results should would
(4) to revise the matching algorithm and propose some improvements, adding
some new phases in the algorithm or write an entirely new algorithm.
Once formalized the testing, the next step is to identify the different hypothesis
that can explain what is maybe causing the problem. Thereafter take that
6
hypothesis that could change in a better way the results and ultimately
develop a solution based on this hypothesis. Finally, the solution taken was to
develop a new method of matching (chapter 4), but we'll see all this in detail in
chapters devoted to it specifically.
Chapter 2
State of the art
In this chapter the State Of The Art will be reviewed. Firstly the fingerprinting
system bases and their common applications will be shown. As in an audio
identification scenario by fingerprinting, the main differences between systems
will be reviewed, that comprises the features used to fill the fingerprint
signature and the algorithms implemented. The last section of the chapter is
an explanation of how the system works.
2.1.1. Uses
Audio fingerprinting is a technology with a wide range of possibilities for uses.
We could say that the main ones can be divided into the ones exposed bellow.
key. Perceptual hashing can help generate input-dependent keys for each
piece of audio. [Haitsma2001] suggest audio fingerprinting to enhance the
security of watermarks in the context of copy attacks. Copy attacks estimate a
watermark from watermarked content and transplant it to unmarked content.
Binding the watermark to the content can help to defeat this type of attacks. In
addition, fingerprinting can be useful against insertion/deletion attacks that
cause desynchronization of the watermark detection: by using the fingerprint,
the detector is able to find anchor points in the audio stream and thus to
resynchronize at these locations [Mihcaç2001].
2.1.2. Applications
These are the common applications derived from the uses that fingerprinting
systems have:
In spite of the different rationales behind the identification task, methods share
certain aspects. There are two fundamental processes: the fingerprint
extraction and the matching algorithm. The fingerprint extraction derives a set
of relevant perceptual characteristics of a recording in a concise and robust
form.
Fast: Sequential scanning and similarity calculation can be too slow for
huge databases.
Correct: Should return the qualifying objects, without missing any —
i.e: low False Rejection Rate (FRR).
Memory efficient: The memory overhead of the search method should
be relatively small.
Easily up-datable: Insertion, deletion and updating of objects should be
easy.
The last block of the system - the hypothesis testing (see Fig. 2.1) - computes
a reliability measure indicating how confident the system is about
identification.
2.1.3.1. Front-End
• Dimensionality reduction
• Perceptually meaningful parameters (similar to those used by the
human auditory system)
• Invariance / robustness (to channel distortions, background noise, etc.)
• Temporal correlation (systems that capture spectral dynamics).
2.3.1.2. Preprocessing
In a first step, the audio is digitalized (if necessary) and converted to a general
15
format, e.g: mono PCM (16 bits) with a fixed sampling rate (ranging from 5 to
44.1 KHz). Sometimes the audio is preprocessed to simulate the channel, e.g:
band-pass filtered in a telephone identification task. Other types of
preprocessing are a GSM coder/decoder in a mobile phone identification
system, pre-emphasis, amplitude normalization (bounding the dynamic range
to (-1,1)).
The idea behind linear transforms is the projection of the set of measurements
to a new set of features. If the transform is suitably chosen, the redundancy is
significantly reduced. There are optimal transforms in the sense of information
packing and decorrelation properties, like Karhunen-Loêve (KL) or Singular
Value Decomposition (SVD) [Theodoris1999]. These transforms, however, are
problem dependent and computationally complex. For that reason, lower
complexity transforms using fixed basis vectors are more common. Most
CBID methods therefore use standard transforms from time to frequency
domain to facilitate efficient compression, noise removal and subsequent
processing. [Lourens1990] and [Kurth2002] use power measures for
computational simplicity and to model highly distorted sequences, where the
time-frequency analysis exhibits distortions, respectively. The power can still
be seen as a simplified time-frequency distribution, with only one frequency
bin.
The audio features try to define the acoustic content of the audio excerpt. For
every feature we will have one different 'definition' of the analyzed audio. The
main features used for audio fingerprinting systems are MFCCs, Spectral
flatness, chroma features, pitch, bass, freq. Modulation, energy filter banks
and other high-level descriptors.
Sukittanon and Atlas claim that spectral estimates and related features only
are inadequate when audio channel distortion occurs [Sukittanon2002];
[Sukittanon2004]. They propose modulation frequency analysis to
characterize the time-varying behavior of audio signals. In this case, features
correspond to the geometric mean of the modulation frequency estimation of
the energy of 19 bark-spaced band-filters.
Fig. 2.4. Audio features extraction. Figure reproduced from [Cano2002a] with
permissions of the author.
18
The most common features are MFCCs. MFCCs are short-term spectral-
based features and are the dominant features used for speech recognition,
and also used in applications of music modeling. Their success has been due
to their ability to represent amplitude spectrum in a compact form. Each step
in the process of creating MFCC features is motivated by perceptual or
computational considerations [Logan2000]. For a block diagram of an MFCC
feature extraction see fig. 2.5.
The main difference and controversy between speech and music recognition
is in the point of scaling and smoothing step. The blocks before doesn't have
to change a lot between music and speech as the portion of audio taken is
always as little as it can be considered stationary (typically 20 ms), so the first
steps should keep the same for each cases. In the case of scaling, Mel-
scaling works well for speech recognition, as the information in the high
frequencies is less important for this aim than the lowers and mid ones. It
doesn't happen in music, where the high frequencies can took a lot of
information also in [Logan2000] were made some experiments to test which
kind of scaling, Mel or linear worked better for music recognition, having no
conclusive results [Logan2000].
time vicinity, inside a recording and across the whole database, is useful to
further reduce the fingerprint size. The type of model chosen conditions the
distance metric and also the design of indexing algorithms for fast retrieval.
Similarity measures
Similarity measures are very much related to the type of model chosen. When
comparing vector sequences, a correlation metric is common. The Euclidean
distance, or slightly modified versions that deal with sequences of different
lengths, are used for instance in [Blum1999]. In [Sukittanon2002], the
classification is Nearest Neighbor using cross entropy estimation. In the
systems where the vector feature sequences are quantized, a Manhattan
distance (or Hamming when the quantization is binary) is common
[Haitsma2002b; Richly2000]. [Mihçak2001] suggest that another error metric,
which they call “Exponential Pseudo Norm” (EPN), could be more appropriate
to better distinguish between close and distant values with an emphasis
stronger than linear. In [Wang2003] a time-frequency analysis is performed,
marking the coordinates of local maxima of a spectrogram. Thus reduces the
search problem to one similar to astronavigation, in which a small patch of
time-frequency constellation points must be quickly located within a large
universe of points in a strip-chart universe with dimensions of bandlimited
frequency versus nearly a billion seconds in the database.
Matching methods
The idea of using a simpler distance to quickly eliminate many hypothesis and
the use of indexing methods to overcome the brute-force exhaustive matching
with a more expensive distance is found in the CBID literature, e.g: in
[Kenyon1993]. [Haitsma2001] proposed an index of possible pieces of a
fingerprint that points to the positions in the songs. Provided that a piece of a
query’s fingerprint is free of errors (exact match), a list of candidate songs and
positions can be efficiently generated to exhaustively search through
[Haitsma2001]. In [Cano2002c], heuristics similar to those used in
computational biology for the comparison of DNA are used to speed up a
search in a system where the fingerprints are sequences of symbols.
[Kurth2002] present an index that use code words extracted from binary
sequences representing the audio. These approaches, although very fast,
make assumptions on the errors permitted in the words used to build the
index which could result in false dismissals. As demonstrated in
[Faloutsos1994], in order to guarantee no false dismissals, the simple
(coarse) distance used for discarding unpromising hypothesis must lower
bound the more expensive (fine) distance.
in one file should also occur in the matching file with the same relative time
sequence. The problem of deciding whether a match has been found reduces
to detecting a significant cluster of points forming a diagonal line within the
scatterplot. Various techniques could be used to perform the detection, for
example a Hough transform or other robust regression technique. Such
techniques are overly general, computationally expensive, and susceptible to
outliers. Due to the rigid constraints of the problem, the following technique
solves the problem in approximately N*log(N) time, where N is the number of
points appearing on the scatterplot. For the purposes of this discussion, we
may assume that the slope of the diagonal line is 1.0.
2.1.3.7. Post-processing
2.2. Vericast
We can see music as a sequence of audio events. The simplest way to show
an example of this is in a monophonic piece of music. Each note can be seen
as an acoustic event and, therefore, from this point of view the piece is a
sequence of events. However, polyphonic music is much more complicated
since several events occur simultaneously. In this case we can define a set of
abstract events that do not have any physical meaning but it mathematically
describes the sequence of complex music. With this approach, we can build a
24
database with the sequences of audio events of all the music we want to
identity.
The first step in a pattern matching system is the extraction of some features
from the raw audio pattern. We choose the parameter extraction method
depending on the nature of the audio signal as well as the application. Since
the aim of our system is to identify music behaving as close as possible to a
human being, it is sensible to approximate the human inner ear in the
parametrization stage. Therefore, we use a filter-bank based analysis
procedure. In speech recognition technology, mel-cepstrum coefficients
(MFCC) are well known and their behavior leads to high performance of the
systems [Batlle1998]. It can be also shown that MFCC are also well suited for
music analysis [Logan2000].
Channel estimation
Since we only have access to the distorted data and due to the nature of the
problem we cannot know how the distortion was, we need a method to
recover the original audio characteristics from the distorted signal without
25
having access to the manipulations this audio has suffered. Here we define
the channel as a combination of all possible distortions like equalizations,
noise sources and DJ manipulations.
By filtering the parameters of the distorted audio with this filter, they are
converted, as close as possible, to the clean version. By removing this
channel effect from the received signal the identification performance is
greatly improved because all the distortions caused by any equalization and
transmission are removed. Therefore the system will be able to deal with not
only clean CD audio but also broadcast noisy audio.
Training
Where O are the samples from the incomplete samples space and are the
samples of the complete samples space. We also suppose that there is at
least one transformation from the space of complete samples to the space of
26
incomplete samples.
Therefore, the training stage in our system is done in an iterative way similar
to the [Baum1967] algorithm widely used in speech recognition system.
Speech systems use HMMs to model phonemes (or phonetic derived units)
but, unfortunately, in music identification systems we do not have any clear
kind of units to use. That is why at each iteration a new set of units is
estimated as a part of the incomplete data in order to jointly find the sequence
of probabilities and also the set of abstract units that best describes complex
music. After some experimental results we found that a good set of units is
completely estimated after 25-30 iterations.
Audio identification
HMM training described in the previous section was aimed at obtaining the
maximum distance between all possible song models in order to increase
speed and reliability during the audio identification phase. Once the HMMs
are trained, the next steps toward building the entire system consist in getting
the song models and matching them against streaming audio signals.
Signature generation
Identification algorithm
Before the first stage of the system it must be mentioned that the audios that
the system allow have to be wav format at 8 KHz sample rate. That means the
system will have information until 4 KHz to extract the fingerprint. This is done
to reduce the computational complexity as the audio files are smaller and
have less information to be analyzed. It has been empirically proved that with
that audio format the system has an optimal accuracy taking into account the
compromise between time-consume versus accuracy.
The first stage in the system is the obtainment of a set of values that
represent the main characteristics of the audio samples from a database
given by the customer. A key assumption made at this step is that the signal
can be regarded as stationary over an interval of a few milliseconds. Thus, the
prime function of the front-end parameterization stage is to divide the input
sound into blocks and from each block derive some features.
The spacing between blocks is 30 ms. as with all processing of this type, a
hamming window function is applied to each block so as to minimize the
signal discontinuities at the beginning and end of each frame
[Oppenheim1989]. After that the required spectral estimates are computed via
Fourier analysis, for every of these 30 ms blocks. Then the MFCC coefficients
are computed. The Fourier spectrum is smoothed by integrating the spectral
coefficients within triangular frequency bins arranged on a non-linear scale
called the Mel-scale. The Mel-scale is designed to approximate the frequency
resolution of the human ear being linear up to 1,000 Hz and logarithmic
thereafter [Ruggero1992]. In order to make the statistics of the estimated
song power spectrum approximately Gaussian, logarithmic (compression)
conversion is applied to the filter-bank output. As said in [Logan2000] this kind
of smooth of the spectrum is demonstrated to be robust for speech
recognition, not in music, but it should not be harmful. These computed
MFCCs are then stored in a feature vector.
Once the fingerprint database is created audio inputs can be inserted to the
system to be identified. The format must be converted to wav at 8 KHz, like
said before. A fingerprint is automatically extracted for this input and the
process is the same than with the extraction of fingerprints for the database
until the creation of the feature vector. Then these vectors are compared
against the fingerprint database and the bests candidates for this piece of
audio are given by the system. Then if the candidates are 'good' in terms of
the system a match is given in the output, fig. 2.9.
29
The matching algorithm is the last block of the system framework. The
distances between the candidates given with the fingerprints of the database
are calculated. The matching step has the aim of returning the final output,
labeling the unlabeled audio in the input.
There is a very important fact regarding the matching method. The broken
sequences are very fragile and must be taken into account with a lot of care.
A sequence can be broken in some point but maybe it continue after that gap
because in some way the algorithm failed giving the candidates. To prevent
these gaps, the system has a parameter that can be turned on or off if it’s
appropriate. This parameter called 'segment search' allows choosing between
two types of matching methods. If one enabled the other disabled.
As exposed before the system is fragile against the gaps produced in the
temporal sequence. Many temporal sequences are broken in the moment that
the audio is finished, that means in the correct moment. But in other cases,
the sequence is broken before the end of the audio stream. After this gap the
temporal sequence is again created, following a good pattern but is too late as
it has been broken some time before. To prevent this, avoiding these gaps,
30
(a) First matching method: one temporal point per candidate at each step of
the algorithm.
As said in every step of the algorithm the system gives 20 candidates (by
default). In the case of choosing this method one temporal point per candidate
is given. Those temporal points are the instant of the song where the
candidate has been given. For example the candidate could be 'Bob Marley –
Is this love' and the temporal point 2 minutes 31 seconds. This point will be
the most probable along the song. The temporal sequence must be generated
with those temporal points. To clarify this working way let's see fig. 2.8 and fig.
2.9.
Fig. 2.9. Example of a temporal sequence with one temporal point (the most
probable) per candidate.
31
Fig 2.10 shows the probability curve along a song, and marks the most
probable point along it. That is the temporal point that will be taken into
account to generate the temporal sequence.
Fig 2.11 shows an example of a candidate and the temporal points given for it
during 12 steps of the algorithm. The good temporal sequence is created as it
can be seen but in the 11th step the sequence is broken. These temporal
sequences are created by followed candidates. If the sequence is broken but
continues later on, it cannot be linked as the sequence is created only by
followed candidates, with no possibility of jump the gaps. Then the beginning
time is set in the beginning of the temporal sequence and the end time is set
in the last point in the good temporal sequence.
(b) Second matching method: many temporal points per candidate at each
step of the algorithm.
Here, many temporal points (100 by default) are given for each candidate
given every algorithm step (20 candidates by default per step). Then the
algorithm tries to generate good temporal sequence between these points,
only taking that ones that create a good sequence. As the sequence is
created by correlative candidates when a followed one doesn't have a
temporal point in the good temporal sequence, that is broken and as in the
other case (the first matching method) a match is given, with the beginning
time set as the first temporal point in the sequence, and the last one as the
last temporal point in the sequence. To clarify this working way let's see fig.
2.10 and fig. 2.11.
Fig. 2.11. Temporal gap present with many temporal points per candidate.
In fig. 2.10 we can see a probability curve along a song, and marked some
points with high probability of being the correct ones. In fig 2.11 the temporal
sequence is created by some of the probable points taken from the probability
curve. A temporal gap is created in some point. Then the temporal sequence
is well continued at some point after, but as the temporal sequence must be
created by followed candidates it doesn't avoid the temporal gaps appear. The
temporal sequences cannot be linked and the match is given with a lower
quality.
The system has some parameters that can be modified to improve the
33
identifications if necessary:
Chapter 3
Problem statement
The problem faced with Vericast has been around the identification of some
types of electronic music. As we will see after, more related with repetitive
music than with electronic music itself. The problem why Vericast finds more
difficulties to identify pieces of electronic music could have multiple answers.
The present work proposes and develops one of these possible answers
based on the problem that has been detected, which does not need to be the
only one existing.
As we have seen deeply in chapter 2 the audio input to the system is divided
making the system work frame-by-frame. From every frame a fingerprint
signature is extracted, and the acoustic vector that composes this signature is
filled with MFCC features. This extracted fingerprint is then compared with
those ones in a database, that has previously created offline with the
fingerprints of the audios that the user wants to control in some sense. If the
audio input is found in the database the system returns a match as the output,
with the information required about it.
Some types of electronic music are based on loops that are repeated several
or many times during the same song. We hypothesize that in other types of
music, although there are parts of songs that are repeated along it (like
chorus), they are might be played by humans, therefore having subtle
variations that make them not exactly the same but similar, so the system
does not have the problem explained before. In the case of electronic music,
loops are exactly the same and repeated many times through the song. It
seems that this fact can confuse Vericast because when this type of song has
to be detected, it is very likely to find matches in many parts of the song for
one precise part. So the system matching jumps, picking candidates who fail
to be correct and therefore do not have a good time continuation between
them, so the system can not find enough strong temporal relationship
sometimes. This fact is shown in fig. 3.1.
35
To find a match the system needs to find a series of candidates with good
temporal continuity within the song. As an explanatory example we might think
of a song in our database of fingerprints that has three choruses that are
exactly the same. As input to the system enters the first chorus, isolated from
the rest of the song and the system extracts the input fingerprint and looks for
it in the database. Having three equal choruses in the database the system
would return randomized candidates from temporal points in each of these
three choruses. For each point of the input we have three probable points in
the same database. The system might not find a good level of continuity
between candidates, as they jumping from one chorus to another. The system
may be able to 'say' what song it is but it might have problems to guess where
it begins and where it ends. In the case of many styles of electronic music is
exactly this, and harder because the loops are repeated more than 3 times
normally and happens nearly in every song (see fig. 3.1).
However the current system has a solution for this as explained in chapter 2
(see 2.2.5. Matching method). By this way, applying this solution we can have
substantial improvement but we could still have the same problem. The
system requires immediate consecutive candidates in good temporal
sequence to find matches, and sometimes this may not happen for the same
nature of loops of electronic music. This problematic is shown in fig. applying
36
the solution proposed. The temporal sequence is stronger but there's a gap
that already appears breaking the sequence. Our solution aims to be more
robust than the already implemented taking into account the fact that try to
find temporal sequence between consecutive candidates can be weak. If we
could have more than two consecutive temporal points to search trajectories
we could jump over the gaps in cases like that one shown in fig. 3.2.
Fig. 3.2. Above we can observe the system with only one temporal point per
candidate; Down, in the graph the solution of system is applied, but the gap is
still present.
37
Chapter 4
Problem solution
4.1. Introduction
The system has the problem that we have just explained in the previous
chapter, but through off-line monitoring of the information that gives the
system at intermediate points (before the output is given), it can be seen that
this problem is solvable, and the information already contained into the
system should be enough. It can be observed that there are actually good
continuity of candidates. In general good temporal sequences are created but
sometimes they are not continuous because they need to have one candidate
after the other, with no temporal gaps.
For example, imagine we have four candidates given by the system. The first
two candidates are correlated to the same part of the song, so they have good
continuity between them (see fig. 4.1). The next candidate, the third, is located
in another part of the song breaking like this the continuity that had the
previous two. In contrast, the fourth point has good continuity with the first
two. That is, we have a path consisting of three candidates that have a gap in
the middle (third). This fact can be extrapolated, the system gives candidates
that are really correlated with others but actually do not have to be followed
one after the other. There may be gaps between the best possible
continuations. Good trajectories are formed, so a solution might be to give
more flexibility to the algorithm that seeks continuity among candidates, to
make it more robust to that gaps/continuity breaks.
In the case exposed before the system previous to modifications, after 'seeing'
the gap would give a match with only the two first points. The new
modifications to the matching method can jump the gap and link the forth
point, which is in good temporal sequence, with the first two (see fig. 4.1).
38
Another matching method for solving such problems has been designed and
implemented. This does not require continuity between followed candidates,
but continuity over time. The algorithm gives the system some time (many
match_time times) to find candidates, to find these trajectories and form good
continuities, calculate them and keep the best. The time given to seek for
good continuity candidates is a parameter of the new algorithm. The larger the
amount of time, the more candidates it will have, so greater the probability to
find a good continuity between them.
The maximum time that we could expect the algorithm to find matches is the
duration of the input. Taking less time we can also have good results with the
benefit of being able to obtain them in real-time. Here real-time means that we
can be identifying the input while it is sounding. This could occur if the length
of the input is not known, for example, monitor what is being played in a radio
broadcast. In this scenario what Vericast is supposed to do is to give matches
in 'real-time' for what is being played. The user wants to know what is hearing
to as soon as possible. If we were monitoring a radio where songs sounded
one after another we could take a time about 10 seconds to allow the
algorithm to give us matches at this timing rate. Having then an updated
playlist of what is being played.
that solution was to give many probable temporal points for each candidate
given. The new algorithm implemented has been developed not taking into
account that 'extra' points for each candidate, but if taken it might be able to
find more good trajectories and good continuations. Complexity when
searching for paths would increase a lot as where we had just one temporal
point we would have then many of them. This implementation has not been
carried out but there remains a potential for improvement.
Waiting step
This first step doesn't appear in the block diagram as the only thing here is
that the algorithm lets the system give candidates for some time, we call this
time as 'step time of the algorithm'. The 'step time of the system', on the other
40
hand, is the time when the system gives each new candidate (match_time
parameter). It must always be less than the step time of the algorithm to be
meaningful, as it must have a number of candidates to search good
continuations. We will refer to those concepts as 'step of the algorithm' and
'step of the system', respectively.
Store candidates
This way we have a big matrix with the candidates (the Ids of them) by rows,
situated in the first column of the matrix. This way, every candidate has a row,
with the ID in the first position and in columns their temporal points. If for one
step of the algorithm there's a candidate that has appeared before does not
appear in the current step the corresponding column for that temporal points
is left blank. The form of the candidate matrix can be seen in table. 4.1. We
find the candidates (all different) in the first column. Then the temporal points
(t), where its first index refers to which is the candidate of that temporal point,
and the second index refers to which is the position of this temporal point.
.
.
.
In the table we can see every different candidate in one single row. In the
columns of a candidate are located the temporal points found. For every step
of the algorithm a new column is created. If the cell is blank means that in this
step of the algorithm the candidate has not been found.
Split candidates
After the step time of the algorithm (20s by default), it takes all the candidates
treating them separately. The candidates can be split as every row has a
single candidate. Every isolated candidate (every row), with its temporal
points is then send to the block where the trajectories and the good temporal
sequences are searched. Here it's the moment where the points in the matrix
are translated to coordinates. This fact is explained in the next point.
Calculate trajectories
The mechanism for searching good continuations for each candidate is itself
another function inside the solution raised. The input to this block is a set of
coordinates for each individual isolated candidate. This coordinates are
formed on one side by the temporal points detected (query time Tq) and the
corresponding temporal point in reference to the audio input to the system
(reference time Tref). That is, imagine we have a recording of unknown length.
At 5:00 minutes starts a whole song lasting 3 minutes, which therefore ends
up at the 8:00 minutes of the recording. In this case Tq is formed by temporal
points from 0 seconds to 3 minutes ideally, and Tref would have temporal
points from minutes 5:00 to 8:00. This way, we would have a representation
as in fig. 4.3.
So, ideally we should have a 1:1 relationship between Tq and Tref, having an
offset on Tref. Thus if we represent the coordinates on an axis, good
continuities should have trajectories with a slope ~= 1. This can vary quite a
bit since the input audio to the system may have suffered distortions like time
stretching or pitch changed (as in dj sessions).
To calculate these trajectories we start from the first coordinate. Then the
trajectories between this first coordinate and the next ones are calculated. To
be considered a good trajectory it must have a slope equal to 1 +/- the error
admitted. In this case the trajectory (characterized by the slope and the offset
43
point) is stored. If not, the trajectory is avoided. After having calculated all the
trajectories between the first coordinate respect the others, the same is done
with the second coordinate. Every trajectory between this second coordinate
and the rest after that are calculated. Then the third, and so on, until the last
one. Therefore if we had N coordinates we would have to calculate N + (N-1)
+ (N-2) + (N-3) + … + [N-N(-1)] trajectories.
For every trajectory there's a counter to count the number of points belonging
to that trajectory. If a trajectory is repeated (same slope and same offset point)
the counter is augmented (+1) for that trajectory. To be considered a good
trajectory a threshold number of points that have to belong to that trajectory is
given. If that threshold is reached the candidate and the temporal points
creating the good temporal sequence are considered a partial match at this
moment. So for every step of the algorithm it returns partial matches whether
if any candidate has good continuity. We call it partial match because the
trajectory created is not maybe complete. We are only taken a few temporal
points that can maybe have a continuation later on. Remember in the waiting
step, the system waits, for example, 20 seconds, and maybe a song lasts for
180 seconds. We would have 18 partial matches (9*2 because of the half of
the overlap) for only one song. As we will see in the next block, there's the
need of linking these partial matches.
Two partial matches must be linked if they belong to the same candidate and
the temporal trajectories of both are good, have good temporal continuity. We
should also mention that we have an overlap of half-time of the step time of
the algorithm to avoid discontinuities between partial matches.
Therefore when a partial match has the same candidate and a good temporal
continuity with another from the next step of the algorithm, they should be
linked forming one bigger partial match. When in the next step of the
algorithm, the candidate of the partial match does not repeat, or it does not
have a good continuity, it becomes a final match, which will be displayed in
the output. When we merge two partial matches as they belong to the same
candidate, we keep this candidate. What has to be changed is the end time of
the match. The beginning time of the match is set by the beginning time of first
partial match. Then the end time is set by the final partial match that can be
linked according to what we have explained before about when we can link
two partial matches.
Let's take the example again that we have as input a song and 180 seconds
immediately followed by another song, and a step time of the algorithm of 60
seconds. In principle we should have 6 partial matches (180/60 * 2 for
mentioned overlap) for the candidate of the first song, with good continuity
between them. Then we should have a partial match from a different
44
candidate, corresponding to the next song. This block should merge these six
partial matches forming a final match, to be shown in the output.
%%%%%%%%%%%%%%%%%%SPLIT CANDIDATES%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%The coordinates
points = [tref(frame-step+2:frame+1) taudio(frame-step+2:frame+1)];
%taking from the second, the first position is the id.
%%%%%%%%%%%%%%%%%%CALCULATING TRAJECTORIES%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
[to,tf,offsethist] = trajectories_searching_mechanism(points);
if length(offsethist > 0)
pos = find(offsethist(:,3) >= points_number);
if (length(pos) > 0)
nPartialMatches = nPartialMatches + 1;
partialMatches(nPartialMatches).mid = mid;
partialMatches(nPartialMatches).offsethist =
offsethist(pos,:);
partialMatches(nPartialMatches).step = frame/step;
if (length(matches) ~= 0)
pos_mid = -1; % default, mid not found
for t = 1:length(matches)
if (matches(t).mid == mid)
pos_mid = t;
end
end
if ( pos_mid >= 1 ) %this id exists
for k = 1:length(offsethist(:,1))
% Adding some error to unify some 'very similar'
% offsets and m's
pos_offsethist =
find( ( round(matches(pos_mid).offsethist(:,1)*100000 + 0.00005) <
round( (offsethist(k,1) + offset_error)*100000 + 0.00005) ) &
( round(matches(pos_mid).offsethist(:,1)*100000 + 0.00005) > round( (offsethist(k,1) -
offset_error)*100000 + 0.00005) ) & (round(matches(pos_mid).offsethist(:,2)*100000 +
0.00005) < round( (offsethist(k,2) + m_error)*100000 + 0.00005)) &
(round(matches(pos_mid).offsethist(:,2)*100000 + 0.00005) > round( (offsethist(k,2) -
m_error)*100000 + 0.00005) ) ); %same offset and m, with some Controlled Error
if (pos_offsethist >= 1) %found
matches(pos_mid).offsethist(pos_offsethist,5) =
offsethist(k,5); %update 'te'
46
matches(pos_mid).offsethist(pos_offsethist,7) =
offsethist(k,7); %update 'teRef'
matches(pos_mid).offsethist(pos_offsethist,3) =
matches(pos_mid).offsethist(pos_offsethist,3) +
offsethist(k,3); %add the number of counts
else
matches(pos_mid).offsethist =
[ matches(pos_mid).offsethist;
offsethist(k,:) ];
end
end
else
nMatches = nMatches + 1;
matches(nMatches).mid = mid;
matches(nMatches).offsethist = offsethist(pos,:);
matches(nMatches).step = frame/step;
end
else
matches.mid = mid;
matches.offsethist = offsethist(pos,:);
matches.step = frame/step;
nMatches = nMatches + 1;
end
end
end
i = i + 2;
end
for j = 1:(cand_row_length/2)
if (i+1 > cand_row_length || i > cand_row_length)
%do nothing
else
taudio = (candidate_matrix(i,:))'; %taking the rows
tref = (candidate_matrix(i+1,:))'; %taking the rows, tref
mid = taudio(1);
points = [tref(frame-framesLeft+2 : frame+1) taudio(frame-framesLeft+2 :
frame+1)];
%couting from the second position, the first position is the id.
[to,tf,offsethist] = matchDiagonals(points);
if length(offsethist > 0)
pos = find(offsethist(:,3) >= points_number);
if length(pos > 0)
nPartialMatches = nPartialMatches + 1
partialMatches(nPartialMatches).mid = mid;
partialMatches(nPartialMatches).offsethist = offsethist(pos,:);
partialMatches(nPartialMatches).step = frame/step;
end
end
i = i + 2;
end
end
trajectories found with the beginning time of the match (to) and the end time
of the match (tf). The parameters that characterize the trajectory as the
slope (m), the offset point (offset), the amount of points belonging to the
trajectory (counts) and the beginning time (tb) and end time (te) of the
trajectory are store in the variable offsethist.
function [to,tf,offsethist] = trajectories_searching_mechanism(points)
%|-----------------------------------------------------------|
%| matchDiagonals |
%| |
%| trajectories_matrix = matrix of the possible trajectories |
%| to = initial time of the match |
%| tf = end/final time of the match |
%| offsethist = [offset m counts tb te] |
%| --> histogram of the trajectories with: |
%| ·offset = offset point |
%| ·m = slope |
%| ·counts = number of points belonging to that trajectory|
%| ·tb = beginning time of the trajectory |
%| ·te = end time of the trajectory |
%|-----------------------------------------------------------|
if (max(sim) > counted_matches) %to be a match there have to be more than one
%point in the diagonal returning the beginning
%and end time of the match
po = max_sim;
%the position of the first point is given by the position of the row with
%maximum similitudes
to = points(po,2); %2 column, where the time_input is
pf = find( trajectories_matrix(max_sim,:) > 0.9 & trajectories_matrix(max_sim,:) <
1.1, 1 , 'last'); %find the last valid match
pf = pf + max_sim;
%from relative position to absolute position (summing the position of the row
%with max. similitudes)
tf = points(pf,2); %2 column, where the time_input is
else
to = -1;
tf = -1;
end
Trec: It's the time (in seconds) that the algorithm waits for the system to give
candidates.
This two parameters are selected a priori and are not dependent on the length
of the database (n) of fingerprints. Usually, the complexity of the fingerprint
algorithms are basically dependent on the length of the fingerprint database
where the comparisons are done. These database can have more than one
million fingerprints. As the solution raised is not dependent on n, the
complexity of the proposed solution doesn't represents an important overhead
versus the complexity that already had the system.
49
O' = O(n) + K
where O(n) is the complexity of the existing algorithm, K is the constant added
by the new algorithm and K<<O(n).
50
Chapter 5
Methodology
To perform the testing we have divided the audio files into portions of 30
seconds to force a little more the system, given the nature of the problem at
hand. That way the system does not have the whole song to try to find a
suitable match, and jumps between candidates should be more evident. The
little the audio stream to be detected the more difficult to the system to find a
strong temporal sequence for it.
(a) Accuracy (song detection): The song for the piece found at a certain point
and the song of in the groundtruth at the same time are coincident.
(b) Portion accuracy (portion detection): In addition to the song detection the
correct portion is also found.
(c) Portion error (error in the portion detection): The error in the detection of
the piece is calculated as:
Where Tiq and Tfq are initial and final temporal points
detected for a certain query (q) and Tig and Tfg are the
initial and final temporal point in the groundtruth (g), that
is, the real content of the audio.
52
Chapter 6
Results
The first to remark is that, as expected, the lower accuracy found is in techno
music. Then classical and dance come with the same level of accuracy. The
low accuracy on dance music was also expected as it has a structure similar
to techno, with a high repetitiveness of audio loops. The hypothesis to explain
the lower accuracy obtained in classical music are not so clear. Listening to
the excerpts many silences appear along them, maybe this fact could be one
53
of the aspects that can confuse the matching system. Another hypothesis is
the fact of the great precision of the classical musicians. We can hypothesize
that if there are two equal parts to be played they can play nearly exactly the
same and confuse the system. As it can be seen the accuracies in the rest
music styles are quite similar, between 92% and 95%. Also the portion error is
higher on those styles with lower accuracy. And, in this case, techno and
dance are quite differentiated from other genres.
For the way the algorithm worked and was designed we expected to higher
the accuracy on those music genres with high level of repetitiveness. But as it
can be seen in the comparison table between the old system and the results
with the new algorithm we raised the accuracy of every involved genre. So we
can deduce that the problematic found in techno music appear also in the
other genres, although the presence is lower. We must say that the higher
range of improvement has been in classical music, around 8%. As with the
low accuracy in classical music with the old system it's difficult to hypothesize
and explain why there's such an improvement in that accuracy. Then the
higher range is in techno music. That was expected as the new algorithm was
focused in solving the problematic around repetitive music. Remarkable also
the fact that with jazz music the improvement has been also high, similar to
that one obtained in techno arriving to an accuracy closer to 100%.
On average the accuracy of the old system was 91.55% on average. With the
new algorithm implemented the accuracy has raised until 95.83% on average.
The accuracy of the portion has, in general, been maintained. The portion
error has not had an improvement, only in Techno music. It's maybe a bug
that affects at the limits of the detected portions (beginning and end time of
the portion). It should be revised to, at least, arrive to the level that the
previous system had. Iin spite of that it must be said that the portion error is
only interesting for certain applications, and having a value between 2-4
seconds is not so relevant.
55
Chapter 7
Conclusions
The state of the art of fingerprint, fingerprinting systems and the product of
audio identification Vericast have been reviewed. The problem statement has
been explained in detail and a solution to this problematic has been
developed and deeply explained.
A test set has been formalized and developed and confirmed that the system
was having a lower accuracy on techno music identification. As it has been
explained in this work the problematic was around the repetitiveness of some
electronic genres. Then the problem was focused not on electronic music but
on repetitive music. A solution has been proposed and implemented. That
solution has consisted in a matching algorithm situated in the end of the
framework of Vericast, the audio identification product we we're trying to
improve. This new matching method has been tested in accuracy and results
have been obtained. The results show that the solution implemented has
increased the accuracy in all genres, between 2.58% (in dance music) and
8.37% (in classical music). The improvement on average has grown from
91.55% to 95.83%. In particular for techno music the increase has been
5.24%.
56
Chapter 8
References
Allamanche, E., Herre, J., Helmuth, O., Fröba, B., Kasten, T., and Cremer, M.
(2001). Content-based identification of audio material using mpeg-7 low level
description. In Proc. of the International Symp. of Music Information Retrieval
(ISMIR).
Batlle, E., Massip, J., Guaus, E. (2002). Automatic song identification in noisy
broadcast audio. In Proc. of the Signal and Image Processing (SIP).
Blum, T., Keislar, D., Wheaton, J., and Wold, E. (1999). Method and article of
manufacture for content-based analysis, storage, retrieval and segmentation
of audio information.
Boney, L., Tewfik, A., Hamdy, K. (1996). Digital watermarks for audio signals.
In IEEE Proceedings Multimedia, pps: 473-480 .
Burges, C., Platt, J., Jana, S. (2003). Distortion discriminant analysis for audio
fingerprinting. In IEEE Transactions on Speech and Audio Processing, vol. 11,
no. 3, pps: 165-174.
Cano, P., Batlle, E., Kalker, T., Haitsma, J. (2002a). A review of algorithms for
audio fingerprinting. In Proc. of the IEEE Multimedia Signal Processing∫, St.
Thomas, V.I.
Cano, P., Kaltenbrunner, M., Gouyon, F., Batlle, E. (2002b). On the use of
fastmap for audio information retrieval. In Proc. of the International Symp. on
Music Information Retrieval (ISMIR).
Cano, P., Batlle, E., Mayer, H., Neuschmied, H. (2002c). Robust sound
modeling for song detection in broadcast audio. In Proc. AES 112th
International Conv.
Cano, P., Batlle, E., Gómez, E., Gomes, L., Bonnet, M. (2005). Audio
Fingerprinting: concepts and applications. Studies in Computational
Intelligence (SCI), pps: 233-245.
Dannenberg, R., Foote, J., Tzanetakis, G., Weare, C. (2001). Panel: New
directions in music information retrieval. In Proc. of the International
Computing Music Conference.
Gomes, L., Cano P., Gómez, E., Bonnet, M., Batlle, E. (2003). Audio
watermarking and fingerprinting: For which applications? In Journal of New
Music Research 32, pps: 65-82.
Gómez, E., Cano, P., Gomes, L., Batlle, E., Bonnet, M. (2002). Mixed
watermarking and fingerprinting approach for integrity verification of audio
recordings. In Proc. of the International Telecommunications Symp.
Haitsma, J., Kalker, T., Oostveen, J. (2001). Robust audio hashing for content
58
Kimura, A., Kashino, K., Kurozumi, T., Murase, H. (2001). Very quick audio
searching: introducing global pruning to the time-series active search. In Proc.
of International Conference on Computational Intelligence and Multimedia
Applications.
Luo, H., Chu, S., Lu, Z. (2008). Self embedding watermarking using halftoning
technique. Circuits, System, and Signal processing, vol. 27, no. 2, pps: 155-
170.
Park, M., Kim, H., Yang, S. (2006). Frequency-temporal filtering for a robust
audio fingerprinting schemes in real-noise environments. Electronic and
Telecommunications Research (ETRI) Journal, pps: 509-512.
59
Richly, G., Varga, L., Kovàcs, F., Hosszú, G. (2000). Short-term sound stream
characterization for reliable, real-time occurrence monitoring of given
soundprints. In Proc. 10th Mediterranean Electrotechnical Conference,
MEleCon.
Riley, M., Heinen, E., Ghosh, J. (2008). A text retrieval approach to content-
based audio retrieval. In International Symp. on Music Information Retrieval
(ISMIR), pps: 295-300.
Seo, J., Jin, M., Lee, S., Jang, D., Lee, S., Yoo, C. (2005). Audio fingerprinting
based on normalized spectral subband centroids. In Proc. of the International
Conference on Acoustics, Speech and Signal Processing (ICASSP),
Philadelphia, PA.
Sukittanon, S., Atlas, L., Pitton, J. (2004). Modulation scale analysis for
content identification. In IEEE Transactions on Signal Processing, pps: 3023-
3035.
Viterbi, A., (1969). Error bounds for convolutional codes and an asymptotically
optimum decoding identification. In IEEE Trans. Info. Theory, vol. 13, no. 2,
pps: 260-269.