Sunteți pe pagina 1din 68

Quality assessment and enhancement of an industrial-

strength audio fingerprinting system

Cristian Quirante Catalán

MASTER THESIS UPF / 2009


Master in Sound and Music Computing

Master thesis supervisors:


Joan Serrà and Pedro Cano
Department of Information and Communication Technologies
Universitat Pompeu Fabra, Barcelona
i

Abstract

This master thesis is the presentation of a work that has sought to


improve an industrial-oriented product of audio recognition by
fingerprinting that seemed to have worse results in electronic music
detection compared to other musical genres. Furthermore, the
testing within genres (including electronic music) treated separately
to date had not been formalized. Therefore the assumptions that the
system worked worst with electronic music have been made by
simple observation of specific cases that had been submitted by
users, customers and developers/testers. This work (a) formalizes
accuracy tests for the recognition of songs of different genres, (b)
empirically reproduces the aforementioned observations regarding
electronic music, (c) finds why the system had problems in detecting
this kind of music, and (d) proposes, develops, and evaluates a
solution for the encountered problems. This solution achieves a
4.28% increase in recognition accuracy and not represents an
important algorithmic complexity overhead.
ii

Aknowledgements

I would like to thank to my supervisors Joan Serrà, from the Music Technology
Group (MTG) and Pedro Cano from Barcelona Music & Audio Technologies
(BMAT) for all the support given during the whole work. Specially for being
always near to give advise and bring the necessary support, any time in a so
positive way.

Cristian.
Contents

Abstract............................................................................................................. i

Acknowledgments............................................................................................. ii

Chapter 1: Introduction 1
1.1. Motivation................................................................................................... 1
1.2. Audio fingerprinting..................................................................................... 1
• 1.2.1. Properties of audio fingerprinting.................................................. 2
• 1.2.2. Requirements for audio fingerprinting........................................... 3
1.3. Initial problem statement and hypotheses.................................................. 4
1.4. Goals and organization of the thesis.......................................................... 5

Chapter 2: State Of The Art 7


2.1. Audio fingerprinting systems....................................................................... 7
• 2.1.1. Uses.............................................................................................. 8
• 2.1.2. Applications................................................................................... 9
• 2.1.3. General fingerprinting framework................................................. 12
2.2. Vericast....................................................................................................... 23
• 2.2.1. System overview........................................................................... 23
• 2.2.2. Framework overview..................................................................... 23
• 2.2.3. Steps for the identification............................................................ 24
• 2.2.4. Fingerprint database creation....................................................... 27
• 2.2.5. Input audio to be identified and matching process....................... 28
• 2.2.6. Matching method.......................................................................... 29
• 2.2.7. Configuration parameters............................................................. 33

Chapter 3: Problem statement 34


Chapter 4: Problem solution 37
4.1. Introduction................................................................................................. 37
4.2. Implemented algorithm explanation............................................................ 39
4.3. The code..................................................................................................... 44
• 4.3.1. Function: new_matching_algorithm.............................................. 44
• 4.3.2. Function: trajectories_searching_mechanism.............................. 47
4.4. Complexity overhead of the proposed solution.......................................... 48

Chapter 5: Methodology 50
5.1. Test set....................................................................................................... 50
5.2. Evaluation measures.................................................................................. 51

Chapter 6: Results 52
6.1. Results: system previous to modifications.................................................. 52
6.2. Results: new matching algorithm versus existing matching method.......... 53

Chapter 7: Conclusions 55

Chapter 8: References 56
List of figures

2.1 Fingerprint system 13


2.2 Content-based audio identification framework 14
2.3 Fingerprint extraction framework: front-end (top)
and fingerprint modeling (bottom) 15
2.4 Audio features extraction 17
2.5 MFCC extraction 18
2.6 Feature extraction in Vericast 28
2.7 Fingerprinting matching system 29
2.8 Taking the most probable point 30
2.9 Example of a temporal sequence with one temporal point
(the most probable) per candidate 30
2.10 Taking many probable points 31
2.11 Temporal gap present with many temporal points per candidate 32
3.1 'Loop 2' is detected in many points of the song 35
3.2 Above we can observe the system with only one temporal
point per candidate; Down, in the graph the solution of system
is applied, but the gap is still present. 36
4.1 Example of good continuity with a gap 38
4.2 Block diagram of the algorithm 39
4.3 Example of ideal input 41
4.4 Characterizing the trajectories 42
List of tables

4.1 Candidate matrix (candidate_matrix in the code) 40


6.1 Results of the system previous to modifications 52
6.2 Results of the system previous to modifications versus new
matching method 53
1

Chapter 1
Introduction

1.1. Motivation
With the rise of digital media and Internet as a way of audio files distribution,
large amounts of audio files belonging to companies, author societies,
consumers, etc. beyond to a control that they might want to have on these
files. Possible examples might be customers who have the copyright of an
amount of audios and want monitor on television, radio or festivals to know, if
somehow, are broadcasting of some of its audio on copyright, to just take
control or to get the legal royalties. Also companies that include
advertisements in various channels of communication and want to know if
they have sounded at the time they hired and which was the quality of the
emission. The automatic monitoring solution becomes a potential need that
might grow in future given the entropy of audio files on the network and
therefore by many and various distribution channels.

The main motivation for this work has been to make a contribution in the
Music Information Retrieval field. As the amount of audio files grows and
grows every day, the need of take control of what is going on around the web
with these files also grows. Nowadays fingerprint technologies are getting
more used and the service given by them more valuable, so it was interesting
to work and know more of them. Finally a knowledge gap seemed to be
identified gap in the detection of electronic music with the system Vericast the
company BMAT, an automatic audio identification system by fingerprinting. So
there were many important reasons that driven the work to its final
composition.

1.2. Audio fingerprinting


As the amount of audio files increase exponentially around the web
fingerprinting is a technology that tries to provide a solution to automatic audio
identification. Audio fingerprinting has been a strengthen technology to solve
this kind of problematic. A Fingerprint is a signature derived from an audio file
that contents information about the perceptually relevant aspects from it.
2

These signatures can be stored in a database. If an unlabeled audio is input


in a fingerprinting system, its fingerprint signature is extracted and compared
to that ones in the database. If its there the system can label the previously
unknown audio.

An audio fingerprint is a content-based, unique, and compact signature


derived from perceptually relevant aspects of a recording [Cano2006]. Such a
signature summarizes an entire audio recording. Fingerprinting technologies
allow the monitoring of audio content without the need of metadata (title,
artist, year, etc.) or watermark embedding, a low noisy signal added to an
audio file to be detected by other systems and relate it to some aspects of that
audio [Gomes2003]. However, additional uses exist for audio fingerprinting,
like content-based recommendations [Cano2006] or cover detection
[Riley2008].

These extracted acoustic relevant characteristics of a piece of audio content


(the signature) are stored in a database. When presented with an unidentified
piece of audio content, characteristics of that piece are calculated and
matched against those stored in the database. Using fingerprints and efficient
matching algorithms, distorted versions (equalization, noise, speaker over
audio, re-recording, re-sampling, etc.) of a single recording can be identified
as the same music title.

1.2.1. Properties of audio fingerprinting


According to [Cano2006] the properties of an audio fingerprinting system
should be:

 Accuracy: The number of correct identifications (true positives), missed


identifications (false negatives), and wrong identifications (false
positives).

 Reliability: Methods for assessing that a query is present or not in the


database to identify is of major importance in play list generation for
copyright enforcement organizations. In such cases, if a song has not
been broadcast, it should not be identified as a match, even at the cost
of missing actual matches. Approaches to deal with false positives
have been treated for instance in [Cano2001]. In other applications, like
automatic labeling of MP3 files, avoiding false positives is not such a
mandatory requirement.

 Robustness: Ability to accurately identify an item, regardless of the


level of compression and distortion or interference in the transmission
channel. Ability to identify whole music titles from excerpts of a few
3

seconds long (known as cropping or granularity), which requires


methods for dealing with lack of synchronization. Other sources of
degradation are re-sampling, re-recording, time stretching and
expansion, pitching, equalization, background noise, D/A-A/D
conversion, audio coders (such as GSM and MP3), etc. [Batlle2002].

 Security: Vulnerability of the solution to cracking or tampering (to


manipulate the files affecting the audio signal). In contrast with the
robustness requirement, the manipulations to deal with are designed to
fool the fingerprint identification algorithm.

 Versatility: Ability to identify audio regardless of the audio format (WAV,


AIFF, MP3, OGG, etc.). Ability to use the same database for different
applications.

 Scalability: Performance with very large databases of titles or a large


number of concurrent identifications. This affects the accuracy and the
complexity of the system.

 Complexity: This refers to the computational costs of the fingerprint


extraction, the size of the fingerprint, the complexity of the search, the
complexity of the fingerprint comparison, the cost of adding new items
to the database, etc. In the end the number of actions that the system
has to do to identify the audio excerpts.

 Fragility: Some applications, such as content-integrity verification


systems, may require the detection of changes in the content. This is
contrary to the robustness requirement, as the fingerprint should be
robust to content-preserving transformations but not to other
distortions. It would be better to classify fragility as a fact to take into
account more than a property of audio fingerprinting.

1.2.2. Requirements for audio fingerprinting


Improving a certain requirement often implies losing performance in some
other. Generally, a fingerprint should be:

 A perceptual digest of the recording. The fingerprint must retain the


maximum of acoustically relevant information. This digest should allow
the discrimination over a large number of fingerprints. This may be
conflicting with other requirements, such as complexity and robustness.

 Invariant to distortions. This derives from the robustness requirement.


Content-integrity applications, however, relax this constraint for
4

content-preserving distortions in order to detect deliberate


manipulations.

 Compact. A small-sized representation is interesting for complexity,


since a large number (maybe millions) of fingerprints need to be stored
and compared. An excessively short representation, however, might
not be sufficient to discriminate among recordings, affecting accuracy,
reliability and robustness.

 Easily computable. For complexity reasons, the extraction of the


fingerprint should not be excessively time-consuming.

1.3. Initial problem statement and hypotheses


A product intended to provide a solution audio fingerprinting identification is
Vericast, from BMAT (Barcelona Music & Audio Technologies) company. This
is a technology that allows the identification of audio if that belongs to a
database defined by the user of the product. This work goes around an
improvement made for Vericast in the issue of detecting electronic music,
where some problems had been noticed.

The reasons why there was a problem in the operation of the product were
quite simple. Usually the kind of songs to detect were mainly non-electronic
music audios, always having good results, with success rates around 95%. It
must be said also there was no formalized testing with genres treated
separately, especially with electronic music. Once there was the need of use
of this technology to identify electronic music, it seemed not reach the heights
of success that previously were achieved for other musical styles. These
observations were basically submitted by customers and then reaffirmed by
BMAT developers.

The algorithm works based in audio frames. Every audio frame (from the
unlabeled audio input), the system gives the more probable songs and its
temporal points that are supposed to be entering the system to be identified.
The probable songs and its temporal points will be called as candidates
through the work. The problem encountered in the matching algorithm was
that this was not receiving correct temporal trajectories between candidates.
The song was detected, but the temporal sequences between candidates
were not having a clear order. Techno music and many styles inside electronic
music, in general, are built based on loops, with repetitions of a piece of
audio. The algorithm detects these loops in various parts of the song as they
are repeated through. Sometimes it doesn't detect the correct one, as there
are others very similar or completely equal, causing bad temporal sequences.
5

After this observation the problem was redefined, as the problem was not in
the electronic music itself but more related with 'repetitive music'.

To formalize the problem and check that it really existed, it was decided to
develop a test. The test basically consisted of many songs from different
genres of music with the tags: pop, rhythm 'n blues, hip hop, dance, jazz,
classical and techno, that last one representing the electronic music style. The
songs were divided by its genre, having for every genre the same amount of
excerpts. Every genre was tested separately to check whether the accuracy
for techno music (and maybe for some variants like dance) was lower than in
the other genres.

After the test it was noticed that the system was actually responding with less
accuracy for the songs tagged as 'techno' and secondly for the ‘dance’ songs,
a style that also has similarities with 'techno' (see section 6.1).

Several hypotheses were thrown to (a) explain and (b) fix the problem that
was occurring. Proposed solutions were basically:

1. To compute the similarity distance in another way


2. To tune in another way the parameters of the system
3. To use other features to fill the acoustic vector
4. To review the matching algorithm.

(1) The similarity distance was quite revised and the one developed inside
seemed to be an efficient one. (2) Tune in another way the parameters would
have good results maybe for concrete cases but it's difficult to find a way to
have a pattern. (3) In works like [Logan2000] is shown that MFCC (the kind of
features that are used to fill the acoustic vector) are the features that can
satisfactory model the human perception of sound. In conclusion, from these
possible solutions, the one that might change more the results should would
(4) to revise the matching algorithm and propose some improvements, adding
some new phases in the algorithm or write an entirely new algorithm.

1.4. Goals and organization of the thesis


The first goal of the thesis is to formalize a testing to compare the accuracy of
electronic music with different musical styles. So we can actually state that the
observations that the system has a worse accuracy for this type of music are
true.

Once formalized the testing, the next step is to identify the different hypothesis
that can explain what is maybe causing the problem. Thereafter take that
6

hypothesis that could change in a better way the results and ultimately
develop a solution based on this hypothesis. Finally, the solution taken was to
develop a new method of matching (chapter 4), but we'll see all this in detail in
chapters devoted to it specifically.

For a proper understanding of this work, it is divided into the following


sections: The State Of The Art, where we'll have a review of audio
fingerprinting mainly, and the operation of Vericast. Then in the third chapter,
there will be described in detail the problem statement. Fourthly we will find
the proposed solution explained in depth. The methodology followed to do the
testing of the system modifications appears in the fifth chapter. The results will
come out in the sixth chapter. Finally the conclusions of the work will be
exhibited in the seventh. The final chapter is for the references.
7

Chapter 2
State of the art

In this chapter the State Of The Art will be reviewed. Firstly the fingerprinting
system bases and their common applications will be shown. As in an audio
identification scenario by fingerprinting, the main differences between systems
will be reviewed, that comprises the features used to fill the fingerprint
signature and the algorithms implemented. The last section of the chapter is
an explanation of how the system works.

2.1. Audio fingerprinting systems


Audio fingerprinting has attracted a lot of attention for its audio monitoring
capabilities. An audio fingerprint is a content-based compact signature that
summarizes an audio recording. Audio fingerprinting or content-based
identification (CBID) technologies extract acoustic relevant characteristics of a
piece of audio content and store them in a database. When presented with an
unidentified piece of audio content, characteristics of that piece are calculated
and matched against those stored in the database. Using fingerprints and
efficient matching algorithms, distorted versions of a single recording can be
identified as the same music title (http://www.ifpi.org/site-
content/press/20010615.html).

The approach differs from an alternative existing solution to monitor audio


content: Audio Watermarking [Boney1996]. In audio watermarking, research
on psychoacoustics is conducted so that an arbitrary message, the
watermark, can be embedded in a recording without altering the perception of
the sound. Compliant devices can check for the presence of the watermark
before proceeding to operations that could result in copyright infringement. In
contrast, in audio fingerprinting, the message is automatically derived from the
perceptually most relevant components of sound. Compared to watermarking,
it is ideally less vulnerable to attacks and distortions since trying to modify this
message, the fingerprint, means alteration of the quality of the sound. It is
also suitable to deal with legacy content, that is, with audio material released
without watermark. In addition, it requires no modification of the audio content.
As a drawback, the complexity of fingerprinting is higher than watermarking
8

and there is the need of a connection to a fingerprint repository. Contrary to


watermarking, the message is not independent from the content. It is
therefore for example not possible to distinguish between perceptually
identical copies of a recording. For more detail on the relation between
fingerprinting and watermarking we refer to [Gomes2003].

2.1.1. Uses
Audio fingerprinting is a technology with a wide range of possibilities for uses.
We could say that the main ones can be divided into the ones exposed bellow.

2.1.1.1. Audio Identification

Independently of the specific approach to extract the content-based compact


signature, a common architecture can be devised to describe the functionality
of fingerprinting when used for identification (http://www.ifpi.org/site-
content/press/20010615.html). The overall functionality mimics the way
humans perform the task. A memory of the works to be recognized is created
off-line (top); in the identification mode (bottom), unlabeled audio is presented
to the system to look for a match. The system compares the input audio with
that ones in the database. If the system finds that this audio is inside this
database returns the information required about it as the output.

2.1.1.2. Integrity verification

Integrity verification aims at detecting the alteration of data. First, a fingerprint


is extracted from the original audio. In the verification phase, the fingerprint
extracted from the test signal is compared with the fingerprint of the original.
As a result, a report indicating whether the signal has been manipulated is
outputted. Optionally, the system can indicate the type of manipulation and
where in the audio has it occurred. The verification data, which should be
significantly smaller than the audio data, can be sent along with the original
audio data (e.g. as a header) or stored in a database. A technique known as
self-embedding [Luo2008] avoids the need of a database or an especially
dedicated header, by embedding the content-based signature into the audio
data using watermarking. An example of such a system is described in
[Gómez2002].

2.1.1.3. Watermarking support

Audio fingerprinting can assist watermarking. Audio fingerprints can be used


to derive secret keys from the actual content. As described in [Mihçak2001],
using the same secret key for a number of different audio items may
compromise security, since each item may leak partial information about the
9

key. Perceptual hashing can help generate input-dependent keys for each
piece of audio. [Haitsma2001] suggest audio fingerprinting to enhance the
security of watermarks in the context of copy attacks. Copy attacks estimate a
watermark from watermarked content and transplant it to unmarked content.
Binding the watermark to the content can help to defeat this type of attacks. In
addition, fingerprinting can be useful against insertion/deletion attacks that
cause desynchronization of the watermark detection: by using the fingerprint,
the detector is able to find anchor points in the audio stream and thus to
resynchronize at these locations [Mihcaç2001].

2.1.1.4. Content-based audio retrieval and processing

Deriving compact signatures from complex multimedia objects and powerful


indexes to search media assets is an essential issue in Multimedia
Information Retrieval. Fingerprinting can extract information from the audio
signal at different abstraction levels, from low level descriptors to higher level
descriptors. Especially, higher level abstractions for modeling audio hold the
possibility to extend the fingerprinting usage modes to content-based
navigation search by similarity, content-based processing and other
applications of Music Information Retrieval [Dannenberg2003]. Adapting
existing efficient fingerprinting systems from identification to similarity
browsing can have a significant impact in the music distribution industry, for
example itunes (www.itunes.com) is already giving that service. But at the
moment, the major part of on-line music providers offer searching by editorial
data (artist, song name, and so on) or following links generated through
collaborative filtering. In a query-by-example scheme, the fingerprint of the
song can be used to retrieve not only the original version but also “similar”
ones [Cano2002b].

2.1.2. Applications
These are the common applications derived from the uses that fingerprinting
systems have:

2.1.2.1. Audio content monitoring and tracking

With audio fingerprinting technology an unlabeled audio can be recognized (if


it is previously inside a database). One of the main applications for that use is
then audio monitoring of any type. It is possible to have control of the audios
of whatever broadcasting station by making the fingerprinting technology to
'hear' what is sounding. The monitoring capabilities can be ordered like this:
10

(a) Monitoring at the distributor end

Monitoring at the Distributor End: Content distributors may need to know


whether they have the rights to broadcast the content to consumers.
Fingerprinting can help identify unlabeled audio in TV and Radio channels
repositories. It can also identify unidentified audio content recovered from CD
plants and distributors in anti-piracy investigations (e.g. screening of master
recordings at CD manufacturing plants).

(b) Monitoring at the transmission channel

Monitoring at the Transmission Channel: In many countries, radio stations


must pay royalties for the music they air. Rights holders need to monitor radio
transmissions in order to verify whether royalties are being properly paid.
Even in countries where radio stations can freely air music, rights holders are
interested in monitoring radio transmissions for statistical purposes.
Advertisers also need to monitor radio and TV transmissions to verify whether
commercials are being broadcast as agreed. The same is true for web
broadcasts. Other uses include chart compilations for statistical analysis of
program material or enforcement of “cultural laws” (e.g. French titles in
France). Fingerprinting-based monitoring systems are being used for this
purpose. The system “listens” to the radio and continuously updates a play list
of songs or commercials broadcast by each station. Of course, a database
containing fingerprints of all songs and commercials to be identified must be
available to the system, and this database must be updated as new songs
come out.

(c) Monitoring at the consumer end

Monitoring at the Consumer End: In usage-policy monitoring applications, the


goal is to avoid misuse of audio signals by the consumer. We can conceive a
system where a piece of music is identified by means of a fingerprint and a
database is contacted to retrieve information about the rights. This information
dictates the behavior of compliant devices (e.g. CD and DVD players and
recorders, MP3 players or even computers) in accordance with the usage
policy. Compliant devices are required to be connected to a network in order
to access the database.

2.1.2.2. Added-Value Services

Content information is defined as information about an audio excerpt that is


relevant to the user or necessary for the intended application. Depending on
the application and the user profile, several levels of content information can
be defined. Here are some of the situations we can imagine:
11

 Content information describing an audio excerpt, such as rhythmic,


melodic or harmonic descriptions.
 Meta-data describing a musical work, how it was composed and how it
was recorded. For example: composer, year of composition, performer,
date of performance, studio recording/live performance.
 Other information concerning a musical work, such as album cover
image, album price, artist biography, information on the next concerts,
etc.

Different user profiles can be defined. Common users would be interested in


general information about a musical work, such as title, composer, label and
year of edition; musicians might want to know which instruments were played,
while sound engineers could be interested in information about the recording
process. Content information can be structured by means of a music
description scheme (MusicDS), which is a structure of meta-data used to
describe and annotate audio data. The MPEG-7 standard proposes a
description scheme for multimedia content based on the XML metalanguage,
providing for easy data interchange between different equipments.

Some systems store content information in a database that is accessible


through the Internet. Fingerprinting can then be used to identify a recording
and retrieve the corresponding content information, regardless of support
type, file format or any other particularity of the audio data. For example,
MusicBrainz, Id3man or Moodlogic (www.musicbrainz.org, www.id3man.com,
www.moodlogic.com) automatically label collections of audio files; the user
can download a compatible player that extracts fingerprints and submits them
to a central server from which meta data associated to the recordings is
downloaded. Gracenote (www.gracenote.com), who has been providing
linking to music meta-data based on the TOC (Table of Contents) of a CD,
started offering audio fingerprinting technology to extend the linking from CD’s
TOC to the song level. Their audio identification method is used in
combination with text-based classifiers to improve accuracy. Another example
is the identification of a tune through mobile devices, e.g. a cell phone; this is
one of the most demanding situations in terms of robustness, as the audio
signal goes through radio distortion, D/A-A/D conversion, background noise
and GSM coding, mobile communication’s channel distortions and only a few
seconds of audio are available (e.g: www.shazam.com) [Wang 2003].

2.1.2.3. Integrity verification systems

In some applications, the integrity of audio recordings must be established


before the signal can actually be used, i.e. one must assure that the recording
has not been modified or that it is not too distorted. If the signal undergoes
lossy compression, D/A-A/D conversion or other content-preserving
transformations in the transmission channel, integrity cannot be checked by
means of standard hash functions, since a single bit flip is sufficient for the
12

output of the hash function to change. Methods based on fragile watermarking


can also provide false alarms in such a context. Systems based on audio
fingerprinting, sometimes combined with watermarking, are being researched
to tackle this issue. Among some possible applications [Gómez2002], we can
name: Check that commercials are broadcast with the required length and
quality, verify that a suspected infringing recording is in fact the same as the
recording whose ownership is known, etc.

2.1.3. General fingerprinting framework


In this section we will briefly review audio fingerprinting algorithms. In the
literature, the different approaches to fingerprinting are usually described with
different rationales and terminology depending on the background: Pattern
matching, Multimedia (Music) Information Retrieval or Cryptography (Robust
Hashing). In this section, we review different techniques mapping functional
parts to blocks of a unified framework. For further review we refer to the
interested reader to [Cano2006].

In spite of the different rationales behind the identification task, methods share
certain aspects. There are two fundamental processes: the fingerprint
extraction and the matching algorithm. The fingerprint extraction derives a set
of relevant perceptual characteristics of a recording in a concise and robust
form.

The solutions proposed to fulfill the above requirements imply a trade-off


between dimensionality reduction and information loss. The fingerprint
extraction consists of a front-end and a fingerprint modeling block (see fig.
2.1). The front-end computes a set of measurements from the signal. The
fingerprint model block defines the final fingerprint representation, e.g: a
vector, a trace of vectors, a codebook, a sequence of indexes to HMM sound
classes, a sequence of error correcting words or musically meaningful high-
level attributes. Given a fingerprint derived from a recording, the matching
algorithm searches a database of fingerprints to find the best match. A way of
comparing fingerprints, that is a similarity measure, is therefore needed. Since
the number of fingerprint comparisons is high in a large database and the
similarity can be expensive to compute, methods that speed up the search are
required. Some fingerprinting systems use a simpler similarity measure to
quickly discard candidates and the more precise but expensive similarity
measure for the reduced set of candidates. There are also methods that pre-
compute some distances off-line and build a data structure that allows
reducing the number of computations to do on-line.
13

According to [Baeza1999], good searching methods should be:

 Fast: Sequential scanning and similarity calculation can be too slow for
huge databases.
 Correct: Should return the qualifying objects, without missing any —
i.e: low False Rejection Rate (FRR).
 Memory efficient: The memory overhead of the search method should
be relatively small.
 Easily up-datable: Insertion, deletion and updating of objects should be
easy.

The last block of the system - the hypothesis testing (see Fig. 2.1) - computes
a reliability measure indicating how confident the system is about
identification.

Fig. 2.1. Fingerprint system. Figure reproduced from [Cano2002a] with


permissions of the author.

The basic steps for the identification are:

(a) Database creation: The collection of works to be recognized is presented


to the system for the extraction of their fingerprint. The fingerprints are stored
in a database and can be linked to a tag or other meta-data relevant to each
recording.
14

Fig. 2.2. Content-based audio identification framework. Figure reproduced


from [Cano2005] with permissions of the author.

(b) Identification: The unlabeled audio is processed in order to extract the


fingerprint. The fingerprint is then compared with the fingerprints in the
database. If a match is found, the tag (information related to the fingerprint
like title, artist, duration, etc.) associated with the work is obtained from the
database. A reliability measure of the match can also be provided.

2.1.3.1. Front-End

The front-end converts an audio signal into a sequence of relevant features to


feed the fingerprint model block (see Fig. 2.2.). Several driving forces co-exist
in the design of the front-end [Cano2006]:

• Dimensionality reduction
• Perceptually meaningful parameters (similar to those used by the
human auditory system)
• Invariance / robustness (to channel distortions, background noise, etc.)
• Temporal correlation (systems that capture spectral dynamics).

In some applications, where the audio to identify is coded, for instance in


mp3, it is possible to by-pass some of the following blocks and extract the
features from the audio coded representation.

2.3.1.2. Preprocessing

In a first step, the audio is digitalized (if necessary) and converted to a general
15

format, e.g: mono PCM (16 bits) with a fixed sampling rate (ranging from 5 to
44.1 KHz). Sometimes the audio is preprocessed to simulate the channel, e.g:
band-pass filtered in a telephone identification task. Other types of
preprocessing are a GSM coder/decoder in a mobile phone identification
system, pre-emphasis, amplitude normalization (bounding the dynamic range
to (-1,1)).

2.1.3.3. Framing and Overlap

A key assumption in the measurement of characteristics of an audio signal is


that the signal can be regarded as stationary over an interval of a few
milliseconds. This allows us to work in the Fourier domain. Therefore, the
signal is divided into frames of a size comparable to the variation velocity of
the underlying acoustic events. The number of frames computed per second
is called frame rate. A tapered window function is applied to each block to
minimize the discontinuities at the beginning and end of the frame. Overlap
between windows must be applied to assure robustness to shifting (i.e. when
the input data is not perfectly aligned to the recording that was used for
generating the fingerprint). There is a trade-off between the robustness to
shifting and the computational complexity of the system: the higher the frame
rate, the more robust to shifting the system is but at a cost of a higher
computational load.

Fig. 2.3. Fingerprint extraction framework: front-end (top) and fingerprint


modeling (bottom). Figure reproduced from [Cano2002a] with permissions of
the author.
16

2.1.3.4. Linear Transforms: Spectral Estimates

The idea behind linear transforms is the projection of the set of measurements
to a new set of features. If the transform is suitably chosen, the redundancy is
significantly reduced. There are optimal transforms in the sense of information
packing and decorrelation properties, like Karhunen-Loêve (KL) or Singular
Value Decomposition (SVD) [Theodoris1999]. These transforms, however, are
problem dependent and computationally complex. For that reason, lower
complexity transforms using fixed basis vectors are more common. Most
CBID methods therefore use standard transforms from time to frequency
domain to facilitate efficient compression, noise removal and subsequent
processing. [Lourens1990] and [Kurth2002] use power measures for
computational simplicity and to model highly distorted sequences, where the
time-frequency analysis exhibits distortions, respectively. The power can still
be seen as a simplified time-frequency distribution, with only one frequency
bin.

The most common transformation is the Discrete Fourier Transform (DFT).


Some other transforms have been proposed: the Discrete Cosine Transform
(DCT), the Haar Transform or the Walsh-Hadamard Transform
[Subramanya1999]. [Richly2000] did a comparison of the DFT and the Walsh-
Hadamard Transform that revealed that the DFT is generally less sensitive to
shifting [Richly2000]. The Modulated Complex Transform (MCLT) used by
[Mihçak2001] and also by [Burges2003] exhibits approximate shift invariance
properties [Mihçak2001].

2.1.3.5. Feature Extraction

The audio features try to define the acoustic content of the audio excerpt. For
every feature we will have one different 'definition' of the analyzed audio. The
main features used for audio fingerprinting systems are MFCCs, Spectral
flatness, chroma features, pitch, bass, freq. Modulation, energy filter banks
and other high-level descriptors.

Once on a time-frequency representation, the most common representation,


additional transformations are applied in order to generate the final acoustic
vectors. In this step, we find a great diversity of algorithms. The objective is
again to reduce the dimensionality and, at the same time, to increase the
invariance to distortions. It is very common to include knowledge of the
transduction stages of the human auditory system to extract more
perceptually meaningful parameters. Therefore, many systems extract several
features performing a critical-band analysis of the spectrum (see fig. 2.4). In
[Cano2002a] and [Blum1999], Mel-Frequency Cepstrum Coefficients (MFCC)
are used. In [Allamanche2001], the choice is the Spectral Flatness Measure
(SFM), which is an estimation of the tone-like or noise-like quality for a band
17

in the spectrum. [Papaodysseus2001] presented the “band representative


vectors”, which are an ordered list of indexes of bands with prominent tones
(i.e. with peaks with significant amplitude). Energy of each band is used by
[Kimura2001]. Normalized spectral subband centroids are proposed by
[Seo2005]. [Haitsma2001] use the energies of 33 bark-scaled bands to obtain
their “hash string”, which is the sign of the energy band differences (both in
the time and the frequency axis) [Haitsma2001]; [Haitsma2002b].

Sukittanon and Atlas claim that spectral estimates and related features only
are inadequate when audio channel distortion occurs [Sukittanon2002];
[Sukittanon2004]. They propose modulation frequency analysis to
characterize the time-varying behavior of audio signals. In this case, features
correspond to the geometric mean of the modulation frequency estimation of
the energy of 19 bark-spaced band-filters.

Approaches from music information retrieval include features that have


proved valid for comparing sounds: harmonicity, bandwidth, loudness
[Blum1999]. [Burges2002] point out that the features commonly used are
heuristic, and as such, may not be optimal. For that reason, they use a
modified Karhunen-Loêve transform, the Oriented Principal Component
Analysis (OPCA), to find the optimal features in an “unsupervised” way. If
Principal Component Analysis (PCA) finds a set of orthogonal directions which
maximize the signal variance, Oriented Principal Component Analysis (OPCA)
obtains a set of possible non-orthogonal directions which take some
predefined distortions into account. In fig. 2.4 we can see a diagram block of
how the audio features are extracted from the spectrum of the signal input.

Fig. 2.4. Audio features extraction. Figure reproduced from [Cano2002a] with
permissions of the author.
18

Mel Frequency Cepstral Coefficients (MFCCs)

The most common features are MFCCs. MFCCs are short-term spectral-
based features and are the dominant features used for speech recognition,
and also used in applications of music modeling. Their success has been due
to their ability to represent amplitude spectrum in a compact form. Each step
in the process of creating MFCC features is motivated by perceptual or
computational considerations [Logan2000]. For a block diagram of an MFCC
feature extraction see fig. 2.5.

Fig. 2.5. MFCC extraction

The main difference and controversy between speech and music recognition
is in the point of scaling and smoothing step. The blocks before doesn't have
to change a lot between music and speech as the portion of audio taken is
always as little as it can be considered stationary (typically 20 ms), so the first
steps should keep the same for each cases. In the case of scaling, Mel-
scaling works well for speech recognition, as the information in the high
frequencies is less important for this aim than the lowers and mid ones. It
doesn't happen in music, where the high frequencies can took a lot of
information also in [Logan2000] were made some experiments to test which
kind of scaling, Mel or linear worked better for music recognition, having no
conclusive results [Logan2000].

2.1.3.6. Fingerprint models

The fingerprint modeling block usually receives a sequence of feature vectors


calculated on a frame by frame basis. Exploiting redundancies in the frame
19

time vicinity, inside a recording and across the whole database, is useful to
further reduce the fingerprint size. The type of model chosen conditions the
distance metric and also the design of indexing algorithms for fast retrieval.

A very concise form of fingerprint is achieved by summarizing the


multidimensional vector sequences of a whole song (or a fragment of it) in a
single vector. Etantrum [www.freshmeat.net/projects/songprint] calculates the
vector out of the means and variances of, for instance, the 16 bank-filtered
energies corresponding to 30 sec of audio ending up with a signature of 512
bits. The signature along with information on the original audio format is sent
to a server for identification. Musicbrainz’ TRM signature
[http://musicbrainz.org/docs/20031108-2.html] includes in a vector: the
average zero crossing rate, the estimated beats per minute (BPM), an
average spectrum and some more features to represent a piece of audio
(corresponding to 26 sec). The two examples above are computationally
efficient and produce a very compact fingerprint. They have been designed for
applications like linking mp3 files to metadata (title, artist, etc.) and are more
tuned for low complexity (both on the client and the server side) than for
robustness (cropping or broadcast streaming audio).

Fingerprints can also be sequences (traces, trajectories) of features. This


fingerprint representation is found in [Blum1999], and also in [Haitsma2001]
as binary vector sequences. The fingerprint in [Papaodysseus2001], which
consists on a sequence of “band representative vectors”, is binary encoded
for memory efficiency. Some systems include high-level musically meaningful
attributes, like rhythm (BPM) or prominent pitch (see
[http://musicbrainz.org/docs/20031108-2.htm] and [Blum1999]).

Following the reasoning on the possible sub-optimality of heuristic features,


[Burges2002] employ several layers of OPCA to decrease the local statistical
redundancy of feature vectors with respect to time. Besides reducing
dimensionality, extra robustness requisites to shifting and pitching are
accounted in the transformation.

“Global redundancies” within a song are exploited in [Allamanche2002]. If we


assume that the features of a given audio item are similar among them, a
compact representation can be generated by clustering the feature vectors.
The sequence of vectors is thus approximated by a much lower number of
representative code vectors, a codebook. The temporal evolution of audio is
lost with this approximation. Also in [Allamanche2002], short-time statistics
are collected over regions of time. These results in both higher recognition,
since some temporal dependencies are taken into account, and a faster
matching, since the length of each sequence is also reduced.

[Cano2002c] and [Batlle2002] use a fingerprint model that further exploits


global redundancy. The rationale is very much inspired on speech research. In
20

speech, an alphabet of sound classes, i.e. phones can be used to segment a


collection of raw speech data into text achieving a great redundancy reduction
without “much” information loss. Similarly, we can view a corpus of music, as
sentences constructed concatenating sound classes of a finite alphabet.
“Perceptually equivalent” drum sounds, for instance, occur in a great number
of pop songs. This approximation yields a fingerprint which consists in
sequences of indexes to a set of sound classes representative of a collection
of audio items. The sound classes are estimated via unsupervised clustering
and modeled with Hidden Markov Models (HMMs). Statistical modeling of the
signal’s time course allows local redundancy reduction. The fingerprint
representation as sequences of indexes to the sound classes retains the
information on the evolution of audio through time.

In [Mihçak2001], discrete sequences are mapped to a dictionary of error


correcting words. In [Kurth2002], the error correcting codes are at the basis of
their indexing method.

Similarity measures

Similarity measures are very much related to the type of model chosen. When
comparing vector sequences, a correlation metric is common. The Euclidean
distance, or slightly modified versions that deal with sequences of different
lengths, are used for instance in [Blum1999]. In [Sukittanon2002], the
classification is Nearest Neighbor using cross entropy estimation. In the
systems where the vector feature sequences are quantized, a Manhattan
distance (or Hamming when the quantization is binary) is common
[Haitsma2002b; Richly2000]. [Mihçak2001] suggest that another error metric,
which they call “Exponential Pseudo Norm” (EPN), could be more appropriate
to better distinguish between close and distant values with an emphasis
stronger than linear. In [Wang2003] a time-frequency analysis is performed,
marking the coordinates of local maxima of a spectrogram. Thus reduces the
search problem to one similar to astronavigation, in which a small patch of
time-frequency constellation points must be quickly located within a large
universe of points in a strip-chart universe with dimensions of bandlimited
frequency versus nearly a billion seconds in the database.

So far we have presented an identification framework that follows a template


matching paradigm [Theodoris1999]: both the reference patterns – the
fingerprints stored in the database – and the test pattern – the fingerprint
extracted from the unknown audio – are in the same format and are compared
according to some similarity measure, e.g: hamming distance, a correlation
and so on. In some systems, only the reference items are actually
“fingerprints” – compactly modeled as a codebook or a sequence of indexes
to HMMs [Allamanche2001], [Batlle2002]. In these cases, the similarities are
computed directly between the feature sequence extracted from the unknown
audio and the reference audio fingerprints stored in the repository. In
21

[Allamanche2001], the feature vector sequence is matched to the different


codebooks using a distance metric. For each codebook, the errors are
accumulated. The unknown item is assigned to the class which yields the
lowest accumulated error. In [Batlle2002], the feature sequence is run against
the fingerprints (a concatenation of indexes pointing at HMM sound classes)
using the Viterbi algorithm. The most likely passage in the database is
selected.

Matching methods

Besides the definition of a distance metric for fingerprint comparison, a


fundamental issue for the usability of a system is how to efficiently do the
comparisons of the unknown audio against the possibly million fingerprints.
The method depends on the fingerprint representation. Vector spaces allow
the use of efficient existing spatial access methods. The general goal is to
build a data structure, an index, to reduce the number of distance evaluations
when a query is presented. As stated by [Chávez2001] most indexing
algorithms for proximity searching build sets of equivalence classes, discard
some classes and search exhaustively the rest [Chávez2001] (see for
example [Kimura2001]).

The idea of using a simpler distance to quickly eliminate many hypothesis and
the use of indexing methods to overcome the brute-force exhaustive matching
with a more expensive distance is found in the CBID literature, e.g: in
[Kenyon1993]. [Haitsma2001] proposed an index of possible pieces of a
fingerprint that points to the positions in the songs. Provided that a piece of a
query’s fingerprint is free of errors (exact match), a list of candidate songs and
positions can be efficiently generated to exhaustively search through
[Haitsma2001]. In [Cano2002c], heuristics similar to those used in
computational biology for the comparison of DNA are used to speed up a
search in a system where the fingerprints are sequences of symbols.
[Kurth2002] present an index that use code words extracted from binary
sequences representing the audio. These approaches, although very fast,
make assumptions on the errors permitted in the words used to build the
index which could result in false dismissals. As demonstrated in
[Faloutsos1994], in order to guarantee no false dismissals, the simple
(coarse) distance used for discarding unpromising hypothesis must lower
bound the more expensive (fine) distance.

In [Wang2003] the information of the input is store in sample hashes. After


each sample hash has been used to search in the database to form matching
time pairs, the bins are scanned for matches. Within each bin the set of time
pairs represents a scatterplot of association between the sample and
database sound files. If the files match, matching features should occur at
similar relative offsets from the beginning of the file, i.e. a sequence of hashes
22

in one file should also occur in the matching file with the same relative time
sequence. The problem of deciding whether a match has been found reduces
to detecting a significant cluster of points forming a diagonal line within the
scatterplot. Various techniques could be used to perform the detection, for
example a Hough transform or other robust regression technique. Such
techniques are overly general, computationally expensive, and susceptible to
outliers. Due to the rigid constraints of the problem, the following technique
solves the problem in approximately N*log(N) time, where N is the number of
points appearing on the scatterplot. For the purposes of this discussion, we
may assume that the slope of the diagonal line is 1.0.

2.1.3.7. Post-processing

Most of the features described so far are absolute measurements. In order to


better characterize temporal variations in the signal, higher order time
derivatives are added to the signal model. In [Cano2002a] and [Batlle2002],
the feature vector is the concatenation of MFCCs, their derivative (delta) and
the acceleration (delta-delta), as well as the delta and delta-delta of the
energy. Some systems only use the derivative of the features, not the
absolute features [Allamanche2001; Kurth2002]. Using the derivative of the
signal measurements tends to amplify noise [Picone1993] but, at the same
time, filters the distortions produced in linear time invariant, or slowly varying
channels (like an equalization). Cepstrum Mean Normalization (CMN) is used
to reduce linear slowly varying channel distortions in [Batlle2002]. If Euclidean
distance is used, mean subtraction and component wise variance
normalization are advisable. [Park2006] propose a frequency-temporal
filtering for a robust audio fingerprinting schemes in real-noise environments.
Some systems compact the feature vector representation using transforms
(e.g: PCA [Cano2002a; Batlle2002]).

It is quite common to apply a very low resolution quantization to the features:


For instance ternary [Richly2000] or binary [Haitsma2002b; Kurth2002]. The
purpose of quantization is to gain robustness against distortions
[Haitsma2002b; Kurth2002], normalize [Richly2000], ease hardware
implementations, reduce the memory requirements and for convenience in
subsequent parts of the system. In addition binary sequences are required to
extract error correcting words utilized in [Mihçak2001; Kurth2002]. In
[Mihçak2001], the discretization is designed to increase randomness in order
to minimize fingerprint collision probability.
23

2.2. Vericast

2.2.1. System overview


BMAT's Vericast is a high-performance audio monitoring solution for TV, radio,
Internet or whatever other type of broadcast. It provides a continuous and
precise content tracking guarantying accurate emission reports and statistics,
mainly used by media companies working with consumer statistics, Digital
Rights Management and Broadcast Monitoring. Vericast continuously
observes a broadcast audio stream and matches its content against a
database of previously extracted reference audio fingerprints. The system
returns periodically the information required to create an identification report
of the media reproduced in the monitored source. Results are stored directly
into a database. The system is based in [Batlle2002]

Vericast is robust against distortions in the audio signal: Automatic


identification of music titles and copyright enforcement of audio material has
become a topic of great interest. One of the main problems with broadcast
audio is that the received audio suffers several transformations before
reaching the listener (equalizations, noise, speaker over the audio, parts of
the songs are changed or removed, etc.) and, therefore, the original and the
broadcast songs are very different from the signal point of view.

2.2.2. Framework overview


The identification system is build on a well known stochastic pattern matching
technique known as Hidden Markov Models (HMM) [Rabiner1989]. HMMs
have proven to be a very powerful tool to statistically model a process that
varies in time. The idea behind them is very simple. Consider a stochastic
process from an unknown source and consider also that we only have access
to its output in time. Then, HMMs are well suited to model this kind of events.
From this point of view, HMMs can be seen as a doubly embedded stochastic
process with a process that is not observable (hidden process) and can only
be observed through another stochastic process (observable process) that
produces the time set of observations.

We can see music as a sequence of audio events. The simplest way to show
an example of this is in a monophonic piece of music. Each note can be seen
as an acoustic event and, therefore, from this point of view the piece is a
sequence of events. However, polyphonic music is much more complicated
since several events occur simultaneously. In this case we can define a set of
abstract events that do not have any physical meaning but it mathematically
describes the sequence of complex music. With this approach, we can build a
24

database with the sequences of audio events of all the music we want to
identity.

To identify a fragment of a piece of music in a stream of audio, the system


continuously finds the probability that the events of the pieces of music stored
in the database are the generators of this unknown broadcast audio. This is
done by using the HMMs as a generator of observations instead of decoding
the audio into a sequence of HMMs.

2.2.3. Steps for the identification

Fingerprint extraction (MFCC) from the audios in the user database

The first step in a pattern matching system is the extraction of some features
from the raw audio pattern. We choose the parameter extraction method
depending on the nature of the audio signal as well as the application. Since
the aim of our system is to identify music behaving as close as possible to a
human being, it is sensible to approximate the human inner ear in the
parametrization stage. Therefore, we use a filter-bank based analysis
procedure. In speech recognition technology, mel-cepstrum coefficients
(MFCC) are well known and their behavior leads to high performance of the
systems [Batlle1998]. It can be also shown that MFCC are also well suited for
music analysis [Logan2000].

Channel estimation

Techniques for dealing with known distortions are straightforward. However, in


real radio broadcast, the distortions over the audio signal are not very
predictable. To remove some effects of these distortions, one can assume that
they are used by a linear time-invariant (or slowly variant) channel. With this
approach it's assumed that all the distortion can be approximated by a linear
filter H( ) that slowly changes in time. Thus, if we define y(n) as the audio
signal received, x(n) as the original signal and F() as the Fourier transform,
we can write:

And in the logarithmic space:

Since we only have access to the distorted data and due to the nature of the
problem we cannot know how the distortion was, we need a method to
recover the original audio characteristics from the distorted signal without
25

having access to the manipulations this audio has suffered. Here we define
the channel as a combination of all possible distortions like equalizations,
noise sources and DJ manipulations.

If the distorting channel H( ) is slowly varying we can design a filter that,


applied to the time sequence of parameters, is able to remove the effects of
the channel. The filter designed for the system is:

By filtering the parameters of the distorted audio with this filter, they are
converted, as close as possible, to the clean version. By removing this
channel effect from the received signal the identification performance is
greatly improved because all the distortions caused by any equalization and
transmission are removed. Therefore the system will be able to deal with not
only clean CD audio but also broadcast noisy audio.

Training

In our approach, HMMs represent generic acoustic generators. Each HMM


models one generic source of audio. For example, if the audio we model has
a piano and a trumpet, we will have one HMM to model the piano and another
one to model the trumpet. However, commercial pop music has a very
complex variety and mixture of sounds and so it is almost impossible to
assign a defined sound source to each HMM. Therefore, each HMM in the
system models abstract audio generations, that is, each HMM is calculated to
maximize the probability that if it was really a sound generator, it will generate
that sound (complex or not). Thus, HMMs are calculated in a way that the
probability that a given sequence of them will generate a particular song and,
that given all possible songs, we can find a sequence of HMMs for each of
them that generates them reasonably well. To derive the formulas to calculate
the parameters of each HMM we used a modification of the Expectation--
Maximization algorithm were the incomplete data are not only the parameters
of the HMMs but also their correct sequences f each song. If we suppose that
a probability density function exists f(φ|λ) that is related to the probability
density function of the incomplete data then we can relate them with:

Where O are the samples from the incomplete samples space and are the
samples of the complete samples space. We also suppose that there is at
least one transformation from the space of complete samples to the space of
26

incomplete samples.

Therefore, the training stage in our system is done in an iterative way similar
to the [Baum1967] algorithm widely used in speech recognition system.
Speech systems use HMMs to model phonemes (or phonetic derived units)
but, unfortunately, in music identification systems we do not have any clear
kind of units to use. That is why at each iteration a new set of units is
estimated as a part of the incomplete data in order to jointly find the sequence
of probabilities and also the set of abstract units that best describes complex
music. After some experimental results we found that a good set of units is
completely estimated after 25-30 iterations.

Audio identification

HMM training described in the previous section was aimed at obtaining the
maximum distance between all possible song models in order to increase
speed and reliability during the audio identification phase. Once the HMMs
are trained, the next steps toward building the entire system consist in getting
the song models and matching them against streaming audio signals.

Signature generation

Signature generation consists in obtaining a sequence of HMMs for each


song that uniquely identifies it among the others. The song signatures are
generated using a Viterbi algorithm [Viterbi1967]. The Viterbi algorithm
computes the highest probability path between HMMs on a complete HMM
graph model. All the song signatures are stored in a signature database.

Identification algorithm

The identification algorithm is in charge of matching all the signatures against


the input streaming audio signals to determine whenever a song section has
been detected. The Viterbi algorithm is used again with the purpose of
exploiting the observation capabilities of the HMM models contained in the
signature sequences. Nevertheless, this time the graph model is not a
complete graph but a cyclic HMM model. This model is built linking all song
HMM sequences from the identity signature database in a ring structure
where each HMM only has two links, one to itself and one toward its
immediate neighbor. Nevertheless, the Viterbi algorithm is allowed periodically
to use internal ring links in order to allow jump between different song
sections.
27

2.2.4. Fingerprint database creation

Before the first stage of the system it must be mentioned that the audios that
the system allow have to be wav format at 8 KHz sample rate. That means the
system will have information until 4 KHz to extract the fingerprint. This is done
to reduce the computational complexity as the audio files are smaller and
have less information to be analyzed. It has been empirically proved that with
that audio format the system has an optimal accuracy taking into account the
compromise between time-consume versus accuracy.

The first stage in the system is the obtainment of a set of values that
represent the main characteristics of the audio samples from a database
given by the customer. A key assumption made at this step is that the signal
can be regarded as stationary over an interval of a few milliseconds. Thus, the
prime function of the front-end parameterization stage is to divide the input
sound into blocks and from each block derive some features.

The spacing between blocks is 30 ms. as with all processing of this type, a
hamming window function is applied to each block so as to minimize the
signal discontinuities at the beginning and end of each frame
[Oppenheim1989]. After that the required spectral estimates are computed via
Fourier analysis, for every of these 30 ms blocks. Then the MFCC coefficients
are computed. The Fourier spectrum is smoothed by integrating the spectral
coefficients within triangular frequency bins arranged on a non-linear scale
called the Mel-scale. The Mel-scale is designed to approximate the frequency
resolution of the human ear being linear up to 1,000 Hz and logarithmic
thereafter [Ruggero1992]. In order to make the statistics of the estimated
song power spectrum approximately Gaussian, logarithmic (compression)
conversion is applied to the filter-bank output. As said in [Logan2000] this kind
of smooth of the spectrum is demonstrated to be robust for speech
recognition, not in music, but it should not be harmful. These computed
MFCCs are then stored in a feature vector.

To have a longer analysis window, these 30 ms windows are overlapped and


a bigger window of 75 ms (5 overlapped 30 ms windows) is then formed to be
analyzed (see fig. 2.8). An average of the features of these 5 linked windows
is then made. That fact introduces some noise as an average is made. For
every audio excerpt, those vectors are stored in a database of fingerprints.
Every different audio will have a unique identification.
28

Fig. 2.6. Feature extraction in Vericast.

2.2.5. Input audio to be identified & matching process

Once the fingerprint database is created audio inputs can be inserted to the
system to be identified. The format must be converted to wav at 8 KHz, like
said before. A fingerprint is automatically extracted for this input and the
process is the same than with the extraction of fingerprints for the database
until the creation of the feature vector. Then these vectors are compared
against the fingerprint database and the bests candidates for this piece of
audio are given by the system. Then if the candidates are 'good' in terms of
the system a match is given in the output, fig. 2.9.
29

Fig. 2.7. Fingerprinting matching system.

2.2.6. Matching method

The matching algorithm is the last block of the system framework. The
distances between the candidates given with the fingerprints of the database
are calculated. The matching step has the aim of returning the final output,
labeling the unlabeled audio in the input.

The matching process searches good temporal sequences between followed


candidates. If 3 (by default, but changeable) followed candidates are
temporally sequenced, the system take this as a match. When broken this
temporal sequence, the first point gives the beginning time, and the last one in
the good sequence sets the end time for the match.

There is a very important fact regarding the matching method. The broken
sequences are very fragile and must be taken into account with a lot of care.
A sequence can be broken in some point but maybe it continue after that gap
because in some way the algorithm failed giving the candidates. To prevent
these gaps, the system has a parameter that can be turned on or off if it’s
appropriate. This parameter called 'segment search' allows choosing between
two types of matching methods. If one enabled the other disabled.

As exposed before the system is fragile against the gaps produced in the
temporal sequence. Many temporal sequences are broken in the moment that
the audio is finished, that means in the correct moment. But in other cases,
the sequence is broken before the end of the audio stream. After this gap the
temporal sequence is again created, following a good pattern but is too late as
it has been broken some time before. To prevent this, avoiding these gaps,
30

the system has a solution. There's a second matching method, created to be


more robust against this type of casuistic. We will explain now how these two
matching method work

(a) First matching method: one temporal point per candidate at each step of
the algorithm.

As said in every step of the algorithm the system gives 20 candidates (by
default). In the case of choosing this method one temporal point per candidate
is given. Those temporal points are the instant of the song where the
candidate has been given. For example the candidate could be 'Bob Marley –
Is this love' and the temporal point 2 minutes 31 seconds. This point will be
the most probable along the song. The temporal sequence must be generated
with those temporal points. To clarify this working way let's see fig. 2.8 and fig.
2.9.

Fig. 2.8. Taking the most probable point.

Fig. 2.9. Example of a temporal sequence with one temporal point (the most
probable) per candidate.
31

Fig 2.10 shows the probability curve along a song, and marks the most
probable point along it. That is the temporal point that will be taken into
account to generate the temporal sequence.

Fig 2.11 shows an example of a candidate and the temporal points given for it
during 12 steps of the algorithm. The good temporal sequence is created as it
can be seen but in the 11th step the sequence is broken. These temporal
sequences are created by followed candidates. If the sequence is broken but
continues later on, it cannot be linked as the sequence is created only by
followed candidates, with no possibility of jump the gaps. Then the beginning
time is set in the beginning of the temporal sequence and the end time is set
in the last point in the good temporal sequence.

(b) Second matching method: many temporal points per candidate at each
step of the algorithm.

Here, many temporal points (100 by default) are given for each candidate
given every algorithm step (20 candidates by default per step). Then the
algorithm tries to generate good temporal sequence between these points,
only taking that ones that create a good sequence. As the sequence is
created by correlative candidates when a followed one doesn't have a
temporal point in the good temporal sequence, that is broken and as in the
other case (the first matching method) a match is given, with the beginning
time set as the first temporal point in the sequence, and the last one as the
last temporal point in the sequence. To clarify this working way let's see fig.
2.10 and fig. 2.11.

Fig. 2.10. Taking many probable points.


32

Fig. 2.11. Temporal gap present with many temporal points per candidate.

In fig. 2.10 we can see a probability curve along a song, and marked some
points with high probability of being the correct ones. In fig 2.11 the temporal
sequence is created by some of the probable points taken from the probability
curve. A temporal gap is created in some point. Then the temporal sequence
is well continued at some point after, but as the temporal sequence must be
created by followed candidates it doesn't avoid the temporal gaps appear. The
temporal sequences cannot be linked and the match is given with a lower
quality.

Having reviewed the two matching methods it's time to do a comparison


between them. The first one, with only one temporal point per candidate is
simpler and faster, but fragile against 'bad quality' matches, with lots of
temporal gaps in them. With the second matching method, taking many
temporal points per candidate, the block is more complex, but more robust
against 'bad quality' matches. The complexity is also reflected in the time-
consuming rate.

2.2.7. Configuration parameters

The system has some parameters that can be modified to improve the
33

identifications if necessary:

 Sensitivity: How many consecutive frames have to be detected to give


a match. If the sensitivity is little the matches are created more easily
but they are more fragile. If too big the match is very difficult to create
because a lot of consecutive frames have to be detected, but if created
the match is very strong, the security is higher, but there can be some
good matches avoided.

 Match_time: time for the system to give candidates (in seconds). If


little, the fingerprint is averaged from this little time given. This is
translated in a more fragile fingerprint. If longer the Match_time
stronger and more fiableis the fingerprint extracted. If lower there can
appear more easily consecutive detected frames (matches). If longer,
as we have less frames, this consecutive frames are more difficult to
find, but if found, they would be stronger.

 Hop_time: jump for the match time (if = match_time no overlapping)

 Match_tolerance: Kind of jitter. If 0, the match must be consecutive


'on time'. If a value is given the matching allows an error of time
detection of this value (in milliseconds)

 Segment_search: if enabled (1) a probability for every time point of


the signature of every candidate is calculated. Then the system uses
the higher probability points to look for matches. If disabled (0) only the
higher probabilistic time point is taken.
34

Chapter 3
Problem statement

The problem faced with Vericast has been around the identification of some
types of electronic music. As we will see after, more related with repetitive
music than with electronic music itself. The problem why Vericast finds more
difficulties to identify pieces of electronic music could have multiple answers.
The present work proposes and develops one of these possible answers
based on the problem that has been detected, which does not need to be the
only one existing.

As we have seen deeply in chapter 2 the audio input to the system is divided
making the system work frame-by-frame. From every frame a fingerprint
signature is extracted, and the acoustic vector that composes this signature is
filled with MFCC features. This extracted fingerprint is then compared with
those ones in a database, that has previously created offline with the
fingerprints of the audios that the user wants to control in some sense. If the
audio input is found in the database the system returns a match as the output,
with the information required about it.

Some types of electronic music are based on loops that are repeated several
or many times during the same song. We hypothesize that in other types of
music, although there are parts of songs that are repeated along it (like
chorus), they are might be played by humans, therefore having subtle
variations that make them not exactly the same but similar, so the system
does not have the problem explained before. In the case of electronic music,
loops are exactly the same and repeated many times through the song. It
seems that this fact can confuse Vericast because when this type of song has
to be detected, it is very likely to find matches in many parts of the song for
one precise part. So the system matching jumps, picking candidates who fail
to be correct and therefore do not have a good time continuation between
them, so the system can not find enough strong temporal relationship
sometimes. This fact is shown in fig. 3.1.
35

Fig. 3.1. 'Loop 2' is detected in many points of the song

We will define the concept of 'good continuity between candidates' to make it


clearer, because it is an important aspect of this work. Vericast gives
candidates for the input that has to be identified from time to time
(match_time, a parameter of the system, a typical value could be between 1
and 3 seconds). Let's say it gives candidates every 3 seconds, two candidates
with good continuity would be those ones where the first candidate
corresponds to time x of a song from our database and the immediate
candidate following this (given by the system after 3 seconds) is on the x + 3
seconds from the same song as the first one. If this was not the case,
between these two candidates there would not exist a good continuity
between the candidates. If many candidates and its temporal points are in
good continuity they create a good temporal sequence that can be converted
into a match.

To find a match the system needs to find a series of candidates with good
temporal continuity within the song. As an explanatory example we might think
of a song in our database of fingerprints that has three choruses that are
exactly the same. As input to the system enters the first chorus, isolated from
the rest of the song and the system extracts the input fingerprint and looks for
it in the database. Having three equal choruses in the database the system
would return randomized candidates from temporal points in each of these
three choruses. For each point of the input we have three probable points in
the same database. The system might not find a good level of continuity
between candidates, as they jumping from one chorus to another. The system
may be able to 'say' what song it is but it might have problems to guess where
it begins and where it ends. In the case of many styles of electronic music is
exactly this, and harder because the loops are repeated more than 3 times
normally and happens nearly in every song (see fig. 3.1).

However the current system has a solution for this as explained in chapter 2
(see 2.2.5. Matching method). By this way, applying this solution we can have
substantial improvement but we could still have the same problem. The
system requires immediate consecutive candidates in good temporal
sequence to find matches, and sometimes this may not happen for the same
nature of loops of electronic music. This problematic is shown in fig. applying
36

the solution proposed. The temporal sequence is stronger but there's a gap
that already appears breaking the sequence. Our solution aims to be more
robust than the already implemented taking into account the fact that try to
find temporal sequence between consecutive candidates can be weak. If we
could have more than two consecutive temporal points to search trajectories
we could jump over the gaps in cases like that one shown in fig. 3.2.

Fig. 3.2. Above we can observe the system with only one temporal point per
candidate; Down, in the graph the solution of system is applied, but the gap is
still present.
37

Chapter 4
Problem solution

4.1. Introduction
The system has the problem that we have just explained in the previous
chapter, but through off-line monitoring of the information that gives the
system at intermediate points (before the output is given), it can be seen that
this problem is solvable, and the information already contained into the
system should be enough. It can be observed that there are actually good
continuity of candidates. In general good temporal sequences are created but
sometimes they are not continuous because they need to have one candidate
after the other, with no temporal gaps.

For example, imagine we have four candidates given by the system. The first
two candidates are correlated to the same part of the song, so they have good
continuity between them (see fig. 4.1). The next candidate, the third, is located
in another part of the song breaking like this the continuity that had the
previous two. In contrast, the fourth point has good continuity with the first
two. That is, we have a path consisting of three candidates that have a gap in
the middle (third). This fact can be extrapolated, the system gives candidates
that are really correlated with others but actually do not have to be followed
one after the other. There may be gaps between the best possible
continuations. Good trajectories are formed, so a solution might be to give
more flexibility to the algorithm that seeks continuity among candidates, to
make it more robust to that gaps/continuity breaks.

In the case exposed before the system previous to modifications, after 'seeing'
the gap would give a match with only the two first points. The new
modifications to the matching method can jump the gap and link the forth
point, which is in good temporal sequence, with the first two (see fig. 4.1).
38

Fig. 4.1. Example of good continuity with a gap,

Another matching method for solving such problems has been designed and
implemented. This does not require continuity between followed candidates,
but continuity over time. The algorithm gives the system some time (many
match_time times) to find candidates, to find these trajectories and form good
continuities, calculate them and keep the best. The time given to seek for
good continuity candidates is a parameter of the new algorithm. The larger the
amount of time, the more candidates it will have, so greater the probability to
find a good continuity between them.

The maximum time that we could expect the algorithm to find matches is the
duration of the input. Taking less time we can also have good results with the
benefit of being able to obtain them in real-time. Here real-time means that we
can be identifying the input while it is sounding. This could occur if the length
of the input is not known, for example, monitor what is being played in a radio
broadcast. In this scenario what Vericast is supposed to do is to give matches
in 'real-time' for what is being played. The user wants to know what is hearing
to as soon as possible. If we were monitoring a radio where songs sounded
one after another we could take a time about 10 seconds to allow the
algorithm to give us matches at this timing rate. Having then an updated
playlist of what is being played.

As we have explained in chapter 2, there is a solution implemented within the


system to the problematic cases we have described in that chapter. In general
39

that solution was to give many probable temporal points for each candidate
given. The new algorithm implemented has been developed not taking into
account that 'extra' points for each candidate, but if taken it might be able to
find more good trajectories and good continuations. Complexity when
searching for paths would increase a lot as where we had just one temporal
point we would have then many of them. This implementation has not been
carried out but there remains a potential for improvement.

4.2. Implemented algorithm explanation


To a proper understanding of the algorithm implemented a block diagram of it
is shown in fig. 4.2. Then an explanation of each block follows will be done
bellow.

Fig. 4.2. Block diagram of the algorithm.

Waiting step

This first step doesn't appear in the block diagram as the only thing here is
that the algorithm lets the system give candidates for some time, we call this
time as 'step time of the algorithm'. The 'step time of the system', on the other
40

hand, is the time when the system gives each new candidate (match_time
parameter). It must always be less than the step time of the algorithm to be
meaningful, as it must have a number of candidates to search good
continuations. We will refer to those concepts as 'step of the algorithm' and
'step of the system', respectively.

Store candidates

At each step of the system, it gives a number of candidates from different


songs (20 by default), from the less to the most probable. These candidates
and their temporal points are stored in a matrix arranged by candidate. We
call it 'candidate matrix'. For every new candidate, a row will be created in that
matrix for it. The temporal points for each candidate will be positioned in the
columns regarding the row of the corresponding candidate. If the candidate to
be stored is already in the matrix, only its temporal point will be stored after
the other temporal points stored for this candidate, because the ID is already
in the matrix and it's only stored one, the first time it appears. If the candidate
is new in the matrix, it will be stored in a new position with its temporal point, a
row will be created.

This way we have a big matrix with the candidates (the Ids of them) by rows,
situated in the first column of the matrix. This way, every candidate has a row,
with the ID in the first position and in columns their temporal points. If for one
step of the algorithm there's a candidate that has appeared before does not
appear in the current step the corresponding column for that temporal points
is left blank. The form of the candidate matrix can be seen in table. 4.1. We
find the candidates (all different) in the first column. Then the temporal points
(t), where its first index refers to which is the candidate of that temporal point,
and the second index refers to which is the position of this temporal point.

Candidate2 t21 t22 t23


Candidate3 t31 t32

.
.
.

CandidateN t1N t2N t3N t4N

Table. 4.1. Candidate matrix (candidate_matrix in the code).


41

In the table we can see every different candidate in one single row. In the
columns of a candidate are located the temporal points found. For every step
of the algorithm a new column is created. If the cell is blank means that in this
step of the algorithm the candidate has not been found.

Split candidates

After the step time of the algorithm (20s by default), it takes all the candidates
treating them separately. The candidates can be split as every row has a
single candidate. Every isolated candidate (every row), with its temporal
points is then send to the block where the trajectories and the good temporal
sequences are searched. Here it's the moment where the points in the matrix
are translated to coordinates. This fact is explained in the next point.

Calculate trajectories

The mechanism for searching good continuations for each candidate is itself
another function inside the solution raised. The input to this block is a set of
coordinates for each individual isolated candidate. This coordinates are
formed on one side by the temporal points detected (query time Tq) and the
corresponding temporal point in reference to the audio input to the system
(reference time Tref). That is, imagine we have a recording of unknown length.
At 5:00 minutes starts a whole song lasting 3 minutes, which therefore ends
up at the 8:00 minutes of the recording. In this case Tq is formed by temporal
points from 0 seconds to 3 minutes ideally, and Tref would have temporal
points from minutes 5:00 to 8:00. This way, we would have a representation
as in fig. 4.3.

Fig 4.3. Example of ideal input


42

So, ideally we should have a 1:1 relationship between Tq and Tref, having an
offset on Tref. Thus if we represent the coordinates on an axis, good
continuities should have trajectories with a slope ~= 1. This can vary quite a
bit since the input audio to the system may have suffered distortions like time
stretching or pitch changed (as in dj sessions).

Therefore to characterize what we call good continuation trajectories, two


coordinates must form a trajectory with slope ~= 1 with a margin of error to
take into account the distortions mentioned above (time stretching and pitch
change). To know if other coordinates belong to the same trajectory the slope
is not enough, we need another reference parameter, to properly characterize
it (see fig 4.4). We decided that this reference could be the 'offset point' where
the line of the traced trajectory cut the axis of coordinates. As we give a
margin of error for the slope that will also be reflected in the offset point, so
that an error should also be defined. Both errors are parameters of the
algorithms. With these two points we have the trajectories characterized.
When a new one is calculated it can be compared with the already mapped
out and discern whether it is a good continuation of any of them or a new
trajectory is traced.

Fig. 4.4. Characterizing the trajectories.

To calculate these trajectories we start from the first coordinate. Then the
trajectories between this first coordinate and the next ones are calculated. To
be considered a good trajectory it must have a slope equal to 1 +/- the error
admitted. In this case the trajectory (characterized by the slope and the offset
43

point) is stored. If not, the trajectory is avoided. After having calculated all the
trajectories between the first coordinate respect the others, the same is done
with the second coordinate. Every trajectory between this second coordinate
and the rest after that are calculated. Then the third, and so on, until the last
one. Therefore if we had N coordinates we would have to calculate N + (N-1)
+ (N-2) + (N-3) + … + [N-N(-1)] trajectories.

For every trajectory there's a counter to count the number of points belonging
to that trajectory. If a trajectory is repeated (same slope and same offset point)
the counter is augmented (+1) for that trajectory. To be considered a good
trajectory a threshold number of points that have to belong to that trajectory is
given. If that threshold is reached the candidate and the temporal points
creating the good temporal sequence are considered a partial match at this
moment. So for every step of the algorithm it returns partial matches whether
if any candidate has good continuity. We call it partial match because the
trajectory created is not maybe complete. We are only taken a few temporal
points that can maybe have a continuation later on. Remember in the waiting
step, the system waits, for example, 20 seconds, and maybe a song lasts for
180 seconds. We would have 18 partial matches (9*2 because of the half of
the overlap) for only one song. As we will see in the next block, there's the
need of linking these partial matches.

Link partial matches

Two partial matches must be linked if they belong to the same candidate and
the temporal trajectories of both are good, have good temporal continuity. We
should also mention that we have an overlap of half-time of the step time of
the algorithm to avoid discontinuities between partial matches.

Therefore when a partial match has the same candidate and a good temporal
continuity with another from the next step of the algorithm, they should be
linked forming one bigger partial match. When in the next step of the
algorithm, the candidate of the partial match does not repeat, or it does not
have a good continuity, it becomes a final match, which will be displayed in
the output. When we merge two partial matches as they belong to the same
candidate, we keep this candidate. What has to be changed is the end time of
the match. The beginning time of the match is set by the beginning time of first
partial match. Then the end time is set by the final partial match that can be
linked according to what we have explained before about when we can link
two partial matches.

Let's take the example again that we have as input a song and 180 seconds
immediately followed by another song, and a step time of the algorithm of 60
seconds. In principle we should have 6 partial matches (180/60 * 2 for
mentioned overlap) for the candidate of the first song, with good continuity
between them. Then we should have a partial match from a different
44

candidate, corresponding to the next song. This block should merge these six
partial matches forming a final match, to be shown in the output.

4.3. The Code


The code will be shown in matlab language. Here there are the two main
functions of the code, in one side the matching algorithm
(new_matching_algorithm function) and the mechanism for searching good
temporal sequences or trajectories (trajectories_searching_mechanism).

4.3.1. Function: new_matching_algorithm


This is the main function of the solution implemented. It receives the
candidates frame-by-frame given by the system and returns the matches with
that information. It returns also the partial matches for further information.
function [partialMatches,matches] = new_matching_algorithm(data)
% data is the whole information about candidates that the system
% retrieves for an audio frame
%remmember: 20 candidates per step of the system
%step of the system set at 1s
%20 candidates per second

%%%%%%%%%%%SOME PREVIOUS DECLARATIONS%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


threshold = 100; %threshold for the number of points in a good temporal sequence to be
a partial match
candidate_matrix = [-1]; %candidate_matrix declaration and initialization
matches = []; %matrix that will store the matches
detectedCandidates = 20; %20 candidates per step of the system
nframes = length(data)/detectedCandidates %number of frames
step = 20; %every each step/2 frames the algorithm looks back to get matches (time of
the waiting step)
nPartialMatches = 0; %number of partial matches
nMatches = 0; %number of matches
points_number = 5; %
offset_error = 100; %error for the offset point (when calculating the trajectories)
m_error = 0.05; %error for the slope (when calculating the trajectories)

%%%%%%%%%%%%STARTING THE ALGORITHM%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


frame = 1; %initialize frame
for frame = 1:nframes

%%%%%%%%STORING THE CANDIDATES IN THE CANDIDATE MATRIX%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


for i = 1:detectedCandidates
mid = data( detectedCandidates*(frame-1) +i ,1);
%id of the candidate (into the 'frame' summing 'i')
qtime = data( detectedCandidates*(frame-1) +i ,10);
% 1 column is where the id is and 10 where the qtime is
reftime = data( detectedCandidates*(frame-1) +i ,11);
% 1 column is where the id is and 10 where the qtime is
pos = find ( candidate_matrix(:,1) == mid ,1);
% position of the cancandidatedidate in the matrix
% if exists return the pos = position, if not pos will be empty
%is the candidate a 'new candidate'? OR is has already appeared?
if ( pos >= 1 ) %if pos >= 1 this will be the position of the already appeared
%candidate
candidate_matrix(pos,frame) = qtime;
candidate_matrix(pos+1,frame) = reftime;

else %else if pos is empty it is a 'new candidate'


if frame == 1 && i == 1 %if the first frame
candidate_matrix = [mid qtime]; % [mid qtime]
45
candidate_matrix = [candidate_matrix; mid reftime];
% the reftime will be down the qtime in the matrix with the same id
elseif frame == 1 && i ~= 1
candidate_matrix = [candidate_matrix; mid qtime]; % [mid qtime; mid1
% qtime1]
candidate_matrix = [candidate_matrix; mid reftime];
else %if is one of the next frames put as many 0's as frames has passed
candidate_matrix = [candidate_matrix; mid -ones(1,frame-1) qtime];
% [mid 0 0 0 ... 0 qtime]
% set the 'mid' in the beginning of the row
% then 'frame-1' zeros, frames from the
% past finally in 'frame' position set
% 'qtime'
candidate_matrix = [candidate_matrix; mid -ones(1,frame-1) reftime];
end
end
end

%%%%%%%%%%%%%%%%%%SPLIT CANDIDATES%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

if rem(frame,step/2) == 0 && frame ~= step/2 % different of step/2, avoiding


% the first overlaping, starting
% with the 'step' frame

%size of the candidate_matrix matrix


[cand_row_length, cand_column_length] = size(candidate_matrix);

i = 1; %initializing the index we will use


for j = 1:(cand_row_length/2) % the half because there are two lines,
% one for qtime and other for reftime
if (i+1 > cand_row_length || i > cand_row_length)
%do nothing
else
taudio = (candidate_matrix(i,:))'; %taking the rows, qtime
tref = (candidate_matrix(i+1,:))'; %taking the rows, tref
mid = taudio(1);

%The coordinates
points = [tref(frame-step+2:frame+1) taudio(frame-step+2:frame+1)];
%taking from the second, the first position is the id.

%%%%%%%%%%%%%%%%%%CALCULATING TRAJECTORIES%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

[to,tf,offsethist] = trajectories_searching_mechanism(points);

if length(offsethist > 0)
pos = find(offsethist(:,3) >= points_number);
if (length(pos) > 0)
nPartialMatches = nPartialMatches + 1;
partialMatches(nPartialMatches).mid = mid;
partialMatches(nPartialMatches).offsethist =
offsethist(pos,:);
partialMatches(nPartialMatches).step = frame/step;

%%%%%%%%%%%%%%%%%%%LINKING THE PARTIAL MATCHES%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

if (length(matches) ~= 0)
pos_mid = -1; % default, mid not found
for t = 1:length(matches)
if (matches(t).mid == mid)
pos_mid = t;
end
end
if ( pos_mid >= 1 ) %this id exists
for k = 1:length(offsethist(:,1))
% Adding some error to unify some 'very similar'
% offsets and m's
pos_offsethist =
find( ( round(matches(pos_mid).offsethist(:,1)*100000 + 0.00005) <
round( (offsethist(k,1) + offset_error)*100000 + 0.00005) ) &
( round(matches(pos_mid).offsethist(:,1)*100000 + 0.00005) > round( (offsethist(k,1) -
offset_error)*100000 + 0.00005) ) & (round(matches(pos_mid).offsethist(:,2)*100000 +
0.00005) < round( (offsethist(k,2) + m_error)*100000 + 0.00005)) &
(round(matches(pos_mid).offsethist(:,2)*100000 + 0.00005) > round( (offsethist(k,2) -
m_error)*100000 + 0.00005) ) ); %same offset and m, with some Controlled Error
if (pos_offsethist >= 1) %found
matches(pos_mid).offsethist(pos_offsethist,5) =
offsethist(k,5); %update 'te'
46
matches(pos_mid).offsethist(pos_offsethist,7) =
offsethist(k,7); %update 'teRef'
matches(pos_mid).offsethist(pos_offsethist,3) =
matches(pos_mid).offsethist(pos_offsethist,3) +
offsethist(k,3); %add the number of counts
else
matches(pos_mid).offsethist =
[ matches(pos_mid).offsethist;
offsethist(k,:) ];
end
end
else
nMatches = nMatches + 1;
matches(nMatches).mid = mid;
matches(nMatches).offsethist = offsethist(pos,:);
matches(nMatches).step = frame/step;
end
else
matches.mid = mid;
matches.offsethist = offsethist(pos,:);
matches.step = frame/step;
nMatches = nMatches + 1;
end
end
end

i = i + 2;
end

end %end of the principal for


end
end

%%%%%%%%%%%%%%%%%%%%%%%%%%%TAKING INTO ACCOUNT THE LAST FRAMES%%%%%%%%%%%%%%%%%%%%%%%%

framesLeft = rem(frame,step/2) + step/2;


tref = (frame - framesLeft + 1 : 1 : frame)';

%size of the candidate_matrix matrix


[cand_row_length, cand_column_length] = size(candidate_matrix);

for j = 1:(cand_row_length/2)
if (i+1 > cand_row_length || i > cand_row_length)
%do nothing
else
taudio = (candidate_matrix(i,:))'; %taking the rows
tref = (candidate_matrix(i+1,:))'; %taking the rows, tref
mid = taudio(1);
points = [tref(frame-framesLeft+2 : frame+1) taudio(frame-framesLeft+2 :
frame+1)];
%couting from the second position, the first position is the id.
[to,tf,offsethist] = matchDiagonals(points);

if length(offsethist > 0)
pos = find(offsethist(:,3) >= points_number);
if length(pos > 0)
nPartialMatches = nPartialMatches + 1
partialMatches(nPartialMatches).mid = mid;
partialMatches(nPartialMatches).offsethist = offsethist(pos,:);
partialMatches(nPartialMatches).step = frame/step;
end
end

i = i + 2;
end

end

4.2.2. Function: trajectories_searching_mechanism

This function receives a set of temporal coordinates (points), on one side


the query time and in the other the reference time. It returns a matrix of
47

trajectories found with the beginning time of the match (to) and the end time
of the match (tf). The parameters that characterize the trajectory as the
slope (m), the offset point (offset), the amount of points belonging to the
trajectory (counts) and the beginning time (tb) and end time (te) of the
trajectory are store in the variable offsethist.
function [to,tf,offsethist] = trajectories_searching_mechanism(points)
%|-----------------------------------------------------------|
%| matchDiagonals |
%| |
%| trajectories_matrix = matrix of the possible trajectories |
%| to = initial time of the match |
%| tf = end/final time of the match |
%| offsethist = [offset m counts tb te] |
%| --> histogram of the trajectories with: |
%| ·offset = offset point |
%| ·m = slope |
%| ·counts = number of points belonging to that trajectory|
%| ·tb = beginning time of the trajectory |
%| ·te = end time of the trajectory |
%|-----------------------------------------------------------|

tinput = points(:,2); %vector of input times


tref = points(:,1); %vector of ref times
sim = zeros(1,length(points));
merror = 0.1; % error for slew
trajectories_matrix = [];
offsethist = []; %[offset, m, counts]
flag_1st_pos = 0;
m_error = 0.05;
offset_error = 50;
counted_matches = 5; % numbers of points in the diagonal

for j = 1:length(points) %go throught all the points


for i = 1:(length(points)-j) %one point less to take into account every step

if tinput(j) == -1 || tinput(j+i) == -1 || tref(j) == -1 || tref(j+i) == -1


% Do nothing
i = i;
else
%tref = m*tinput + n trajectories_matrix(j,i) = D;
%D is an estructure with m and n and offset
if (tinput(j+i) == tinput(j)) %controlling diveded by zero
D.m = -1;
else
D.m = (tref(j+i) - tref(j)) / (tinput(j+i) - tinput(j));
%slew (from the point we are 'j' in relation with all the others
%'j+i')
end
D.n = tref(j) - D.m*tinput(j); %n factor (substituing one point of the
%diagonal)
D.offset = -D.n/D.m;

%filling the diagonal matrix with D structures


trajectories_matrix(j,i) = D.m;

if (D.m >= (1-1*merror) && D.m <= (1+1*merror))


if flag_1st_pos == 0 %if is the first position of the histogram
offsethist = [D.offset D.m 1 tinput(j) tinput(j+i) tref(j)
tref(j+i)]; %create the first row
flag_1st_pos = -1; %change the flag to not enter here anymore
else %there are already positions
%Add the offset histogram
%Adding some error
pos = find( offsethist(:,1) < D.offset + offset_error &
offsethist(:,1) > D.offset - offset_error & offsethist(:,2) > D.m - m_error &
offsethist(:,2) < D.m + m_error);

if (pos > 0) %this offset with this m exists in the histogram


offsethist(pos,3) = offsethist(pos,3) + 1; %one more counted
offsethist(pos,5) = tinput(j+i); % renew the end time
offsethist(pos,7) = tref(j+i); %renew the end ref time
48
else %this offset and m does not exist in the histogram
offsethist = [offsethist; D.offset D.m 1 tinput(j) tinput(j+i)
tref(j) tref(j+i)];
%create it with one count [new row]. Set the
%beginning and input time
end
end
end

%count the points which are part of correct trajectories_matrix


if (D.m >= (1-1*merror) && D.m <= (1+1*merror))
sim(j) = sim(j) + 1;
end
end
end
end

%index of the row with maximum similitudes


max_sim = find( sim == max(sim) ,1, 'first');
%index of the row with maximum similitudes

if (max(sim) > counted_matches) %to be a match there have to be more than one
%point in the diagonal returning the beginning
%and end time of the match
po = max_sim;
%the position of the first point is given by the position of the row with
%maximum similitudes
to = points(po,2); %2 column, where the time_input is
pf = find( trajectories_matrix(max_sim,:) > 0.9 & trajectories_matrix(max_sim,:) <
1.1, 1 , 'last'); %find the last valid match
pf = pf + max_sim;
%from relative position to absolute position (summing the position of the row
%with max. similitudes)
tf = points(pf,2); %2 column, where the time_input is
else
to = -1;
tf = -1;
end

4.2.3. Complexity overhead of the proposed solution

The complexity of this algorithm is due to the following parameters:

Trec: It's the time (in seconds) that the algorithm waits for the system to give
candidates.

DetectedCandidates: Is the number of candidates that the system gives


every frame. It's a parameter of the system, outside of the new algorithm
implemented.

This two parameters are selected a priori and are not dependent on the length
of the database (n) of fingerprints. Usually, the complexity of the fingerprint
algorithms are basically dependent on the length of the fingerprint database
where the comparisons are done. These database can have more than one
million fingerprints. As the solution raised is not dependent on n, the
complexity of the proposed solution doesn't represents an important overhead
versus the complexity that already had the system.
49

The complexity of the new algorithm (O') can be expressed as:

O' = O(n) + K

where O(n) is the complexity of the existing algorithm, K is the constant added
by the new algorithm and K<<O(n).
50

Chapter 5
Methodology

5.1. Test set


To properly test the casuistic that we have been developing in the work we
have a used in-house genre database. The audio files are labeled as follows:
classical, dance, jazz, hip hop, rhythm 'n blues, rock. Each one is composed
of 55 instances. We have also added a home-made category labeled as
'techno'. This way we can test the behavior of the system with a repetitive
electronic style of music compared to other styles. We expect to have a lower
accuracy with techno than with the other styles. Then with dance, as it has a
similar structure than techno, therefore we expect the others to have a
significantly higher accuracy. We assume that the sample used is big enough
to be representative and to give away casual facts that could deviate the
results.

To perform the testing we have divided the audio files into portions of 30
seconds to force a little more the system, given the nature of the problem at
hand. That way the system does not have the whole song to try to find a
suitable match, and jumps between candidates should be more evident. The
little the audio stream to be detected the more difficult to the system to find a
strong temporal sequence for it.

Each style is tested separately, to observe the isolated results by style. As we


have cut the songs in portions of 30 seconds we have an amount of portions
belonging to different songs. All these portions (from one isolated genre) are
concatenated attending that two portions belonging to the same song are not
concatenated together (are not consecutive). Like this we avoid to have
longer portions of one song, as if we had, for example, 2 consecutive portions
belonging to the same song they would form a bigger portion of 60 seconds.
Also mention that the number of portions of each genre is the same (forced
fact) to compare better the genres. This way we can test the same way every
genre.
51

5.2. Evaluation measures


For the evaluation of the results we will take into account three principal
factors:

(a) Accuracy (song detection): The song for the piece found at a certain point
and the song of in the groundtruth at the same time are coincident.

(b) Portion accuracy (portion detection): In addition to the song detection the
correct portion is also found.

(c) Portion error (error in the portion detection): The error in the detection of
the piece is calculated as:

e = |Tiq – Tig| + |Tfq – Tfg|

Where Tiq and Tfq are initial and final temporal points
detected for a certain query (q) and Tig and Tfg are the
initial and final temporal point in the groundtruth (g), that
is, the real content of the audio.
52

Chapter 6
Results

6.1. Results: system previous to modifications


As explained in the methodology chapter (see chapter 5) we had 8 genres
and from every one the accuracy, accuracy portion and portion error was
calculated and averaged. The results for every genre with the system previous
to modifications applying the (segment_search = '1') existing solution
are shown in table 6.1.

Table 6.1. Results of the system previous to modifications

The first to remark is that, as expected, the lower accuracy found is in techno
music. Then classical and dance come with the same level of accuracy. The
low accuracy on dance music was also expected as it has a structure similar
to techno, with a high repetitiveness of audio loops. The hypothesis to explain
the lower accuracy obtained in classical music are not so clear. Listening to
the excerpts many silences appear along them, maybe this fact could be one
53

of the aspects that can confuse the matching system. Another hypothesis is
the fact of the great precision of the classical musicians. We can hypothesize
that if there are two equal parts to be played they can play nearly exactly the
same and confuse the system. As it can be seen the accuracies in the rest
music styles are quite similar, between 92% and 95%. Also the portion error is
higher on those styles with lower accuracy. And, in this case, techno and
dance are quite differentiated from other genres.

6.2. Results: new matching method vs. existing


matching method
The same testing was made for every genre with the new matching method
developed. As the same methodology was followed to evaluate the success of
the system the results of the system previous to modifications and the new
matching method can be compared directly. The results for every genre with
the system previous to modifications and the results with the new matching
method are shown in table 6.2. They are situated one above the other,
separated by genre to properly compare them.

Table 6.2. Results of the system previous to modifications versus new


matching method.
54

For the way the algorithm worked and was designed we expected to higher
the accuracy on those music genres with high level of repetitiveness. But as it
can be seen in the comparison table between the old system and the results
with the new algorithm we raised the accuracy of every involved genre. So we
can deduce that the problematic found in techno music appear also in the
other genres, although the presence is lower. We must say that the higher
range of improvement has been in classical music, around 8%. As with the
low accuracy in classical music with the old system it's difficult to hypothesize
and explain why there's such an improvement in that accuracy. Then the
higher range is in techno music. That was expected as the new algorithm was
focused in solving the problematic around repetitive music. Remarkable also
the fact that with jazz music the improvement has been also high, similar to
that one obtained in techno arriving to an accuracy closer to 100%.

On average the accuracy of the old system was 91.55% on average. With the
new algorithm implemented the accuracy has raised until 95.83% on average.
The accuracy of the portion has, in general, been maintained. The portion
error has not had an improvement, only in Techno music. It's maybe a bug
that affects at the limits of the detected portions (beginning and end time of
the portion). It should be revised to, at least, arrive to the level that the
previous system had. Iin spite of that it must be said that the portion error is
only interesting for certain applications, and having a value between 2-4
seconds is not so relevant.
55

Chapter 7
Conclusions

This thesis reports on the process of improving an industrial-strength audio


fingerprinting system. More concretely, observations of customers, developers
and users pointed that the system was having lower accuracies in electronic
music identification. The objectives were then (a) to formalize a testing to
conclude if this fact was really happen. (b) To empirically reproduce the
aforementioned observations regarding electronic music. If that was really
happening, (c) find the reasons why the system was having this worst
response. And finally, (d) develop a solution to improve the system to achieve
the accuracies obtained in the most popular musical genres.

The state of the art of fingerprint, fingerprinting systems and the product of
audio identification Vericast have been reviewed. The problem statement has
been explained in detail and a solution to this problematic has been
developed and deeply explained.

A test set has been formalized and developed and confirmed that the system
was having a lower accuracy on techno music identification. As it has been
explained in this work the problematic was around the repetitiveness of some
electronic genres. Then the problem was focused not on electronic music but
on repetitive music. A solution has been proposed and implemented. That
solution has consisted in a matching algorithm situated in the end of the
framework of Vericast, the audio identification product we we're trying to
improve. This new matching method has been tested in accuracy and results
have been obtained. The results show that the solution implemented has
increased the accuracy in all genres, between 2.58% (in dance music) and
8.37% (in classical music). The improvement on average has grown from
91.55% to 95.83%. In particular for techno music the increase has been
5.24%.
56

Chapter 8
References

Allamanche, E., Herre, J., Helmuth, O., Fröba, B., Kasten, T., and Cremer, M.
(2001). Content-based identification of audio material using mpeg-7 low level
description. In Proc. of the International Symp. of Music Information Retrieval
(ISMIR).

Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Information Retrieval.


Addison Wesley.

Batlle, E., Nadeu, C., Fonollosa, J. (1998). Feature decorrelation methods in


speech recognition. In International Conference on Spoken Language
Processing, vol. 3, pps: 951-954.

Batlle, E., Cano, P. (2000). Automatic segmentation for music classification


using competitive hidden Markov models. In Proc. of the International Symp.
on Music Information Retrieval (ISMIR).

Batlle, E., Massip, J., Guaus, E. (2002). Automatic song identification in noisy
broadcast audio. In Proc. of the Signal and Image Processing (SIP).

Baum, L., Eagon, J. (1967). An inequality with applications to statistical


estimation for probabilistic functions of Markov processes and to a model for
ecology. Bams, pps: 360-363.

Blum, T., Keislar, D., Wheaton, J., and Wold, E. (1999). Method and article of
manufacture for content-based analysis, storage, retrieval and segmentation
of audio information.

Boney, L., Tewfik, A., Hamdy, K. (1996). Digital watermarks for audio signals.
In IEEE Proceedings Multimedia, pps: 473-480 .

Burges, C., Platt, J., Jana, S. (2003). Distortion discriminant analysis for audio
fingerprinting. In IEEE Transactions on Speech and Audio Processing, vol. 11,
no. 3, pps: 165-174.

Cano, P., Kaltenbrunner, M., Mayor, O., Batlle, E. (2001). Statistical


57

significance in song spotting in audio. In Proc. of the International Symp. on


Music Information Retrieval (ISMIR).

Cano, P., Batlle, E., Kalker, T., Haitsma, J. (2002a). A review of algorithms for
audio fingerprinting. In Proc. of the IEEE Multimedia Signal Processing∫, St.
Thomas, V.I.

Cano, P., Kaltenbrunner, M., Gouyon, F., Batlle, E. (2002b). On the use of
fastmap for audio information retrieval. In Proc. of the International Symp. on
Music Information Retrieval (ISMIR).

Cano, P., Batlle, E., Mayer, H., Neuschmied, H. (2002c). Robust sound
modeling for song detection in broadcast audio. In Proc. AES 112th
International Conv.

Cano, P., Batlle, E., Gómez, E., Gomes, L., Bonnet, M. (2005). Audio
Fingerprinting: concepts and applications. Studies in Computational
Intelligence (SCI), pps: 233-245.

Cano, P. (2006). Content based audio search: from fingerprintting to semantic


audio retrieval. PhD. Thesis. Universitat Pompeu Fabra (UPF).

Chávez, E., Navarro, G., Baeza-Yates, R. A., Marroquin, J. L. (2001).


Searching in metric spaces. ACM Computing Surveys, vol. 33, no. 3, pps:
273-321.

Dannenberg, R., Foote, J., Tzanetakis, G., Weare, C. (2001). Panel: New
directions in music information retrieval. In Proc. of the International
Computing Music Conference.

Faloutsos, C., Ranganathan, M., Manolopoulos, Y. (1994). Fast subsequence


matching in time-series databases. In Proc. of the ACM SIGMOD, pps: 419-
429.

Gomes, L., Cano P., Gómez, E., Bonnet, M., Batlle, E. (2003). Audio
watermarking and fingerprinting: For which applications? In Journal of New
Music Research 32, pps: 65-82.

Gómez, E., Cano, P., Gomes, L., Batlle, E., Bonnet, M. (2002). Mixed
watermarking and fingerprinting approach for integrity verification of audio
recordings. In Proc. of the International Telecommunications Symp.

Gusfield, D. (1997). Algorithms on strings, trees and sequences. Cambridge


University Press.

Haitsma, J., Kalker, T., Oostveen, J. (2001). Robust audio hashing for content
58

identification. In Proc. of the Content-Based Multimedia Indexing.

Haitsma, J., Kalker, T. (2002b). A highly robust audio fingerprinting system. In


Proc. of the International Symp. on Music Information Retrieval (ISMIR).

Kenyon, S. (1993). Signal recognition system and method. US patent


5,210,820.

Kimura, A., Kashino, K., Kurozumi, T., Murase, H. (2001). Very quick audio
searching: introducing global pruning to the time-series active search. In Proc.
of International Conference on Computational Intelligence and Multimedia
Applications.

Kurth, F., Ribbrock, A., Clausen, M. (2002). Identification of highly distorted


audio material for querying large scale databases. In Proc. Audio Engineering
Society 112th International Conv.

Logan B. (2000). Mel frequency cepstral coefficients for music modelling. In


International Symp. on Music Information Retrieval (ISMIR).

Lourens, J. (1990). Detection and logging advertisements using its sound. In


IEEE Transactions on Broadcasting, vol. 36, pps: 231-233.

Luo, H., Chu, S., Lu, Z. (2008). Self embedding watermarking using halftoning
technique. Circuits, System, and Signal processing, vol. 27, no. 2, pps: 155-
170.

Mihçak, M. and Venkatesan, R. (2001). A perceptual audio hashing algorithm:


a tool for robust audio identification and information hiding. In 4th International
Information Hiding Workshop.

Miotto, R., Orio, N. (2008). A music identification system based on chroma


indexing and statistical modeling. In Proc. of the International Symp. on Music
Information Retrieval (ISMIR), Content-Based Retrieval, Categorization and
Similarity.

Oppenheim, A., Schafer, R. (1989). Discrete-Time Signal Processing. Prentice


Hall.

Papaodysseus, C., Roussopoulos, G., Fragoulis, D., Panagopoulos, T.,


Alexiou, C. (2001). A new approach to the automatic recognition of musical
recordings. J. Audio Engineering Society, pps: 23-35.

Park, M., Kim, H., Yang, S. (2006). Frequency-temporal filtering for a robust
audio fingerprinting schemes in real-noise environments. Electronic and
Telecommunications Research (ETRI) Journal, pps: 509-512.
59

Picone, J. (1993). Signal modeling techniques in speech recognition. In Proc.


of the International Conference on Acoustics, Speech and Signal Processing
(ICASSP), vol. 81, pps: 1215-1247.

Rabiner, L. (1989). A Tutorial on HMM and Selected Applications in Speech


Recognition. In Proc. of the IEEE, vol. 77, no. 2, pps: 257-286.

Richly, G., Varga, L., Kovàcs, F., Hosszú, G. (2000). Short-term sound stream
characterization for reliable, real-time occurrence monitoring of given
soundprints. In Proc. 10th Mediterranean Electrotechnical Conference,
MEleCon.

Riley, M., Heinen, E., Ghosh, J. (2008). A text retrieval approach to content-
based audio retrieval. In International Symp. on Music Information Retrieval
(ISMIR), pps: 295-300.

Ruggero, M. (1992). Physiology and coding of sounds in the auditory nerve. In


The Mammalian Auditory Pathway: Neurophysiology. Springer-Verlag.

Seo, J., Jin, M., Lee, S., Jang, D., Lee, S., Yoo, C. (2005). Audio fingerprinting
based on normalized spectral subband centroids. In Proc. of the International
Conference on Acoustics, Speech and Signal Processing (ICASSP),
Philadelphia, PA.

Subramanya, S., Simha, R., Narahari, B., Youssef, A. (1999). Transform-


based indexing of audio data for multimedia databases. In Proc. of
International Conf. on Computational Intelligence and Multimedia Applications,
New Delhi, India.

Sukittanon, S., Atlas, L. (2002). Modulation frequency features for audio


fingerprinting. n Proc. of the International Conference on Acoustics, Speech
and Signal Processing (ICASSP).

Sukittanon, S., Atlas, L., Pitton, J. (2004). Modulation scale analysis for
content identification. In IEEE Transactions on Signal Processing, pps: 3023-
3035.

Theodoris, S. Koutroumbas, K. (1999). Pattern Recognition. Academic Press.

Viterbi, A., (1969). Error bounds for convolutional codes and an asymptotically
optimum decoding identification. In IEEE Trans. Info. Theory, vol. 13, no. 2,
pps: 260-269.

Wang, L. (2003). An industrial strength audio search algorithm. Shazam


entertainment, Ltd. Technical report.

S-ar putea să vă placă și