Fingerprint and Quality-Based Audio Track Retrieval

` DEGLI STUDI UNIVERSITA DI MILANO BICOCCA
` DI SCIENZE MATEMATICHE, FACOLTA FISICHE E NATURALI
Corso di Laurea Magistrale in Informatica (MSc in Computer Science)
FINGERPRINT AND QUALITY-BASED AUDIO TRACK RETRIEVAL

SUPERVISORS:
dott.ssa F. Gasparini (advisor) dott. S. Bianco (co-advisor)
Submitted by: Riccardo Vincenzo Vincelli (709588) r.vincelli@campus.unimib.it
AA 2011-2012 - third session, 20/11/2012
Contents
1 MP3 1.1 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 PCM . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Analysis polyphase lter bank . . . . . . . . . . . . 1.1.3 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 Psychoacoustic model . . . . . . . . . . . . . . . . 1.1.5 MDCT with Windowing . . . . . . . . . . . . . . . 1.1.6 Quantization . . . . . . . . . . . . . . . . . . . . . 1.1.7 Human Coding . . . . . . . . . . . . . . . . . . . 1.1.8 Bit stream formatting and CRC word generation . 1.2 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Synchronization and Error Checking . . . . . . . . 1.2.2 Human decoding and Human info decoding . . . 1.2.3 Scale-factor decoding and Requantization . . . . . 1.2.4 Reordering . . . . . . . . . . . . . . . . . . . . . . 1.2.5 Joint stereo decoding, Alias reduction and IMDCT 2 The 2.1 2.2 2.3 2.4 2.5 ngerprinting technique Some audio ngerprint techniques . Synopsis of the technique . . . . . . Forward algorithm . . . . . . . . . . Backward algorithm . . . . . . . . . Pseudocode . . . . . . . . . . . . . . 2.5.1 Blocks generation . . . . . . . 2.5.2 Frequency rearrangement and 2.5.3 SBE . . . . . . . . . . . . . . 2.5.4 PMF . . . . . . . . . . . . . . 2.5.5 Entropy . . . . . . . . . . . . 2.5.6 Bit stream output . . . . . . 2.5.7 BER . . . . . . . . . . . . . . Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 6 7 7 8 10 13 15 16 16 18 19 20 20 20 20 20 20 21 22 24 24 24 25 26 27 28 28 29 30 30 31 31 32 32 33 33 33 34 34 35 36 36 37
2.6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . sub-band division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
3 Testing 3.1 Basic distortions . . . 3.1.1 White noise . . 3.1.2 Echo . . . . . . 3.1.3 Pitch shift . . . 3.1.4 Voice . . . . . 3.2 SNR . . . . . . . . . . 3.3 Testing infrastructure 3.4 Results . . . . . . . . . 3.4.1 White noise . . 3.4.2 Echo . . . . . . 3.4.3 Pitch shift . . . 3.4.4 Voice . . . . . 4 Conclusions
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
References
38
Introduction
An audio ngerprinting technique is, in its most general form, a pair of algorithms, the ngerprinting algorithm and the matching algorithm. The ngerprinting algorithm examines and processes a set of salient features of a given input audio track, generating a small digest from them. The matching algorithm is used to identify an unknown audio track, by computing the ngerprint for a small sample of the track itself and comparing it with a set of known fulllength ngerprints. The idea can be applied to any digital media content, but audio identication is of great interest in practical implementations, with many commercial and non-commercial software available for use. The literature on the topic is large and many robust audio ngerprinting algorithms have been implemented in both commercial and free software. A strong mathematical modeling drawing from psychoacoustics, Fourier theory, information and coding theory, statistics and probability is at the cornerstone of the most successful techniques, and great emphasis is also on the complexity and performance of the algorithms, since in the most common scenario the client-side is operated on portable devices (e.g. smartphones). A successful technique exhibits robustness to common degradations, collectively known as noise, that can deteriorate the quality of the track to be identied. The key to robust performance lies in the ability to identify features that are to some extent invariant to noise, pregnant information the track retains even if its quality is noticeably degraded. In order to achieve this a deep knowledge of how our auditory system works is imperative; for example, it is important to observe that not all the frequency contents are equally important, and welldened sensitivity peaks along the spectrum exist. The ngerprinting (forward) and matching (backward) algorithms are both deterministic, but since a good technique seeks a tradeo between robustness and eciency, false positives with respect to an estimated tolerance are accepted. Otherwise stated, a robust yet ecient algorithm returns the correct answer with a probability close to 1 and takes an acceptable time to compute it. The following is the typical use case scenario for audio ngerprinting techniques:
a DMS (Digital Media Service) maintains a large database of popular audio tracks and implements as a service an intelligent query and retrieval system for the database, free of charge every time a new song is added, it is processed with the forward algorithm; a second associated database table stores, for each track, the corresponding ngerprint the client-side consists in an implementation of the forward algorithm, and the input audio sample to be identied is fed through the computer/device microphone or as an existing MP3/WAV le the communication medium is the Internet and once the song has been identied by the server, additional information is returned (e.g. ID3 elds, lyrics, pictures...)
Curious analogies between the ngerprint digest and a DNA strain or a hash code help to better understand the whole picture. Just like DNA material, a 4
ngerprint can be said to belong to a precise person but, even in the case that all laboratory operations on tissue samples are carried out with great care (i.e. no contaminations), false positives are still possible; this is one of the main reasons why even though DNA tests are a precious and widely used forensic resource, in law enforcement investigations they are rarely the only piece of evidence. Compared to the output of a hash function, ngerprints are easy to compute too, and can be righteously used as references for the original input le (i.e. when searching through a database). In this work the audio MP3 ngerprinting technique proposed in [1] is analyzed, and implemented with localized yet relevant improvements. The implementation is thoroughly tested, with test cases designed to evaluate the stability of the technique with respect to noise, at dierent intensity levels. Basically, the ngerprinting algorithm divides the input le into blocks of MP3 granules and generates a bit stream by examining, thanks to the entropy statistic, the variation of the information contributed by the blocks. Important audio features are not explicitly isolated but their existence aects the statistics of the track and emerges in the computation of entropy dierences. From this perspective, two tracks having an almost equal ngerprint share a certain trend, since a bit set to 0 in the i-th position of the bit stream means that the i-th block is somewhat less informative, for example less rich in sounds timbre, that the one to follow. One of the most important improvements contributed by this implementation is the ability to process not only single channel (mono) MP3 tracks but also stereo ones, maximizing the information taken from the stream. While this is not relevant for a large part of MP3 tracks found on the Internet where the channels all have the same bits, more rened musical expressions or ad-hoc elaborated audio tracks might present eects requiring stereo channeling (e.g. ping pong echo). The implementation of the technique comes as a pair of C programs, the two algorithms are distinct. The matching algorithm is quite simple, being just one compilation unit, whereas the ngerprinting algorithm is implemented in two modules, one for the algorithm itself and one for the decoding library. C was the language of choice for dierent reasons: for example, it is standardized and performance-oriented. An early-stage implementation was completed in MathWorks MATLAB, but it did not full basic time performance requirements. Testing aims at evaluating performance on a large and heterogeneous set of samples, fed to the algorithms both in a clean and noisy version. Noise eects taken into account are white noise addition, echo/delay, pitch shift and voice addition. Test results conrmed what claimed by the authors in the research, with optimal results even in presence of quite obtrusive noise disturbances.
MP3
MP3 (Moving Picture Expert Group-1/2 Audio Layer 3 ) is the name of the well-known audio lossy compression format standard, which has become over the years the most adopted format for persisting and exchanging music on the Internet. The standard carefully describes the structure and interpretation of
an MP3 bit stream and what the decoded result is expected to be, together with encoding and decoding block diagrams; this constitutes the main part of the non-normative corpus of the standard. No constraints are set on the encoding algorithm, which means that, for a given uncompressed bit stream (i.e., a WAV format audio track), as long as an encoder outputs a meaningful bit stream consistent with the standard, it can be labeled as compliant; as usual, reference implementations were provided in the standard. On the other hand, the decoding phase is carefully presented, and most of the decoders are said to be bit stream compliant in the sense that their output matches, within a certain tolerance, the one formally dened in the standard. A deep understanding of the encoding and decoding processes is not an easy matter, as average knowledge in many dierent elds, ranging from information theory to psychoacoustics, is required. For the sake of clarity we present here the minimal information needed to understand the two phases at a general level; the concepts presented form the base for approaching the ngerprinting algorithm too, as it directly operates on the MP3 compressed bit stream. An exhaustive yet not overwhelmingly technical guide to the standard is [2]; in [3], among other things, Layer 3 is compared to its predecessors, Layer 1 and 2; [4] is the standard, published in 1993 and updated in [5] in 1995.
1.1
Encoding
Figure 1: Encoding block diagram. 6
1.1.1
PCM
PCM stands for Pulse code modulation, which is an easy way to digitally represent an analog signal; a PCM signal is obtained with just the following three steps:
sampling of the analog signal quantization of the samples binary representation of the quantized values
Figure 2: The PCM process. The delity of the acquired signal depends upon the sample rate and the bit depth: a minimum sampling rate is forced by the fundamental Nyquist-Shannon theorem, and a secondary result, known as Widrow theorem, can be applied to ensure the quantization noise is reduced to white additive Gaussian noise. If the quantization function is linear, the method is referred to as LPCM (linear PCM). A pure PCM bit stream typically requires high bit rates, even for non-audiophile quality contents: a rather simple acquisition scheme comes at the price of a great deal of information to be kept. The intuition behind the MP3 compression is that a lot of this information is redundant, and can be discarded without aecting the overall quality of the digitized audio track in a perceptually-signicant way. A common LPCM digital format is WAV (Waveform Audio File Format), by Microsoft and IBM; it is the standard audio format for CDs, where the sample rate is 44.1 kHz and the bit depth 16 bits. 1.1.2 Analysis polyphase lter bank
Our auditory perception is not uniform along the range of the frequencies we can hear, 20 Hz - 20 kHz approximately, for example there is a sensitivity peak
around 1 - 5 kHz. The MP3 compression is designed to take advantage of this fact, and the crucial phase in this sense is the application of a psychoacoustic model (see below). The analysis polyphase lter bank is the rst step in this direction, because dierent frequency ranges are identied and saved separately; this ltering is named polyphase quadrature lter (PQF). For each channel, the encoding process starts with partitioning the input into frames of 1152 samples and proceeds ltering the spectrum of each of them into {2 wide, where f is the Nyquist 32 equally-spaced frequency sub-bands, each f32 frequency of the input PCM, i.e. f {2 is half the sampling frequency. For example, for f 44.1 kHz is f {2 22.05 kHz and each band will be 689 Hz wide, r1, 689s, r900, 1378s, . . . , r21362, 22050s. In this phase, the frequency contribute of each single sample is balanced across the 32 sub-bands, yielding a factor 32 information increase, as for each sample 32 values are computed. For this reason, in each sub-band the number of values is dually decimated by a factor 32; for each frame, these sub-band values are then grouped into 3 sets of 12 each. This process is lossy, the original signal cannot be recovered; it introduces some artifacts too, but they are inaudible. One of the reasons for this is the impossibility of constructing bandpass lters with perfectly square response, and this determines sub-bands which overlap a little bit, where the overlap depends upon the machine precision of the encoder. All these eects are collectively referred to as aliasing. Finally, it is worth noticing that even if this phase is really useful to further processing oriented to advanced psychoacoustic models, the sub-bands are unrelated to any critical bands model of our auditory system.
Figure 3: The lter bank process. 1.1.3 FFT
The discrete Fourier transform has become ubiquitous in digital signal processing as a tool for switching from the time domain to the frequency domain (and back, with the inverse). The DFT is a key ingredient in many compression and editing techniques. For a brief but self-contained introduction to the subject see [6]. Any algorithm computing the DFT with computational time complexity less than Opn2 q, where n is the length of sequence of complex numbers to be tran8
sformed, is said to be a FFT algorithm, fast FT. At the moment, algorithms faster than Opn lg nq are not known. Let x0 , . . . , xN 1 be a sequence of N complexes. The DFT is dened by the formula: Xk and its inverse is: xn 1 N
N 1 k 0
N 1 n 0
xn e
2i N kn
0, . . . , N 1 1
Xk e
2i N kn
n 0, . . . , N
The result X0 , . . . , XN 1 is a complex sequence too, each value of it encoding in its amplitude and phase those of a sinusoidal form of frequency per sample corresponding to k {N , which are found as |Xk |{N , arg pXk q respectively; these sinusoidal forms decompose the function of n represented by the inputsequence. DFT formulae can come in many equivalent avors, depending on the application eld; for the two above, following the most common convention the normalization factors can be any pair of numbers whose product is 1{N and the signs of the exponents can be interchanged. If it is the case that the normalization factor of the forward transform is 1{N the zero-frequency value, known as the DC-value, is the mean of the input sequence. In many applications, the result sequence is reordered so that the DC score is right in the middle, and for a 1D DFT it is enough to swap the left and the right halves. For a given sequence of length N an n-point DFT is commonly intended as the transformation of only the sub-sequence formed by the rst n elements, discarding the others, if n N , of the original sequence zero-padded in the remaining empty positions otherwise. The greater n the ner will be the frequency contents decomposition represented by the sinusoidal forms, as more frequency scores will be returned, actually as many as were the elements of the possibly zero-padded input vector: a single spectrum can be examined at dierent degrees of detail. In this stage, in parallel with polyphase lter bank processing, both a 256 and a 1024 points FFT are performed on the input frames. As a frame is 1152 samples long, some samples are to be ignored, and the choice is to center the FFT window discarding the rst and the last 64 and 448 samples respectively. Thanks to the fact that 256 and 1024 are powers of 2, particularly fast FFT algorithms can be employed, such as the DIT (decimation-in-time) version of the Cooley-Turkey FFT algorithm. The information conveyed by the 256-points FFT is useful to spot great spectrum dierences between adjacent frames, and the 1024-points FFT bears the minimum spectrum resolution information needed to carry out eective compressions. The results are fed to the Psychoacoustic model block, where major compression eorts take place.
Figure 4: 1024/256 Fourier spectra of a 16 bit sample. A plot like this is obtained by just interpolating linearly between the result scores. 1.1.4 Psychoacoustic model
The concept of a psychoacoustic model is really a fundamental one, and a good strategy here does make the dierence in terms of compression achievements and overall delity of the output bit stream. An encoder adopting a valid psychoacoustic model is referred to as perceptual encoder. The information gathered at this stage is passed both to the MDCT block and to the Quantization block (see below). The tasks here are:
choosing a particular type of window to alleviate artifacts deriving from processing with the MDCT a discontinuous stream of information, as each frame is transformed singly
10
computing the information necessary to quantize MDCT scores, to save just an amount of frequency information proportional to the importance of the frequency content itself in the contest of the spectrum, applying a theoretical model
Outputs of this block are simply, respectively:

for each frame, the window type for each frequency, quantization thresholds
The window type is chosen by comparing the current FFT spectra pair to the previous one. Relevant dierences trigger attacks: new sounds begin and produce audible dierences (e.g. after 5 seconds of silence a strong guitar ri breaks in). If this is the case, short windowing is used, otherwise the counterpart is long windowing; the names come from their shapes. Long windows come in three forms, being dierent according to whether they are followed by, or follow, short windows (start or stop long windows, respectively) or not (standard long). Short windowing is the key to contrast a common aw in lossy encoders, pre-echo. This artifact denotes a spread of the attack and decay over time periods where they are not meant to be present originally, resulting in an articial backward and forward echo eect. Pre-echo is due to the strict time domain discontinuities imposed by the use of frames and subsequent frequency contents balance through the lter bank. Pre-echo is generally not a problem except for percussion instruments and forward/decay pre-echo is much attenuated by the masking discussed above.
Figure 5: Window types. Short windows are actually made up of three overlapping windows, allowing for a more precise time resolution. Start and stop long windows guarantee amplitude continuity with the short ones.
11
Figure 6: Finite state machine for choosing the appropriate window type.
An advanced psychoacoustic model can exploit a great number of psychophysical phenomena, but most of them falls in one of the following categories:
range-related masking
As previously observed, there exists a peak in the range of the audible. Then, as we are less sensitive to extremal frequency contents, very bass or treble sounds necessitate of higher volume levels. Finally, because the audible range actually shrinks, especially about high frequencies, as one gets older, particular sound information perceived by children end up to be of no use to the elderly. Masking phenomena are subdivided into:
simultaneous (frequency domain) temporal (time domain)
The human auditory system is organized in a number of so-called critical bands, a fact that congures our sound perception as driven by a bank of pass-band lters. It is worth pointing out that even if this is the same idea as the one behind the very rst step of the encoding process, the application of an analysis polyphase lter bank, the sub-bands we are talking about here are unrelated. Suppose that a stimulus resonating at a particular dominant frequency is heard. The phenomenon of the simultaneous masking determines a particular air pressure threshold, and sounds with components in the same critical band need to be reproduced at volumes higher than the threshold to be heard too.
12
Figure 7: Simultaneous masking; direct proportionality can be observed. In temporal masking, regardless of the frequencies involved, a particularly loud sound covers all of the other sounds below the threshold that start after it ceased, still following a proportional pattern (post-masking); in the same fashion, also weaker sounds starting shortly before the masker are covered (premasking).
Figure 8: Temporal masking.
In both cases, the nal output of the sub-block is a set of threshold values for the frequency sub-bands of use in the MDCT to follow. Finally, the choice of the window and the masking model are strictly related. The approximation of the human critical bands in MP3 goes under the name of scale-factor bands. For short windows, sub-band division is less precise as masking can be exploited to diminish pre-echo, and on transients there is no great need for high frequency resolution. 1.1.5 MDCT with Windowing
The modied discrete cosine transform is a particular discrete cosine transform very popular in digital signal processing for transforming overlapping blocks of consecutive data. 13
A DCT is basically the result of applying the DFT to a real function whose periodic extension is set to be even at the left border, i.e. symmetry about the origin. By doing this, each frequency content is carried by a cosine only, not by the sum of both a sine and a cosine like in the DFT. This fact, whose proof is not trivial, is somewhat not new to the reader familiar with Fourier series on real functions: when the function is even sine coecients cancel out and vice versa. Eight variants of DCT are needed to specify all the possible left-even extensions. The DCT turns out to have better compression properties than the DFT, in the sense that most of the information is concentrated in the rst coecients; roughly speaking, working on the same input data, to express the information present in the DCT sequence a DFT sequence of twice the length is needed. The MDCT is based on the DCT-IV, and for a sequence x0 , . . . , x2N 1 of reals returns a real sequence half the length, X0 , . . . , XN 1 ; its formula is: Xk
2N 1 n 0
wn xn cos N
1 2
N 2
1 2
&
where wn is the window score. For a single frame sub-band, a group of 12x3 samples enters the block in the diagram of Figure 1. For long blocks (frames requiring long windows), the 36 samples are fed to the MDCT and 18 frequency lines are obtained; for short ones, each 12-uple is processed separately yielding 6 lines each. In both cases, overlap is 50%, i.e. the rst half of the MDCT-processed sequence comes from the previous frame, the other half is from the current. The output of this step is the granule, the minimum information unit: for long blocks, 576 frequency lines, for short blocks 192x3 lines. The tradeo between frequency and time resolution is clear: long blocks have more sub-bands but short ones capture three, not just one, time slices individually elaborated. These sub-bands are called scale-factor bands in the MP3 jargon.
Figure 9: The center and right images are obtained by removing the second half of the output scores of the DFT and DCT respectively; the results make clear that the DCT is better at condensing spectrum information in the rst coecients. The functionality of this block is also an alias reduction algorithm to compensate for the eects deriving from a necessarily not-perfect polyphase lter bank. These are quite complex from a theoretical point of view but computa-
14
tionally they just reduce to products. In some implementations alias reduction takes place before reordering. 1.1.6 Quantization
After the psychoacoustic model determines the windowing and the MDCT is performed it is time for further compression, still driven by the psychoacoustic analysis, on the resulting transformed sequence. At the same time the encoder is constrained to output a bit stream to be reproduced at a given bit rate, and bit rate is a function of bit depth; this bit rate can be constant (CBR) or variable (VBR). As observed above, thanks to masking eects some frequency contents turn out to be of little or no importance, and this suggests strong quantization on them. Quantization always comes with noise though, so the introduced disturbance must be insignicant in terms of auditory perception; in other words, the scores are quantized accordingly with the output of the psychoacoustic model. Quantization consists in a power-law: this helps avoiding regular/periodic quantization noise artifacts and allows for larger values to be coded less accurately. Some treatment prior to quantization helps attenuating this noise. The sub-bands can be pretreated and quantized both as a whole or with an independent fashion, i.e. globally or non-uniformly. The correction and compression of each single sub-band is encoded in the respective scale-factor numbers, usually stored as dierences with respect to the global quantization coecient, the gain factor. This process is structured into two nested loops, the distortion control loop (outer) and the rate control loop (inner). In this iterative process, the outer loop takes care of adjusting the single sub-band scale-factors for the purpose of adapting the quantization noise to the requirements of the psychoacoustic model, whereas the inner loop works on the global gain with the aim of tting the quantized values in the number of bits constrained by the bit rate.
Figure 10: The two aspects of the quantization process.
15
1.1.7
Human Coding
MP3 also makes use of classic information theory by coding the quantized samples with the Human algorithm, a well-known variable length code where the less frequent a symbol the longer the codeword it is assigned to. In order to t a particular bit rate, for a xed sample rate and number of channels, a proper bit depth is required. Once it is xed, the algorithm can go on using codewords of accordant length. A number of 15 tables is published in the standard, and the frequency lines are grouped and dierently coded with these tables according to their importance. Each granule is subdivided into three variable-length groups: big values (the scores expected to have the greatest scores in absolute value), quad region (intermediate), zero region (zero-clipped/rounded). In addition to this, table access is further parametrized yielding a total 29 dierent ways to code.
Figure 11: Human binary tree for the source In my mind I have these thoughts on the binary coding alphabet: decoding a single codeword corresponds to going down a particular walk from the root to a leaf. 1.1.8 Bit stream formatting and CRC word generation
In this nal block the construction of the compressed bit stream takes place. As one expects from the encoding process, the bit stream is an ordered collection of frames, each frame in turn made up by two (mono) or four (stereo) 576-values granules. A frame has additional elds other than eective data: 16
header: contains general information for the frame and the synchronization word telling the decoder that a new frame is starting Cyclic Redundancy Check: this eld carries the checksum for the sensitive part of the frame, dened in the standard to be the portion of the header and side information elds that, when corrupted, forces to discard the whole frame; it is interesting to observe that commonly the loss of some frames does not aect the overall quality in a noticeable way, thanks to the high time-domain resolution (in the MPEG family the length of a frame is on the order of milliseconds); use of this eld is optional, and it is common practice to skip the CRC generation side information: everything necessary to properly decode the frame; examples of stored information are block type, scale-factor bands length in bits, length and encoding of the Human regions main data: the eective data of a single granule is formed by a Human coded bits and scale-factors; Human coded bits are the actual frequency line values to be decoded, and the scale-factors used during quantization have to be elided when the process is reversed ancillary data: the ancillary data subeld is seldom used, and a couple of famous encoders use it for padding reasons for example; its length is undened, and this is legit as the synchronization word will allow to locate the starting point of the following frame
17
Figure 12: The basic units of a frame are the header and the audio data; see the references for details.
1.2
Decoding
Since the investigated ngerprinting algorithm requires just partial rather than full decoding of the MP3 bit stream the description will not go through the whole process. Partial decoding is commonly intended as halting at the IMDCT block, when ready to return to the time domain. The following descriptions will tend to be short enough, as the reader has already gained condence with the encoding process in the previous section.
18
Figure 13: Decoding block diagram. 1.2.1 Synchronization and Error Checking
The bit stream is parsed to recognize a correct structure, and each frame is identied by searching for the next synchronization word. If the CRC bit is on, checksum verication is performed.
19
1.2.2
Human decoding and Human info decoding
For a correct inversion of the Human coding the decoder has to know where the rst codeword starts, because Human is a variable-length encoding, and the substitution tables. This information is provided to the Human decoding block by the Human info decoding. Additional elaboration is performed by the Human decoding block, for example zero padding to compensate for missing line frequencies or run-length decoding, especially for high frequencies, where many scores are zero-clipped. 1.2.3 Scale-factor decoding and Requantization
Three pieces of information are needed in order to inverse the quantization process, scale-factors, which are decoded separately, global gain and additional side information. The inverse quantization adopts two dierent techniques, for short and long blocks. 1.2.4 Reordering
Output from the previous block is sets of dequantized frequency lines, representing a short or a long block. If it is the case of a short block, the batch of frequencies is ordered by sub-band, window, frequency, whereas long ones by sub-band, frequency. This ordering dierence aims at maximizing the compression factor of the Human algorithm, because, in short windows, scores in the same frequency band are much more likely to have equal values, thus yielding one single codeword, than scores taken at progressive time intervals with the windows. 1.2.5 Joint stereo decoding, Alias reduction and IMDCT
If the input stream is not a mono one, dierent channel information has to be produced. Alias imperfections have to be re-introduced to have a correct audio reconstruction. Finally, the IMDCT remaps the lines yielding 32 sub-bands each carrying 18 time domain samples.
2
2.1
The ngerprinting technique

Some audio ngerprint techniques
Benecial to approaching the discussed technique is a brief synopsis on the theoretical tools employed in some of the most known and successful techniques. In [7] a rather straightforward approach to individuating salient audio features is presented: 1. the audio signal is segmented into overlapping frames 2. Fourier transform is applied to each frame, but only the spectrum is kept, as our auditory system is poorly receptive to phase shifts 3. this frequency content is subdivided into a number of bands modeling the auditory critical bands (see 1.1.4) 20
4. a bit stream is computed by looking at energy dierences of the scores both along the frequency and the time axes In the technique [8], of Shazam smartphone application fame, the data structure at the core of the process is called constellation. A constellation is obtained by pruning most of the points of a spectrogram, a 3D plot visualizing frequency and amplitude of a signal in time, by leaving only those showing a particularly high energy level with respect to their neighbors. This structure is then converted to a handy bit stream by applying hash functions. The paper [9] introduces a very interesting approach taking advantage of classic computer vision techniques, and the idea is to treat spectrogram plots as eective images and performing wavelet analysis on them. Wavelet theory can be seen as an evolution of Fourier theory which allows for a more robust function decomposition as not only frequency information but also time information is encoded in the function basis.
2.2
Synopsis of the technique
We now examine the ngerprinting technique discussed in the research paper [1]. Information theory plays a primary role in the forward algorithm, as the bit stream is computed by comparing entropy dierences between consecutive time units, and in the backward algorithm too, since the matching process just involves a number of Hamming distance computations. From an eciency perspective, the technique avoids a complete decoding of the MP3 bit stream because information statistics are extracted from the IMDCT coecients; this is a noteworthy advantage over full decompression as uncompressed data, such as WAV les, are space-consumptive. The outline of the forward process is as follows:
the source MP3 is partially decoded (see gure 13) this partially decoded bit stream is partitioned into basic units called blocks, each one collecting 22 granules, with an overlap factor of .95 frequency lines in each block are rearranged, as by granule grouping a block is formed by both short and long windows new sub-band division to each rearranged block is applied a particular function called SBE (subband energy) a probability mass function is computed over the SBE scores of each single block the entropy of each block is calculated a bit stream is generated comparing the successive entropy values of the blocks
For the backward (matching) process:
21
for each database ngerprint, the query ngerprint is slided over it and the Hamming distance at each window is computed the minimum of these window distances, divided by the length of the query bit stream, is returned as the matching score this set of scores is sorted in ascending order if the right song is in the rst ten elements of the sorted list whose score is under a threshold value, the query is deemed successful; otherwise the query is failed
2.3
Forward algorithm
The rst step in the algorithm consists in scanning the input le and progressively assembling basic processing units that we refer to as blocks; the reader is advised not to confuse this kind of blocks with the MP3 encoding blocks. A block is obtained by grouping together 22 granules, which equals 11 frames. In the paper, input MP3 les are implicitly assumed to be mono channel and the most obvious generalization in order to meaningfully process stereo MP3s as well is to double the size of each block: a block is still built out of 11 frames but carries 44 granules. An overlap factor of 95% means that block Bi is equal to block Bi1 in all positions but the last, which has a new granule in it. Differently stated, a new block is produced by shifting to the left by one position the previous one, with the new slot assigned to the read granule. This heavy overlap assures stability in terms of time domain localization because minimizes boundary dierences between the database and query ngerprints: simplistically stated each query ngerprint has contents that cannot be too dierent from a particular set of blocks. The number of granules for a block is xed and the authors do not seem to view it as a point of optimization.
Figure 14: Block overlapping. Blocks are groups of granules resulting from both long and short windowing, so the frequency content of a block is not homogeneously represented. A normalization is obtained by applying the following rearrangement formula: snpi, j q
1 (long) snl pi, j q 3 3j |snpi, nq| n 2 1 s m sn pi, j q 3 m0 |sn pi, j q| (short)
3j2
22
where snpi, j q is the input coecient in the i-th granule j -frequency line of a time-domain block and snm pi, j q denotes the m-th window of a short MP3 block. This operation is to be repeated on each of the N blocks. For the long case, every three consecutive coecients are grouped; for the short case we take three coecients at the same frequency, one per window. Not all the newly computed frequency lines are retained. The saved lines are then organized in a number of sub-bands based on the scale-factor bands of the short windows. This choice aims at focusing only on relevant auditive information relating to transients (e.g. percussive stimuli).
0, 1, . . . , 21 j 0, 1, . . . , 191
Figure 15: This table resumes the normalization and reorganization of the spectrum (image taken from [1]). The sub-band energy formula returns as a real number the importance of the argument sub-band in the context of the argument block by summing up its scores: SBE pi, j q
2i 10 M DCT Tj m 2i 1 M DCT Bj
|sn2 pm, nq|
0, . . . , N j 0, 1, . . . , 8
where SBE pi, j q is the SBE for the i-th block, j -th sub-band, M DCT Bj and M DCT Tj are the ranges for the sub-band, listed in the table above. The PMF computed over the SBE data indicates how much a certain band contributes to the overall information of a single block. The formula is: P pi, j q
SBE pi, j q
j 0
SBE pi, j q
8
0, . . . , N j 0, 1, . . . , 8
The classic entropy of the PMF denoted by P pi, j q is now computed; for the block i it is: H piq
j 0
P pi, j qlg P pi, j q
0, . . . , N
In the paper, the robustness of entropy as an information indicator is discussed both from an experimental and theoretical perspective. The reader is invited to 23
read through this material. The nal stage is the calculation of the bit stream: S piq
0 : H piq H pi 1q i 1 : H piq H pi 1q
0, 1, . . . , N 1
2.4
Backward algorithm
Input to the backward algorithm is a single query bit stream to be matched against a database of known bit streams. Basically, the query is slided over the database bit stream and a Hamming distance is computed at each window; the minimum distance value is returned, divided by the length of the query sample. This is formalized by the following formula:
i i BERpiq minppx1 , x2 , . . . , xn q pxi j , xj 1 , . . . , xj n1 qq
1, . . . , Ntrack
1, 2, . . . , N n 1
the Bit Error Rate of the excerpt with respect to the i-th audio track in the database. After the BER is computed for each track in the database, the track yielding the minimum BER is the nal result. It is suggested to return a more complete number of results though. The list of tracks is sorted in ascending order with respect to the matching score. Of this sorted list, only the rst ten elements are returned. In the case the right answer is not present in these ten songs, the query has failed. In all of the test groups, most of the matches are perfect matches: if the audio track is guessed right, then it is the best guess. Finally, the choice of returning not just one result but a rich list can be defended, besides from the fact that this strengthens the technique performance. Even if the tracks in the list almost always do not share relevant perceptive similarities, for example they are not of the same genre, they indeed have some traits in common since their matching scores are of the same order of magnitude. Because of this, the user is not only given, if the query successes, the correct answer but also a number of other suggestions for tracks of ane nature in terms of entropy.
2.5
Pseudocode
The pseudocode programs presented depict the essential steps illustrated above. The input les are assumed to be stereo (two channels). Some additional logic is required in order to write an implementation of the technique, but it is left behind, for the reason that a working implementation is part of this thesis work. The only data structure in use is the static array, with indexes starting at 1. 2.5.1 Blocks generation
In the picture of the whole forward algorithm, the time-domain blocks subdivision procedure can be positioned both as the rst step after partial decoding or as an extension to it. If we opt for the former blocks generation takes place after decoding has completed, whereas for the latter, which is the choice in the implementation, the blocks get built as the partial decoding proceeds, the operation of frame decoding and block generation are interleaved. currF rameV aluesrchannelssrgranulessrsubbandsrf requency s is the structure containing the data of a frame, with a total of 2 2 32 18 values; its type 24
is in currF rameT ypesrchannelssrgranuless. For block values a new index ranging from 1 to 11 is added so that its form is currBlockV aluesrnumbersrchannelssrgranulessrsubbandsrf requency s, and there is a currBlockT ypes too. The rst time this procedure is called currentBlock is empty; as soon as it is lled, setReady raises a ag to be checked by the caller telling that the current block is in a consistent state and can therefore be fed to the next step of the forward algorithm. These two pairs of multidimensional arrays are bundled into two vanity arguments currF rame and currBlock . num, ch, gr are the current block position, channel and granule number.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34:
procedure buildBlock(currF rame, currBlock, num, ch, gr) if num 12 then if num 11 then setReady end if currBlockT ypesrnumsrchsrgrs currF rameT ypesrchsrgrs for i 1 to 32 do for j 1 to 18 do value currF rameV aluesrchsrgrsrisrj s currBlockV aluesrnumsrchsrgrsrisrj s value end for end for else for k 1 to 10 do type currBlockT ypesrk 1srchsrgrs currBlockT ypesrk srchsrgrs type for i 1 to 32 do for j 1 to 18 do value currBlockV aluesrk 1srchsrgrsrisrj s currBlockV aluesrk srchsrgrsrisrj s value end for end for end for k 11 currBlockT ypesrk srchsrgrs currF rameT ypesrchsrgrs for i 1 to 32 do for j 1 to 18 do value currF rameV aluesrchsrgrsrisrj s currBlockV aluesrchsrgrsrisrj s value end for end for end if return currBlock end procedure Frequency rearrangement and sub-band division
2.5.2
Input to this procedure is a completed block, currBlock . The values of the current block are transformed into a new structure, 25
rearBlockV aluesrnumbersrchannelsrgranulesrf requency s, where frequency lines are indexed with only one value; 66 frequency lines are formed. Sub-band division is implicitly performed: just the needed number of frequencies is computed but no subband index is considered. The two dierent cases are clearly formalized (2 corresponds to short block type). sf reqs is the number of frequency lines per sub-band in each short window.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31:
procedure rearranger(currBlock ) sf reqs 6 for i 1 to 11 do for j 1 to 2 do for k 1 to 2 do c0 for l 1 to 66 do if currBlockT ypesrisrj srk s 2 then if l $ 0 and l 0 pmod sf reqsq then cc1 end if base l pmod sf reqsq val1 currBlockV aluesrisrj srk srcsrbases val2 currBlockV aluesrisrj srk srcsrbase sf reqss val3 currBlockV aluesrisrj srk srcsrbase 2sf reqss 2val3 rearBlockV aluesrisrj srk srls val1val 3 else if l $ 0 and l pmod sqf reqs 0 then cc1 end if val1 currBlockV aluesrisrj srk sr3l pmod 18qs val2 currBlockV aluesrisrj srk sr3l 1 pmod 18qs val3 currBlockV aluesrisrj srk sr3l 2 pmod 18qs 2val3 rearBlockV aluesrisrj srk srls val1val 3 end if end for end for end for end for return rearBlockV alues end procedure SBE
2.5.3
The procedure returns a structure, SBEsrchannelsrf requency s. getNewBandsBounds returns, for a given sub-band number, the starting position (see 2.3). First, the energy for each sub-band in each granule of the block is computed (rst group of cycles), then these values are collected channel by channel, sub-band by sub-band.
26
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38:
procedure SBE(rearBlock ) for i 1 to 11 do for j 1 to 2 do for k 1 to 2 do for l 1 to 9 do if l $ 11 then bandStart getNewBandsBounds(l) bandStop getNewBandsBounds(l 1) for m bandStart to bandStop do a grSumrisrj srk srls b rearBlockV aluesrisrj srk srms grSumrisrj srk srls a b end for else start getNewBandsBounds(l) for m start to 66 do a SBEsrisrj s b rearBlockV aluesrisrj srk srms grSumrisrj srk srls a b end for end if end for end for end for end for for i 1 to 2 do for j 1 to 9 do for l 1 11 do for k 1 2 do a SBEsrisrj s b grSumrlsrisrk srj s SBEsrisrj s a b end for end for end for end for return SBEs end procedure PMF
2.5.4
The nal result is computed in-place on the input structure, SBEs, in two steps. In the rst loop the denominators are calculated by reading, for a given channel, all the sub-band energies, and in the second loop nal results are written. In case a denominator is zero, the distribution is xed to uniform.
27
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18:
procedure PMF(SBEs) for j 1 to 2 do for k 1 to 9 do sumsrj s sumsrj s SBEsrj srk s end for end for for j 1 to 2 do for k 1 to 9 do val sumsrj s if val $ 0 then rj srks SBEsrj srk s SBEs val else SBEsrj srk s 1 9 end if end for end for return SBEs end procedure Entropy
2.5.5
P M F is the original SBEs data structure rewritten in the previous step to contain the discrete distribution.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:
procedure entropy(P M F ) for j 1 to 2 do for k 1 to 66 do P M F rj srk s P M F rj srk slg pP M F risrj sq end for end for for j 1 to 2 do for k 1 to 66 do H rj s H rj s P M F rj srk s end for H rj s H rj s end for end procedure Bit stream output
2.5.6
The bit stream is computed by looking at the sequence of entropy values of the blocks. To evaluate the bit for the i-th block, only the entropy values for the i1th bit are needed, just two numbers, one per channel. These new values, passed with H rchannels are inserted in an auxiliary structure, Hbuf r2srchannels for current and previous values. The rule is applied in two dierent but equivalent ways, for even and odd indexes.
28
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21:
procedure bufbit(H, blockN um) for j 1 to 2 do Hbuf rblockN um pmod 2qsrj s H rj s if blockN um $ 0 then if blockN um pmod 2q then if Hbuf r2srj s Hbuf r1srj s then bitsrj s 0 else bitsrj s 1 end if else if Hbuf r1srj s Hbuf r2srj s then bitsrj s 0 else bitsrj s 1 end if end if end if end for return bits end procedure BER
2.5.7
wholeBits and sampleBits are the database and query bit streams. Sliding along the database stream is done in a base+oset approach. If the distance just computed is smaller than the current, the current distance is updated.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19:
procedure BER(wholeBits, sampleBits) while base |wholeBits| |sampleBits| do berT emp 0 for i 1 to |sampleN um| do if sampleBitsris $ wholeBitsrbase is then berT emp berT emp 1 end if end for base base 1 if base 0 then berCurr berT emp else if berT emp berCurr then berCurr berT emp end if end if end while berCurr return |sampleN um| end procedure
29
2.6
Implementation
Initially Mathworks MATLAB was chosen as the implementation environment. Reasons for this choice are a really ergonomic IDE, a programming language easy to take up, an important asset of built-in data structures isolating the programmer from low-level issues, and nally the fact that the natural setting of this thesis work is digital signal processing, and MATLAB oers a dedicated module to it, the Signal Processing toolbox. Development process in MATLAB gave birth to a complete and working implementation of the forward algorithm. Unfortunately, the testing on the program was really unsatisfactory, due to unacceptable running times: processing a standard MP3 song on a common desktop took up to one hour to complete. Consequently, I was compelled to looking for other solutions. In order to push performances to signicantly higher levels I opted for the C programming language and, for maximum portability, I conformed to the standard, avoiding external libraries. Even if the developed code is for great part machine and architecture independent, some tasks, like le management, are unavoidably dependent on the particular operative system of execution. The target machine of preference is a standard UNIX box; nevertheless, the program can be successfully compiled and used on Windows machines too, by installing proper emulation environments like Cygwin. The code is divided into two distinct parts, the forward and backward algorithms. The matching module consists of just one compilation unit. The ngerprinting module breaks down into the eective algorithm and the code needed to decode an MP3 stream. The decoder is a reference implementation by Fraunhofer Institute. Once the decoder produces IMDCT coecients, the decoding process is halted (partial decoding) and the current frame is fed to the forward algorithm. Time performance for ngerprint generation is quite satisfactory: stereo MP3 les of even 15 minutes are processed in no more than ve minutes. Things are less rosy for ngerprint matching: matching a single 30-seconds query against a database of 1000 full-length bit streams is time consuming, even about one hour on a common desktop. Anyway, one must bear in mind that the ngerprint matching process, in the typical scenario of use, is assumed to be performed on powerful computing infrastructures as it is carried out by the services provider. On the contrary, it is ngerprint production for audio excerpts which requires quickness in order to grant a valuable user experience and, extrapolating what emerges from the tests, it is likely to take a dozen of seconds on smartphones and alike for samples lasting between ve and ten seconds. Testing of the algorithm is discussed thoroughly in the next section.
Testing
The technique is expected to maintain stability even in presence of noise: it is clear that if the excerpt is acquired, for example, in a crowded room, not only the technique has to deal with quantization noise but also with over-the-air disturbance, e.g. voices of those speaking over in the room. On the other hand, it is very important that a technique has the right guess even if the original track is somewhat distorted in a deliberate way: this is the case in many DJ set performances where echoes eects are added or the tracks are played with
30
noticeable pitch or tempo changes. In [1] resistance to many types of distortion is claimed, and the presented results show very good performance for much of them. In this work a subset of the noise degradations is selected and, for each of them the stability of the technique is tested at various degrees of intensity; these degradations are additive white noise, echo, pitch shift. For a more complete analysis, the eect of explicitly adding voices to the queries is studied.
3.1
3.1.1
Basic distortions
White noise
A white noise signal exhibits a at power spectral density. The PSD of a signal is, for each wavelength, the contribute in terms of power (work over time units) over area units, that is how much energy is conveyed by the signal at a given frequency. The color of this noise comes from the fact that a light stimulus to be perceived as white has to carry, in the context of additive synthesis, maximum energy at every frequency. A white noise signal, given any pair of frequency I , f2 I s of equal width, contains the same amount of energy ranges rf1 , f2 s and rf1 in both. Formally a (strong) white noise signal is a vector whose mean is zero, variance is nite, values are independent and identically distributed; for example, sampling a continuous uniform distribution over r1, 1s yields a white vector.
Figure 16: Spectra of vectors obtained by generating pseudo-random numbers in the interval r1, 1s; thanks to the law of large numbers, as the number of samples grows a better white noise-like signal is observed. Clearly, as long as the theoretical requirements are matched, an arbitrary underlying distribution will t. Along with the uniform distribution the p0, qGaussian distribution is usually chosen and if the noise is added to an input signal the disturbance is commonly termed AWGN (additive white gaussian noise). In a white noise audio signal the samples are amplitudes, and the white noise vector is usually rescaled to run in the allowed range of intensities. The intensity of the disturbance can be calibrated by examining the SNR value (see 3.2). 31
3.1.2
Echo
An echo eect, also called delay, is usually dened by four parameters:

time interval between sound emission and its return (delay) intensity of the repetition (decay) gain factor for the input signal (gain in) gain factor for the output signal (gain out)
The echo is commonly used in many electronic music genres; dub for example, originated in Jamaica, nds its peculiarity in ethereal and slow rhythm patterns with echoes at the backbone. Many guitarists make use of delay pedals in their live performances too, seeking polyphonic impressions. When added to an audio signal, echo can be seen as a self-generated noise: a sample at time t is played again at time t k , adding itself to the original sample in this position. The initial samples in the range rt, t k s are untouched, and if the response of the ngerprinting technique is strong enough here, the audio track is recognized besides echoes. 3.1.3 Pitch shift
In color theory a widely used triple of color perception correlates are hue, saturation and brightness, informally representing for the stimulus its pure color, how intense it is chromatically and how much light looks to emit respectively. In music theory the nature of a sound is discussed in terms of:
duration: for a generic sound how much it lasts with respect to a tempo; a modern tempo measure is BPM (beats per minute) loudness: informally the volume; correlate of amplitude pitch: correlate of frequency; common pitch categories are bass, mid and treble timbre: an aggregate attribute grouping a number of sub-attributes independent from the previous three; a notable component is ASDR (attack, sustain, decay, release) the time envelope of a sound
In digital equipments pitch is shifted by positive or negative small increments called cents. An octave is the distance between notes at dierent pitches: for example, for a 440 Hz note, the note one octave above is at 880 Hz, the note an octave below 220 Hz; the ratio is constant. The space of an octave can be partitioned into 12 sub-intervals called semitones, of 100 cents each. In this arrangement, shifting up the pitch by 1000 almost means doubling the frequency; shifting up of just one cent means adjusting to the frequency obtained by scaling by the 1200-th root of 2 the starting frequency. A cent is the frequency counterpart for the amplitude decibel. Pitch is an useful tool when a recorded voice needs to be turned unidentiable. Pitch controls can also be found in turntable mixers.
32
3.1.4
Voice
Voice, arguably the rst ever instrument man has played, has frequency contents in the r60, 7000s Hz range about. Overlay voices are a really important noise resistance test for two reasons:
our auditory system is particularly sensitive to voice ngerprint techniques often work on samples recorded in public places, where people talking are the main source of disturbance to the signal
White noise and voices are mixed to the input signal by actually adding the samples. Echo and pitch eects are modications of the input signal itself.
3.2
SNR
The signal-to-noise ratio indicates the level of degradation of the signal telling how far it stands out with respect to noise. Amplitude SNR in decibels is given by: SN RdB SAcontent 20log10 RM RM SA
noise
where RM SAx px1 , x2 , . . . , xn q: is the root mean square amplitude of the signal x
g f n f 1 2 e RM SAx xi
n i1
The higher the SNR the better the communication is expected to be since the content transmitted over the channel dominates background noise. If the SNR is negative, noise strength is stronger than signal strength and a clear communication is compromised. SNR can be applied to both analog and digital signals, even if for bit streams the Eb {N0 (energy per bit to noise power spectral density ratio) indicator is more appropriate: SN R Eb {N0 LSE where LSE is the link spectral eciency, measured in (bit/s)/Hz, of the channel. Anyway, for it is a more amenable parameter, we will work with SNR rather than Eb {N0 .
3.3
Testing infrastructure
The test database is a collection of 1000 MP3 les. Many musical styles are covered, from rap music to post-techno. Random intervals are extracted from each song; the samples are 5 s long and make up the set of queries to be ngerprinted and matched. Notice that picking sub-songs at random is a stricter test condition that, for example, taking central portions, because initial and nal parts of a song are likely to be less informative. This is especially the case for four on the oor club music where often the tails of the track are nothing but the drum beat to allow an easier beat-matching by the DJ.
33
The second step is to generate ngerprints for database songs and queries. Finally, each query is matched against the database, and results are returned. The rst type of test tries to match clean samples, the other four test noisedeteriorated samples; the noise is added at rst hand on the already extracted samples. The testing process is highly automated through the use of scripts and external free programs, MP3info, MP3splt and SoX. Reports are text les showing, for each sample, the ten database songs yielding the lowest matching scores.
3.4
Results
As said, the testing perspective of the technique is broadened, by taking into consideration dierent noise intensities as well as a novel distortion, voice addition. The technique response for clean samples is tested again in two initial test cases. The complete list follows:
without noise addition/clean samples:
preliminary test: about 100 songs picked at random dierentiation test: about 125 songs by the same author
with noise addition - each on the whole database of 1000 songs:
voices addition white noise addition echo addition pitch shift Once a query has completed, a list containing the results is returned. If the right song is in this list, then a match is counted, and if it is the rst of the list a perfect match is counted too. For the rst two test cases, results are really positive:
preliminary test: perfect match hit rate 98%, hit rate 100% dierentiation test: 92%, 98%
We now proceed presenting the noise distortion test cases. 3.4.1 White noise
In the paper, noise addition is tested with an SNR as low as 15 dB. The volume of a signal aects its RMSA, so the choice is to generate white noise at dierent volume levels. The reference volume is 0 dBFS, the maximum possible digital level for the volume on the computer running the tests, and in this test corresponds, at maximum speakers power, to a sound pressure level of about 100 dB. Subsequent volume values are dierent attenuations of the reference value, which is scaled. In the following, adjusting the white noise volume to vol determines a certain RM SA; SN R is the average SNR between the signal and the database tracks. 1. vol 1 RM SA .45 9.5dB 34
2. vol .5 RM SA .23 3.7dB 4. vol .125 RM SA .06 8.0dB 3. vol .25 RM SA .11 2.7dB
5. vol .0625 RM SA .03 14dB
6. vol .03125 RM SA .01 23dB
Figure 17: Plot of volume versus hit rate. 3.4.2 Echo
Echo eect is tested with the following delay and decay values (gain is unchanged):
del 100ms, dec 50% del 50ms, dec 25%
del 1600ms, dec 90%
del 800ms, dec 80%
del 400ms, dec 70%
del 200ms, dec 60%
35
Figure 18: Plot of |pdelay, decay q| versus hit rate. 3.4.3 Pitch shift
Shift cents:

1000cents 500cents 100cents 100cents 500cents 1000cents
Figure 19: Plot of pitch shift versus hit rate. 3.4.4 Voice
Just like the white noise case, the initial signal is progressively attenuated by scaling it. 36
Figure 20: Plot of volume versus hit rate.
Conclusions
This thesis work consisted in the study, implementation and testing of the ngerprinting technique illustrated in [1]. The technique employs entropy as the tool to model the uctuation of information content in the MP3 audio le and generate a compact bit stream. Performance was evaluated with respect to the most relevant noise degradations cited in the research paper. Results were satisfactory and conrmed the robustness and stability of the algorithms claimed by the authors, since hit rate statistics turned out to be low only for degradations of high intensity. The main application focus of this kind of techniques is audio track retrieval, especially in the context of use of mobile phones. For very large databases, other interesting applications of the system worth to cite in these Conclusions are duplicate track detection and track quality sorting. In the rst case, as ngerprints can be seen as a way of hash-indexing the audio les, whenever two computed ngerprints are very similar, a duplicate track is detected. In addition, given a reference audio track and a number of derived dierent versions of it, by looking at the matching score of these versions their relative quality is obtained. As said, the concept of entropy from Shannon theory is at the very core of the whole forward algorithm, where the entropy of each 22-granules block is computed and bits are emitted accordingly. In the rst place, the actual entropy of the discrete signal can be approximated by entropy estimation approaches, and the easiest one would be to sub-sample the signal, seeking a compromise between sub-sample factor and estimation error. More rened versions of estimates exist, for example ApEn (approximate entropy), which is a statistic for detecting how steady a time series is. Finally, an analytical comparison study of this technique versus the other known and successful techniques would be of great interest, also because [1] is relatively easy and fast.
37
References
[1] Wei Li, Yaduo Liu, and Xiangyang Xue. Robust audio identication for MP3 popular music. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, SIGIR 10, pages 627634, New York, NY, USA, 2010. ACM. [2] Rassol Raissi. The theory behind MP3, 2002. [3] Z.N. Li and M.S. Drew. Fundamentals of Multimedia. Pearson Prentice Hall, 2004. [4] ISO/IEC. ISO/IEC 11172-3:1993 - Information technology - Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s - Part 3: Audio, 1993. [5] ISO/IEC. ISO/IEC 13818-3:1995 - Information technology - Generic coding of moving pictures and associated audio information - Part 3: Audio, 1995. [6] Rafael C. Gonzalez and Richard E. Woods. Digital Image Processing. Addison-Wesley Longman Publishing Co., Inc., 2001. [7] Jaap Haitsma and Ton Kalker. A highly robust audio ngerprinting system. In ISMIR, pages 107115, 2002. [8] Avery L. Wang. An Industrial-Strength Audio Search Algorithm. In ISMIR 2003, 4th Symposium Conference on Music Information Retrieval, pages 713, 2003. [9] Shumeet Baluja and Michele Covell. Content ngerprinting using wavelets. In Proc. CVMP, 2006.
38

Fingerprint and Quality-Based Audio Track Retrieval

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Fingerprint and Quality-Based Audio Track Retrieval

Încărcat de

Drepturi de autor:

Formate disponibile

` DEGLI STUDI UNIVERSITA DI MILANO BICOCCA

` DI SCIENZE MATEMATICHE, FACOLTA FISICHE E NATURALI

Corso di Laurea Magistrale in Informatica (MSc in Computer Science)

FINGERPRINT AND QUALITY-BASED AUDIO TRACK RETRIEVAL

Submitted by: Riccardo Vincenzo Vincelli (709588) r.vincelli@campus.unimib.it

AA 2011-2012 - third session, 20/11/2012

Figure 1: Encoding block diagram. 6

Figure 3: The lter bank process. 1.1.3 FFT

Outputs of this block are simply, respectively:

Figure 8: Temporal masking.

Figure 10: The two aspects of the quantization process.

Human decoding and Human info decoding

The ngerprinting technique

Synopsis of the technique

For the backward (matching) process:

|sn2 pm, nq|

P pi, j qlg P pi, j q

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18:

An echo eect, also called delay, is usually dened by four parameters:

5. vol  .0625 RM SA  .03 14dB

6. vol  .03125 RM SA  .01 23dB

Figure 17: Plot of volume versus hit rate. 3.4.2 Echo

del  1600ms, dec  90%

del  800ms, dec  80%

del  400ms, dec  70%

del  200ms, dec  60%

1000cents 500cents 100cents 100cents 500cents 1000cents

Figure 20: Plot of volume versus hit rate.

S-ar putea să vă placă și

5. vol .0625 RM SA .03 14dB

6. vol .03125 RM SA .01 23dB

del 1600ms, dec 90%

del 800ms, dec 80%

del 400ms, dec 70%

del 200ms, dec 60%