Sunteți pe pagina 1din 32

Time-Frequency Analysis for Music Signal Analysis [Tutorial]

r99943114

Abstract Introduction Time-frequency analysis and classic Fourier transform Basic Concepts about Music
i. ii. iii. Musical Pitch Harmony Tempo, Beat and Rhythm

I.
II. III.

IV. i. ii. V. i. ii.

Time-Frequency Analysis and Musical Signal


Short Time Fourier Transform and Gabor Transform Wigner Distribution Function

Time-Frequency Representation
Log-Frequency Spectrogram Time-Chroma Representation

VI.
i. ii. iii. iv. VII. VIII.

Other Applications on Musical Signal


Onset Detection and Novelty Curve Periodicity Analysis and Tempo Estimation Harmonic Pitch Class Profiles Modified HHT for Detecting Fundamental Frequency

Conclusion Reference
1

Abstract
Time-frequency analysis is an efficient tool for analyzing signals. It is

extended from the classic Fourier approach. In this tutorial, it will introduce several kinds of time-frequency analysis and work them on musical signals. There are many time-frequency methods such as Short-time Fourier transform (STFT), Gabor transform (GT), Wigner Distribution function (WDF). They are employed in analyzing music played on a piano, a flute, or a guitar. Musical sound is more complicated than sound produced by human. It has wider band of frequency, different methods of producing sounds. The important of all is music signals are typical examples for time-varying signals. Therefore, the classic Fourier transform is not sufficient to analyze it. We can use time-frequency analysis to see the variation of frequency corresponding to time.

I.

Introduction
In this tutorial, at first, II will introduce why we can use time-frequency

analysis for music signals and what is difference between classic Fourier transform and time-frequency analysis. In the section III, I will also introduce some basic music theory. The several kinds of time-frequency analysis will be introduced and be implemented in the section IV. In section V, it will show two kinds of time-frequency representation for musical signal. Next in the section VI, some other advanced analysis for musical signals will be mentioned. For example, Chroma (HPCP) is an advanced application of time-frequency analysis. The frequency is mapped into 12 pitch classes. We can know the change of pitch class corresponding to time. Finally, the conclusion is in section VII, and the reference is in section VIII.

II. Time-frequency analysis and classic Fourier transform


In the past, we can get a continue signal s(t)s spectrum by classic Fourier transform. This is computed by

In the spectrum, we can see the magnitude in different frequency. It helps a lot in research. For example, a sinusoid signal which frequency is 440Hz. We can do Fourier transform on Figure. 1(a). And can get the result in Figure. 1(b) . There is a peak in frequency 440Hz.

Figure. 1 Fourier transform of a sinusoid signal. (a) The sinusoid signal with 440Hz.(b) The Fourier spectrum of (a).

Similarly, the Fourier transform of sinusoid components will have several peaks in right frequency. However, this representation cans not give any information about the localization of the sinusoids in time. We dont know when the sinusoid appears in the signal. Therefore, time-frequency analysis can solve this problem. Lets take a typical example in the class. Example: f(t): x(t) = cos( t) when t < 10, x(t) = cos(3 t) when 10 t < 20 x(t) = cos(2 t) when t 20

Figure. 2 (a) the signal and the classic Fourier transform of the signal. (b) the time-frequency analysis of signal.

In the figure.2 , we can find the most important difference is the timefrequency analysis can have the time information of the signal. By seeing the figure.2(b), we know the cos(t) appears from 0~10, cos(3t) appears from 10~20, cos(2t) appears from 20~30. It is the reason why we need to use timefrequency analysis. Except for fast convolution, all the properties of classic Fourier transform can be replaced by time-frequency analysis.

III. Basic Concepts about Music


This section will introduce some knowledge about music. Music is sound that has some stable frequencies in a time period. Music can be produced by
4

several methods. Sound of piano is produced by percussing wires. Sound of Violin is fricative. All of musical sounds have its fundamental frequency and overtone. Fundamental frequency is the lowest frequency in harmonic series. In a period signal, the fundamental frequency is the inverse of the period length. The overtone is integer multiple of fundamental frequency. And we start this section by the introduction of pitch.

A. Musical Pitch
Most musical instrumentsincluding string-based instruments such as guitars, violins, and pianos, as well as instruments based on vibrating air columns such as flutes, clarinets, and trumpetsare explicitly constructed to allow performers to produce sounds with easily controlled, locally stable fundamental periods. Such a signal is well described as a harmonic series of sinusoids at multiples of a fundamental frequency, and results in the percept of a musical note at a clearly defined pitch in the mind of the listener. With the exception of unpitched instruments like drums, and a few inharmonic instruments such as bells, the periodicity of individual musical notes is rarely ambiguous, and thus equating the perceived pitch with fundamental frequency is common. Music exists for the pleasure of human listeners, and thus its features reflect specific aspects of human auditory perception. In particular, humans perceive two signals whose fundamental frequencies fall in a ratio 2:1 (which is called an octave) as highly similar (sometimes known as octave equivalence). A sequence of notesa melodyperformed at pitches exactly one octave displaced from an original will be perceived as largely musically equivalent. We note that the sinusoidal harmonics of a fundamental at f0 at frequencies f0, 2f0, 3f0, 4f0,... are a proper superset of the harmonics of a note with fundamental 2 f0 (i.e., 2f0, 4f0, 6f0,...).
5

Fig. 3. Middle C (262 Hz) played on a piano and a violin. The top pane shows the waveform, with the spectrogram below. Zoomed-in regions shown above the waveform reveal the 3.8-ms fundamental period of both notes.

Fig.3. shows the waveforms and spectrograms of middle C (with fundamental frequency 262 Hz) played on a piano and a violin. Zoomed-in views above the waveforms show the relatively stationary waveform with a 3.8-ms period in both cases. The spectrograms (calculated with a 46-ms window) show the harmonic series at integer multiples of the fundamental. Obvious differences between piano and violin sound include the decaying energy within the piano note, and the slight frequency modulation (vibrato) on the violin. Although different cultures have developed different musical conventions, a common feature is the musical scale, a set of discrete pitches that repeats every octave, from which melodies are constructed. For example, contemporary western music is based on the equal tempered scale, which allows the octave to be divided into twelve equal steps on a logarithmic axis while still (almost) preserving intervals corresponding to the most pleasant note combinations. The equal division makes each frequency 21/12~=1.06x larger than its predecessor, and this interval is
6

known as a semitone. There are twelve semitones in an octave. Its shown as Figure.4 . For example, if frequency of A in this octave is 440Hz, the one octave higher of A is 880Hz.

Figure.4 The twelve pitch classes of an octave.

The coincidence is that it is even possible to divide the octave uniformly into such a small number of steps, and still have these steps give close, if not exact, matches to the simple integer ratios that result in consonant harmonies, eg. 2(1/12)^7~=1.498~=3/2. The western major scale spans the octave using seven of the twelve stepsthe "white notes" on a piano, denoted by C, D, E, F, G, A, B. The spacing between successive notes is two semitones, except for E/F and B/C which are only one semitone apart. The black notes in between are named in reference to the note immediately below (e.g.,C#), or above (Db) , depending on musicological
7

conventions. The octave degree denoted by these symbols is sometimes known as the pitchs chroma, and a particular pitch can be specified by the concatenation of a chroma and an octave number (where each numbered octave spans C to B). The lowest note on a piano is A0 (27.5 Hz), the highest note is C8 (4186 Hz), and middle C (262 Hz) is C4.

Fig. 5. Middle C, followed by the E and G above, then all three notes togethera C Major triadplayed on a piano. Top pane shows the spectrogram; bottom pane shows the chroma representation.

B. Harmony
While sequences of pitches create melodies and the only part reproducible by a monophonic instrument such as the voiceanother essential aspect of much music is harmony, the simultaneous presentation of notes at different pitches. Different combinations of notes result in different chords, which remain recognizable regardless of the instrument used to play them. Consonant harmonies tend to involve pitches with simple frequency ratios, indicating many shared harmonics. Fig. 5 shows middle C (262 Hz), E (330 Hz), and G (392 Hz) played on a piano; these three notes together form a C Major triad, a common harmonic unit in western music. The ubiquity of simultaneous pitches, with coincident or nearcoincident harmonics, is a major challenge in the automatic analysis of music audio.

C. Tempo, Beat and Rhythm


The musical aspects of tempo, beat, and rhythm play a fundamental role for the understanding of music. The beat is the steady pulse that drives music forward and provides the temporal framework of a piece of music. Intuitively, the beat can be described as a sequence of perceived pulses that are regularly spaced in time and correspond to the pulse a human taps along when listening to the music The term tempo then refers to the rate of the pulse. Musical pulses typically go along with note onsets or percussive events. Locating such events within a given signal constitutes a fundamental task, which is often referred to as onset detection. And this part will be introduced more comprehensively in section VI

IV. Time-Frequency Analysis and Musical Signal


Figure.6 shows some kinds of time-frequency analysis. This section will introduce three time-frequency methods and the implementation results of musical signals.

Fig. 6. Time-frequency analysis methods


9

A. Short Time Fourier Transform and Gabor Transform


Short-time Fourier Transform is a basic type of time-frequency analysis. If there is a continue signal x(t), we can compute the short-time Fourier transform by

Where w(t) is a mask function. When the w(t) is a rectangular function, the transform is called Rec-STFT. When the w(t) is a Gaussian function, the transform is called Gabor transform. However, the musical signal is not a continue signal. It is sampled in a sampling frequency. Therefore, we cant use the form to compute the Rec -shorttime Fourier transform. So we change the original form to

where t=nt, f=mf, =pt, B=Qt, There are some constraints because the discrete form of the short-time Fourier transform. The first, t*f = 1/N, where N is a integer. The second, N>=2Q+1. The third, t<1/(2*fmax), where fmax is the highest frequency of the signal.

Fig. 7. The wave of a drum


10

Take a drum for example. Figure.7 is the wave of a drum. The length of signals is 0.05 seconds. And the sampling frequency is 44100Hz. It was implemented by Matlab. The width of window is 0.005 and 0.002. The frequency band is 0~5000Hz. We can see the result in Figure. 8.

Figure.8 (a) Rec-STFT of a drum, Width of window B=0.005 (b) Rec-STFT of a drum, B=0.002. The vertical axis is frequency (Hz) and the horizontal axis is time (s)

As you can see, the fundamental frequency of the drum is about 2000Hz.There is an overtone in 4000Hz. You also can find that when B=0.005, the white line is clearer. However, when B=0.002, the white line is rough and the resolution is not good. Therefore, the width of window is also an important factor. There is another example on piano in Fih.9.

Figure.9 The analysis of piano

11

Figure.9 is the wave of piano and the spectrum of piano notes. We can find the fundamental frequency about 440Hz, and there are several harmonic overtones in higher frequency. The Figure.10 is the spectrogram of piano notes. The spectrogram is square of STFT. The analysis meaning of spectrogram is same with the STFT. Spectrogram is computed by

Figure.10 The spectrogram of piano notes.

Figure.11 (a) the wave of piano (b) the STFT of piano wave

The window function can have different type to have different Short-time Fourier transform. Except for rectangular function and Gaussian function, there are
12

also triangle function, Hanning function, Hamming function and others you can imagine. Comparing to other functions, the Gaussian function has better effect on resolution because of Gaussian function is an eigen-function of Fourier transform. So it can have better resolution in time domain and frequency domain.

B. Wigner Distribution Function


Wigner distribution function is also a useful tool for analysis signals. It is computed by

Where x(t) is the signal, * is conjugate of the signal. The advantage of Wigner distribution function (WDF) is high clarity. However, it also has high calculation and cross-term problem. Fig.12. shows the comparison between Gabor transform and WDF.

Figure.12 Comparing WDF to Gabor transform


13

V. Time-Frequency Representation
Although the spectrogram is profoundly useful, it still has one drawback. It displays frequencies on a uniform scale. However, musical scales are based on a logarithmic scale for frequencies. Therefore, we should describe below how such a logarithmic scale is related to human hearing and it leads to a new type of timefrequency analysis. Here we will introduce two types of representation.

A. Log-Frequency Spectrogram
As mentioned above, our perception of music defines a logarithmic frequency scale, with each doubling in frequency (an octave) corresponding to an equal musical interval. This motivates the use of timefrequency representations with a similar logarithmic frequency axis, which in fact correspond more closely to representation in the ear. (Because the bandwidth of each bin varies in proportion to its center frequency, these representations are also known as constant-Q transforms, since each filters effective center frequency-to-bandwidth ratioits Qis the same.) The constant-Q transform is also a type of time-frequency analysis. It develops from Short-time Fourier transform. It can transform a data series to the frequency domain. It is computed by

Where N(k) = Q(fs/fk), W(k,n)=-(1-)cos(2n/N(k)), fs is sampling rate, Q=fk/fk, fk is center frequency, is a number between zero to one. With, for instance, 12 frequency bins per octave, the result is a representation with one bin per semitone of the equal-tempered scale. A simple way to achieve this is as a mapping applied to an STFT representation. Each bin in the log-frequency spectrogram is formed as a linear
14

weighting of corresponding frequency bins from the original spectrogram. For a log-frequency axis with KL bins, this calculation can be expressed in matrix notation as Y=MX, where Y is the log-frequency spectrogram with KL rows and T columns, X is the original STFT magnitude array |X(t,k)| (with t indexing columns and k indexing rows). M is a weighting matrix consisting of KL rows, each of K+1 columns, that give the weight of STFT bin X(.,k) contributing to log-frequency bin Y(.,l) . For instance, using a Gaussian window

Where B defines the bandwidth of the filter-bank as the frequency difference (in octaves) at which the bin has fallen to exp(-1/2) of its peak gain. fmin is the frequency of the lowest bin (l=0) and N0 is the number of bins per octave in the log-frequency axis. The calculation is illustrated in Fig. 13, where the top-left image is the matrix M, the top right is the conventional spectrogram X, and the bottom right shows the resulting log-frequency spectrogram Y.

Figure.13. Calculation of a log-frequency spectrogram as a column-wise linear mapping of bins from a conventional (linear-frequency) spectrogram.
15

Drawback of Log-Frequency Spectrogram


Although conceptually simple, such a mapping often gives unsatisfactory

results: in the figure, the logarithmic frequency axis uses one bin per semitone, starting at fmin = 110 Hz (A2). At this point, the log-frequency bins have centers only 6.5 Hz apart; to have these centered on distinct STFT bins would require a window of 153 ms, or almost 7000 points at Hz. Using a 64-ms window, as in the figure, causes blurring of the low-frequency bins. The long time window required to achieve semitone resolution at low frequencies has serious implications for the temporal resolution of any analysis. Since human perception of rhythm can often discriminate changes of 10ms or less, an analysis window of 100ms or more can lose important temporal structure. One popular alternative to a single STFT analysis is to construct a bank of individual band-pass filters, for instance one per semitone, each tuned the appropriate bandwidth and with minimal temporal support. Although this loses the famed computational efficiency of the fast Fourier transform, some of this may be regained by processing the highest octave with an STFT-based method, downsampling by a factor of 2, then repeating for as many octaves as are desired. However, this results in different sampling rates for each octave of the analysis, raising further computational issues.

B. Time-Chroma Representation
Some applications are primarily concerned with the chroma of the notes present, but less with the octave. Foremost among these is chord transcription the annotation of the current chord as it changes through a song. Chords are a joint property of all the notes sounding at or near a particular point in time, for instance the C Major chord of Fig. 5, which is the unambiguous label of the three notes C, E,
16

and G. Chords are generally defined by three or four notes, but the precise octave in which those notes occur is of secondary importance. Thus, for chord recognition, a representation that describes the chroma present but folds the octaves together seems ideal. A typical chroma representation consists of a 12-bin vector for each time step, one for each chroma class from C to B. Given a log-frequency spectrogram representation with semitone resolution from the preceding section, one way to create chroma vectors is simply to add together all the bins corresponding to each distinct chroma. More involved approaches may include efforts to include energy only from strong sinusoidal components in the audio, and exclude non-tonal energy such as percussion and other noise.

Figure.14 Three representations of a chromatic scale comprising every note on the piano from lowest to highest. Top pane: conventional spectrogram (93-ms window). Middle pane: log-frequency spectrogram (186-ms window). Bottom pane: chroma-gram (based on 186-ms window).

Fig. 14 shows a chromatic scale, consisting of all 88 piano keys played one a second in an ascending sequence. The top pane shows the conventional, linear17

frequency spectrogram, and the middle pane shows a log-frequency spectrogram calculated as in Fig. 13. Notice how the constant ratio between the fundamental frequencies of successive notes appears as an exponential growth on a linear axis, but becomes a straight line on a logarithmic axis. The bottom pane shows a 12-bin chroma representation (a chroma-gram) of the same data.

Drawback of Time-Chroma Spectrogram


Even though there is only one note sounding at each time, notice that very

few notes result in a chroma vector with energy in only a single bin. This is because although the fundamental may be mapped neatly into the appropriate chroma bin, as will the harmonics at 2f0, 4f0, 8f0,, etc. (all related to the fundamental by octaves), the other harmonics will map onto other chroma bins. The harmonic at 3f0, for instance, corresponds to an octave plus 7 semi-tones 2(12+7)/12 ~=3, thus for the C4 sounding at 40 s, we see the second most intense chroma bin after C is the G seven steps higher. Other harmonics fall in other bins, giving the more complex pattern. Many musical notes have the highest energy in the fundamental harmonic, and even with a weak fundamental, the root chroma is the bin into which the greatest number of low-order harmonics fall, but for a note with energy across a large number of harmonicssuch as the lowest notes in the figurethe chroma vector can become quite cluttered. One might think that attempting to attenuate higher harmonics would give better chroma representations by reducing these alias terms. In fact, many applications are improved by whitening the spectrumi.e., boosting weaker bands to make the energy approximately constant across the spectrum. This helps remove differences arising from the different spectral balance of different musical instruments, and hence better represents the tonal. Chroma representations may use more than 12 bins per octave to reflect
18

finer pitch variations, but still retain the property of combining energy from frequencies separated by an octave. To obtain robustness against global mistunings, practical chroma analyses need to employ some kind of adaptive tuning, for instance by building a histogram of the differences between the frequencies of all strong harmonics and the nearest quantized semitone frequency, then shifting the semitone grid to match the peak of this histogram. It is, however, useful to limit the range of frequencies over which chroma is calculated. Human pitch perception is most strongly influenced by harmonics that occur in a dominance region between about 400 and 2000 Hz. Thus, after whitening, the harmonics can be shaped by a smooth, tapered frequency window to favor this range.

VI. Other Applications on Musical Signals

A. Onset Detection and Novelty Curve


The objective of onset detection is to determine the physical starting times of notes or other musical events as they occur in a music recording. The general idea is to capture sudden changes in the music signal, which are typically caused by the onset of novel events. As a result, one obtains a so-called novelty curve, the peaks of which indicate onset candidates. For example, playing a note on a percussive instrument typically results in a sudden increase of the signals energy, see Fig. 15(a). Having such a pronounced at-tack phase, note onset candidates may be determined by locating time positions, where the signals amplitude envelope starts to increase. Much more challenging, however, is the detection of onsets in the case of non-percussive music, where one often has to deal with soft onsets or blurred note transitions. This is often the case for vocal music or classical music dominated by string instruments.
19

Figure.15 Waveform representation of the beginning of Another one bites the dust by Queen (a) Note onsets. (b) Beat positions.

Furthermore, in complex polyphonic mixtures, simultaneously occurring events may result in masking effects, which makes it hard to detect individual onsets. As a consequence, more refined methods have to be used for computing the novelty curves, e.g., by analyzing the signals spectral content, pitch, harmony or phase. To handle the variety of different signal types, a combination of novelty curves particularly designed for certain classes of instruments can improve the detection accuracy. To illustrate some of these ideas, we now describe a typical spectral-based approach for computing novelty curves. Given a music recording, a short-time Fourier transform is used to obtain a spectrogram X = (X(t,k))t,k with k[0:K] and t[0:T-1]. Note that the Fourier coefficients of X are linearly spaced on the frequency axis. Using suitable binning strategies, various approaches switch over to a logarithmically spaced frequency axis. Keeping the linear frequency axis puts greater emphasis on the high-frequency regions of the signal, thus accentuating the aforementioned noise bursts visible as high-frequency content. One simple, yet
20

important step, which is often applied in the processing of music signals, is referred to as logarithmic compression. In our context, this step consists in applying a logarithm to the magnitude spectrogram |X| of the signal yielding Y= log(1+C|X|) for a suitable constant C>1. Such a compression step not only accounts for the logarithmic sensation of human sound intensity, but also balances out the dynamic range of the signal. In particular, by increasing C, low-intensity values in the highfrequency spectrum become more prominent. This effect is clearly visible in Fig. 16.

Figure.16 (a) Score representation (b) Magnitude spectrogram (c) Compressed spectrogram using C = 1000 (d) Novelty curve derived from (b) (e) Novelty curve derived from (c) .

To obtain a novelty curve, one basically computes the discrete derivative of the compressed spectrum Y. More precisely, one sums up only positive intensity
21

changes to emphasize onsets while discarding offsets to obtain the novelty function

Fig. 16(e) shows a typical novelty curve for our Shostakovich example. As mentioned above, one often process the spectrum in a band-wise fashion obtaining a novelty curve for each band separately. These novelty curves are then weighted and summed up to yield a final novelty function. The peaks of the novelty curve typically indicate the positions of note onsets. Therefore, to explicitly determine the positions of note onsets, one employs peak picking strategies based on fixed or adaptive threshold. In the case of noisy novelty curves with many spurious peaks, however, this is a fragile and errorprone step. Here, the selection of the relevant peaks that correspond to true note onsets becomes a difficult or even infeasible problem. For example, in the Shostakovich Waltz, the first beats (downbeats) of the 3/4 meter are played softly by non-percussive instruments leading to relatively weak and blurred onsets, whereas the second and third beats are played staccato supported by percussive instruments. As a result, the peaks of the novelty curve corresponding to downbeats are hardly visible or even missing, whereas peaks corresponding to the percussive beats are much more pronounced, see Fig. 16(e).

B. Periodicity Analysis and Tempo Estimation


Generally speaking, one can do this analysis between three different methods. The autocorrelation method allows for detecting periodic self-similarities by comparing a novelty curve with time-shifted (localized) copies. Another widely used method is based on a bank of comb filter resonators, where a novelty curve is compared with templates that consists of equally spaced spikes covering a range of
22

periods and phases. Third, the short-time Fourier transform can be used to derive a timefrequency representation of the novelty curve. Here, the novelty curve is compared with templates consisting of sinusoidal kernels each representing a specific frequency. Each of the methods reveals periodicity properties of the underlying novelty curve from which one can estimate the tempo or beat structure.

Figure.17 Excerpt of Shostakovichs Waltz No.2 (a) Fourier tempo-gram (b). Autocorrelation tempo-gram

For example, suppose that a music signal has a dominant tempo of =220 BPM (beats per minute) around position t, then the tempo-gram corresponding value T(t,) is in Fig. 17. In practice, one often has to deal with tempo ambiguities, where a tempo is confused with integer multiples 2, 3(referred to as harmonics of ) and integer fractions /2, /3,(referred to as sub-harmonics of ). To avoid such ambiguities, a mid-level tempo representation referred to as cyclic tempo-grams can be constructed. Tempi differing by a power of two are identified. A tempo-gram can be obtained by analyzing a novelty curve with respect

to local periodic patterns using a short-time Fourier transform. To this end, one
23

fixes a window function W of finite length centered at t=0. Then, for a frequency parameter w, the complex Fourier coefficient F(t,w) is defined by

Note that the frequency parameter w (measured in Hertz) corresponds to the tempo parameter =60w (measured in BPM). Therefore, one obtains a discrete Fourier tempo-gram by

As an example, Fig.17(a) shows the tempo-gram of our Shostakovich example from Fig.16. Note that TF reveals a slightly increasing tempo over time starting with roughly =225 BPM. Also, TF reveals the second tempo harmonics starting with =450 BPM. Actually, since the novelty curve locally behaves like a

track of positive clicks, it is not hard to see that Fourier analysis responds to harmonics but tends to suppress sub-harmonics. Next we will introduce autocorrelation-based methods. To obtain a discrete autocorrelation tempo-gram, one again fixes a window function fixes a window function W of finite length centered at t=0. The local autocorrelation is then computed by comparing the windowed novelty curve with time shifted copies of itself. Here, we use the unbiased local autocorrelation.

Now, to convert the lag parameter into a tempo parameter, one needs to know the sampling rate. Supposing that each time parameter t corresponds to r seconds, then the lag l corresponds to the tempo = 60/(rl) BPM. From this, one obtains the autocorrelation tempo-gram TA by

Finally, using standard resampling and interpolation techniques applied to


24

the tempo domain, one can derive an autocorrelation tempo-gram TA that is defined on the same tempo set as the Fourier tempo-gram TF. The tempo-gram TA for our Shostakovich example is shown in Fig. 17(b). It clearly indicates the subharmonics. Actually, the parameter =75 is the third sub-harmonics of =225 and corresponds to the tempo on the measure level.

Figure.18 Excerpt of the Mazurka Op.30 No.2 (a) Score (b) Fourier rempogram with reference tempo (c) Beat position

Assuming a more or less steady tempo, most tempo estimation approaches determine only one global tempo value for the entire recording. For example, such a value may be obtained by averaging the tempo values obtained from a framewise periodicity analysis. Dealing with music with significant tempo changes, the task of local tempo estimation becomes a much more difficult problem. See Fig. 18 for a complex example. Having computed a tempo-gram, the frame-wise maximum yields a good indicator of the locally dominating tempohowever, one often has to struggle with confusions of tempo harmonics and sub-harmonics. Here, it can be improved by a combined usage of Fourier and autocorrelation tempo-grams.
25

C. Harmonic Pitch Class Profiles (HPCP)


If we want to detect pitch correctly, we have to extract a nice feature for seeing the pitch clearly. The tool is Harmonic Pitch Class Profiles. The HPCP is an enhanced pitch distribution feature. It is also called Chroma. We can do some process on musical signals to get the HPCP feature and then we use the feature to measure similarity. We will take our focus on how to get the HPCP feature because of the process also has relationship with time-frequency analysis.

Figure.19 General HPCP feature extraction block diagram. Music signals are converted to a sequence of HPCP vectors that evolves with time

After musical signals are input, first we will do spectral analysis to know the frequency components. So we can use constant-Q transform to convert the signal
26

into a spectrogram. After constant-Q transform, it also has a frequency filtering, so only a frequency band between 100 and 5000 Hz is used. The peak detection is used, so only the local maxima of spectrum are considered. In the reference frequency computation procedure, we will estimate the deviation with respect to 440Hz. In the frequency to pitch class mapping, its a procedure for determining the pitch class value from frequency values. We introduce a weighting scheme using a cosine function and consider the presence of harmonic frequency, taking account a total of 8 harmonics for each frequency. To map the value on a one-third of a semitone, so the size of the pitch class distribution vectors is equal to 36. Finally, in the post-processing, we just need to normalize the feature frame by frame dividing through the maximum value in order to eliminate dependency on global loudness. And then we can get a result like Figure. 20.

Figure.20 Example of a HPCP sequence.

After we get the HPCP feature, we can know the pitch in a time section. It has been used to compute similarity between two songs in many papers. In the Figure.21, its a system of measuring similarity between two songs. At first, we need to use time-frequency analysis to extract the HPCP feature. And then set two
27

songs HPCP to a global HPCP, so there is a standard of comparing. Next, use the two features to construct a binary similarity matrix. We will use Smith-Waterman algorithm to construct a local alignment matrix H in the Dynamic Programming Local Alignment. Finally, after doing some post- processing, we can compute the distance between two songs. We can use a threshold of distance to choose which songs we want.

Figure.21 The example of music similarity measure.

D. Modified HHT for Detecting Fundamental Frequency


The traditional HHT is computed by

However, there are some drawbacks in traditional HHT. The first, when a signal has multiple primary frequency components, the same frequency component may not reside in the same IMF. The second, some perturbation in the neighborhood may change its position. Therefore, it will also change the upper and lower envelope curve and the mean function. Then more iteration is needed to sift out the IMF with a suitable scale. It complicates the stopping criterion and has more computational complexity. The third, HHT is very sensitive to non-stationary components. The existence of these components complicates the task of fundamental frequency estimation.
28

Therefore, there is a Modified HHT for fundamental frequency estimation. The block diagram is shown as Figure.22.

Figure.22 The modified HHT block diagram of fundamental frequency estimation.

After the signal is segmented with window size, we use filter bank to decompose the signal into several narrowband music signals. Then we discard weak bands using an energy threshold. EMD is used to get each individual bands IMF. The next step, we discard IMFs which are outside the pass-band. Then we select the IMF containing the fundamental frequency, which has maximum correlation with the original signal. Finally, the traditional Hilbert transform is applied to the IMF which is selected finally. The median of the instantaneous frequency inside the effective window is set to the fundamental frequency.

Figure.23 IMFs of C4 (261Hz) by sifting (a) w/o and (b) w/ the filter-bank preprocessing.
29

The modified HHT algorithm has three unique features. First, it uses a mirror approach to estimate the outside extrema in the EMD process. The second, it uses Rillings stopping criteria in the EMD process to handle the mode mixing problem. The third, to solve the problem of sub-harmonics and partials, it discards the weakness bands in the original signal. Therefore, it has better performance of estimating the fundamental frequency. There are experimental results about comparing modified HHT to other methods. It is shown as Figure.24. In this section we know the Hilbert Huang transform can also be used to estimate fundamental frequency. So it has its effects on musical signals.

Figure.24 Performance comparison of hit rates of the YIN method, the HHT method and the modified HHT.

30

VI. Conclusion

In this tutorial, we know time-frequency analysis is more powerful than classic Fourier transform in analyzing music signal. There are many types of timefrequency analysis such as Short-time Fourier transform, Wigner distribution function. However, not all time-frequency methods are appropriate to process music signals. We need to make a choice in different situations. Musical scales are based on a logarithmic scale for frequencies, thats the reason why we introduce log-frequency spectrogram and time-chroma representation. There are lots of applications we can use time-frequency analysis to process musical signal. For instance, beat detection, tempo estimation and similarity measurement. Moreover, there are some drawbacks on Hilbert Huang transform. And the modified Hilbert Huang transform has some post-processing before doing HHT in order to adopt on musical signals. Maybe there are still some applications of time-frequency analysis not discussed. I will study hard to improve my knowledge but I still hope the tutorial will have some supports and offer basic related knowledge to reader.

VII. Reference

[1] Joan Serra, Emilia Gomez, Perfecto Herrera, and Xavier Serra Chroma Binary Similarity and Local Alignment Applied to Cover Song Identification August, 2008 [2] William J. Pielemeier, Gregory H. Wakefield, and Mary H. Simoni Timefrequency Analysis of Musical Signals September,1996 [3] Jeremy F. Alm and James S. Walker Time-Frequency Analysis of Musical Instruments 2002

31

[4] Monika Dorfler What Time-Frequency Analysis Can Do To Music Signals April,2004 [5] EnShuo Tsau, Namgook Cho and C.-C. Jay Kuo Fundamental Frequency Estimation For Music Signals with Modified Hilbert-Huang transform [6] Meinard Muller, Daniel P.W.Ellis, Anssi Klapuri and Gael Richard, Signal Processing for Music Analysis, IEEE Journel of Selected Topics In Signal Processing, Vol. 5, NO. 6, October 2011 [7] Masataka Goto, An Audio-based Real-time Beat Tracking System for Music With or Without Drum-sounds, Journel of New Music Research, 2001, Vol. 30, No. 2 ,pp 159~171. [8] Kuo-Cyuan Kuo , Fractional Fourier Transform and Time-Frequency Analysis and Apply to Acoustic Signals, Master Thesis, June, 2008. [9] Chung-Han Huang, tutorial of Time-Frequency Analysis for Music Signal Analysis [10] J.J Ding, Slides of time-frequency analysis and wavelet transform

32

S-ar putea să vă placă și