Documente Academic
Documente Profesional
Documente Cultură
ABSTRACT
This paper presents singing synthesizers collaboratively designed by several developers. In the video-sharing Web site
Nico Nico Douga, many creators jointly create songs with
a singing synthesis system called Vocaloid. To synthesize various styles of singer, another singing system UTAU
which is a free software, is being developed and used by
many creators. However, the sound quality of this system has not been yet as good as Vocaloid. The purpose
of this study is to develop a singing synthesizer for UTAU
by collaborative creation. Developers were encouraged to
design a singing synthesizer by using a high-quality speech
synthesis system named WORLD that can synthesize a
singing voice that sounds as natural as a human voice. We
released WORLD and a singing synthesizer for UTAU as
free software with C language source code and attempted
to encourage collaborative creation. As a result of our attempt, six singing synthesizers for UTAU and two original singing synthesis systems were developed and released.
These were used to create many songs that were evaluated
as high-quality singing by audiences on a video-sharing
Web site Nico Nico Douga.
1. INTRODUCTION
Singing synthesis is a major research target in the field of
sound synthesis, and several commercial applications such
as Melodyne and Auto-Tune have already been used to
tune singing voices. Text-To-Speech synthesis systems for
singing have been released as computers are sufficiently
developed. However, the sales of these applications have
been poor.
After the releasing of the Vocaloid 2 Hatsune Miku [1],
singing synthesis systems have played an important role in
entertainment culture on the video-sharing Web site Nico
Nico Douga, and many amateur creators have been uploading songs to the site. Several studies for Vocaloid have
been carried out to synthesize natural singing voices [2,
3]. As a result, Vocaloid music is now a category of
Japanese pop culture, in what has been termed the Hatsune Miku Phenomenon [1].
Social Creativity [4] which is a collaborative creation
[5] by multiple creators, has been gaining popularity as a
2.1 UTAU
UTAU is a Japanese singing synthesis system similar to
Vocaloid. As shown in Fig. 1, the framework consists of
an editor to manipulate parameters, a synthesizer, and a
voice library associated with the singer. UTAU can switch
the voice library to synthesize various styles of singing
and synthesizers to improve the sound quality. Although
Vocaloid has a few voice libraries (around 20), UTAU has
far more (over 5,000) because creating a voice library of
amateur singers is easy.
The following sections describe the voice library for UTAU
and the requirements for the singing synthesizer. To develop a synthesizer for UTAU, it is necessary to adjust the
format for the voice and labeling data.
c
Copyright: 2013
First author et al. This is an open-access article distributed
1 On Nico Nico Douga, audiences can effectively overlay text comments onto video content. This feature provides the audience with the
sense of sharing the viewing experience [6].
2 http://en.wikipedia.org/wiki/Utau
under the terms of the Creative Commons Attribution 3.0 Unported License, which
permits unrestricted use, distribution, and reproduction in any medium, provided
the original author and source are credited.
287
Proceedings of the Stockholm Music Acoustics Conference 2013, SMAC 2013, Stockholm, Sweden
Articulation
information
Vocaloid
Voice library
Synthesizer
. Time stretching
. F0 modification
. Timbre modification
Analysis
Editor
UTAU
Synthesis
F0
F0
Spectral
envelope
Spectral
envelope
Excitation
signal
Excitation
signal
Synthesizer
Voice library
Articulation
information
Voice
Synthesis
method
Label data
Time-scale
modification
Input
Output
Editor
Switching
Voice library created by singers
Voice
Original synthesizer
Label data
Time-stretching function
F0-modification function
(a)
Amplitude
(c)
(b)
Time
3. DEVELOPMENT OF A SINGING
SYNTHESIZER BASED ON WORLD
288
Proceedings of the Stockholm Music Acoustics Conference 2013, SMAC 2013, Stockholm, Sweden
Amplitude
t1
Since voiced speech has an F0, the speech waveform includes not only the spectral envelope but also the F0 information. Many methods based on linear predictive coding (LPC) [16] and Cepstrum [17] have been proposed.
Among them, STRAIGHT [18] can accurately estimate the
spectral envelope and can synthesize high-quality speech.
TANDEM-STRAIGHT [19] produces the same results as
STRAIGHT but at a lower computational cost, and STAR
can reduce the computational cost even more than TANDEMSTRAIGHT [20]. To calculate the spectral envelope, TANDEMSTRAIGHT uses two power spectra windowed by two window functions, whereas STAR produces the same result
using only one power spectrum.
t2
Time
Figure 4. Four intervals used for determining the F0. Inverse value of the average is an F0 candidate, and that of
the standard deviation is used as the index to determine the
best of the candidates.
most important parameters for speech modification. Many
F0 estimation methods, (such as Cepstrum [8] and autocorrelation-based method [9]) have therefore been proposed
for accurate estimation. Although these methods can accurately estimate F0, they require extensive calculation such
as FFT.
DIO [10] is a rapid F0 estimation method for high-SNR
speech that is based on fundamental component extraction.
The fundamental component is extracted by low-pass filters and the F0 is calculated as its frequency. Since the cutoff frequency to extract only the fundamental component
is unknown, DIO uses many low-pass filters with different
cut-off frequencies and the periodicity score to determine
the final F0 of all candidates.
DIO consists of three steps to calculate F0 candidates and
periodicity scores:
(1)
where S(, ) represents the spectrum of the windowed
waveform and represents the temporal position for windowing. A Hanning window, which is used as the window
function, has a length of 3T0 and is based on pitch synchronous analysis [21]. 0 represents fundamental angular frequency (2f0 ). By windowing this window function
289
Proceedings of the Stockholm Music Acoustics Conference 2013, SMAC 2013, Stockholm, Sweden
1
Input voice
80
0.5
0
Amplitude
Level (dB)
60
0.5
0.1
40
0.105
0.11
20
0
0.12
1
Synthesized voice
20
0
0.115
0.5
Original
LPC
TANDEMSTRAIGHT
STAR
500
1000
1500
2000
Frequency (Hz)
2500
0.5
0.1
3000
0.115
0.12
Voiced speech
Origin
Time
0.11
Time (sec)
Figure 7. Waveforms of input speech (upper) and synthesized speech (bottom). Since PLATINUM can synthesize
a windowed waveform, the output speech is almost all the
same except for the temporal position of each excitation
Figure 5. Spectral envelope estimated by STAR. The target spectrum consists of a pole and a dip. Linear predictive coding (LPC) could not estimate the spectral envelope,
whereas TANDEM-STRAIGHT and STAR could.
X() =
0.105
(2)
As shown in Eq. (1), the spectral envelope H() estimated by STAR is smoothed by a rectangular window.
The inverse value of H() can be calculated without an
extremely high amplitude.
The pitch marking required for TD-PSOLA [23] is crucial because PLATINUM uses windowed waveform as the
glottal vibration for synthesis. To calculate the temporal
positions for calculating the spectrum Y (), PLATINUM
uses an origin from voiced speech and an F0 contour. The
origin of each voiced speech is determined in the manner
shown in Fig. 6. The center interval of the voiced speech
is selected, and the time with the maximum amplitude is
extracted as the origin for windowing. Other positions are
automatically calculated by the F0 contour.
Figure 7 shows the waveforms of both input and synthesized speeches. The waveform synthesized with WORLD
is almost completely the same as the input waveform because PLATINUM can compensate for the windowed waveform by the minimum and maximum phase. The temporal
positions of each glottal vibration are shifted because the
F0 contour does not include the origin of the glottal vibrations.
In reference [22], a MUSHRA-based evaluation [24] was
carried out. WORLD was compared with STRAIGHT [18]
and TANDEM-STRAIGHT [19] as modern techniques, and
Cepstrum [17] as a conventional one. Not only a synthesized speech but also F0-scaled speeches (F0 25 %)
and Formant-shifted speeches (15%) were tested to determine the robustness of the modification. The speeches
used for the evaluation were of three males and three females. The sampling was 44,100 Hz/16 bit, and a 32-dB (A
weighted) room was used. Five subjects with normal hearing ability participated. This article showed only the result
of WORLD, STRAIGHT and TANDEM-STRAIGHT because the sound quality of Cepstrum is clearly low compared with these three. The results are shown in Table 1.
Under almost all conditions, WORLD can synthesize the
best speech.
4. EVALUATION
WORLD and a singing synthesizer that fulfills the requirement for UTAU were developed and released via a Web
site4 . Both the execute file and the C language source code
were released to encourage collaborative creation by developers. Developers could use WORLD and release their
synthesizer without any permission from us (they were released under the modified BSD license). An evaluation
was performed to determine whether other singing synthesizers were developed and released. The number of
contents uploaded on the video-sharing Web site was also
4
290
http://ml.cs.yamanashi.ac.jp/world/
Proceedings of the Stockholm Music Acoustics Conference 2013, SMAC 2013, Stockholm, Sweden
Synthesized speech
F0-scaled speech (+25%)
F0-scaled speech (25%)
Formant-shifted speech (+15%)
Formant-shifted speech (15%)
STRAIGHT
88.2
77.4
70.1
71.4
70.1
TANDEM-STRAIGHT
83.2
72.1
67.9
71.4
67.9
WORLD
97.3
88.4
79.3
73.2
68.1
Postprocessing
Synthesis
Singing
Lyric
Mixdown
Singing
Contents
Score
Parameter tuning
Method
Library
Compatibility
4.2 Discussion
Six synthesizers based on WORLD were developed by four
developers, and many contents were created and uploaded
on the video-sharing Web site. In this section, we discuss
our evaluation of the synthesizers.
4.2.1 Synthesizers as content generation software
Vocaloid and UTAU are singing synthesis systems used for
supporting creation activities. Although the simplest evaluation of a singing synthesizer is a MOS evaluation of the
synthesized singing voice, the content consists of not only
the singing but also the music. Post-processing such as
adding reverb affects the quality of the music, and compatibility between the synthesizer and the library (including
labeling data) also affects the quality. As shown in Fig. 8,
there are various factors to evaluate the performance of the
synthesizer as content generation software.
5. CONCLUSIONS
In this article, we described the development of singing
synthesizers for UTAU by collaborative creation among
many developers. The synthesizers were based on WORLD,
which is a high-quality speech synthesis system, and released via a Web site with C language source code. In
total, six synthesizers were developed, released, and used
to create music.
We also discussed our evaluation of the singing synthesizer. Although WORLD can synthesize a speech that sounds
as natural as the input speech, it is difficult to evaluate each
synthesizer because there are so many factors in the music
creation process.
We consider the proposed attempt to be a success because
six synthesizersa half of all the synthesizers of UTAU
were developed, many creators used them, and their contents were evaluated as good. A discussion of how to evaluate the effectiveness of the singing synthesizer will be the
key focus of our future work. We will attempt to develop
another singing synthesis system that does not depend on
UTAU by collaborative creation.
291
Proceedings of the Stockholm Music Acoustics Conference 2013, SMAC 2013, Stockholm, Sweden
Acknowledgments
This work was supported by JSPS KAKENHI Grant Numbers 23700221, 24300073, and 24650085.
6. REFERENCES
[1] H. Kenmochi, Vocaloid and hatsune miku phenomenon in japan, in Proc. INTERSINGING2010,
2010, pp. 14.
[2] T. Nakano and M. Goto, Vocalistener: A singing-tosinging synthesis system based on iterative parameter
estimation, in Proc. SMC2009, 2009, pp. 343348.
[3] , Vocalistener2: A singing synthesis system able
to mimic a users singing in terms of voice timbre
changes as well as pitch and dynamics, in Proc.
ICASSP2011, 2011, pp. 453456.
[4] G. Fischer, Symmetry of ignorance, social creativity,
and meta-design, Knowledge-Based Systems Journal,
vol. 13, no. 7-8, pp. 527537, 2000.
[5] M. Hamasaki, H. Takeda, and T. Nishimura, Network
analysis of massively collaborative creation of multimedia contents case study of hatsune miku videos
on nico nico douga, in Proc. uxTV2008, 2008, pp.
165168.
[6] K. Yoshii and M. Goto, Musiccommentator: generating comments synchronized with musical audio signals
by a joint probabilistic model of acoustic and textual
features, Lecture Notes in Computer Science, LNCS
5709, pp. 8597, 2009.
[7] H. Dudley, Remaking speech, J. Acoust. Soc. Am.,
vol. 11, no. 2, pp. 169177, 1939.
[8] A. M. Noll, Cepstrum pitch determination, J. Acoust.
Soc. Am., vol. 41, no. 2, pp. 293309, 1967.
[9] L. R. Rabiner, On the use of autocorrelation analysis
for pitch detection, in IEEE Trans. Acoust, Speech,
and Signal Process., vol. 25, no. 1, 1977, pp. 2433.
[10] M. Morise, H. Kawahara, and H. Katayose, Fast and
reliable f0 estimation method based on the period extraction of vocal fold vibration of singing voice and
speech, in Proc. AES 35th International Conference,
2009, pp. CDROM.
[11] A. H. Nuttall, Some windows with very good sidelobe
behavior, IEEE Trans. on Acoust. Speech, and Signal
process., vol. 29, no. 1, pp. 8491, 1981.
[12] H. Kawahara, A. Cheveigne, H. Banno, T. Takahashi,
and T. Irino, Nearly defect-free f0 trajectory extraction for expressive speech modifications based on
straight, in Proc. ICSLP2005, 2005, pp. 537540.
[14] A. Camacho and J. Harris, A sawtooth waveform inspired pitch estimator for speech and music, J. Acoust.
Soc. Am., vol. 124, no. 3, pp. 16381652, 2008.
[15] M. Morise, H. Kawahara, and T. Nishiura, Rapid f0
estimation for high-snr speech based on fundamental
component extraction, IEICE Trans. on Information
and Systems, vol. J93-D, no. 2, pp. 109117, 2010. (in
Japanese).
[16] B. S. Atal and S. L. Hanauer, Speech analysis and
synthesis by linear prediction of the speech wave, J.
Acoust. Soc. Am., vol. 50, no. 2B, pp. 637655, 1971.
[17] A. M. Noll, Short-time spectrum and cepstrum
techniques for vocal pitch detection, J. Acoust. Soc.
Am., vol. 36, no. 2, pp. 269302, 1964.
[18] H. Kawahara, I. Masuda-Katsuse, and A. Cheveigne,
Restructuring speech representations using a
pitch-adaptive time-frequency smoothing and an
instantaneous-frequency-based f0 extraction, Speech
Communication, vol. 27, no. 34, pp. 187207, 1999.
[19] H. Kawahara, M. Morise, T. Takahashi, R. Nisimura,
T. Irino, and H. Banno, Tandem-straight: A temporally stable power spectral representation for periodic
signals and applications to interference-free spectrum,
f0, and aperiodicity estimation, in Proc. ICASSP2008,
2008, pp. 39333936.
[20] M. Morise, T. Matsubara, K. Nakano, and T. Nishiura,
A rapid spectrum envelope estimatino technique of
vowel for high-quality speech synthesis, IEICE Trans.
on Information and Systems, vol. J94-D, no. 7, pp.
10791087, 2011. (in Japanese).
[21] M. V. Mathews, J. E. Miller, and E. E. David, Pitch
synchronous analysis of voiced sounds, J. Acoust.
Soc. Am., vol. 33, no. 2, pp. 179185, 1961.
[22] M. Morise, Platinum: A method to extract excitation signals for voice synthesis system, Acoust. Soc.
& Tech., vol. 33, no. 2, pp. 123125, 2012.
[23] E. M. C. Hanon and F. Charpentier, A diphone synthesis system based on time-domain prosodic modifications of speech, in Proc. ICASSP89, 1989, pp. 238
241.
[24] Method for the subjective assessment of intermediate
quality level of coding systems. ITU-R Recommendation BS.1534-1, 2003.
[25] M. Morise, M. Onishi, H. Kawahara, and H. Katayose,
v.morish09: A morphing-based singing design interface for vocal melodies, Lecture Notes in Computer
Science, LNCS 5709, pp. 185190, 2009. (in Japanese).
292