Sunteți pe pagina 1din 6

Proceedings of the Stockholm Music Acoustics Conference 2013, SMAC 2013, Stockholm, Sweden

AN ATTEMPT TO DEVELOP A SINGING SYNTHESIZER BY


COLLABORATIVE CREATION
Masanori Morise
Faculty of Engineering, University of Yamanashi, Japan
mmorise@yamanashi.ac.jp

ABSTRACT

new style of creation to improve the quality of contents


on video-sharing Web sites. Today, many creators jointly
create many contents including songs, promotional videos,
and comments 1 .
The purpose of this study is to develop singing synthesizers by collaborative creation. We implemented a base
technology named WORLD and a singing synthesizer for
UTAU and released them on the Web site to encourage collaboration. This attempt functions as indirect support for
creating songs by the developed synthesizers.
The rest of this article is organized as follows. Section 2
describes conventional singing synthesis systems and outlines the requirements for developing one. In Section 3,
we explain the principle behind and effectiveness of the
base technology, WORLD. Section 4 reveals whether the
synthesizers could be developed by other developers and
discusses the result of our attempt. We conclude in Section 5 with a brief summary of our research and a mention
of future work.

This paper presents singing synthesizers collaboratively designed by several developers. In the video-sharing Web site
Nico Nico Douga, many creators jointly create songs with
a singing synthesis system called Vocaloid. To synthesize various styles of singer, another singing system UTAU
which is a free software, is being developed and used by
many creators. However, the sound quality of this system has not been yet as good as Vocaloid. The purpose
of this study is to develop a singing synthesizer for UTAU
by collaborative creation. Developers were encouraged to
design a singing synthesizer by using a high-quality speech
synthesis system named WORLD that can synthesize a
singing voice that sounds as natural as a human voice. We
released WORLD and a singing synthesizer for UTAU as
free software with C language source code and attempted
to encourage collaborative creation. As a result of our attempt, six singing synthesizers for UTAU and two original singing synthesis systems were developed and released.
These were used to create many songs that were evaluated
as high-quality singing by audiences on a video-sharing
Web site Nico Nico Douga.

2. SINGING SYNTHESIS SYSTEMS: VOCALOID


AND UTAU
Vocaloid is a Text-To-Speech synthesis system for singing
in which creators can synthesize the singing using lyrics
and scores. The demands to synthesize various kinds of
singing have been growing due to the rapid growth of Vocaloid,
but it has been virtually impossible to support this demand.
To solve this problem, UTAU2 was developed as a singing
synthesis system.

1. INTRODUCTION
Singing synthesis is a major research target in the field of
sound synthesis, and several commercial applications such
as Melodyne and Auto-Tune have already been used to
tune singing voices. Text-To-Speech synthesis systems for
singing have been released as computers are sufficiently
developed. However, the sales of these applications have
been poor.
After the releasing of the Vocaloid 2 Hatsune Miku [1],
singing synthesis systems have played an important role in
entertainment culture on the video-sharing Web site Nico
Nico Douga, and many amateur creators have been uploading songs to the site. Several studies for Vocaloid have
been carried out to synthesize natural singing voices [2,
3]. As a result, Vocaloid music is now a category of
Japanese pop culture, in what has been termed the Hatsune Miku Phenomenon [1].
Social Creativity [4] which is a collaborative creation
[5] by multiple creators, has been gaining popularity as a

2.1 UTAU
UTAU is a Japanese singing synthesis system similar to
Vocaloid. As shown in Fig. 1, the framework consists of
an editor to manipulate parameters, a synthesizer, and a
voice library associated with the singer. UTAU can switch
the voice library to synthesize various styles of singing
and synthesizers to improve the sound quality. Although
Vocaloid has a few voice libraries (around 20), UTAU has
far more (over 5,000) because creating a voice library of
amateur singers is easy.
The following sections describe the voice library for UTAU
and the requirements for the singing synthesizer. To develop a synthesizer for UTAU, it is necessary to adjust the
format for the voice and labeling data.

c
Copyright: 2013
First author et al. This is an open-access article distributed

1 On Nico Nico Douga, audiences can effectively overlay text comments onto video content. This feature provides the audience with the
sense of sharing the viewing experience [6].
2 http://en.wikipedia.org/wiki/Utau

under the terms of the Creative Commons Attribution 3.0 Unported License, which
permits unrestricted use, distribution, and reproduction in any medium, provided
the original author and source are credited.

287

Proceedings of the Stockholm Music Acoustics Conference 2013, SMAC 2013, Stockholm, Sweden
Articulation
information

Vocaloid
Voice library

Synthesizer

. Time stretching
. F0 modification
. Timbre modification

Analysis

Editor

UTAU

Synthesis

F0

F0

Spectral
envelope

Spectral
envelope

Excitation
signal

Excitation
signal

Synthesizer
Voice library
Articulation
information

Voice

Synthesis
method

Label data

Time-scale
modification

Input

Output

Editor

Figure 3. Overview of WORLD.


Switching

Switching
Voice library created by singers
Voice

To develop the singing synthesizer for UTAU, developers


must implement at least the following functions

Original synthesizer

Target of this study

Label data

Time-stretching function
F0-modification function

Figure 1. Overview of Vocaloid and UTAU. They consist


of an editor, voice library, and synthesis method. Switching voice libraries and synthesizers is possible in UTAU.

Timbre-modification (at least Formant shift) functions


These functions are used to adjust the voices to the desired
musical note, and developers can implement these functions by their original algorithm. However, F0-modification
function must be implemented based on two processes that
normalize F0 contour to musical scale and add the detailed
contour given by the editor.

(a)

Amplitude

(c)

2.2 Problem of UTAU


Three synthesizers officially released by the UTAU developer to manipulate sensitive atmospheres. Other three synthesizers have been released by other developers because
differences between synthesizers affect the quality of the
content. Since the sound quality often deteriorates by the
compatibility between the voice library and synthesizer,
various types of synthesizers should be developed.
In this study, we developed a singing synthesizer for UTAU
based on a high-quality speech synthesis and attempted to
induce collaborative creation by releasing its C language
source code.

(b)
Time

Figure 2. Labeling data of a singing voice /shi/. UTAU


requires determining these positions.
2.1.1 Voice library

3. DEVELOPMENT OF A SINGING
SYNTHESIZER BASED ON WORLD

UTAU supports both CV and VCV synthesis for Japanese


singing3 , and recording all phonemes is required for the
voice library. The recorded voices are then labeled to fulfill
the requirement for UTAU. This is done automatically by
using free software developed by another developer.

The base technology WORLD is a vocoder-based system


[7] that decomposes the input speech into an F0, a spectral
envelope, and an excitation signal and is able to synthesize
a speech that sounds as natural as a human speech. Three
parameters can be modified to fulfill the requirements for
UTAU: time stretching, F0 modification, and timbre modification (as shown in Fig. 3).
WORLD estimates the F0 and then uses the F0 information to estimate the spectral envelope. The excitation signal
is then extracted by the F0 and spectral envelope information.

2.1.2 Labeling data


Figure 2 shows an example of labeling data in a CV voice
/shi/. Three intervals are used for synthesis: Interval (a)
represents the interval for smoothly mixing two voices, Interval (b) represents the consonant interval, with the endpoint used as the origin of a CV voice, and Interval (c)
represents the voiced speech and is used to stretch the duration of the voice. A phoneme boundary between V and
C is added for VCV voices.

3.1 DIO: F0 estimation method


F0 of a voiced sound is defined as the inverse value of
the shortest period of glottal vibrations. It is one of the

3 Singing in other languages is available, but it is difficult because the


editor cannot support other languages.

288

Proceedings of the Stockholm Music Acoustics Conference 2013, SMAC 2013, Stockholm, Sweden

Amplitude

In the final step, the F0 with the highest periodicity score


is selected as the final F0. DIO calculates the F0 much
faster than the conventional method because it does not
use frame-by-frame processing to calculate FFT. The F0
estimation performance of DIO is the same as the conventional techniques [1214], while the elapsed time is at least
28 times faster than the other methods [15].
3.2 STAR: Spectral envelope estimation method

t1

Since voiced speech has an F0, the speech waveform includes not only the spectral envelope but also the F0 information. Many methods based on linear predictive coding (LPC) [16] and Cepstrum [17] have been proposed.
Among them, STRAIGHT [18] can accurately estimate the
spectral envelope and can synthesize high-quality speech.
TANDEM-STRAIGHT [19] produces the same results as
STRAIGHT but at a lower computational cost, and STAR
can reduce the computational cost even more than TANDEMSTRAIGHT [20]. To calculate the spectral envelope, TANDEMSTRAIGHT uses two power spectra windowed by two window functions, whereas STAR produces the same result
using only one power spectrum.

In STAR, spectral envelope |H(,


)| is given by
)
( 0
2
2
2

log (|S( + , )|) d ,


|H(,
)| = exp
0 20

t2
Time

Figure 4. Four intervals used for determining the F0. Inverse value of the average is an F0 candidate, and that of
the standard deviation is used as the index to determine the
best of the candidates.
most important parameters for speech modification. Many
F0 estimation methods, (such as Cepstrum [8] and autocorrelation-based method [9]) have therefore been proposed
for accurate estimation. Although these methods can accurately estimate F0, they require extensive calculation such
as FFT.
DIO [10] is a rapid F0 estimation method for high-SNR
speech that is based on fundamental component extraction.
The fundamental component is extracted by low-pass filters and the F0 is calculated as its frequency. Since the cutoff frequency to extract only the fundamental component
is unknown, DIO uses many low-pass filters with different
cut-off frequencies and the periodicity score to determine
the final F0 of all candidates.
DIO consists of three steps to calculate F0 candidates and
periodicity scores:

(1)
where S(, ) represents the spectrum of the windowed
waveform and represents the temporal position for windowing. A Hanning window, which is used as the window
function, has a length of 3T0 and is based on pitch synchronous analysis [21]. 0 represents fundamental angular frequency (2f0 ). By windowing this window function

and by smoothing with Eq. (1), |H(,


)|2 is temporally
stable.
Figure 5 shows an example of the estimation result. The
target spectral envelope consisted of a pole and a dip, but
LPC could not accurately estimate the envelope from the
spectral envelope including dip. In contrast, TANDEMSTRAIGHT and STAR could, with STAR completing the
estimation in half the time of TANDEM-STRAIGHT.

Step1: Filtering by many low-pass filters with the


different cutoff frequencies from low frequency to
high frequency.
Step2: Calculation of F0 candidates and periodicity
scores.
Step3: Determination of final F0 based on periodicity scores.

3.3 PLATINUM: Excitation signal extraction method


PLATINUM is the excitation signal extraction method from
the windowed waveform, spectral envelope and F0 information [22]. In typical vocoder-based systems, the pulse
is used as the excitation signal and the signal calculated
based on the spectral envelope with minimum phase is used
as the impulse response of voiced speech. White noise
is used as the excitation signal to synthesize consonants.
PLATINUM can calculate the phase information of the
windowed waveform and use it when synthesizing.
The observed spectrum Y () is defined as the product
of the spectral envelope H() and target spectrum X()
for reconstructing the waveform. In the case that the phase
of spectral envelope H() is minimum, an inverse filter
is given by simply calculating the inverse of H(). Since
minimum phase is used as the phase information of H()

In the first step, the input waveform is filtered by many


low-pass filters. DIO uses a Nuttall window [11] as a lowpass filter with a sidelobe of around 90 dB. The filtered
signal is the sine wave with F0 Hz, provided that the filter is designed so that only the fundamental component is
extracted.
In the second step, four intervals (shown in Fig. 4) are
calculated at all temporal positions. The four intervals are
defined as the negative and positive going zero-crossing intervals and the intervals between the peaks and dips. In the
case that the filtered signal is a sine wave, four intervals
indicate the same value, and the inverse value of the average of four intervals indicates the F0. That of the standard
deviation can be used as its periodicity score.

289

Proceedings of the Stockholm Music Acoustics Conference 2013, SMAC 2013, Stockholm, Sweden

1
Input voice

80

0.5
0
Amplitude

Level (dB)

60

0.5
0.1

40

0.105

0.11

20
0

0.12

1
Synthesized voice

20
0

0.115

0.5

Original
LPC
TANDEMSTRAIGHT
STAR
500

1000
1500
2000
Frequency (Hz)

2500

0.5
0.1

3000

0.115

0.12

3.4 Sound quality of the synthesized speech

Voiced speech

Origin

Time

Figure 6. Determination of origin in a voiced speech. The


index with maximum amplitude round the center of speech
is determined. Other positions are automatically calculated
by origin and F0 contour.

in vocoder-based systems, the target spectrum X() is given


by
Y ()
.
H()

0.11
Time (sec)

Figure 7. Waveforms of input speech (upper) and synthesized speech (bottom). Since PLATINUM can synthesize
a windowed waveform, the output speech is almost all the
same except for the temporal position of each excitation

Figure 5. Spectral envelope estimated by STAR. The target spectrum consists of a pole and a dip. Linear predictive coding (LPC) could not estimate the spectral envelope,
whereas TANDEM-STRAIGHT and STAR could.

X() =

0.105

(2)

As shown in Eq. (1), the spectral envelope H() estimated by STAR is smoothed by a rectangular window.
The inverse value of H() can be calculated without an
extremely high amplitude.
The pitch marking required for TD-PSOLA [23] is crucial because PLATINUM uses windowed waveform as the
glottal vibration for synthesis. To calculate the temporal
positions for calculating the spectrum Y (), PLATINUM
uses an origin from voiced speech and an F0 contour. The
origin of each voiced speech is determined in the manner
shown in Fig. 6. The center interval of the voiced speech
is selected, and the time with the maximum amplitude is
extracted as the origin for windowing. Other positions are
automatically calculated by the F0 contour.

Figure 7 shows the waveforms of both input and synthesized speeches. The waveform synthesized with WORLD
is almost completely the same as the input waveform because PLATINUM can compensate for the windowed waveform by the minimum and maximum phase. The temporal
positions of each glottal vibration are shifted because the
F0 contour does not include the origin of the glottal vibrations.
In reference [22], a MUSHRA-based evaluation [24] was
carried out. WORLD was compared with STRAIGHT [18]
and TANDEM-STRAIGHT [19] as modern techniques, and
Cepstrum [17] as a conventional one. Not only a synthesized speech but also F0-scaled speeches (F0 25 %)
and Formant-shifted speeches (15%) were tested to determine the robustness of the modification. The speeches
used for the evaluation were of three males and three females. The sampling was 44,100 Hz/16 bit, and a 32-dB (A
weighted) room was used. Five subjects with normal hearing ability participated. This article showed only the result
of WORLD, STRAIGHT and TANDEM-STRAIGHT because the sound quality of Cepstrum is clearly low compared with these three. The results are shown in Table 1.
Under almost all conditions, WORLD can synthesize the
best speech.
4. EVALUATION
WORLD and a singing synthesizer that fulfills the requirement for UTAU were developed and released via a Web
site4 . Both the execute file and the C language source code
were released to encourage collaborative creation by developers. Developers could use WORLD and release their
synthesizer without any permission from us (they were released under the modified BSD license). An evaluation
was performed to determine whether other singing synthesizers were developed and released. The number of
contents uploaded on the video-sharing Web site was also
4

290

http://ml.cs.yamanashi.ac.jp/world/

Proceedings of the Stockholm Music Acoustics Conference 2013, SMAC 2013, Stockholm, Sweden

Synthesized speech
F0-scaled speech (+25%)
F0-scaled speech (25%)
Formant-shifted speech (+15%)
Formant-shifted speech (15%)

STRAIGHT
88.2
77.4
70.1
71.4
70.1

TANDEM-STRAIGHT
83.2
72.1
67.9
71.4
67.9

WORLD
97.3
88.4
79.3
73.2
68.1

Table 1. The sound quality of speech synthesized with each method.


counted to collect comments on the sound quality of the
synthesizers.

Postprocessing

Synthesis
Singing

Lyric

4.1 Created singing synthesizers

Mixdown
Singing

Contents

Score

As of now (April 2013), six synthesizers created by four


developers have been released, and two original singing
synthesis applications have been created by one developer.
More than 70 contents were uploaded by several creators
in Nico Nico Douga.
The comments collected from Nico Nico Douga were analyzed to determine the effectiveness of the synthesizers.
Almost all comments on the sound quality of the developed
synthesizers were positive. It was also suggested that the
sensitive atmospheres depended on the synthesizer even if
they were synthesized by WORLD-based synthesizers. On
the other hand, there are some indications about the compatibility between the synthesizer and the voice library.

Parameter tuning
Method

Library
Compatibility

Figure 8. Music creation process. It is difficult to evaluate


the synthesized singing voice because the quality of the
music does not depend solely on only the singing voice.
Not only the synthesizer but also the post-processing
and mixdown can change the sensitive atmosphere of the
singing voice.
The next step of this attempt is to develop other synthesizers that do not depend on UTAU. Although two such systems have already been developed, they rely on the labeling
data and functions of UTAU. Since UTAU requires adjusting the format to synthesize the singing voice, WORLD
does not achieve its full potential. Singing voice morphing [25] has potential for use in the field of singing synthesis. More flexible modification will be the primary focus
of our future work.

4.2 Discussion
Six synthesizers based on WORLD were developed by four
developers, and many contents were created and uploaded
on the video-sharing Web site. In this section, we discuss
our evaluation of the synthesizers.
4.2.1 Synthesizers as content generation software
Vocaloid and UTAU are singing synthesis systems used for
supporting creation activities. Although the simplest evaluation of a singing synthesizer is a MOS evaluation of the
synthesized singing voice, the content consists of not only
the singing but also the music. Post-processing such as
adding reverb affects the quality of the music, and compatibility between the synthesizer and the library (including
labeling data) also affects the quality. As shown in Fig. 8,
there are various factors to evaluate the performance of the
synthesizer as content generation software.

5. CONCLUSIONS
In this article, we described the development of singing
synthesizers for UTAU by collaborative creation among
many developers. The synthesizers were based on WORLD,
which is a high-quality speech synthesis system, and released via a Web site with C language source code. In
total, six synthesizers were developed, released, and used
to create music.
We also discussed our evaluation of the singing synthesizer. Although WORLD can synthesize a speech that sounds
as natural as the input speech, it is difficult to evaluate each
synthesizer because there are so many factors in the music
creation process.
We consider the proposed attempt to be a success because
six synthesizersa half of all the synthesizers of UTAU
were developed, many creators used them, and their contents were evaluated as good. A discussion of how to evaluate the effectiveness of the singing synthesizer will be the
key focus of our future work. We will attempt to develop
another singing synthesis system that does not depend on
UTAU by collaborative creation.

4.2.2 Effectiveness of the collaborative creation


The purpose of this study was to support collaborative creation by developers. We consider our attempt a success
because six synthesizers were developed and subsequently
used to create music. In the past, three synthesizers were
released by the developer of UTAU, and three synthesizers
that do not use WORLD have been released by other developers. In our case, six synthesizers with WORLD were
released. The performance of these synthesizers was verified by other people, which is the collaborative element of
the verification.

291

Proceedings of the Stockholm Music Acoustics Conference 2013, SMAC 2013, Stockholm, Sweden

Acknowledgments
This work was supported by JSPS KAKENHI Grant Numbers 23700221, 24300073, and 24650085.
6. REFERENCES
[1] H. Kenmochi, Vocaloid and hatsune miku phenomenon in japan, in Proc. INTERSINGING2010,
2010, pp. 14.
[2] T. Nakano and M. Goto, Vocalistener: A singing-tosinging synthesis system based on iterative parameter
estimation, in Proc. SMC2009, 2009, pp. 343348.
[3] , Vocalistener2: A singing synthesis system able
to mimic a users singing in terms of voice timbre
changes as well as pitch and dynamics, in Proc.
ICASSP2011, 2011, pp. 453456.
[4] G. Fischer, Symmetry of ignorance, social creativity,
and meta-design, Knowledge-Based Systems Journal,
vol. 13, no. 7-8, pp. 527537, 2000.
[5] M. Hamasaki, H. Takeda, and T. Nishimura, Network
analysis of massively collaborative creation of multimedia contents case study of hatsune miku videos
on nico nico douga, in Proc. uxTV2008, 2008, pp.
165168.
[6] K. Yoshii and M. Goto, Musiccommentator: generating comments synchronized with musical audio signals
by a joint probabilistic model of acoustic and textual
features, Lecture Notes in Computer Science, LNCS
5709, pp. 8597, 2009.
[7] H. Dudley, Remaking speech, J. Acoust. Soc. Am.,
vol. 11, no. 2, pp. 169177, 1939.
[8] A. M. Noll, Cepstrum pitch determination, J. Acoust.
Soc. Am., vol. 41, no. 2, pp. 293309, 1967.
[9] L. R. Rabiner, On the use of autocorrelation analysis
for pitch detection, in IEEE Trans. Acoust, Speech,
and Signal Process., vol. 25, no. 1, 1977, pp. 2433.
[10] M. Morise, H. Kawahara, and H. Katayose, Fast and
reliable f0 estimation method based on the period extraction of vocal fold vibration of singing voice and
speech, in Proc. AES 35th International Conference,
2009, pp. CDROM.
[11] A. H. Nuttall, Some windows with very good sidelobe
behavior, IEEE Trans. on Acoust. Speech, and Signal
process., vol. 29, no. 1, pp. 8491, 1981.
[12] H. Kawahara, A. Cheveigne, H. Banno, T. Takahashi,
and T. Irino, Nearly defect-free f0 trajectory extraction for expressive speech modifications based on
straight, in Proc. ICSLP2005, 2005, pp. 537540.

[14] A. Camacho and J. Harris, A sawtooth waveform inspired pitch estimator for speech and music, J. Acoust.
Soc. Am., vol. 124, no. 3, pp. 16381652, 2008.
[15] M. Morise, H. Kawahara, and T. Nishiura, Rapid f0
estimation for high-snr speech based on fundamental
component extraction, IEICE Trans. on Information
and Systems, vol. J93-D, no. 2, pp. 109117, 2010. (in
Japanese).
[16] B. S. Atal and S. L. Hanauer, Speech analysis and
synthesis by linear prediction of the speech wave, J.
Acoust. Soc. Am., vol. 50, no. 2B, pp. 637655, 1971.
[17] A. M. Noll, Short-time spectrum and cepstrum
techniques for vocal pitch detection, J. Acoust. Soc.
Am., vol. 36, no. 2, pp. 269302, 1964.
[18] H. Kawahara, I. Masuda-Katsuse, and A. Cheveigne,
Restructuring speech representations using a
pitch-adaptive time-frequency smoothing and an
instantaneous-frequency-based f0 extraction, Speech
Communication, vol. 27, no. 34, pp. 187207, 1999.
[19] H. Kawahara, M. Morise, T. Takahashi, R. Nisimura,
T. Irino, and H. Banno, Tandem-straight: A temporally stable power spectral representation for periodic
signals and applications to interference-free spectrum,
f0, and aperiodicity estimation, in Proc. ICASSP2008,
2008, pp. 39333936.
[20] M. Morise, T. Matsubara, K. Nakano, and T. Nishiura,
A rapid spectrum envelope estimatino technique of
vowel for high-quality speech synthesis, IEICE Trans.
on Information and Systems, vol. J94-D, no. 7, pp.
10791087, 2011. (in Japanese).
[21] M. V. Mathews, J. E. Miller, and E. E. David, Pitch
synchronous analysis of voiced sounds, J. Acoust.
Soc. Am., vol. 33, no. 2, pp. 179185, 1961.
[22] M. Morise, Platinum: A method to extract excitation signals for voice synthesis system, Acoust. Soc.
& Tech., vol. 33, no. 2, pp. 123125, 2012.
[23] E. M. C. Hanon and F. Charpentier, A diphone synthesis system based on time-domain prosodic modifications of speech, in Proc. ICASSP89, 1989, pp. 238
241.
[24] Method for the subjective assessment of intermediate
quality level of coding systems. ITU-R Recommendation BS.1534-1, 2003.
[25] M. Morise, M. Onishi, H. Kawahara, and H. Katayose,
v.morish09: A morphing-based singing design interface for vocal melodies, Lecture Notes in Computer
Science, LNCS 5709, pp. 185190, 2009. (in Japanese).

[13] A. Cheveigne and H. Kawahara, Yin: a fundamental


frequency estimator for speech and music, J. Acoust.
Soc. Am., vol. 111, no. 4, pp. 19171930, 2002.

292

S-ar putea să vă placă și