Sunteți pe pagina 1din 1

Small Vocabulary Singing Voice Synthesis using

Neural Network Characterization of Vocal Features


Angelina A. Aquino

Abstract—A project is proposed on the design and implemen- the gender of the speaker. The output at the final layer will
tation of a system to characterize acoustic and phonemic features be the fundamental and formant frequencies and amplitudes
of utterances in sung speech over a limited vocabulary and at each short-time window.
range of pitches, to find optimal parameters from the given data
by backpropagation in a neural network, and to subsequently The extracted feature sets will be normalized and regular-
synthesize song by generating specified word-pitch sequences ized. These will then be used to train the weights and biases of
within the trained vocabulary and vocal range. the neural network through backpropagation, with a sigmoid
activation function for all nodes [4]. The number of nodes and
hidden layers to be used will be optimized as needed.
I. I NTRODUCTION
In digital signal processing, acoustic signals and speech
C. Synthesis by Concatenation
signals are typically characterized independently, using distinct
methods of analysis. However, singing is an audio signal The outputs of the neural network will be used to produce
which consists of both notation, a tonal component, and lyrics, short-time segments with varying formant frequencies, which
a linguistic component. The synthesis of a singing voice will be concatenated to produce single words sung at the
therefore requires the capability to analyze and reconstruct desired pitch. Multiple words may then be concatenated to
(a) the fundamentals and harmonics which produce a given produce continuous phrases of song.
musical note, (b) the formant frequencies which differentiate
the phonemes in a given word, and (c) the effect of speaker III. D ELIVERABLES
characteristics, such as gender, on the formant frequencies [1]. A. Halfway Point
Singing is produced by organic variations in the vocal tract, By the second week of the project, it is expected that record-
creating a signal with continuously changing properties. The ing will have been accomplished for all speakers, features will
variation in these features over time may be difficult to model have been extracted for all recordings, and a functional (if not
using rational expressions. However, a neural network is able optimized) neural network will be presentable.
to analyze large amounts of data and use various computational
layers to formulate accurate models without the need for
a mathematical expression relating input variables [2]. The B. Final Output
modeling of a spoken or sung signal can therefore be made By the end of the project duration, a fully-functional system
more natural by characterizing its continuous properties with is expected, capable of creating audio samples with continuous
a neural network. phrases of song, with specific word and pitch sequences.

II. M ETHODOLOGY R EFERENCES


A. Data Collection and Feature Extraction [1] J. Kaur and V. Narang, “Variation of pitch and formants in different age
group,” International Journal of Multidisciplinary Research and Modern
A set of five test words (‘happy’, ‘birth’, ‘day’, ‘to’, ‘you’) Education, vol. 1, no. 1, pp. 517–521, 2015.
will be recorded at equal tempo by two groups of speakers at [2] V. Maini and S. Sabri, “Machine learning for humans,” 2017. https:
two specific vocal ranges: ten female speakers, recording each //medium.com/machine-learning-for-humans.
[3] Sakshat-Virtual-Labs, “Cepstral analysis of speech,” 2011. http://iitg.vlab.
set of words from C4 to B4 , and ten male speakers, recording co.in/?sub=59&brch=164&sim=615&cnt=1.
from C3 to B3 . [4] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,
Recordings will be analyzed using 20-millisecond frames. 2016. http://www.deeplearningbook.org.
For each frame, the current word and phoneme will be
specified. The fundamental frequency will be extracted using
high-time liftering of the signal cepstrum, while the four most
significant formants and their corresponding magnitudes will
be extracted using low-time liftering [3].

B. Parameter Training using Neural Networks


The neural network will have three nodes at the input layer:
the word to be sung, the note at which it is to be sung, and

S-ar putea să vă placă și