Documente Academic
Documente Profesional
Documente Cultură
Abstract—A project is proposed on the design and implemen- the gender of the speaker. The output at the final layer will
tation of a system to characterize acoustic and phonemic features be the fundamental and formant frequencies and amplitudes
of utterances in sung speech over a limited vocabulary and at each short-time window.
range of pitches, to find optimal parameters from the given data
by backpropagation in a neural network, and to subsequently The extracted feature sets will be normalized and regular-
synthesize song by generating specified word-pitch sequences ized. These will then be used to train the weights and biases of
within the trained vocabulary and vocal range. the neural network through backpropagation, with a sigmoid
activation function for all nodes [4]. The number of nodes and
hidden layers to be used will be optimized as needed.
I. I NTRODUCTION
In digital signal processing, acoustic signals and speech
C. Synthesis by Concatenation
signals are typically characterized independently, using distinct
methods of analysis. However, singing is an audio signal The outputs of the neural network will be used to produce
which consists of both notation, a tonal component, and lyrics, short-time segments with varying formant frequencies, which
a linguistic component. The synthesis of a singing voice will be concatenated to produce single words sung at the
therefore requires the capability to analyze and reconstruct desired pitch. Multiple words may then be concatenated to
(a) the fundamentals and harmonics which produce a given produce continuous phrases of song.
musical note, (b) the formant frequencies which differentiate
the phonemes in a given word, and (c) the effect of speaker III. D ELIVERABLES
characteristics, such as gender, on the formant frequencies [1]. A. Halfway Point
Singing is produced by organic variations in the vocal tract, By the second week of the project, it is expected that record-
creating a signal with continuously changing properties. The ing will have been accomplished for all speakers, features will
variation in these features over time may be difficult to model have been extracted for all recordings, and a functional (if not
using rational expressions. However, a neural network is able optimized) neural network will be presentable.
to analyze large amounts of data and use various computational
layers to formulate accurate models without the need for
a mathematical expression relating input variables [2]. The B. Final Output
modeling of a spoken or sung signal can therefore be made By the end of the project duration, a fully-functional system
more natural by characterizing its continuous properties with is expected, capable of creating audio samples with continuous
a neural network. phrases of song, with specific word and pitch sequences.