Tts Overview

Text-To-Speech Synthesis
An Overview
What is a TTS System

Goal
A system that can read any text Automatic production of new sentences Not just audio playback Simple voice response systems
Definition
The production of speech by machines, by way of the automatic phonetization of the sentences to utter
Text-To-Speech
Text Processing
Text Normalization Pronunciation Timing and Intonation
Speech Generation
Segmental Concatenation Waveform Synthesis
Functional Diagram
TTS Synthesizer Natural Language Processing
Text Morphosyntactic Analysis Letter-to-Sound Prosody Generation
Narrow Phonetic Transcription Phones Prosody
Digital Signal Processing

Mathematical Models Algorithms Computations Speech
The Natural Language Processing Module

Text
NLP Module
Preprocessor Morphological Analyzer Contextual Analyzer Syntactic and Prosodic Parser
Morphosyntactic Analyzer
Letter-to-Sound Module
Natural Prosody Generator
Phone Names Prosody
Text Preprocessing
Challenges
Text Segmentation Tokenization Sentence End Detection Normalization
(i) () (know) ( ) (1) (,) (000) ( ) (words)

Jones lives at the end of St. James St.
Abbreviations .: , , .: , Acronyms , , Numbers 1.023,32 12/1/2002 13:23 12.15
Text Preprocessing
Tokenizer
Dealing with Non-Standard Words

Breaks up single tokens that need splitting 12:35AM -> 12 : 35 AM
Classifier
Determines the most likely class for a given token January 1956 1956 potatoes
Expansion Module
Methods for expanding numbers and classes that can be handled algorithmically
Text Preprocessing
Dealing with Non-Standard Words

Not all tokens can be handled with a deterministic set of rules Methods for designing domain-dependent expansion and tagging modules
Supervised: work on tagged text corpus Unsupervised: work on raw text
pt | o
pt o | t pt po
Determines the probability of a tag t given the observed string o
p(o): the probability of the observed text p(t): the prior probability of observing the tag t in the text p(o|t): a trigram letter language model for predicting observations of a particulat tag t
Morphological Analysis
Function Words
Determiners, Pronouns, Prepositions, Conjunctions Skeleton of sentence Stored in lexicon, along with pronunciation
Content Words
Inflection + Compounding Used for pronunciation and stressing
Synthesis
Input
Sequence of phonemes Prosodic Information
Output
Digital Speech
Synthesis Strategies
Synthesis by Rule
Cognitive approach of the phonation mechanism Speech is produced by mathematical rules that formally describe the influence of phonemes on one another
Synthesis by Concatenation
Limited knowledge of the data to be handled Elementary speech units are stored in a database and then concatenated and processed to produce the speech signal
Synthesis by Rule
Functional Diagram
Phone Names Prosody
DSP Module
Speech Science Parametric Speech Corpus
Speech Corpus
Rule Database
Rule Matching
Speech Analysis Signal Processing
Rule Finding
Signal Synthesis
Speech
Synthesis by Rule
Preparation

Analysis and Synthesis

Words are read by professional speaker Data Parameterization through speech analyzer Rule extraction (manual) Trial and Error Optimization
Synthesis
Rules are matched to phonetic input Production of parametric signal Synthesis of speech signal by re-implementing analysis model
Synthesis by Rule
Rule Efficiency Corpus Quality
Segmental Quality
Choice of utterances and recording quality Intrinsic Errors: Accuracy of model describing highquality speech
Even simple analysis-resynthesis may produce problems!
Extrinsic Errors: Parameter extraction algorithm
Improvements during Trial-Error tuning
Synthesis by Rule
Formant Synthesizers
+ Speech is a dynamic evolution of up to 60 parameters
Formant, antiformant frequencies and bandwidths Glottal waveforms
+ Almost free of modeling errors Difficult to estimate Time consuming

Intensive trial-error testing to cope with extrinsic errors
Signal Buzziness Low Signal Quality

High-quality synthesis rules are yet to be discovered
Functional Diagram
Speech Science
Speech Corpus
DSP Module
Phone Names Prosody
Selective Segmentation Speech Analysis Equalization Speech Coding
Parametric Segment DB
Speech Segment DB
Segment Info
Segment List Generation
Signal Processing
Synthesis Segment DB
Prosody Matching Speech Decoding Concatenation Signal Synthesis Speech
Analysis Database Preparation

Choose the appropriate speech units
Diphones, Half-Syllables and Triphones
Compile and record utterances Segment signal and extract speech units Store segment waveforms (along with context) and extended information in database Extract parameters and create parametric segment database
Useful for data compaction Easier prosody matching and modification
Perform amplitude equalization to prevent mismatches
Unit Database Issues
Very large combinatorial space of combinations of phonemes and prosodic contexts

In English: 43 phones, 79,507 possible triphones, only 70,000 used Which of them should we keep?
Unit Selection vs Concatenative Synthesis

We record a large speech corpus In unit selection, the corpus is segmented into phonetic units, indexed, and used as-is
Unit selection is made on-line
In Concatenative synthesis, the selection is made offline and manually!
Concatenating Segments
The PSOLA Method
Pitch Synchronous Overlap and Add

A window (2-pitch periods long) is multiplied with the signal The signal is broken into a set of localized signals (non-zero only at the window intervals)
Pitch Modification
Relative shifting of localized signals Spacing reflects pitch duration Good result for modification factor =[0.6 1.5]
Duration
Localized signals are added or deleted from output
Concatenative and Rule Based Synthesis Comparison

Concatenative Synthesis is the state-of-the-art
Storage is of little concern now
Storing the segment database is no longer an issue
Advances in ensuring smoothness in concatenations

Rule-based synthesis output used to be smoother
Certain sounds are too hard to be produced by rule

Vowels are easy to create by rule Bursts, voiceless stops are too difficult, we do not fully understand their production mechanisms

Tts Overview

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Tts Overview

Încărcat de

Drepturi de autor:

Formate disponibile

Text-To-Speech Synthesis

What is a TTS System

Narrow Phonetic Transcription Phones Prosody

Digital Signal Processing

The Natural Language Processing Module

Natural Prosody Generator

Phone Names Prosody

Text Segmentation Tokenization Sentence End Detection Normalization

(i) () (know) ( ) (1) (,) (000) ( ) (words)

Abbreviations .: , , .: , Acronyms , , Numbers 1.023,32 12/1/2002 13:23 12.15

Dealing with Non-Standard Words

Dealing with Non-Standard Words

Determines the probability of a tag t given the observed string o

Speech Analysis Signal Processing

Analysis and Synthesis

Extrinsic Errors: Parameter extraction algorithm

Improvements during Trial-Error tuning

+ Almost free of modeling errors Difficult to estimate Time consuming

Signal Buzziness Low Signal Quality

Phone Names Prosody

Selective Segmentation Speech Analysis Equalization Speech Coding

Segment List Generation

Prosody Matching Speech Decoding Concatenation Signal Synthesis Speech

Analysis Database Preparation

Perform amplitude equalization to prevent mismatches

Unit Database Issues

Very large combinatorial space of combinations of phonemes and prosodic contexts

Unit Selection vs Concatenative Synthesis

In Concatenative synthesis, the selection is made offline and manually!

The PSOLA Method

Pitch Synchronous Overlap and Add

Concatenative and Rule Based Synthesis Comparison

Advances in ensuring smoothness in concatenations

Certain sounds are too hard to be produced by rule

S-ar putea să vă placă și