TRAMES, 2007, 11(61/56), 2, 284–298
Modelling speech temporal structure for Estonian text-to-speech synthesis: feature selection
Meelis Mihkla
Institute of the Estonian Language, Tallinn
Abstract. The article discusses the principles of selecting features for modelling the temporal structure of Estonian speech, using different types of read-out texts, with a view to text-to-speech synthesis (TTS). Feature selection is known to depend on certain general issues regulating speech temporal structure, as well as on some language specific aspects. The durational model of Estonian stands out for some foot-bound features (foot quantity degree, number of feet in the word) being included in the input. In addition to the traditional descriptors of sound context and hierarchical position the prediction of Estonian segmental durations requires information on some morphological, syntactic and lexical features of the word, such as word form, part of sentence, and part of speech. In the prediction of pauses in the speech flow the relevant features are: distance from sentence beginning and from the previous pause, the length and quantity degree of the preceding foot, and the occurrence of a punctuation mark or conjunction. Although expert opinions were used in feature selection, statistical methods should be applied to test the vector of optimal argument features.
Keywords: feature selection, speech timing, segmental durations, pauses, text-to-speech synthesis, feature significance, statistical modelling
References
Campbell, Nick (2000)
“Timing in speech: a multilevel process”. In Prosody: theory and experiment,
281–334. M. Horne, ed. Dordrecht/Boston/London: Kluwer Academic Publishers.
Campbell, N. W. and S. D. Isard (1991)
“Segment durations in a syllable frame” Journal of Phonetics 19, 37–47.
Eek, Arvo and Einar Meister (1999) “Estonian
speech in the BABEL multi-language database: phonetic-phonological problems
revealed in the text corpus”. In Proceedings of LP’98, II, 529–546. O. Fujimura,
ed. Prague: The Karolinum Press.
Eek, Arvo and Einar Meister (2003)
“Foneetilisi katseid ja arutlusi kvantiteedi alalt (I): Häälikukestusi muutvad
kontekstid ja välde”. [Phonetic tests and
disputes about quantity (I): Contexts changing sound
duration and quantity degree.] Keel ja
Kirjandus (Tallinn) 46, 11, 815–837 and 12, 904–918. Tõlge inglise keelde
Eek, Arvo and Einar Meister (2004)
“Foneetilisi katseid ja arutlusi kvantiteedi alalt (II): Takt, silp ja välde”. [Phonetic
tests and disputes about quantity (II). Foot, syllable and quantity.] Keel ja
Kirjandus (Tallinn) 47, 4, 251–277 and 5, 336–357. +++++
Dutoit,
Thierry (1997) An introduction to text-to-speech synthesis. Dordrecht:
Kluwer Academic Publishers.
Horak, Pavel (2005) “Using neural networks to
model Czech text-to-speech synthesis”. In Proceedings of the 16th Conference of electronic
speech signal processing, 76–83. R. Vich, ed. Prague: TUDpress.
Huggins, A.W.F. (1968) “The perception of
timing in natural speech: compensation within syllable”. Language and Speech 11,
1–11.
Kaalep, Heiki-Jaan and Tarmo Vaino (2001)
“Complete morphological analysis in the linguist’s toolbox”. In Congressus
Nonus Internationalis Fenno-Ugristarum, Tartu 7.-13.08.2000, V,
9–16. Tartu: TÜ Kirjastus.
Klatt, D. H. (1979) “Synthesis by rule of segmental durations in
English sentences”. In Frontiers of Speech Communication research,
287–300. B. Lindblom and S. Öhman, eds. New York: Academic Press.
Liiv, Georg (1961) “Eesti keele kolme
vältusastme vokaalide kestus ja meloodiatüübid”. [Duration of vowels of
the three quantity degree of Estonian and types of melody.++++] Keel ja Kirjandus (Tallinn) 4, 7, 412–424
and 8, 480–490. Tõlge inglise
keelde
Meister, Einar and Stefan Werner (2006)
“Intrinsic microprosodic variations in Estonian and Finnish: acoustic
analysis”. In Fonetiikan Päivät 2006 = The Phonetics Symposium 2006,
103–112. R. Aulanko, L. Wahlberg, and M. Vainio, eds. (Publications
of the Department of Speech Sciences, University of Helsinki) Helsinki:
University of Helsinki.
Mihkla, Meelis and Jüri Kuusik (2005)
“Analysis and modelling of temporal characteristics of speech for Estonian
text-to-speech synthesis”. Linguistica Uralica 41, 2, 91–97.
Mihkla, Meelis (2006a) “Pausid kõnes”. [Pauses
in Speech.] Keel ja Kirjandus (Tallinn) 49, 4,
286–295.
Mihkla, Meelis (2006b) “Comparison of
statistical methods used to predict segmental durations”. In Fonetiikan Päivät
2006 = The Phonetics Symposium 2006, 120–124. R. Aulanko, L. Wahlberg, and M.
Vainio, eds. (Publications of the Department of Speech Sciences, University of
Helsinki) Helsinki: University of Helsinki.
Mihkla, Meelis (2007) “Morphological and synthetic factors in
predicting segmental durations for Estonian text-to-speech synthesis”. Proceedings
ICPhS 2007. (accepted, in print).
Sagisaka, Yoshinori (2003) “Modeling and
perception of temporal characteristics in speech”. In Proceedings of 15th International
Congress of Phonetic Sciences, 1–6. M. J. Sole, D. Recasens,
and J. Romero, eds. Barcelona.
van Santen, Jan (1998) “Timing”. In Multilingual
text-to-speech synthesis: the Bell Labs approach, 115–140. R.
Sproat, ed. Kluwer Academic
Publishers.[KOHT]
Stout, Rex 2003 “Deemoni surm”. [Death
of a Demon.]
CD-versioon (Read by Andres Ots). Tallinn: Elmatar.Tõlge inglise keelde
Tatham, Mark and Katherine Morton (2005) Developments
in speech synthesis. Chichester: John Wiley & Sons Ltd.
Tseng, C. (2002) “The prosodic status of
breaks in running speech: examination and evaluation”. In Proceedings of Speech Prosody 2002, 667–670. Aix-en-Provence,
France.
Vainio, Martti (2001) Artificial neural network based
prosody models for Finnish text-to-speech synthesis. Helsinki:
University of Helsinki.
Viks, Ülle (2000). “Eesti keele avatud
morfoloogiamudel” [Open morphology model of Estonian language.]. In Arvutuslingvistikalt inimesele, 9–36. [From
computational linguistics to people.] T.
Hennoste, ed. (Tartu Ülikooli üldkeeleteaduse õppetooli toimetised, 1.) Tartu.Tõlge inglise
keelde