Tokenization

Next: Token-POS Up: Modules Previous: Initialization Contents

Tokenization

The first step in synthesizing text is tokenization. A token is, generally speaking, an atom, separated from the rest of the text by whitespaces. Punctuation marks are separated from the tokens and saved as a feature in the relation with the name token. To access these features (item.feat TOKEN "punc") or (item.feat TOKEN ``whitespace'') can be used. Punctuation marks and whitespaces are defined in the file ''festival/lib/token.scm'':

(defvar token.punctuation "\"'`.,:;!?(){}[]")
(defvar token.prepunctuation "\"'`({[")
(defvar token.whitespace "\t \n \r")

For the German version of Festival, this tokenization is only extended by the hyphen (``-'') in ``token.prepunctuation'' (file festival/lib/german/ims_german_voices.scm). That is, hyphenated compounds are split into words using the hyphen as delimiter.

Gregor Moehler
2001-07-17