next up previous contents
Next: Expansions of special characters Up: Token-to-word rules Previous: Abbreviations   Contents

Inter-punctuation and whitespaces

The punctuation marks are detached from the words in the module Text and saved as features in the token relation. Punctuation marks are used (among other things) to determine sentence breaks. In text preprocessing they are mainly used to determine ordinals and abbreviations. Whitespaces are handled as features in the token relation.



Gregor Moehler
2001-07-17