File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/95/j95-2001_abstr.xml
Size: 7,060 bytes
Last Modified: 2025-10-06 13:48:22
<?xml version="1.0" standalone="yes"?> <Paper uid="J95-2001"> <Title>Automatic Stochastic Tagging of Natural Language Texts</Title> <Section position="2" start_page="0" end_page="138" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> In the natural language processing community, there has been a growing awareness of the key importance that lexical and corpora resources, especially annotated corpora, have to play, both in the advancement of research in this area and in the development of relevant products. In order to reduce the huge cost of manually creating such corpora, the development of automatic taggers is of paramount importance. In this respect, the ability of a tagger to handle both known and unknown words, to improve its performance by training, and to achieve a high rate of correctly tagged words, is the criterion for assessing its usability in practical cases.</Paragraph> <Paragraph position="1"> Several taggers based on rules, stochastic models, neural networks, and hybrid systems have already been presented for Part-of-speech (POS) tagging. Rule-based taggers (Brill 1992; Elenius 1990; Jacobs and Zernik 1988; Karlsson 1990; Karlsson et al. 1991; Voutilainen, Heikkila, and Antitila 1992; Voutilainen and Tapanainen 1993) use POS-dependent constraints defined by experienced linguists. A small error rate has been achieved by such systems when a restricted, application-dependent POS set is used; e.g., an error rate of 2-6 percent has been reported by Marcus, Santorini, and Marcinkiewicz (1993) using the Penn Treebank corpus. Nevertheless, if a large POS set is specified, the number of rules increases significantly and rule definition becomes highly costly and cumbersome.</Paragraph> <Paragraph position="2"> Stochastic taggers use both contextual and morphological information, and the model parameters are usually defined or updated automatically from tagged texts (Cerf-Danon and E1-Beze 1991; Church 1988; Cutting et al. 1992; Dermatas and Kokkinakis 1988, 1990, 1993, 1994; Garside, Leech, and Sampson 1987; Kupiec 1992; Maltese * Department of Electrical Engineering, Wire Communications Laboratory (WCL), University of Patras, 265 00 Patras, Greece. E-mail: dermatas@wcl.ee.upatras.gr.</Paragraph> <Paragraph position="3"> (~) 1995 Association for Computational Linguistics Computational Linguistics Volume 21, Number 2 and Mancini 1991; Meteer, Schwartz, and Weischedel 1991; Merialdo 1991; Pelillo, Moro, and Refice 1992; Weischedel et al. 1993; Wothke et al. 1993). These taggers are preferred when tagged texts are available for training, and large tagsets and multilingual applications are involved. In the case where additionally raw untagged text is available, the Maximum Likelihood training can be used to reestimate the parameters of HMM taggers (Merialdo 1994).</Paragraph> <Paragraph position="4"> Connectionist models have been used successfully for lexical acquisition (Eineborg and Gamback 1993; Elenius 1990; Elenius and Carlson 1989; Nakamura et al. 1990). Correct classification rates up to 96.4 percent have been achieved in the latter case by testing on the Teleman Swedish corpus. On the other hand, a time-consuming training process has been reported.</Paragraph> <Paragraph position="5"> Recently, several solutions to the problem of tagging unknown words have been presented (Charniak et al. 1993; Meteer, Schwartz, and Weischedel 1991). Hypotheses for unknown words, both stochastic (Dermatas and Kokkinakis 1993, 1994; Maltese and Mancini 1991; Weischedel et al. 1993), and connectionist (Eineborg and Gamback 1993; Elenius 1990) have been applied to unlimited vocabulary taggers. In taggers that are based on hidden Markov models (HMM), parameters of the unknown words are estimated by taking into account morphological information from the last part of the word (Dermatas and Kokkinakis 1994; Maltese and Mancini 1991). Accurate tagging of seven European languages has been achieved in the first case (error rates of 3-13 percent for a detailed POS set), but an enormous amount of training text is required for the estimation of the parameters for unknown words. Similar results have been reported by Maltese and Mancini (1991) for the Italian language. Weischedel et al.</Paragraph> <Paragraph position="6"> (1993) have used four categories of word morphology, such as inflectional endings, derivational endings, hyphenation, and capitalization. For the case in which only a restricted training text is available, a simple, language- and tagset-independent HMM tagger has been presented by Dermatas and Kokkinakis (1993), where the HMM parameters for the unknown words are estimated by assuming that the POS probability distribution of the unknown words and the POS probability distribution of the less probable words in the small training text are identical.</Paragraph> <Paragraph position="7"> In this paper, five natural language stochastic taggers that are able to predict POS of unknown words are presented and tested following the process of developing annotated corpora (the most recently fully tagged and corrected text is used to update the model parameters). Three stochastic optimization criteria and seven European languages (Dutch, English, French, German, Greek, Italian and Spanish) and two POS sets are used in the tests. The set of main grammatical classes and an extended set of detailed grammatical categories is the same in all languages. The testing material consists of newspaper texts with 60,000-180,000 words for each language and an English EEC-law text with 110,000 words. This material was assembled and annotated in the framework of the ESPRIT-291/860 project &quot;Linguistic Analysis of the European Languages.&quot; In addition, we present transformations of the taggers' calculations to a fixed-point arithmetic system, which are useful for machines without floating-point hardware.</Paragraph> <Paragraph position="8"> The taggers handle both lexical and tag transition information, and without performing morphological analysis can be used to annotate corpora when small training texts are available. Thus, they are preferred when a new language or a new tagset is used. When the training text is adequate to estimate the tagger parameters, more efficient stochastic taggers (Dermatas and Kokkinakis 1994; Maltese and Mancini 1991; Weischedel et al. 1993) and training methods can be implemented (Merialdo 1994).</Paragraph> <Paragraph position="9"> The structure of this paper is as follows: in Section 2 the stochastic tagging models are presented in detail. In Section 3 the influence of the training text errors and the Dermatas and Kokkinakis Stochastic Tagging sources of stochastic tagger errors are discussed, followed, in Section 4, by a short presentation of the implementation. In Section 5, statistical measurements on the corpora and a short description of the taggers' performance is given. Detailed experimental results are included in Appendices A and B.</Paragraph> </Section> class="xml-element"></Paper>