File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/95/e95-1022_intro.xml

Size: 7,726 bytes

Last Modified: 2025-10-06 14:05:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="E95-1022">
  <Title>A syntax-based part-of-speech analyser</Title>
  <Section position="2" start_page="0" end_page="157" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Part-of-speech analysis usually consists of (i) introduction of ambiguity (lexical analysis) and (ii) disambiguation (elimination of illegitimate alternatives). While introducing ambiguity is regarded as relatively straightforward, disambiguation is known to be a difficult and controversial problem.</Paragraph>
    <Paragraph position="1"> There are two main methodologies: the linguistic and the data-driven.</Paragraph>
    <Paragraph position="2"> * In the linguistic approach, the generalisations are based on the linguist's (potentially corpus-based) abstractions about the paradigms and syntagms of the language.</Paragraph>
    <Paragraph position="3"> Distributional generalisations are manually coded as a grammar, a system of constraint rules used for discarding contextually illegitimate analyses. The linguistic approach is labour-intensive: skill and effort is needed for writing an exhaustive grammar.</Paragraph>
    <Paragraph position="4"> * In the data-driven approach, frequency-based information is automatically derived from corpora. The learning corpus can consist of plain text, but the best results seem achievable with annotated corpora (Merialdo 1994; Elworthy 1994). This corpus-based information typically concerns sequences of 1-3 tags or words (with some well-known exceptions, e.g. Cutting et al. 1992). Corpus-based information can be represented e.g. as neural networks (Eineborg and Gamb/~c k 1994; Schmid 1994), local rules (Brill 1992), or collocational matrices (Garside 1987). In the data-driven approach, no human effort is needed for rulewriting. However, considerable effort may be needed for determining a workable tag set (cf.</Paragraph>
    <Paragraph position="5"> Cutting 1994) and annotating the training corpus.</Paragraph>
    <Paragraph position="6"> At the first flush, the linguistic approach may seem an obvious choice. A part-of-speech tagger's task is often illustrated with a noun-verb ambiguous word directly preceded by an unambiguous determiner (e.g. table in the table). This ambiguity can reliably be resolved with a simple and obvious grammar rule that disallows verbs after determiners. null Indeed, few contest the fact that reliable linguistic rules can be written for resolving some part-of-speech ambiguities. The main problem with this approach seems to be that resolving part-of-speech ambiguities on a large scale, without introducing a considerable error margin, is very difficult at best. At least, no rule-based system with a convincing accuracy has been reported so far. 1 As a rule, data-driven systems rely on statistical generalisations about short sequences of words or tags. Though these systems do not usually employ information about long-distance phenom1There is one potential exception: the rule-based morphological disambiguator used in the English Constraint Grammar Parser ENGCG (Voutilainen, HeikkilPS and Anttila 1992). Its recall is very high (99.7% of all words receive the correct morphological analysis), but this system leaves 3-7% of all words ambiguous, trading precision for recall.</Paragraph>
    <Paragraph position="7">  ena or the linguist's abstraction capabilities (e.g. knowledge about what is relevant in the context), they tend to reach a 95-97% accuracy in the analysis of several languages, in particular English (Marshall 1983; Black et aL 1992; Church 1988; Cutting et al. 1992; de Marcken 1990; DeRose 1988; Hindle 1989; Merialdo 1994; Weischedel et al. 1993; Brill 1992; Samuelsson 1994; Eineborg and Gamb~ick 1994, etc.). Interestingly, no significant improvement beyond the 97% &amp;quot;barrier&amp;quot; by means of purely data-driven systems has been reported so far.</Paragraph>
    <Paragraph position="8"> In terms of the accuracy of known systems, the data-driven approach seems then to provide the best model of part-of-speech distribution. This should appear a little curious because very competitive results have been achieved using the linguistic approach at related levels of description. With respect to computational morphology, witness for instance the success of the Two-Level paradigm introduced by Koskenniemi (1983): extensive morphological descriptions have been made of more than 15 typologically different languages (Kimmo Koskenniemi, personal communication). With regard t.o computational syntax, see for instance (GiingSrdii and Oflazer 1994; Hindle 1983; Jensen, Heidorn and Richardson (eds.) 1993; McCord 1990; Sleator and Temperley 1991; Alshawi (ed.) 1992; Strzalkowski 1992). The present success of the statistical approach in part-of-speech analysis seems then to form an exception to the general feasibility of the rule-based linguistic approach. Is the level of parts of speech somehow different, perhaps less rulegoverned, than related levels? 2 We do not need to assume this idiosyncratic status entirely. The rest of this paper argues that also parts of speech can be viewed as a rule-governed phenomenon, possible to model using the linguistic approach. However, it will also be argued that though the distribution of parts of speech can to some extent be described with rules specific to this level of representation, a more natural account could be given using rules overtly about the form and function of essentially syntactic categories. A syntactic grammar appears to predict the distribution of parts of speech as a &amp;quot;side effect&amp;quot;. In this sense parts of speech seem to differ from morphology and syntax: their status as an independent level of linguistic description appears doubtful.</Paragraph>
    <Paragraph position="9"> Before proceeding further with the main argument, consider three very recent hybrids - systems that employ linguistic rules for resolving some of the ambiguities before using automatically generated corpus-based information: collocation matrices (Leech, Garside and Bryant 1994),  Church (1992).</Paragraph>
    <Paragraph position="10"> J~irvinen 1994). What is interesting in these hybrids is that they, unlike purely data-driven taggers, seem capable of exceeding the 97% barrier: all three report an accuracy of about 98.5%. 3 The success of these hybrids could be regarded as evidence for the syntactic aspects of parts of speech. However, the above hybrids still contain a data-driven component, i.e. it remains an open question whether a tagger entirely based on the linguistic approach can compare with a data-driven system.</Paragraph>
    <Paragraph position="11"> Next, a new system with the following properties is outlined and evaluated: * The tagger uses only linguistic distributional rules.</Paragraph>
    <Paragraph position="12"> * Tested agMnst a 38,000-word corpus of previously unseen text, the tagger reaches a better accuracy than previous systems (over 99%).</Paragraph>
    <Paragraph position="13"> * At the level of linguistic abstraction, the grammar rules are essentially syntactic. Ideally, part-of-speech disambiguation should fall out as a &amp;quot;side effect&amp;quot; of syntactic analysis. null Section 2 outlines a rule-based system consisting of the ENGCG tagger followed by a finite-state syntactic parser (Voutilainen and Tapanainen 1993; Voutilainen 1994) that resolves remaining part-of-speech ambiguities as a side effect. In Section 3, this rule-based system is tested against a 38,000-word corpus of previously unseen text. Currently tagger evaluation is only becoming standardised; the evaluation method is accordingly reported in detail.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML