File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/p97-1032_intro.xml

Size: 12,394 bytes

Last Modified: 2025-10-06 14:06:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1032">
  <Title>Comparing a Linguistic and a Stochastic Tagger</Title>
  <Section position="3" start_page="0" end_page="247" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> There are currently two main methods for automatic part-of-speech tagging. The prevailing one uses essentially statistical language models automatically derived from usually hand-annotated corpora. These corpus-based models can be represented e.g.</Paragraph>
    <Paragraph position="1"> as collocational matrices (Garside et al. (eds.) 1987: Church 1988), Hidden Markov models (cf. Cutting et al. 1992), local rules (e.g. Hindle 1989) and neural networks (e.g. Schmid 1994). Taggers using these statistical language models are generally reported to assign the correct and unique tag to 95-97% of words in running text. using tag sets ranging from some dozens to about 130 tags.</Paragraph>
    <Paragraph position="2"> The less popular approach is based on hand-coded linguistic rules. Pioneering work was done in the 1960&amp;quot;s (e.g. Greene and Rubin 1971). Recently, new interest in the linguistic approach has been shown e.g. in the work of (Karlsson 1990: Voutilainen et al. 1992; Oflazer and Kuru6z 1994: Chanod and Tapanainen 1995: Karlsson et al. (eds.) 1995; Voutilainen 1995). The first serious linguistic competitor to data-driven statistical taggers is the English Constraint Grammar parser. EngCG (cf. Voutilainen et al. 1992; Karlsson et al. (eds.) 1995). The tagger consists of the following sequentially applied modules: null  1. Tokenisation 2. Morphological analysis (a) Lexical component (b) Rule-based guesser for unknown words 3. Resolution of morphological ambiguities  The tagger uses a two-level morphological analyser with a large lexicon and a morphological description that introduces about 180 different ambiguity-forming morphological analyses, as a result of which each word gets 1.7-2.2 different analyses on an average. Morphological analyses are assigned to unknown words with an accurate rule-based 'guesser'. The morphological disambiguator uses constraint rules that discard illegitimate morphological analyses on the basis of local or global context conditions. The rules can be grouped as ordered subgrammars: e.g. heuristic subgrammar 2 can be applied for resolving ambiguities left pending by the more &amp;quot;careful' subgrammar 1.</Paragraph>
    <Paragraph position="3"> Older versions of EngCG (using about 1,150 constraints) are reported (~butilainen et al. 1992; Voutilainen and HeikkiUi 1994; Tapanainen and Voutilainen 1994; Voutilainen 1995) to assign a correct analysis to about 99.7% of all words while each word in the output retains 1.04-1.09 alternative analyses on an average, i.e. some of the ambiguities remait~ unresolved.</Paragraph>
    <Paragraph position="4"> These results have been seriously questioned. One doubt concerns the notion 'correct analysis&amp;quot;. For example Church (1992) argues that linguists who manually perform the tagging task using the double-blind method disagree about the correct analysis in at least 3% of all words even after they have negotiated about the initial disagreements. If this were the case, reporting accuracies above this 97% &amp;quot;upper bound' would make no sense.</Paragraph>
    <Paragraph position="5"> However, Voutilainen and J~rvinen (1995) empirically show that an interjudge agreement virtually of 1()0% is possible, at least with the EngCG tag set if not with the original Brown Corpus tag set. This consistent applicability of the EngCG tag set is explained by characterising it as grammatically rather than semantically motivated.</Paragraph>
    <Paragraph position="6">  Another main reservation about the EngCG figures is the suspicion that, perhaps partly due to the somewhat underspecific nature of the EngCG tag set, it must be so easy to disambiguate that also a statistical tagger using the EngCG tags would reach at least as good results. This argument will be examined in this paper. It will be empirically shown (i) that the EngCG tag set is about as difficult for a probabilistic tagger as more generally used tag sets and (ii) that the EngCG disambiguator has a clearly smaller error rate than the probabilistic tagger when a similar (small) amount of ambiguity is permitted in the output.</Paragraph>
    <Paragraph position="7"> A state-of-the-art statistical tagger is trained on a corpus of over 350,000 words hand-annotated with EngCG tags. then both taggers (a new version known as En~CG-21 with 3,600 constraints as five subgrammars-, and a statistical tagger) are applied to the same held-out benchmark corpus of 55,000 words, and their performances are compared. The results disconfirm the suspected 'easiness' of the EngCG tag set: the statistical tagger's performance figures are no better than is the case with better known tag sets.</Paragraph>
    <Paragraph position="8"> Two caveats are in order. What we are not addressing in this paper is the work load required for making a rule-based or a data-driven tagger. The rules in EngCG certainly took a considerable effort to write, and though at the present state of knowledge rules could be written and tested with less effort, it may well be the case that a tagger with an accuracy of 95-97% can be produced with less effort by using data-driven techniques. 3 Another caveat is that EngCG alone does not resolve all ambiguities, so it cannot be compared to a typical statistical tagger if full disambiguation is required. However, &amp;quot;~butilainen (1995) has shown that EngCG combined with a syntactic parser produces morphologically unambiguous output with an accuracy of 99.3%, a figure clearly better than that of the statistical tagger in the experiments below (however. the test data was not the same).</Paragraph>
    <Paragraph position="9"> Before examining the statistical tagger, two practical points are addressed: the annotation of tile corpora used. and the modification of the EngCG tag set for use in a statistical tagger.</Paragraph>
    <Paragraph position="10"> 1An online version of EngCG-2 can be found at, ht tp://www.ling.helsinki.fi/&amp;quot;avoutila/engcg-2.ht ml. :The first three subgrammars are generally highly reliable and almost all of the total grammar development time was spent on them: the last two contain rather rough heuristic constraints.</Paragraph>
    <Paragraph position="11"> 3However, for an interesting experiment suggesting otherwise, see (Chanod and Tapanainen 1995).</Paragraph>
    <Section position="1" start_page="246" end_page="246" type="sub_section">
      <SectionTitle>
2 Preparation of Corpus Resources
2.1 Annotation of training corpus
</SectionTitle>
      <Paragraph position="0"> The stochastic tagger was trained on a sample of 357,000 words from the Brown University Corpus</Paragraph>
    </Section>
    <Section position="2" start_page="246" end_page="246" type="sub_section">
      <SectionTitle>
of Present-Day English (Francis and Ku6era 1982)
</SectionTitle>
      <Paragraph position="0"> that was annotated using the EngCG tags. The corpus was first analysed with the EngCG lexical analyser, and then it was fully disambiguated and, when necessary, corrected by a human expert. This annotation took place a few years ago. Since then, it has been used in the development of new EngCG constraints (the present version, EngCG-2, contains about 3,600 constraints): new constraints were applied to the training corpus, and whenever a reading marked as correct was discarded, either the analysis in the corpus, or the constraint itself, was corrected.</Paragraph>
      <Paragraph position="1"> In this way, the tagging quality of the corpus was continuously improved.</Paragraph>
    </Section>
    <Section position="3" start_page="246" end_page="247" type="sub_section">
      <SectionTitle>
2.2 Annotation of benchmark corpus
</SectionTitle>
      <Paragraph position="0"> Our comparisons use a held-out benchmark corpus of about 55,000 words of journalistic, scientific and manual texts, i.e., no ,training effects are expected for either system. The benchmark corpus was annotated by first applying the preprocessor and morphological aaalyser, but not the morphological disambiguator, to the text. This morphologically ambiguous text was then independently and fully disambiguated by two experts whose task was also to detect any errors potentially produced by the previously applied components. They worked independently, consulting written documentation of the tag set when necessary. Then these manually disambiguated versions were automatically compared with each other. At this stage, about 99.3% of all analyses were identical. When the differences were collectiyely examined, virtually all were agreed to be due to clerical mistakes. Only in the analysis of 21 words, different (meaning-level) interpretations persisted, and even here both judges agreed the ambiguity to be genuine. One of these two corpus versions was modified to represent the consensus, and this &amp;quot;consensus corpus' was used as a benchmark in the evaluations.</Paragraph>
      <Paragraph position="1"> As explained in Voutilainen and J/irvinen (1995).</Paragraph>
      <Paragraph position="2"> this high agreement rate is due to two main factors.</Paragraph>
      <Paragraph position="3"> Firstly, distinctions based on some kind of vague semantics are avoided, which is not always case with better known tag sets. Secondly. the adopted analysis of most of the constructions where humans tend to be uncertain is documented as a collection of tag application principles in the form of a grammarinn's manual (for further details, cf. Voutilainen and J/irvinen 1995).</Paragraph>
      <Paragraph position="4"> Tile corpus-annotation procedure allows us t.o perform a text-book statistical hypothesis test. Let tile null hypothesis be that any two human evaluators will necessarily disagree in at least 3% of  the cases. Under this assumption, the probability of an observed disagreement of less than 2.88% is less than 5%. This can be seen as follows: For the relative frequency of disagreement, fn, we have  t-.--..that f. is approximately --, N(p, ~/~), where p is the actual disagreement probability and n is the number of trials, i.e., the corpus size. This means fn-P v/- ff that P(( ~ &lt; z) ~ ~(x) where C/b is the standard normal distribution function. This in turn means that P ( f , &lt; p + z P~ - p-----~) ) ,~ ~ ( z ) Here n is 55,000 and ~(-1.645) = 0.05. Under the null hypothesis, p is at least 3% and thus: . /O.O3.0.97 P(f. &lt; o.o3- 1.64%/-g,o-g6 ) = P(A &lt;__ 0.0288) &lt; 0.05 We can thus discard the null hypothesis at significance level 5% if the observed disagreement is less than 2.88%. It was in fact 0.7% before error cor.21) rection, and virtually zero (~ after negotiation. This means that we can actually discard the hypotheses that the human evaluators in average disagree in at least 0.8% of the cases before error correction, and in at least 0.1% of the cases after negotiations, at significance level 5%.</Paragraph>
    </Section>
    <Section position="4" start_page="247" end_page="247" type="sub_section">
      <SectionTitle>
2.3 Tag set conversion
</SectionTitle>
      <Paragraph position="0"> The EugCG morphological analyser's output formally differs from most tagged corpora; consider the following 5-ways ambiguous analysis of &amp;quot;'walk&amp;quot;:  walk walk &lt;SV&gt; &lt;SVO&gt; V SUBJUNCTIVE VFIN walk &lt;SV&gt; &lt;SVO&gt; V IMP VFIN walk &lt;SV&gt; &lt;SVG&gt; V INF walk &lt;SV&gt; &lt;SVO&gt; V PRES -SG3 VFIN walk N NOM SG  Statistical taggers usually employ single tags to indicate analyses (e.g. &amp;quot;'NN&amp;quot; for &amp;quot;'N NOM SG&amp;quot;). Therefore a simple conversion program was made for producing the following kind of output, where each reading is represented as a single tag:</Paragraph>
      <Paragraph position="2"> The conversion program reduces the multipart EngCG tags into a set of 80 word tags and 17 punctuation tags (see Appendix) that retain the central linguistic characteristics of the original EngCG tag set.</Paragraph>
      <Paragraph position="3"> A reduced version of the benchmark corpus was prepared with this conversion program for the statistical tagger's use. Also EngCG's output was converted into this format to enable direct comparison with the statistical tagger.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML