File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1035_metho.xml

Size: 21,494 bytes

Last Modified: 2025-10-06 14:07:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1035">
  <Title>Serial Combination of Rules and Statistics: A Case Study in Czech Tagging</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Tagging of Inflective Languages
</SectionTitle>
    <Paragraph position="0"> Inflective languages pose a specific problem in tagging due to two phenomena: highly inflective nature (causing sparse data problem in any statistically-based system), and free word order (causing fixed-context systems, such as n-gram Hidden Markov Models (HMMs), to be even less adequate than for English). The average tagset contains about 1,000 - 2,000 distinct tags; the size of the set of possible and plausible tags can reach several thousands.</Paragraph>
    <Paragraph position="1"> Apart from agglutinative languages such as Turkish, Finnish and Hungarian (see e.g.</Paragraph>
    <Paragraph position="2"> (Hakkani-Tur et al., 2000)), and Basque (Ezeiza et al., 1998), which pose quite different and in the end less severe problems, there have been attempts at solving this problem for some of the highly inflectional European languages, such as (Daelemans et al., 1996), (Erjavec et al., 1999) (Slovenian), (HajiVc and Hladk'a, 1997), (HajiVc and Hladk'a, 1998) (Czech) and (HajiVc, 2000) (five Central and Eastern European languages), but so far no system has reached - in the absolute terms - a performance comparable to English tagging (such as (Ratnaparkhi, 1996)), which stands around or above 97%. For example, (HajiVc and Hladk'a, 1998) report results on Czech slightly above 93% only. One has to realize that even though such a performance might be adequate for some tasks (such as word sense disambiguation), for many other (such as parsing or translation) the implied sentence error rate at 50% or more is simply too much to deal with.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 Statistical Tagging
</SectionTitle>
      <Paragraph position="0"> Statistical tagging of inflective languages has been based on many techniques, ranging from plain-old HMM taggers (M'irovsk'y, 1998), memory-based (Erjavec et al., 1999) to maximum-entropy and feature-based (HajiVc and Hladk'a, 1998), (HajiVc, 2000). For Czech, the best result achieved so far on approximately 300 thousand word training data set has been described in (HajiVc and Hladk'a, 1998).</Paragraph>
      <Paragraph position="1"> We are using 1.8M manually annotated tokens from the Prague Dependency Treebank (PDT) project (HajiVc, 1998). We have decided to work with an HMM tagger1 in the usual source-channel setting, with proper smoothing. The HMM tagger uses the Czech morphological processor from PDT to disambiguate only among those tags 1Mainly because of the ease with which it is trained even on large data, and also because no other publicly available tagger was able to cope with the amount and ambiguity of the data in reasonable time.</Paragraph>
      <Paragraph position="2"> which are morphologically plausible for a given input word form.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.2 Manual Rule-based Systems
</SectionTitle>
      <Paragraph position="0"> The idea of tagging by means of hand-written disambiguation rules has been put forward and implemented for the first time in the form of Constraint-Based Grammars (Karlsson et al., 1995). From languages we are acquainted with, the method has been applied on a larger scale only to English (Karlsson et al., 1995), (Samuelsson and Voutilainen, 1997), and French (Chanod and Tapanainen, 1995). Also (Bick, 1996) and (Bick, 2000) use manually written rules for Brazilian Portuguese, and there are several publications by Oflazer for Turkish.</Paragraph>
      <Paragraph position="1"> Authors of such systems claim that hand-written systems can perform better than systems based on machine learning (Samuelsson and Voutilainen, 1997); however, except for the work cited, comparison is difficult to impossible due to the fact that they do not use the standard evaluation techniques (and not even the same data). But the substantial disadvantage is that the development of manual rule-based systems is demanding and requires a good deal of very subtle linguistic expertise and skills if full disambiguation also of &amp;quot;difficult&amp;quot; texts is to be performed.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.3 System Combination
</SectionTitle>
      <Paragraph position="0"> Combination of (manual) rule-writing and statistical learning has been studied before. E.g., (Ngai and Yarowsky, 2000) and (Ngai, 2001) provide a thorough description of many experiments involving rule-based systems and statistical learners for NP bracketing. For tagging, combination of purely statistical classifiers has been described (Hladk'a, 2000), with about 3% relative improvement (error reduction from 18.6% to 18%, trained on small data) over the best original system. We regard such systems as working in parallel, since all the original classifiers run independently of each other.</Paragraph>
      <Paragraph position="1"> In the present study, we have chosen a different strategy (similar to the one described for other types of languages in (Tapanainen and Voutilainen, 1994), (Ezeiza et al., 1998) and (Hakkani-Tur et al., 2000)). At the same time, the rule-based component is known to perform well in eliminating the incorrect alternatives2, rather than picking the correct one under all circumstances.</Paragraph>
      <Paragraph position="2"> Moreover, the rule-based system used can examine the whole sentential context, again a difficult thing for a statistical system3. That way, the ambiguity of the input text4 decreases. This is exactly what our statistical HMM tagger needs as its input, since it is already capable of using the lexical information from a dictionary.</Paragraph>
      <Paragraph position="3"> However, also in the rule-based approach, there is the usual tradeoff between precision and recall.</Paragraph>
      <Paragraph position="4"> We have decided to go for the &amp;quot;perfect&amp;quot; solution: to keep 100% recall, or very close to it, and gradually improve precision by writing rules which eliminate more and more incorrect tags. This way, we can be sure (or almost sure) that the performance of the HMM tagger performance will not be hurt by (recall) errors made by the rule component. null</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The Rule-based Component
2.1 Formal Means
</SectionTitle>
    <Paragraph position="0"> Taken strictly formally, the rule-based component has the form of a restarting automaton with deletion (Pl'atek et al., 1995), that is, each rule can be thought of as a finite-state automaton starting from the beginning of the sentence and passing to the right until it finds an input configuration on which it can operate by deletion of some parts of the input. Having performed this, the whole system is restarted, which means that the next rule is applied on the changed input (and this input is again read from the left end). This means that a single rule has the power of a finite state automaton, but the system as a whole has (even more than) a context-free power.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The Rules and Their Implementation
</SectionTitle>
      <Paragraph position="0"> The system of hand-written rules for Czech has a twofold objective: a2 practical: an error-free and at the same time the most accurate tagging of Czech texts a2 theoretical: the description of the syntactic  system of Czech, its langue, rather than parole. null The rules are to reduce the input ambiguity of the input text. During disambiguation the whole rule system combines two methods: a2 the oblique one consisting in the elimination of syntactically wrong tag(s), i.e. in the reduction of the input ambiguity by deleting those tags which are excluded by the context a2 the direct choice of the correct tag(s). The overall strategy of the rule system is to keep the highest recall possible (i.e. 100%) and gradually improve precision. Thus, the rules are (manually) assigned reliabilities which divide the rules into reliability classes, with the most reliable (&amp;quot;bullet-proof&amp;quot;) group of rules applied first and less reliable groups of rules (threatening to decrease the 100% recall) being applied in subsequent steps. The bullet-proof rules reflect general syntactic regularities of Czech; for instance, no word form in the nominative case can follow an unambiguous preposition. The less reliable rules can be exemplified by those accounting for some special intricate relations of grammatical agreement in Czech. Within each reliability group the rules are applied independently, i.e. in any order in a cyclic way until no ambiguity can be resolved. null Besides reliability, the rules can be generally divided according to the locality/nonlocality of their scope. Some phenomena (not many) in the structure of Czech sentence are local in nature: for instance, for the word &amp;quot;se&amp;quot; which is two-way ambiguous between a preposition (with) and a reflexive particle/pronoun (himself, as a particle) a prepositional reading can be available only in local contexts requiring the vocalisation of the basic form of the preposition &amp;quot;s&amp;quot; (with) resulting in the form &amp;quot;se&amp;quot;. However, in the majority of phenomena the correct disambiguation requires a much wider context. Thus, the rules use as wide context as possible with no context limitations being imposed in advance. During rules development performed so far, sentential context has been used, but nothing in principle limits the context to a single sentence. If it is generally appropriate for the disambiguation of the languages of the world to use unlimited context, it is especially fit for languages with free word order combined with rich inflection. There are many syntactic phenomena in Czech displaying the following property: a word form wf1 can be part-of-speech determined by means of another word form wf2 whose word-order distance cannot be determined by a fixed number of positions between the two word forms.</Paragraph>
      <Paragraph position="1"> This is exactly a general phenomenon which is grasped by the hand-written rules.</Paragraph>
      <Paragraph position="2"> Formally, each rule consists of a2 the description of the context (descriptive component), and a2 the action to be performed given the context (executive component): i.e. which tags are to be discarded or which tag(s) are to be proclaimed correct (the rest being discarded as wrong).</Paragraph>
      <Paragraph position="3"> For example, a2 Context: unambiguous finite verb, followed/preceded by a sequence of tokens containing neither comma nor coordinating conjunction, at either side of a word x ambiguous between a finite verb and another reading a2 Action: delete the finite verb reading(s) at the word x.</Paragraph>
      <Paragraph position="4"> There are two ways of rule development: a2 the rules developed by syntactic introspection: such rules are subsequently verified on the corpus material, then implemented and the implemented rules are tested on a testing corpus a2 the rules are derived from the corpus by introspection and subsequently implemented The rules are formulated as generally as possible and at the same time as error-free (recallwise) as possible. This approach of combining the requirements of maximum recall and maximum precision demands sophisticated syntactic knowledge of Czech. This knowledge is primarily based on the study of types of morphological ambiguity occurring in Czech. There are two main types of such ambiguity:</Paragraph>
      <Paragraph position="6"> The regular (paradigm-internal) ambiguities occur within a paradigm, i.e. they are common to all lexemes belonging to a particular inflection class. For example, in Czech (as in many other inflective languages), the nominative, the accusative and the vocative case have the same form (in singular on the one hand, and in plural on the other).</Paragraph>
      <Paragraph position="7"> The casual (lexical, paradigm-external) morphological ambiguity is lexically specific and hence cannot be investigated via paradigmatics.</Paragraph>
      <Paragraph position="8"> In addition to the general rules, the rule approach includes a module which accounts for collocations and idioms. The problem is that the majority of collocations can - besides their most probable interpretation just as collocations - have also their literal meaning.</Paragraph>
      <Paragraph position="9"> Currently, the system (as evaluated in Sect. 2.3) consists of 80 rules.</Paragraph>
      <Paragraph position="10"> The rules had been implemented procedurally in the initial phase; a special feature-oriented, interpreted &amp;quot;programming language&amp;quot; is now under development.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Evaluation of the Rule System Alone
</SectionTitle>
      <Paragraph position="0"> The results are presented in Table 1. We use the usual equal-weight formula for F-measure:</Paragraph>
      <Paragraph position="2"/>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Statistical Component
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 The HMM Tagger
</SectionTitle>
      <Paragraph position="0"> We have used an HMM tagger in the usual source-channel setting, fine-tuned to perfection using</Paragraph>
      <Paragraph position="2"> for both models.</Paragraph>
      <Paragraph position="3"> Thus the HMM tagger outputs a sequence of tags a102 according to the usual equation  The tagger has been trained in the usual way, using part of the training data as heldout data for smoothing of the two models employed. There is no threshold being applied for low counts. Smoothing has been done first without using buckets, and then with them to show the difference. Table 2 shows the resulting interpolation coefficients for the tag language model using the usual linear interpolation smoothing formula  where p(...) is the &amp;quot;raw&amp;quot; Maximum Likelihood estimate of the probability distributions, i.e. the relative frequency in the training data.</Paragraph>
      <Paragraph position="4"> The bucketing scheme for smoothing (a necessity when keeping all tag trigrams and tag-toword bigrams) uses &amp;quot;buckets bounds&amp;quot; computed according to the following formula (for more on bucketing, see (Jelinek, 1997)):  It should be noted that when using this bucketing scheme, the weights of the detailed distributions (with longest history) grow quickly as the history reliability increases. However, it is not monotonic; at several of the most reliable histories, the weight coefficients &amp;quot;jump&amp;quot; up and down. We have found that a sudden drop in a135 a121 happens, e.g., for the bucket containing a history consisting of two consecutive punctuation symbols, which is not so much surprising after all.</Paragraph>
      <Paragraph position="5"> A similar formula has been used for the lexical model (Table 3), and the strenghtening of the weights of the most detailed distributions has been observed, too.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Evaluation of the HMM Tagger alone
</SectionTitle>
      <Paragraph position="0"> The HMM tagger described in the previous paragraph has achieved results shown in Table 4. It produces only the best tag sequence for every sentence, therefore only accuracy is reported. Five-fold cross-validation has been performed (Exp 15) on a total data size of 1489983 tokens (excluding heldout data), divided up to five datasets of roughly the same size.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 The Serial Combination
</SectionTitle>
    <Paragraph position="0"> When the two systems are coupled together, the manual rules are run first, and then the HMM tagger runs as usual, except it selects from only those tags retained at individual tokens by the manual rule component, instead of from all tags as produced by the morphological analyzer: a2 The morphological analyzer is run on the test data set. Every input token receives a list of possible tags based on an extensive Czech morphological dictionary.</Paragraph>
    <Paragraph position="1"> a2 The manual rule component is run on the output of the morphology. The rules eliminate some tags which cannot form grammatical sentences in Czech.</Paragraph>
    <Paragraph position="2"> a2 The HMM tagger is run on the output of the rule component, using only the remaining tags at every input token. The output is best-only; i.e., the tagger outputs exactly one tag per input token.</Paragraph>
    <Paragraph position="3"> If there is no tag left at a given input token after the manual rules run, we reinsert all the tags from morphology and let the statistical tagger decide as if no rules had been used.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Evaluation of the Combined Tagger
</SectionTitle>
      <Paragraph position="0"> Table 5 contains the final evaluation of the main contribution of this paper. Since the rule-based component does not attempt at full disambiguation, we can only use the F-measure for comparison and improvement evaluation6.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Error Analysis
</SectionTitle>
      <Paragraph position="0"> The not-so-perfect recall of the rule component has been caused either by some deficiency in the rules, or by an error in the input morphology (due to a deficiency in the morphological dictionary), or by an error in the 'truth' (caused by an imperfect manual annotation).</Paragraph>
      <Paragraph position="1"> As Czech syntax is extremely complex, some of the rules are either not yet absolutely perfect, or they are too strict7. An example of the rule which decreases 100% recall for the test data is the following one: In Czech, if an unambiguous preposition is detected in a clause, it &amp;quot;must&amp;quot; be followed - not necessarily immediately - by a nominal element (noun, adjective, pronoun or numeral) or, in very 6For the HMM tagger, which works in best-only mode, accuracy = precision = recall = F-measure, of course. 7&amp;quot;Too strict&amp;quot; is in fact good, given the overall scheme with the statistical tagger coming next, except in cases when it severely limits the possibility of increasing the precision. Nothing unexpected is happening here.</Paragraph>
      <Paragraph position="2">  special cases, such a nominal element may be missing as it is elided. This fact about the syntax of prepositions in Czech is accounted for by a rule associating an unambiguous preposition with such a nominal element which is headed by the preposition. The rule, however, erroneously ignores the fact that some prepositions function as heads of plain adverbs only (e.g., adverbs of time). As an example occurring in the test data we can take a simple structure &amp;quot;do kdy&amp;quot; (lit. till when), where &amp;quot;do&amp;quot; is a preposition (lit. till), when is an adverb of time and no nominal element follows. This results in the deletion of the prepositional interpretation of the preposition &amp;quot;do&amp;quot; thus causing an error. However, in cases like this, it is more appropriate to add another condition to the context (gaining back the lost recall) of such a rule rather than discard the rule as a whole (which would harm the precision too much).</Paragraph>
      <Paragraph position="3"> As examples of erroneous tagging results which have been eliminated for good due to the architecture described we might put forward: a2 preposition requiring case a158 not followed by any form in case a158 : any preposition has to be followed by at least one form (of noun, adjective, pronoun or numeral) in the case required. Turning this around, if a word which is ambiguous between a preposition and another part of speech is not followed by the respective form till the end of the sentence, it is safe to discard the prepositional reading in almost all non-idiomatic, non-coordinated cases.</Paragraph>
      <Paragraph position="4"> a2 two finite verbs within a clause: Similarly to most languages, a Czech clause must not contain more than one finite verb. This means that if two words, one genuine finite verb and the other one ambiguous between a finite verb and another reading, stand in such a configuration that the material between them contains no clause separator (comma, conjunction), it is safe to discard the finite verb reading with the ambiguous word.</Paragraph>
      <Paragraph position="5"> a2 two nominative cases within a clause: The subject in Czech is usually case-marked by nominative, and simultaneously, even when the position of subject is free (it can stand both to the left or to the right of the main verb) in Czech, no clause can have two non-coordinated subjects.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML