XML Viewer - n06-1042

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1042_metho.xml
Size: 11,384 bytes
Last Modified: 2025-10-06 14:10:11
<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1042">
  <Title>Learning Morphological Disambiguation Rules for Turkish</Title>
  <Section position="4" start_page="329" end_page="331" type="metho">
    <SectionTitle>
3 Decision Lists
</SectionTitle>
    <Paragraph position="0"> We introduce a new method for morphological disambiguation based on decision lists. A decision list is an ordered list of rules where each rule consists of a pattern and a classi cation (Rivest, 1987). In our application the pattern speci es the surface attributes of the words surrounding the target such as suf xes and character types (e.g. upper vs. lower case, use of punctuation, digits). The classi cation indicates the presence or absence of a morphological feature for the center word.</Paragraph>
    <Section position="1" start_page="329" end_page="330" type="sub_section">
      <SectionTitle>
3.1 A Sample Decision List
</SectionTitle>
      <Paragraph position="0"> We will explain the rules and their patterns using the sample decision list in Table 2 trained to identify the feature +Det (determiner).</Paragraph>
      <Paragraph position="1">  The value in the class column is 1 if word W should have a +Det feature and 0 otherwise. The pattern column describes the required attributes of the words surrounding the target word for the rule to match. The last (default) rule has no pattern, matches every instance, and assigns them +Det.</Paragraph>
      <Paragraph position="2"> This default rule captures the behavior of the majority of the training instances which had +Det in their correct parse. Rule 4 indicates a common exception: the frequently used word c,ok (meaning very) should not be assigned +Det by default: c,ok can be also used as an adjective, an adverb, or a postposition. Rule 1 introduces an exception to rule 4: if the right neighbor R1 ends with the suf x +DA (the locative suf x) then c,ok should receive +Det. The meanings of various symbols in the patterns are described below.</Paragraph>
      <Paragraph position="3"> When the decision list is applied to a window of words, the rules are tried in the order from the most speci c (rule 1) to the most general (rule 5). The rst rule that matches is used to predict the classi cation of the center word. The last rule acts as a catch-all; if none of the other rules have matched, this rule assigns the instance a default classi cation. For example, the ve rule decision list given above classi es the middle word in pek c,ok alanda (matches rule  letters on the right represent character groups useful in identifying phonetic variations of certain suf xes, e.g. the locative suf x +DA can surface as +de, +da, +te, or +ta depending on the root word ending.</Paragraph>
      <Paragraph position="4"> 1) and pek c,ok insan (matches rule 2) as +Det, but insan c,ok daha (matches rule 4) as not +Det.</Paragraph>
      <Paragraph position="5"> One way to interpret a decision list is as a sequence of if-then-else constructs familiar from programming languages. Another way is to see the last rule as the default classi cation, the previous rule as specifying a set of exceptions to the default, the rule before that as specifying exceptions to these exceptions and so on.</Paragraph>
    </Section>
    <Section position="2" start_page="330" end_page="331" type="sub_section">
      <SectionTitle>
3.2 The Greedy Prepend Algorithm (GPA)
</SectionTitle>
      <Paragraph position="0"> To learn a decision list from a given set of training examples the general approach is to start with a default rule or an empty decision list and keep adding the best rule to cover the unclassi ed or misclassied examples. The new rules can be added to the end of the list (Clark and Niblett, 1989), the front of the list (Webb and Brkic, 1993), or other positions (Newlands and Webb, 2004). Other design decisions include the criteria used to select the best rule and how to search for it.</Paragraph>
      <Paragraph position="1"> The Greedy Prepend Algorithm (GPA) is a variant of the PREPEND algorithm (Webb and Brkic, 1993).</Paragraph>
      <Paragraph position="2"> It starts with a default rule that matches all instances and classi es them using the most common class in the training data. Then it keeps prepending the rule with the maximum gain to the front of the growing decision list until no further improvement can be made. The algorithm can be described as follows:  The gain of a candidate rule in GPA is de ned as the increase in the number of correctly classi ed instances in the training set as a result of prepending the rule to the existing decision list. This is in contrast with the original PREPEND algorithm which uses the less direct Laplace preference function (Webb and Brkic, 1993; Clark and Boswell, 1991).</Paragraph>
      <Paragraph position="3"> To nd the next rule with the maximum gain, GPA uses a heuristic search algorithm. Candidate rules are generated by adding a single new attribute to the pattern of each rule already in the decision list. The candidate with the maximum gain is prepended to the decision list and the process is repeated until no more positive gain rules can be found. Note that if the best possible rule has more than one extra attribute compared to the existing rules in the decision list, a suboptimal rule will be selected. The original PREPEND uses an admissible search algorithm, OPUS, which is guaranteed to nd the best possible candidate (Webb, 1995), but we found OPUS to be too slow to be practical for a problem of this scale.</Paragraph>
      <Paragraph position="4"> We picked GPA for the morphological disambiguation problem because we nd it to be fast and fairly robust to the existence of irrelevant or redundant attributes. The average training instance has 40 attributes describing the suf xes of all possible lengths and character type information in a ve word window. Most of this information is redundant or irrelevant to the problem at hand. The number of distinct attributes is on the order of the number of distinct word-forms in the training set. Nevertheless GPA is able to process a million training instances for each of the 126 unique morphological features and produce a model with state of the art accuracy in about two hours on a regular desktop PC.2</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="331" end_page="332" type="metho">
    <SectionTitle>
4 Experiments and Results
</SectionTitle>
    <Paragraph position="0"> In this section we present the details of the data, the training and testing procedures, the surface attributes used, and the accuracy results.</Paragraph>
    <Section position="1" start_page="331" end_page="331" type="sub_section">
      <SectionTitle>
4.1 Training Data
</SectionTitle>
      <Paragraph position="0"> Our training data consists of about 1 million words of semi-automatically disambiguated Turkish news text. For each one of the 126 unique morphological features, we used the subset of the training data in which instances have the given feature in at least one of their generated parses. We then split this subset into positive and negative examples depending on whether the correct parse contains the given feature. A decision list speci c to that feature is created using GPA based on these examples.</Paragraph>
      <Paragraph position="1"> Some relevant statistics for the training data are given in Table 4.</Paragraph>
    </Section>
    <Section position="2" start_page="331" end_page="331" type="sub_section">
      <SectionTitle>
4.2 Input Attributes
</SectionTitle>
      <Paragraph position="0"> Once the training data is selected for a particular morphological feature, each instance is represented by surface attributes of ve words centered around the target word. We have tried larger window sizes but no signi cant improvement was observed. The attributes computed for each word in the window consist of the following:  1. The exact word string (e.g. W==Ali'nin) 2. The lowercase version (e.g. W= ali'nin) Note: all digits are replaced by 0's at this stage.</Paragraph>
      <Paragraph position="1"> 3. All suf xes of the lowercase version (e.g.</Paragraph>
      <Paragraph position="2">  W=+n, W=+In, W=+nIn, W=+'nIn, etc.) Note: certain characters are replaced with capital letters representing character groups mentioned in Table 3. These groups help the algorithm recognize different forms of a suf x created by the phonetic rules of Turkish: for example the locative suf x +DA can surface as +de, +da, +te, or +ta depending on the ending of the root word.</Paragraph>
      <Paragraph position="3"> 4. Attributes indicating the types of characters at various positions of the word (e.g. Ali'nin would be described with W=UPPER-FIRST, W=LOWER-MID, W=APOS-MID, W=LOWERLAST) null Each training instance is represented by 40 attributes on average. The GPA procedure is responsible for picking the attributes that are relevant to the decision. No dictionary information is required or used, therefore the models are fairly robust to unknown words. One potentially useful source of attributes is the tags assigned to previous words which we plan to experiment with in future work.</Paragraph>
    </Section>
    <Section position="3" start_page="331" end_page="331" type="sub_section">
      <SectionTitle>
4.3 The Decision Lists
</SectionTitle>
      <Paragraph position="0"> At the conclusion of the training, 126 decision lists are produced of the form given in Table 2. The number of rules in each decision list range from 1 to 6145. The longer decision lists are typically for part of speech features, e.g. distinguishing nouns from adjectives, and contain rules speci c to lexical items.</Paragraph>
      <Paragraph position="1"> The average number of rules is 266. To get an estimate on the accuracy of each decision list, we split the one million word data into training, validation, and test portions using the ratio 4:1:1. The training set accuracy of the decision lists is consistently above 98%. The test set accuracies of the 126 decision lists range from 80% to 100% with the average at 95%. Table 5 gives the six worst features with test set accuracy below 89%; these are the most dif cult to disambiguate.</Paragraph>
    </Section>
    <Section position="4" start_page="331" end_page="332" type="sub_section">
      <SectionTitle>
4.4 Correct Tag Selection
</SectionTitle>
      <Paragraph position="0"> To evaluate the candidate tags, we need to combine the results of the decision lists. We assume that the presence or absence of each feature is an independent event with a probability determined by the test set accuracy of the corresponding decision list. For example, if the +P3pl decision list predicts YES, we assume that the +P3pl feature is present with  curacy.</Paragraph>
      <Paragraph position="1"> probability 0.8408 (See Table 5). If the +Fut decision list predicts NO, we assume the +Futfeature is present with probability 1[?]0.8511 = 0.1489. To avoid zero probabilities we cap the test set accuracies at 99%.</Paragraph>
      <Paragraph position="2"> Each candidate tag indicates the presence of certain features and the absence of others. The probability of the tag being correct under our independence assumption is the product of the probabilities for the presence and absence of each of the 126 features as determined by our decision lists. For ef ciency, one can neglect the features that are absent from all the candidate tags because their contribution will not effect the comparison.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML