File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1033_metho.xml

Size: 21,014 bytes

Last Modified: 2025-10-06 14:10:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1033">
  <Title>Adaptive Transformation-based Learning for Improving Dictionary Tagging</Title>
  <Section position="3" start_page="257" end_page="258" type="metho">
    <SectionTitle>
2 A Rule-based Dictionary Entry Tagger
</SectionTitle>
    <Paragraph position="0"> The rule-based entry tagger (Karagol-Ayan et al., 2003) utilizes the repeating structure of the dictionaries to identify and tag the linguistic role of tokens or sets of tokens. Rule-based tagging  usesthreedifferenttypesofclues--fontstyle,keywords and separators--to tag the entries in a systematic way. The method accommodates noise introduced by the document analyzer by allowing for a relaxed matching of OCRed output to tags.</Paragraph>
    <Paragraph position="1"> For each dictionary, a human operator must specify the lexicographic information used in that particular dictionary, along with the clues for each tag. This process can be performed in a few hours.</Paragraph>
    <Paragraph position="2">  Therule-basedmethodaloneachievedtokenaccuracy between 73%-87% and phrase accuracy between 75%-89% in experiments conducted using three different dictionaries5.</Paragraph>
    <Paragraph position="3"> The rule-based method has demonstrated promising results, but has two shortcomings. First, the method does not consider the relations between different tags in the entries. While not a problem for some dictionaries, for others ordering the relations between tags may be the only information that will tag a token correctly. Consider the dictionary entries in Figure 1. In this dictionary, the word &amp;quot;a&amp;quot; represents POS when in italic font, and part of a translation if in normal font. However if the font is incorrect (font errors are more likely to happen with short tokens), the only way to mark correctly the tag involves checking the neighboring tokens and tags to determine its relative position within the entry. When the token has an incorrect font or OCR errors exist, and the other clues are ambiguous or inconclusive, the rule-based method may yield incorrect results.</Paragraph>
    <Paragraph position="4"> Second, the rule-based method can produce incorrect splitting and/or merging of phrases. An erroneous merge of two tokens as a phrase may take place either because of a font error in one of the tokens or the lack of a separator, such as a punctuation mark. A phrase may split erroneously either 5Using HMMs for entry tagging on the same set of dictionaries produced slightly lower performance, resulting in token accuracy between 73%-88% and phrase accuracy between 57%-85%.</Paragraph>
    <Paragraph position="5">  as a result of a font error or an ambiguous separator. For instance, a comma may be used after an example of usage to separate it from its translation or within it as a normal punctuation mark.</Paragraph>
  </Section>
  <Section position="4" start_page="258" end_page="258" type="metho">
    <SectionTitle>
3 TBL
</SectionTitle>
    <Paragraph position="0"> TBL (Brill, 1995), a rule-based machine learning algorithm, has been applied to various NLP tasks.</Paragraph>
    <Paragraph position="1"> TBL starts with an initial state, and it requires a correctly annotated training corpus, or truth, for the learning (or training) process. The iterative learning process acquires an ordered list of rules or transformations that correct the errors in this initial state. At each iteration, the transformation which achieved the largest benefit during application is selected. During the learning process, the templates of allowable transformations limit the search space for possible transformation rules.</Paragraph>
    <Paragraph position="2"> The proposed transformations are formed by instantiation of the transformation templates in the context of erroneous tags. The learning algorithm stops when no improvement can be made to the current state of the training data or when a pre-specified threshold is reached.</Paragraph>
    <Paragraph position="3"> A transformation modifies a tag when its context (such as neighboring tags or tokens) matches the context described by the transformation. Two parts comprise a transformation: a rewrite rule-what to replace-- and a triggering environment-when to replace. A typical rewrite rule is: Change the annotation from aa to ab, and a typical triggering environment is: The preceding word is wa.</Paragraph>
    <Paragraph position="4"> The system's output is the final state of this data after applying all transformations in the order they are produced.</Paragraph>
    <Paragraph position="5"> To overcome the lengthy training time associated with this approach, we used fnTBL, a fast version of TBL that preserves the performance of the algorithm (Ngai and Florian, 2001). Our research contribution shows this method is effective when applied to a miniscule set of training data.</Paragraph>
  </Section>
  <Section position="5" start_page="258" end_page="259" type="metho">
    <SectionTitle>
4 Application of TBL to Entry Tagging
</SectionTitle>
    <Paragraph position="0"> In this section, we describe how we used TBL in the context of tagging dictionary entries.</Paragraph>
    <Paragraph position="1"> We apply TBL at two points: to render correctly the font style of the tokens and to label correctly  boundary flags. In this paper, whenever we say &amp;quot;application of TBL to tagging&amp;quot;, we mean tags and phrase boundary flags  cial role in identifying tags. The rule-based entry taggerreliesonfontstyle, whichcanbealsoincorrect. Therefore we also investigate whether improving font style accuracy will further improve tagging results. We apply TBL in three configurations: (1) to improve font style, (2) to improve tagging and (3) to improve both, one after another. Figure 2 shows the phases of TBL application.</Paragraph>
    <Paragraph position="2"> First we have the rule-based entry tagging results with the font style assigned by document image analysis (Result1), then we apply TBL to tagging using this result (Result2). We also apply TBL to improve the font style accuracy, and we feed these changed font styles to the rule-based method (Result3). We then apply TBL to tagging using this result (Result4). Finally, in order to find the upper bound when we use the manually corrected font stylesinthegroundtruthdata, wefeedcorrectfont stylestotherule-basedmethod(Result5), andthen apply TBL to tagging using this result (Result6).</Paragraph>
    <Paragraph position="3"> In the transformation templates, we use the tokens themselves as features, i.e. the items in the triggering environment, because the token's content is useful in indicating the role. For instance a comma and a period may have different functionalities when tagging the dictionary. However, when transformations are allowed to make reference to tokens, i.e., when lexicalized transformations are allowed, some relevant information may be lost because of sparsity. To overcome the data sparseness problem, we also assign a type to each token that classifies the token's content. We use eight types: punctuation, symbol, numeric, uppercase, capitalized, lowercase, non-Latin, and other. For TBL on font style, the transformation templatescontainthreefeatures: thetoken, thetoken's type, andthetoken'sfont. ForTBLontagging, we together.</Paragraph>
    <Paragraph position="4">  use four features: the token, the token's type, the token's font style, and the token's tag.</Paragraph>
    <Paragraph position="5"> The initial state annotations for font style are assigned by document image analysis. The rulebasedentrytaggingmethodassignstheinitialstate null ofthetokens'tags. Thetemplatesforfontstyleaccuracy improvement consist of those from studying the data and all templates using all features within a window of five tokens (i.e., two preceding tokens, the current token, and two following tokens). For tagging accuracy improvement, we prepared the transformation templates by studying dictionaries and errors in the entry tagging results. The objective function for evaluating transformations in both cases is the classification accuracy, and the objective is to minimize the number of errors. null</Paragraph>
  </Section>
  <Section position="6" start_page="259" end_page="261" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> We performed our experiments on a Cebuano-English dictionary (Wolff, 1972) consisting of 1163 pages, 4 font styles, and 18 tags, and on anIraqiArabic-Englishdictionary(Woodheadand Beene, 2003) consisting of 507 pages, 3 font styles, and 26 tags. For our experiments, we used a publicly available implementation of TBL's fast version, fnTBL7, described in Section 3.</Paragraph>
    <Paragraph position="1"> Weusedeightrandomlyselectedpagesfromthe dictionaries to train TBL, and six additional randomly selected pages for testing. The font style and tag of each token on these pages are manually corrected from an initial run. Our goal is to measure the effect of TBL on font style and tagging that have the same noisy input. For the Cebuano dictionary, the training data contains 156 entries, 8370 tokens, and 6691 non-punctuation tokens, and the test data contains 137 entries, 6251 tokens, and 4940 non-punctuation tokens. For the Iraqi Arabic dictionary, the training data contains 232 entries, 6130 tokens, and 4621 non-punctuation tokens, and the testdata contains 175 entries, 4708 tokens, 3467 non-punctuation tokens.</Paragraph>
    <Paragraph position="2"> For evaluation, we used the percentage of accuracy for non-punctuation tokens, i.e., the number of correctly identified tags divided by total number of tokens/phrases. The learning phase of TBL took less than one minute for each run, and application of learned transformations to the whole dictionary less than two minutes.</Paragraph>
    <Paragraph position="3"> We report how TBL affects accuracy of tagging  when applied to font styles, tags, and font styles and tags together. To find the upper bound tagging results with correct font styles, we also ran rule-based entry tagger using manually corrected font styles, and applied TBL for tagging accuracy improvement to these results. We should note that  feedingthecorrectfonttotherule-basedentrytagger does not necessarily mean the data is totally correct, it may still contain noise from document image analysis or ambiguity in the entry.</Paragraph>
    <Paragraph position="4"> We conducted three sets of experiments to observe the effects of TBL (Section 5.1), the effects of different training data (Section 5.2), and the effects of training data size (Section 5.3).</Paragraph>
    <Section position="1" start_page="259" end_page="260" type="sub_section">
      <SectionTitle>
5.1 TBL on Font Styles and Tags
</SectionTitle>
      <Paragraph position="0"> punctuation tokens We report the accuracy of font styles on the test data before and after applying TBL to the font styleofthenon-punctuationtokensinTable1. The initial font style accuracy of Cebuano dictionary was much less than the Iraqi Arabic dictionary, but applying TBL resulted in similar font style accuracy for both dictionaries (97% and 98%).</Paragraph>
      <Paragraph position="1">  ation tokens and phrases for two dictionaries The results of tagging accuracy experiments are presented in Table 2. In the tables, RB is rule-based method, TBL(tag) is the TBL run on tags, TBL(font) is the TBL run on font style, and GT(font) is the ground truth font style. In each case, we begin with font style information provided by document image analysis. We tabulate  percentagesoftaggingaccuracyofindividualnonpunctuation tokens and phrases8. The results for 8In phrase accuracy, if a group of consequent tokens is assigned one tag as a phrase in the ground truth, the tagging of the phrase is considered correct only if the same group of  token and phrase accuracy are presented for three different sets: The entry tagger using the font style (1) provided by document image analysis, (2) after TBL is applied to font style, and (3) cor null rected manually, i.e. the ground truth. All results reported, except the token accuracies for two cases for the Iraqi Arabic dictionary, namely using TBL(font) vs. GT(font) and using TBL(font) and TBL(tag) together vs. using GT(font) and TBL(tag), are statistically significant within the 95% confidence interval with two-tailed paired ttests9. null Using TBL(font) instead of initial font styles improved initial accuracy as much as 4.74% for tokens, and 8.36% for phrases in the Cebuano dictionary which has a much lower initial font style accuracy than the Iraqi Arabic dictionary. Using the GT(font) further increased the tagging accuracy by 2.77% for tokens and 2.27% for phrases for the Cebuano dictionary. As for the Iraqi Arabic dictionary, using TBL(font) and GT(font) resulted in an improvement of 0.57% and 0.85% for tokens and 0.74% and 1.18% for phrases respectively. The improvements in these two dictionaries differ because the initial font style accuracy for the Iraqi Arabic dictionary is very high while for the Cebuano dictionary potentially very useful font style information (namely, the font style for POS tokens) is often incorrect in the initial run. Using TBL(tag) alone improved rule-based method results by 8.19% and 3.16% for tokens and by 23.25% and 9.61% for phrases in Cebuano and Iraqi Arabic dictionaries respectively. The last two rows in Table 2 show the upper bound. For the two dictionaries, our results using TBL(font) and TBL(tag) together is 2.68% and 0.24% for token accuracy and 2.10% and 0.53% for phrase accuracy less than the upper bound of using the GT(font) and TBL(tag) together.</Paragraph>
      <Paragraph position="2"> Applying TBL to font styles resulted in a higher accuracy than applying TBL to tagging. Since the number of tag types (18 and 26) is much larger than that of font style types (4 and 3), TBL application on tags requires more training data than the font style to perform as well as TBL application on font style.</Paragraph>
      <Paragraph position="3"> In summary, applying TBL using the same templates to two different dictionaries using very limitedtrainingdataresultedinperformanceincrease, null tokens was assigned the same tag as a phrase in the result. 9We did the t-tests on the results of individual entries. and the greatest increases we observed are in phrase accuracy. Applying TBL to font style first increased the accuracy even further.</Paragraph>
    </Section>
    <Section position="2" start_page="260" end_page="260" type="sub_section">
      <SectionTitle>
5.2 Effect of Training Data
</SectionTitle>
      <Paragraph position="0"> We conducted experiments to measure the robustness of our method with different training data.</Paragraph>
      <Paragraph position="1"> For this purpose, we trained TBL on eight pages randomly selected from the 14 pages for which we have ground truth, for each dictionary. We used theremainingsixpagesfortesting. Wedidthisten times, and calculated the average accuracy and the standard deviation. Table 3 presents the average accuracy and standard deviation. The accuracy results are consistent with the results we presented in Table 2, and the standard deviation is between 0.56-2.28. These results suggest that using different training data does not affect the performance dramatically.</Paragraph>
    </Section>
    <Section position="3" start_page="260" end_page="261" type="sub_section">
      <SectionTitle>
5.3 Effect of Training Data Size
</SectionTitle>
      <Paragraph position="0"> The problem to which we apply TBL has one important challenge and differs from other tasks in which TBL has been applied. Each dictionary has a different structure and different noise patterns, hence, TBL must be trained for each dictionary.</Paragraph>
      <Paragraph position="1"> This requires preparing ground truth manually for each dictionary before applying TBL. Moreover, although each dictionary has hundreds of pages, it is not feasible to use a significant portion of the dictionary for training. Therefore the training data should be small enough for someone to annotate ground truth in a short amount of time. One of our goals is to calculate the quantity of training data necessaryforareasonableimprovementintagging accuracy. For this purpose, we investigated the effect of the training data size by increasing the training data size for TBL one entry at a time. The entries are added in the order of the number of errorstheycontain, startingwiththeentrywithmaximum errors. We then tested the system trained with these entries on two test pages10.</Paragraph>
      <Paragraph position="2">  Figure3showsthenumberoffontstyleandtagging errors for non-punctuation tokens on two test pages as a function of the number of entries in the training data. The tagging results are presented when using font style from document image analysis and font style after TBL. In these graphs, the 10We used two test data pages because if such a method will determine the minimum training data required to obtain a reasonable performance, the test data should be extremely limited to reduce human provided data.</Paragraph>
      <Paragraph position="3">  data for two dictionaries number of errors declines dramatically with the addition of the first entries. For the tags, the declineisnotassteepasthedeclineinfontstyle. The main reason involves the number of tags (18 and 26), which are more than the number of font styles (4 and 3). The method of adding entries to training data one by one, and finding the point when the number of errors on selected entries stabilizes, can determine minimum training data size to get a reasonable performance increase. lexicalized</Paragraph>
    </Section>
    <Section position="4" start_page="261" end_page="261" type="sub_section">
      <SectionTitle>
5.4 Example Results
</SectionTitle>
      <Paragraph position="0"> Table 4 presents some learned transformations for Cebuano dictionary. Table 5 shows how these transformations change the font style and tags of tokens from Figure 4. The first column gives the tagging results before applying TBL. The consecutive columns shows how different TBL runs changes these results. The tags with * indicate incorrect tags, the tags with + indicate corrected tags, and the tags with - indicate introduced errors. The font style of tokens is also represented.</Paragraph>
      <Paragraph position="1"> The No column in Tables 4 and 5 gives the applied transformation number.</Paragraph>
      <Paragraph position="2"> For these entries, using TBL on font styles and tagging together gives correct results in all cases. Using TBL only on tagging gives the correct tagging only for the last entry.</Paragraph>
      <Paragraph position="3"> TBL introduces new errors in some cases. One error we observed occurs when an example of usage translation is assigned a tag before any example of usage tag in an entry. This case is illustrated when applying transformation 9 to the token Abaa because of a misrecognized comma before the token. null</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="261" end_page="262" type="metho">
    <SectionTitle>
6 Conclusion
</SectionTitle>
    <Paragraph position="0"> In this paper, we introduced a new dictionary entry tagging system in which TBL improves tagging accuracy. TBL is applied at two points, on font style and tagging- and yields high performance even with limited user provided training data. For two different dictionaries, we achieved an increase from 84% and 94% to 97% and 98% in font style accuracy, from 83% and 91% to 93% and 94% in tagging accuracy of tokens, and from 64% and 83% to 90% and 93% in tagging accuracy of phrases. If the initial font style is not accurate, first improving font style with TBL further assisted the tagging accuracy as much as 2.62% for tokens and 2.82% for phrases compared to using TBL only for tagging. This result cannot be  tion of a phrase indicates this token merges with the previous one to form a phrase. attributed to a low rule-based baseline as a similar, even a slightly lower baseline is obtained from an HMM trained system. Results came from a method used to compensate for extremely limited training data. The similarity of performance acrosstwodifferentdictionariesshowsthemethod as adaptive and able to be applied genericly.</Paragraph>
    <Paragraph position="1"> In the future, we plan to investigate the sources of errors introduced by TBL and whether these can be avoided by post-processing TBL results using heuristics. We will also examine the effects of using TBL to increase the training data size in a bootstrapped manner. We will apply TBL to a few pages, then correct these and use them as new training data in another run. Since TBL improves accuracy, manually preparing training data will take less time.</Paragraph>
  </Section>
  <Section position="8" start_page="262" end_page="262" type="metho">
    <SectionTitle>
Acknowledgements
</SectionTitle>
    <Paragraph position="0"> The partial support of this research under contract MDA-9040-2C-0406 is gratefully acknowledged.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML