File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/e06-1033_intro.xml

Size: 4,852 bytes

Last Modified: 2025-10-06 14:03:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1033">
  <Title>Adaptive Transformation-based Learning for Improving Dictionary Tagging</Title>
  <Section position="2" start_page="0" end_page="257" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The availability and use of electronic resources such as electronic dictionaries has increased tremendously in recent years and their use in Natural Language Processing (NLP) systems is widespread. For languages with limited electronic resources, i.e. low-density languages, however, we cannot use automated techniques based on parallel corpora (Gale and Church, 1991; Melamed, 2000; Resnik, 1999; Utsuro et al., 2002), comparable corpora (Fung and Yee, 1998), or multilingual thesauri (Vossen, 1998). Yet for these low-density languages, printed bilingual dictionaries oftenoffereffectivemappingfromthelow-density language to a high-density language, such as English. null Dictionaries can have different formats and can provide a variety of information. However, they typically have a consistent layout of entries and a  tags are identified and tagged in the given entries. consistent structure within entries. Publishers of dictionaries often use a combination of features to impose this structure including (1) changes in font style, font-size, etc. that make implicit the lexicographic information1, such as headwords, pronunciations, parts of speech (POS), and translations, (2) keywords that provide an explicit interpretation of the lexicographic information, and (3) various separators that impose an overall structure on the entry. For example, a boldface font may indicate a headword, italics may indicate an example of usage, keywords may designate the POS, commas may separate different translations, and a numbering system may identify different senses of a word.</Paragraph>
    <Paragraph position="1"> We developed an entry tagging system that recognizes, parses, and tags the entries of a printed dictionary to reproduce the representation electronically (Karagol-Ayan et al., 2003). The system aims to use features as described above and the consistent layout and structure of the dictio- null naries to capture and recover the lexicographic information in the entries. Each token2 or group of tokens (phrase)3 in an entry associates with a tag indicating its lexicographic information in the entry. Figure1showssampletaggedentriesinwhich eight different types of lexicographic information are identified and marked. The system gets format and style information from a document image analyzer module (Ma and Doermann, 2003) and is retargeted at many levels with minimal human assistance.</Paragraph>
    <Paragraph position="2"> A major requirement for a human aided dictionary tagging application is the need to minimize human generated training data.4 This requirement limits the effectiveness of data driven methods for initial training. We chose rule-based tagging that uses the structure to analyze and tag tokens as our baseline, because it outperformed the baseline results of an HMM tagger. The approachhasdemonstratedpromisingresults, butwe  willshowitsshortcomingscanbeimprovedbyapplyingatransformation-basedlearning(TBL)post null processing technique.</Paragraph>
    <Paragraph position="3">  TBL(Brill,1995)isarule-basedmachinelearning method with some attractive qualities that make it suitable for language related tasks. First, the resulting rules are easily reviewed and understood. Second, itiserror-driven, thusdirectlyminimizes the error rate (Florian and Ngai, 2001). Furthermore, TBL can be applied to other annotation systems' output to improve performance. Finally, it makes use of the features of the token and those in the neighborhood surrounding it.</Paragraph>
    <Paragraph position="4"> In this paper, we describe an adaptive TBL basedtechniquetoimprovetheperformanceofthe rule-based entry tagger, especially targeting certain shortcomings. We first investigate how using TBL to improve the accurate rendering of tokens' font style affects the rule-based tagging accuracy.</Paragraph>
    <Paragraph position="5"> We then apply TBL on tags of the tokens. In our experiments with two dictionaries, the range of font style accuracies is increased from 84%-94% to 97%-98%, and the range of tagging accuracies is increased from 83%-90% to 93%-94% for tokens,andfrom64%-83%to90%-93%forphrases. null Section 2 discusses the rule-based entry tagging  more than eight pages that took around three hours of human effort.</Paragraph>
    <Paragraph position="6"> method. In Section 3, we briefly describe TBL, and Section 4 recounts how we apply TBL to improve the performance of the rule-based method.</Paragraph>
    <Paragraph position="7"> Section5explainstheexperimentsandresults,and we conclude with future work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML