XML Viewer - w04-3237

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3237_intro.xml
Size: 6,153 bytes
Last Modified: 2025-10-06 14:02:50
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3237">
  <Title>Adaptation of Maximum Entropy Capitalizer: Little Data Can Help a Lot</Title>
  <Section position="3" start_page="0" end_page="3" type="intro">
    <SectionTitle>
2 Capitalization as Sequence Tagging
</SectionTitle>
    <Paragraph position="0"> Automatic capitalization can be seen as a sequence tagging problem: each lower-case word receives a tag that describes its capitalization form. Similar to the work in (Lita et al., 2003), we tag each word in a sentence with one of the tags:  * LOC lowercase * CAP capitalized * MXC mixed case; no further guess is made as to the capitalization of such words. A possibility is to use the most frequent one encountered in the training data.</Paragraph>
    <Paragraph position="1"> * AUC all upper case * PNC punctuation; we decided to have a sep null arate tag for punctuation since it is quite frequent and models well the syntactic context in a parsimonious way For training a given capitalizer one needs to convert running text into uniform case text accompanied by the above capitalization tags. For example, PrimeTime continues on ABC .PERIOD Now ,COMMA from Los Angeles ,COMMA</Paragraph>
    <Paragraph position="3"> The text is assumed to be already segmented into sentences. Any sequence labeling algorithm can then be trained for tagging lowercase word sequences with capitalization tags.</Paragraph>
    <Paragraph position="4"> At test time, the uniform case text to be capitalized is first segmented into sentences  after which each sentence is tagged.</Paragraph>
    <Section position="1" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
2.1 1-gram capitalizer
</SectionTitle>
      <Paragraph position="0"> A widespread algorithm used for capitalization is the 1-gram tagger: for every word in a given vocabulary (usually large, 100kwds or more) use the most frequent tag encountered in a large amount of training data. As a special case for automatic capitalization, the most frequent tag for the first word in a sentence is overridden by CAP, thus capitalizing on the fact that the first word in a sentence is most</Paragraph>
      <Paragraph position="2"> Unlike the training phase, the sentence segmenter at test time is assumed to operate on uniform case text.</Paragraph>
      <Paragraph position="3">  As with everything in natural language, it is not hard to find exceptions to this &amp;quot;rule&amp;quot;.</Paragraph>
      <Paragraph position="4"> Due to its popularity, both our work and that of (Lita et al., 2003) uses the 1-gram capitalizer as a baseline. The work in (Kim and Woodland, 2004) indicates that the same 1-gram algorithm is used in Microsoft Word 2000 and is consequently used as a baseline for evaluating the performance of their algorithm as well.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
2.2 Previous Work
</SectionTitle>
      <Paragraph position="0"> We share the approach to capitalization as sequence tagging with that of (Lita et al., 2003). In their approach, a language model is built on pairs (word, tag) and then used to disambiguate over all possible tag assignments to a sentence using dynamic programming techniques.</Paragraph>
      <Paragraph position="1"> The same idea is explored in (Kim and Woodland, 2004) in the larger context of automatic punctuation generation and capitalization from speech recognition output. A second approach they consider for capitalization is the use a rule-based tagger as described by (Brill, 1994), which they show to outperform the case sensitive language modeling approach and be quite robust to speech recognition errors and punctuation generation errors.</Paragraph>
      <Paragraph position="2"> Departing from their work, our approach builds on a standard technique for sequence tagging, namely MEMMs, which has been successfully applied to part-of-speech tagging (Ratnaparkhi, 1996). The MEMM approach models the tag sequence T conditionally on the word sequence W, which has a few substantial advantages over the 1-gram tagging approach: * discriminative training of probability model P(T|W) using conditional maximum likelihood is well correlated with tagging accuracy * ability to use a rich set of word-level features in a parsimonious way: sub-word features such as prefixes and suffixes, as well as future words  are easily incorporated in the probability model * no concept of &amp;quot;out-of-vocabulary&amp;quot; word: sub-word features are very useful in dealing with words not seen in the training data * ability to integrate rich contextual features into the model More recently, certain drawbacks of MEMM models have been addressed by the conditional random field (CRF) approach (Lafferty et al., 2001) which slightly outperforms MEMMs on a standard part-of-speech tagging task. In a similar vein, the work  Relative to the current word, whose tag is assigned a probability value by the MEMM.</Paragraph>
      <Paragraph position="3"> of (Collins, 2002) explores the use of discriminatively trained HMMs for sequence labeling problems, a fair baseline for such cases that is often overlooked in favor of the inadequate maximum likelihood HMMs.</Paragraph>
      <Paragraph position="4"> The work on adapting the MEMM model parameters using MAP smoothing builds on the Gaussian prior model used for smoothing MaxEnt models, as presented in (Chen and Rosenfeld, 2000). We are not aware of any previous work on MAP adaptation of MaxEnt models using a prior, be it Gaussian or a different one, such as the exponential prior of (Goodman, 2004). Although we do not have a formal derivation, the adaptation technique should easily extend to the CRF scenario.</Paragraph>
      <Paragraph position="5"> A final remark contrasts rule-based approaches to sequence tagging such as (Brill, 1994) with the probabilistic approach taken in (Ratnaparkhi, 1996): having a weight on each feature in the Max-Ent model and a sound probabilistic model allows for a principled way of adapting the model to a new domain; performing such adaptation in a rule-based model is unclear, if at all possible.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML