File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-1033_intro.xml

Size: 4,621 bytes

Last Modified: 2025-10-06 14:01:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-1033">
  <Title>Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Almost all approaches to sequence problems such as part-of-speech tagging take a unidirectional approach to conditioning inference along the sequence. Regardless of whether one is using HMMs, maximum entropy conditional sequence models, or other techniques like decision trees, most systems work in one direction through the sequence (normally left to right, but occasionally right to left, e.g., Church (1988)). There are a few exceptions, such as Brill's transformation-based learning (Brill, 1995), but most of the best known and most successful approaches of recent years have been unidirectional.</Paragraph>
    <Paragraph position="1"> Most sequence models can be seen as chaining together the scores or decisions from successive local models to form a global model for an entire sequence. Clearly the identity of a tag is correlated with both past and future tags' identities. However, in the unidirectional (causal) case, only one direction of influence is explicitly considered at each local point. For example, in a left-to-right first-order HMM, the current tag t0 is predicted based on the previous tag t[?]1 (and the current word).1 The backward interaction between t0 and the next tag t+1 shows up implicitly later, when t+1 is generated in turn. While unidirectional models are therefore able to capture both directions of influence, there are good reasons for suspecting that it would be advantageous to make information from both directions explicitly available for conditioning at each local point in the model: (i) because of smoothing and interactions with other modeled features, terms like P(t0|t+1,...) might give a sharp estimate of t0 even when terms like P(t+1|t0,...) do not, and (ii) jointly considering the left and right context together might be especially revealing. In this paper we exploit this idea, using dependency networks, with a series of local conditional loglinear (aka maximum entropy or multiclass logistic regression) models as one way of providing efficient bidirectional inference.</Paragraph>
    <Paragraph position="2"> Secondly, while all taggers use lexical information, and, indeed, it is well-known that lexical probabilities are much more revealing than tag sequence probabilities (Charniak et al., 1993), most taggers make quite limited use of lexical probabilities (compared with, for example, the bilexical probabilities commonly used in current statistical parsers). While modern taggers may be more principled than the classic CLAWS tagger (Marshall, 1987), they are in some respects inferior in their use of lexical information: CLAWS, through its IDIOMTAG module, categorically captured many important, correct taggings of frequent idiomatic word sequences. In this work, we incorporate appropriate multiword feature templates so that such facts can be learned and used automatically by 1Rather than subscripting all variables with a position index, we use a hopefully clearer relative notation, where t0 denotes the current position and t[?]n and t+n are left and right context tags, and similarly for words.</Paragraph>
    <Paragraph position="4"> first-order CMM, (b) the (reversed) right-to-left CMM, and (c) the bidirectional dependency network.</Paragraph>
    <Paragraph position="5"> the model.</Paragraph>
    <Paragraph position="6"> Having expressive templates leads to a large number of features, but we show that by suitable use of a prior (i.e., regularization) in the conditional loglinear model something not used by previous maximum entropy taggers - many such features can be added with an overall positive effect on the model. Indeed, as for the voted perceptron of Collins (2002), we can get performance gains by reducing the support threshold for features to be included in the model. Combining all these ideas, together with a few additional handcrafted unknown word features, gives us a part-of-speech tagger with a per-position tag accuracy of 97.24%, and a whole-sentence correct rate of 56.34% on Penn Treebank WSJ data. This is the best automatically learned part-of-speech tagging result known to us, representing an error reduction of 4.4% on the model presented in Collins (2002), using the same data splits, and a larger error reduction of 12.1% from the more similar best previous loglinear model in Toutanova and Manning (2000).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML