File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1088_intro.xml

Size: 8,874 bytes

Last Modified: 2025-10-06 14:03:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1088">
  <Title>Wolfson Building Parks Road</Title>
  <Section position="4" start_page="697" end_page="698" type="intro">
    <SectionTitle>
2 Maximum Entropy Tagging
</SectionTitle>
    <Paragraph position="0"> The tagger uses conditional probabilities of the form P(y|x) where y is a tag and x is a local context containing y. The conditional probabilities have the following log-linear form:</Paragraph>
    <Paragraph position="2"> where Z(x) is a normalisation constant which ensures a proper probability distribution for each context x.</Paragraph>
    <Paragraph position="3"> The feature functions fi(x,y) are binaryvalued, returning either 0 or 1 depending on the tag y and the value of a particular contextual predicate given the context x. Contextual predicates identify elements of the context which might be useful for predicting the tag. For example, the following feature returns 1 if the current word isthe and the tag is DT; otherwise it returns 0:</Paragraph>
    <Paragraph position="5"> (2) word(x) = the is an example of a contextual predicate. The POS tagger uses the same contextual predicates as Ratnaparkhi (1996); the supertagger adds contextual predicates corresponding to POS tags and bigram combinations of POS tags (Curran and Clark, 2003).</Paragraph>
    <Paragraph position="6"> Each feature fi has an associated weight li which is determined during training. The training processaimstomaximisetheentropyofthemodel subject to the constraints that the expectation of each feature according to the model matches the empirical expectation from the training data. This can be also thought of in terms of maximum likelihood estimation (MLE) for a log-linear model (Della Pietra et al., 1997). We use the L-BFGS optimisation algorithm (Nocedal and Wright, 1999; Malouf, 2002) to perform the estimation.</Paragraph>
    <Paragraph position="7"> MLE has a tendency to overfit the training data.</Paragraph>
    <Paragraph position="8"> We adopt the standard approach of Chen and Rosenfeld (1999) by introducing a Gaussian prior term to the objective function which penalises feature weights with large absolute values. A parameter defined in terms of the standard deviation of the Gaussian determines the degree of smoothing.</Paragraph>
    <Paragraph position="9"> The conditional probability of a sequence of tags, y1,...,yn, given a sentence, w1,...,wn, is defined as the product of the individual probabilities for each tag:</Paragraph>
    <Paragraph position="11"> where xi is the context for word wi. We use the standard approach of Viterbi decoding to find the highest probability sequence.</Paragraph>
    <Section position="1" start_page="697" end_page="698" type="sub_section">
      <SectionTitle>
2.1 Multi-tagging
</SectionTitle>
      <Paragraph position="0"> Multi-tagging -- assigning one or more tags to a word -- is used here in two ways: first, to retain ambiguity in the CCG lexical category sequence for the purpose of building parse structure; and second, to retain ambiguity in the POS tag sequence. We retain ambiguity in the lexical category sequence since a single-tagger is not accurate enoughtoserveasafront-endtoa CCG parser, and we retain some POS ambiguity since POS tags are used as features in the statistical models of the supertagger and parser.</Paragraph>
      <Paragraph position="1"> Charniak et al. (1996) investigated multi-POS tagging in the context of PCFG parsing. It was found that multi-tagging provides only a minor improvement in accuracy, with a significant loss in efficiency; hence it was concluded that, given the particular parser and tagger used, a single-tag POS tagger is preferable to a multi-tagger. More recently, Watson (2006) has revisited this question inthecontextofthe RASP parser(BriscoeandCarroll, 2002) and found that, similar to Charniak et al. (1996), multi-tagging at the POS level results in a small increase in parsing accuracy but at some cost in efficiency.</Paragraph>
      <Paragraph position="2"> For lexicalized grammars, such as CCG and TAG, the motivation for using a multi-tagger to assign the elementary structures (supertags) is more compelling. Since the set of supertags is typically much larger than a standard POS tag set, the tagging problem becomes much harder. In  fact, when using a state-of-the-art single-tagger, the per-word accuracy for CCG supertagging is so low (around 92%) that wide coverage, high accuracy parsing becomes infeasible (Clark, 2002; Clark and Curran, 2004a). Similar results have beenfoundforahighlylexicalized HPSG grammar (Prins and van Noord, 2003), and also for TAG.</Paragraph>
      <Paragraph position="3"> As far as we are aware, the only approach to successfully integrate a TAG supertagger and parser is the Lightweight Dependency Analyser of Bangalore (2000). Hence, in order to perform effective full parsing with these lexicalized grammars, the tagger front-end must be a multi-tagger (given the current state-of-the-art).</Paragraph>
      <Paragraph position="4"> The simplest approach to CCG supertagging is to assign all categories to a word which the word was seen with in the data. This leaves the parser the task of managing the very large parse space resulting from the high degree of lexical category ambiguity (Hockenmaier and Steedman, 2002; Hockenmaier, 2003). However, one of the original motivations for supertagging was to significantly reduce the syntactic ambiguity before full parsing begins (Bangalore and Joshi, 1999). Clark and Curran (2004a) found that performing CCG supertagging prior to parsing can significantly increase parsing efficiency with no loss in accuracy. Our multi-tagging approach follows that of Clark and Curran (2004a) and Charniak et al.</Paragraph>
      <Paragraph position="5"> (1996): assign all categories to a word whose probabilities are within a factor, b, of the probability of the most probable category for that word:</Paragraph>
      <Paragraph position="7"> Ci is the set of categories assigned to the ith word; Ci istherandomvariablecorrespondingtothecategory of the ith word; cmax is the category with the highest probability of being the category of the ith word; andS is the sentence. One advantage of this adaptive approach is that, when the probability of the highest scoring category is much greater than the rest, no extra categories will be added.</Paragraph>
      <Paragraph position="8"> Clark and Curran (2004a) propose a simple method for calculating P(Ci = c|S): use the word and POS features in the local context to calculate the probability and ignore the previously assigned categories (the history). However, it is possible to incorporate the history in the calculation of the tag probabilities. A greedy approach is to use the locally highest probability history as a feature, which avoids any summing over alternative histories. Alternatively, there is a well-known dynamic programming algorithm -- the forward backward algorithm -- which efficiently calculates P(Ci = c|S) (Charniak et al., 1996).</Paragraph>
      <Paragraph position="9"> The multitagger uses the following conditional probabilities:</Paragraph>
      <Paragraph position="11"> as a fixed category, whereas yj (j negationslash= i) varies over the possible categories for word j. In words, the probability of category yi, given the sentence, is the sum of the probabilities of all sequences containing yi. This sum is calculated efficiently using the forward-backward algorithm:</Paragraph>
      <Paragraph position="13"> where ai(c) is the total probability of all the category sub-sequences that end at position i with category c; and bi(c) is the total probability of all the category sub-sequences through to the end which start at position i with category c.</Paragraph>
      <Paragraph position="14"> The standard description of the forward-backward algorithm, for example Manning and Schutze (1999), is usually given for an HMM-style tagger. However, it is straightforward to adapt the algorithm to the Maximum Entropy models used here. The forward-backward algorithm we use is similar to that for a Maximum Entropy Markov Model (Lafferty et al., 2001).</Paragraph>
      <Paragraph position="15"> POS tags are very informative features for the supertagger, which suggests that using a multi-POS tagger may benefit the supertagger (and ultimately the parser). However, it is unclear whether multi-POS tagging will be useful in this context, since our single-tagger POS tagger is highly accurate: over 97% for WSJ text (Curran and Clark, 2003). In fact, in Clark and Curran (2004b) we report that using automatically assigned, as opposed to gold-standard, POS tags as features results in a 2% loss in parsing accuracy. This suggests that retaining some ambiguity in the POS sequence may be beneficial for supertagging and parsing accuracy. In Section 4 we show this is the case for supertagging.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML