File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1046_metho.xml

Size: 13,562 bytes

Last Modified: 2025-10-06 14:13:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1046">
  <Title>MEASURES AND MODELS FOR PHRASE RECOGNITION</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Entropy measures of parser performance have focussed on the parser's contribution to word prediction. This is appropriate for evaluating a parser as a language model for speech recognition, but it is less appropriate for evaluating how well a parser does at parsing. I would like to present an entropy measure for phrase recognition, along with closely-related measures of precision and recall. I consider a seres of models, in order to establish a baseline for performance, and to give some sense of what parts of the problem are hardest, and what kinds of information contribute most to a solution.</Paragraph>
    <Paragraph position="1"> Specifically, I consider the problem of recognizing chunks (Abney 1991)--non-recursive pieces of major-category phrases, omitting post-head complements and modifiers.</Paragraph>
    <Paragraph position="2"> Chunks correspond to prosodic phrases (Abney 1992) and can be assembled into complete parse trees by adding head-head dependencies.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="233" type="metho">
    <SectionTitle>
2. THE PARSING PROBLEM
</SectionTitle>
    <Paragraph position="0"> Parsing is usually characterized as the problem of recovering parse trees for sentences, given a grammar that defines the mapping of sentences to parse-trees. However, I wish to characterize the problem without assuming a grammar, for two reasons. First, we cannot assume a grammar for unrestricted English. For unrestricted English, failure of coverage will be a significant problem for any grammar, and we would like a measure of performance that treats failure of coverage and failures within the grammar uniformly.</Paragraph>
    <Paragraph position="1"> Second, I am particularly interested in parsers like Fidditch (Hindle 1983) and Cass (Abney 1990) that avoid search by relying on highly reliable patterns for recognizing individual phrases. Such parsers may need to consider competing patterns when scoring a given pattern--for example, Cass relies heavily on a preference for the pattern that matches the longest prefix of the input. Such crosspattern dependencies cannot be expressed within, for example, a stochastic context-free grammar (SCFG).</Paragraph>
    <Paragraph position="2"> Hence I am interested in a more general evaluation framework, one that subsumes both Fidditch/Cass-style parsers and SCFG parsing.</Paragraph>
    <Paragraph position="3"> Instead of assuming a grammar, I take the Penn Treebank (Marcus &amp; Santorini 1991) to provide a representative sample of English, viewed as a function from sentences to parse trees. A parser's task is to statistically approximate that function. We can measure the (in)accuracy of the parser by the amount of additional information we must provide in order to specify the correct (Treebank) parse for a sentence, given the output of the parser. This is the entropy of the corpus given the parser, and approaches zero as the parser approaches perfect emulation of Treebank annotation.</Paragraph>
    <Paragraph position="4"> We can characterize the parser's task at two levels of granularity. At the level of the sentence, the task is to assign a probability distribution over the set of possible parse-trees for the sentence. At the phrase level, the problem is to give, for each candidate phrase c, the probability that c belongs to the correct parse. I will focus on the latter characterization, for several reasons: (1) as mentioned, I am interested in developing reliable patterns for recognizing individual phrases, in order to reduce the necessity for search and to increase parsing speed, (2) evaluating at the phrase level allows us to assign blame for error at a finer grain, (3) there are applications such as data extraction where we may have good models for certain phrase types, but not for entire sentences, and (4) a phrase model can easily be embedded in a sentence model, so evaluating at the finer grain does not exclude evaluation at the coarser grain.</Paragraph>
  </Section>
  <Section position="5" start_page="233" end_page="233" type="metho">
    <SectionTitle>
3. MEASURES
</SectionTitle>
    <Paragraph position="0"> Given a sentence, the chunk candidates are all tuples c = (x,id), for x a syntactic category, and i andj the start and end positions of the chunk. For each candidate, there are two possible events in the Treebank: the candidate is indeed a phrase in the Treebank parse (T), or it is not a true phrase (~T). For each candidate, the parsing model provides P(TIc), the probability of the candidate being a true phrase, and P(~TIc) = 1 - P(TIc).</Paragraph>
    <Paragraph position="1"> Given the probabilities provided by the parsing model, the information that must be provided to specify that T occurs (that the candidate is a true phrase) is -lg P(TIc); and to specify that ~T occurs, -lg P(~TIc). The entropy of the corpus given the model is the average -lg P(Eclc), for Ec being T or ~T according as candidate c does or does not appear in the Treebank parse. That is,</Paragraph>
    <Paragraph position="3"> A perfect model would have P(Eclc) = 1 for all c, hence H = 0. At the other extreme, a 'random-guess' model would have P(Eclc) = 1/2 for all c, hence H = 1 bit/candidate (b/c).</Paragraph>
    <Paragraph position="4"> This provides an upper bound on H, in the sense that any model that has H &gt; 1 b/c can be changed into a model with H &lt; 1 by systematically interchanging P(TIc) and P(~TIc).</Paragraph>
    <Paragraph position="5"> Hence, for all models, 0 _&lt; H _&lt; 1 b/c.</Paragraph>
    <Paragraph position="6"> There are some related measures of interest. We can translate entropy into an equivalent number of equallylikely parses (perplexity) by the relation:</Paragraph>
    <Paragraph position="8"> for H in bits/candidate and a the number of candidates per sentence. In the test corpus I used, a = 8880, so PP ranges from 1 to 28880 = 102670.</Paragraph>
    <Paragraph position="9"> We can also measure expected precision and recall, by considering P(TIc) as a probabilistic 'Yes' to candidate c. For example, if the model says P(TIc) = 3/4, that counts as 3/4 of a 'Yes'. Then the expected number of Yes's is the sum of P(TIc) over all candidates, and the expected number of correct Yes's is the sum of P(TIc) over candidates that are true chunks. From that and the number of true chunks, which can simply be counted, we can compute precision and recall:</Paragraph>
    <Paragraph position="11"/>
  </Section>
  <Section position="6" start_page="233" end_page="235" type="metho">
    <SectionTitle>
4. MODELS
</SectionTitle>
    <Paragraph position="0"> To establish a baseline for performance, and to determine how much can be accomplished with 'obvious', easilyacquired information, I consider a series of models. Model 0 is a zero-parameter, random-guess model; it establishes a lower bound on performance. Model 1 estimates one parameter, the proportion of true chunks among candidates.</Paragraph>
    <Paragraph position="1"> Model XK takes the category and length of candidates into account. Model G induces a simple grammar from the training corpus. Model C considers a small amount of context. And model S is a sentence-level model based on G.</Paragraph>
    <Paragraph position="2"> 4.1. Models 0 and 1 Models 0 and 1 take P(TIc) to be constant. Model 0 (the random-guess model) takes P(T) = 1/2, and provides a lower bound on performance. Model 1 (the one-parameter model) estimates P(T) as the proportion of true chunks among candidates in a training corpus. The training corpus I used consists of 1706 sentences, containing 19,025 true chunks (11.2 per sentence), and 14,442,484 candidates (8470 per sentence). The test corpus consisted of 1549 sentences, 17,676 true chunks (11.4 per sentence), and  For these two models (in fact, for any model with P(TIc) constan0, precision is at a minimum, and equals the proportion of true chunks in the test corpus. Recall is uninformative, being equal to P(TIc).</Paragraph>
    <Paragraph position="3"> 4.2. Model XK Model XK is motivated by the observation that very long chunks are highly unlikely. It takes P(TIc) = P(TIx,k), for x the category of c and k its length. It estimates P(TIx, k) as the proportion of true chunks among candidates of category x and length k in the training corpus. As expected, this model does better than the previous ones:</Paragraph>
    <Section position="1" start_page="233" end_page="234" type="sub_section">
      <SectionTitle>
4.3. Models G and C
</SectionTitle>
      <Paragraph position="0"> For model G, I induced a simple grammar from the training corpus. I used Ken Church's tagger (Church 1988) to  assign part-of-speech probabilities to words. The grammar contains a rule x ---&gt; T for every Treebank chunk \[x &amp;quot;t\] in the training corpus. (x is the syntactic category of the chunk, and y is the part-of-speech sequence assigned to the words of the chunk.) Ix V\] is counted as being observed P(y) times, for P('t) the probability of assigning the part-of-speech sequence y to the words of the chunk. I used a second corpus to estimate P(TIx,Y=) for each rule in the grammar, by counting the proportion of true phrases among candidates of form Ix Y\]. For candidates that matched no rule, I estimated the probabilities P(TIx, k) as in the XK model.</Paragraph>
      <Paragraph position="1"> Model C is a variant of model G, in which a small amount of context, namely, the following part of speech, is also taken into account.</Paragraph>
      <Paragraph position="2"> The results on the test corpus are as follows:</Paragraph>
    </Section>
    <Section position="2" start_page="234" end_page="234" type="sub_section">
      <SectionTitle>
4.4 Assigning Blame
</SectionTitle>
      <Paragraph position="0"> We can make some observations about the sources of For example, we can break out entropy by entropy.</Paragraph>
      <Paragraph position="1">  accounted for by candidates of the given category. In the second column, I have subtracted the amount we would have expected if entropy were divided among candidates without regard to category. The results clearly confirm our intuitions that, for example, noun phrases are more difficult to recognize than verb clusters, and that the Null category, consisting mostly of punctuation and connectives, is easy to recognize.</Paragraph>
      <Paragraph position="2"> We can also break out entropy among candidates covered by the grammar, and those not covered by the grammar. The usual measure of grammar coverage is simply the proportion of true chunks covered, but we can more accurately determine how much of a problem coverage is by measuring how much we stand to gain by improving coverage, versus how much we stand to gain by improving our model of covered candidates. On our test corpus, only 4% of the candidates are uncovered by the grammar, but 19% of the information cost (entropy) is due to uncovered candidates.</Paragraph>
    </Section>
    <Section position="3" start_page="234" end_page="235" type="sub_section">
      <SectionTitle>
4.5. Model S
</SectionTitle>
      <Paragraph position="0"> None of the models discussed so far take into account the constraint that the set of true chunks must partition the sentence. Now, if a perfect sentence model exists--if an algorithm exists that assigns to each sentence its Treebank parse---then a perfect phrase model also exists. And to the extent that a model uses highly reliably local patterns (as I would like), little information is lost by not evaluating at the sentence level. But for other phrase-level models, such as those considered here, embedding them in a sentence-level model can significantly improve performance.</Paragraph>
      <Paragraph position="1"> Model S is designed to gauge how much information is lost in model G by not evaluating parses as a whole. It uses model O's assignments of probabilities P(TIc) for individual candidates as the basis for assigning probabilities P(s) to entire parses, that is, to chunk-sequences s that cover the entire sentence.</Paragraph>
      <Paragraph position="2"> To choose a sequence of chunks stochastically, we begin with s = the null sequence at position i = 0. We choose from among the candidates at position i, taking the probability P(c) of choosing candidate c to be proportional to P(TIc). The chosen chunk c is appended to s, and the current position i is advanced to the end position of c. We iterate to the end of the sentence. In brief:</Paragraph>
      <Paragraph position="4"> The entropy of a sentence given the model is -lg P(s), for s the true sequence of chunks. We can also compute actual (not expected) precision and recall by counting the true chunks in the most-likely parse according to the model.</Paragraph>
      <Paragraph position="5"> The results on the test corpus are: M b/s present P~dsion Recall S 14.1 104 74.1% 75.6% (By way of comparison, the bits/sentence numbers for the other models are as follows:) 0 1 XK G C S 8880 126 70.6 33.8 29.8 14.1 For model S, the number of parses per sentence is still rather high, but the precision and recall are surprisingly  good, given the rudimentary information that the model takes into account. I think there is cause for optimism that the chunk recognition problem can be solved in the near term, using models that take better account of context and word-level information.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML