File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/a00-2018_evalu.xml
Size: 9,396 bytes
Last Modified: 2025-10-06 13:58:32
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-2018"> <Title>A Maximum-Entropy-Inspired Parser *</Title> <Section position="6" start_page="135" end_page="137" type="evalu"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> In the previous sections we have concentrated on the relation of the parser to a maximum-entropy approach, the aspect of the parser that is most novel. However, we do not think this aspect is the sole or even the most important reason for its comparative success. Here we list what we believe to be the most significant contributions and give some experimental results on how well the program behaves without them.</Paragraph> <Paragraph position="1"> We take as our starting point the parser labled Char97 in Figure 1 \[5\], as that is the program from which our current parser derives.</Paragraph> <Paragraph position="2"> That parser, as stated in Figure 1, achieves an average precision/recall of 87.5. As noted in \[5\], that system is based upon a &quot;tree-bank grammar&quot; -- a grammar read directly off the training corpus. This is as opposed to the &quot;Markovgrammar&quot; approach used in the current parser. Also, the earlier parser uses two techniques not employed in the current parser. First, it uses a clustering scheme on words to give the system a &quot;soft&quot; clustering of heads and sub-heads. (It is &quot;soft&quot; clustering in that a word can belong to more than one cluster with different weights -- the weights express the probability of producing the word given that one is going to produce a word from that cluster.) Second, Char97 uses unsupervised learning in that the original system was run on about thirty million words of unparsed text, the output was taken as &quot;correct&quot;, and statistics were collected on the resulting parses. Without these enhancements Char97 performs at the 86.6% level for sentences of length < 40.</Paragraph> <Paragraph position="3"> In this section we evaluate the effects of the various changes we have made by running various versions of our current program. To avoid repeated evaluations based upon the testing corpus, here our evaluation is based upon sentences of length < 40 from the development corpus. We note here that this corpus is somewhat more difficult than the &quot;official&quot; test corpus. For example, the final version of our system 40, development corpus achieves an average precision/recall of 90.1% on the test corpus but an average precision/recall of only 89.7% on the development corpus. This is indicated in Figure 2, where the model labeled &quot;Best&quot; has precision of 89.8% and recall of 89.6% for an average of 89.7%, 0.4% lower than the results on the official test corpus. This is in accord with our experience that developmentcorpus results are from 0.3% to 0.5% lower than those obtained on the test corpus.</Paragraph> <Paragraph position="4"> The model labeled &quot;Old&quot; attempts to recreate the Char97 system using the current program.</Paragraph> <Paragraph position="5"> It makes no use of special maximum-entropy-inspired features (though their presence made it much easier to perform these experiments), it does not guess the pre-terminal before guessing the lexical head, and it uses a tree-bank grammar rather than a Markov grammar. This parser achieves an average precision/recall of 86.2%. This is consistent with the average precision/recall of 86.6% for \[5\] mentioned above, as the latter was on the test corpus and the former on the development corpus.</Paragraph> <Paragraph position="6"> Between the Old model and the Best model, Figure 2 gives precision/recall measurements for several different versions of our parser. One of the first and without doubt the most significant change we made in the current parser is to move from two stages of probabilistic decisions at each node to three. As already noted, Char97 first guesses the lexical head of a constituent and then, given the head, guesses the PCFG rule used to expand the constituent in question.</Paragraph> <Paragraph position="7"> In contrast, the current parser first guesses the head's pre~terminal, then the head, and then the expansion. It turns out that usefulness of this process had a/ready been discovered by Collins \[10\], who in turn notes (personal communication) that it was previously used by Eisner \[12\]. However, Collins in \[10\] does not stress the decision to guess the head's pre-terminal first, and it might be lost on the casual reader. Indeed, it was lost on the present author until he went back after the fact and found it there. In Figure 2 we show that this one factor improves performance by nearly 2%.</Paragraph> <Paragraph position="8"> It may not be obvious why this should make so great a difference, since most words are effectively unambiguous. (For example, part-of-speech tagging using the most probable pre-terminal for each word is 90% accurate \[8\].) We believe that two factors contribute to this performance gain. The first is simply that if we first guess the pre~terminal, when we go to guess the head the first thing we can condition upon is the pre-terminal, i.e., we compute p(h I t). This quantity is a relatively intuitive one (as, for example, it is the quantity used in a PCFG to relate words to their pre-terminals) and it seems particularly good to condition upon here since we use it, in effect, as the unsmoothed probability upon which all smoothing of p(h) is based. This one '~fix&quot; makes slightly over a percent difference in the results.</Paragraph> <Paragraph position="9"> The second major reason why first guessing the pre-terminal makes so much difference is that it can be used when backing off the lexical head in computing the probability of the rule expansion. For example, when we first guess the lexical head we can move from computing p(r I 1, lp, h) to p(r I l,t, lp, h). So, e.g., even if the word &quot;conflating&quot; does not appear in the training corpus (and it does not)~ the &quot;ng&quot; ending allows our program to guess with relative security that the word has the vbg pre-terminal, and thus the probability of various rule expansions can be considerable sharpened. For example, the tree-bank PCFG probability of the rule &quot;vp --+ vbg np&quot; is 0.0145, whereas once we condition on the fact that the lexical head is a vbg we get a probability of 0.214.</Paragraph> <Paragraph position="10"> The second modification is the explicit marking of noun and verb-phrase coordination. We have already noted the importance of conditioning on the parent label l v. So, for example, information about an np is conditioned on the parent -- e.g., an s, vp, pp, etc. Note that when an np is part of an np coordinate structure the parent will itself be an np, and similarly for a vp. But nps and vps can occur with np and vp parents in non-coordinate structures as well.</Paragraph> <Paragraph position="11"> For example, in the Penn Treebank a vp with both main and auxiliary verbs has the structure shown in Figure 3. Note that the subordinate vp has a vp parent.</Paragraph> <Paragraph position="12"> Thus np and vp parents of constituents are marked to indicate if the parents are a coordinate structure. A vp coordinate structure is defined here as a constituent with two or more vp children, one or more of the constituents comma, cc, conjp (conjunctive phrase), and nothing else; coordinate np phrases are defined similarly. Something very much like this is done in \[15\]. As shown in Figure 2, conditioning on this information gives a 0.6% improvement. We believe that this is mostly due to improvements in guessing the sub-constituent's pre-terminai and head. Given we are already at the 88% level of accuracy, we judge a 0.6% improvement to be very much worth while.</Paragraph> <Paragraph position="13"> Next we add the less obvious conditioning events noted in our previous discussion of the final model -- grandparent label I a and left sibling label lb. When we do so using our maximum-entropy-inspired conditioning, we get another 0.45% improvement in average precision/recall, as indicated in Figure 2 on the line labeled &quot;MaocEnt-Inspired'. Note that we also tried including this information using a standard deleted-interpolation model. The results here are shown in the line &quot;Standard Interpolation&quot;. Including this information within a standard deleted-interpolation model causes a 0.6% decrease from the results using the less conventional model. Indeed, the resulting performance is worse than not using this information at all.</Paragraph> <Paragraph position="14"> Up to this point all the models considered in this section are tree-bank grammar models.</Paragraph> <Paragraph position="15"> That is, the PCFG grammar rules are read directly off the training corpus. As already noted, our best model uses a Markov-grammar approach. As one can see in Figure 2, a first-order Markov grammar (with all the aforementioned improvements) performs slightly worse than the equivalent tree-bank-grammar parser.</Paragraph> <Paragraph position="16"> However, a second-order grammar does slightly better and a third-order grammar does significantly better than the tree-bank parser.</Paragraph> </Section> class="xml-element"></Paper>