File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0301_metho.xml
Size: 21,009 bytes
Last Modified: 2025-10-06 14:14:37
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0301"> <Title>A. Linear Observed Time Statistical Parser Based on Maximum Entropy Models</Title> <Section position="4" start_page="0" end_page="2" type="metho"> <SectionTitle> 2 Procedures for Building Trees </SectionTitle> <Paragraph position="0"> The parser uses four procedures, TAG, CHUNK, BUILD, and CHECK, that incrementally build parse trees with their actions. The procedures are applied in three left-to-right passes over the input sentence; the first pass applies TAG, the second pass applies CHUNK, and the third pass applies BUILD and CHECK. The passes, the procedures they apply, and the actions of the procedures are summarized in table 1 and described below.</Paragraph> <Paragraph position="1"> The actions of the procedures are designed so that any possible complete parse tree T for the input sentence corresponds to exactly one sequence of actions; call this sequence the derivation of T. Each procedure, when given a derivation d = {al... an}, predicts some action a,+l to create a new derivation d' = {al ... a,+l }. Typically, the procedures postulate many different values for a,+l, which cause the parser to explore many different derivations when parsing an input sentence. But for demonstration purposes, figures 1-7 trace one possible derivation for the sentence &quot;I saw the man with the telescope&quot;, using the part-of-speech (POS) tag set and constituent label set of the Penn treebank.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 First Pass </SectionTitle> <Paragraph position="0"> The first pass takes an input sentence, shown in figure 1, and uses TAG to assign each word a POS tag.</Paragraph> <Paragraph position="1"> The result of applying TAG to each word is shown in figure 2.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Second Pass </SectionTitle> <Paragraph position="0"> The second pass takes the output of the first pass and uses CHUNK to determine the &quot;flat&quot; phrase chunks of the sentence, where a phrase is &quot;flat&quot; if and only if it is a constituent whose children consist solely of POS tags. Starting from the left, CHUNK assigns each (word,POS tag) pair a &quot;chunk&quot; tag, either Start X, Join X, or Other. Figure 3 shows the result after the second pass. The chunk tags are then used for chunk detection, in which any consecutive sequence of words win.., w, (m _< n) are grouped into a &quot;flat&quot; chunk X if wm has been assigned Start X and Wm+l...w, have all been assigned Join X.</Paragraph> <Paragraph position="1"> The result of chunk detection, shown in figure 4, is a forest of trees and serves as the input to the third pass.</Paragraph> </Section> <Section position="3" start_page="0" end_page="2" type="sub_section"> <SectionTitle> ations of a shift-reduce parser 2.3 Third Pass </SectionTitle> <Paragraph position="0"> The third pass always alternates between the use of BUILD and CHECK, and completes any remaining constituent structure. BUILD decides whether a tree will start a new constituent or join the incomplete constituent immediately to its left. Accordingly, it annotates the tree with either Start X, where X is any constituent label, or with Join X, where X matches the label of the incomplete constituent to the left. BUILD always processes the leftmost tree without any Start X or Join X annotation. Figure 5 shows an application of BUILD in which the action is Join VP. After BUILD, control passes to CHECK, which finds the most recently proposed constituent, and decides if it is complete.</Paragraph> <Paragraph position="1"> The most recently proposed constituent, shown in figure 6, is the rightmost sequence of trees tin.., tn (m < n) such that tm is annotated with Start X and tm+l .. * tn are annotated with Join X. If CHECK decides yes, then the proposed constituent takes its place in the forest as an actual constituent, on which BUILD does its work. Otherwise, the constituent is not finished and BUILD processes the next tree in the forest, tn+ 1. CHECK always answers no if the</Paragraph> <Paragraph position="3"/> <Paragraph position="5"/> <Paragraph position="7"/> <Paragraph position="9"> proposed constituent is a &quot;fiat&quot; chunk, since such constituents must be formed in the second pass. Figure 7 shows the result when CHECK looks at the proposed constituent in figure 6 and decides No. The third pass terminates when CHECK is presented a constituent that spans the entire sentence.</Paragraph> <Paragraph position="10"> Table 2 compares the actions of BUILD and CHECK to the operations of a standard shift-reduce parser.</Paragraph> <Paragraph position="11"> The No and Yea actions of CHECK correspond to the shift and reduce actions, respectively. The important difference is that while a shift-reduce parser creates a constituent in one step (reduce a), the procedures BUILD and CHECK create it over several steps in smaller increments.</Paragraph> </Section> </Section> <Section position="5" start_page="2" end_page="5" type="metho"> <SectionTitle> 3 Probability Model </SectionTitle> <Paragraph position="0"> This paper takes a &quot;history-based&quot; approach (Black et al., 1993) where each tree-building procedure uses a probability model p(alb), derived from p(a, b), to weight any action a based on the available context, or history, b. First, we present a few simple categories of contextual predicates that capture any information in b that is useful for predicting a. Next, the predicates are used to extract a set of features from a corpus of manually parsed sentences. Finally, those features are combined under the maximum entropy framework, yielding p(a, b).</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.1 Contextual Predicates </SectionTitle> <Paragraph position="0"> Contextual predicates are functions that check for the presence or absence of useful information in a context b and return true or false accordingly. The comprehensive guidelines, or templates, for the contextual predicates of each tree building procedure are given in table 3. The templates use indices relative to the tree that is currently being modified. For example, if the current tree is the 5th tree, cons(-2) looks at the constituent label, head word, and start/join annotation of the 3rd tree in the forest. The actual contextual predicates are generated automatically by scanning the derivations of the trees in the manually parsed corpus with the templates. For example, an actual contextual predicate based on the template cons(0) might be &quot;Does cons(0) = { NP, he } ?&quot; Constituent head words are found, when necessary, with the algorithm in (Magerman, 1995).</Paragraph> <Paragraph position="1"> Contextual predicates which look at head words, or especially pairs of head words, may not be reliable predictors for the procedure actions due to their sparseness in the training sample. Therefore, for each lexically based contextual predicate, there also exist one or more corresponding less specific, or &quot;backed-off&quot;, contextual predicates which look at the same context, but omit one or more words.</Paragraph> <Paragraph position="2"> For example, the contexts cons(0, 1&quot;), cons(0*, 1), cons(0*, 1&quot;) are the same as cons(0,1) but omit references to the head word of the 1st tree, the 0th tree, and both the 0th and 1st tree, respectively.</Paragraph> <Paragraph position="3"> The backed-off contextual predicates should allow the model to provide reliable probability estimates when the words in the history are rare. Backed-off predicates are not enumerated in table 3, but their existence is indicated with a * and t.</Paragraph> </Section> <Section position="2" start_page="2" end_page="5" type="sub_section"> <SectionTitle> 3.2 Maximum Entropy Framework </SectionTitle> <Paragraph position="0"> The contextual predicates derived from the templates of table 3 are used to create the features necessary for the maximum entropy models. The predicates for TAG, CHUNK, BUILD, and CHECK are used to scan the derivations of the trees in the corpus to form the training samples &quot;~^c, TcMusK, '~U~LD, and TCMECK, respectively. Each training sample has the form T = ((al, 51), (a2, b2),..., CaN, bN)}, where ai is an action of the corresponding procedure and bi is the list of contextual predicates that were true in the context in which al was decided.</Paragraph> <Paragraph position="1"> The training samples are respectively used to create the models PT^G, PCHUNK, PBUILD, and PCMECK, all of which have the form:</Paragraph> <Paragraph position="3"> where a is some action, b is some context, ~&quot; is a nor- null The word, POS tag, and chunk tag of nth leaf. Chunk tag omitted if n > 0.</Paragraph> <Paragraph position="4"> chunkandpostag(m) & chunkandpostag(n) The head word, constituent (or POS) label, and start/join annotation of the nth tree. Start/join annotation omitted if</Paragraph> <Paragraph position="6"> cons(m), cons(n), & cons(p).</Paragraph> <Paragraph position="7"> The constituent we could join (1) contains a &quot;\[&quot; and the current tree is a &quot;\]&quot;; (2) contains a &quot;,&quot; and the current tree is a &quot;,&quot;; (3) spans the entire sentence and current tree is &quot;.&quot; The head word, constituent (or POS) label of the nth tree, and the label of proposed constituent, begin and last are first and last child (resp.) of proposed</Paragraph> <Paragraph position="9"> Constituent label of parent (X), and constituent or P0S labels of children (Xz ... Xn) of proposed constituent POS tag and word of the nth leaf to the left of the constituent, if n < 0, or to the right of the constituent, if n > 0</Paragraph> <Paragraph position="11"> malization constant, aj are the model parameters, 0 < aj < oo, and fj(a, b) E {0, 1} are called features, j = {1... k}. Features encode an action a' as well as some contextual predicate cp that a tree-building procedure would find useful for predicting the action a'. Any contextual predicate cp derived from table 3 which occurs 5 or more times in a training sample with a particular action a' is used to construct a feature fj:</Paragraph> <Paragraph position="13"> for use in the corresponding model. Each feature fj corresponds to a parameter aj, which can be viewed as a &quot;weight&quot; that reflects the importance of the feature.</Paragraph> <Paragraph position="14"> The parameters {al...an} are found automatically with Generalized Iterative Scaling (Darroch and Ratcliff, 1972), or GIS. The GIS procedure, as well as the maximum entropy and maximum likelihood properties of the distribution of form (1), are described in detail in (Ratnaparkhi, 1997). In general, the maximum entropy framework puts no limitations on the kinds of features in the model; no special estimation technique is required to combine features that encode different kinds of contextual predicates, like punctuation and cons(0, 1, 2). As a result, experimenters need only worry about what features to use, and not how to use them.</Paragraph> <Paragraph position="15"> We then use the models Pr^a, PeHusK, PBUILD, and Pongee to define a function score, which the search procedure uses to rank derivations of incomplete and complete parse trees. For each model, the corresponding conditional probability is defined as usual:</Paragraph> <Paragraph position="17"> For notational convenience, define q as follows \[ PrAa(a\]b) if a is an action from TAG pCStmK(a\[b) if a is an action from CHUNK q(alb) = PBUILD(al b) if a is an action from BUILD PcnEcK(alb ) if a is an action from CHECK Let deriv(T) = {al,... ,an} be the derivation of a parse T, where T is not necessarily complete, and where each al is an action of some tree-building procedure. By design, the tree-building procedures guarantee that {al,..., an} is the only derivation for the parse T. Then the score of T is merely the product of the conditional probabilities of the individual actions in its derivation:</Paragraph> <Paragraph position="19"> where bi is the context in which ai was decided.</Paragraph> </Section> </Section> <Section position="6" start_page="5" end_page="6" type="metho"> <SectionTitle> 4 Search </SectionTitle> <Paragraph position="0"> The search heuristic attempts to find the best parse T*, defined as:</Paragraph> <Paragraph position="2"> where trees(S) are all the complete parses for an input sentence S.</Paragraph> <Paragraph position="3"> The heuristic employs a breadth-first search (BFS) which does not explore the entire frontier, but rather, explores only at most the top g scoring incomplete parses in the frontier, and terminates when it has found M complete parses, or when all the hypotheses have been exhausted. Furthermore, if {al...an} are the possible actions for a given procedure on a derivation with context b, and they are sorted in decreasing order according to q(ailb), we only consider exploring those actions {al ... am} that hold most of the probability mass, where m is defined as follows:</Paragraph> <Paragraph position="5"> and where Q is a threshold less than 1. The search also uses a Tag Dictionary constructed from training data, described in (Ratnaparkhi, 1996), that reduces the number of actions explored by the tagging model. Thus there are three parameters for the search heuristic, namely K,M, and Q and all experiments reported in this paper use K = 20, M = 20, and Q = .951 Table 4 describes the top K BFS and the semantics of the supporting functions.</Paragraph> <Paragraph position="6"> It should be emphasized that if K > 1, the parser does not commit to a single POS or chunk assignment for the input sentence before building constituent structure. All three of the passes described in section 2 are integrated in the search, i.e., when parsing a test sentence, the input to the second pass consists of K of the best distinct POS tag assignments for the input sentence. Likewise, the input to the third pass consists of K of the best distinct chunk and POS tag assignments for the input sentence. null The top K BFS described above exploits the observed property that the individual steps of correct derivations tend to have high probabilities, and thus avoids searching a large fraction of the search space.</Paragraph> <Paragraph position="7"> Since, in practice, it only does a constant amount of work to advance each step in a derivation, and since derivation lengths are roughly proportional to the advance: d x V insert: d x h extract: h ) d completed: d</Paragraph> <Paragraph position="9"> if ( Vi, hi is empty ) then break /* Applies relevant tree building procedure to d and returns list of new derivations whose action probabilities pass the threshold Q */ /* inserts d in heap h */ /* removes and returns derivation in h with highest score */ /* returns true if and only if d is a complete derivation */ /* Heap of completed parses */ /* hi contains derivations of length i */</Paragraph> <Paragraph position="11"> for q=l to p if (completed (dq)) UltraSPARC processor and 256MB RAM of a Sun Ultra Enterprise 4000. sentence length, we would expect it to run in linear observed time with respect to sentence length. Figure 8 confirms our assumptions about the linear observed running time.</Paragraph> </Section> <Section position="7" start_page="6" end_page="6" type="metho"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> The maximum entropy parser was trained on sections 2 through 21 (roughly 40000 sentences) of the Penn Treebank Wall St. Journal corpus, release 2 (Marcus, Santorini, and Marcinkiewicz, 1994), and tested on section 23 (2416 sentences) for comparison with other work. All trees were stripped of their semantic tags (e.g., -LOC, -BNF, etc.), coreference information(e.g., *-1), and quotation marks (&quot; and ' ' ) for both training and testing. The PARSEVAL (Black and others, 1991) measures compare a proposed parse P with the corresponding correct treebank parse T as follows: # correct constituents in P Recall = # constituents in T # correct constituents in P Precision = # constituents in P A constituent in P is &quot;correct&quot; if there exists a constituent in T of the same label that spans the same words. Table 5 shows results using the PARSEVAL measures, as well as results using the slightly more forgiving measures of (Collins, 1996) and (Magerman, 1995). Table 5 shows that the maximum entropy parser performs better than the parsers presented in (Collins, 1996) and (Magerman, 1995) ~, which have the best previously published parsing accuracies on the Wall St. Journal domain.</Paragraph> <Paragraph position="1"> It is often advantageous to produce the top N parses instead of just the top 1, since additional information can be used in a secondary model that reorders the top N and hopefully improves the quality of the top ranked parse. Suppose there exists a &quot;perfect&quot; reranking scheme that, for each sentence, magically picks the best parse from the top N parses produced by the maximum entropy parser, where the best parse has the highest average precision and recall when compared to the treebank parse. The performance of this &quot;perfect&quot; scheme is then an upper bound on the performance of any reranking scheme that might be used to reorder the top N parses. Figure 9 shows that the &quot;perfect&quot; scheme would achieve roughly 93% precision and recall, which is a dramatic increase over the top 1 accuracy of 87% precision and 86% recall. Figure 10 shows that the &quot;Ex- null between ADVP and PRT, and ignore all punctuation.</Paragraph> <Paragraph position="2"> the proposed parse P is identical (excluding POS tags) to the treebank parse T, rises substantially to about 53% from 30% when the &quot;perfect&quot; scheme is applied. For this reason, research into reranking schemes appears to be a promising step towards the goal of improving parsing accuracy.</Paragraph> </Section> <Section position="8" start_page="6" end_page="43" type="metho"> <SectionTitle> 6 Comparison With Previous Work </SectionTitle> <Paragraph position="0"> The two parsers which have previously reported the best accuracies on the Penn Treebank Wall St. Journal are the bigram parser described in (Collins, 1996) and the SPATTER parser described in (Jelinek et al., 1994; Magerman, 1995). The parser presented here outperforms both the bigram parser and the SPATTER parser, and uses different modelling technology and different information to drive its decisions. null The bigram parser is a statistical CKY-style chart parser, which uses cooccurrence statistics of head-modifier pairs to find the best parse. The maximum entropy parser is a statistical shift-reduce style parser that cannot always access head-modifier pairs. For example, the checkcons(m,n) predicate of the maximum entropy parser may use two words such that neither is the intended head of the proposed consituent that the CHECK procedure must judge. And unlike the bigram parser, the maximum entropy parser cannot use head word information besides &quot;flat&quot; chunks in the right context.</Paragraph> <Paragraph position="1"> The bigram parser uses a backed-off estimation scheme that is customized for a particular task, whereas the maximum entropy parser uses a general purpose modelling technique. This allows the maximum entropy parser to easily integrate varying kinds of features, such as those for punctuation, whereas the bigram parser uses hand-crafted punctuation rules. Furthermore, the customized estimation framework of the bigram parser must use information that has been carefully selected for its value, whereas the maximum entropy framework ro- null Treebank, as a function of N. Evaluation ignores quotation marks. bustly integrates any kind of information, obviating the need to screen it first.</Paragraph> <Paragraph position="2"> The SPATTER parser is a history-based parser that uses decision tree models to guide the operations of a few tree building procedures. It differs from the maximum entropy parser in how it builds trees and more critically, in how its decision trees use information. The SPATTER decision trees use predicates on word classes created with a statistical clustering technique, whereas the maximum entropy parser uses predicates that contain merely the words themselves, and thus lacks the need for a (typically expensive) word clustering procedure. Furthermore, the top K BFS search heuristic appears to be much simpler than the stack decoder algorithm outlined in (Magerman, 1995).</Paragraph> </Section> class="xml-element"></Paper>