File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1026_metho.xml
Size: 16,150 bytes
Last Modified: 2025-10-06 14:13:07
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1026"> <Title>Towards History-based Grammars: Using Richer Models for Probabilistic Parsing*</Title> <Section position="5" start_page="134" end_page="135" type="metho"> <SectionTitle> 3. The History-based Grammar Model </SectionTitle> <Paragraph position="0"> The history-based grammar model defines context of a parse tree in terms of the leftmost derivation of the tree.</Paragraph> <Paragraph position="1"> Following \[7\], we show in Figure 1 a context-free grammar (CFG) for anb ~ and the parse tree for the sentence aabb. The leftmost derivation of the tree T in Figure 1 is: S ~ ASB --% aSB --h aABB --% aaBB -% aabB -A aabb (2) where the rule used to expand the i-th node of the tree is denoted by ri. Note that we have indexed the non-terminal (NT) nodes of the tree with this leftmost order. We denote by t~ the sentential form obtained just before we expand node i. Hence, t~ corresponds to the sentential form aSB or equivalently to the string rlr2. In a leftmost derivation we produce the words in left-to-right order.</Paragraph> <Paragraph position="2"> Using the one-to-one correspondence between leftmost derivations and parse trees, we can rewrite the joint</Paragraph> <Paragraph position="4"> In a probabilistic context-free grammar (P-CFG), the probability of an expansion at node i depends only on the identity of the non-terminal N~, i.e., p(ri ITS-) -- p(r~).</Paragraph> <Paragraph position="6"> So in P-CFG the derivation order does not affect the probabilistic model 1.</Paragraph> <Paragraph position="7"> A tess crude approximation than the usual P-CFG is to use a decision tree to determine which aspects of the left-most derivation have a bearing on the probability of how node i will be expanded. In other words, the probability distribution p(rilt~&quot; ) will be modeled by p(r~lE\[t~\]) where E\[~\] is the equivalence class of the history t as determined by the decision tree. This allows our probabilistic model to use any information anywhere in the partial derivation tree to determine the probability of different expansions of the i-th non-terminal. The use of decision trees and a large bracketed corpus may shift some of the burden of identifying the intended parse from the grammarian to the statistical estimation methods. We refer to probabilistic methods based on the derivation as History-based Grammars (HBG).</Paragraph> <Paragraph position="8"> iNote the a'buse of notation since we denote by p(r~) the conditional probability of rewriting the non-terminal Ni.</Paragraph> <Paragraph position="9"> In this paper, we explored a restricted implementation of this model in which only the path from the current node to the root of the derivation along with the index of a branch (index of the child of a parent ) are examined in the decision tree model to build equivalence classes of histories. Other parts of the subtree are not examined in the implementation of HBG.</Paragraph> </Section> <Section position="6" start_page="135" end_page="135" type="metho"> <SectionTitle> 4. Task Domain </SectionTitle> <Paragraph position="0"> We have chosen computer manuals as a task domain.</Paragraph> <Paragraph position="1"> We picked the most frequent 3000 words in a corpus of 600,000 words from 10 manuals as our vocabulary. We then extracted a few million words of sentences that are completely covered by this vocabulary from 40,000,000 words of computer manuals. A randomly chosen sentence from a sample of 5000 sentences from this corpus is: 396. It indicates whether a call completed successfully or if some error was detected that caused the call to fail.</Paragraph> <Paragraph position="2"> To define what we mean by a correct parse, we use a corpus of manually bracketed sentences at the University of Lancaster called the Treebank. The Treebank uses 17 non-terminal labels and 240 tags. The bracketing of the above sentence is shown in Figure 2.</Paragraph> <Paragraph position="4"> A parse produced by the grammar is judged to be correct if it agrees with the Treebank parse structurally and the NT labels agree. The grammar has a significantly richer NT label set (more than 10000) than the Treebank but we have defined an equivalence mapping between the grammar NT labels and the Treebank NT labels. In this paper, we do not include the tags in the measure of a correct parse.</Paragraph> <Paragraph position="5"> We have used about 25,000 sentences to help the grammarian develop the grammar with the goal that the correct (as defined above) parse is among the proposed (by the grammar) parses for a sentence. Our most common test set consists of 1600 sentences that are never seen by the grammarian.</Paragraph> </Section> <Section position="7" start_page="135" end_page="136" type="metho"> <SectionTitle> 5. The Grammar </SectionTitle> <Paragraph position="0"> The grammar used in this experiment is a broadcoverage, feature-based unification grammar. The grammar is context-free but uses unification to express rule templates for the the context-free productions. For example, the rule template:</Paragraph> <Paragraph position="2"> corresponds to three CFG productions where the second feature : n is either s, p, or : n. This rule template may elicit up to 7 non-terminals. The grammar has 21 features whose range of values maybe from 2 to about 100 with a median of 8. There are 672 rule templates of which 400 are actually exercised when we parse a corpus of 15,000 sentences. The number of productions that are realized in this training corpus is several hundred thousand.</Paragraph> <Section position="1" start_page="135" end_page="136" type="sub_section"> <SectionTitle> 5.1. P-CFG </SectionTitle> <Paragraph position="0"> While a NT in the above grammar is a feature vector, we group several NTs into one class we call a mnemonic represented by the one NT that is the least specified in that class. For example, the mnemonic VBOPASTSG* corresponds to all NTs that unify with:</Paragraph> <Paragraph position="2"> We use these mnemonics to label a parse tree and we also use them to estimate a P-CFG, where the probability of rewriting a NT is given by the probability of rewriting the mnemonic. So from a training set we induce a CFG from the actual mnemonic productions that are elicited in parsing the training corpus. Using the Inside-Outside algorithm, we can estimate P-CFG from a large corpus of text. But since we also have a large corpus of bracketed sentences, we can adapt the Inside-Outside algorithm to reestimate the probability parameters sub-ject to the constraint that only parses consistent with the Treebank (where consistency is as defined earlier) contribute to the reestimation. From a training run of 15,000 sentences we observed 87,704 mnemonic productions, with 23,341 NT mnemonics of which 10,302 were lexical. Running on a test set of 760 sentences 32% of the rule templates were used, 7% of the lexical mnemonics, 10% of the constituent mnemonics, and 5% of the mnemonic productions actually contributed to parses of test sentences.</Paragraph> </Section> <Section position="2" start_page="136" end_page="136" type="sub_section"> <SectionTitle> 5.2. Grammar and Model Performance Metrics </SectionTitle> <Paragraph position="0"> To evaluate the performance of a grammar and an accompanying model, we use two types of measurements: * the any-consistent rate, defined as the percentage of sentences for which the correct parse is proposed among the many parses that the grammar provides for a sentence. We also measure the parse base, which is defined as the geometric mean of the number of proposed parses on a per word basis, to quantify the ambiguity of the grammar.</Paragraph> <Paragraph position="1"> * the Viterbi rate defined as the percentage of sentences for which the most likely parse is consistent. The arty-consistent rate is a measure of the grammar's coverage of linguistic phenomena. The Viterbi rate evaluates the grammar's coverage with the statistical model imposed on the grammar. The goal of probabilistic modelling is to produce a Viterbi rate close to the arty-consistent rate.</Paragraph> <Paragraph position="2"> The any-consistent rate is 90% when we require the structure and the labels to agree and 96% when unlabeled bracketing is required. These results are obtained on 760 sentences from 7 t0 17 words long from test material that has never been seen by the grammarian. The parse base is 1.35 parses/word. This translates to about 23 parses for a 12-word sentence. The unlabeled Viterbi rate stands at 64% and the labeled Viterbi rate is 60%.</Paragraph> <Paragraph position="3"> While we believe that the above Vitevbi rate is close if not the state-of-the-art performance, there is room for improvement by using a more refined statistical model to achieve the labeled arty-cortsistertt rate of 90% with this grammar. There is a significant gap between the labeled Viterbi and arty-cortsistent rates: 30 percentage points.</Paragraph> <Paragraph position="4"> Instead of the usual approach where a grammarian tries to fine tune the grammar in the hope of improving the Viterbi rate we use the combination of a large Treebank and the resulting derivation histories with a decision tree building algorithm to extract statistical parameters that would improve the Viterbi rate. The grammarian's task remains that of improving the arty-consistertt rate.</Paragraph> <Paragraph position="5"> The history-based grammar model is distinguished from the context-free grammar model in that each constituent structure depends not only on the input string, but also the entire history up to that point in the sentence. In HBGs, history is interpreted as any element of the output structure, or the parse tree, which has already been determined, including previous words, non-terminal categories, constituent structure, and any other linguistic information which is generated as part of the parse structure. null</Paragraph> </Section> </Section> <Section position="8" start_page="136" end_page="138" type="metho"> <SectionTitle> 6. The HBG Model </SectionTitle> <Paragraph position="0"> Unlike P-CFG which assigns a probability to a mnemonic production, the HBG model assigns a probability to a rule template. Because of this the HBG formulation allows one to handle any grammar formalism that has a derivation process.</Paragraph> <Paragraph position="1"> For the HBG model, we have defined about 50 syntactic categories, referred to as Syn, and about 50 semantic categories, referred to as Sem. Each NT (and therefore mnemonic) of the grammar has been assigned a syntactic (Syn) and a semantic (Sere) category. We also associate with a non-terminal a primary lexical head, denoted by H1, and a secondary lexical head, denoted by H2. 2 When a rule is applied to a non-terminal, it indicates which child will generate the lexical primary head and which child will generate the secondary lexical head.</Paragraph> <Paragraph position="2"> The proposed generative model associates for each constituent in the parse tree the probability: p( Syn, Sere, R, H1, H2 ISynp, Sernp, Rp, Ipc, Hip, H2p) In HBG, we predict the syntactic and semantic labels of a constituent, its rewrite rule, and its two lexical heads using the labels of the parent constituent, the parent's lexical heads, the parent's rule Rp that lead to the constituent and the constituent's index Ipc as a child of Rp. As we discuss in a later section, we have also used with success more information about the derivation tree than the immediate parent in conditioning the probability of expanding a constituent.</Paragraph> <Paragraph position="3"> We have approximated the above probability by the following five factors: 1. p(Syn IRp, Ipc, Hip, Synp, Semp) ~The primary lexical head H1 corresponds (roughly) to the linguistic notion of a lexical head. The secondary lexicM head H2 has no linguistic parallel. It merely represents a word in the constituent besides the head which contains predictive information about the constituent.</Paragraph> <Paragraph position="4"> 2. p( Sem ISyn, Rp, IpC/, Hip, It2p, Synp, Serf+) 3. p( R ISyn, Sere, Rp, IpC/, Hip, H2p, Synp, Sern~ ) 4. p( ul \[R, Syn, Sere, Rp, Ipo, Hip, H2p ) 5. p(H2 \]Hi, R, Syn, Sem, Rp, Ip~, Synp) While a different order for these predictions is possible, we only experimented with this one.</Paragraph> <Section position="1" start_page="137" end_page="138" type="sub_section"> <SectionTitle> 6.1. Parameter Estimation </SectionTitle> <Paragraph position="0"> We only have built a decision tree to the rule probability component (3) of the model. For the moment, we are using n-gram models with the usual deleted interpolation for smoothing for the other four components of the model.</Paragraph> <Paragraph position="1"> We have assigned bit strings to the syntactic and semantic categories and to the rules manually. Our retention is that bit strings differing in the least significant bit positions correspond to categories of non-terminals or rules that are similar. We also have assigned bitstrings for the words in the vocabulary (the lexical heads) using automatic clustering algorithms using the bigram mutual information clustering algorithm (see \[4\]). Given the bitsting of a history, we then designed a decision tree for modeling the probability that a rule will be used for rewriting a node in the parse tree.</Paragraph> <Paragraph position="2"> Since the grammar produces parses which may be more detailed than the Treebank, the decision tree was built using a training set constructed in the following manner. Using the grammar with the P-CFG model we determined the most likely parse that is consistent with the Treebank and considered the resulting sentence-tree pair as an event. Note that the grammar parse will also provide the lexical head structure of the parse* Then, we extracted using leftmost derivation order tuples of a history (truncated to the definition of a history in the HBG model) and the corresponding rule used in expanding a node. Using the resulting data set we built a decision tree by classifying histories to locally minimize the entropy of the rule template.</Paragraph> <Paragraph position="3"> With a training set of about 9000 sentence-tree pairs, we had about 240,000 tuples and we grew a tree with about 40,000 nodes. This required 18 hours on a 25 MIPS RISC-based machine and the resulting decision tree was nearly 100 megabytes.</Paragraph> <Paragraph position="4"> 6.2. Immediate vs. Functional Parents The HBG model employs two types of parents, the immediate parent and the functional parent. The immediate parent is the constituent that immediately dominates model.</Paragraph> <Paragraph position="5"> the constituent being predicted* If the immediate parent of a constituent has a different syntactic type from that of the constituent, then the immediate parent is also the functional parent; otherwise, the functional parent is the functional parent of the immediate parent* The distinction between functional parents and immediate parents arises primarily to cope with unit productions* When unit productions of the form XP2 ~ XP1 occur, the immediate parent of XP1 is XP2. But, in general, the constituent XP2 does not contain enough useful information for ambiguity resolution. In particular, when considering only immediate parents, unit rules such as NP2 -+ NP1 prevent the probabilistic model from allowing the NP1 constituent to interact with the VP rule which is the functional parent of NP1.</Paragraph> <Paragraph position="6"> When the two parents are identical as it often happens, the duplicate information will be ignored. However, when they differ, the decision tree will select that parental context which best resolves ambiguities.</Paragraph> <Paragraph position="7"> Figure 3 shows an example of the representation of a history in HBG for the prepositional phrase &quot;with a list.&quot; In this example, the immediate parent of the N1 node is the NBAR4 node and the functional parent of N1 is the PP1 node.</Paragraph> </Section> </Section> class="xml-element"></Paper>