File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1013_metho.xml

Size: 27,727 bytes

Last Modified: 2025-10-06 14:08:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1013">
  <Title>Discriminative Training of a Neural Network Statistical Parser</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Two History-Based Probability
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Models
</SectionTitle>
      <Paragraph position="0"> As with many previous statistical parsers (Ratnaparkhi, 1999; Collins, 1999; Charniak, 2000), we use a history-based model of parsing. Designing a history-based model of parsing involves two steps, rst choosing a mapping from the set of phrase structure trees to the set of parses, and then choosing a probability model in which the probability of each parser decision is conditioned on the history of previous decisions in the parse. We use the same mapping for both our probability models, but we use two di erent ways of conditioning the probabilities, one generative and one discriminative. As we will show in section 6, these two di erent ways of parameterizing the probability model have a big impact on the ease with which the parameters can be estimated.</Paragraph>
      <Paragraph position="1"> To de ne the mapping from phrase structure trees to parses, we use a form of left-corner parsing strategy (Rosenkrantz and Lewis, 1970). In a left-corner parse, each node is introduced after the subtree rooted at the node's rst child has been fully parsed. Then the subtrees for the node's remaining children are parsed in their left-to-right order. Parsing a constituent starts by pushing the leftmost word w of the constituent onto the stack with a shift(w) action.</Paragraph>
      <Paragraph position="2"> Parsing a constituent ends by either introducing the constituent's parent nonterminal (labeled Y ) with a project(Y) action, or attaching to the parent with an attach action.1 A complete parse consists of a sequence of these actions, d1;:::; dm, such that performing d1;:::; dm results in a complete phrase structure tree.</Paragraph>
      <Paragraph position="3"> Because this mapping from phrase structure trees to sequences of decisions about parser actions is one-to-one, nding the most probable phrase structure tree is equivalent to nding the parse d1;:::; dm which maximizes P(d1;:::; dmjw1;:::; wn). This probability is only nonzero if yield(d1;:::; dm) = w1;:::; wn, so we can restrict attention to only those parses which actually yield the given sentence. With this restriction, it is equivalent to maximize P(d1;:::; dm), as is done with our rst probability model.</Paragraph>
      <Paragraph position="4"> The rst probability model is generative, because it speci es the joint probability of the input sentence and the output tree. This joint probability is simply P(d1;:::; dm), since the 1More details on the mapping to parses can be found in (Henderson, 2003b).</Paragraph>
      <Paragraph position="5"> probability of the input sentence is included in the probabilities for the shift(wi) decisions included in d1;:::; dm. The probability model is then de ned by using the chain rule for conditional probabilities to derive the probability of a parse as the multiplication of the probabilities of each decision di conditioned on that decision's prior parse history d1;:::; di 1.</Paragraph>
      <Paragraph position="7"> The parameters of this probability model are the P(dijd1;:::; di 1). Generative models are the standard way to transform a parsing strategy into a probability model, but note that we are not assuming any bound on the amount of information from the parse history which might be relevant to each parameter.</Paragraph>
      <Paragraph position="8"> The second probability model is discriminative, because it speci es the conditional probability of the output tree given the input sentence. More generally, discriminative models try to maximize this conditional probability, but often do not actually calculate the probability, as with Support Vector Machines (Vapnik, 1995). We take the approach of actually calculating an estimate of the conditional probability because it di ers minimally from the generative probability model. In this form, the distinction between our two models is sometimes referred to as \joint versus conditional&amp;quot; (Johnson, 2001; Klein and Manning, 2002) rather than \generative versus discriminative&amp;quot; (Ng and Jordan, 2002). As with the generative model, we use the chain rule to decompose the entire conditional probability into a sequence of probabilities for individual parser decisions, where yield(dj;:::; dk) is the sequence of words wi from the shift(wi) actions in dj;:::; dk.</Paragraph>
      <Paragraph position="10"> Note that d1;:::; di 1 speci es yield(d1;:::; di 1), so it is su cient to only add yield(di;:::; dm) to the conditional in order for the entire input sentence to be included in the conditional. We will refer to the string yield(di;:::; dm) as the lookahead string, because it represents all those words which have not yet been reached by the parse at the time when decision di is chosen.</Paragraph>
      <Paragraph position="11"> The parameters of this model di er from those of the generative model only in that they include the lookahead string in the conditional.</Paragraph>
      <Paragraph position="12"> Although maximizing the joint probability is the same as maximizing the conditional probability, the fact that they have di erent parameters means that estimating one can be much harder than estimating the other. In general we would expect that estimating the joint probability would be harder than estimating the conditional probability, because the joint probability contains more information than the conditional probability. In particular, the probability distribution over sentences can be derived from the joint probability distribution, but not from the conditional one. However, the unbounded nature of the parsing problem means that the individual parameters of the discriminative model are much harder to estimate than those of the generative model.</Paragraph>
      <Paragraph position="13"> The parameters of the discriminative model include an unbounded lookahead string in the conditional. Because these words have not yet been reached by the parse, we cannot assign them any structure, and thus the estimation process has no way of knowing what words in this string will end up being relevant to the next decision it needs to make. The estimation process has to guess about the future role of an unbounded number of words, which makes the estimate quite di cult. In contrast, the parameters of the generative model only include words which are either already incorporated into the structure, or are the immediate next word to be incorporated. Thus it is relatively easy to determine the signi cance of each word.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Estimating the Parameters with a
Neural Network
</SectionTitle>
    <Paragraph position="0"> The most challenging problem in estimating P(dijd1;:::; di 1;yield(di;:::; dm)) and P(dijd1;:::; di 1) is that the conditionals include an unbounded amount of information.</Paragraph>
    <Paragraph position="1"> Both the parse history d1;:::; di 1 and the lookahead string yield(di;:::; dm) grow with the length of the sentence. In order to apply standard probability estimation methods, we use neural networks to induce nite representations of both these sequences, which we will denote h(d1;:::; di 1) and l(yield(di;:::; dm)), respectively. The neural network training methods we use try to nd representations which preserve all the information about the sequences which are relevant to estimating the desired probabilities.</Paragraph>
    <Paragraph position="3"> Of the previous work on using neural networks for parsing natural language, by far the most empirically successful has been the work using Simple Synchrony Networks. Like other recurrent network architectures, SSNs compute a representation of an unbounded sequence by incrementally computing a representation of each pre x of the sequence. At each position i, representations from earlier in the sequence are combined with features of the new position i to produce a vector of real valued features which represent the pre x ending at i. This representation is called a hidden representation. It is analogous to the hidden state of a Hidden Markov Model. As long as the hidden representation for position i 1 is always used to compute the hidden representation for position i, any information about the entire sequence could be passed from hidden representation to hidden representation and be included in the hidden representation of that sequence. When these representations are then used to estimate probabilities, this property means that we are not making any a priori hard independence assumptions (although some independence may be learned from the data).</Paragraph>
    <Paragraph position="4"> The di erence between SSNs and most other recurrent neural network architectures is that SSNs are speci cally designed for processing structures. When computing the history representation h(d1;:::; di 1), the SSN uses not only the previous history representation h(d1;:::; di 2), but also uses history representations for earlier positions which are particularly relevant to choosing the next parser decision di.</Paragraph>
    <Paragraph position="5"> This relevance is determined by rst assigning each position to a node in the parse tree, namely the node which is on the top of the parser's stack when that decision is made. Then the relevant earlier positions are chosen based on the structural locality of the current decision's node to the earlier decisions' nodes. In this way, the number of representations which information needs to pass through in order to ow from history representation i to history representation j is determined by the structural distance between i's node and j's node, and not just the distance between i and j in the parse sequence.</Paragraph>
    <Paragraph position="6"> This provides the neural network with a linguistically appropriate inductive bias when it learns the history representations, as explained in more detail in (Henderson, 2003b).</Paragraph>
    <Paragraph position="7"> When computing the lookahead representation l(yield(di;:::; dm)), there is no structural information available to tell us which positions are most relevant to choosing the decision di. Proximity in the string is our only indication of relevance. Therefore we compute l(yield(di;:::; dm)) by running a recurrent neural network backward over the string, so that the most recent input is the rst word in the lookahead string, as discussed in more detail in (Henderson, 2003a).</Paragraph>
    <Paragraph position="8"> Once it has computed h(d1;:::; di 1) and (for the discriminative model) l(yield(di;:::; dm)), the SSN uses standard methods (Bishop, 1995) to estimate a probability distribution over the set of possible next decisions di given these representations. This involves further decomposing the distribution over all possible next parser actions into a small hierarchy of conditional probabilities, and then using log-linear models to estimate each of these conditional probability distributions. The input features for these log-linear models are the real-valued vectors computed by h(d1;:::; di 1) and l(yield(di;:::; dm)), as explained in more detail in (Henderson, 2003b).</Paragraph>
    <Paragraph position="9"> Thus the full neural network consists of a recurrent hidden layer for h(d1;:::; di 1), (for the discriminative model) a recurrent hidden layer for l(yield(di;:::; dm)), and an output layer for the log-linear model. Training is applied to this full neural network, as described in the next section.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Three Optimization Criteria and
</SectionTitle>
    <Paragraph position="0"> their Training Methods As with many other machine learning methods, training a Simple Synchrony Network involves rst de ning an appropriate learning criteria and then performing some form of gradient descent learning to search for the optimum values of the network's parameters according to this criteria. In all the parsing models investigated here, we use the on-line version of Backpropagation to perform the gradient descent. This learning simultaneously tries to optimize the parameters of the output computation and the parameters of the mappings h(d1;:::; di 1) and l(yield(di;:::; dm)). With multi-layered networks such as SSNs, this training is not guaranteed to converge to a global optimum, but in practice a network whose criteria value is close to the optimum can be found.</Paragraph>
    <Paragraph position="1"> The three parsing models di er in the criteria the neural networks are trained to optimize. Two of the neural networks are trained using the standard maximum likelihood approach of optimizing the same probability which they are estimating, one generative and one discriminative. For the generative model, this means maximizing the total joint probability of the parses and the sentences in the training corpus. For the discriminative model, this means maximizing the conditional probability of the parses in the training corpus given the sentences in the training corpus. To make the computations easier, we actually minimize the negative log of these probabilities, which is called cross-entropy error. Minimizing this error ensures that training will converge to a neural network whose outputs are estimates of the desired probabilities.2 For each parse in the training corpus, Backpropagation training involves rst computing the probability which the current network assigns to that parse, then computing the rst derivative of (the negative log of) this probability with respect to each of the network's parameters, and then updating the parameters proportionately to this derivative.3 The third neural network combines the advantages of the generative probability model with the advantages of the discriminative optimization criteria. The structure of the network and the set of outputs which it computes are exactly the same as the above network for the generative model. But the training procedure is designed to maximize the conditional probability of the parses in the training corpus given the sentences in the training corpus. The conditional probability for a sentence can be computed from the joint probability of the generative model by normalizing over the set of all parses d01;:::; d0m0 for the sentence.</Paragraph>
    <Paragraph position="3"> So, with this approach, we need to maximize this normalized probability, and not the probability computed by the network.</Paragraph>
    <Paragraph position="4"> The di culty with this approach is that there are exponentially many parses for the sentence, so it is not computationally feasible to compute them all. We address this problem by only computing a small set of the most probable parses. The remainder of the sum is estimated using a combination of the probabilities from the best parses and the probabilities 2Cross-entropy error ensures that the minimum of the error function converges to the desired probabilities as the amount of training data increases (Bishop, 1995), so the minimum for any given dataset is considered an estimate of the true probabilities.</Paragraph>
    <Paragraph position="5"> 3A number of additional training techniques, such as regularization, are added to this basic procedure, as will be speci ed in section 6.</Paragraph>
    <Paragraph position="6"> from the partial parses which were pruned when searching for the best parses. The probabilities of pruned parses are estimated in such a way as to minimize their e ect on the training process. For each decision which is part of some un-pruned parses, we calculate the average probability of generating the remainder of the sentence by these un-pruned parses, and use this as the estimate for generating the remainder of the sentence by the pruned parses. With this estimate we can calculate the sum of the probabilities for all the pruned parses which originate from that decision. This approach gives us a slight overestimate of the total sum, but because this total sum acts simply as a weighting factor, it has little e ect on learning. What is important is that this estimate minimizes the e ect of the pruned parses' probabilities on the part of the training process which occurs after the probabilities of the best parses have been calculated.</Paragraph>
    <Paragraph position="7"> After estimating P(d1;:::; dmjw1;:::; wn), training requires that we estimate the rst derivative of (the negative log of) this probability with respect to each of the network's parameters. The contribution to this derivative of the numerator in the above equation is the same as in the generative case, just scaled by the denominator.</Paragraph>
    <Paragraph position="8"> The di erence between the two learning methods is that we also need to account for the contribution to this derivative of the denominator.</Paragraph>
    <Paragraph position="9"> Here again we are faced with the problem that there are an exponential number of derivations in the denominator, so here again we approximate this calculation using the most probable parses.</Paragraph>
    <Paragraph position="10"> To increase the conditional probability of the correct parse, we want to decrease the total joint probabilities of the incorrect parses. Probability mass is only lost from the sum over all parses because shift(wi) actions are only allowed for the correct wi. Thus we can decrease the total joint probability of the incorrect parses by making these parses be worse predictors of the words in the sentence.4 The combination of training the correct parses to be good predictors of the words and training the incorrect parses to be bad predictors of the words results in prediction prob4Non-prediction probability estimates for incorrect parses can make a small contribution to the derivative, but because pruning makes the calculation of this contribution inaccurate, we treat this contribution as zero when training. This means that non-prediction outputs are trained to maximize the same criteria as in the generative case.</Paragraph>
    <Paragraph position="11"> abilities which are not accurate estimates, but which are good at discriminating correct parses from incorrect parses. It is this feature which gives discriminative training an advantage over generative training. The network does not need to learn an accurate model of the distribution of words. The network only needs to learn an accurate model of how words disambiguate previous parsing decisions.</Paragraph>
    <Paragraph position="12"> When we apply discriminative training only to the most probable incorrect parses, we train the network to discriminate between the correct parse and those incorrect parses which are the most likely to be mistaken for the correct parse.</Paragraph>
    <Paragraph position="13"> In this sense our approximate training method results in optimizing the decision boundary between correct and incorrect parses, rather than optimizing the match to the conditional probability. Modifying the training method to systematically optimize the decision boundary (as in large margin methods such as Support Vector Machines) is an area of future research.</Paragraph>
    <Paragraph position="14"> 5 Searching for the most probable parse The complete parsing system uses the probability estimates computed by the SSN to search for the most probable parse. The search incrementally constructs partial parses d1;:::; di by taking a parse it has already constructed d1;:::; di 1 and using the SSN to estimate a probability distribution P(dijd1;:::; di 1; :::) over possible next decisions di. These probabilities are then used to compute the probabilities for d1;:::; di. In general, the partial parse with the highest probability is chosen as the next one to be extended, but to perform the search e ciently it is necessary to prune the search space. The main pruning is that only a xed number of the most probable derivations are allowed to continue past the shifting of each word. Setting this post-word beam width to 5 achieves fast parsing with reasonable performance in all models. For the parsers with generative probability models, maximum accuracy is achieved with a post-word beam width of 100.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 The Experiments
</SectionTitle>
    <Paragraph position="0"> We used the Penn Treebank (Marcus et al., 1993) to perform empirical experiments on the proposed parsing models. In each case the input to the network is a sequence of tag-word pairs.5 5We used a publicly available tagger (Ratnaparkhi, 1996) to provide the tags. For each tag, there is an We report results for three di erent vocabulary sizes, varying in the frequency with which tag-word pairs must occur in the training set in order to be included explicitly in the vocabulary.</Paragraph>
    <Paragraph position="1"> A frequency threshold of 200 resulted in a vocabulary of 508 tag-word pairs, a threshold of 20 resulted in 4215 tag-word pairs, and a threshold of 5 resulted in 11,993 tag-word pairs For the generative model we trained networks for the 508 (\GSSN-Freq 200&amp;quot;) and 4215 (\GSSN-Freq 20&amp;quot;) word vocabularies. The need to calculate word predictions makes training times for the 11,993 word vocabulary very long, and as of this writing no such network training has been completed. The discriminative model does not need to calculate word predictions, so it was feasible to train networks for the 11,993 word vocabulary (\DSSN-Freq 5&amp;quot;).</Paragraph>
    <Paragraph position="2"> Previous results (Henderson, 2003a) indicate that this vocabulary size performs better than the smaller ones, as would be expected.</Paragraph>
    <Paragraph position="3"> For the networks trained with the discriminative optimization criteria and the generative probability model, we trained networks for the 508 (\DGSSN-Freq 200&amp;quot;) and 4215 (\DGSSN-Freq 20&amp;quot;) word vocabularies. For this training, we need to select a small set of the most probable incorrect parses. When we tried using only the network being trained to choose these top parses, training times were very long and the resulting networks did not outperform their generative counterparts. In the experiments reported here, we provided the training with a list of the top 20 parses found by a network of the same type which had been trained with the generative criteria. The network being trained was then used to choose its top 10 parses from this list, and training was performed on these 10 parses and the correct parse.6 This reduced the time necessary to choose the top parses during training, and helped focus the early stages of training on learning relevant discriminations.</Paragraph>
    <Paragraph position="4"> Once the training of these networks was complete, we tested both their ability to parse on their own and their ability to re-rank the top unknown-word vocabulary item which is used for all those words which are not su ciently frequent with that tag to be included individually in the vocabulary (as well as other words if the unknown-word case itself does not have at least 5 instances). We did no morphological analysis of unknown words.</Paragraph>
    <Paragraph position="5">  were found with post-word beam widths of 20 and 10, respectively, so these are only approximations to the top parses.</Paragraph>
    <Paragraph position="6"> 20 parses of their associated generative model (\DGSSN-: : :, rerank&amp;quot;).</Paragraph>
    <Paragraph position="7"> We determined appropriate training parameters and network size based on intermediate validation results and our previous experience.7 We trained several networks for each of the GSSN models and chose the best ones based on their validation performance. We then trained one network for each of the DGSSN models and for the DSSN model. The best post-word beam width was determined on the validation set, which was 5 for the DSSN model and 100 for the other models.</Paragraph>
    <Paragraph position="8"> To avoid repeated testing on the standard testing set, we rst compare the di erent models with their performance on the validation set. Standard measures of accuracy are shown in table 1.8 The largest accuracy di erence is between the parser with the discriminative probability model (DSSN-Freq 5) and those with the generative probability model, despite the larger vocabulary of the former. This demonstrates the di culty of estimating the parameters of a discriminative probability model. There is also a clear e ect of vocabulary size, but there is a slightly larger e ect of training method. When tested in the same way as they were trained (for reranking), the parsers which were trained with a discriminative criteria achieve a 7% and 8% reduction in error rate over their respective parsers with the same generative probability model. When tested alone, these DGSSN parsers perform only slightly better than their respective GSSN parsers. Initial experiments on giving these networks exposure to parses outside the top 20 parses of the GSSN parsers at the very end of training did not result in any improvement on this task. This suggests that at least some of the advantage of the DSSN models is due to the fact that re-ranking is a simpler task than parsing from scratch. But additional experimental work would be necessary to make any de nite conclusions about this issue.</Paragraph>
    <Paragraph position="9"> 7All the best networks had 80 hidden units for the history representation (and 80 hidden units in the lookahead representation). Weight decay regularization was applied at the beginning of training but reduced to near 0 by the end of training. Training was stopped when maximum performance was reached on the validation set, using a post-word beam width of 5.</Paragraph>
    <Paragraph position="10"> 8All our results are computed with the evalb program following the standard criteria in (Collins, 1999), and using the standard training (sections 2{22, 39,832 sentences, 910,196 words), validation (section 24, 1346 sentence, 31507 words), and testing (section 23, 2416 sentences, 54268 words) sets (Collins, 1999).</Paragraph>
    <Paragraph position="11">  (LR), precision (LP), and a combination of both (F =1) on the entire testing set.</Paragraph>
    <Paragraph position="12"> For comparison to previous results, table 2 lists the results for our best model (DGSSN-Freq 20, rerank)9 and several other statistical parsers (Ratnaparkhi, 1999; Collins, 1999; Collins and Du y, 2002; Charniak, 2000; Collins, 2000; Bod, 2003) on the entire testing set. Our best performing model is more accurate than all these previous models except (Bod, 2003). This DGSSN parser achieves this result using much less lexical knowledge than other approaches, which mostly use at least the words which occur at least 5 times, plus morphological features of the remaining words. However, the fact that the DGSSN uses a large-vocabulary tagger (Ratnaparkhi, 1996) as a preprocessing stage may compensate for its smaller vocabulary. Also, the main reason for using a smaller vocabulary is the computational complexity of computing probabilities for the shift(wi) actions on-line, which other models do not require.</Paragraph>
    <Paragraph position="13"> 9On sentences of length at most 40, the DGSSN-Freq 20-rerank model gets 90.1% recall and 90.7% precision. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML