File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1064_metho.xml

Size: 6,975 bytes

Last Modified: 2025-10-06 14:09:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1064">
  <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 507-514, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Hidden-Variable Models for Discriminative Reranking</Title>
  <Section position="4" start_page="507" end_page="509" type="metho">
    <SectionTitle>
3 The Hidden-Variable Model
</SectionTitle>
    <Paragraph position="0"> In this section we describe a hidden-variable model based on conditional log-linear models. Each sentence si for i = 1...n in our training data has a set of ni candidate parse trees ti,1,...,ti,ni, which are the output of an N-best baseline parser. Each candidate parse has an associated F-measure score,  indicating its similarity to the gold-standard parse. Without loss of generality, we define ti,1 to be the parse with the highest F-measure for sentence si.</Paragraph>
    <Paragraph position="1"> Given a candidate parse tree ti,j, the hidden-variable model assigns a domain of hidden values to each word in the tree. For example, the hidden-value domain for the word bank could be {bank1,bank2,bank3} or {NN1,NN2,NN3}. Detailed descriptions of the domains we used are given in Section 4.1. Formally, if ti,j spans m words then the hidden-value domains for each word are the sets H1(ti,j),...,Hm(ti,j). A global hidden-value assignment, which attaches a hidden value to every word in ti,j, is written h = (h1,...,hm) [?] H(ti,j), where H(ti,j) = H1(ti,j)x...xHm(ti,j) is the set of all possible global assignments for ti,j.</Paragraph>
    <Paragraph position="2"> We define a feature-based representation Ph such that Ph(ti,j,h) [?] Rd is a vector of feature occurrence counts that describes candidate parse ti,j with global assignment h [?] H(ti,j). We write Phk for k = 1...d to denote the kth component of the vector Ph. Each component of the feature vector is the count of some substructure within (ti,j,h). For example, Ph12 and Ph101 could be defined as follows:</Paragraph>
    <Paragraph position="4"> Number of times the word the occurs with hidden value the3 and part of speech tag DT in (ti,j,h).</Paragraph>
    <Paragraph position="6"> Number of times CEO1 appears as the subject of owns2</Paragraph>
    <Paragraph position="8"> We use a parameter vector Th [?] Rd to define a log-linear distribution over candidate trees together with global hidden-value assignments:</Paragraph>
    <Paragraph position="10"> By marginalizing out the global assignments, we obtain a distribution over the candidate parses alone:</Paragraph>
    <Paragraph position="12"> Later in this paper we will describe how to train the parameters of the model by minimizing the following loss function--which is the negative log-likelihood of the training data--with respect to Th:</Paragraph>
    <Paragraph position="14"/>
    <Section position="1" start_page="508" end_page="509" type="sub_section">
      <SectionTitle>
3.1 Local Feature Vectors
</SectionTitle>
      <Paragraph position="0"> Note that the number of possible global assignments (i.e., |H(ti,j)|) grows exponentially fast with respect to the number of words spanned by ti,j. This poses a problem when training the model, or when calculating the probability of a parse tree through Eq. 2. This section describes how to address this difficulty by restricting features to sufficiently local scope. In Section 3.2 we show that this restriction allows efficient training and decoding of the model.</Paragraph>
      <Paragraph position="1"> The restriction to local feature-vectors makes use of the dependency structure underlying the parse tree ti,j. Formally, for tree ti,j, we define the corresponding dependency tree D(ti,j) to be a set of edges between words in ti,j, where (u,v) [?] D(ti,j) if and only if there is a head-modifier dependency between words u and v. See Figure 1 for an example dependency tree. We restrict the definition of Ph in the following way1. If w, u and v are word indices, we introduce single-variable local feature vectors ph(ti,j,w,hw) [?] Rd and pairwise local feature vectors ph(ti,j,u,v,hu,hv) [?] Rd. The global feature vector Ph(ti,j,h) is then decomposed into a sum over the local feature vectors:</Paragraph>
      <Paragraph position="3"> Notice that the second sum, over pairwise local feature vectors, respects the dependency structure D(ti,j). Section 3.2 describes how this decomposition of Ph leads to an efficient and exact dynamic-programming approach that, during training, allows us to calculate the gradient [?]L[?]Th and, during testing, allows us to find the most probable candidate parse.</Paragraph>
      <Paragraph position="4"> In our implementation, each dimension of the local feature vectors is an indicator function signaling the presence of a feature, so that a sum over local feature vectors in a tree gives the occurrence count 1Note that the restriction on local feature vectors only concerns the inclusion of hidden values; features may still observe arbitrary structure within the underlying parse tree ti,j.</Paragraph>
      <Paragraph position="5">  of features in that tree. For instance, define</Paragraph>
      <Paragraph position="7"> and tree ti,j places (u,v) in a subject-verb relationship LARGErrbracket where the notation llbracketPrrbracket signifies a 0/1 indicator of predicate P. When summed over the tree, these definitions of ph12 and ph101 yield global features Ph12 and Ph101 as given in the previous example (see Eq. 1).</Paragraph>
    </Section>
    <Section position="2" start_page="509" end_page="509" type="sub_section">
      <SectionTitle>
3.2 Training the Model
</SectionTitle>
      <Paragraph position="0"> We now describe how the loss function in Eq. 3 can be optimized using gradient descent. The gradient of the loss function is given by:</Paragraph>
      <Paragraph position="2"> is the expected value of the feature vector produced by parse tree ti,j. As we remarked earlier, |H(ti,j)| is exponential in size so direct calculation of either p(ti,j |si,Th) or F(ti,j,Th) is impractical. However, using the feature-vector decomposition in Eq. 4, we can rewrite the key functions of Th as follows:</Paragraph>
      <Paragraph position="4"> where p(ti,j,w,hw) and p(ti,j,u,v,hu,hv) are marginalized probabilities and Zi,j is the associated normalization constant:</Paragraph>
      <Paragraph position="6"> The three quantities above can be computed with belief propagation (Yedidia et al., 2003), a dynamic-programming technique that is efficient2 and exact 2The running time of belief propagation varies linearly with the number of nodes in D(ti,j) and quadratically with the cardinality of the largest hidden-value domain.</Paragraph>
      <Paragraph position="7"> when the graph D(ti,j) is a tree, which is the case in our parse reranking model. Having calculated the gradient in this way, we minimize the loss using stochastic gradient descent3 (LeCun et al., 1998).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML