File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1035_metho.xml

Size: 18,947 bytes

Last Modified: 2025-10-06 14:14:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1035">
  <Title>Exploiting Syntactic Structure for Language Modeling</Title>
  <Section position="3" start_page="0" end_page="226" type="metho">
    <SectionTitle>
2 The Basic Idea and Terminology
</SectionTitle>
    <Paragraph position="0"> Consider predicting the word after in the sentence: the contract ended with a loss of 7 cents after trading as low as 89 cents.</Paragraph>
    <Paragraph position="1"> A 3-gram approach would predict after from (7, cents) whereas it is intuitively clear that the strongest predictor would be ended which is outside the reach of even 7-grams. Our assumption is that what enables humans to make a good prediction of after is the syntactic structure in the past. The linguistically correct partial parse of the word history when predicting after is shown in Figure 1.</Paragraph>
    <Paragraph position="2"> The word ended is called the headword of the constituent (ended (with (...) )) and ended is an exposed headword when predicting after -- topmost headword in the largest constituent that contains it.</Paragraph>
    <Paragraph position="3"> The syntactic structure in the past filters out irrelevant words and points to the important ones, thus enabling the use of long distance information when predicting the next word.</Paragraph>
    <Paragraph position="4"> Our model will attempt to build the syntactic structure incrementally while traversing the sentence left-to-right. The model will assign a probability P(W, T) to every sentence W with every possible POStag assignment, binary branching parse, non-terminal label and headword annotation for every constituent of T.</Paragraph>
    <Paragraph position="5"> Let W be a sentence of length n words to which we have prepended &lt;s&gt; and appended &lt;/s&gt; so that Wo =&lt;s&gt; and w,+l =&lt;/s&gt;. Let Wk be the word k-prefix Wo...wk of the sentence and WkTk  &amp;quot;i .... &amp;quot; deg'' (C/:s&gt;. SB) ....... (wp. t p) (w {p/l }. L_( I~-I }) ........ (wk. t_k) w_( k*l }.... &lt;/s.~  the word-parse k-prefix. To stress this point, a word-parse k-prefix contains -- for a given parse -- only those binary subtrees whose span is completely included in the word k-prefix, excluding w0 =&lt;s&gt;. Single words along with their POStag can be regarded as root-only trees. Figure 2 shows a word-parse k-prefix; h_0 .. h_{-m} are the exposed heads, each head being a pair(headword, non-terminal label), or (word, POStag) in the case of a root-only tree. A complete parse -- Figure 3 -- is any binary parse of the (wl,tl)...(wn,t,) (&lt;/s&gt;, SE) sequence with the restriction that (&lt;/s&gt;, TOP') is the only allowed head. Note that ((wl,tl)...(w,,t,)) needn't be a constituent, but for the parses where it is, there is no restriction on which of its words is the headword or what is the non-terminal label that accompanies the headword.</Paragraph>
    <Paragraph position="6"> The model will operate by means of three modules: null * WORD-PREDICTOR predicts the next word wk+l given the word-parse k-prefix and then passes control to the TAGGER; * TAGGER predicts the POStag of the next word tk+l given the word-parse k-prefix and the newly predicted word and then passes control to the PARSER; * PARSER grows the already existing binary branching structure by repeatedly generating the transitions: (unary, NTlabel), (adjoin-left, NTlabel) or (adjoin-right, NTlabel) until it passes control to the PREDICTOR by taking a null transition.</Paragraph>
    <Paragraph position="7"> NTlabel is the non-terminal label assigned to the newly built constituent and {left ,right} specifies where the new headword is inherited from.</Paragraph>
    <Paragraph position="8"> The operations performed by the PARSER are illustrated in Figures 4-6 and they ensure that all possible binary branching parses with all possible T'_{.m/l l &lt;-&lt;s.~.</Paragraph>
    <Paragraph position="10"> headword and non-terminal label assignments for the wl ... wk word sequence can be generated. The following algorithm formalizes the above description of the sequential generation of a sentence with a complete parse.</Paragraph>
    <Paragraph position="12"> The unary transition is allowed only when the most recent exposed head is a leaf of the tree -a regular word along with its POStag -- hence it can be taken at most once at a given position in the  T'_l.m+l } &lt;-&lt;s&gt; h' {- I }=h {-2} h'_0 = (h_0.word, NTlabC/l)  input word string. The second subtree in Figure 2 provides an example of a unary transition followed by a null transition.</Paragraph>
    <Paragraph position="13"> It is easy to see that any given word sequence with a possible parse and headword annotation is generated by a unique sequence of model actions. This will prove very useful in initializing our model parameters from a treebank -- see section 3.5.</Paragraph>
  </Section>
  <Section position="4" start_page="226" end_page="229" type="metho">
    <SectionTitle>
3 Probabilistic Model
</SectionTitle>
    <Paragraph position="0"> The probability P(W, T) of a word sequence W and a complete parse T can be broken into:</Paragraph>
    <Paragraph position="2"> * Nk -- 1 is the number of operations the PARSER executes before passing control to the WORD-PREDICTOR (the Nk-th operation at position k is the null transition); Nk is a function of T * pi k denotes the i-th PARSER operation carried out at position k in the word string;</Paragraph>
    <Paragraph position="4"> As can be seen, (wk, tk, Wk-xTk-x,p~...pki_x) is one of the Nk word-parse k-prefixes WkTk at position k in the sentence, i = 1, Nk.</Paragraph>
    <Paragraph position="5"> To ensure a proper probabilistic model (1) we have to make sure that (2), (3) and (4) are well defined conditional probabilities and that the model halts with probability one. Consequently, certain PARSER and WORD-PREDICTOR probabilities must be given specific values:</Paragraph>
    <Paragraph position="7"> &lt;/s&gt; -- ensures that (&lt;s&gt;, SB) is adjoined in the last step of the parsing process;</Paragraph>
    <Paragraph position="9"> ensure that the parse generated by our model is consistent with the definition of a complete parse;</Paragraph>
    <Paragraph position="11"> ensures that the model halts with probability one.</Paragraph>
    <Paragraph position="12"> The word-predictor model (2) predicts the next word based on the preceding 2 exposed heads, thus making the following equivalence classification:</Paragraph>
    <Paragraph position="14"> After experimenting with several equivalence classifications of the word-parse prefix for the tagger model, the conditioning part of model (3) was reduced to using the word to be tagged and the tags of the two most recent exposed heads:</Paragraph>
    <Paragraph position="16"> Model (4) assigns probability to different parses of the word k-prefix by chaining the elementary operations described above. The workings of the parser module are similar to those of Spatter (Jelinek et al., 1994). The equivalence classification of the WkTk word-parse we used for the parser model (4) was the same as the one used in (Collins, 1996): p (pk / Wk Tk ) = p (pk / ho , h-x) It is worth noting that if the binary branching structure developed by the parser were always right-branching and we mapped the POStag and non-terminal label vocabularies to a single type then our model would be equivalent to a trigram language model.</Paragraph>
    <Section position="1" start_page="226" end_page="227" type="sub_section">
      <SectionTitle>
3.1 Modeling Tools
</SectionTitle>
      <Paragraph position="0"> All model components -- WORD-PREDICTOR, TAGGER, PARSER -- are conditional probabilistic models of the type P(y/xl,x2,...,xn) where y, Xx,X2,...,Xn belong to a mixed bag of words, POStags, non-terminal labels and parser operations (y only). For simplicity, the modeling method we chose was deleted interpolation among relative frequency estimates of different orders fn(') using a  recursive mixing scheme:</Paragraph>
      <Paragraph position="2"> As can be seen, the context mixing scheme discards items in the context in right-to-left order. The A coefficients are tied based on the range of the count C(xx,...,Xn). The approach is a standard one which doesn't require an extensive description given the literature available on it (Jelinek and Mercer, 1980).</Paragraph>
    </Section>
    <Section position="2" start_page="227" end_page="227" type="sub_section">
      <SectionTitle>
3.2 Search Strategy
</SectionTitle>
      <Paragraph position="0"> Since the number of parses for a given word prefix Wt grows exponentially with k, I{Tk}l ,,. O(2k), the state space of our model is huge even for relatively short sentences so we had to use a search strategy that prunes it. Our choice was a synchronous multi-stack search algorithm which is very similar to a beam search.</Paragraph>
      <Paragraph position="1"> Each stack contains hypotheses -- partial parses -- that have been constructed by the same number of predictor and the same number of parser operations.</Paragraph>
      <Paragraph position="2"> The hypotheses in each stack are ranked according to the ln(P(W, T)) score, highest on top. The width of the search is controlled by two parameters: * the maximum stack depth -- the maximum number of hypotheses the stack can contain at any given state; * log-probability threshold -- the difference between the log-probability score of the top-most hypothesis and the bottom-most hypothesis at any given state of the stack cannot be larger than a given threshold.  above pruning strategy proved to be insufficient so we chose to also discard all hypotheses whose score is more than the log-probability threshold below the score of the topmost hypothesis. This additional pruning step is performed after all hypotheses in stage k' have been extended with the null parser transition and thus prepared for scanning a new word.</Paragraph>
    </Section>
    <Section position="3" start_page="227" end_page="228" type="sub_section">
      <SectionTitle>
3.3 Word Level Perplexity
</SectionTitle>
      <Paragraph position="0"> The conditional perplexity calculated by assigning to a whole sentence the probability:</Paragraph>
      <Paragraph position="2"> where T* = argrnaxTP(W, T), is not valid because it is not causal: when predicting wk+l we use T* which was determined by looking at the entire sentence. To be able to compare the perplexity of our</Paragraph>
      <Paragraph position="4"> model with that resulting from the standard tri-gram approach, we need to factor in the entropy of guessing the correct parse T~ before predicting wk+l, based solely on the word prefix Wk.</Paragraph>
      <Paragraph position="5"> The probability assignment for the word at posi-</Paragraph>
      <Paragraph position="7"> which ensures a proper probability over strings W*, where Sk is the set of all parses present in our stacks at the current stage k.</Paragraph>
      <Paragraph position="8"> Another possibility for evaluating the word level perplexity of our model is to approximate the probability of a whole sentence:</Paragraph>
      <Paragraph position="10"> where T (k) is one of the &amp;quot;N-best&amp;quot; -- in the sense defined by our search -- parses for W. This is a deficient probability assignment, however useful for justifying the model parameter re-estimation.</Paragraph>
      <Paragraph position="11"> The two estimates (8) and (10) are both consistent in the sense that if the sums are carried over all  possible parses we get the correct value for the word level perplexity of our model.</Paragraph>
    </Section>
    <Section position="4" start_page="228" end_page="229" type="sub_section">
      <SectionTitle>
3.4 Parameter Re-estimation
</SectionTitle>
      <Paragraph position="0"> The major problem we face when trying to reestimate the model parameters is the huge state space of the model and the fact that dynamic programming techniques similar to those used in HMM parameter re-estimation cannot be used with our model.</Paragraph>
      <Paragraph position="1"> Our solution is inspired by an HMM re-estimation technique that works on pruned -- N-best -- trellises(Byrne et al., 1998).</Paragraph>
      <Paragraph position="2"> Let (W, T(k)), k = 1... N be the set of hypotheses that survived our pruning strategy until the end of the parsing process for sentence W. Each of them was produced by a sequence of model actions, chained together as described in section 2; let us call the sequence of model actions that produced a given (W, T) the derivation(W, T).</Paragraph>
      <Paragraph position="3"> Let an elementary event in the derivation(W, T)</Paragraph>
      <Paragraph position="5"> action number l in the derivation(W, T); , y~mt) is the action taken at position I in the derivation: null if mt = WORD-PREDICTOR, then y~m,) is a word; if mt -- TAGGER, then y~m~) is a POStag; if mt = PARSER, then y~m~) is a parser-action; * ~m~) is the context in which the above action was taken: if rat = WORD-PREDICTOR or PARSER, then</Paragraph>
      <Paragraph position="7"> The probability associated with each model action is determined as described in section 3.1, based on counts C (m) (y(m), x_(&amp;quot;0), one set for each model component.</Paragraph>
      <Paragraph position="8"> Assuming that the deleted interpolation coefficients and the count ranges used for tying them stay fixed, these counts are the only parameters to be re-estimated in an eventual re-estimation procedure; indeed, once a set of counts C (m) (y(m), x_(m)) is specified for a given model ra, we can easily calculate: * the relative frequency estimates fn(m)/,,(m) Ix(m) ~ for all context orders kY I_n /</Paragraph>
      <Paragraph position="10"> * the count c(m)(x_ (m)) used for determining the A(x_ (m)) value to be used with the order-n context x(m)&amp;quot; This is all we need for calculating the probability of an elementary event and then the probability of an entire derivation.</Paragraph>
      <Paragraph position="11"> One training iteration of the re-estimation procedure we propose is described by the following algorithm: null N-best parse development data; // counts.El // prepare counts.E(i+l) for each model component c{ gather_counts development model_c; } In the parsing stage we retain for each &amp;quot;N-best&amp;quot; hypothesis (W, T(k)), k = 1... N, only the quantity C/(W, T(k)) p(W,T(k))/ N = ~-~k=l P(W, T(k)) and its derivation(W,T(k)). We then scan all the derivations in the &amp;quot;development set&amp;quot; and, for each occurrence of the elementary event (y(m), x_(m)) in derivation(W,T(k)) we accumulate the value C/(W,T (k)) in the C(m)(y(m),x__ (m)) counter to be used in the next iteration.</Paragraph>
      <Paragraph position="12"> The intuition behind this procedure is that C/(W,T (k)) is an approximation to the P(T(k)/w) probability which places all its mass on the parses that survived the parsing process; the above procedure simply accumulates the expected values of the counts c(m)(y(m),x (m)) under the C/(W,T (k)) conditional distribution. As explained previously, the C(m) (y(m), X_(m)) counts are the parameters defining our model, making our procedure similar to a rigorous EM approach (Dempster et al., 1977).</Paragraph>
      <Paragraph position="13"> A particular -- and very interesting -- case is that of events which had count zero but get a non-zero count in the next iteration, caused by the &amp;quot;N-best&amp;quot; nature of the re-estimation process. Consider a given sentence in our &amp;quot;development&amp;quot; set. The &amp;quot;N-best&amp;quot; derivations for this sentence are trajectories through the state space of our model. They will change from one iteration to the other due to the smoothing involved in the probability estimation and the change of the parameters -- event counts -- defining our model, thus allowing new events to appear and discarding others through purging low probability events from the stacks. The higher the number of trajectories per sentence, the more dynamic this change is expected to be.</Paragraph>
      <Paragraph position="14"> The results we obtained are presented in the experiments section. All the perplexity evaluations were done using the left-to-right formula (8) (L2R-PPL) for which the perplexity on the &amp;quot;development set&amp;quot; is not guaranteed to decrease from one iteration to another. However, we believe that our re-estimation method should not increase the approximation to perplexity based on (10) (SUM-PPL) -again, on the &amp;quot;development set&amp;quot;; we rely on the consistency property outlined at the end of section 3.3 to correlate the desired decrease in L2R-PPL with that in SUM-PPL. No claim can be made about the change in either L2R-PPL or SUM-PPL on test data.</Paragraph>
    </Section>
    <Section position="5" start_page="229" end_page="229" type="sub_section">
      <SectionTitle>
3.5 Initial Parameters
</SectionTitle>
      <Paragraph position="0"> Each model component -- WORD-PREDICTOR, TAGGER, PARSER -- is trained initially from a set of parsed sentences, after each parse tree (W, T) undergoes: * headword percolation and binarization -- see section 4; * decomposition into its derivation(W, T). Then, separately for each m model component, we: * gather joint counts cCm)(y(m),x (m)) from the derivations that make up the &amp;quot;development data&amp;quot; using C/(W,T) = 1; * estimate the deleted interpolation coefficients on joint counts gathered from &amp;quot;check data&amp;quot; using the EM algorithm.</Paragraph>
      <Paragraph position="1"> These are the initial parameters used with the re-estimation procedure described in the previous section. null</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="229" end_page="229" type="metho">
    <SectionTitle>
4 Headword Percolation and
Binarization
</SectionTitle>
    <Paragraph position="0"> In order to get initial statistics for our model components we needed to binarize the UPenn Tree-bank (Marcus et al., 1995) parse trees and percolate headwords. The procedure we used was to first percolate headwords using a context-free (CF) rule-based approach and then binarize the parses by using a rule-based approach again.</Paragraph>
    <Paragraph position="1"> The headword of a phrase is the word that best represents the phrase, all the other words in the phrase being modifiers of the headword. Statistically speaking, we were satisfied with the output of an enhanced version of the procedure described in (Collins, 1996) -- also known under the name &amp;quot;Magerman &amp; Black Headword Percolation Rules&amp;quot;. Once the position of the headword within a constituent -- equivalent with a CF production of the type Z --~ Y1.--Yn , where Z, Y1,...Yn are non-terminal labels or POStags (only for Y/) -- is identified to be k, we binarize the constituent as follows: depending on the Z identity, a fixed rule is used to decide which of the two binarization schemes in Figure 8 to apply. The intermediate nodes created by the above binarization schemes receive the non-terminal label Z ~.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML