File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1027_metho.xml

Size: 18,384 bytes

Last Modified: 2025-10-06 14:13:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1027">
  <Title>S A NP$ VP VP A V NP$ NP NP VP* ADV I I I I</Title>
  <Section position="4" start_page="140" end_page="141" type="metho">
    <SectionTitle>
2. SLTAG
</SectionTitle>
    <Paragraph position="0"> Informally speaking, SLTAGs are defined by assigning a probability to the event that an elementary tree is combined (by adjunction or substitution) on a specific node of another elementary tree. These events of combination are the stochastic processes considered.</Paragraph>
    <Paragraph position="1"> For sake of mathematical precision and elegance, we use a stochastic linear rewriting system, stochastic linear indexed grammars (SLIG), as a notation for SLTAGs. A linear indexed grammar is constructed following the method given in \[13\]. However, in addition, each rule is associated with a probability.</Paragraph>
    <Paragraph position="2"> Linear Indexed grammar (LIG) \[14, 15\] is a rewriting system in which the non-terminal symbols are augmented with a stack. In addition to rewriting non-terminals, the rules of the grammar can have the effect of pushing or popping symbols on top of the stacks that are associated with each non-terminal symbol. A specific rule is triggered by the non-terminal on the left hand side of the rule and the top element of its associated stack. LIGs \[15\] restrict The productions of a LIG are restricted to copy the stack corresponding to the non-terminal being rewritten to at most one stack associated with a non-terminal symbol on the right hand side of the production. 2 In the following, \[..p\] refers to a possibly unbounded stack whose top element is p and whose remaining part is schematically written as '..'. \[$\] represents a stack whose only element is the bottom of the stack. While it is possible to define SLIGs in general, we define them for the particular case where the rules are binary branching and where the left hand sides are always incomparable.</Paragraph>
    <Paragraph position="3"> A stochastic linear indexed grammar, G, is denoted by (VN, VT, Vt, S, Prod), where VN is a finite set of non-terminal symbols; VT is a finite set of terminal symbols; Vi is a finite set of stack symbols; S E VN is the start symbol; Prod is a finite set of productions of the form:</Paragraph>
    <Paragraph position="5"> where Xk E VN, a G VT and Po E Vi, Pl,P2 G V~; 2 LIGs have been shown to be weakly equivalent to Tree-Adjoining Grammars \[16\].</Paragraph>
    <Paragraph position="6"> P, a probability distribution which assigns a probability, 0 &lt; P(X\[..z\] ~ A) &lt; 1, to a rule, X\[..x\] -+ A E Prod such that the sum of the probabilities of all the rules that can be applied to any non-terminal annotated with a stack is equal to one. More precisely if, VX G VN, Vp E Vi:</Paragraph>
    <Paragraph position="8"> that X\[..p\] is rewritten as A.</Paragraph>
    <Paragraph position="9"> A derivation starts from S associated with the empty stack (S\[$\]) and each level of the derivation must be validated by a production rule. The language of a SLIG is defined as follows: n = {w E V~ I S\[$\]:~w}.</Paragraph>
    <Paragraph position="10"> The probability of a derivation is defined as the product of the probabilities of all individual rules involved (counting repetition) in the derivation, the derivation being validated by a correct configuration of the stack at each level. The probability of a sentence is then computed as the sum of the probabilities of all derivations of the sentence.</Paragraph>
    <Paragraph position="11"> Following the construction described in \[13\], given a LTAG, Gtag, we construct an equivalent 3 LIG, G, ug. In addition, a probability is assigned to each production of the LIG. For simplicity of explanation and without loss of generality we assume that each node in an elementary tree found in a tree-adjoining grammar is either a leaf node (i.e. either a foot node or a non-empty terminal node) or binary branching. 4 The construction of the equivalent SLIG follows.</Paragraph>
    <Paragraph position="12"> The non-terminal symbols of Gstia are the two symbols 'top' (t) and 'bottom' (b), the set of terminal symbols is the same as the one of Gtag, the set of stack symbols is the set of nodes (not node labels) found in the elementary trees of Gta9 augmented with the bottom of the stack ($), and the start symbol is 'top' (t).</Paragraph>
    <Paragraph position="13"> For all root nodes N0 of an initial tree whose root is labeled by S, the following starting rules are added:</Paragraph>
    <Paragraph position="15"> These rules state that a derivation must start from the top of the root node of some initial tree. P is the probability that a derivation starts from the initial tree associated with a lexical item and rooted by No.</Paragraph>
    <Paragraph position="16"> Then, for all node ~1 in an elementary tree, the following rules are generated.</Paragraph>
    <Paragraph position="17">  tree-adjoining grammar. 4The algorithms explained in this paper can be generalized to lexicalized tree-adjoining grammars that need not be in Chomsky Normal Form using techniques similar the one found in \[17\].</Paragraph>
    <Paragraph position="18">  the spine (i.e. subsumes the foot node), include:</Paragraph>
    <Paragraph position="20"> Since (2) encodes an immediate domination link defined  by t\]he tree-adjoining grammar, its associated probability is one.</Paragraph>
    <Paragraph position="21"> * Similarly, if ~7102 are the 2 children of a node r/such that r/1 is on the spine (i.e. subsumes the foot node), include: bill (3) Since (3) encodes an immediate domination link defined by the tree-adjoining grammar, its associated probability is one.</Paragraph>
    <Paragraph position="22"> * If ~71~72 are the 2 children of a node r/such that none of them is on the spine, include: b\[$q\] p~l t\[$rh\]t\[$r/2 \] (4) Since (4) also encodes an immediate domination link defined by the tree-adjoining grammar, its associated probability is one.</Paragraph>
    <Paragraph position="23"> * If 77 is a node labeled by a non-terminal symbol and if  it does not have an obligatory adjoining constraint, then we need to consider the case that adjunction might not take place. In this case, include:</Paragraph>
    <Paragraph position="25"> The probability of rule (5) corresponds to the probability that no adjunction takes place at node 77.</Paragraph>
    <Paragraph position="26"> * If 77 is an node on which the auxiliary tree fl can be adjoined, the adjunction of fl can be predicted, therefore (assuming that Yr is the root node of fl) include: (6) The probability of rule (6) corresponds to the probability of adjoining the auxiliary tree whose root node is ~Tr, say ~, on the node ~7 belonging to some elementary tree, say Or. 5 * If r7! is the foot node of an auxiliary tree/9 that has been adjoined, then the derivation of the node below O! must resume. In this case, include:</Paragraph>
    <Paragraph position="28"> The above stochastic production is included with probability one since the decision of adjunction has already been made in rules of the form (6).</Paragraph>
    <Paragraph position="29"> * Finally, if 7/1 is the root node of an initial tree that can be substituted on a node marked for substitution ~/, include: (8) Here, p is the probability that the initial tree rooted by r/1 is substituted at node r I. It corresponds to the probability of substituting the lexicalized initial tree whose root node 5 Since the grammar is lexicalized, both trees c~ and/~ are associated with lexical items, and the site node for adjunction rl corresponds to some syntactic modification. Such rule encapsulates S modifiers (e.g. sentential adverbs as in &amp;quot;apparently John left&amp;quot;), VP modifiers (e.g. verb phrase adverbs as in &amp;quot;John left abruptly)&amp;quot;, NP modifiers (e.g. relative clauses as in &amp;quot;The man who left was happy&amp;quot;), N modifiers (e.g. adjectives as in &amp;quot;?relty woman&amp;quot;), or even sentential complements (e.g. John ~hlnks ~hat Harry is sick).</Paragraph>
    <Paragraph position="30"> is 01, say 6, at the node r/ofa lexicalized elementary tree, say o~. 6 The SLIG constructed as above is well defined if the following equalities hold for all nodes r/:</Paragraph>
    <Paragraph position="32"> Beside the distributional phenomena that we mentioned earlier, SLTAG also captures the effect of adjoining constraints (selective, obligatory or null adjoining) which are required for tree-adjoining grammar, s</Paragraph>
  </Section>
  <Section position="5" start_page="141" end_page="142" type="metho">
    <SectionTitle>
3. PROBABILITY OF A SENTENCE
</SectionTitle>
    <Paragraph position="0"> We now define an bottom-up algorithm for SLTAG which computes the probability of an input string. The algorithm is an extension of the CKY-type parser for tree-adjoining grammar \[18\]. The extended algorithm parses all spans of the input string and also computes their probability in a bottom-up fashion.</Paragraph>
    <Paragraph position="1"> Since the string on the frontier of an auxiliary is broken up into two substrings by the foot node, for the purpose of computing the probability of the sentence, we will consider the probability that a node derives two substrings of the input string. This entity will be called the inside probability. Its exact definition is given below.</Paragraph>
    <Paragraph position="2"> We will refer to the subsequence of the input string w = al... aN from position i to j, w~. It is defined as follows: &amp;quot;del f ai+l * &amp;quot;&amp;quot; aj , if i &lt; j w~= ~ ~ ,ifi&gt;j Given a string w = al.. &amp;quot;aN and a SLTAG rewritten as in (1-8) the inside probability, IW(pos, O, i, j, k, i), is defined for all nodes 77 contained in an elementary tree o~ and for pose {t,b},andforallindices0&lt;i&lt;j&lt; k&lt;l&lt;Nas follows: 6 Among other cases, the probability of this rule corresponds to the probability of filling some argument position by a lexicalized tree. It will encapsulate the distribution for selectional restriction since the position of substitution is taken into account.</Paragraph>
    <Paragraph position="3"> rWe will not investigate the conditions under which (12) holds. We conjecture that some of the techniques used for checking the consistency of stochastic context-free grammars can be adapted to SLTAG. SFor exaxnple, for a given node r/setting to zero the probability of all rules of the form (6) has the effect of blocking adjunction.  (i) If the node t/does not subsume the foot node of oL (if there is one), then j and k are unbound and: I tdeg (pos, 71, i,-,-, l)ae=lP(pos\[*71\]~ w~) (ii) If the node T/ subsumes the foot node ~/! of a,  then: Itdeg(pos, r h i,j, k, l)a=eY P( pos\[$ol~ w~b\[$ojlw~) In (ii), only the top element of the stack matters since as a consequence of the construction of the SLIG, we have that if pos\[$@=~ w~b\[$r//\]w~ then for all string 7 E V~ we also have pos\[*~\]~ ~b\[*~lw~.~ Initially, all inside probabilities are set to zero* Then, the computation goes bottom-up starting from the productions introducing lexical items: if r/is a node such that b\[$r/\] ~ a, then: ( 1 ifl=i+lAa=w~+l (13) IW(b'rh i'-'-'l) = 0 otherwise.</Paragraph>
    <Paragraph position="4"> Then, the inside probabilities of larger substrings are computed bottom-up relying on the recurrence equations stated in Appendix A. This computation takes in the worst case O(IG\]2g6)-time and O(\[GIN4)-space for a sentence of length N.</Paragraph>
    <Paragraph position="5"> Once the inside probabilities computed, we obtain the probability of the sentence as follows:</Paragraph>
    <Paragraph position="7"> We now consider the problem of re-estimating a SLTAG.</Paragraph>
  </Section>
  <Section position="6" start_page="142" end_page="143" type="metho">
    <SectionTitle>
4. RE-ESTIMATION OF SLTAG
</SectionTitle>
    <Paragraph position="0"> Given a set of positive example sentences, W = {wl...WK}, assumed to have been generated by an unknown SLTAG, we would like to compute the probability of each rule of a given SLTAG in order to maximize the probability that the corpus were generated by this SLTAG.</Paragraph>
    <Paragraph position="1"> An algorithm solving this problem can be used in two different ways.</Paragraph>
    <Paragraph position="2"> The first use is as a re-estimation algorithm. In this approach, the input SLTAG derives structures that are reasonable according to some criteria (such as a linguistic theory and some a priori knowledge of the corpus) and the intended use of the algorithm is to refine the probability of each rule* The second use is as a learning algorithm. At the first iteration, a SLTAG which generates all possible structures over a given set of nodes and terminal symbols is used*</Paragraph>
    <Section position="1" start_page="142" end_page="143" type="sub_section">
      <SectionTitle>
9This can be seen by observing that for any node on the path from
</SectionTitle>
      <Paragraph position="0"> the root node to the foot node of an auxiliary tree, the stack remains unchanged.</Paragraph>
      <Paragraph position="1"> Initially the probability of each rule is randomly assigned and then the algorithm will re-estimate these probabilities* Informally speaking, given a first estimate of the parameters of a SLTAG, the algorithm re-estimates these parameters on the basis of the parses of each sentence in a training corpus obtained by a CKY-type parser. The algorithm derives a new estimate such that the probability that the corpus were generated by the grarnlnar is increased. By analogy to the inside-outside algorithm for stochastic context-free grammars \[19, 7\], we believe that the following quantity decreases after each iteration: 1deg</Paragraph>
      <Paragraph position="3"> In order to derive a new estimate, the algorithm needs to compute for all sentences in W the inside probabilities and the outside probabilities. Given a string w = al...aN, the outside probability, Otdeg(pos,~l,i,j,k,l), is defined for all nodes r/contained in an elementary tree o~ and for pos E it,b}, and for all indices 0 _&lt; i _&lt; j _&lt; k &lt; i &lt; N as follows:  (i) If the node 7/does not subsume the foot node of o~ (if there is one), then j and k are unbound and: o~ (poe, ,7, i, -, -, t)~*- -I P(B7 e V~ s.t. t\[$\]:~ w~ pos\[*Trl\] wz N) (ii) If the node 77 does subsume the foot node rll of a then:</Paragraph>
      <Paragraph position="5"> Once the inside probabilities computed, the outside probabilities can be computed top-down by considering smaller spans of the input string starting with OW(t,$, 0,-,-, N) = 1 (by definition). This is done by computing the recurrence equations stated in Appendix B.</Paragraph>
      <Paragraph position="6"> Due to the lack of space, we only illustrate the re-estimation of the rules corresponding to adjunction, rules of the form: t\[..r/\] ~ t\[..r/rF\]. The other re-estimation formulae can be derived in a similar manner.</Paragraph>
      <Paragraph position="7"> In the following, we assume that r 7 subsumes the foot node 7/! within a same elementary tree, and also that r/I subsumes the foot node r/tt (within a same elementary tree).</Paragraph>
      <Paragraph position="8"> 1degHe is an estimate of the entropy H of the unknown language being estimated and it converges to the entropy of the language as the size of the corpus grows.</Paragraph>
      <Paragraph position="9">  Let: Nto (t\[..~}\] --~ t\[..zpl/\], i, r, j, k, s, l)</Paragraph>
      <Paragraph position="11"> It can be shown that the rule t\[..r}\] --+ t\[..yyl\] is optimally reestimated at each iteration as follows:</Paragraph>
      <Paragraph position="13"> The denominator of the above reestimation formula estimates the probability that a derivation will involve at least one expansion of t\[..r}\]. The numerator estimates the probability that a derivation will involve the rule t\[--r}\] ---~ t\[..r}r}t\]. The probability of no adjunction on the node r/, P(t\[..r/\] ~ b\[-.z}\] is reestimated using the equality (9).</Paragraph>
      <Paragraph position="14"> The algorithm reiterates until He(W) is unchanged (within some epsilon) between two iterations. Each iteration of the algorithm requires at most O(\[GI2N6)-time for each sentence of length N.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="143" end_page="144" type="metho">
    <SectionTitle>
5. CONCLUSION
</SectionTitle>
    <Paragraph position="0"> A novel statistical language model and fundamental algorithms for this model have been presented.</Paragraph>
    <Paragraph position="1"> SLTAGs provide a stochastic model both hierarchical and sensitive to lexical information. They combine the advantages of purely lexical models such as N-gram distributions or Hidden Markov Models and the one of hierarchical modes as stochastic context-free grammars without their inherent limitations. The parameters of a SLTAG correspond to the probability of combining two structures each one associated with a word and therefore capture linguistically relevant distributions over words.</Paragraph>
    <Paragraph position="2"> An algorithm for computing the probability of a sentence generated by a SLTAG was presented as well as an iterative algorithm for estimating the parameters of a SLTAG given a training corpus of raw text. Similarly to its context-free counterpart, the reestimation algorithm can be extended to handle partially parsed corpora \[20\]. The worst case complexity of the algorithm with respect to the length of the input string (O(N6)) makes it impractical with a large corpus on a single processor computer for grammars requiring the worst case complexity. However, this complexity reduces to O(N 3) or to O(N 2) for interesting subsets of SLTAGs. If time permits, experiments in this direction will be reported  at the time of the meeting.</Paragraph>
    <Paragraph position="3"> Furthermore, the techniques explained in this paper apply to other grammatical formalisms such as combinatory categorial grammars and modified head grammars since they have been proven to be equivalent to tree-adjoining grammars and linear indexed grammars \[21\].</Paragraph>
    <Paragraph position="4"> In collaboration with Aravind Joshi, Fernando Pereira and Stuart Shieber, we are currently investigating additional algorithms and applications for SLTAG, methods for lexical clustering and automatic construction of a SLTAG from a large training corpus.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML