XML Viewer - w05-0803

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0803_metho.xml
Size: 24,651 bytes
Last Modified: 2025-10-06 14:09:52
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0803">
  <Title>Parsing Word-Aligned Parallel Corpora in a Grammar Induction Context</Title>
  <Section position="3" start_page="17" end_page="18" type="metho">
    <SectionTitle>
2 Synchronous grammars
</SectionTitle>
    <Paragraph position="0"> For the purpose of grammar induction from parallel corpora, we assume a fairly straightforward extension of context-free grammars to the synchronous grammar case (compare the transduction grammars of (Lewis II and Stearns, 1968)): Firstly, the terminal and non-terminal categories are pairs of symbols, one for each language; as a special case, one of the two symbols can be NIL for material realized in only one of the languages. Secondly, the linear sequence of daughter categories that is specified in the rules can differ for the two languages; therefore, an explicit numerical ranking is used for the linear precedence in each language. We use a compact rule notation with a numerical ranking for the linear precedence in each language. The general form of a grammar rule for the case of two parallel languages is N0/M0 - N1:i1/M1:j1 . . . Nk:ik/Mk:jk, where Nl, Ml are NIL or a terminal or nonterminal symbol for language L1 and L2, respectively, and il, jl are natural numbers for the rank of the phrase in the sequence for L1 and L2 respectively (for NIL categories a special rank 0 is assumed).5 Since linear ordering of daughters in both languages is explicitly encoded by the rank indices, the specification sequence in the rule is irrelevant from a declarative point of view. To facilitate parsing we assume a normal form in which the right-hand side is ordered by the rank in L1, with the exception that the categories that are NIL in L1 come last. If there are several such 5Note that in the probabilistic variants of these grammars, we will typically expect that any ordering of the right-hand side symbols is possible (but that the probability will of course vary - in a maximum entropy or log-linear model, the probability will be estimated based on a variety of learning features). This means that in parsing, the right-hand side categories will be accepted as they come in, and the relevant probability parameters are looked up accordingly.</Paragraph>
    <Paragraph position="1"> NIL categories in the same rule, they are viewed as unordered with respect to each other.6 Fig. 2 illustrates our simple synchronous grammar formalism with some rules of a sample grammar and their application on a German/English sentence pair. Derivation with a synchronous grammar gives rise to a multitree, which combines classical phrase structure trees for the languages involved and also encodes the phrase level correspondence across the languages. Note that the two monolingual trees in fig. 2 for German and English are just two ways of unfolding the common underlying multitree.</Paragraph>
    <Paragraph position="2"> Note that the simple formalism goes along with the continuity assumption that every complete constituent is continuous in both languages. Various recent studies in the field of syntax-based Statistical MT have shown that such an assumption is problematic when based on typical treebank-style analyses. As (Melamed, 2003) discusses for instance, in the context of binary branching structures even simple examples like the English/French pair a gift for you from France - un cadeau de France pour vouz [a gift from France for you] lead to discontinuity of a &amp;quot;synchronous phrase&amp;quot; in one of the two languages. (Gildea, 2003) and (Galley et al., 2004) discuss different ways of generalizing the tree-level crosslinguistic correspondence relation, so it is not confined to single tree nodes, thereby avoiding a continuity assumption. We believe that in order to obtain full coverage on real parallel corpora, some mechanism along these lines will be required.</Paragraph>
    <Paragraph position="3"> However, if the typical rich phrase structure analyses (with fairly detailed fine structure) are replaced by flat, multiply branching analyses, most of the highly frequent problematic cases are resolved.7 In  ments by (Fox, 2002), who compared (i) treebank-parser style analyses, (ii) a variant with flattened VPs, and (iii) dependency structures. The degree of cross-linguistic phrasal cohesion increases from (i) to (iii). With flat clausal trees, we will come close to dependency structures with respect to cohesion.</Paragraph>
    <Paragraph position="4">  the flat representation that we assume, a clause is represented in a single subtree of depth 1, with all verbal elements and the argument/adjunct phrases (NPs or PPs) as immediate daughters of the clause node. Similarly, argument/adjunct phrases are flat internally. Such a flat representation is justified both from the point of view of linguistic learning and from the point of view of grammar application: (i) Language-specific principles of syntactic structure (e.g., the strong configurationality of English), which are normally captured linguistically by the richer phrase structure, are available to be induced in learning as systematic patterns in the relative ordering of the elements of a clause. (ii) The predicate-argument-modifier structure relevant for application of the grammars, e.g., in information extraction can be directly read off the flat clausal representation.</Paragraph>
    <Paragraph position="5"> It is a hypothesis of our longer-term project that a word alignment-based consensus structure which works with flat representations and under the continuity assumption is a very effective starting point for learning the basic language-specific constraints required for a syntactic grammar. Linguistic phenomena that fall outside what can be captured in this confined framework (in particular unbounded dependencies spanning more than one clause and discontinuous argument phrases) will then be learned in a later bootstrapping step that provides a richer set of operations. We are aware of a number of open practical questions, e.g.: Will the fact that real parallel corpora often contain rather free translations undermine our idea of using the consensus structure for learning basic syntactic constraints? Statistical alignments are imperfect - can the constraints imposed by the word alignment be relaxed accordingly without sacrificing tractability and the effect of indirect supervision?8</Paragraph>
  </Section>
  <Section position="4" start_page="18" end_page="20" type="metho">
    <SectionTitle>
3 Alignment-guided synchronous parsing
</SectionTitle>
    <Paragraph position="0"> Our dynamic programming algorithm can be described as a variant of standard Earley-style chart parsing (Earley, 1970) and generation (Shieber, 1988; Kay, 1996). The chart is a data structure which stores all sub-analyses that cover part of the input string (in parsing) or meaning representation (in generation). Memoizing such partial results has the standard advantage of dynamic programming techniques - it helps one to avoid unnecessary re-computation of partial results. The chart structure for context-free parsing is also exploited directly in dynamic programming algorithms for probabilistic context-free grammars (PCFGs): (i) the inside (or outside) algorithm for summing over the probabilities for every possible analysis of a given string, (ii) the Viterbi algorithm for determining the most likely analysis of a given string, and (iii) the in8Ultimately, bootstrapping of not only the grammars, but also of the word alignment should be applied.</Paragraph>
    <Paragraph position="1">  side/outside algorithm for re-estimating the parameters of the PCFG in an Expectation-Maximization approach (i.e., for iterative training of a PCFG on unlabeled data). This aspect is important for the intended later application of our parsing algorithm in a grammar induction context.</Paragraph>
    <Paragraph position="2"> A convenient way of describing Earley-style parsing is by inference rules. For instance, the central completion step in Earley parsing can be described</Paragraph>
    <Paragraph position="4"> Synchronous parsing. The input in synchronous parsing is not a one-dimensional string, but a pair of sentences, i.e., a two-dimensional array of possible word pairs (or a multidimensional array if we are looking at a multilingual corpus), as illustrated in fig. 3.</Paragraph>
    <Paragraph position="5">  put (with word alignment marked) The natural way of generalizing context-free parsing to synchronous grammars is thus to control the inference rules by string indices in both dimensions. Graphically speaking, parsing amounts to identifying rectangular crosslinguistic constituents - by assembling smaller rectangles that will together cover the full string spans in both dimensions (compare (Wu, 1997; Melamed, 2003)). For instance in fig. 4, the NP/NP rectangle [i1, j1, j2, k2] can be combined with the Vinf/Vinf rectangle [j1, k1, i2, j2] (assuming there is an appropriate rule in the grammar). 9A chart item is specified through a position (*) in a production and a string span ([l1, l2]). &lt;X - a * Y b, [i, j]&gt; means that between string position i and j, the beginning of an X phrase has been found, covering a, but still missing Y b. Chart items for which the dot is at the end of a production (like</Paragraph>
    <Paragraph position="7"> parsing part of Can I interview her?/Kann ich sie interviewen? More generally, we get the inference rules (2) and  (3) (one for the case of parallel sequencing, one for crossed order across languages).</Paragraph>
    <Paragraph position="8"> (2) &lt;X1/X2 - a * Y1:r1/Y2:r2 b, [i1, j1, i2, j2]&gt; ,</Paragraph>
    <Paragraph position="10"> Since each inference rule contains six free variables over string positions (i1, j1, k1, i2, j2, k2), we get a parsing complexity of order O(n6) for unlexicalized grammars (where n is the number of words in the longer of the two strings from language L1 and L2) (Wu, 1997; Melamed, 2003). For large-scale learning experiments this may be problematic, especially when one moves to lexicalized grammars, which involve an additional factor of n4.10 As a further issue, we observe that the inference rules are insufficient for multiply branching rules, in which partial constituents may be discontinuous in one dimension (only complete constituents need to be continuous in both dimensions). For instance, by parsing the first two words of the German string in fig. 1 (Heute stellt), we should get a partial chart item for a sentence, but the English correspondents for the two words (now and is) are discontinuous, so we couldn't apply rule (2) or (3).</Paragraph>
    <Paragraph position="11"> Correspondence-guided parsing. As an alternative to the standard &amp;quot;rectangular indexing&amp;quot; approach 10The assumption here (following (Melamed, 2003)) is that lexicalization is not considered as just affecting the grammar constant, but that in parsing, every terminal symbol has to be considered as the potential head of every phrase of which it is a part. Melamed demonstrates: If the number of different category symbols is taken into consideration as l, we get O(l2n6) for unlexicalized grammars, and O(l6n10) for lexicalized grammars; however there are some possible optimizations.</Paragraph>
    <Paragraph position="12">  to synchronous parsing we propose a conceptually very simple asymmetric approach. As we will show in sec. 4 and 5, this algorithm is both theoretically and practically efficient when applied to sentence pairs for which a word alignment has previously been determined. The approach is asymmetric in that one of the languages is viewed as the &amp;quot;master language&amp;quot;, i.e., indexing in parsing is mainly based on this language (the &amp;quot;primary index&amp;quot; is the string span in L1 as in monolingual parsing). The other language contributes a secondary index, which is mainly used to guide parsing in the master language - i.e., certain options are eliminated. The choice of the master language is in principle arbitrary, but for efficiency considerations it is better to pick the one that has more words without a correspondent.</Paragraph>
    <Paragraph position="13"> A way of visualizing correspondence-guided parsing is that standard Earley parsing is applied to L1, with primary indexing by string position; as the chart items are assembled, the synchronous grammar and the information from the word alignment is used to check whether the string in L2 could be generated (essentially using chart-based generation techniques; cf. (Shieber, 1988; Neumann, 1998)).</Paragraph>
    <Paragraph position="14"> The index for chart items consists of two components: the string span in L1 and a bit vector for the words in L2 which are covered. For instance, based on fig. 3, the noun compound Agrarpolitik corresponding to agricultural policy in English will have the index &lt;[4, 5], [0, 0, 0, 0, 0, 0, 1, 1]&gt; (assuming for illustrative purposes that German is the master language in this case).</Paragraph>
    <Paragraph position="15"> The completion step in correspondence-guided parsing can be formulated as the following single in- null one subsequence of 1's).</Paragraph>
    <Paragraph position="16"> Condition (iii) excludes discontinuity in passive chart items, i.e., complete constituents; active items 11We use the bold-faced variables v,w,u for bit vectors; the function OR performs bitwise disjunction on the vectors (e.g., OR([0, 1, 1, 0, 0], [0, 0, 1, 0, 1]) = [0, 1, 1, 0, 1]). (i.e., partial constituents) may well contain discontinuities. The success condition for parsing a string with N words in L1 is that a chart item with index &lt;[0, N],1&gt; has been found for the start category pair of the grammar.</Paragraph>
    <Paragraph position="17"> Words in L2 with no correspondent in L1 (let's call them &amp;quot;L1-NIL&amp;quot;s for short), for example the words at and agricultural in fig. 3,12 can in principle appear between any two words of L1. Therefore they are represented with a &amp;quot;variable&amp;quot; empty L1-string span like for instance in &lt;[i, i], [0, 0, 1, 0, 0]&gt; . At first blush, such L1-NILs seem to introduce an extreme amount of non-determinism into the algorithm. Note however that due to the continuity assumption for complete constituents, the distribution of the L1-NILs is constrained by the other words in L2. This is exploited by the following inference rule, which is the only way of integrating L1-NILs into the chart:</Paragraph>
    <Paragraph position="19"> and v does not lead to more 0-separated 1sequences than v contains already); (ii) OR(v,w) = u.</Paragraph>
    <Paragraph position="20"> The rule has the effect of finalizing a cross-linguistic constituent (i.e., rectangle in the two-dimensional array) after all the parts that have correspondents in both languages have been found. 13</Paragraph>
  </Section>
  <Section position="5" start_page="20" end_page="21" type="metho">
    <SectionTitle>
4 Complexity
</SectionTitle>
    <Paragraph position="0"> We assume that the two-dimensional chart is initialized with the correspondences following from a word alignment. Hence, for each terminal that is non-empty in L1, both components of the index are known. When two items with known secondary indices are combined with rule (4), the new secondary 12It is conceivable that a word alignment would list agricultural as an additional correspondent for Agrarpolitik; but we use the given alignment for illustrative purposes.</Paragraph>
    <Paragraph position="1"> 13For instance, the L1-NILs in fig. 3 - NIL/at and NIL/agricultural - have to be added to incomplete NP/PP constituent in the L1-string span from 3 to 5, consisting of the Det/Det die/the and the N/N Agrarpolitik/policy. With two applications of rule (5), the two L1-NILs can be added. Note that the conditions are met, and that as a result, we will have a continuous NP/PP constituent with index &lt;[3, 5], [0, 0, 0, 0, 1, 1, 1, 1]&gt; , which can be used as a passive item Y1/Y2 in rule (4).</Paragraph>
    <Paragraph position="2">  index can be determined by bitwise disjunction of the bit vectors. This operation is linear in the length of the L2-string (which is of the same order as the length of the L1-string) and has a very small constant factor.14 Since parsing with a simple, non-lexicalized context-free grammar has a time complexity of O(n3) (due to the three free variables for string positions in the completion rule), we get O(n4) for synchronous parsing of sentence pairs without any L1-NILs. Note that words from L1 without a correspondent in L2 (which we would have to call L2-NILs) do not add to the complexity, so the language with more correspondent-less words can be selected as L1.</Paragraph>
    <Paragraph position="3"> For the average complexity of correspondence-guided parsing of sentence pairs without L1-NILs we note an advantage over monolingual parsing: certain hypotheses for complete constituents that would have to be considered when parsing only L1, are excluded because the secondary index reveals a discontinuity. An example from fig. 3 would be the sequence m&amp;quot;ussen deshalb, which is adjacent in L1, but doesn't go through as a continuous rectangle when L2 is taken into consideration (hence it cannot be used as a passive item in rule (4)).</Paragraph>
    <Paragraph position="4"> The complexity of correspondence-guided parsing is certainly increased by the presence of L1-NILs, since with them the secondary index can no longer be uniquely determined. However, with the adjacency condition ((i) in rule (5)), the number of possible variants in the secondary index is a function of the number of L1-NILs. Let us say there are m L1-NILs, i.e., the bit vectors contain m elements that we have to flip from 0 to 1 to obtain the final bit vector. In each application of rule (5) we pick a vector v, with a variable for the leftmost and rightmost L1-NIL element (since this is not fully determined by the primary index). By the adjacency condition, 14Note that the operation does not have to be repeated when the completion rule is applied on additional pairs of items with identical indices. This means that the extra time complexity factor of n doesn't go along with an additional factor of the grammar constant (which we are otherwise ignoring in the present considerations). In practical terms this means that changes in the size of the grammar are much more noticable than moving from monolingual parsing to alignment-guided parsing.</Paragraph>
    <Paragraph position="5"> An additional advantage is that in an Expectation Maximization approach to grammar induction (with a fixed word alignment), the bit vectors have to be computed only in the first iteration of parsing the training corpus, later iterations are cubic. either the leftmost or rightmost marks the boundary for adding the additional L1-NIL element NIL/Y2 hence we need only one new variable for the newly shifted boundary among the L1-NILs. So, in addition to the n4 expense of parsing non-nil words, we get an expense of m3 for parsing the L1-NILs, and we conclude that for unlexicalized synchronous parsing, guided by an initial word alignment the complexity class is O(n4m3) (where n is the total number of words appearing in L1, and m is the number of words appearing in L2, without a correspondent in L1). Recall that the complexity for standard synchronous parsing is O(n6).</Paragraph>
    <Paragraph position="6"> Since typically the number of correspondent-less words is significantly lower than the total number of words (at least for one of the two languages), these results are encouraging for medium-to-large-scale grammar learning experiments using a synchronous parsing algorithm.</Paragraph>
  </Section>
  <Section position="6" start_page="21" end_page="23" type="metho">
    <SectionTitle>
5 Empirical Evaluation
</SectionTitle>
    <Paragraph position="0"> In order to validate the theoretical complexity results empirically, we implemented the algorithm and ran it on sentence pairs from the Europarl parallel corpus. At the present stage, we are interested in quantitative results on parsing time, rather than qualitative results of parsing accuracy (for which a more extensive training of the rule parameters would be required).</Paragraph>
    <Paragraph position="1"> Implementation. We did a prototype implementation of the correspondence-guided parsing algorithm in SWI Prolog.15 Chart items are asserted to the knowledge base and efficiently retrieved using indexing by a hash function. Besides chart construction, the Viterbi algorithm for selecting the most probable analysis has been implemented, but for the current quantitative results only chart construction was relevant.</Paragraph>
    <Paragraph position="2"> Sample grammar extraction. The initial probablistic grammar for our experiments was extracted from a small &amp;quot;multitree bank&amp;quot; of 140 German/English sentence pairs (short examples from the Europarl corpus). The multitree bank was annotated using the MMAX2 tool16 and a specially  tailored annotation scheme for flat correspondence structures as described in sec. 2. A German and English part-of-speech tagger was used to determine word categories; they were mapped to a reduced category set and projected to the syntactic constituents. To obtain parameters for a probabilistic grammar, we used maximum likelihood estimation from the small corpus, based on a rather simplistic generative model,17 which for each local subtree decides (i) what categories will be the two heads, (ii) how many daughters there will be, and for each non-head sister (iii) whether it will be a nonterminal or a terminal (and in that case, what category pair), and (iv) in which position relative to the head to place it in both languages. In order to obtain a realistically-sized grammar, we applied smoothing to all parameters; so effectively, every sequence of terminals/nonterminals of arbitrary length was possible in parsing.</Paragraph>
    <Paragraph position="3"> Parsing sentences without NIL words  and without exploiting constraints from L2 Results. To validate empirically that the proposed correspondence-guided synchronous parsing approach (CGSP) can effectively exploit L2 as a guide, thereby reducing the search space of L1 parses that have to be considered, we first ran a comparison on sentences without L1-NILs. The results (average parsing time for Viterbi parsing with the sample grammar) are shown in fig. 5.18 The parser we call &amp;quot;monolingual&amp;quot; cannot exploit any 17For our learning experiments we intend to use a Maximum Entropy/log-linear model with more features.</Paragraph>
    <Paragraph position="4"> 18The experiments were run on a 1.4GHz Pentium M processor. null alignment-induced restrictions from L2.19 Note that CGSP takes clearly less time.</Paragraph>
    <Paragraph position="5"> Comparison wrt. # NIL words  Fig. 6 shows our comparative results for parsing performance on sentences that do contain L1-NILs.</Paragraph>
    <Paragraph position="6"> Here too, the theoretical results are corroborated that with a limited number of L1-NILs, the CGSP is still efficient.</Paragraph>
    <Paragraph position="7"> The average chart size (in terms of the number of entries) for sentences of length 8 (in L1) was 212 for CGSP (and 80 for &amp;quot;monolingual&amp;quot; parsing). The following comparison shows the effect of L1-NILs (note that the values for 4 and more L1-NILs are based on only one or two cases):  We also simulated a synchronous parser which does not take advantage of a given word alignment (by providing an alignment link between any pair of words, plus the option that any word could be a NULL word). For sentences of length 5, this parser took an average time of 22.3 seconds (largely independent of the presence/absence of L1-NILs).20 19The &amp;quot;monolingual&amp;quot; parser used in this comparison parses two identical copies of the same string synchronously, with a strictly linear alignment.</Paragraph>
    <Paragraph position="8"> 20While our simulation may be significantly slower than a direct implementation of the algorithm (especially when some of the optimizations discussed in (Melamed, 2003) are taken into account), the fact that it is orders of magnitude slower does in- null Finally, we also ran an experiment in which the continuity condition (condition (iii) in rule (4)) was deactivated, i.e., complete constituents were allowed to be discontinuous in one of the languages. The results in (7) underscore the importance of this condition - leaving it out leads to a tremendous increase in parsing time.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML