File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1060_metho.xml
Size: 17,232 bytes
Last Modified: 2025-10-06 14:08:59
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1060"> <Title>Experiments in Parallel-Text Based Grammar Induction</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Cross-language order divergences </SectionTitle> <Paragraph position="0"> The English-French example in figure 1 gives a simple illustration of the partial information about constituency that a word-aligned parallel corpus may provide. The en bloc reversal of subsequences of words provides strong evidence that, for instance, [ moment the voting ] or [ aura lieu a ce ] do not form constituents.</Paragraph> <Paragraph position="1"> At first sight it appears as if there is also clear evidence for [ at that moment ] forming a constituent, since it fully covers a substring that appears in a different position in French. Similarly for [ Le vote aura lieu ]. However, from the distribution of contiguous substrings alone we cannot distinguish between two the types of situations sketched in (1) and (2):</Paragraph> <Paragraph position="3"> A string that is contiguous under projection, like</Paragraph> <Paragraph position="5"> be a non-constituent part of a larger constituent as in a22 a0 in (2).</Paragraph> <Paragraph position="6"> Word blocks. Let us define the notion of a word block (as opposed to a phrase or constituent) induced by a word alignment to capture the relevant property of contiguousness under translation.2 The alignments induced by GIZA++ (following the IBM models) are asymmetrical in that several words from a22 a1 may be aligned with one word in a22 a0 , but not vice versa. So we can view a word alignment as a function a23 that maps each word in an a22 a0 -sentence to a (possibly empty) subset of words from its translation in a22a24a1 . For example, in figure 1, a23a26a25 votinga4a28a27 ={votea1 }, and a23a26a25 thata1a29a27 = {cea5a31a30 -laa7a31a32 . Note that a23a33a25a35a34a37a36a38a27a40a39a41a23a33a25a35a34a43a42a29a27a33a44a46a45 for a34a47a36a49a48a44a50a34a43a42 . The a23 -images of a sentence need not exhaust the words of the translation in a22 a1 ; however it is common to assume a special empty word NULL in each a22 a0 -sentence, for which by definition a23a33a25 NULLa27 is the set of a22 a1 -words not contained in any a23 -image of the overt words.</Paragraph> <Paragraph position="7"> We now define an a23 -induced block (or a23 -block for short) as a substring a21 a0a52a51a53a51a53a51 a21 a36 of a sentence in a22 a0 , such that the union over all a23 -images (a54</Paragraph> <Paragraph position="9"> forms a contiguous substring in a22 a1 , modulo the words from a23a26a25 NULLa27 .</Paragraph> <Paragraph position="10"> For example, a21 a0 a21 a1 a21 a2 in (1) (or (2)) is not an a23 -block since the union over its a23 -images is</Paragraph> <Paragraph position="12"> a4 a32 which do not form a contiguous string in a22 a1 . The sequences a21 a2 a21 a3 or a21 a2 a21 a3 a21 a4 are a23 -induced blocks.</Paragraph> <Paragraph position="13"> Let us define a maximal a23 -block as an a23 -block</Paragraph> <Paragraph position="15"> would lead to a non-block, or a21 a36a63a61 a0 or a21 a42a18a62 a0 do not exist as we are at the beginning or end of the string), rectly related to the concept of a &quot;phrase&quot; in recent work in Statistical Machine Translation. (Koehn et al., 2003) show that exploiting all contiguous word blocks in phrase-based alignment is better than focusing on syntactic constituents only. In our context, we are interested in inducing syntactic constituents based on alignment information; given the observations from Statistical MT, it does not come as a surprise that there is no direct link from blocks to constituents. Our work can be seen as an attempt to zero in on the distinction between the concepts; we find that it is most useful to keep track of the boundaries between blocks.</Paragraph> <Paragraph position="16"> (Wu, 1997) also includes a brief discussion of crossing constraints that can be derived from phrase structure correspondences. null to the block.3 String a21 a2 a21 a3 in (1) is not a maximal a23 -block, because a21 a2 a21 a3 a21 a4 is an a23 -block; but a21 a2 a21 a3 a21 a4 is maximal since a21 a4 is the final word of the sentence and</Paragraph> <Paragraph position="18"> We can now make the initial observation precise that (1) and (2) have the same block structure, but the constituent structures are different (and this is not due to an incorrect alignment). a21 a0 a21 a1 is a maximal block in both cases, but while it is a constituent in (1), it isn't in (2).</Paragraph> <Paragraph position="19"> We may call maximal blocks that contain only non-maximal blocks as substrings first-order maximal a23 -blocks. A maximal block that contains other maximal blocks as substrings is a higher-order maximal a23 -block. In (1) and (2), the complete string a21 a0 a21 a1 a21 a2 a21 a3 a21 a4 is a higher-order maximal block. Note that a higher-order maximal block may contain substrings which are non-blocks.</Paragraph> <Paragraph position="20"> Higher-order maximal blocks may still be non-constituents as the following simple English-French example shows: (3) He gave Mary a book Il a donne un livre a Mary The three first-order maximal blocks in English are [He gave], [Mary], and [a book]. [Mary a book] is a higher-order maximal block, since its &quot;projection&quot; to French is contiguous, but it is not a constituent. (Note that the VP constituent gave Mary a book on the other hand is not a maximal block here.) Block boundaries. Let us call the string position between two maximal blocks an a23 -block boundary.4 In (1)/(2), the position between a21 a1 and a21 a2 is a block boundary.</Paragraph> <Paragraph position="21"> We can now formulate the (4) Distituent hypothesis If a substring of a sentence in language a22 a0 crosses a first-order a23 -block boundary (zone5), then it can only be a constituent of a22 a0 if it contains at least one of the two maximal a23 -blocks separated by that boundary in full.</Paragraph> <Paragraph position="22"> This hypothesis makes it precise under which conditions we assume to have reliable negative evidence against a constituent. Even examples of complicated structural divergence from the classical MT 3I.e., an element of</Paragraph> <Paragraph position="24"> string at the other end.</Paragraph> <Paragraph position="25"> literature tend not to pose counterexamples to the hypothesis, since it is so conservative. Projecting phrasal constituents from one language to another is problematic in cases of divergence, but projecting information about distituents is generally safe. Mild divergences are best. As should be clear, the a23 -block-based approach relies on the occurrence of reorderings of constituents in translation. If two languages have the exact same structure (and no paraphrases whatsoever are used in translation), the approach does not gain any information from a parallel text. However, this situation does not occur realistically. If on the other hand, massive reordering occurs without preserving any contiguous subblocks, the approach cannot gain information either. The ideal situation is in the middleground, with a number of mid-sized blocks in most sentences. The table in figure 2 shows the distribution of sentences with a16 a23 -block boundaries based on the alignment of English and 7 other languages, for a sample of c.</Paragraph> <Paragraph position="26"> 3,000 sentences from the Europarl corpus. We can see that the occurrence of boundaries is in a range that should make it indeed useful.6 boundaries for a22a47a0 : English Zero fertility words. So far we have not addressed the effect of finding zero fertility words, i.e., words a21 a36 from a22 a0 with a23a33a25</Paragraph> <Paragraph position="28"> word alignment makes frequent use of this mechanism. An actual example from our alignment is shown in figure 3. The English word has is treated as a zero fertility word. While we can tell from the block structure that there is a maximal block boundary somewhere between Baringdorf and the, it is 6The average sentence length for the English sentence is 26.5 words. (Not too suprisingly, Swedish gives rise to the fewest divergences against English. Note also that the Romance languages shown here behave very similarly.) Mr. Graefe zu Baringdorf has the floor to explain this request . La parole est a M. Graefe zu Baringdorf pour motiver la demande . unclear on which side has should be located.7 The definitions of the various types of word blocks cover zero fertility words in principle, but they are somewhat awkward in that the same word may belong to two maximal a23 -blocks, on its left and on its right. It is not clear where the exact block boundary is located. So we redefine the notion of a23 -block boundaries. We call the (possibly empty) sub-string between the rightmost non-zero-fertility word of one maximal a23 -block and the leftmost non-zero-fertility word of its right neighbor block the a23 -block boundary zone.</Paragraph> <Paragraph position="29"> The distituent hypothesis is sensitive to crossing a boundary zone, i.e., if a constituent-candidate ends somewhere in the middle of a non-empty boundary zone, this does not count as a crossing. This reflects the intuition of uncertainty and keeps the exclusion of clear distituents intact.</Paragraph> <Paragraph position="30"> 3 EM grammar induction with weighting factors The distituent identification scheme introduced in the previous section can be used to hypothesize a fairly reliable exclusion of constituency for many spans of strings from a parallel corpus. Besides a statistical word alignment, no further resources are required.</Paragraph> <Paragraph position="31"> In order to make use of this scattered (non-) constituency information, a semi-supervised approach is needed that can fill in the (potentially large) areas for which no prior information is available. For the present experiments we decided to choose a conceptually simple such approach, with which we can build on substantial existing work in grammar induction: we construe the learning problem as PCFG induction, using the inside-outside algorithm, with the addition of weighting factors based on the (non)constituency information. This use of weighting factors in EM learning follows the approach discussed in (Nigam et al., 2000).</Paragraph> <Paragraph position="32"> Since we are mainly interested in comparative experiments at this stage, the conceptual simplicity, and the availability of efficient implemented open7Since zero-fertility words are often function words, there is probably a rightward-tendency that one might be able to exploit; however in the present study we didn't want to build such high-level linguistic assumptions into the system.</Paragraph> <Paragraph position="33"> source systems of a PCFG induction approach outweighs the disadvantage of potentially poorer over-all performance than one might expect from some other approaches.</Paragraph> <Paragraph position="34"> The PCFG topology we use is a binary, entirely unrestricted X-bar-style grammar based on the Penn Treebank POS-tagset (expanded as in the TreeTagger by (Schmid, 1994)). All possible combinations of projections of POS-categories X and Y are included following the schemata in (5). This gives rise to 13,110 rules.</Paragraph> <Paragraph position="36"> We tagged the English version of our training section of the Europarl corpus with the TreeTagger and used the strings of POS-tags as the training corpus for the inside-outside algorithm; however, it is straightforward to apply our approach to a language for which no taggers are available if an unsupervised word clustering technique is applied first.</Paragraph> <Paragraph position="37"> We based our EM training algorithm on Mark Johnson's implementation of the inside-outside algorithm.8 The initial parameters on the PCFG rules are set to be uniform. In the iterative induction process of parameter reestimation, the current rule parameters are used to compute the expectations of how often each rule occurred in the parses of the training corpus, and these expectations are used to adjust the rule parameters, so that the likelihood of the training data is increased. When the probablity of a given rule drops below a certain threshold, the rule is excluded from the grammar. The iteration is continued until the increase in likelihood of the training corpus is very small.</Paragraph> <Paragraph position="38"> Weight factors. The inside-outside algorithm is a dynamic programming algorithm that uses a chart in order to compute the rule expectations for each sentence. We use the information obtained from the parallel corpus as discussed in section 2 as prior information (in a Bayesian framework) to adjust the expectations that the inside-outside algorithm determines based on its current rule parameters. Note that the this prior information is information about string spans of (non-)constituents - it does not tell us anything about the categories of the potential constituents affected. It is combined with the PCFG expectations as the chart is constructed. For each span in the chart, we get a weight factor that is multiplied with the parameter-based expectations.9</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> We applied GIZA++ (Al-Onaizan et al., 1999; Och and Ney, 2003) to word-align parts of the Europarl corpus (Koehn, 2002) for English and all other 10 languages. For the experiments we report in this paper, we only used the 1999 debates, with the language pairs of English combined with Finnish, French, German, Greek, Italian, Spanish, and Swedish.</Paragraph> <Paragraph position="1"> For computing the weight factors we used a two-step process implemented in Perl, which first determines the maximal a23 -block boundaries (by detecting discontinuities in the sequence of the a23 projected words). Words with fertility a0a2a1 whose a23 correspondents were non-adjacent (modulo NULLprojections) were treated like zero fertility words, i.e., we viewed them as unreliable indicators of block status (compare figure 4). (7) shows the internal representation of the block structure for (6) (compare figure 3). L and R are used for the beginning and end of blocks, when the adjacent boundary zone is empty; l and r are used next to non-empty boundary zones. Words that have correspondents in 9In the simplest model, we use the factor 0 for spans satisfying the distituent condition underlying hypothesis (4), and factor 1 for all other spans; in other words, parses involving a distituent are cancelled out. We also experimented with various levels of weight factors: for instance, distituents were assigned factor 0.01, likely distituents factor 0.1, neutral spans 1, and likely constituents factor 2. Likely constituents are defined as spans for which one end is adjacent to an empty block boundary zone (i.e., there is no zero fertility word in the block boundary zone which could be the actual boundary of constituents in which the block is involved).</Paragraph> <Paragraph position="2"> Most variations in the weighting scheme did not have a significant effect, but they caused differences in coverage because rules with a probability below a certain threshold were dropped in training. Below, we report the results of the 0.01-0.1-1-2 scheme, which had a reasonably high coverage on the test data. the normal sequence are encoded as *, zero fertility words as -; A and B are used for the first block in a sentence instead of L and R, unless it arises from &quot;relocation&quot;, which increases likelihood for constituent status (likewise for the last block: Y and Z). Since we are interested only in first-order blocks here, the compact string-based representation is sufficient. null (6) la parole est a m. graefe zu baringdorf pour motiver la demande</Paragraph> <Paragraph position="4"> The second step for computing the weight factors creates a chart of all string spans over the given sentence and marks for each span whether it is a distituent, possible constituent or likely distituent, based on the location of boundary symbols. (For instance zu Baringdorf has the is marked as a distituent; the floor and has the floor are marked as likely constituents.) The tests are implemented as simple regular expressions. The chart of weight factors is represented as an array which is stored in the training corpus file along with the sentences. We combine the weight factors from various languages, since each of them may contribute distinct (non)constituent information. The inside-outside algorithm reads in the weight factor array and uses it in the computation of expected rule counts.</Paragraph> <Paragraph position="5"> We used the probability of the statistical word alignment as a confidence measure to filter out unreliable training sentences. Due to the conservative nature of the information we extract from the alignment, the results indicate however that filtering is not necessary.</Paragraph> </Section> class="xml-element"></Paper>