XML Viewer - w05-0803

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0803_intro.xml
Size: 4,013 bytes
Last Modified: 2025-10-06 14:03:13
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0803">
  <Title>Parsing Word-Aligned Parallel Corpora in a Grammar Induction Context</Title>
  <Section position="2" start_page="0" end_page="17" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The technical results presented in this paper1 are motivated by the following considerations: It is conceivable to use sentence pairs from a parallel corpus (along with the tentative word correspondences from a statistical word alignment) as training data for a grammar induction approach. The goal is to induce monolingual grammars for the languages under consideration; but the implicit information about syntactic structure gathered from typical patterns in the alignment goes beyond what can be obtained from unlabeled monolingual data. Consider for instance the sentence pair from the Europarl corpus (Koehn, 2002) in fig. 1 (shown with a hand-labeled word alignment): distributional patterns over this and similar sentences may show that in English, the subject 1This work was in part supported by the German Research Foundation DFG in the context of the author's Emmy Noether research group at Saarland University.</Paragraph>
    <Paragraph position="1"> (the word block &amp;quot;the situation&amp;quot;) is in a fixed structural position, whereas in German, it can appear in various positions; similarly, the finite verb in German (here: stellt) systematically appears in second position in main clauses. In a way, the translation of sentences into other natural languages serves as an approximation of a (much more costly) manual structural or semantic annotation - one might speak of automatic indirect supervision in learning. The technique will be most useful for low-resource languages and languages for which there is no funding for treebanking activities. The only requirement will be that a parallel corpus exist for the language under consideration and one or more other languages.2 Induction of grammars from parallel corpora is rarely viewed as a promising task in its own right; in work that has addressed the issue directly (Wu, 1997; Melamed, 2003; Melamed, 2004), the synchronous grammar is mainly viewed as instrumental in the process of improving the translation model in a noisy channel approach to statistical MT.3 In the present paper, we provide an important prerequisite for parallel corpus-based grammar induction work: an efficient algorithm for synchronous parsing of sentence pairs, given a word alignment. This work represents a second pilot study (after (Kuhn, 2004)) for the longer-term PTOLEMAIOS project at Saarland University4 with the goal of learning linguistic grammars from parallel corpora (compare (Kuhn, 2005)). The grammars should be robust and assign a 2In the present paper we use examples from English/German for illustration, but the approach is of course independent of the language pair under consideration.</Paragraph>
    <Paragraph position="2"> 3Of course, there is related work (e.g., (Hwa et al., 2002; L&amp;quot;u et al., 2002)) using aligned parallel corpora in order to &amp;quot;project&amp;quot; bracketings or dependency structures from English to another language and exploit them for training a parser for the other language. But note the conceptual difference: the &amp;quot;parse projection&amp;quot; approach departs from a given monolingual parser, with a particular style of analysis, whereas our project will explore to what extent it may help to design the grammar topology specifically for the parallel corpus case. This means that the emerging English parser may be different from all existing ones.</Paragraph>
    <Paragraph position="3">  Heute stellt sich die Lage jedoch v&amp;quot;ollig anders dar The situation now however is radically different  predicate-argument-modifier (or dependency) structure to sentences, such that they can be applied in the context of multilingual information extraction or question answering.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML