File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1060_intro.xml
Size: 4,068 bytes
Last Modified: 2025-10-06 14:02:22
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1060"> <Title>Experiments in Parallel-Text Based Grammar Induction</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> There have been a number of recent studies exploiting parallel corpora in bootstrapping of monolingual analysis tools. In the &quot;information projection&quot; approach (e.g., (Yarowsky and Ngai, 2001)), statistical word alignment is applied to a parallel corpus of English and some other language a0 for which no tagger/morphological analyzer/chunker etc. (henceforth simply: analysis tool) exists. A high-quality analysis tool is applied to the English text, and the statistical word alignment is used to project a (noisy) target annotation to the a0 version of the text.</Paragraph> <Paragraph position="1"> Robust learning techniques are then applied to bootstrap an analysis tool for a0 , using the annotations projected with high confidence as the initial training data. (Confidence of both the English analysis tool and the statistical word alignment is taken into account.) The results that have been achieved by this method are very encouraging.</Paragraph> <Paragraph position="2"> Will the information projection approach also work for less shallow analysis tools, in particular full syntactic parsers? An obvious issue is that one does not expect the phrase structure representation of English (as produced by state-of-the-art tree-bank parsers) to carry over to less configurational languages. Therefore, (Hwa et al., 2002) extract a more language-independent dependency structure from the English parse as the basis for projection to Chinese. From the resulting (noisy) dependency treebank, a dependency parser is trained using the techniques of (Collins, 1999). (Hwa et al., 2002) report that the noise in the projected treebank is still a major challenge, suggesting that a future research focus should be on the filtering of (parts of) unreliable trees and statistical word alignment models sensitive to the syntactic projection framework.</Paragraph> <Paragraph position="3"> Our hypothesis is that the quality of the resulting parser/grammar for language a0 can be significantly improved if the training method for the parser is changed to accomodate for training data which are in part unreliable. The experiments we report in this paper focus on a specific part of the problem: we replace standard treebank training with an Expectation-Maximization (EM) algorithm for PCFGs, augmented by weighting factors for the reliability of training data, following the approach of (Nigam et al., 2000), who apply it for EM training of a text classifier. The factors are only sensitive to the constituent/distituent (C/D) status of each span of the string in a0 (cp. (Klein and Manning, 2002)). The C/D status is derived from an aligned parallel corpus in a way discussed in section 2. We use the Europarl corpus (Koehn, 2002), and the statistical word alignment was performed with the GIZA++ toolkit (Al-Onaizan et al., 1999; Och and Ney, 2003).1 For the current experiments we assume no pre-existing parser for any of the languages, contrary to the information projection scenario. While better absolute results could be expected using one or more parsers for the languages involved, we think that it is important to isolate the usefulness of exploiting just crosslinguistic word order divergences in order to obtain partial prior knowledge about the constituent structure of a language, which is then exploited in an EM learning approach (section 3).</Paragraph> <Paragraph position="4"> Not using a parser for some languages also makes it possible to compare various language pairs at the same level, and specifically, we can experiment with grammar induction for English exploiting various other languages. Indeed the focus of our initial experiments has been on English (section 4), which facilitates evaluation against a treebank (section 5).</Paragraph> </Section> class="xml-element"></Paper>