File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-1508_intro.xml
Size: 3,782 bytes
Last Modified: 2025-10-06 14:03:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-1508"> <Title>Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Treebank Transfer</Title> <Section position="3" start_page="0" end_page="74" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Annotated corpora are valuable resources for Natural Language Processing (NLP) which often require significant effort to create. Syntactically annotated corpora - treebanks, for short - currently exist for a small number of languages; but for the vast majority of the world's languages, treebanks are unavailable and unlikely to be created any time soon.</Paragraph> <Paragraph position="1"> The situation is especially difficult for dialectal variants of many languages. A prominent example is Arabic: syntactically annotated corpora exist for the common written variety (Modern Standard Arabic or MSA), but the spoken regional dialects have a lower status in written communication and lack annotated resources. This lack of dialect treebanks hampers the development of syntax-based NLP tools, such as parsers, for Arabic dialects.</Paragraph> <Paragraph position="2"> On the bright side, there exist very large annotated (Maamouri et al., 2003, 2004a,b) corpora for Modern Standard Arabic. Furthermore, unannotated text corpora for the various Arabic dialects can also be assembled from various sources on the Internet.</Paragraph> <Paragraph position="3"> Finally, the syntactic differences between the Arabic dialects and Modern Standard Arabic are relatively minor (compared with the lexical, phonological, and morphological differences). The overall research question is then how to combine and exploit these resources and properties to facilitate, and perhaps even automate, the creation of syntactically annotated corpora for the Arabic dialects.</Paragraph> <Paragraph position="4"> We describe a general approach to this problem, which we call treebank transfer: the goal is to project an existing treebank, which exists in a source language, to a target language which lacks annotated resources. The approach we describe is not tied in any way to Arabic, though for the sake of concreteness one may equate the source language with Modern Standard Arabic and the target language with a dialect such as Egyptian Colloquial Arabic.</Paragraph> <Paragraph position="5"> We link the two kinds of resources that are available - a treebank for the source language and an unannotated text corpus for the target language in a generative probability model. Specifically, we construct a joint distribution over source-language trees, target-language trees, as well as parameters, and draw inferences by iterative simulation. This allows us to impute target-language trees, which can then be used to train target-language parsers and other NLP components.</Paragraph> <Paragraph position="6"> Our approach does not require aligned data, unlike related proposals for transferring annotations from one language to another. For example, Yarowksy and Ngai (2001) consider the transfer of word-level annotation (part-of-speech labels and bracketed NPs). Their approach is based on aligned corpora and only transfers annotation, as opposed to generating the raw data plus annotation as in our approach. null We describe the underlying probability model of our approach in Section 2 and discuss issues pertaining to simulation and inference in Section 3.</Paragraph> <Paragraph position="7"> Sampling from the posterior distribution of target-language trees is one of the key problems in iterative simulation for this model. We present a novel sampling algorithm in Section 4. Finally in Section 5 we summarize our approach in its full generality.</Paragraph> </Section> class="xml-element"></Paper>