File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/06/w06-1501_abstr.xml
Size: 4,891 bytes
Last Modified: 2025-10-06 13:45:21
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1501"> <Title>References</Title> <Section position="2" start_page="0" end_page="1" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> This paper discusses a novel probabilistic synchronous TAG formalism, synchronous Tree Substitution Grammar with sister adjunction (TSG+SA). We use it to parse a language for which there is no training data, by leveraging off a second, related language for which there is abundant training data. The grammar for the resource-rich side is automatically extracted from a treebank; the grammar on the resource-poor side and the synchronization are created by handwritten rules. Our approach thus represents a combination of grammar-based and empirical natural language processing. We discuss the approach using the example of Levantine</Paragraph> <Section position="1" start_page="0" end_page="1" type="sub_section"> <SectionTitle> Adjoining Grammar </SectionTitle> <Paragraph position="0"> The Arabic language is a collection of spoken dialects and a standard written language. The standard written language is the same throughout the Arab world, Modern Standard Arabic (MSA), which is also used in some scripted spoken communication (news casts, parliamentary debates).</Paragraph> <Paragraph position="1"> It is based on Classical Arabic and is not a native language of any Arabic speaking people, i.e., children do not learn it from their parents but in school. Thus most native speakers of Arabic are unable to produce sustained spontaneous MSA.</Paragraph> <Paragraph position="2"> The dialects show phonological, morphological, lexical, and syntactic differences comparable to [?]This work was primarily carried out while the rst author was at the University of Maryland Institute for Advanced Computer Studies.</Paragraph> <Paragraph position="3"> those among the Romance languages. They vary not only along a geographical continuum but also with other sociolinguistic variables such as the urban/rural/Bedouin dimension.</Paragraph> <Paragraph position="4"> The multidialectal situation has important negative consequences for Arabic natural language processing (NLP): since the spoken dialects are not of cially written and do not have standard orthography, it is very costly to obtain adequate corpora, even unannotated corpora, to use for training NLP tools such as parsers. Furthermore, there are almost no parallel corpora involving one dialect and MSA.</Paragraph> <Paragraph position="5"> The question thus arises how to create a statistical parser for an Arabic dialect, when statistical parsers are typically trained on large corpora of parse trees. We present one solution to this problem, based on the assumption that it is easier to manually create new resources that relate a dialect to MSA (lexicon and grammar) than it is to manually create syntactically annotated corpora in the dialect. In this paper, we deal with Levantine Arabic (LA). Our approach does not assume the existence of any annotated LA corpus (except for development and testing), nor of a parallel LA-MSA corpus.</Paragraph> <Paragraph position="6"> The approach described in this paper uses a special parameterization of stochastic synchronous TAG (Shieber, 1994) which we call a hidden TAG model. This model couples a model of MSA trees, learned from the Arabic Treebank, with a model of MSA-LA translation, which is initialized by hand and then trained in an unsupervised fashion. Parsing new LA sentences then entails simultaneously building a forest of MSA trees and the corresponding forest of LA trees. Our implementation uses an extension of our monolingual parser (Chiang, 2000) based on tree-substitution grammar with sister adjunction (TSG+SA).</Paragraph> <Paragraph position="7"> The main contributions of this paper are as follows: null 1. We introduce the novel concept of a hidden TAG model.</Paragraph> <Paragraph position="8"> 2. We use this model to combine statistical ap null proaches with grammar engineering (specifically motivated from the linguistic facts). Our approach thus exempli es the speci c strength of a grammar-based approach.</Paragraph> <Paragraph position="9"> 3. We present an implementation of stochastic synchronous TAG that incorporates various facilities useful for training on real-world data: sister-adjunction (needed for generating the at structures found in most treebanks), smoothing, and Inside-Outside reestimation.</Paragraph> <Paragraph position="10"> This paper is structured as follows. We rst brie y discuss related work (Section 2) and some of the linguistic facts that motivate this work (Section 3). We then present the formalism, probabilistic model, and parsing algorithm (Section 4). Finally, we discuss the manual grammar engineering (Section 5) and evaluation (Section 6).</Paragraph> </Section> </Section> class="xml-element"></Paper>