File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1047_metho.xml
Size: 22,901 bytes
Last Modified: 2025-10-06 14:10:04
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1047"> <Title>Parsing Arabic Dialects</Title> <Section position="4" start_page="369" end_page="369" type="metho"> <SectionTitle> 3 Linguistic Resources </SectionTitle> <Paragraph position="0"> We use the MSA treebanks 1, 2 and 3 (ATB) from theLDC(Maamouri etal., 2004). Wesplit thecorpus into 10% development data, 80% training data and 10% test data all respecting document boundaries. The training data (ATB-Train) comprises 17,617 sentences and 588,244 tokens.</Paragraph> <Paragraph position="1"> The Levantine treebank LATB (Maamouri et al., 2006) comprises 33,000 words of treebanked conversational telephone transcripts collected as part of the LDC CALL HOME project. The treebanked section is primarily in the Jordanian subdialect of LA. The data is annotated by the LDC for speech effects such as disfluencies and repairs.</Paragraph> <Paragraph position="2"> We removed the speech effects, rendering the data more text-like. The orthography and syntactic analysis chosen by the LDC for LA closely follow previous choices for MSA, see Figure 1 for two examples. The LATB is used exclusively for development and testing, not for training. We split the data in half respecting document boundaries. The resulting development data comprises 1928 sentences and 11151 tokens (DEV). The test data comprises 2051 sentences and 10,644 tokens (TEST). For all the experiments, we use the non-vocalized (undiacritized) version of both treebanks, as well as the collapsed POS tag set provided by the LDC for MSA and LA.</Paragraph> <Paragraph position="3"> Two lexicons were created: a small lexicon comprising 321 LA/MSA word form pairs covering LA closed-class words and a few frequent open-class words; and a big lexicon which contains the small lexicon and an additional 1,560 LA/MSA word form pairs. We assign to the mappings in the two lexicons both uniform probabilities and biased probabilities using Expectation Maximization (EM; see (Rambow et al., 2005) for details of the use of EM). We thus have four different lexicons: Small lexicon with uniform probabilities (SLXUN); Small Lexicon with EM-based probabilities (SLXEM); Big Lexicon with uniform probabilities (BLXUN); and Big Lexicon with EM-based probabilities (BLXEM).</Paragraph> </Section> <Section position="5" start_page="369" end_page="370" type="metho"> <SectionTitle> 4 Linguistic Facts </SectionTitle> <Paragraph position="0"> We illustrate the differences between LA and the men do not like this work Lexically, we observe that the word for 'work' is a4a30a6a9a8a7a10a41a0 Al$gl in LA but a4a30a29a42a31a9a10a11a0 AlEml in MSA. In contrast, the word for 'men' is the same in both LA and MSA: a22a19a24a25a27a26a28a10a11a0 AlrjAl. There are typically also differences in function words, in our example a12 $ (LA) and a40 lA (MSA)for 'not'. Morphologically, we see that LA a14a9a15a21a17a19a18a21a20 byHbw has the same stem as MA a38a43a17a19a39 yHb, but with two additional morphemes: the present aspect marker b- which does not exist in MSA, and the agreement marker -w, which is used in MSA only in subject-initial sentences, while in LA it is always used.</Paragraph> <Paragraph position="1"> Syntactically, we observe three differences.</Paragraph> <Paragraph position="2"> First, the subject precedes the verb in LA (SVO order), but follows in MSA (VSO order). This is in fact not a strict requirement, but a strong preference: both varieties allow both orders. Second, we see that the demonstrative determiner follows the noun in LA, but precedes it in MSA. Finally, we see that the negation marker follows the verb in LA, while it precedes the verb in MSA.4 The two phrase structure trees are shown in Figure 1 in the LDC convention. Unlike the phrase structure trees, the (unordered) dependency trees for the MSA and LA sentences (not shown here for space considerations) are isomorphic. They differ only in the node labels.</Paragraph> </Section> <Section position="6" start_page="370" end_page="371" type="metho"> <SectionTitle> 5 Sentence Transduction </SectionTitle> <Paragraph position="0"> In this approach, we parse an MSA translation of the LA sentence and then link the LA sentence to the MSA parse. Machine translation (MT) is not easy, especially when there are no MT resources available such as naturally occurring parallel text or transfer lexicons. However, for this task we have three encouraging insights. First, for really close languages itispossible toobtain better translation quality by means of simpler methods (Hajic et al., 2000). Second, suboptimal MSA output can still be helpful for the parsing task without necessarily being fluent or accurate (since our goal is parsing LA, not translating it to MSA). And finally, translation from LA to MSA is easier than from MSA to LA. This is a result of the availability of abundant resources for MSA as compared to LA: for example, text corpora and tree banks for the verb, as well as the circumfix m- -$.</Paragraph> <Paragraph position="1"> language modeling and a morphological generation system (Habash, 2004).</Paragraph> <Paragraph position="2"> One disadvantage of this approach is the lack of structural information on the LA side for translation from LA to MSA, which means that we are limited in the techniques we can use. Another disadvantage is that the translation can add more ambiguity to the parsing problem. Some unambiguous dialect words can become syntactically ambiguous in MSA. For example, the LA words a0a2a1 mn 'from' and a0 a18 a1 myn 'who' both are translated into an orthographically ambiguous form in MSA a0a2a1 mn 'from' or 'who'.</Paragraph> <Section position="1" start_page="370" end_page="370" type="sub_section"> <SectionTitle> 5.1 Implementation </SectionTitle> <Paragraph position="0"> Each word in the LA sentence is translated into a bag of MSA words, producing a sausage lattice. The lattice is scored and decoded using the SRILM toolkit with a trigram language model trained on 54 million MSA words from Arabic Gigaword (Graff, 2003). The text used for language modeling was tokenized to match the tokenization of the Arabic used in the ATB and LATB. The tokenization was done using the ASVM Toolkit (Diab et al., 2004). The 1-best path in the lattice is passed on to the Bikel parser (Bikel, 2002), which was trained on the MSA training ATB. Finally, the terminal nodes in the resulting parse structure are replaced with the original LA words.</Paragraph> </Section> <Section position="2" start_page="370" end_page="371" type="sub_section"> <SectionTitle> 5.2 Experimental Results </SectionTitle> <Paragraph position="0"> Table 1 describes the results of the sentence transduction path on the development corpus (DEV) in different settings: using no POS tags in the input versus using gold POS tags in the input, and using SLXUN versus BLXUN. The baseline results are obtained byparsing theLAsentence directly using the MSA parser (with and without gold POS tags).</Paragraph> <Paragraph position="1"> The results are reported in terms of PARSEVAL's</Paragraph> <Paragraph position="3"> Using SLXUN improves the F1 score for no tags and for gold tags. A further improvement is gained when using the BLXUN lexicon with no POS tags in the input, but this improvement disappears when we use BLXUN with gold POS tags.</Paragraph> <Paragraph position="4"> We suspect that the added translation ambiguity from BLXUN is responsible for the drop. We also experimented with the SLXEM and BLXEM lexicons. There was no consistent improvement.</Paragraph> <Paragraph position="5"> InTable2, wereport theF-Measure score onthe test set (TEST) for the baseline and for SLXUN (with and without gold POS tags). We see a general drop in performance between DEV and TEST for all combinations suggesting that TEST is a harder set to parse than DEV.</Paragraph> </Section> <Section position="3" start_page="371" end_page="371" type="sub_section"> <SectionTitle> 5.3 Discussion </SectionTitle> <Paragraph position="0"> The current implementation does not handle cases where the word order changes between MSA and LA. Since we start from an LA string, identifying constituents to permute is clearly a hard task. We experimented with identifying strings with the postverbal LA negative particle $ and then permuting them to obtain the MSA preverbal order.</Paragraph> <Paragraph position="1"> The original word positions are &quot;bread-crumbed&quot; through the systems language modeling and parsing steps and then used to construct an unordered dependency parse tree labeled with the input LA words. (A constituency representation is meaningless since word order changes from LA to MSA.) The results were not encouraging since the effect of the positive changes was undermined by newly introduced errors.</Paragraph> </Section> </Section> <Section position="7" start_page="371" end_page="373" type="metho"> <SectionTitle> 6 Treebank Transduction </SectionTitle> <Paragraph position="0"> In this approach, the idea is to convert the MSA treebank (ATB-Train)into anLA-like treebank using linguistic knowledge of the systematic variations on the syntactic, lexical and morphological levels across the two varieties of Arabic. We then train a statistical parser on the newly transduced treebank and test the parsing performance against the gold test set of the LA treebank sentences.</Paragraph> <Section position="1" start_page="371" end_page="372" type="sub_section"> <SectionTitle> 6.1 MSA Transformations </SectionTitle> <Paragraph position="0"> We now list the transformations we applied to ATB-Train: Consistency checks (CON): These are conversions that make the ATB annotation more consistent. For example, there are many cases where SBAR and S nodes are used interchangeably in the MSA treebank. Therefore, an S clause headed by a complementizer is converted to an SBAR.</Paragraph> <Paragraph position="1"> Sentence Splitting (TOPS): A fair number of sentences in the ATB has a root node S with several embedded direct descendant S nodes, sometimes conjoined using the conjunction w. We split such sentences into several shorter sentences. There are several possible systematic syntactic transformations. We focus on three major ones due to their significant distributional variation in MSA and LA. They are illustrated in Figure 1. Negation (NEG): In MSA negation is marked with preverbal negative particles. In LA, a negative construction is expressed in one of three possible ways: m$/mA preceding the verb; a particle $ suffixed onto the verb; or a circumfix of a prefix mA and suffix it $. We converted all negation instances in the ATB-Train three ways reflecting the LA constructions for negation.</Paragraph> <Paragraph position="2"> VSO-SVO Ordering (SVO): Both Verb Subject Object (VSO) and Subject Verb Object (SVO) constructions occur in MSA and LA treebanks.</Paragraph> <Paragraph position="3"> But pure VSO constructions - where there is no pro-drop - occur in the LA corpus only 10% of the data, while VSO is the most frequent ordering in MSA. Hence, the goal is to skew the distributions of the SVO constructions in the MSA data. Therefore, VSO constructions are both replicated and converted to SVO constructions.</Paragraph> <Paragraph position="4"> Demonstrative Switching (DEM): In LA, demonstrative pronouns precede or, more com- null monly, follow the nouns they modify, while in MSA demonstrative pronoun only precede the noun they modify. Accordingly, we replicate the LA constructions in ATB-Train and moved the demonstrative pronouns to follow their modified nouns while retaining the source MSAordering simultaneously. null We use the four lexicons described in Section 3. These resources are created with a coverage bias from LA to MSA. As an approximation, we reversed the directionality to yield MSA to LA lexicons, retaining the assigned probability scores. Manipulations involving lexical substitution are applied only to the lexical items without altering the POS tag or syntactic structure.</Paragraph> <Paragraph position="5"> We applied some morphological rules to handle specific constructions in the LA. The POS tier as well as the lexical items were affected by these manipulations.</Paragraph> <Paragraph position="6"> bd Construction (BD): bd is an LA noun that means 'want'. It acts like a verb in verbal constructions yielding VP constructions headed by NN. It is typically followed by a possessive pronoun. Accordingly, we translated all MSA verbs meaning want/need into the noun bd and changed their POS tag to the nominal tag NN. In cases where the subject ofthe MSAverb is pro-dropped, we add a clitic possessive pronoun in the first or second person singular. This was intended to bridge the genre and domain disparity between the MSA and LA data.</Paragraph> <Paragraph position="7"> Aspectual Marker b (ASP): In dialectal Arabic, present tense verbs are marked with an initial b. Therefore we add a b prefix to all verbs of POS tag type VBP. The aspectual marker is present on the verb byHbw in the LA example in Figure 1.</Paragraph> <Paragraph position="8"> lys Construction (LYS): In the MSA data, lys is interchangeably marked as a verb and as a particle. However, in the LA data, lys occurs only as a particle. Therefore, we convert all occurrences of lys into RP.</Paragraph> </Section> <Section position="2" start_page="372" end_page="373" type="sub_section"> <SectionTitle> 6.2 Experimental Results </SectionTitle> <Paragraph position="0"> We transform ATB-Train into an LA-like treebank using different strategies, and then train the Bikel parser on the resulting LA-like treebank. Weparse the LA test set with the Bikel parser trained in this manner. As before, we report results on DEV and</Paragraph> <Paragraph position="2"> TEST sets, without POS tags and with gold POS tags, using the Parseval metrics of labeled precision, labeled recall and f-measure. Table 3 summarizes the results on the LA development set.</Paragraph> <Paragraph position="3"> In Table 3, STRUCT refers to the structural transformations combining TOPS with CON. Of the Syntactic transformations applied, NEG is the only transformation that helps performance.</Paragraph> <Paragraph position="4"> Both SVO and DEM decrease the performance from the baseline with F-measures of 59.4 and 59.5, respectively. Of the lexical substitutions (i.e., lexicons), SLXEM helps performance the best. MORPH refers to a combination of all the morphological transformations. MORPH does not help performance, as we see a decrease from the baseline by 0.3% when applied on its own. When combining MORPH with other conditions, wesee aconsistent decrease. Forinstance, STRUCT+NEG+SLXEM+MORPHyields an f-measure of 62.9 compared to 63.3 yielded by STRUCT+NEG+SLXEM. The best results obtained are those from combining STRUCT with NEG and SLXEM for both the No Tag and Gold Tag conditions.</Paragraph> <Paragraph position="5"> Table 4 shows the results obtained on TEST. As for the sentence transduction case, we see an overallreduction inthe performance indicating that the test data is very different from the training data.</Paragraph> </Section> <Section position="3" start_page="373" end_page="373" type="sub_section"> <SectionTitle> 6.3 Discussion </SectionTitle> <Paragraph position="0"> The best performing condition always includes CON, TOPS and NEG. SLXEM helps as well, however, due to the inherent directionality of the resource, its impact is limited. We experimented with the other lexicons but none of them helped improve performance. We believe that the EM probabilities helped in biasing the lexical choices, playing the role of an LA language model (which wedonothave). Wedonotobserve anysignificant improvement from applying MORPH.</Paragraph> </Section> </Section> <Section position="8" start_page="373" end_page="374" type="metho"> <SectionTitle> 7 Grammar Transduction </SectionTitle> <Paragraph position="0"> The grammar-transduction approach uses the machinery of synchronous grammars to relate MSA andLA.Asynchronous grammarcomposes paired elementary trees, or fragments of phrase-structure trees, to generate pairs of phrase-structure trees.</Paragraph> <Paragraph position="1"> In the present application, we start with MSA elementary trees(plus probabilities) induced from the ATB and transform them using handwritten rules into dialect elementary trees to yield an MSA-dialect synchronous grammar. This synchronous grammar can be used to parse new dialect sentences using statistics gathered from the MSA data.</Paragraph> <Paragraph position="2"> Thus this approach can be thought of as a variant of the treebank-transduction approach in which the syntactic transformations are localized to elementary trees. Moreover, because a parsed MSA translation is produced as a byproduct, we can also think of this approach as being related to the sentence-transduction approach.</Paragraph> <Section position="1" start_page="373" end_page="373" type="sub_section"> <SectionTitle> 7.1 Preliminaries </SectionTitle> <Paragraph position="0"> The parsing model used is essentially that of Chiang (Chiang, 2000), which is based on a highly restricted version of tree-adjoining grammar. In its present form, the formalism is tree-substitution grammar (Schabes, 1990) with an additional operation called sister-adjunction (Rambow et al., 2001). Because of space constraints, we omit discussion of the sister-adjunction operation in this paper.</Paragraph> <Paragraph position="1"> A tree-substitution grammar is a set of elementary trees. A frontier node labeled with a nonterminal label is called a substitution site. If an elementary tree has exactly one terminal symbol, that symbol is called its lexical anchor.</Paragraph> <Paragraph position="2"> A derivation starts with an elementary tree and proceeds by a series of composition operations.</Paragraph> <Paragraph position="3"> In the substitution operation, a substitution site is rewritten with an elementary tree with a matching root label. The final product is a tree with no more substitution sites.</Paragraph> <Paragraph position="4"> A synchronous TSG is a set of pairs of elementary trees. In each pair, there is a one-to-one correspondence between the substitution sites of the two trees, which we represent using boxed indices (Figure 2). The substitution operation then rewrites a pair of coindexed substitution sites with an elementary tree pair. A stochastic synchronous TSG adds probabilities to the substitution operation: the probability of substituting an elementary tree pair <a,aprime> at a substitution site pair <e,eprime> is P(a,aprime |e,eprime).</Paragraph> <Paragraph position="5"> When we parse a monolingual sentence S using one side of a stochastic synchronous TSG, using a straightforward generalization of the CKY and Viterbi algorithms, we obtain the highest-probability paired derivation which includes a parse for S on one side, and a parsed translation of S on the other side. It is also straightforward to calculate inside and outside probabilities for re-estimation by Expectation-Maximization (EM).</Paragraph> </Section> <Section position="2" start_page="373" end_page="374" type="sub_section"> <SectionTitle> 7.2 An MSA-dialect synchronous grammar </SectionTitle> <Paragraph position="0"> We now describe how we build our MSA-dialect synchronous grammar. As mentioned above, the MSA side of the grammar is extracted from the ATB in a process described by Chiang and others (Chiang, 2000; Xia et al., 2000; Chen, 2001). This process also gives us MSA-only substitution probabilities P(a |e).</Paragraph> <Paragraph position="1"> We then apply various transformation rules (described below) to the MSA elementary trees to produce a dialect grammar, at the same time assigning probabilities P(aprime |a). The synchronoussubstitution probabilities can then be estimated as:</Paragraph> <Paragraph position="3"> where w and t are the lexical anchor of a and its POS tag, and -a is the equivalence class of a modulo lexical anchors and their POS tags.</Paragraph> <Paragraph position="4"> P(wprime,tprime |w,t) is assigned as described in Section 3; P(-aprime |-a,wprime,tprime,w,t) is initially assigned by hand. Because the full probability table for the latter would be quite large, we smooth it using a backoff model so that the number of parameters to be chosen is manageable. Finally, we reestimate these parameters using EM.</Paragraph> <Paragraph position="5"> Because of the underlying syntactic similarity between the two varieties of Arabic, we assume that every tree in the MSA grammar extracted from the MSA treebank is also an LA tree. In addition, we perform certain tree transformations on all elementary trees which match the pattern: NEG and SVO (Section 6.1.2) and BD (Section 6.1.4). NEG is modified so that we simply insert a $ negation marker postverbally, as the pre-verbal markers are handled by MSA trees.</Paragraph> </Section> <Section position="3" start_page="374" end_page="374" type="sub_section"> <SectionTitle> 7.3 Experimental Results </SectionTitle> <Paragraph position="0"> We first use DEV to determine which of the transformations are useful. The results are shown in Table 5. The baseline is the same as in the previous two approaches. We see that important improvements are obtained using lexicon SLXUN.</Paragraph> <Paragraph position="1"> Adding the SVO transformation does not improve the results, but the NEG and BD transformations help slightly, and their effect is (partly) cumulative. (We did not perform these tuning experiments on input with no POS tags.) We also experimented with the SLXEM and BLXEM lexicons.</Paragraph> <Paragraph position="2"> There was no consistent improvement.</Paragraph> </Section> <Section position="4" start_page="374" end_page="374" type="sub_section"> <SectionTitle> 7.4 Discussion </SectionTitle> <Paragraph position="0"> Weobserve thatthelexicon canbeusedeffectively in our synchronous grammar framework. In addition, some syntactic transformations are useful.</Paragraph> <Paragraph position="1"> The SVO transformation, we assume, turned out not to be useful because the SVO word order is also possible in MSA, so that the new trees were not needed and needlessly introduced new derivations. The BD transformation shows the importance not of general syntactic transformations, but rather of lexically specific syntactic transformations: varieties within one language family may differ more in terms of the lexico-syntactic constructions used for a specific (semantic or pragmatic) purpose than in their basic syntactic inventory. Notethatourtree-based synchronous formalism is ideally suited for expressing such transformations since it is lexicalized, and has an extended domain of locality.</Paragraph> </Section> </Section> class="xml-element"></Paper>