File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4015_metho.xml
Size: 7,271 bytes
Last Modified: 2025-10-06 14:08:55
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-4015"> <Title>Morphological Analysis for Statistical Machine Translation</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. Word Segmentation </SectionTitle> <Paragraph position="0"> We pre-suppose segmentation of a word into prefix(es)-stem-suffix(es), as described in (Lee et al. 2003) The category prefix and suffix encompasses function words such as conjunction markers, prepositions, pronouns, determiners and all inflectional morphemes of the language. If a word token contains more than one prefix and/or suffix, we posit multiple prefixes/suffixes per stem. A sample word segmented Arabic text is given below, where prefixes are marked with #, and suffixes with +.</Paragraph> <Paragraph position="1"> w# s# y# Hl sA}q Al# tjArb fy jAgwAr Al# brAzyly lwsyAnw bwrty mkAn AyrfAyn fy Al# sbAq gdA Al# AHd Al*y s# y# kwn Awly xTw +At +h fy EAlm sbAq +At AlfwrmwlA</Paragraph> </Section> <Section position="4" start_page="0" end_page="3" type="metho"> <SectionTitle> 3. Morphological Analysis </SectionTitle> <Paragraph position="0"> Morphological analysis identifies functional morphemes to be merged into meaning-bearing stems or to be deleted. In Arabic, functional morphemes typically belong to prefixes or suffixes.</Paragraph> <Paragraph position="1"> Sample Arabic texts before and after morphological analysis is shown below. Mwskw 51-7 ( Af b ) - Elm An Al# qSf Al# mdfEy Al*y Ady Aly ASAb +p jndy +yn rwsy +yn Avn +yn b# jrwH Tfyf +p q*A}f Al# jmE +p fy mTAr xAn qlE +p ...</Paragraph> <Paragraph position="2"> Mwskw 51-7 ( Af b ) - Elm An Al# qSf Al# mdfEy Al*y Ady Aly ASAbp jndyyn rwsyyn Avnyn b# jrwH Tfyfp msA' Al# jmEp fy mTAr xAn qlEp ...</Paragraph> <Paragraph position="3"> In the morphologically analyzed Arabic (bottom), the feminine singular suffix +p and the masculine plural suffix +yn are merged into the preceding stems analogous to singular/plural noun distinction in English, e.g. girl vs. girls.</Paragraph> <Section position="1" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 3.1 Method </SectionTitle> <Paragraph position="0"> We apply part-speech tagging to a symbol tokenized and word segmented Arabic and symbol-tokenized English parallel corpus. We then viterbi-align the part-of-speech tagged parallel corpus, using translation parameters obtained via Model 1 training of word segmented Arabic and symbol-tokenized English, to derive the conditional probability of an English part-of-speech tag given the combination of an Arabic prefix and its part-of-speech or an We have used an Arabic part-of-speech tagger with around 120 tags, and an English part-of-speech tagger with around 55 tags.</Paragraph> </Section> <Section position="2" start_page="1" end_page="3" type="sub_section"> <SectionTitle> 3.2 Algorithm </SectionTitle> <Paragraph position="0"> The algorithm utilizes two sets of translation probabilities to determine merge/deletion analysis of a morpheme. We obtain tag-to-tag translation probabilities according to (1), which identifies the most probable part-of-speech is one of the major stem parts-of-speech with which the specified prefix or suffix co-occurs, i.e. ADV, ADJ, NOUN, NOUN_PROP, VERB_IMPERFECT, VERB_PERFECT.</Paragraph> <Paragraph position="2"> may be interpreted in a manner analogous to suffix</Paragraph> <Paragraph position="4"> in (2).</Paragraph> <Paragraph position="5"> The algorithm for word-based translation model, e.g. IBM Model 1, implements the idea that if a morpheme in one language is robustly translated into a distinct part-of-speech in the other language, the morpheme is very likely to have its independent counterpart in the other language. Therefore, a robust overlap of tag ) for a prefix is a positive indicator that the Arabic prefix/suffix has an independent counterpart in English. If the overlap is weak or doesn't exist, the prefix/suffix is unlikely to have an independent counterpart and is subject to merge/deletion analysis. We assume that only one tag is assigned to one morpheme or word, i.e. no combination tag of the form DET+NOUN, etc.</Paragraph> <Paragraph position="6"> probability into NULL tag is not the highest, merge the prefix occurring in the appropriate stem tag contexts in the training corpus (for translation model training) and a new input text (for decoding).</Paragraph> <Paragraph position="7"> For phrase translation models (Och and Ney 2002), we induce additional merge/deletion analysis on the basis of base noun phrase parsing of Arabic. One major asymmetry between Arabic and English is caused by more frequent use of the determiner Al# in Arabic compared with its counterpart the in English. We apply Al#-deletion to Arabic noun phrases so that only the first occurrence of Al# in a noun phrase is retained. All instances of Al# occurring before a proper noun - as in Al# qds, whose literal translation is the Jerusalem - are also deleted. Unlike the automatic induction of morphological analysis described in 3.2.1, Al#-deletion analysis is manually induced.</Paragraph> </Section> </Section> <Section position="5" start_page="3" end_page="5" type="metho"> <SectionTitle> 4. Performance Evaluations </SectionTitle> <Paragraph position="0"> System performances are evaluated on LDCdistributed Multiple Translation Arabic Part I consisting of 1,043 segments derived from AFP and Xinhua newswires. Translation qualities are measured by uncased BLEU (Papineni et al.</Paragraph> <Paragraph position="1"> 2002) with 4 reference translations, sysids: ahb, ahc, ahd, ahe.</Paragraph> <Paragraph position="2"> Systems are developed from 4 different sizes of training corpora, 3.5K, 35K, 350K and 3.3M sentence pairs, as in Table 1. The number in each cell indicates the number of sentence pairs in each genre (newswires, ummah, UN corpus).</Paragraph> <Paragraph position="3"> We have used the same language model for all evaluations.</Paragraph> <Paragraph position="4"> corpus size baseline morph analysis Baseline performances are obtained by Model 1 training and decoding without any segmentation or morphological analysis on Arabic. BLEU scores under 'morph analysis' is obtained by Model 1 training on Arabic morphologically analyzed and English symbol-tokenized parallel corpus and Model 1 decoding on the Arabic morphologically analyzed input text.</Paragraph> <Section position="1" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 4.2 Phrase Translation Model </SectionTitle> <Paragraph position="0"> Impact of Arabic morphological analysis on a phrase translation model with monotone decoding (Tillmann 2003), is shown in Table 3.</Paragraph> <Paragraph position="1"> corpus size baseline morph analysis BLEU scores under baseline and morph analysis are obtained in a manner analogous to Model 1 except that the morphological analysis for the phrase translation model is a combination of the automatically induced analysis for Model 1 plus the manually induced Al#-deletion in 3.2.2. The scores with only automatically induced morphological analysis are 0.21, 0.25, 0.33 and 0.36 for 3.5K, 35K, 350K and 3.3M sentence pair training corpora, respectively.</Paragraph> </Section> </Section> class="xml-element"></Paper>