File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1603_metho.xml
Size: 13,153 bytes
Last Modified: 2025-10-06 14:09:18
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1603"> <Title>Preliminary Lexical Framework for English-Arabic Semantic Resource Construction</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 http://trec.nist.gov </SectionTitle> <Paragraph position="0"> though not always correct, outperform bilingual term lists with only one translation alternative.</Paragraph> <Paragraph position="1"> Combining dictionaries is especially important when working with ambiguous languages such as Arabic.</Paragraph> <Paragraph position="2"> Many TREC teams used translation probabilities to deal with translation ambiguity and term weighting issues, especially since a translation lexicon with probabilities was provided as a standard resource. However, most teams combined translation probabilities from different sources and achieved better retrieval results that way (Xu, Fraser, and Weischedel, 2002), (Chowdhury et al., 2002), (Darwish and Oard, 2002). Darwish and Oard (2002) posit that since there is no such thing as a complete translation resource one should always use a combination of resources and that translation probabilities will be more accurate if one uses more resources.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Resource combination methodologies </SectionTitle> <Paragraph position="0"> Ruiz (2000) uses the term lexical triangulation to describe the process of mapping a bilingual English-Chinese lexicon into an existing WordNet-based Conceptual Interlingua by using translation evidence from multiple sources. Recall that WordNet synsets are formed by groups of terms with similar meaning (Miller, 1990). By translating each of the synonyms into Chinese, Ruiz created a frequency-ranked list of translations, and assumed that the most frequent translations were most likely to be correct. By establishing certain translation evidence thresholds, mappings of varying reliability were created. This method was later augmented with additional translation evidence from a Chinese-English parallel corpus.</Paragraph> <Paragraph position="1"> A methodology to improve query translation is described by Chen (2003). The methodology is intended to improve translation through the use of NLP techniques and the combining of the document collection, available translation resources, and transliteration techniques. A basic mapping was created between the Chinese terms from the collection and the English terms in WordNet by using a simple Chinese-English lexicon. Missing terms such as Named Entities were added through the process of transliteration.</Paragraph> <Paragraph position="2"> By customizing the translation resources to the document collection Chen showed an improvement in retrieval performance.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="7" type="metho"> <SectionTitle> 3 Establishing a Preliminary Framework </SectionTitle> <Paragraph position="0"> The preliminary Framework provides a methodology for the automatic combination of various lexical semantic resources such as machine readable dictionaries, ontologies, encyclopedias, and machine translation lexicons. While these individual resources are all valuable individually, automatic intelligent lexical combination into one single lexical knowledge base will provide an enhancement that is larger than the sum of its parts. The resulting resource will provide better coverage, more reliable translation probability information, and additional information leveraged through the process of lexical triangulation. In an initial evaluation of the preliminary Framework, it was applied to the combination of English and Arabic lexical resources as described in section 4.</Paragraph> <Paragraph position="1"> The preliminary Framework consists of 9 stages: 1) establish goals 2) collect resources 3) create resource feature matrix 4) develop evidence combination strategies and thresholds 5) construct combinatory lexical resource 6) manage problems that arise during creation 7) evaluate combinatory lexical resource 8) implement possible improvements 9) create final version of combinatory lexical resource.</Paragraph> <Paragraph position="2"> Stage 1: The first stage of the Framework is intended to establish the possible usage of the combinatory lexical resource (resulting form the combination of multiple resources). The requirements of this resource will drive the second stage: resource collection.</Paragraph> <Paragraph position="3"> Stage 2: Two types of resources should be collected: language processing resources such as stemmers and tokenizers; and lexical semantic resources such as dictionaries and lexicons. While not every resource may seem particularly useful at first, different resources can aid in mapping other resources together. During the second stage, conversion into a single encoding (such as UTF-8) will also take place.</Paragraph> <Paragraph position="4"> Stage 3: Once a set of resources has been collected, the resource feature matrix can be created. This matrix provides an overview of the types of information found in the collected resources and of certain resource characteristics. For example, it is important to note what base form the dictionary entries have. Some dictionaries use the singular form (for nouns) or indefinite form (for verbs), some use roots, others use stems, and free resources from the web often use a combination of all of the above. By studying the feature matrix the evidence combination strategies for stage four can be developed.</Paragraph> <Paragraph position="5"> strategy should be informed by the features of the different resources. It may be, for example, that one resource uses vocalized Arabic only and that another resource uses both vocalized and unvocalized Arabic. This fact should be taken into account by the combination strategy since the second resource can serve as an intermediary to map the first resource. Thresholding decisions are also part of stage four because the certainty of some combinations will be higher than others.</Paragraph> <Paragraph position="6"> Stage 5: Stage five involves writing programs based on the findings in stage four that will automatically create the combinatory lexical resource. The combination programs should provide output concerning problematic instances that occur during the creation i.e. words that only occur in a single resource, so that these problems may be handled by alternative strategies in stage six.</Paragraph> <Paragraph position="7"> Stage 6: Most of the problems in stage six are likely to be uncommon words, such as named entities or transliteration. A transliteration step, where for example English letters, i.e. r, are mapped to the closest Arabic sounding letters, i.e. , may be applied for languages that do not share the same orthographies.</Paragraph> <Paragraph position="8"> Stage 7: After the initial combinatory lexical resource has been created it needs to be evaluated. First the accuracy (quality) of the combination mappings of the various resources needs to be assessed in an intrinsic evaluation. After it has been established that the combination has been successful, an extrinsic evaluation can be carried out. In this evaluation the combinatory lexical resource is tested as part of the actual application the source was intended for, i.e. CLIR. (For a more detailed description of evaluation see Section 5 below.) Stage 8: These two evaluations will inform stage eight where possible improvements are added to the combination process.</Paragraph> <Paragraph position="9"> Stage 9: The final version of the combinatory lexical resource can be created in stage nine.</Paragraph> <Paragraph position="10"> 4 Application of the Framework to English-Arabic null The preliminary Framework as described in section 3 was applied to five English and Arabic language resources as a kind of feasibility test. Following the Framework, we first established the goals of the combinatory lexical resource. It was determined that the resource would be used as a translation resource for CLIR that would aid query translation as well as manual translation disambiguation by the user. This meant that the combinatory lexical resource would need translation probabilities as well as English definitions for Arabic translations to enable an English language user to select the correct Arabic translation. We collected five different resources: WordNet 2.03, the lexicon included with the Buckwalter Stemmer4, translations mined from Ajeeb5, the wordlist from the Arabeyes project6, and the LDC Arabic Gigaword corpus7. After the resources were collected the feature matrix was developed (see Table 1).</Paragraph> <Paragraph position="12"> The established combinatory lexical resource goals and resource feature matrix were used to determine the combination strategy. Since the resource should provide the user with definitions of Arabic words and WordNet is most comprehensive in this regard, it was selected as our base resource. The AFP newswire collection from the Gigaword corpus was used to mine Ajeeb. As is evident in the matrix, all resources contain English terms as a common denominator. The information used for evidence combination was as follows. Evidence used for mapping the Ajeeb and Buckwalter lexicons is part-of-speech information.</Paragraph> <Paragraph position="13"> Additionally, these two resources also provide vocalized Arabic terms/stems that can be used for a more reliable (less ambiguous) match. The Arabeyes lexicon is not terribly rich but was used as additional evidence for a certain translation through frequency weighting. The combinatory lexical resource was constructed by mapping the three lexical resources into WordNet using the evidence as discussed above (see Table 2).</Paragraph> <Paragraph position="14"> world, human race, humanity, humankind, human beings, humans, mankind, man, all of the inhabitants of the earth all of the inhabitants of the earth example resulting from Step 5 After examining the combinatory lexical resource we found that the Arabeyes Arabic terms could not be compared directly to the Arabic terms in the other lexical resources since the determiner prefixes are still attached to the terms (as in $ for example). More problematic were the translations mined from Ajeeb since the part-of-speech information of the Arabic term did not necessarily match the part-of-speech of the translations: #VB#2.1.2# #do_sentry_duty,keep_watch_over, guard,watchdog,oversee,sentinel, shield,watch,ward The first problem is easily fixed by applying a light stemmer to the dictionary. At this point it is not clear however, how to fix the second problem. It was also decided that the translation reliability weighting by frequency is too limited to be useful. A back-translation lookup needs to determine how many other terms can result in a certain translation. This data can then update the reliability score.</Paragraph> </Section> <Section position="6" start_page="7" end_page="7" type="metho"> <SectionTitle> 5 Comprehensive Evaluation </SectionTitle> <Paragraph position="0"> While we only have carried out a preliminary evaluation, we envision a comprehensive evaluation in the near future. As part of this evaluation three different types of evaluation can be carried out: 1) evaluate the process of applying the Framework; 2) evaluate the combinatory lexical resource itself; and 3) evaluate the contribution of the combinatory lexical resource to the application the resource was created for. Evaluation of the process of applying the Framework will provide evidence as to the advantages and disadvantages of our Framework, and where it may have to be adjusted.</Paragraph> <Paragraph position="1"> The construction of a Combinatory Lexical Resource by applying the Framework is the first step toward an effective evaluation of the full Framework. The construction process detailed in Section 3 should be carefully documented. The evaluation will focus on the time and effort spent on the process, difficulties or ease with resources that are acquired, managed and processed, as well as problems or issues that arise during the process. The intrinsic evaluation of the combinatory lexical resource indicates the quality of the newly created combinatory lexical resource. For this evaluation a large random number of entries will need to be evaluated for correctness. The evaluation will provide accuracy and coverage measures for the resource. Also, descriptive statistics will be generated to provide general understanding of the lexical resource that has been produced.</Paragraph> <Paragraph position="2"> The extrinsic evaluation of the combinatory lexical resource is intended to measure the contribution of the resource to an application (i.e. CLIR, Information Extraction). The application of choice should be run with the combinatory lexical resource, and without. Performance metrics appropriate for the type of application can be collected for both experiments and then compared.</Paragraph> </Section> class="xml-element"></Paper>