File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2248_metho.xml
Size: 7,058 bytes
Last Modified: 2025-10-06 14:15:10
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2248"> <Title>Target Word Selection as Proximity in Semantic Space</Title> <Section position="5" start_page="1496" end_page="1497" type="metho"> <SectionTitle> 2 Experiment </SectionTitle> <Paragraph position="0"> To assess the proposed semantic distance (SD) method for target word selection, I used an English-Spanish parallel corpus I for testing and evaluation. Several features of a real MT system were incorporated in order that the experiment mimic the type of information available to the lexical selection component. Investigation was restricted to the translation of content words: common nouns, verbs, adjectives and adverbs.</Paragraph> <Section position="1" start_page="1496" end_page="1496" type="sub_section"> <SectionTitle> 2.1 Materials and Procedure </SectionTitle> <Paragraph position="0"> The test corpus was an English language movie script that had been translated into Spanish on a line-by-line basis. A random sample of 170 lines was extracted from the Spanish half of the corpus, and each content :word in this SL subcorpus was looked up in the online version of Bilingual Dictionary. 2 Experimental items were chosen and a bilingual lexicon (see Figure 1) formed from the information in the dictionary, subject to the following constraints: The translations given in the parallel corpus for 13 SL items were not listed in Langenscheidt's. This was due to the directionality of bilingual dictionaries - entries are created from the TL point of view - and the fact that the direction of original translation was opposite to that used for building the testing lexicon. These translations were incorporated into the bilingual lexicon. A total of 99 experimental items were compiled.</Paragraph> <Paragraph position="1"> For each SL item, the corresponding TL translation was located in the parallel corpus and all TL content words within a +25 word window were extracted to form the local discourse context. Co-occurrence vectors for each lemmatised context word meeting the frequency threshold were created from a lemmatised version of the spoken part of the BNC. Vectors were constructed by advancing a window of +3 words through the corpus, and for each word recording the number of times each of 446 index words occurred within the window. This procedure produced a 446-dimension semantic space. Finally, co-occurrence counts were replaced with their log-likelihood values, which effectively normalizes the vectors. Parameter settings were taken from McDonald (1997).</Paragraph> <Paragraph position="2"> Vectors for the translation candidates were created using exactly the same method.</Paragraph> <Paragraph position="3"> Compared to a practical MT system, the lexical selection simulation makes several simplifying assumptions. For one, two or more items in the same SL sentence are treated as if all other items are already correctly translated. Secondly, the use of forward context means that a word is left untranslated until a prespecified number of following words are translated. Finally, the bilingual lexicon listed 4.2 translation candidates per entry on average. Many of the alternatives could be described as stylistic variants, and might not be present in an actual MT lexicon.</Paragraph> </Section> <Section position="2" start_page="1496" end_page="1497" type="sub_section"> <SectionTitle> 2.2 Calculating Semantic Distance </SectionTitle> <Paragraph position="0"> The proximity of each translation candidate to the bag of words forming the local TL context was measured as described below, and the &quot;closest&quot; target was chosen. The method for scaling each dimension of the space was adapted from Kozimo and Ito (1995) in order to de-emphasize dimensions irrelevant to the local context. If the variability of vector component i is high, then this dimension is considered to be less relevant than a component with lower variability, and the semantic distance measure should take this into account.</Paragraph> <Paragraph position="1"> The relevance r i for each dimension is defined as the ratio of the standard deviation si of the distribution formed by dimension i, for all local context words LC, over the maximum standard deviation Smax for LC: si ri= Smax For each candidate translation t the vector representing each word c in LC is moved to a new position in the space according to a function of r and its current distance from t: c'=ci+ri(ti-ci) If r is large, then any difference in the value of component i between t and LC is made less prominent than if r is small. Finally, semantic distance is calculated as the mean cosine of the angle between target t and each word c in LC:</Paragraph> <Paragraph position="3"/> </Section> <Section position="3" start_page="1497" end_page="1497" type="sub_section"> <SectionTitle> 2.3 Results and Discussion </SectionTitle> <Paragraph position="0"> Performance was evaluated against the actual English translation aligned with each Spanish item. Two baseline measures were used for comparison: accuracy expected by random selection, and word frequency (WF; selection of the translation candidate with the highest corpus frequency). The semantic distance method made 57/99 correct choices (57.6%) whereas the frequency method bettered it slightly, choosing the aligned translation 59 times (59.6%).</Paragraph> <Paragraph position="1"> Expected chance performance was 22.9%. Of the errors made by WF, SD corrected 15%, and WF corrected 19% of the SD method's errors.</Paragraph> <Paragraph position="2"> In about one-quarter of the errors made by the SD method, the selected candidate and the &quot;correct&quot; translation seemed equally acceptable in the context. This can be seen more clearly in an example TL context for trabajo (Figure 2).</Paragraph> <Paragraph position="3"> There appears to be little information available in the context in order to prefer work over the closely related job.</Paragraph> <Paragraph position="4"> Performance was assessed at the level of 100% applicability - the SD method was used for every item. Future work will investigate the use of a confidence estimate: if the evidence for SL: TL: Ud. es muy dedicado a su trahaJo.</Paragraph> <Paragraph position="5"> ... to go back to the office.</Paragraph> <Paragraph position="6"> what's your name? i'm john wilkenson.</Paragraph> <Paragraph position="7"> why were you on the plane? on business.</Paragraph> <Paragraph position="8"> you're very committed to your <X).</Paragraph> <Paragraph position="9"> you go ahead and finish your story, please.</Paragraph> <Paragraph position="10"> we were taking a vacation-my sister, me, and our kids.</Paragraph> <Paragraph position="11"> you know-no husbands.</Paragraph> <Paragraph position="12"> we saw ...</Paragraph> <Paragraph position="13"> trabajo~work. X indicates the target word position. preferring one candidate over another is weak, an alternative selection method should be used.</Paragraph> </Section> </Section> class="xml-element"></Paper>