File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1006_metho.xml

Size: 7,524 bytes

Last Modified: 2025-10-06 14:07:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1006">
  <Title>NLP and IR Approaches to Monolingual and Multilingual Link Detection</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Multilingual Link Detection
Algorithm
</SectionTitle>
    <Paragraph position="0"> The multilingual link detection should tell if two stories in different languages are discussing the same topic. In this paper, the stories are in English and in Chinese. Comparing to English stories, there is no apparent word boundary in Chinese stories. We have to segment the Chinese sentences into meaningful lexical units.</Paragraph>
    <Paragraph position="1"> We employed our own Chinese segmentation and tagging system to pre-process Chinese sentences. Similar to monolingual link detection, each story in a pair is represented as a vector and the cosine similarity is used to decide if two stories discuss the same topic.</Paragraph>
    <Paragraph position="2"> In multilingual link detection, we have to deal with terms used in different languages.</Paragraph>
    <Paragraph position="3"> Consider the following three cases. E and C denote an English story and a Chinese story, respectively. (E, E) denotes an English pair; (C, C) denotes a Chinese pair; and (C, E) or (E, C) denotes a multilingual pair.</Paragraph>
    <Paragraph position="4">  (a) (E, E): no translation is required.</Paragraph>
    <Paragraph position="5"> (b) (C, E) or (E, C): C is translated to E'.</Paragraph>
    <Paragraph position="6">  The new E' could be an English vector or the vector is mixed in two languages if the original  Chinese terms are included in the new English vector.</Paragraph>
    <Paragraph position="7"> (c) (C, C): No translation is required; or both stories are translated into English and use English vectors; or these new English terms are added into the original Chinese vectors.</Paragraph>
    <Paragraph position="8"> The reason that we included the original Chinese terms in the new English vector is that we could not find the corresponding English translation candidates for some Chinese words. Including the Chinese terms could not lose information.</Paragraph>
    <Paragraph position="9"> We employed a simple approach to translate a Chinese story into an English one. A Chinese-English dictionary is consulted. There are 374,595 Chinese-English pairs in the dictionary. For each English term, there are 2.49 Chinese translations. For each Chinese term, there are 1.87 English translations. In this dictionary, English translations are less ambiguous. Therefore, we translated Chinese stories into English ones. If a Chinese word corresponds to more than one English word, these English words are all selected. That is, we did not disambiguate the meaning of a Chinese word. To avoid the noise introduced by many English translations, each translation term is assigned a lower weight. The weight is determined as follows. We divided the weight of a Chinese term by the total number translation equivalents.</Paragraph>
    <Paragraph position="11"> ) is the weight of a Chinese term in story d, w(d, t e ) is the weight of its English translation in story d, and N is the number of English translation candidates for the Chinese term.</Paragraph>
    <Paragraph position="12"> Table 5 shows the performances of multilingual link detection. We conducted three experiments using different story representation schemes for Chinese stories. &amp;quot;E&amp;quot; denotes Chinese stories are translated into English ones. &amp;quot;C&amp;quot; denotes Chinese stories are compared directly without translation, but Chinese stories are translated into English ones in multilingual pairs. &amp;quot;EC&amp;quot; denotes Chinese stories are represented in Chinese terms and their corresponding English translation candidates. The threshold for English story pairs is set to 0.12. The threshold for the other pairs varies from 0.1 to 0.5. The results reveal that &amp;quot;E&amp;quot; is better than &amp;quot;C&amp;quot; and &amp;quot;EC&amp;quot;.  could bring some advantages. Some Chinese terms which denote the same concept but in different forms could be matched through their English translations, for example, &amp;quot;Tu Sha &amp;quot; and &amp;quot; Sha Hai &amp;quot; (kill), as well as &amp;quot;Xing Wei &amp;quot; and &amp;quot;Xing Jing &amp;quot; (behaviour).</Paragraph>
    <Paragraph position="13"> The effect of English translations for Chinese stories is similar to the effect of thesaurus. We employed the CILIN (Mei et al., 1982) in multilingual link detection. We use the small category information and synonyms to expand the features we selected to represent a news story. The experimental results are shown in  We found that the performances of &amp;quot;E&amp;quot; translation and synonyms expansion schemes are very close. In our consideration, a good bilingual dictionary can be regarded as a thesaurus.</Paragraph>
    <Paragraph position="14"> The results of multilingual link detection are apparently worse than those of monolingual link detection. When the threshold is 0.2, the best performance is 0.6260 and the miss rate is 0.4547. The value of miss rate is very high. To improve the performance, we have to reduce the miss rate. We found the similarity of two stories in different languages is very low in comparison with the similarity of two stories in the same language. It is unfair to set the same threshold for different languages, thus we introduced a two-threshold method to resolve this problem.</Paragraph>
    <Paragraph position="15"> The performance of the two-threshold method for synonyms expansion (denotes as &amp;quot;Syn&amp;quot;) is shown in Table 7. &amp;quot;Chinese&amp;quot; means the  The result reveals that there is a great improvement when applying the two-threshold method. The threshold for Chinese story pairs is 0.2, the threshold for English story pairs is 0.12, and threshold for multilingual story pairs is 0.05. The similarity distributions for story pairs in different languages vary. As monolingual link detection, we did experiments about the combinations of different lexical terms. The results of these different combinations are shown in Table 8. It shows that the representation of the best performance in the multilingual task is different from that in the monolingual task. CNs bring positive influence. But using nouns, verbs and adjectives to represent a story is better than using nouns and adjectives only in multilingual link detection. Words in Chinese are seldom tagged as adjective. They are tagged as verbs in Chinese, but are tagged as adjectives in English (&amp;quot;An Quan &amp;quot; vs. &amp;quot;safe&amp;quot;).</Paragraph>
    <Paragraph position="16"> We also adopted story expansion mentioned in Section 2.3 before computing the similarity. Note that only stories in the same language are used to expand each other. In Table 9, &amp;quot;One&amp;quot; denotes the weights of expanded terms are the same as the original ones, and &amp;quot;Half&amp;quot; denotes the weights of the expanded terms are only half of the original ones. The results reveal that expanded terms with half weights are better than with original ones. Giving expanded terms half weights could reduce the effect of noise. Nouns, verbs, adjectives and compound nouns are used to represent stories in Table 9, and the thresholds are set as the best ones in the previous experiments. The expansion threshold for Chinese pairs varies from 0.2 to 0.3.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML