File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/02/c02-1006_concl.xml
Size: 3,472 bytes
Last Modified: 2025-10-06 13:53:11
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1006"> <Title>NLP and IR Approaches to Monolingual and Multilingual Link Detection</Title> <Section position="5" start_page="0" end_page="0" type="concl"> <SectionTitle> 5 Results of the Evaluation on TDT3 </SectionTitle> <Paragraph position="0"> corpus We applied the best strategies and the trained thresholds in above experiments for both monolingual and multilingual link detection tasks to TDT3 corpus. The results of our methods and of the other sites participating the TDT 2001 evaluation are shown in Table 10. In this evaluation, both published and unpublished topics are considered.</Paragraph> <Paragraph position="1"> For monolingual task, nouns, adjectives and CNs are used to represent story vectors. And the thresholds for decision and expansion are 0.06 and 0.07, respectively. For multilingual task, nouns, verbs, adjectives and CNs are used to represent story vectors. The thresholds for English pairs are set the same as those in the monolingual task, and for Chinese pairs, they are 0.2 and 0.25, respectively. The decision threshold for multilingual pairs is 0.05.</Paragraph> <Paragraph position="2"> In the multilingual task, our result (NTU) is better than The Chinese University of Hong Kong (CUHK). And the multilingual result is close to the monolingual result. This is a significant improvement.</Paragraph> <Paragraph position="3"> Conclusion and Future Work Several issues for link detection are considered in this paper. For both monolingual and multilingual tasks, the best features to represent stories are nouns, verbs, adjectives, and compound nouns. The story expansion using historic information is helpful. Story pairs in different languages have different similarity distributions. Using thresholds to model the differences is shown to be usable.</Paragraph> <Paragraph position="4"> Topic segmentation is an interesting issue.</Paragraph> <Paragraph position="5"> We expected it would bring some benefits, but the experiments for TDT testing environment showed that this factor did not gain as much as we expected. Few multi-topic story pairs and segmentation accuracy induced this result. We made an index file containing multi-topic story pairs and did experiments to investigate. The experimental results support our thought.</Paragraph> <Paragraph position="6"> We examined the similarities of story pairs and tried to figure out why the miss rate was not reduced. There are 919 pairs of 4,908 ones are mistaken. The mean similarity of miss pairs is much smaller than the decision threshold. That means there are no similar words between two stories even they are discussing the same topic. None or few match words result that the similarity does not exceed the threshold. That is the problem that we have to overcome.</Paragraph> <Paragraph position="7"> We also find that the people names may be spelled in different ways in different news agencies. For example, the name of a balloonist is spelled as &quot;Faucett&quot; in VOA news stories, but is spelled as &quot;Fossett&quot; in the other news sources. And for machine translated news stories, the people names would not be translated into their corresponding English names. Therefore, we could not find the same people name in two stories. In substance, people names are important features to discriminate from topics. This is another challenge issue to overcome.</Paragraph> </Section> class="xml-element"></Paper>