File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2205_metho.xml

Size: 2,782 bytes

Last Modified: 2025-10-06 14:14:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2205">
  <Title>Redefining similarity in a thesaurus by using corpora</Title>
  <Section position="5" start_page="1132" end_page="1133" type="metho">
    <SectionTitle>
4 Remarks
</SectionTitle>
    <Paragraph position="0"> It is difficult to extract all knowledge from only a corpus because of ineoml)lete analysis and data sparseness. In order to avoid these difilculties, the approach to use of different resources from the col pus is promising. To construct the thesaurus fi'om  a dictionary (Turumaru et al., 1991), and to make example data from a usable knowledge (Kaneda et al., 1995) is considered this approach. The proposed method uses the handmade thesaurus as the different resource from the corpus. In addition, the statistical data from the corpus are weighted. However, it will be important in future research to investigate how much weight should be given to each bit of data.</Paragraph>
    <Paragraph position="1"> It is difficult to build knowledge corresponding to each domain from zero. So it is important to extend and modify the existing knowledge corresponding to the purpose of use. In this method, relatively few bits of cooccurrence data are used because nouns in the cooecurrence data are not on Bunrui-goi-hyou. If we extend Bunrui-goi-hyou, these unused cooccurrence data may be useful.</Paragraph>
    <Paragraph position="2"> And by using the obtained similarities, we can modify Bunrui-goi-hyou. Since our method construct a thesaurus from the handmade thesaurus by the corpus, it can be considered a method to refine the handmade thesaurus such as to be suitable for the domain of the used corpus.</Paragraph>
  </Section>
  <Section position="6" start_page="1133" end_page="1133" type="metho">
    <SectionTitle>
5 Conclusions
</SectionTitle>
    <Paragraph position="0"> In this paper, we proposed a method to define similarities between general nouns used in various domains. The proposed method redefines the similarity in a handmade thesaurus by using corpora. The method avoids data sparseness by estimating undefined similarities from the similarity in the thesaurus and similarities defined by corpora. The obtained similarities are obviously the same in number as the original similarities, and are more appropriate than the original similarities in the thesaurus.</Paragraph>
    <Paragraph position="1"> By using Bnnru~-goi-hyou as the handmade thesaurus and newspaper articles with about 7.85 M sentences as a corpus, we confirmed the appropriateness of this method.</Paragraph>
    <Paragraph position="2"> In the future, we will extend and modify Bunrui-goi-hyou by the cooecurrence data and the similarities obtained in this study, and will try to classify multiple senses of verbs.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML