File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-1407_intro.xml

Size: 3,429 bytes

Last Modified: 2025-10-06 14:01:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1407">
  <Title>A Simple but Powerful Automatic Term Extraction Method</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Background
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 Candidates Extraction
</SectionTitle>
      <Paragraph position="0"> The first thing to do in ATR is to extract term candidates from the given text corpus. Here we only focus on nouns, more precisely a single-noun and a compound noun, which are exactly the targets of the NTCIR1 TMREC task(Kageura et al 1999). To extract compound nouns which are promising term candidates and at the same time to exclude undesirable strings such as &amp;quot;is a&amp;quot; or &amp;quot;of the&amp;quot;, the frequently used method is to filter out the words that are members of a stop-word-list. More complex structures like noun phrases, collocations and so on, become focused on (Frantzi and Ananiadou 1996). All of these are good term candidates in a corpus of a specific domain because all of them have a strong unithood (Kageura&amp;Umino96) which refers to the degree of strength or stability of syntagmatic combinations or collocations. We assume the following about compound nouns or collocations: Assumption Terms having complex structure a e t be made of xisting simple terms r o e The structure of complex terms is another important factor for automatic term candidates extraction. It is expressed syntactically or semantically. As a syntactic structure, dependency structures that are the results of parsing are focused on in many works. Since we focus on these complex structures, the first task in extracting term candidates is a morphological analysis including part of speech (POS) tagging. For Japanese, which is an agglutinative language, a morphological analysis was carried out which segmented words from a sentence and did POS tagging (Matsumoto et al.</Paragraph>
      <Paragraph position="1"> 1996).</Paragraph>
      <Paragraph position="2"> After POS tagging, the complex structures mentioned above are extracted as term candidates. Previous studies have proposed many promising ways for this purpose, Hisamitsu(2000) and Nakagawa (1998) concentrated their efforts on compound nouns. Frantzi and Ananiadou (1996) tried to treat more general structures like collocations.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.2 Scoring
</SectionTitle>
      <Paragraph position="0"> The next thing to do is to assign a score to each term candidate in order to rank them in descending order of termhood.</Paragraph>
      <Paragraph position="1"> Many researchers have sought the definition of the term candidate's score which approximates termhood. In fact, many of those proposals make use of surface statistics like tf[?] idf. Ananiadou et al. proposed C-value (Frantzi and Ananiadou 1996) and NC-value (Frantzi and Ananiadou 1999) which count how independently the given compound noun is used in the given corpus. Hisamitsu (2000) propose a way to measure termhood that counts how far the given term is different from the distribution of non-domain-specific terms. All of them tried to capture how important and independent a writer regards and uses individual terms in a corpus</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML