File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1112_intro.xml

Size: 4,732 bytes

Last Modified: 2025-10-06 14:02:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1112">
  <Title>Chinese Term Extraction from Web Pages Based on Compound word Productivity</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Background
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Typical Procedures of Automatic Term
Recognition
</SectionTitle>
      <Paragraph position="0"> An ATR procedure consists of two procedures in general. The first one is a procedure of extracting term candidates from a corpus. The second procedure is to assign each term candidate a score that indicates how likely the term candidate is a term to be recognized. Then all candidates are ranked according to their scores. In the remaining part of this section, we describe the background of a candidate extraction procedure and a scoring procedure respectively.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Candidates Extraction
</SectionTitle>
      <Paragraph position="0"> In term candidates extraction from the given text corpus, we mainly focus on compound words as well as simple words. To extract compound words which are promising term candidates and at the same time to exclude undesirable strings such as &amp;quot;is a&amp;quot; or &amp;quot;of the&amp;quot;, the frequently used method is to filter out the words that are the member of a stopword-list. null The structure of complex term is another important factor for automatic term candidate extraction. A syntactic structure that is the result of parsing is focused on in many works. Since we focus on these complex structures, the first task in extracting term candidates is a morphological analysis including part of speech (POS) tagging.</Paragraph>
      <Paragraph position="1"> There are no explicit word boundary marker in Chinese, we first have to do morphological analysis which segments out words from a sentence and does POS tagging at the same time.</Paragraph>
      <Paragraph position="2"> After POS tagging, the complex structures mentioned above are extracted as term candidates.</Paragraph>
      <Paragraph position="3"> Previous studies have proposed many promising ways for this purpose, for instance, Smadja and McKeown (1990), and Frantzi and Ananiadou (1996) tried to treat more general structures like collocations.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Scoring
</SectionTitle>
      <Paragraph position="0"> The next step of ATR is to assign each term candidate its score in order to rank them in descending order of termhood. Many researchers have sought the definition of term candidate's score which approximates termhood. In fact, many of those proposals make use of statistics of actual use in a corpus such as term frequency which is so powerful and simple that many researchers directly or indirectly have used it. The combination of term frequency and inverse document frequency is also well studied i.e. (Uchimoto et al 2000), (Fukushige and Noguchi 2000). On the other hand, several scoring methods that are neither directly nor heavily based on frequency of term candidates have been proposed. Among those, Ananiadou et al. proposed C-value (Frantzi and Ananiadou 1996) which counts how independently the given compound word is used in the given corpus.</Paragraph>
      <Paragraph position="1"> Hisamitsu (2000) proposes a way to measure termhood which estimates how far the document containing given term is different from the distribution of documents not containing the given term. However, the method proposed by Nakagawa and Mori (2003) outperforms these methods in terms of NTCIR1 TMREC task(Kageura, et al, 1999).</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Chinese Term Extraction
</SectionTitle>
      <Paragraph position="0"> As for Chinese language NLP, very many works about word segmentation were published i.e. (Ma and Xia 2003). Nevertheless the term &amp;quot;Term extraction&amp;quot; has not yet been used for Chinese NLP, key words extraction have been a target for a long time. For instance, key words extraction from news articles (Li. et al. 2003) is the recent result which uses frequency and length of character string for scoring. Max-duplicated string based method (Yang and Li. 2002) is also promising. In spite of previous research efforts, there have been no attempt so far to apply the relation between simple and compound words to Chinese term extraction, and that is exactly what we propose in this paper.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML