File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/03/p03-2020_concl.xml

Size: 2,876 bytes

Last Modified: 2025-10-06 13:53:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-2020">
  <Title>Automatic Collection of Related Terms from the Web</Title>
  <Section position="5" start_page="2" end_page="2" type="concl">
    <SectionTitle>
3 Experiments and Disucssion
</SectionTitle>
    <Paragraph position="0"> First, we examined the precision of the system. We prepared fifty seed terms in total: ten terms for each of five genres; natural language processing, Japanese language, information technology, current topics, and persons in Japanese history. From these fifty terms, the system collected 610 terms in total; the average number of output terms per input is 12.2 terms. We checked whether each of the 610 terms is a correct related term of the original seed term by hand. The result is shown in the left half (Evaluation I) of Table 2. In this evaluation, 519 terms out of 610 terms were correct: the precision is 85%. From this high value, we conclude that the system can be used as a tool that helps us compile a glossary.</Paragraph>
    <Paragraph position="1"> Second, we tried to examine the recall of the system. It is impossible to calculate the actual recall value, because the ideal output is not clear and cannot be defined. To estimate the recall, we first prepared three to five target terms that should be collected from each seed word, and then checked whether each of the target terms was included in the system output. We counted the number of target terms in the following five cases. The right half (Evaluation II) in Table 2 shows the result.</Paragraph>
    <Paragraph position="2"> S: the target term was collected by the system.</Paragraph>
    <Paragraph position="3"> F: the target term was removed in the filtering step.</Paragraph>
    <Paragraph position="4"> A: the target term existed in the compiled corpus, but was not extracted by automatic term extraction. null C: the target term existed in the collected web pages, but did not exist in the compiled corpus.</Paragraph>
    <Paragraph position="5"> R: the target term did not exist on the collected web pages.</Paragraph>
    <Paragraph position="6"> Only 43 terms (20%) out of 210 terms were collected by the system. This low recall primarily comes from the failure of automatic term recognition (case A in the above classification). Improvement of this step is necessary.</Paragraph>
    <Paragraph position="7"> We also examined whether each of the 210 target terms passes the filtering step. The result was that 133 (63%) terms passed; 44 terms did not satisfy the condition H(x) [?] 100; 15 terms did not satisfy the condition H(x) [?] 100, 000; and 18 terms did not pass the relation test. These experimental results suggest that the ATR step may be replaced with a simple and exhaustive term collector from a corpus.</Paragraph>
    <Paragraph position="8"> We have a plan to examine this possibility next.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML