File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/w01-1005_evalu.xml

Size: 6,523 bytes

Last Modified: 2025-10-06 13:58:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-1005">
  <Title>Identification of relevant terms to support the construction of Domain Ontologies</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> An obvious problem of any automatic method for concept extraction is to provide objective performance evaluation.</Paragraph>
    <Paragraph position="1"> * Firstly, a &amp;quot;golden standard&amp;quot; tourism terminology would be necessary to formally measure the accuracy of the method. One such standard is not available, and determining this standard is one of the objectives of FETISH.</Paragraph>
    <Paragraph position="2"> Moreover, the notion of &amp;quot;term&amp;quot; is too vague to consider available terminological databases as &amp;quot;closed&amp;quot; sets, unless the domain is extremely specific.</Paragraph>
    <Paragraph position="3"> * Secondly, no formal methods to evaluate a terminology are available in literature. The best way to evaluate a &amp;quot;basic&amp;quot; linguistic component (i.e. a module that performs some basic task, such as POS tagging, terminology extraction, etc.) within a larger NLP application (information extraction, document classification, etc.) is to compute the difference in performance with and without the basic component. In our case, since Ontology does not perform any measurable task, adopting a similar approach is not straightforward. As a matter of facts, an Ontology is a basic component itself, therefore it can be formally evaluated only in the context of some specific usage of the Ontology itself.</Paragraph>
    <Paragraph position="4"> Having in mind all these inherent difficulties, we performed two sets of experiments. In the first, we extracted the terminology from a collection of texts in the Tourism domain, and we manually evaluated the results, with the help of other participants in the FETISH project (see the FETISH web site). In the second, we attempted to assess the generality of our approach. We hence extracted the terminology from a financial corpus (the Wall Street journal) and then we both manually evaluated the result, and compared the extracted terminology with an available thesaurus in a (approximately) similar domain. As a reference set of terms we used the Washington Post5 (WP) dictionary of economic and financial terms.</Paragraph>
    <Paragraph position="5"> To compute the Domain Relevance, we first collected corpora in several domains: tourism announcements and hotel descriptions, economic prose (Wall Street Journal), medical news (Reuters), sport news (Reuters), a balanced corpus (Brown Corpus) and four novels by Wells. Overall, about 3,2 million words were collected.</Paragraph>
    <Paragraph position="6"> In the first experiment, we used the Tourism corpus as a &amp;quot;target&amp;quot; domain for term extraction.</Paragraph>
    <Paragraph position="7"> The Tourism corpus was manually built using the WWW and currently has only about 200,000 words, but it is rapidly growing.</Paragraph>
    <Paragraph position="8"> Table 1 is a summary of the experiment. It is seen that only 2% terms are extracted from the initial list of candidates. This extremely high filtering rate is due to the small corpus: many candidates are found just one time in the corpus. However, candidates are extracted with high precision (over 85%).</Paragraph>
    <Paragraph position="9"> N. of candidate multiword terms (after parsing)  N. of extracted terms (with a=0.35 and  extraction task in the Tourism domain Table 2 shows the 15 most highly rated multiword terms, ordered by Consensus (Relevance is 1 for all the terms in the list). Table 3 illustrates the effectiveness of Domain Consensus at pruning irrelevant terms: all the  and low Domain Consensus In the second experiment, we used the onemillion-word Wall Street journal (WSJ) and the Washington Post (WP) reference terminology.</Paragraph>
    <Paragraph position="10"> The WP includes 1270 terms, but only 214 occur at least once in the WSJ. We used these 214 as the &amp;quot;golden standard&amp;quot; (Test1), but we performed different experiments eliminating terms with a frequency lower than 2 (Test2), 5 (Test5) and 10 (Test10). This latter set includes only 73 terms.</Paragraph>
    <Paragraph position="11"> During syntactic processing, 41,609 chunk prototypes have been extracted as eligible terminology.</Paragraph>
    <Paragraph position="12"> The Tables 4 and 5 compare our method with t with Mutual Information, Dice factor, and pure frequency. Clearly, these measures are applied on the same set of eligible candidates extracted by the CHAOS chunker. The results reported in each line are those obtained using the best threshold for each adopted measure6. For our method (DR+DC), the threshold is given by the values a and b. As remarked in the introduction, a comparison against a golden standard may be unfair, since, on one side, many terms may be present in the observed documents, and not present in the terminology. On the other side, low frequency terms in the reference terminology are difficult to capture using statistical filters. Due to these problems, the F-measure is in general quite low, though our method outperforms Mutual Information and Dice factor. As remarked by Daille (1994), the frequency emerges as a reasonable indicator, especially as for the Recall value, which is a rather obvious result.</Paragraph>
    <Paragraph position="13"> However pure frequency implies the problems outlined in the previous section. Upon manual inspection, we found that, as obvious, undesired terms increase rapidly in the frequency ranked term list, as the frequency decreases. Manually inspecting the first 100 highly ranked terms produced a score of 87,5 precision for our method, and 77,5 for the frequency measure. For the subsequent 100 terms, the discrepancy gets much higher (18%).</Paragraph>
    <Paragraph position="14"> Note that the precision score is in line with that obtained for the Tourism corpus. Notice also 6 as a matter of fact, for our method we are not quite using the best value for b, as remarked later. that the values of a and b are the same in the two experiments. In practice, we found that the threshold a=0,35 for the Domain Relevance is a generally &amp;quot;good&amp;quot; value, while a little tuning may be necessary for the Domain Consensus. In the Tourism domain, where statistical evidence is lower, a lower value for</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML