XML Viewer - w02-0904

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/w02-0904_evalu.xml
Size: 9,071 bytes
Last Modified: 2025-10-06 13:58:50
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0904">
  <Title>Building a hyponymy lexicon with hierarchical structure</Title>
  <Section position="8" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
7 Results &amp; Evaluation
</SectionTitle>
    <Paragraph position="0"> The number of extracted hh-constructions from the original corpus is 14,828. Statistics describing changes in the data throughout the implementation is presented in Table 3. The table shows the change in number of top nodes as well as in the number of d-pairs. A d-pair is defined as an ordered pair of terms a119 a80a101a120a11a10a6a80a6a121a123a122 in the hierarchy where t1 dominates t2, and where t1 a117a118 t2. For example, from the hierarchy in Figure 1 we get eight d-pairs.</Paragraph>
    <Paragraph position="1"> The statistics in the last column in Table 3 presents the number of d-pairs per top node in the data. This is suggested as a measurement of how complex the partial hierarchies are on average.</Paragraph>
    <Paragraph position="2"> The three values in Table 3 - number of top nodes, number of d-pairs and d-pairs per top node - are given for the original data, the original with the extended data, the classified data and for the data when hierarchical structure is introduced. Values are also given for two possible extracted lexicons (see below). null As can be seen in Table 3, the number of d-pairs increased through the introduction of hierarchies by 2,071 d-pairs (from 22,832 to 24,903 pairs). The relatively low number of new hyponymy relations (that is d-pairs) is disappointing, but with improvements discussed later, the number could hopefully be increased. null Evaluation of semantic hierarchies or lexicon always presents a challenge. Usually, human judges are used to evaluate the result, or the result is compared against a gold-standard resource. Lacking a suitable Swedish gold-standard, our method is evaluated with human judges.</Paragraph>
    <Paragraph position="3"> In building a usable lexicon from the data, we try to exclude hierarchies with few terms in them. Several options were tested and two of them are presented in Table 3: one lexicon where all top nodes had at least seven descendants (lexicon-7) and one where all top nodes had at least ten descendants (lexicon-10).</Paragraph>
    <Paragraph position="4"> No. of No. of d-pairs per Data top nodes d-pairs top node  pairs through the data. * No. of top nodes is equal to the number of hh-constructions.</Paragraph>
    <Paragraph position="5">  archies judged as correct by each judge. Total no. of judged d-pairs is 1,000.</Paragraph>
    <Paragraph position="6"> The latter, lexicon-10, was used in evaluation.</Paragraph>
    <Paragraph position="7"> That is, 1,000 of the d-pairs from lexicon-10 was randomly picked in order to evaluate the partial hierarchies and new hyponymy relations. Four human judges were to decide, for each pair, if they thought it was a correct pair or not. The result, presented in Table 4, is in the range of 52.2% to 76.6% correct. Table 5 presents five ways to look at the result. The first gives the average result over the four judges. The second, at-least-one, gives the percentage of d-pairs where at least one of the judges deemed the pair as correct. The majority is the percentage of d-pairs where at least two deemed the pair as correct, and the consensus option refers to the percentage of d-pairs where all judges agreed. The at-least-one option, the least strict of the measures, give us 82.2% correct, while the most strict (the consensus) gives us 41.6% correct.</Paragraph>
    <Paragraph position="8"> The kappa value (Carletta, 1996) was used to evaluate the agreement among the judges and to estimate how difficult the evaluation task was. Not  1,000 d-pairs, by four judges.</Paragraph>
    <Paragraph position="9"> surprisingly, as evaluation of semantic information, in general, is hard to perform on purely objective grounds, the kappa value is rather low; that is, the value for four annotators on the 1,000 d-pairs is K=0.51. The low kappa value for the evaluation task reflects the great many problems of evaluations of semantic resources by humans. Some of these problems are discussed below: While lemmatization or stemming is necessary for performing this kind of task, it may also cause problems in cases where morphology is important for correct classification. For example, while the plural form of the word 'boy' (i.e. 'boys') is a valid hyponym of the hypernym 'group', the singular form would not be.</Paragraph>
    <Paragraph position="10"> As was also reported by Caraballo (1999), the judges sometimes found proper nouns (as hyponyms) hard to evaluate. E.g. it might be hard to tell if 'Simon Le Bon' is a valid hyponym to the hypernym 'rock star' if his identity is unknown to the judge. One way to overcome this problem might be to give judges information about a sequence of higher ancestors, in order to make the judgement easier.</Paragraph>
    <Paragraph position="11"> It is difficult to compare these results with results from other studies such as that of Caraballo (1999), as the data used is not the same. However, it seems that our figures are in the same range as those reported in previous studies.</Paragraph>
    <Paragraph position="12"> Charniak &amp; Roark (1998), evaluating the semantic lexicon against gold standard resources (the MUC-4 and the WSJ corpus), reports that the ratio of valid to total entries for their system lies between 20% and 40%.</Paragraph>
    <Paragraph position="13"> Caraballo (1999) let three judges evaluate ten internal nodes in the hyponymy hierarchy, that had at least twenty descendants. Cases where judges had problems with proper nouns as hyponyms, corresponding to these mentioned above, were corrected. When the best hypernym was evaluated, the result reported for a majority of the judges was 33%.</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
8 Discussion and future work
</SectionTitle>
    <Paragraph position="0"> In this paper, we have mainly been concentrating on algorithm development for building the partial hierarchies and on evaluating the quality of the hyponymy relations in the hierarchies. In future work we will continue to put our efforts to include more of the extracted data into the hierarchies.</Paragraph>
    <Paragraph position="1"> In classification of hh-construction data (section 6.1), for example, there is a great many classes that are never collapsed where there should have been a collapse. That is, correct sense distinction is captured (through correct collapses), but incorrect sense distinction is also introduced due to lack of overlap in hyponyms. For example, if two classes with the hypernym 'animal' are found where there is no non-empty intersection in hyponyms, 'animal' will incorrectly be treated as having two senses. This is a side effect of the method we are using in order to get disambiguated data to build hierarchies from.</Paragraph>
    <Paragraph position="2"> In most cases, introduction of incorrect sense distinction is due to one of two situations: first, when the hypernym only has proper noun hyponyms (e.g.</Paragraph>
    <Paragraph position="3"> 'person' or 'artist'), the overlap in hyponyms tends to be small. Secondly, when the hypernym is a very general concept, for example 'part', 'question' or 'alternative', the hyponyms will rarely overlap. No assessment of the scope of these problems has been performed in this study. A more thorough investigation ought to be performed in order to know how to overcome the problem of incorrect sense distinctions. null Also, the kind of general, underspecified hypernyms, such as 'question' mentioned above are rarely meaningful as concepts on their own. As discussed by Hearst (1992), more information is needed to solve the underspecification, and the missing information is probably found in previous sentences. An improved algorithm has to deal with this problem either in excluding this type of hypernyms, or in improving on the concepts by finding information that solves the underspecification.</Paragraph>
    <Paragraph position="4"> Modification in the algorithm to impose hierarchical structure should be carried out in the future, so that more compositions are performed for each class (as discussed in section 6.2). This, together with a more elaborate extension algorithm (section 6.3) should give us further hierarchical links in the lexicon. null Compound analysis and improvements on term extraction for Swedish will also be helpful in future work. Improvements would possibly lead to more collapses by the algorithm presented in section 6.1, which in turn would reduce the number of incorrect sense distinctions.</Paragraph>
    <Paragraph position="5"> The resulting hierarchies are not fully strict, e.g.</Paragraph>
    <Paragraph position="6"> descendants of the same lemma type can occasionally be found in different branches of the same tree. This has to be dealt with in future implementations, as well.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML