File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-0503_evalu.xml
Size: 16,143 bytes
Last Modified: 2025-10-06 13:59:48
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0503"> <Title>Max-Planck-Institute for Computer Science</Title> <Section position="6" start_page="20" end_page="23" type="evalu"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="20" end_page="22" type="sub_section"> <SectionTitle> 3.1 Setup </SectionTitle> <Paragraph position="0"> We ran LEILA on different corpora with increasing heterogeneity: * Wikicomposers: The set of all Wikipedia articles about composers (872 HTML documents). We use it to see how LEILA performs on a document collection with a strong structural and thematic homogeneity.</Paragraph> <Paragraph position="1"> * Wikigeography: The set of all Wikipedia pages about the geography of countries (313 HTML documents).</Paragraph> <Paragraph position="2"> * Wikigeneral: A set of random Wikipedia arti- null cles (78141 HTML documents). We chose it to assess LEILA's performance on structurally homogenous, but thematically random documents. * Googlecomposers: This set contains one document for each baroque, classical, and romantic composer in Wikipedia's list of composers, as delivered by a Google &quot;I'm feeling lucky&quot; search for the composer's name (492 HTML documents). We use it to see how LEILA performs on a corpus with a high structural heterogeneity. Since the querying was done automatically, the downloaded pages include spurious advertisements as well as pages with no proper sentences at all.</Paragraph> <Paragraph position="3"> We tested LEILA on different target relations with increasing complexity: * birthdate: This relation holds between a person and his birth date (e.g. &quot;Chopin&quot; / &quot;1810&quot;). It is easy to learn, because it is bound to strong surface clues (the first element is always a name, the second is always a date).</Paragraph> <Paragraph position="4"> * synonymy: This relation holds between two names that refer to the same entity (e.g.</Paragraph> <Paragraph position="5"> &quot;UN&quot;/&quot;United Nations&quot;). The relation is more sophisticated, since there are no surface clues. * instanceOf: This relation is even more sophisticated, because the sentences often express it only implicitly.</Paragraph> <Paragraph position="6"> We compared LEILA to different competitors. We only considered competitors that, like LEILA, extract the information from a corpus without using other Internet sources. We wanted to avoid running the competitors on our own corpora or on our own target relations, because we could not be sure to achieve a fair tuning of the competitors. Hence we ran LEILA on the corpora and the target relations that our competitors have been tested on by their authors. We compare the results of LEILA with the results reported by the authors. Our competitors, together with their respective corpora and relations, are: * TextToOnto3: A state-of-the-art representative for non-deep pattern matching. The system provides a component for the instanceOf relation and takes arbitrary HTML documents as input. For completeness, we also consider its suc- null cessor Text2Onto (Cimiano and V&quot;olker, 2005a), although it contains only default methods in its current state of development.</Paragraph> <Paragraph position="7"> * Snowball (Agichtein and Gravano, 2000): A recent representative of the slot-extraction paradigm. In the original paper, Snowball has been tested on the headquarters relation.</Paragraph> <Paragraph position="8"> This relation holds between a company and the city of its headquarters. Snowball was trained on a collection of some thousand documents and then applied to a test collection. For copyright reasons, we only had access to the test collection (150 text documents).</Paragraph> <Paragraph position="9"> * (Cimiano and V&quot;olker, 2005b) present a new system that uses context to assign a concept to an entity. We will refer to this system as the CV-system. The approach is restricted to the instanceOf-relation, but it can classify instances even if the corpus does not contain explicit definitions. In the original paper, the system was tested on a collection of 1880 files from the Lonely Planet Internet site4.</Paragraph> <Paragraph position="10"> For the evaluation, the output pairs of the system have to be compared to a table of ideal pairs. One option would be to take the ideal pairs from a pre-compiled data base. The problem is that these ideal pairs may differ from the facts expressed in the documents. Furthermore, these ideal pairs do not allow to measure how much of the document content the system actually extracted. This is why we chose to extract the ideal pairs manually from the documents. In our methodology, the ideal pairs comprise all pairs that a human would understand to be elements of the target relation. This involves full anaphora resolution, the solving of reference ambiguities, and the choice of truly defining concepts. For example, we accept Chopin as instance of composer but not as instance of member, even if the text says that he was a member of some club. Of course, we expect neither the competitors nor LEILA to achieve the results in the ideal table. However, this methodology is the only fair way of manual extraction, as it is guaranteed to be system-independent. If O denotes the multi-set of the output pairs and I denotes the multi-set of the ideal pairs, then precision, recall, and their</Paragraph> <Paragraph position="12"> To ensure a fair comparison of LEILA to Snowball, we use the same evaluation as employed in the original Snowball paper (Agichtein and Gravano, 2000), the Ideal Metric. The Ideal Metric assumes the target relation to be right-unique (i.e.</Paragraph> <Paragraph position="13"> a many-to-one relation). Hence the set of ideal pairs is right-unique. The set of output pairs can be made right-unique by selecting the pair with the highest confidence for each first component. Duplicates are removed from the ideal pairs and also from the output pairs. All output pairs that have a first component that is not in the ideal set are removed.</Paragraph> <Paragraph position="14"> There is one special case for the CV-system, which uses the Ideal Metric for the non-rightunique instanceOfrelation. To allow for a fair comparison, we used the Relaxed Ideal Metric, which does not make the ideal pairs right-unique.</Paragraph> <Paragraph position="15"> The calculation of recall is relaxed as follows:</Paragraph> <Paragraph position="17"> Due to the effort, we could extract the ideal pairs only for a sub-corpus. To ensure significance in spite of this, we compute confidence intervals for our estimates: We interpret the sequence of output pairs as a repetition of a Bernoulli-experiment, where the output pair can be either correct (i.e.</Paragraph> <Paragraph position="18"> contained in the ideal pairs) or not. The parameter of this Bernoulli-distribution is the precision. We estimate the precision by drawing a sample (i.e.</Paragraph> <Paragraph position="19"> by extracting all ideal pairs in the sub-corpus). By assuming that the output pairs are identically independently distributed, we can calculate a confidence interval for our estimation. We report confidence intervals for precision and recall for a confidence level of a = 95%. We measure precision at different levels of recall and report the values for the best F1 value. We used approximate string matching techniques to account for different writings of the same entity. For example, we count the output pair &quot;Chopin&quot; / &quot;composer&quot; as correct, even if the ideal pairs contain &quot;Frederic Chopin&quot; / &quot;composer&quot;. To ensure that LEILA does not just reproduce the example pairs, we list the percentage of examples among the output pairs. During our evaluation, we found that the Link Grammar parser does not finish parsing on roughly 1% of the files for unknown reasons.</Paragraph> <Paragraph position="21"/> </Section> <Section position="2" start_page="22" end_page="23" type="sub_section"> <SectionTitle> 3.2 Results </SectionTitle> <Paragraph position="0"> Table 1 summarizes our experimental results with LEILA on different relations. For the birthdate relation, we used Edward Morykwas' list of famous birthdays5 as examples. As counterexamples, we chose all pairs of a person that was in the examples and an incorrect birthdate. All pairs of a proper name and a date are candidates. We ran LEILA on the Wikicomposer corpus. LEILA performed quite well on this task. The patterns found were of the form &quot;X was born in Y &quot; and &quot;X (Y)&quot;. For the synonymy relation we used all pairs of proper names that share the same synset in WordNet as examples (e.g. &quot;UN&quot;/&quot;United Nations&quot;). As counterexamples, we chose all pairs of nouns that are not synonymous in WordNet (e.g.</Paragraph> <Paragraph position="1"> &quot;rabbit&quot;/&quot;composer&quot;). All pairs of proper names are candidates. We ran LEILA on the Wikigeography corpus, because this set is particularly rich in synonyms. LEILA performed reasonably well. The patterns found include &quot;X was known as Y &quot; as well as several non-grammatical constructions such as &quot;X (formerly Y)&quot;.</Paragraph> <Paragraph position="2"> For the instanceOf relation, it is difficult to select example pairs, because if an entity belongs to a concept, it also belongs to all super-concepts.</Paragraph> <Paragraph position="3"> However, admitting each pair of an entity and one of its super-concepts as an example would result in far too many false positives. The problem is to determine for each entity the (super-)concept that is most likely to be used in a natural language definition of that entity. Psychological evidence (Rosch et al., 1976) suggests that humans prefer a certain layer of concepts in the taxonomy to classify entities. The set of these concepts is called the Basic Level. Heuristically, we found that the lowest super-concept in WordNet that is not a compound word is a good approximation of the ba- null sic level concept for a given entity. We used all pairs of a proper name and the corresponding basic level concept of WordNet as examples. We could not use pairs of proper names and incorrect super-concepts as counterexamples, because our corpus Wikipedia knows more meanings of proper names than WordNet. Therefore, we used all pairs of a common noun and an incorrect super-concept from WordNet as counterexamples. All pairs of a proper name and a WordNet concept are candidates. null We ran LEILA on the Wikicomposers corpus.</Paragraph> <Paragraph position="4"> The performance on this task was acceptable, but not impressive. However, the chances to obtain a high recall and a high precision were significantly decreased by our tough evaluation policy: The ideal pairs include tuples deduced by resolving syntactic and semantic ambiguities and anaphoras.</Paragraph> <Paragraph position="5"> Furthermore, our evaluation policy demands that non-defining concepts like member not be chosen as instance concepts. In fact, a high proportion of the incorrect assignments were friend, member, successor and predecessor, decreasing the precision of LEILA. Thus, compared to the gold standard of humans, the performance of LEILA can be considered reasonably good. The patterns found include the Hearst patterns (Hearst, 1992) &quot;Y such as X&quot;, but also more complex patterns like &quot;X was known as a Y &quot;, &quot;X [. . . ] as Y &quot;, &quot;X [. . . ] can be regarded as Y &quot; and &quot;X is unusual among Y &quot;. Some of these patterns could not have been found by primitive regular expression matching.</Paragraph> <Paragraph position="6"> To test whether thematic heterogeneity influences LEILA, we ran it on the Wikigeneral corpus.</Paragraph> <Paragraph position="7"> Finally, to try the limits of our system, we ran it on the Googlecomposers corpus. As shown in Table 1, the performance of LEILA dropped in these increasingly challenging tasks, but LEILA could still produce useful results. We can improve the results on the Googlecomposers corpus by adding the Wikicomposers corpus for training.</Paragraph> <Paragraph position="8"> The different learning methods (kNN and SVM) performed similarly for all relations. Of course, in each of the cases, it is possible to achieve a higher precision at the price of a lower recall. The run-time of the system splits into parsing ([?] 40s for each document, e.g. 3:45h for Wikigeography) and the core algorithm (2-15min for each corpus, 5h for the huge Wikigeneral).</Paragraph> <Paragraph position="9"> Table 2 shows the results for comparing LEILA against various competitors (with LEILA in boldface). We compared LEILA to TextToOnto and Text2Onto for the instanceOf relation on the Wikicomposers corpus. TextToOnto requires an ontology as source of possible concepts. We gave it the WordNet ontology, so that it had the same preconditions as LEILA. Text2Onto does not require any input. Text2Onto seems to have a precision comparable to ours, although the small number of found pairs does not allow a significant conclusion. Both systems have drastically lower recall than LEILA.</Paragraph> <Paragraph position="10"> For Snowball, we only had access to the test corpus. Hence we trained LEILA on a small portion (3%) of the test documents and tested on the remaining ones. Since the original 5 seed pairs that Snowball used did not appear in the collection at our disposal, we chose 5 other pairs as examples. We used no counterexamples and hence omitted the Training Phase of our algorithm.</Paragraph> <Paragraph position="11"> LEILA quickly finds the pattern &quot;Y-based X&quot;. This led to very high precision and good recall, compared to Snowball - even though Snowball was trained on a much larger training collection.</Paragraph> <Paragraph position="12"> The CV-system differs from LEILA, because its ideal pairs are a table, in which each entity is assigned to its most likely concept according to a human understanding of the text - independently of whether there are explicit definitions for the entity in the text or not. We conducted two experiments: First, we used the document set used in Cimiano and V&quot;olker's original paper (Cimiano and V&quot;olker, 2005a), the Lonely Planet corpus. To ensure a fair comparison, we trained LEILA separately on the Wikicomposers corpus, so that LEILA cannot have example pairs in its output. For the evaluation, we calculated precision and recall with respect to an ideal table provided by the authors.</Paragraph> <Paragraph position="13"> Since the CV-system uses a different ontology, we allowed a distance of 4 edges in the WordNet hierarchy to count as a match (for both systems).</Paragraph> <Paragraph position="14"> Since the explicit definitions that our system relies on were sparse in the corpus, LEILA performed worse than the competitor. In a second experiment, we had the CV-system run on the Wikicomposers corpus. As the CV-system requires a set of target concepts, we gave it the set of all concepts in our ideal pairs. Furthermore, the system requires an ontology on these concepts. We gave it the WordNet ontology, pruned to the target concepts with their super-concepts. We evaluated by the Relaxed Ideal Metric, again allowing a distance of 4 edges in the WordNet hierarchy to count as a match (for both systems). This time, our competitor performed worse. This is because our ideal table is constructed from the definitions in the text, which our competitor is not designed to follow. These experiments only serve to show the different philosophies in the definition of the ideal pairs for the CV-system and LEILA. The CV-system does not depend on explicit definitions, but it is restricted to the instanceOf-relation.</Paragraph> </Section> </Section> class="xml-element"></Paper>