File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-0507_evalu.xml
Size: 9,684 bytes
Last Modified: 2025-10-06 13:59:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0507"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Towards Large-scale Non-taxonomic Relation Extraction: Estimating the Precision of Rote Extractors[?]</Title> <Section position="6" start_page="51" end_page="54" type="evalu"> <SectionTitle> 4 Experiment and results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="51" end_page="53" type="sub_section"> <SectionTitle> 4.1 Rote extractor settings </SectionTitle> <Paragraph position="0"> The initial steps of the rote extractor follows the general approach: downloading a training corpus using the seed list and extracting patterns.</Paragraph> <Paragraph position="1"> The training corpora are processed with a part-of-speech tagger and a module for Named Entity Recognition and Classification (NERC) that annotates people, organisations, locations, dates, relative temporal expressions and numbers (Alfonseca et al., 2006b), so this information can be included in the patterns. Furthermore, for each of the terms in a pair in the training corpora, the system also In the case that a pair from the seed list is found in a sentence, a context around the two words in the pair is extracted, including (a) at most five words to the left of the first word; (b) all the words in between the pair words; (c) at most five words to the right of the second word. The context never jumps over sentence boundaries, which are marked with the symbols BOS (Beginning of sentence)andEOS(Endof sentence). Thetworelated concepts are marked as <hook> and <target>.</Paragraph> <Paragraph position="2"> Figure 1 shows several example contexts extracted for the relations birth year, birth place, writer-book and country-capital city.</Paragraph> <Paragraph position="3"> The approach followed for the generalisation is the one described by (Alfonseca et al., 2006a; Ruiz-Casado et al., in press), which has a few modifications with respect to Ravichandran and Hovy (2002)'s, such as the use of the wildcard * to represent any sequence of words, and the addition of part-of-speech and Named Entity labels to the patterns.</Paragraph> <Paragraph position="4"> The input table has been built with the following nineteen relations: birth year, death year, birth place, death place, author-book, actorfilm, director-film, painter-painting, Employeeorganisation, chief of state, soccer player-team, and number of unique patterns in each step.</Paragraph> <Paragraph position="5"> soccer team-city, soccer team-manager, country or region-capital city, country or region-area, country or region-population, country-bordering country, country-name of inhabitant (e.g. Spain-Spaniard), and country-continent. The time required to build the table and the seed lists was less than one person-day, as some of the seed lists were directly collected from web pages.</Paragraph> <Paragraph position="6"> For each step, the following settings have been set: * The size of the training corpus has been set to 50 documents for each pair in the original seed lists. Given that the typical sizes of the lists collected are between 50 and 300 pairs, this means that several thousand documents are downloaded for each relation.</Paragraph> <Paragraph position="7"> * Before the generalisation step, the rote extractor discards those patterns in which the hook and the target are too far away to each other, because they are usually difficult to generalise. The maximum allowed distance procedure and with the traditional hook corpus approach, and precision evaluated by hand). between them has been set to 8 words.</Paragraph> <Paragraph position="8"> * At each step, the two most similar patterns are generalised, and their generalisation is added to the set of patterns. No pattern is discarded at this step. This process stops when all the patterns resulting from the generalisation of existing ones contain wildcards adjacent to either the hook or the target.</Paragraph> <Paragraph position="9"> * For the precision estimation, for each pair in the seed lists, 50 documents are collected for the hook and other 50 for the target. Because of time constraints, and given that the total size of the hook and the target corpora exceeds 100,000 documents, for each pattern a sample of 250 documents is randomly chosen and the patterns are applied to it. This sample is built randomly but with the following constraints: there should be an equal amount of documents selected from the corpora from each relationship; and there should be an equal amount of documents from hook corpora and from target corpora.</Paragraph> </Section> <Section position="2" start_page="53" end_page="54" type="sub_section"> <SectionTitle> 4.2 Output obtained </SectionTitle> <Paragraph position="0"> Table 2 shows the number of patterns obtained for each relation. Note that the generalisation procedure applied produces new (generalised) patterns to the set of original patterns, but no original pattern is removed, so they all are evaluated; this is why the set of patterns increases after the generalisation. The filtering criterion was to keep the patterns that applied at least twice on the test corpus. null It is interesting to see that for most relations the reduction of the pruning is very drastic. This is because of two reasons: Firstly, most patterns are far too specific, as they include up to 5 words at each side of the hook and the target, and all the words in between. Only those patterns that have generalised very much, substituting large portions with wildcards or disjunctions are likely to apply to the sentences in the hook and target corpora.</Paragraph> <Paragraph position="1"> Secondly, the samples of the hook and target corpora used are too small for some of the relations to apply, so few patterns apply more than twice.</Paragraph> <Paragraph position="2"> Note that, for some relations, the output of the generalisation step contains less patterns that the output of the initial extraction step: that is due to the fact that the patterns in which the hook and the target are not nearby were removed in between these two steps.</Paragraph> <Paragraph position="3"> Concerning the precision estimates, a full evaluation is provided for the birth-year relation. Table 3 shows in detail the thirty patterns obtained. It can also be seen that some of the patterns with good precision contain the wildcard *. For instance, the first pattern indicates that the presence of any of the words biography, poetry, etc. anywhere in a sentence before a person name and a date or number between parenthesis is a strong indication that the target is a birth year.</Paragraph> <Paragraph position="4"> The last columns in the table indicate the number of times that each rule applied in the hook and target corpora, and the precision of the rule in each of the following cases: * As estimated by the complete program (Prec1).</Paragraph> <Paragraph position="5"> * As estimated by the traditional hook corpus approach (Prec2). Here, cardinality is not taken into account, patterns are evaluated only on the hook corpora from the same relation, and those pairs whose hook is not in the seed list are ignored.</Paragraph> <Paragraph position="6"> * The real precision of the rule (real). In order to obtain this metric, two different annotators evaluated the pairs applied independently, and the precision was estimated from the pairs in which they agreed (there was a 96.29% agreement, Kappa=0.926).</Paragraph> <Paragraph position="7"> As can be seen, in most of the cases our procedure produces lower precision estimates.</Paragraph> <Paragraph position="8"> If we calculate the total precision of all the rules altogether, shown in the last row of the table, we can see that, without the modifications, the whole set of rules would be considered to have a total precision of 0.84, while that estimate decreases sharply to 0.46 when they are used. This value is nearer the precision of 0.54 evaluated by hand. Although it may seem surprising that the precision estimated by the new procedure is even lower than the real precision of the patterns, as measured by hand, that is due to the fact that the web queries consider unknown pairs as incorrect unless they extracted pairs by all rules and all relations.</Paragraph> <Paragraph position="9"> appear in the web exactly in the format of the query in the input table. Specially for not very well-known people, we cannot expect that all of them will appear in the web following the pattern &quot;X was born in date&quot;, so the web estimates tend to be over-conservative.</Paragraph> <Paragraph position="10"> Table 4 shows the precision estimates for every pair extracted with all the rules using both procedures, with 0.95 confidence intervals. The real precision has been estimating by sampling randomly 200 pairs and evaluating them by hand, as explained above for the birth year relation. As can be observed, out of the 19 relations, the precision estimate of the whole set of rules for 11 of them is not statistically dissimilar to the real precision, while that only holds for two relationships using the previous approach.</Paragraph> <Paragraph position="11"> Please note as well that the precisions indicated in the table refer to all the pairs extracted by all the rules, some of which are very precise, but some of which are very imprecise. If the rules are to be applied in an annotation system, only those with a high precision estimate would be used, and expectedly much better overall results would be obtained. null</Paragraph> </Section> </Section> class="xml-element"></Paper>