File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0507_metho.xml
Size: 9,147 bytes
Last Modified: 2025-10-06 14:10:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0507"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Towards Large-scale Non-taxonomic Relation Extraction: Estimating the Precision of Rote Extractors[?]</Title> <Section position="5" start_page="49" end_page="51" type="metho"> <SectionTitle> 3 Our proposal </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="49" end_page="50" type="sub_section"> <SectionTitle> 3.1 Motivation </SectionTitle> <Paragraph position="0"> In a rote extractor as described above, we believe that the procedure for calculating the precision of the patterns may be unreliable in some cases. For example, the following patterns are reported by In the particular application in which they are used (relation extraction for Question Answering), they are useful because there is initially a question to be answered that indicates whether we are looking for an invention, a discovery or a location. However, if we want to apply them to unrestricted relation extraction, we have the problem that the same pattern, the genitive construction, represents all these relations, apart from the most common use indicating possession.</Paragraph> <Paragraph position="1"> If patterns like these are so ambiguous, then why do they receive so high a precision estimate? One reason is that the patterns are only evaluated for the same hook for which they were extracted. To illustrate this with an example, let us suppose that we obtain a pattern for the relationlocated-at usingthepairs(NewYork, Chrysler Building). The genitive construction can be extracted from the context New York's Chrysler Building. Afterwards, when estimating the precision of this pattern, only sentences containing <target>'s Chrysler Building are taken into account. Because of this, most of the pairs extracted by this pattern may extract the target New York, apart from a few that extract the name of the architect that built it, van Allen. Thus we can expect that the genitive pattern will receive a high precision estimate as a located-at pattern.</Paragraph> <Paragraph position="2"> For our purposes, however, we want to collect patterns for several relations such as writer-book, painter-picture, director-film, actor-film, and we want to make sure that the obtained patterns are only applicable to the desired relation. Patterns like <target> 's <hook> are very likely to be applicable to all of these relations at the same time, so we would like to be able to discard them automatically by assigning them a low precision.</Paragraph> </Section> <Section position="2" start_page="50" end_page="50" type="sub_section"> <SectionTitle> 3.2 Suggested improvements </SectionTitle> <Paragraph position="0"> Therefore, we propose the following three improvements to this procedure: 1. Collecting not only a hook corpus but also a target corpus should help in calculating the precision. In the example of the Chrysler building, we have seen that in most cases that we look for the pattern 's Chrysler building the previous words are New York, and so the pattern is considered accurate. However, if we look for the pattern New York's, we shall surely find it followed by many different terms representing different relations, and the precision estimate will decrease.</Paragraph> <Paragraph position="1"> 2. Testing the patterns obtained for one relation using the hook and target corpora collected for other relations. For instance, if the genitive construction has been extracted as a possible pattern for the writer-book relation, and weapplyittoacorpusaboutpainters,therote extractor can detect that it also extracts pairs with painters and paintings, so that particular pattern will not be very precise for that relation. null 3. Many of the pairs extracted by the patterns in the hook corpora were not evaluated at all when the hook in the extracted pair was not present in the seed lists. To overcome this, we propose to use the web to check whether the extracted pair might be correct, as shown below.</Paragraph> </Section> <Section position="3" start_page="50" end_page="51" type="sub_section"> <SectionTitle> 3.3 Algorithm </SectionTitle> <Paragraph position="0"> In our implementation, the rote extractor starts withatablecontainingsomeinformationaboutthe relations for which we want to learn patterns. This procedure needs a little more information than just the seed list, which is provided as a table in the format displayed in Table 1. The data provided for eachrelationisthefollowing: (a)Thenameofthe relation, used for naming the output files containing the patterns; (b) the name of the file containing the seed list; (c) the cardinality of the relation. For instance, given that many people can be born on the same year, but for every person there is just one birth year, the cardinality of the relation birth year is n:1; (d) the restrictions on the hook and the target. These can be of the following three categories: unrestricted, if the pattern can extract any sequenceofwordsashookortargetoftherelation, Entity, if the pattern can extract as hook or target only things of the same entity type as the words in the seed list (as annotated by the NERC module), or PoS, if the pattern can extract as hook or target any sequence of words whose sequence of PoS labels was seen in the training corpus; and (e) a sequence of queries that could be used to check, using the web, whether an extracted pair is correct or not.</Paragraph> <Paragraph position="1"> We assume that the system has used the seed list to extract and generalise a set of patterns for each of the relations using training corpora (Ravichandran and Hovy, 2002; Alfonseca et al., 2006a). Our procedure for calculating the patterns' precisions is as follows: 1. For every relation, (a) For every hook, collect a hook corpus from the web.</Paragraph> <Paragraph position="2"> Relation name Seed-list Cardinality Hook-type Target-type Web queries birth year birth-date.txt n:1 entity entity $1 was born in $2 death year death-date.txt n:1 entity entity $1 died in $2 birth place birth-place.txt n:1 entity entity $1 was born in $2 country-capital country-capital.txt 1:1 entity entity $2 is the capital of $1 author-book author-book.txt n:n entity unrestricted $1 is the author of $2 director-film director-film.txt 1:n entity unrestricted $1 directed $2, $2 directed by $1 (b) For every target, collect a target corpus from the web.</Paragraph> <Paragraph position="3"> 2. For every relation r, (a) For every pattern P, collected during training, apply it to every hook and target corpora to extract a set of pairs.</Paragraph> <Paragraph position="4"> For every pair p = (ph,pt), * If it appears in the seed list ofr, consider it correct.</Paragraph> <Paragraph position="5"> * If it appears in the seed list of other relation, consider it incorrect.</Paragraph> <Paragraph position="6"> * Ifthehookph appearsintheseedlist of r with a different target, and the cardinality is 1:1 or n:1, consider it incorrect.</Paragraph> <Paragraph position="7"> * Ifthetargetpt appearsinr'sseedlist with a different hook, and the cardinality is 1:1 or 1:n, incorrect.</Paragraph> <Paragraph position="8"> * Otherwise, the seed list does not provide enough information to evaluate p, so we perform a test on the web. For every query provided forr, the system replaces $1 with ph and $2 with pt, and sends the query to Google. The pair is deemed correct if and only if there is at least one answer. null The precision of P is estimated as the number of extracted pairs that are supposedly correct divided by the total number of pairs extracted.</Paragraph> <Paragraph position="9"> In this step, every pattern that did not apply at least twice in the hook and target corpora is also discarded.</Paragraph> </Section> <Section position="4" start_page="51" end_page="51" type="sub_section"> <SectionTitle> 3.4 Example </SectionTitle> <Paragraph position="0"> After collecting and generalising patterns for the relation director-film, we apply each pattern to the hook and target corpora collected for every relation. Let us suppose that we want to estimate the precision of the pattern <target> 's <hook> and we apply it to the hook and the target corpora for this relation and for author-book. Possible pairs extracted are (Woody Allen, Bananas), (Woody Allen, Without Fears), (Charles Dickens, A Christmas Carol). Only the first one is correct. The rote extractor proceeds as follows: a decision about the second pair, it queries Google with the sequences Woody Allen directed Without Fears Without Fears directed by Woody Allen Because neither of those queries provide any answer, it is considered incorrect.</Paragraph> <Paragraph position="1"> In this way, it can be expected that the patterns that are equally applicable to several relations, such as writer-book, director-film or painter-picture will attain a low precision because they will extract many incorrect relations from the corpora corresponding to the other relations.</Paragraph> </Section> </Section> class="xml-element"></Paper>