File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2201_metho.xml
Size: 27,392 bytes
Last Modified: 2025-10-06 14:10:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2201"> <Title>Learning Effective Surface Text Patterns for Information Extraction</Title> <Section position="3" start_page="1" end_page="1" type="metho"> <SectionTitle> 2 Problem description </SectionTitle> <Paragraph position="0"> We consider two classes cq and ca and the corresponding non-empty sets of instances Iq and Ia.</Paragraph> <Paragraph position="1"> Elements in the sets Iq and Ia are instances of cq and ca respectively, and are known to us beforehand. However, the sets I do not have to be complete, i.e. not all possible instances of the corresponding class have to be in the set I.</Paragraph> <Paragraph position="2"> Moreover, we consider some relation R between these classes and give a non-empty training set of instance-pairs TR = {(x,y) |x [?] Iq [?] y [?] Ia}, which are instance-pairs that are known to be R-related.</Paragraph> <Paragraph position="3"> Problem: Given the classes cq and ca, the sets of instances Iq and Ia, a relation R and a set of R-related instance-pairs TR, learn effective surface text patterns that express the relation R.</Paragraph> <Paragraph position="4"> Say, for example, we consider the classes 'author'and'booktitle'andtherelation'haswritten'. null We assume that we know some related instance-pairs , e.g. ('Leo Tolstoy', 'War and Peace') and ('G&quot;unter Grass', 'Die Blechtrommel'). We then want to PSnd natural language phrases that relate authors to the titles of the books they wrote. Thus, ifwequeryapatternincombinationwiththename of an author (e.g. 'Umberto Eco wrote'), we want thesearchresultsofthisquerytocontainthebooks by this author.</Paragraph> <Paragraph position="5"> The population of an ontology can be seen as a generalization of a question-answering setting.</Paragraph> <Paragraph position="6"> Unlike question-answering, we are interested in PSnding all possible instance-pairs, not only the pairs with one PSxed instance (e.g. all 'author''book' pairs instead of only the pairs containing a PSxed author). Functional relations in an ontology correspond to factoid questions, e.g. the population of the classes 'person' and 'country' and the 'was born in'-relation. Non-functional relations can be used to identify answers to list questions, for example &quot;name all books written by Louis-Ferdinand C'eline&quot; or &quot;which countries border Germany?&quot;.</Paragraph> </Section> <Section position="4" start_page="1" end_page="2" type="metho"> <SectionTitle> 3 Related work </SectionTitle> <Paragraph position="0"> Brin identiPSes the use of patterns in the discovery of relations on the web (Brin, 1998). He describes a website-dependent approach to identify hyper-text patterns that express some relation. For each web site, such patterns are learned and explored to identify instances that are similarly related. In (Agichtein and Gravano, 2000), such a system is combined with a named-entity recognizer.</Paragraph> <Paragraph position="1"> In (Craven et al., 2000) an ontology is populated by crawling a website. Based on tagged web pages from other sites, rules are learned to extract information from the website.</Paragraph> <Paragraph position="2"> Research on named-entity recognition was addressed in the nineties at the Message Understanding Conferences (Chinchor, 1998) and is continued for example in (Zhou and Su, 2002).</Paragraph> <Paragraph position="3"> Automated part of speech tagging (Brill, 1992) is a useful technique in term extraction (Frantzi et al., 2000), a domain closely related to named-entity recognition. Here, terms are extracted with a predePSned part-of-speech structure, e.g. an adjective-noun combination. In (Nenadi'c et al., 2002), methods are discussed to extract information from natural language texts with the use of both part of speech tags and hyponym patterns.</Paragraph> <Paragraph position="4"> As referred to in the introduction, Ravichandran and Hovy (2002) present a method to identify surface text patterns using a web search engine. They extract patterns expressing functional relations in a factoid question answering setting. Selection of the extracted patterns is based on the precision of the patterns. For example, if the pattern 'was born in' is identiPSed as a pattern for the pair ('Mozart', 'Salzburg'), they compute precision as the number of excerpts containing 'Mozart was born in Salzburg' divided by the number of excerpts with 'Mozart was born in'.</Paragraph> <Paragraph position="5"> Information extraction and ontologies creation aretwocloselyrelatedPSelds. Forreliableinformation extraction, we need background information, e.g. an ontology. On the other hand, we need information extraction to generate broad and highly usableontologies. Anoverviewonontologylearning from text can be found in (Buitelaar et al., 2005).</Paragraph> <Paragraph position="6"> Early work (Hearst, 1998), describes the extraction of text patterns expressing WordNet-relations (such as hyponym relations) from some corpus.</Paragraph> <Paragraph position="7"> This work focusses merely on the identiPScation of such text patterns (i.e. phrases containing both instances of some related pair). Patterns found by multiple pairs are suggested to be usable patterns.</Paragraph> <Paragraph position="8"> KnowItAll is a hybrid named-entity extraction system (Etzioni et al., 2005) that PSnds lists of instances of some class from the web using a search engine. It combines Hearst patterns and learned patterns for instances of some class to identify and extract named-entities. Moreover, it uses adaptive wrapper algorithms (Crescenzi and Mecca, 2004) to extract information from html markup such as tables.</Paragraph> <Paragraph position="9"> Cimiano and Staab descibe a method to use a search engine to verify a hypothesis relation (2004). For example, if we are interested in the 'is a'orhyponymrelationandwehaveacandidateinstancepair('river', 'Nile')forthisrelation, we can use a search engine to query phrases expressing this relation (e.g. 'rivers such as the Nile'). The number of hits to such queries can then be used as a measure to determine the validity of the hypothesis. null In (Geleijnse and Korst, 2005), a method is described to populate an ontology with the use of queried text patterns. The algorithm presented extracts instances from search results after having submitted a combination of an instance and a pattern as a query to a search engine. The extracted instances from the retrieved excerpts can thereafter be used to formulate new queries - and thus identify and extract other instances.</Paragraph> </Section> <Section position="5" start_page="2" end_page="4" type="metho"> <SectionTitle> 4 The algorithm </SectionTitle> <Paragraph position="0"> We present an algorithm to learn surface text patterns for relations. We use GoogleTM to retrieve such patterns.</Paragraph> <Paragraph position="1"> The algorithm makes use of a training set TR of instance-pairs that are R-related. This training set should be chosen such the instance-pairs are typical for relation R.</Paragraph> <Paragraph position="2"> We PSrst discover how relation R is expressed in natural language texts on the web (Section 4.1). In Section 4.2 we address the problem of selecting effective patterns from the total set of patterns found.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.1 Identifying relation patterns </SectionTitle> <Paragraph position="0"> We PSrst generate a list of surface text patterns with the use of the following algorithm. For evaluation purposes, we also compute the frequency of each pattern found.</Paragraph> <Paragraph position="1"> - Step 1: Formulate queries using an instance-pair (x,y) [?] TR. Since we are interested in phrases within sentences rather than in key-words or expressions in telegram style that often appear in titles of webpages, we use the allintext: option. This gives us only search results with the queried expression in the bodies of the documents rather than in the titles. We query both allintext:&quot; x * y &quot; and allintext:&quot; y * x &quot;. The * is a regular expression operator accepted by Google. It is a placeholder for zero or more words.</Paragraph> <Paragraph position="2"> - Step 2: Send the queries to Google and collect the excerpts of the at most 1,000 pages it returns for each query.</Paragraph> <Paragraph position="3"> - Step 3: Extract all phrases matching the queried expressions and replace both x and y by the names of their classes.</Paragraph> <Paragraph position="4"> - Step 4: Remove all phrases that are not within one sentence.</Paragraph> <Paragraph position="5"> - Step 5: Normalize all phrases by removing all mark-up that is ignored by Google. Since Google is case-insensitive and ignores punctuation, double spaces and the like, we translate all phrases found to a normal form: the simplest expression that we can query that leads to the document retrieved.</Paragraph> <Paragraph position="6"> - Step6: Update the frequencies of all normalized phrases found.</Paragraph> <Paragraph position="7"> - Step 7: Repeat the procedure for any un null queried pair (xprime,yprime) [?] TR.</Paragraph> <Paragraph position="8"> We now have generated a list with relation patterns and their frequencies within the retrieved Google excerpts.</Paragraph> </Section> <Section position="2" start_page="2" end_page="4" type="sub_section"> <SectionTitle> 4.2 Selecting relation patterns </SectionTitle> <Paragraph position="0"> From the list of relation patterns found, we are interested in the most effective ones.</Paragraph> <Paragraph position="1"> We are not only interested in the most precise ones. For example, the retrieved pattern &quot;f&quot;odd 30 mars 1853 i&quot; proved to a 100% precise pattern expressing the relation between a person ('Vincent van Gogh') and his place of birth ('Zundert'). Clearly, this rare phrase is unsuited to mine instance-pairs of this relation in general. On the other hand, high frequency of some pattern is no guarantee for effectiveness either. The frequently occurring pattern &quot;was born in London&quot; (found when querying for ThomasBayes*England) is well-suited to be used to PSnd London-born persons, but in general the pattern is unsuited - since too narrow - to express the relation between a per-son and his or her country of origin.</Paragraph> <Paragraph position="2"> Taking these observations into account, we formulate three criteria for selecting effective relation patterns.</Paragraph> <Paragraph position="3"> 1. The patterns should frequently occur on the web, toincreasetheprobabilityofgettingany results when querying the pattern in combination with an instance.</Paragraph> <Paragraph position="4"> 2. The pattern should be precise. When we query a pattern in combination with an instance in Iq, we want to have many search results containing instances from ca.</Paragraph> <Paragraph position="5"> 3. If relation R is not functional, the pattern should be wide-spread, i.e. among the search results when querying a combination of the pattern and an instance in Iq there must be as many distinct R-related instances from ca as possible.</Paragraph> <Paragraph position="6"> To measure these criteria, we use the following scoring functions for relation patterns s.</Paragraph> <Paragraph position="7"> 1. ffreq(s) = &quot;number of occurrences of s in the excerpts as found by the algorithm described in the previous subsection&quot;</Paragraph> <Paragraph position="9"> after querying s in combination with x containing instances of ca.</Paragraph> <Paragraph position="10"> FO(s,x) = the total number of excerpts found (at most 1,000).</Paragraph> <Paragraph position="11"> 3. fspr(s) =summationtextx[?]Iprimeq B(s,x), where B(s,x) = the number of distinct instances of class ca found after querying pattern s in combination with x.</Paragraph> <Paragraph position="12"> The larger we choose the testset, the subset Iprimeq of Iq, the more reliable the measures for precision and spreading. However, the number of Google queries increases with the number of patterns found for each instance we add to Iprimeq. We PSnally calculate the score of the patterns by multiplying the individual scores:</Paragraph> <Paragraph position="14"> For efPSciency reasons, we only compute the scores of the patterns with the highest frequencies.</Paragraph> <Paragraph position="15"> The problem remains how to recognize a (possible multi-word) instance in the Google excerpts. For an ontology alignment setting - where the sets Ia and Iq are not to be expanded - these problems are trivial: we determine whether t [?] Ia is accompanied by the queried expression. For a setting where the instances of ca are not all known (e.g.</Paragraph> <Paragraph position="16"> it is not likely that we have a complete list of all books written in the world), we solve this problem in two stages. First we identify rules per class to extract candidate instances. Thereafter we use an additional Google query to verify if a candidate is indeed an instance of class ca.</Paragraph> <Paragraph position="17"> Identifying a candidate instance The identiPScation of multi-word terms is an issueofresearchonitsown. However, inthissetting we can allow ourselves to use less elaborate techniques to identify candidate instances. We can do so, since we additionally perform a check on each extracted term. So, per class we create rules to identify candidate instances with a focus on high recall. Inourcurrentexperimentswethususevery simpletermrecognitionrules, basedonregularexpressions. For example, we identify a candidate instance of class 'person' if the queried expression is accompanied by two or three capitalized words.</Paragraph> <Paragraph position="18"> Identifying an instance-class relation We are interested in the question whether some extracted term t is an instance of class ca. For example, given the term 'The Godfather', does this term belong to the class 'movie'? The instance-class relation can be viewed of as a hyponym relation. We therefore verify the hypothesis of t being an instance of ca by Googling hyponym relation patterns. We use a PSxed set H of common patterns expressing the hyponym relation (Hearst, 1992; Cimiano and Staab, 2004), see Table 1. For the class names, we use plurals.</Paragraph> <Paragraph position="19"> We use these patterns in the following acceptance function</Paragraph> <Paragraph position="21"> &quot;cq including t and&quot; &quot;cq for example t and&quot; &quot;cq like t and&quot; &quot;cq such as t and&quot; Table1: Hearstpatternsforinstance-classrelation. where h(p,cq,t) is the number of Google hits for query with pattern p combined with term t and the plural form of the class name cq. The threshold n has to be chosen beforehand. We can do so, by calculating the sum of Google hits for queries with known instances of the class. Based on these PSgures, a threshold can be chosen e.g. the minimum of these sums.</Paragraph> <Paragraph position="22"> Note that term t is both preceded and followed by a PSxed phrase in the queries. We do so, to guarantee that t is indeed the full term we are interested in. For example, if we had extracted the term 'Los' instead of 'Los Angeles' as a CalifornianCity, wewouldfalselyidentify'Los'asaCalifornian City, when we do not let 'Los' follow by the PSxed expression and. The number of Google hits for some expression x is at least the number of Google hits when querying the same expression followed by some expression y.</Paragraph> <Paragraph position="23"> If we identify a term t as being an instance of class ca, we can add this term to the set Ia. However, we cannot relate t to an instance in Iq, since thepatternusedtoPSndthasnotproventobeeffective yet (e.g. the pattern could express a different relation between one of the instance-pairs in the training set).</Paragraph> <Paragraph position="24"> We reduce the amount of Google queries by using a list of terms found that do not belong to ca. Terms that occur multiple times in the excerpts can then be checked only once. Moreover, we use the OR-clause to combine the individual queries into one. We then check if the number of hits to this query exceeds the threshold. The amount of Google queries in this phase thus equals the amount of distinct terms extracted.</Paragraph> <Paragraph position="25"> 5 The use of surface text patterns in information extraction Having a method to identify relation patterns, we now focus on utilizing these patterns in information extraction from texts found by a search engine. We use an ontology to represent the information extracted.</Paragraph> <Paragraph position="26"> Suppose we have an ontology O with classes (c1,c2,...) and corresponding instance sets (I1,I2,..). On these classes, relations R(i,j)1 are dePSned, with i and j the index number of the classes. The non-empty sets T(i,j) contain the training set of instance-pairs of the relations R(i,j).</Paragraph> <Paragraph position="27"> Per instance, we maintain a list of expressions that already have been used as a query. Initially, these are empty.</Paragraph> <Paragraph position="28"> The PSrst step of the algorithm is to learn surface text patterns for each relation in O.</Paragraph> <Paragraph position="29"> The following steps of the algorithm are per- null formed until either some stop criterion is reached, or no more new instances and instance-pairs can be found.</Paragraph> <Paragraph position="30"> - Step 1: Select a relation R(i,j), and an instance v from either Ii or Ij such that there exists at least one pattern expressing R(i,j) we have not yet queried in combination with v.</Paragraph> <Paragraph position="31"> - Step 2: Construct queries using the patterns with v and send these queries to Google.</Paragraph> <Paragraph position="32"> - Step 3: Extract instances from the excerpts.</Paragraph> <Paragraph position="33"> - Step 4: Add the newly found instances to the corresponding instance set and add the instance-pairs found (thus with v) to T(i,j).</Paragraph> <Paragraph position="34"> - Step 5: If there exists an instance that we can use to formulate new queries, then repeat the procedure.</Paragraph> <Paragraph position="35"> Else, learn new patterns using the extracted instance-pairs and then repeat the procedure.</Paragraph> <Paragraph position="36"> Note that instances of class cx learned using the algorithm applied on relation R(x,y) can be used as input for the algorithm applied to some relation R(x,z) to populate the sets Iz and T(x,z).</Paragraph> </Section> </Section> <Section position="6" start_page="4" end_page="6" type="metho"> <SectionTitle> 6 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 6.1 Learning effective hyponym patterns Weareinterestedwhethertheeffectivesurfacetext </SectionTitle> <Paragraph position="0"> patterns are indeed intuitive formulations of some relation R. As a test-case, we compute the most effective patterns for the hyponym relation using a test set with names of all countries.</Paragraph> <Paragraph position="1"> Our experiment was set up as follows. We collected the complete list of countries in the world from the CIA World Factbook2. Let Iq be this set of countries, and let Ia be the set { 'countries', 'country' }. The set TR consists of all pairs (a, 'countries') and (a, 'country') , for a [?] Ia. We apply the surface text pattern learning algorithm on this set TR.</Paragraph> <Paragraph position="2"> ThealgorithmidentiPSedalmost40,000patterns.</Paragraph> <Paragraph position="3"> We computed fspr and fprec for the 1,000 most frequently found patterns. In table 2, we give the 25 most effective patterns found by the algorithm. Weconsiderthepatternsinboldfacetruehyponym patterns. Focussing on these patterns, we observe two groups: 'is a' and Hearst-like patterns.</Paragraph> <Paragraph position="4"> pattern freq prec spr (countries) like 645 0.66 134 (countries) such as 537 0.54 126 is a small (country) 142 0.69 110 (country) code for 342 0.36 84 (country) map of 345 0.34 78 (countries) including 430 0.21 93 is the only (country) 138 0.55 102 is a (country) 339 0.22 99 (country) $?ag of 251 0.63 46 and other (countries) 279 0.34 72 and neighboring (countries) 164 0.43 92 (country) name republic of 83 0.93 76 (country) book of 59 0.77 118 is a poor (country) 63 0.73 106 is the PSrst (country) 53 0.70 112 (countries) except 146 0.37 76 (country) code for calling 157 0.95 26 is an independent (country) 62 0.55 114 and surrounding (countries) 84 0.40 107 is one of the poorest (countries) 61 0.75 78 and several other (countries) 65 0.59 90 among other (countries) 84 0.38 97 is a sovereign (country) 48 0.69 89 or any other (countries) 87 0.58 58 (countries) namely 58 0.44 109 scores.</Paragraph> <Paragraph position="5"> The Hearst-patterns 'like' and 'such as' show to be the most effective. This observation is useful, when we want to minimize the amount of queries for hyponym patterns.</Paragraph> <Paragraph position="6"> Expressions of properties that hold for each patterns.</Paragraph> <Paragraph position="7"> The combination of 'is a', 'is an' or 'is the' with an adjective is a common pattern, occurring 2,400 timesinthelist. Infuturework, weplantoidentify such adjectives in Google excerpts using a Part of Speech tagger (Brill, 1992).</Paragraph> </Section> <Section position="2" start_page="5" end_page="6" type="sub_section"> <SectionTitle> 6.2 Applying learned patterns in information </SectionTitle> <Paragraph position="0"> extraction The Text Retrieval Conference (TREC) question answering track in 2004 contains list question, for example 'Who are Nirvana's band members?' (Voorhees, 2004). We illustrate the use of our ontology population algorithm in the context of such list-question answering with a small case-study. Note that we do not consider the processing of the question itself in this research.</Paragraph> <Paragraph position="1"> Inspired by one of the questions ('What countriesisBurgerKinglocatedin?'),weareinterested null in populating an ontology with restaurants and the countries in which they operate. We identify the classes 'country' and 'restaurant' and the relation 'located in' between the classes.</Paragraph> <Paragraph position="2"> We hand the algorithm the instances of 'country', as well as two instances of 'restaurant': 'Mc-Donald's' and 'KFC'. Moreover, we add three instance-pairs of the relation to the algorithm. We use these pairs and a subset Iprimecountry of size eight to compute a ranked list of the patterns. We extract terms consisting of one up to four capitalized words. In this test we set the threshold for the number of Google results for the queries with the extracted terms to 50. After a small test with names of international restaurant branches, this seemed an appropriate threshold.</Paragraph> <Paragraph position="3"> The algorithm learned, besides a ranked list of 170 surface text patterns (Table 3), a list of 54 instances of restaurant (Table 4). Among these instances are indeed the names of large international chains, Burger King being one of them. Less expected are the names of geographic locations and names of famous cuisines such as 'Chinese' and 'French'. The last category of false instances found that have not be PSltered out, are a number of very common words (e.g. 'It' and 'There').</Paragraph> <Paragraph position="4"> We populate the ontology with relations found between Burger King and instances from country using the 20 most effective patterns.</Paragraph> <Paragraph position="5"> pattern prec spr freq ca restaurants of cq 0.24 15 21 ca restaurants in cq 0.07 19 9 ca hamburger chain that occupies villages throughout modern day cq 1.0 1 7 ca restaurant in cq 0.06 16 6 ca restaurants in the cq 0.13 16 2 ca hamburger restaurant in southern cq 1.0 1 4 The algorithm returned 69 instance-pairs with countries related to 'Burger King'. On the Burger King website3 a list of the 65 countries can be found in which the hamburger chain operates. Of these 65 countries, we identiPSed 55. This implies that our results have a precision of 5569 = 80% and recall of 5565 = 85%. Many of the falsely related countries - mostly in eastern Europe - are locations where Burger King is said to have plans to expand its empire.</Paragraph> </Section> </Section> <Section position="7" start_page="6" end_page="6" type="metho"> <SectionTitle> 7 Conclusions </SectionTitle> <Paragraph position="0"> We have presented a novel approach to identify useful surface text patterns for information extraction using an internet search engine. We argued that the selection of patterns has to be based on effectiveness: a pattern has to occur frequently, it has to be precise and has to be wide-spread if it represents a non-functional relation.</Paragraph> <Paragraph position="1"> These criteria are combined in a scoring function which we use to select the most effective patterns. null The method presented can be used for arbitrary relations, thus also relations that link an instance to multiple other instances. These patterns can be used in information extraction. We combine patterns with an instance and offer such an expression as a query to a search engine. From the excerpts retrieved, we extract instances and simultaneously instance-pairs.</Paragraph> <Paragraph position="2"> Learning surface text patterns is efPScient with respect to the number of queries if we know all instances of the classes concerned. The PSrst part of the algorithm is linear to the size of the training set. Furthermore, we select the n most frequent patterns and perform |Iprimeq |* n queries to compute the score of these n patterns.</Paragraph> <Paragraph position="3"> However, for a setting where Iprimea is incomplete, we have to perform a check for each unique term identiPSed as a candidate instance in the excerpts found by the |Iprimeq |* n queries. The number of queries, one for each extracted unique candidate instance, thus fully depends on the rules that are used to identify a candidate instance.</Paragraph> <Paragraph position="4"> We apply the learned patterns in an ontology population algorithm. We combine the learned high quality relation patterns with an instance in a query. In this way we can perform a range of effective queries to PSnd instances of some class and simultaneously PSnd instance-pairs of the relation. A PSrst experiment, the identiPScation of hyponym patterns, showed that the patterns identiPSed indeed intuitively re$?ect the relation considered. Moreover, we have generated a ranked list of hyponym patterns. The experiment with the restaurant ontology illustrated that a small training set sufPSces to learn effective patterns and populate an ontology with good precision and recall. The algorithm performs well with respect to recall of the instances found: many big international restaurant branches were found. The identiPScation of the instances however is open to improvement, since the additional check does not PSlter out all falsely identiPSed candidate instances.</Paragraph> </Section> class="xml-element"></Paper>