XML Viewer - w00-1323

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1323_metho.xml
Size: 24,186 bytes
Last Modified: 2025-10-06 14:07:25
<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1323">
  <Title>Combining Lexical and Formatting Cues for Named Entity Acquisition from the Web</Title>
  <Section position="5" start_page="181" end_page="182" type="metho">
    <SectionTitle>
3 Architecture and Principles
</SectionTitle>
    <Paragraph position="0"> To acquire NEs from the Web, we have developed a system that consists of three sequential modules (see Figure 1):  1. A harvester that downloads the pages retrieved by a search engine from the four following query strings (1.a) following (NE) (1.c) (NE) such as (1.b) list of (NE) (1.d) such (NE) as  in which (NE) stands for a typifying hypernym of NEs such as Universities, politicians, or car makers (see list in 4). . Three parallel shallow parsers Pc, P1 and Pa which extract candidate NEs respectively from enumerations, lists and tables, and anchors.</Paragraph>
    <Paragraph position="1"> . A post-filtering module that cleans up the candidate NEs from leading determiners or trailing unrelated words and splits co-ordinated NEs into unitary items.</Paragraph>
    <Section position="1" start_page="181" end_page="182" type="sub_section">
      <SectionTitle>
Corpus Harvesting
</SectionTitle>
      <Paragraph position="0"> The four strings (1.a-d) given above are used to query a search engine. They consist of an hypernym and a discourse marker. They are expected to be followed by a collection of NEs.</Paragraph>
      <Paragraph position="1"> Figure 2 shows five prototypical examples of collections encountered in HTML pages re- null trieved through one of the strings (1.a-d)3 The first collection is an enumeration and consists of a coordination of three NEs. The second collection is a list organized into two sublists. Each sublist is introduced by a hypernym. The third structure is a list marked by bullets. Such lists can be constructed through an HTML table (this example), or by using enumeration marks (&lt;ul&gt; or &lt;ol&gt;).</Paragraph>
      <Paragraph position="2"> The fourth example is also a list built by using a table structure but displays a more complex spatial organization and does not employ graphical bullets. The fifth example is an anchor to a collection not provided to the reader within the document, but which can be reached by following an hyperlink instead.</Paragraph>
      <Paragraph position="3"> The corpus of HTML pages is collected through two search engines with different capabilities: AltaVista (AV) and Northern Light (NL).2 AV offers the facility of double-quoting the query strings in order to search for exact strings (also called phrases in IR). NL does not support phrase search. However, in AV, the number of retrievable documents is limited to the 200 highest ranked documents while it is potentially unlimited in NL. For NL, the  search engine is a combination of wget available from ftp://sunsite, auc. dk/pub/infosystems/wget/ and Perl scripts.</Paragraph>
      <Paragraph position="4">  It's development is due to the support gwen by the Ministry of Pubhc Health, aided by international organizations such as the Pan American Health Organization (PAHO), the United Nations Development program, and the Caribbean and Latin American Medical Science Information Center.</Paragraph>
      <Paragraph position="5"> 7. The session was also attended by observers from the following international organizations:  Books, documentation, periodicals on European legislation, economy, agriculture, industry, educatmn, norms, social pohtics, law. For more information on publicauons, COM documents and to subscribe to the Officml Journal please contact Dunya Infotel.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="182" end_page="183" type="metho">
    <SectionTitle>
UN (United Nations)
</SectionTitle>
    <Paragraph position="0"> Peace and security, economics, statistics, energy, natural resources, environment, international law, human rights, polmcal affairs and disarmament, social questions. 1997  number of retrieved documents was however restricted to 2000 in order to limit processing times. The choice of these two search engines is intended to evaluate whether a poorer query mode (bags of words in NL instead of strings in AV) can be palliated by accessing more documents (2000 max. for NL instead of 200 max. for AV).</Paragraph>
    <Paragraph position="1"> The corpus collected by the two search engines and the four f~.milies of queries is 2,958Mb large (details are given in Section 4). Acquisition of Candidate NEs Three parallel shallow parsers Pc, P\]. and Pa are used to extract NEs from the corpora collected by the harvester. The parsers rely on the query string to detect the sentence introducing the collection of NEs (the initializer in (P~ry-Woodley, 1998)). The text and HTML marks after the initializer are parsed jointly in order to retrieve one of the following three  spatio-syntactic structures: 1. a textual enumeration (parser Pc, top null most example in Figure 2), 2. a list or a table (parser Pl, the next three examples in Figure 2), 3. an anchor toward a page containing a list (parser Pa, bottom example in Figure 2).</Paragraph>
    <Paragraph position="2">  In brief, these parsers combine string matching (the initial lexical cue), syntactic analysis (enumerations in Pe), analysis of formatting instructions (lists and tables in Pl), and access to linked documents through anchors detected by Pa. The results presented in this paper only concern the first two parsers. Since anchors raise specific problems in linguistic analysis (Amitay, 1999), they will be analyzed in another pubhcation. The resulting candidate NEs are cleaned up and filtered by a post-filtering module that splits associations of NEs, suppresses initial determiners or trailing modifiers and punctuations, and rejects incorrect NEs.</Paragraph>
    <Paragraph position="3"> The Enumeration Parser Pe The enumerations are expected to occur inside the sentences containing the query string. Pe uses a traditional approach to parsing through conjunction splitting in which a NE pattern NE is given by (3) and an enumeration by (4). 3</Paragraph>
    <Paragraph position="5"> The lists are expected to occur no further than four lines after the sentence containing the query string. The lists are extracted through one of the following three patterns. They correspond to three alternatives commonly used by HTML authors in order to build a spatial construction of aligned items (lists, line breaks, or tables). They are expressed by case-insensitive regular expressions in which the selected string is the shortest acceptable underlined pattern:</Paragraph>
    <Paragraph position="7"> der to accept diacriticized letters, and possible abbreviations composed of a single letter followed by a dot.</Paragraph>
    <Paragraph position="8"> In addition, after the removal of the HTML mark-up tags, only the longest subpart of the string accepted by (3) is produced as output to the final filter. These patterns do not cover all the situations in which a formatted text denotes a list. Some specific cases of lists such as pre-formatted text in a verbatim environment (&lt;we&gt;), or items marked by a paragraph tag (&lt;p&gt;) are not considered here. They would produce too inaccurate results because they are not typical enough of lists.</Paragraph>
    <Paragraph position="9"> Postfilterlng The pre-candidate NEs produced by the shallow parsers are processed by filters before being proposed as candidate NEs. The roles of the filters are (in this order): * removal of trailing lower-case words, * deletion of the determiner the and the co-ordinating conjunctions and and or and the words which follow them,  candidate NEs because only organization names are expected to contain acronyms, * rejection of NEs containing words in a stop list such as Next, Top, Web, or Click.</Paragraph>
    <Paragraph position="10"> Postfiltering is completed by discarding single-word candidates, that are described as common words in the CELEX 4 database, and multi-word candidates that contain more than</Paragraph>
  </Section>
  <Section position="7" start_page="183" end_page="184" type="metho">
    <SectionTitle>
5 words.
4 Experiments and Evaluations
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="183" end_page="184" type="sub_section">
      <SectionTitle>
Data Collection
</SectionTitle>
      <Paragraph position="0"> The acquisition of NEs is performed on 34 types of NEs chosen arbitrarily among three subtypes of the MUC typology:</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="184" end_page="188" type="metho">
    <SectionTitle>
ORGANIZATION (American companies,
</SectionTitle>
    <Paragraph position="0"> international organizations, universities, political organizations, international agencies, car makers, terrorist groups, financial institutions, museums, international companies, holdings, sects, and realtors), PERSON (politicians, VIPs, actors, managers, celebrities, actresses, athletes, authors, film directors, top models, musicians, singers, and journalists), and LOCATION (countries, regions, states, lakes, cities, rivers, mountains, and islands). Each of these 34 types (a (NE) string) is combined with the four discourse markers given in (1.a-d), yielding 136 queries for the two search engines. Each of the 272 corpora collected through the harvester is made of the 200 documents downloadable through AV for the phrase search (or less if less are retrieved) and 2,000 documents though NL.</Paragraph>
    <Paragraph position="1"> Each of these corpora is parsed by the enumeration and the list parsers.</Paragraph>
    <Paragraph position="2"> Two aspects of the data are evaluated.</Paragraph>
    <Paragraph position="3"> First, the size of the yield is measured in order to compare the productivity of the 272 queries according to the type of query (type of NE and type of discourse marker) and the type .of search engine (rich versus plain queries and low versus high number of downloaded documents). Second, the quality of the candidate NEs is measured through human inspection of accessible Web pages containing each NE.</Paragraph>
    <Section position="1" start_page="184" end_page="184" type="sub_section">
      <SectionTitle>
Corpus Size
</SectionTitle>
      <Paragraph position="0"> The 272 corpora are 2,958 Mb large: 368 Mb for the corpora collected through AV and 2,590 Mb for those obtained through NL. Detailed sizes of corpora are shown in Table 1.</Paragraph>
      <Paragraph position="1"> The corpora collected through NL for the pattern list o/ (NE / represent more than a half of the NL collection (1,307 Mb). The most productive pattern for AV is (NE) such as through which 41% of the AV collection is downloaded (150 Mb).</Paragraph>
      <Paragraph position="2"> The sizes of the corpora also depends on the type of NEs. For each search engine, the total sizes are reported for each pattern (1.ad). In addition, the largest corpus for each of the three types of NEs is indicated in the last three lines. The variety of sizes and distribution among the types of NEs shows that using search engines with different capabilities yields different figures for the collections of pages. Therefore, the subsequent process of NE acquisition heavily depends on the means used to collect the basic textual data from which knowledge is acquired.</Paragraph>
    </Section>
    <Section position="2" start_page="184" end_page="188" type="sub_section">
      <SectionTitle>
Quantitative Evaluation of Acquisition
</SectionTitle>
      <Paragraph position="0"> Table 2 presents, for each pattern and each search engine, the number of candidates, the productivity, the ratios of the number of enumerations to lists, and the rate of redundancy. In all, 17,176 candidates are produced through AV and 34,978 through NL. The lowest accuracy of the NL query mode is well palliated by a larger collection of pages.</Paragraph>
      <Paragraph position="1"> Productivity. The productivity is the ratio of the number of candidates to the size of the collection. Using a unit of number of candidates per Mb, the productivity of AV is 46.7 while it is 3.5 times lower for NL (13.5).</Paragraph>
      <Paragraph position="2"> Thus, collecting NEs from a coarser search engine, such as NL, requires downloading 3.5 times larger corpora for the same yield. A finer search engine with phrase query facilities, such as AV, is more economical with respect to knowledge acquisition based on discourse markers.</Paragraph>
      <Paragraph position="3"> As was the case for the size of the collection, the productivity of the corpora also depends on the types of NEs. Universities (28.1), celebrities (53.0) and countries (36.5) are the most productive NEs in their categories while international agencies (4.0), film directors (4.4) and states (8.7) are the less productive ones. These discrepancies certainly depend on the number of existing names in these categories. For instance, there are many more names of celebrities than .film directors. In fact, the productivity of NL is significantly lower than the productivity of AV only for the pattern list of NE. Since this pattern corresponds to the largest corpus (see Table 1), its poor performance in acquisition has a strong impact on the overall productivity of NL. Avoiding this pattern would make NL more suitable for acquisition with a productivity of 23.2 (only 2 times lower than AV). Ratios enumerations/lists. The ratios in the third lines of the tables correspond to the quotient of the number of candidates acquired by analyzing enumerations (Pe parser) to the number of candidates obtained from the analysis of lists (P1 parser). Following NE mainly yields NEs through the analysis of lists, probably because enumerations using coordinations are better introduced by such as. The outcome is more balanced for list of NE. It could be expected that this pat- null tern tends to introduce only lists, but there are only 1.66 times more NEs obtained from lists than from enumerations through list off NE. The large number of NEs produced from enumerations after this pattern certainly relies on the combination of linguistics and formatting cues in the construction of meaning.</Paragraph>
      <Paragraph position="4"> The writer avoids using (the word) list when the text is followed by a (physical) list. Lastly, in all, 11 times more NEs are obtained from enumerations than from lists after the pattern NE such as, and 18 times more after such NE as. This shows that the linguistic pattern such as preferably introduces textual enumerations through coordinations (Hearst, 1998).</Paragraph>
      <Paragraph position="5"> Redundancy. There are two main causes of redundancy in acquisition. A first cause is that the same NE can be acquired from several collections in the same corpus. Redundancy in the fourth lines of the tables is the ratio of duplicates among the yield of candidate NEs for each search engine and each query. This value is relatively stable whatever the search engine or the query pattern. On average, redundancy is 2.09: each candidate is acquired slightly more than two times. Acquisition through NL is slightly more re.~:dundant (2.18) than through AV (1.92). This difference is not significant since the number of NEs acquired through NL is twice as large as the number of NEs acquired through AV.</Paragraph>
      <Paragraph position="6"> Overlap. Another cause of multiple acquisition is due to the concurrent exploitation of two search engines. If these engines were using similar techniques to retrieve documents, the overlap would be large. Since we have chosen two radically different modes of query (phrase vs. bag-of-word technique), the overlap---the ratio of the number common candidates to the number of total candidates--is low (15%).</Paragraph>
      <Paragraph position="7"> The two search engines seem to be complementary rather than competitive because they retrieve different sets of documents.</Paragraph>
      <Paragraph position="8">  In all, 31,759 candidates are produced by postfiltering the acquisition from the corpora retrieved by the two search engines. A set of 504 candidates is randomly chosen for the purpose of evaluation. For each candidate, AV is queried with a phrase containing the string of the NE. The topmost 20 pages retrieved by AV are downloaded and then used for manual inspection in case of doubt about the actual status of the candidate. We assume that if a candidate is correct, an unambiguous reference with the expected type should be found at least in one of the topmost 20 pages.</Paragraph>
      <Paragraph position="9"> Two levels of precision are measured: 1. A NE is correct if its full name is retrieved and if its fine-grained type (the 34 types given at the beginning of this section) is correct. The manual inspection of the 504 candidates indicates a precision of 62.8%.</Paragraph>
      <Paragraph position="10"> 2. A NE is correct if its full name is retrieved and if its MUC type (ORGANIZATION, PERSON, or LOCATION) is correct. In this case, the precision is 73.6%.</Paragraph>
      <Paragraph position="11"> The errors can be classified into the following categories: Wrong type Many errors in NE typing are due to an incorrect connection between a query pattern and a collection in a document. For instance, Ashley Judd is incorrectly reported as an athlete (she is an actress) from the occurrence His clientele includes stars and athletes such as Ashley Judd (below) and Mats Sundin.</Paragraph>
      <Paragraph position="12"> The error is due to a partial analysis of the initializer (underlined above). Only athletes is seen as the hypernym while stars is also part of it. A correct analysis of the occurrence would have led to a type ambiguity. In this context, there is no clue for deciding whether Ashley Judd is a star or an athlete.</Paragraph>
      <Paragraph position="13"> Other wrong types are due to polysemy. For instance, HorseFlySwarm is extracted from a list of actors in a page describing the commands and procedures for programming a video game. Here actors has the meaning of a virtual actor, a procedure in a programming environment, and not a movie star.</Paragraph>
      <Paragraph position="14"> Incomplete Partial extraction of candidates is mainly due to parsing errors or to collections containing partial names of entities. null As an illustration of the second case, the author's name Goffman is drawn from the occurrence Readings are drawnfrom the work o\] such authors as Laing,  Szasz, Goffman, Sartre, Bateson, and Freud.</Paragraph>
      <Paragraph position="15"> Since this enumeration ,does not contain the first names of the authors, it is not appropriate for an acquisition of unambiguous author's names.</Paragraph>
      <Paragraph position="16"> Other names such as Lucero are ambiguous even though they are completely extracted because they correspond to a first name or to a name that is part of several other ones. They are also counted as errors since they will be responsible of spurious identifications in a name tagging task.</Paragraph>
      <Paragraph position="17"> Over-complete Excessive extractions are due to parsing errors or to collections that contain words accompanying names that are incorrectly collected together with the name. For instance, Director Lewis Burke FFrumkes is extracted as an author's name from a list in which the actual name Lewis Burke Frumkes is preceded by the title Director.</Paragraph>
      <Paragraph position="18"> Miscellaneous Other types of errors do not show clear connection between the extracted sequence and a NE. They are mainly due to errors in the analysis of the web page.</Paragraph>
      <Paragraph position="19"> These types of errors are distributed as follows: wrong type 25%, incomplete 24%, overcomplete 8% and miscellaneous 43%.</Paragraph>
      <Paragraph position="20"> 5 Refinement of the Types of NEs So far, the type of the candidate NEs is provided by the NE hypernym given in (1.a-d). However, the initializer preceding the collection of NEs to be extracted can contain more information on the type of the following NEs. In fact the initializer fulfills four distinct functions: null 1. introduces the presence and the proximity of the collection, e.g. Here is 2. describes the structure of the collection, e.g. a list of 3. gives the type of each item of the collection, e.g. universities 4. specifies the particular characteristics of each item. e.g. universities in Vietnam The cues used by the harvester are elements which either introduce the collection (e.g. the .following) or describe the structure (e.g. a list of). In initializers in general, these first 2 functions need not be expressed explicitly by lexical means, as the layout itself indicates the presence and type of the collection. Readers exploit the visual properties of written text to aid the construction of meaning (P6ry-Woodley, 1998).</Paragraph>
      <Paragraph position="21"> However it is necessary to be explicit when defining the items of the collection as this information is not available to the reader via structural properties. Initializers generally contain additional characteristics of the items which provide the differentia (underlined here): This is a list off American companies with business interests in Latvia.</Paragraph>
      <Paragraph position="22"> This example is the most explicit form an initializer can take as it contains a lexical element which corresponds to each of the four functions outlined above. It is fairly simple to extract the details of the items from initializers with this basic form, as the modification of the hypernym takes the form of a relative clause, a prepositional phrase or an adjectival phrase. A detailed grammar of this form of initializer is as shown in Figure 3. 5  We tag the collection by part of speech using the TreeTagger (Schmid, 1999). The elements which express the differentia are extracted by means of pattern matching: they are always the modifiers of the plural noun in the string, which is the hypernym of the items of the collection.</Paragraph>
      <Paragraph position="23"> 5pp = prepositional phrase, Ns = noun (singular), Npl = noun (plural), Vp = verb in present tense, rel.cl. = relative clause.</Paragraph>
      <Paragraph position="24">  Initializers containing the search string such as behave somewhat differently. They are syntactically incomplete, and the missing constituent is provided by each item of the collection (Virbel, 1985). These phrases vary considerably in structure and can require relatively complex syntactic rearrangement to extract the properties of the hypernym. We will not discuss these in more detail here.</Paragraph>
      <Paragraph position="25"> One type of error in this system occurs when a paragraph containing the search string is followed by an unrelated list. For example the harvester recognizes Ask the long list of American companies who have unsuccessfully marketed products in Japan.</Paragraph>
      <Paragraph position="26"> as an initializer when in fact it is not related to any collection. If it happened to be followed on the page by an collection of any kind the system would mistakenly collect the items as NEs of the type specified by the search string. The cue list of is commonly used in discursive texts, so some filtering is required to identify collections which are not employed as initializers and to reduce the collection of erroneous items. Analyzing the syntactic forms h:has allowed us to construct a set of regular expressions which are used to eliminate noninitializers and disregard any items collected following them.</Paragraph>
      <Paragraph position="27"> We have extracted 1813 potential initializers from the corpus of HTML pages collected via AV &amp; NL for the query string list of NE. Using lexico-syntactic patterns in order to identify correct initializers, we have designed a shallow parser for filtering and analyzing the strings. This parser consists of 14 modules, 4 of which carry out pre-filtering to prepare and tag the corpus, and 10 of which carry out a fine-grained syntactic analysis, removing collections that do not function as initializers. After filtering, the corpus contains 520 collections. The process has a precision of 78% and a recall of 90%.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML