File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0105_metho.xml

Size: 15,822 bytes

Last Modified: 2025-10-06 14:08:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0105">
  <Title>Grounding spatial named entities for information extraction and question answering</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Spatial Grounding
</SectionTitle>
    <Paragraph position="0"> Gazetteers are large lists of names of geographic entities, usually enriched with further information, such as their class (e.g., town, river, dam, etc.), their size, and their location (i.e. with respect to some relative or absolute coordinate system such as longitude and latitude).</Paragraph>
    <Paragraph position="1"> Appendix A identi es some publicly available sources.</Paragraph>
    <Paragraph position="2"> UN-LOCODE is the of cial gazetteer by the United Nations; it is also freely available from the UNECE Web site1 and contains more than 36 000 locations in 234 countries (UNECE, 1998). The Alexandria Gazetteer (Smith et al., 1996; Frew et al., 1998) is another database of geographical entities, including both their coordinates and relationships such as: in-state-of, in-province-of, in-county-of, in-country-of, in-region-of, part-of and formerly-known-as.</Paragraph>
    <Paragraph position="3"> To date, Named Entity Recognition (NER) has only used gazetteers as evidence that a text span could be some kind of place name (LOCATION), even though their nite nature makes lists of names of limited use for classi cation (Mikheev et al., 1999). Here we use them for spatial grounding relating linguistic entities of subtype LOCA-TION (Grishman and Sundheim, 1998) to their real-world counterparts.</Paragraph>
    <Paragraph position="4"> World Atlases and the gazetteers that index them are not the only resources than can be used for grounding spatial terms. In biomedicine, there are are several brain atlases of different species, using various different techniques, and focussing on both normal and disease state; as well as a digital atlas of the human body  Mouse Atlas (Baldock et al., 1999).</Paragraph>
    <Paragraph position="5"> based on data from the Visible Human project. Such atlases and the nomenclatures that label their parts, provide an important resource for biomedical research and clinical diagnosis. For example, the Mouse Atlas (Ringwald et al., 1994) comprises a sequence of 3D (volumetric) reconstructions of the mouse embryo in each of its 26 Theiler States of development. Indexing it is an part-of hierarchy of anatomical terms (such as embryo.organsystem.cardiovascularsystem.heart.atrium), called the Mouse Anatomical Nomenclature (MAN). Each term is mapped to one or more sets of adjacent voxels2 that constitute the term's denotation in the embryo. Figure 1 illustrate this linkage (using 2D cross-sections) in the EMAGE database.3 Just as one might nd it useful for information extraction or question answering to ground grographic terms found in previously unseen text, one may also nd it useful to ground anatomical terms in previously unseen text. One example of this would be in providing support for the curation of the Gene Expression Database (GXD).4 This support could come in the form of a named entity recognizer for anatomical parts in text, with grounding against the Mouse Atlas, using the gazetteer-like information in the MAN.</Paragraph>
    <Paragraph position="6"> So what is the relationship between a place name gazetteer like UN-LOCODE and the Mouse Atlas? The MAN is structured in a similar part-of hierarchy to that of geographical locations:  Because both gazetteers like UN-LOCODE and biomedical atlases like the Mouse Atlas provide spatial grounding for linguistic terms (Figure 2), both can be used to reason about spatio-temporal settings of a discourse, for instance, to resolve referential ambiguity.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Extraction
</SectionTitle>
      <Paragraph position="0"> There are many places that share the same (Berlin, Germany a3 Berlin, WI, USA) or similar names (York, UK a3 New York, USA), usually because historically, the founders of a new town had given it a similar or the same name as the place they emigrated from.</Paragraph>
      <Paragraph position="1"> When ambiguous place names are used in conversation or in text, it is usually clear to the hearer what speci c referent is intended. First, speaker and hearer usually share some extra-linguistic context and implicitly adhere to Grice's Cooperative Principle and the maxims that follow, which require a speaker to provide more identifying information about a location that the recipient is believed to be unfamiliar with. Secondly, linguistic context can provide clues: an accident report on the road between Perth and Dundee promotes an interpretation of Perth in Scotland, while an accident on the road between Perth and Freemantle promotes an interpretation of Perth in Western Australia. Computers, which are bound to select referents algorithmically, can exploit linguistic context more easily than extra-linguistic context, but even the use of linguistic context requires (as always) some subtle heuristic reasoning.</Paragraph>
      <Paragraph position="2"> Grounding place names mentioned in a text can support effective visualization for instance, in a multimedia document surrogate that contains textual, video and map elements (e.g. in a question answering scenario), where we want to ensure that the video shows the region and the map is centered around the places mentioned.</Paragraph>
      <Paragraph position="3"> To make use of linguistic context in resolving ambiguous place names, we apply two different minimality heuristics (Gardent and Webber, 2001). The rst we borrow (slightly modi ed) from work in automatic word sense disambiguation (Gale et al., 1992), calling it one referent per discourse . It assumes that a place name mentioned in a discourse refers to the same location throughout the discourse, just as a word is assumed to be used in the same one sense throughout the discourse.</Paragraph>
      <Paragraph position="4"> Neither is logically necessary, and hence both are simply interpretational biases.</Paragraph>
      <Paragraph position="5"> The second minimality heuristic assumes that, in cases where there is more than one place name mentioned in some span of text, the smallest region that is able to ground the whole set is the one that gives them their interpretation.5 This can be used to resolve referential ambiguity by proximity: i.e., not only is the place name Berlin taken to denote the same Berlin throughout a discourse unless mentioned otherwise,6 but so does a Potsdam men- null tioned together with a Berlin uniquely select the capital of Germany as the likely referent from the set of all candidate Berlins.7 To illustrate this spatial minimality heuristic, consider Figure 3: Assume that a mention of place A in a text could either refer to Aa5 or Aa5 a5 . If the text also contains terms that ground unambiguously to I, J, and K, we assume the referent of A is Aa5 rather than Aa5 a5 because the former leads to a smaller spatial context.</Paragraph>
      <Paragraph position="6"> To use this spatial minimality heuristic, we start by extracting all place names using a named entity recognizer. We then look up the confusion set of potential referents for each place name, e.g. for Berlin: a7 Berlin, FRG (German capital); Berlin, WI, USA; Berlin, NJ, USA; Berlin, CT, USA; Berlin, NH, USA; Berlin, GA, USA; Berlin, IL, USA; Berlin, NY, USA; Berlin, ND, USA; Berlin, NJ, USA a12 . Each member of the set of potential referents is associated with its spatial coordinates (longitude/latitude), using a gazetteer. We then compute the cross-product of all the confusion sets. (Each member of the cross-product contains one potential referent for each place name, along with its spatial coordinates.) For each member of the cross-product, we compute the area of the minimal polygon bounding all the potential referents, and select as the intended interpretation, the one with the smallest area.8 The resulting behaviour is 7 despite the fact that most places named Berlin are in the  Minimality.</Paragraph>
      <Paragraph position="7"> shown in Figure 4: depending on contextually mentioned other places, a different Berlin is selected.</Paragraph>
      <Paragraph position="8"> The value of this heuristic needs to be assessed quantitatively against various types of text.</Paragraph>
      <Paragraph position="9"> In resolving anatomical designators in text, we may employ a variation of the spatial minimality heuristic, based on the fact that no listing will ever be complete with respect to all the existing or new-minted synonyms for anatomical terms.</Paragraph>
      <Paragraph position="10"> When grounding the anatomical terms in the text In subsequent stages until birth, cytokeratin 8 continues to be expressed in embryonic taste buds distributed in punctuate patterns at regular intervals along rows that are symmetrically located on both sides of the median sulcus in the dorsal anterior developing tongue.</Paragraph>
      <Paragraph position="11"> we nd no median sulcus within the MAN, only alveolar sulcus, optic sulcus, pre-otic sulcus, sulcus limitans and sulcus terminalis. We just assume that all anatomical terms refer to previously recognized anatomical entities, just as we assume that all geographic terms refer to existing geographic entities and not, for example, some new town called Berlin or London that is not yet in the gazetteer. Hence median sulcus is assumed to be a synonym for one of the ve sulci given in the MAN. At this point, we can invoke the spatial minimality heuristic, looking for the minimal bounding space that includes tongue and one of the ve sulci, here yielding sulcus terminalis . Thus the spatial minimality heuristic is here pairwise point-point distances, or symbolically, using a hierarchical gazetteer's relations, such as in-region-of.</Paragraph>
      <Paragraph position="12"> used with other assumptions to resolve missing or previously unseen terms.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Visualization of Geo-Spatial Aspects in
</SectionTitle>
    <Paragraph position="0"> Narrative The usefulness of visual representations to convey information is widely recognized (cf. (Larkin and Simon, 1987)). Here, we use the grounding of named entities in news stories to create a visual surrogate that represents their spatial aboutness . Two news stories were selected from online newspapers on the same day (2003-02-21): one story (Appendix B) reports the tragic death of a baby from London in a Glasgow hospital despite ying it to a Glasgow specialist hospital in the Royal aircraft (BBC News, 2003), and the other report (Appendix C) describes the search of the Californian police for a pregnant women from Modesto, CA, who has disappeared (The Mercury News, 2003).</Paragraph>
    <Paragraph position="1"> We use the term surrogate to refer to a partial view of a text (e.g. (Leidner, 2002)). Figure 5 shows a textual surrogate in the form of all place names found in a text: an analyst who wants to get a quick overview about the locations involved in some item of news reportage, to decide its local interest or relevance, might nd such a surrogate helpful, although the source would still have to  be skim-read.</Paragraph>
    <Paragraph position="2"> Story A ... Scotland ... Tooting ... London ... Glasgow ... London ... Glasgow ... Northolt ... Glasgow ... Britain ... Prestwick ... Tooting ... Glasgow ... UK ... (Glasgow) ...</Paragraph>
    <Paragraph position="3"> Story B Modesto ... (Southern California) ... (Modesto) ... Los Angeles ... Sacramento ... Berkeley (Marina) ... Fresno ... Oakland ... Modesto ... Los Angeles ... Southern California ... Modesto ... Southern California ... New York ... Long Island ...</Paragraph>
    <Paragraph position="4"> Figure 5: A Textual Geo-Spatial Document Surrogate for the Stories in Appendices B and C.</Paragraph>
    <Paragraph position="5">  We now compare this baseline textual surrogate to a graphical map representation that draws on the algorithm introduced before. Our simple visualisation method comprises the following components (Figure 6): an (opendomain) news item is fed into locotagger, our simple named entity tagger for place names based on UN- null Scott, more than a dozen news crews</Paragraph>
    <Paragraph position="7"> &lt;/ENAMEX&gt; camped out front.</Paragraph>
    <Paragraph position="8"> From the text we obtain a vector of types of all spatial named entities with their frequency of occurrence in the text:</Paragraph>
    <Paragraph position="10"/>
    <Paragraph position="12"/>
    <Paragraph position="14"> For simplicity, we drop those that correspond to regions (which are represented by sets of points) and feed the remaining list of point coordinates (corresonding to villages and cities) into a map generator to generate a Mercator projection of the geographical area that includes all the points plus 10% of the surrounding area. For this, The  pendix B12 Figure 9 shows the map for the story in Appendix C. Clearly, such a visual surrogate is superior with respect to comprehension time than the textual surrogate presented before. It is interesting so see what happens if we leave out the nal paragraph for the map creation (Figure 8): we obtain a zoomed-in version of the map.</Paragraph>
    <Paragraph position="15"> This turns out to be the case for many stories and is due to the convention of news reportage to close a report with linking the narrative to similar events in order to present the event in a wider context.</Paragraph>
    <Paragraph position="16">  in Context (Global View; Complete Story).</Paragraph>
    <Paragraph position="17"> (Shanon, 1979) discusses how the granularity of the answers to where-questions depends on the reference points of speaker and listener (Where is the Empire State Building? (a) In New York, (b) In the U.S.A, (c) On 34th Street and 3rd Avenue); the map generation task depends on such levels of granularity in the sense that to create a useful map, entities that belong to the same level of granularity or scale should be marked (e.g. city city rather than village continent).</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Question Answering
</SectionTitle>
    <Paragraph position="0"> Using grounding knowledge in gazetteers also enables us to answer questions in natural language more effectively:  A: [should yield a surrogate based on textual descriptions generated from the gazetteer relations: X is-typeY , X part-ofZ and the coordinates, plus a map as generated above, with additional images, e.g. from satellites or picture search engines as available.] 3. What X is Y part of? Q: What is 'Bad Bergzabern' part of? A: Bad Bergzabern is part of the Federal Republic of Germany. null Q: Is Andorra la Vella part of Spain? A: No, Andorra la Vella belongs to Andorra.</Paragraph>
    <Paragraph position="1"> 4. How far is X from Y? Q: How far is Cambridge from London? A: The distance between London, England, United Kingdom and Cambridge, England, United Kingdom is 79 km (49 miles or 43 nautical miles).</Paragraph>
    <Paragraph position="2"> Note here that the spatial minimality heuristic resolves Cambridge and London to places in the UK rather than, say, London, Ontario, Canada and Cambridge, Mass., USA. However the answer makes clear the precise question being adressed, so the user can follup up with a different question if this was not what he or she intended. Since sophisticated gazetteers are available, answering such questions should not be based on textual extraction from Internet sources, but be based on the gazetteers directly, which reduces noise.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML