File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0106_metho.xml
Size: 9,256 bytes
Last Modified: 2025-10-06 14:08:22
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0106"> <Title>InfoXtract location normalization: a hybrid approach to geographic references in information extraction [?]</Title> <Section position="5" start_page="0" end_page="7" type="metho"> <SectionTitle> 3 Previous Work and Issues </SectionTitle> <Paragraph position="0"> This paper is follow-up research based on our previous work [Li et al. 2002]. Some efficiency and performance issues are identified and addressed by the modified approach.</Paragraph> <Paragraph position="1"> The previous algorithm [Li et al. 2002] for location normalization consisted of five steps.</Paragraph> <Paragraph position="2"> Step 1. Look up location names in the gazetteer to associate candidate senses for each location NE; Step 2. Call the pattern matching sub-module to resolve the ambiguity of the NEs involved in local patterns like &quot;Williamsville, New York, USA&quot; to retain only one sense for the NE as early as possible; Step 3. Apply the 'one sense per discourse' principle [Gale et al.1992] for each disambiguated location name to propagate the selected sense to its other mentions within a document; Step 4. Call the discourse sub-module, which is a graph search algorithm (Kruskal's algorithm), to resolve the remaining ambiguities; Step 5. If the decision score for a location name is lower than a threshold, we choose a default sense of that name as a result.</Paragraph> <Paragraph position="3"> In this algorithm, Step 2, Step 4, and Step 5 complement each other, and help produce better overall performance.</Paragraph> <Paragraph position="4"> Step 2 uses local context that is the co-occurring words around a location name. Local context can be a reliable source in deciding the sense of a location. The following are the most commonly used patterns for this purpose.</Paragraph> <Paragraph position="5"> (1) LOC + ',' + NP (headed by 'city') e.g. Chicago, an old city (2) 'city of' + LOC1 + ',' + LOC2 e.g. city of Albany, New York (3) 'city of' + LOC (4) 'state of' + LOC (5) LOC1+ ',' + LOC2 + ',' + LOC3 e.g. (i) Williamsville, New York, USA (ii) New York, Buffalo, USA (6) 'on'/ 'in' + LOC e.g. on Strawberry Gc6 ISLAND in Key West Gc6 CITY</Paragraph> <Paragraph position="7"> the location is a city, a state or an island, while patterns (2) and (5) can be used to determine both the sub-tag and its sense.</Paragraph> <Paragraph position="8"> Step 4 constructs a weighted graph where each node represents a location sense, and each edge represents similarity weight between location names. The graph is partially complete since there are no links among the different senses of a location name. The maximum weight spanning tree (MST) is calculated using Kruskal's MinST algorithm [Cormen et al. 1990]. The nodes on the resulting MST are the most promising senses of the location names.</Paragraph> <Paragraph position="9"> Figure 3 and Figure 4 show the graphs for calculating MST. Dots in a circle mean the number of senses of a location name.</Paragraph> <Paragraph position="10"> Through experiments, we found an efficiency problem in Step 4 which adopted Kruskal's algorithm for MST search to capture the impact of location co-occurrence in a discourse. While this algorithm works fairly well for short documents (e.g. most news articles), there is a serious time complexity issue when numerous location names are contained in long documents. A weighted graph is constructed by linking sense nodes for each location with the sense nodes for other locations. In addition, there is also an associated performance issue: the value weighting for the calculated edges using the previous method is not distinctive enough. We observe that the number of location mentions and the distance between the location names impact the selection of location senses, but the previous method could not reflect these factors in distinguishing the weights of candidate senses.</Paragraph> <Paragraph position="11"> Finally, our research shows that default senses play a significant role in location normalization. For example, people refer to &quot;Los Angeles&quot; as the city in California more than the city in the Philippines, Chile, Puerto Rico, or the city in Texas in the USA.</Paragraph> <Paragraph position="12"> Unfortunately, the available Tipster Gazetteer (http://crl.nmsu.edu/cgi-bin/Tools/CLR/clrcat) does not mark default senses for most entries. It has 171,039 location entries with 237,916 senses, among which 30,711 location names are ambiguous.</Paragraph> <Paragraph position="13"> Manually tagging the default senses for over 30,000 location names is difficult; moreover, it is also subject to inconsistency due to the different knowledge backgrounds of the human taggers. This problem was solved by developing a procedure to automatically extract default senses from web pages using the Yahoo! search engine [Li et al.</Paragraph> <Paragraph position="14"> 2002]. Such a procedure has the advantage of enabling 're-training' of default senses when necessary. If the web pages obtained through Yahoo! represent a typical North American 'view' of what default sense should be assigned to location names, it may be desirable to re-train the default senses of location names using other views (e.g. an Asian view or African view) when the system needs to handle overseas documents that contain many foreign location names.</Paragraph> <Paragraph position="15"> In addition to the above automatic default sense extraction, we later found that a few simple default sense heuristics, when used at proper levels, can further enhance performance. This finding is incorporated in our modified approach described in Section 3 below.</Paragraph> </Section> <Section position="6" start_page="7" end_page="7" type="metho"> <SectionTitle> 4 Modified Hybrid Approach </SectionTitle> <Paragraph position="0"> To address the issues identified in Section 2, we adopt Prim's algorithm, which traverses each node of a graph to choose the most promising senses.</Paragraph> <Paragraph position="1"> This algorithm has much less search space and shows the advantage of being able to reflect the number of location mentions and their distances in a document.</Paragraph> <Paragraph position="2"> The following is the description of our adapted Prim's algorithm for the weight calculation.</Paragraph> <Paragraph position="3"> The weight of each sense of a node is calculated by considering the effect of linked senses of other location nodes based on a predefined weight table (Table 1) for the sense categories of co-occurring location names. For example, when a location name with a potential city sense co-occurs with a location name with a potential state/province sense and the city is in the state/province, the impact weight of the state/province name on the city name is fairly high, with the weight set to 3 as shown in the 3 rd row of Table 1.</Paragraph> <Paragraph position="5"> ) is the measure of distance between two locations. The final sense of a location is the one that has maximum weight. A location name may be mentioned a number of times in a document. For each location name, we only count the location mention that has the maximum sense weight summation in equation (1) and eventually propagate the selected sense of this location mention to all its other mentions based on one sense per discourse principle. Equation (2) refers to the sense with the maximum weight for Loc</Paragraph> <Paragraph position="7"> Through experiments, we also found that it is beneficial to select default senses when candidate location senses in the discourse analysis turn out to be of the same weight. We included two kinds of default senses: heuristics-based default senses and the default senses extracted semi-automatically from the web using Yahoo. For the first category of default senses, we observe that if a name has a country sense and other senses, such as &quot;China&quot; and &quot;Canada&quot;, the country senses are dominant in most cases. The situation is the same for a name with province sense and for a name with country capital sense (e.g. London, Beijing). The updated algorithm for location normalization is as follows.</Paragraph> <Paragraph position="8"> Step 1. Look up the location gazetteer to associate candidate senses for each location NE; Step 2. If a location has sense of country, then select that sense as the default sense of that location (heuristics); Step 3. Call the pattern matching sub-module for local patterns like &quot;Williamsville, New York, USA&quot;; Step 4. Apply the 'one sense per discourse' principle for each disambiguated location name to propagate the selected sense to its other mentions within a document; Step 5. Apply default sense heuristics for a location with province or capital senses; Step 6. Call Prim's algorithm in the discourse sub-module to resolve the remaining ambiguities (Figure 5); Step 7. If the difference between the sense with the maximum weight and the sense with next largest weight is equal to or lower than a threshold, choose the default sense of that name from lexicon. Otherwise, choose the sense with the maximum weight as</Paragraph> </Section> class="xml-element"></Paper>