File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0103_metho.xml
Size: 11,117 bytes
Last Modified: 2025-10-06 14:08:21
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0103"> <Title>Semi-supervised learning of geographical gazetteers from the internet</Title> <Section position="3" start_page="1" end_page="1" type="metho"> <SectionTitle> 3 The initial system </SectionTitle> <Paragraph position="0"> Below we describe a system we developed for our previous study. We use it as a reference point in our current work. However, we do not expect our new approach to perform better than the initial one -- the old system makes use of intelligently collected knowledge, whereas the new one must do the whole work by itself.</Paragraph> <Paragraph position="1"> The initial algorithm works as follows. For each class we constructed a set of patterns. All the patterns have the form &quot;KEYWORD+of+X&quot; and &quot;X+KEYWORD&quot;. Each class has from 3 (ISLAND) up to 10 (MOUNTAIN) different keywords. For example, for the class ISLAND we have 3 keywords (&quot;island&quot;, &quot;islands&quot;, &quot;archipelago&quot;) and 5 corresponding patterns (&quot;X island&quot;, &quot;island of X&quot;, &quot;X islands&quot;, &quot;islands of X&quot;, &quot;X archipelago&quot;). Keywords and patterns were selected manually: we tested many different candidates for keywords, collected counts (cf. bellow) for the patterns associated with a given candidate, then filtered most of them out using the t-test. The remaining patterns were checked by hand.</Paragraph> <Paragraph position="2"> For each name of location to be classified, we construct queries, substituting this name for the X in our patterns.</Paragraph> <Paragraph position="3"> We do not use morphological variants here, because morphology of proper names is quite irregular (compare, for example, the noun phrases Fijian government and Mali government -- in the first case the proper name is used with the suffix -an, and in the second case -- without it).</Paragraph> <Paragraph position="4"> The queries are sent to the AltaVista search engine. The number of pages found by AltaVista for each query is then normalized by the number of pages for the item to be classified alone (the pattern &quot;X&quot;, without keywords). Obtained queries (normalized and raw) are then provided to a machine learner as features. In our previous work we compared two machine learners (C4.5 and TiMBL) for this task.</Paragraph> <Paragraph position="5"> In our present study we use the Ripper machine learner (Cohen, 1995). The main reasons for this decision are the following: first, Ripper selects the most important features automatically, and the classifier usually contains less features than, for example, the one from C4.5. This is very important when we want to classify many items (that is exactly what happens at the end of each bootstrapping loop in our approach), because obtaining values for the features requires much time.</Paragraph> <Paragraph position="6"> We use our training set (520 items, cf. above) to train Ripper. The testing results (on remaining 740 items) are summarized in table 2. Compared to our original system as it was described in (Ourioupina, 2002), Ripper performed better than C4.5 and TiMBL on a smaller (320 words) training set, but slightly worse than the same learners in leave-one-out (i.e. on 1259-words training sets). Although the comparison was not performed on exactly the same data, it is nevertheless clear that Ripper's performance for this task is not worse than the results of C4.5.</Paragraph> </Section> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> 4 The bootstrapping approach </SectionTitle> <Paragraph position="0"> We start the processing from our 100-words lists. For each name on each list we go to AltaVista, ask for this name, and download pages, containing it. Currently, we only download 100 pages for each word. However, it seems to be enough to obtain reliable patterns. In future we plan to download much more pages. We match the pages with a simple regular expression, extracting all the contexts up to 2 words to the left and 2 words to the right of the given name. We substitute &quot;X&quot; for the name in rescoring rescoring patterns &quot;of X&quot; 70 &quot;X island&quot; 17 &quot;X island&quot; &quot;the X&quot; 60 &quot;island of X&quot; 9 &quot;and X islands&quot; &quot;X and&quot; 58 &quot;X islands&quot; 8 &quot;insel X&quot; &quot;X the&quot; 55 &quot;island X&quot; 7 &quot;to X&quot; 53 &quot;islands X&quot; 7 &quot;in X&quot; 52 &quot;insel X&quot; 7 &quot;and X&quot; 47 &quot;the island X&quot; 6 &quot;X is&quot; 45 &quot;X elects&quot; 5 &quot;X in&quot; 45 &quot;of X islands&quot; 5 &quot;on X&quot; 45 &quot;zealand X&quot; 4 the contexts to produce patterns. Afterwards, we compile for each class separately a list of patterns used with the names of this class. We score them by the number of the names they were extracted by. The left column of table 3 shows the best patterns for the class ISLAND after this procedure. Overall we had 27190 patterns for ISLANDS.</Paragraph> <Paragraph position="1"> Obviously, such patterns as &quot;of X&quot; cannot really help in classifying something as A6ISLAND, because they are too general. Usually the most general patterns are discarded with the help of stopwords-lists. However, this approach is not feasible, when dealing with such a huge noisy dataset as the Internet. Therefore we have chosen another solution: we rescore the patterns, exploiting the idea that general patterns should originally have high scores for several classes. Thus, we can compare the results for all the lists and penalize the patterns appearing in more than one of them. Currently we use a very simple formula for calculating new scores - the penalties for all the classes, except the one we are interested in, are summed up and then subtracted from the original score: bootstrapping loop.</Paragraph> <Paragraph position="2"> The second column of table 3 shows the best patterns for ISLAND after rescoring. From the 27190 patterns collected, only 250 have new scores above 1. As it can be seen, our simple rescoring strategy allows us to focus on more specific patterns.</Paragraph> <Paragraph position="3"> In future we plan to investigate patterns' distributions over classes in more detail, searching for patterns that are common for two or three classes, but appear rather rare with the items of other classes, for example, CITIES, REGIONS, COUNTRIES, and some ISLANDS (but not RIVERS and MOUNTAINS) appear often in such constructions as &quot;population of X&quot;. This would allow us to organize classes in hierarchical way, possibly leading to useful generalizations.</Paragraph> <Paragraph position="4"> As the third step, we take the best patterns (currently 20 best patterns are considered) and use them in the same way we did it with the manually preselected patterns for the initial system: for each name in the training set, we substitute this name for X in all our patterns, go to the AltaVista search engine and collect corresponding counts. We normalize them by the count for the name alone. Normalized and raw counts are provided to the Ripper machine learner.</Paragraph> <Paragraph position="5"> We use Ripper to produce three classifiers, varying the parameter &quot;Loss Ratio&quot; (ratio of the cost of a false negative to the cost of a false positive). In future we plan to do a better optimization, including more parameters.</Paragraph> <Paragraph position="6"> Changing the loss ratio parameter, we get three classifiers. We can chose from them the ones with the best recall, precision, and overall accuracy. Recall, precision and accuracy are measured in the common way:</Paragraph> <Paragraph position="8"> Table 4 shows the classifiers, learned for the class ISLAND (#CH stays for the AltaVista count for CH ).</Paragraph> <Paragraph position="9"> The classifier with the best precision values usually contains less rules, than the one with the best recall. So, we take all the patterns from the best recall classifier. We are, of course, only interested in patterns, providing positive information (AZD4BQCP or AZD4BPAZCGBQCP), leaving aside such patterns as &quot;X geography&quot; in our high-accuracy ISLAND classifier. The right column of table 3 shows the final set of extraction patterns for the class ISLAND.</Paragraph> <Paragraph position="10"> At this stage we swap the roles of patterns and names.</Paragraph> <Paragraph position="11"> We go to the Internet and download web pages, containing our extraction patterns. Currently we use only 2000 pages pro pattern, because we want to be able to check the results (at least for some classes) to evaluate the approach. Technically, this step goes as follows: each pattern has the form &quot;LEFT X RIGHT&quot;, where LEFT and RIGHT contain from 0 to 2 words. We ask AltaVista for all the pages, containing LEFT and RIGHT simultaneously. Then we check whether our pattern occurs in the returned files, and, if so, how exactly CG is realized. As we are looking for place names, only words, beginning with capital letters, are included.</Paragraph> <Paragraph position="12"> After this step we have a big list of candidate names for each class. We have a small list of stop-words (&quot;A(n)&quot;, &quot;The&quot;, &quot;Every&quot;,...). These items are discarded. It must be noted that stop list is not really necessary -- at the next step all those candidates would anyway be discarded, but, as they appear very often, the stop list saves some processing time. For the class ISLAND we have got 573 items (recall, that we download only first 2000 pages).</Paragraph> <Paragraph position="13"> Afterwards we take the high-precision classifier and run it on the items collected. The names, that the classifier rejects, are discarded. After this procedure we've got 134 new names for the class ISLAND.</Paragraph> <Paragraph position="14"> The remaining items are added to the temporary lexicon. They are used for the next iteration of the bootstrapping loop. All the following iterations resemble the first one (described above). There are only minor differences to be mentioned. After the first loop, word lists for different classes have different size (at the beginning they all contained 100 items). Therefore we must adjust CP in our rescoring formula:</Paragraph> <Paragraph position="16"> It must also be mentioned, that we use new items only for extraction, but not for machine learning. This helps us to control the system's performance. We do not have any stopping criteria: even when classifiers do not improve anymore, the system can still extract new place names.</Paragraph> <Paragraph position="17"> The whole approach is depicted on figure 1.</Paragraph> </Section> class="xml-element"></Paper>