File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/03/w03-0103_concl.xml
Size: 3,023 bytes
Last Modified: 2025-10-06 13:53:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0103"> <Title>Semi-supervised learning of geographical gazetteers from the internet</Title> <Section position="6" start_page="1" end_page="1" type="concl"> <SectionTitle> 6 Conclusion and future work </SectionTitle> <Paragraph position="0"> We described an approach to the automatic acquisition of geographical gazetteers from the Internet. By applying bootstrapping techniques, we are able to learn new gazetteers starting from a small set of preclassified examples. This approach can be particularly helpful for the Named Entity Recognition task in languages, where no manually collected geographical resources are available.</Paragraph> <Paragraph position="1"> Apart from gazetteers, our system produce classifiers.</Paragraph> <Paragraph position="2"> They use Internet counts (acquired from the AltaVista search engine) to classify any entity online. Unlike gazetteers, classifiers also provide negative information: the fact, that Washington is not a RIVER, can be obtained from a classifier, whereas gazetteers can only tell us, that they do not contain any Washington river, but still, there is a chance that such a river exists.</Paragraph> <Paragraph position="3"> The bootstrapping approach performed reasonably well on this task -- 86.5% accuracy on average after the second iteration. Moreover, high control over the noise allow the system to improve exactly on the classes with originally poor performance (CITY and REGION).</Paragraph> <Paragraph position="4"> There is still a lot of work to be done. First, we plan to include new classes, such as, for example, SEA, and organize them in a hierarchy. In this case we will have to investigate patterns' distributions over classes more carefully and elaborate our rescoring strategy.</Paragraph> <Paragraph position="5"> Second, we plan to extend our approach to cover multi-words expressions. A half of this problem is already solved -- our classifiers can deal with such names as Sri Lanka. So, we need to adjust our items extraction step to this task.</Paragraph> <Paragraph position="6"> We also plan to investigate more sophisticated sampling techniques to get rid of initial fully classified data. Although our first experiments with the learning from positive examples only were not very successful, we still hope to solve this problem. It would allow us to simply download seed datasets from the Internet and start processing with these partially classified data, instead of compiling a high-quality seed gazetteer manually.</Paragraph> <Paragraph position="7"> Finally, we plan two related experiments. The same approach can be used for classifying names into locations instead of time (for example, Edmonton is in Alberta/Canada). We also want to try the same algorithm in another language, preferably with a non-Latin alphabet. The output may be quite useful, as there are not so many geographical knowledge bases available for languages other than English.</Paragraph> </Section> class="xml-element"></Paper>