File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-2031_intro.xml
Size: 1,875 bytes
Last Modified: 2025-10-06 14:01:50
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-2031"> <Title>Automatic Acquisition of Named Entity Tagged Corpus from World Wide Web</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Current trend in Named Entity Recognition (NER) is to apply machine learning approach, which is more attractive because it is trainable and adaptable, and subsequently the porting of a machine learning system to another domain is much easier than that of a rule-based one. Various supervised learning methods for Named Entity (NE) tasks were successfully applied and have shown reasonably satisfiable performance.((Zhou and Su, 2002)(Borthwick et al., 1998)(Sassano and Utsuro, 2000)) However, most of these systems heavily rely on a tagged corpus for training. For a machine learning approach, a large corpus is required to circumvent the data sparseness problem, but the dilemma is that the costs required to annotate a large training corpus are non-trivial.</Paragraph> <Paragraph position="1"> In this paper, we suggest a method that automatically constructs an NE tagged corpus from the web to be used for learning of NER systems. We use an NE list and an web search engine to collect web documents which contain the NE instances. The documents are refined through the sentence separation and text refinement procedures and NE instances are finally annotated with the appropriate NE categories.</Paragraph> <Paragraph position="2"> This automatically tagged corpus may have lower quality than the manually tagged ones but its size can be almost infinitely increased without any human efforts. To verify the usefulness of the constructed NE tagged corpus, we apply it to a learning of NER system and compare the results with the manually tagged corpus.</Paragraph> </Section> class="xml-element"></Paper>