File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-1043_intro.xml
Size: 2,002 bytes
Last Modified: 2025-10-06 14:01:50
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1043"> <Title>A Bootstrapping Approach to Named Entity Classification Using Successive Learners</Title> <Section position="3" start_page="1" end_page="1" type="intro"> <SectionTitle> 2 System Design </SectionTitle> <Paragraph position="0"> Figure 1 shows the overall system architecture.</Paragraph> <Paragraph position="1"> Before the bootstrapping is started, a large raw training corpus is parsed by the English parser from our InfoXtract system (Srihari et al. 2003).</Paragraph> <Paragraph position="2"> The bootstrapping experiment reported in this paper is based on a corpus containing ~100,000 news articles and a total of ~88,000,000 words.</Paragraph> <Paragraph position="3"> The parsed corpus is saved into a repository, which supports fast retrieval by a keyword-based indexing scheme.</Paragraph> <Paragraph position="4"> Although the parsing-based NE learner is found to suffer from the recall problem, we can apply the learned rules to a huge parsed corpus. In other words, the availability of an almost unlimited raw corpus compensates for the modest recall. As a result, large quantities of NE instances are automatically acquired. An automatically annotated NE corpus can then be constructed by extracting the tagged instances plus their neighboring words from the repository.</Paragraph> <Paragraph position="5"> The bootstrapping is performed as follows: 1. Concept-based seeds are provided by the user.</Paragraph> <Paragraph position="6"> 2. Parsing structures involving concept-based seeds are retrieved from the repository to train a decision list for NE classification. 3. The learned rules are applied to the NE candidates stored in the repository.</Paragraph> <Paragraph position="7"> 4. The proper names tagged in Step 3 and their neighboring words are put together as an NE annotated corpus.</Paragraph> <Paragraph position="8"> 5. An HMM is trained based on the annotated corpus.</Paragraph> </Section> class="xml-element"></Paper>