File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-2020_intro.xml
Size: 1,499 bytes
Last Modified: 2025-10-06 14:01:51
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-2020"> <Title>Automatic Collection of Related Terms from the Web</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 System </SectionTitle> <Paragraph position="0"> Figure 1 shows the configuration of the system. The system consists of three steps: compiling corpus, automatic term recognition (ATR), and filtering. This system is implemented for Japanese language.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Compiling corpus </SectionTitle> <Paragraph position="0"> The first step, compiling corpus, produces a corpus</Paragraph> <Paragraph position="2"> for a seed term s. In general, compiling corpus is to select the appropriate passages from a document set. We use the Web for the document set and select the passages that describe s for the corpus. The actual procedure of compiling corpus is: 1. Web page collection For a given seed term s, the system first makes four queries: &quot;s toha&quot;, &quot;s toiu&quot;, &quot;s ha&quot;, and &quot;s&quot;, where toha, ha, and toiu are Japanese functional words that are often used for defining or explaining a term. Then, the system collects the top K (= 100) pages at maximum for each query by using a search engine. If a collected page has a link whose anchor string is s, the system collects the linked page too.</Paragraph> </Section> </Section> class="xml-element"></Paper>