File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-2031_metho.xml
Size: 10,353 bytes
Last Modified: 2025-10-06 14:08:21
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-2031"> <Title>Automatic Acquisition of Named Entity Tagged Corpus from World Wide Web</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Automatic Acquisition of an NE Tagged Corpus </SectionTitle> <Paragraph position="0"> We only focus on the three major NE categories (i.e., person, organization and location) because others are relatively easier to recognize and these three categories actually suffer from the shortage of an NE tagged corpus.</Paragraph> <Paragraph position="1"> Various linguistic information is already held in common in written form on the web and its quantity is recently increasing to an almost unlimited extent. The web can be regarded as an infinite language resource which contains various NE instances with diverse contexts. It is the key idea that automatically marks such NE instances with appropriate category labels using pre-compiled NE lists. However, there should be some general and language-specific con- null from the web siderations in this marking process because of the word ambiguity and boundary ambiguity of NE instances. To overcome these ambiguities, the automatic generation process of NE tagged corpus consists of four steps. The process first collects web documents using a web search engine fed with the NE entries and secondly segments them into sentences. Next, each sentence is refined and filtered out by several heuristics. An NE instance in each sentence is finally tagged with an appropriate NE category label. Figure 1 explains the entire procedure to automatically generate NE tagged corpus.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Collecting Web Documents </SectionTitle> <Paragraph position="0"> It is not appropriate for our purpose to randomly collect documents from the web. This is because not all web documents actually contain some NE instances and we also do not have the list of all NE instances occurring in the web documents. We need to collect the web documents which necessarily contain at least one NE instance and also should know its category to automatically annotate it. This can be accomplished by using a web search engine queried with pre-compiled NE list.</Paragraph> <Paragraph position="1"> As queries to a search engine, we used the list of Korean Named Entities composed of 937 per-son names, 1,000 locations and 1,050 organizations. Using a Part-of-Speech dictionary, we removed ambiguous entries which are not proper nouns in other contexts to reduce errors of automatic annotation.</Paragraph> <Paragraph position="2"> For example, 'E (kyunggi, Kyunggi/business conditions/a game)' is filtered out because it means a location (proper noun) in one context, but also means business conditions or a game (common noun) in other contexts. By submitting the NE entries as queries to a search engine1, we obtained the maximum 500 of URL's for each entry. Then, a web robot visits the web sites in the URL list and fetches the corresponding web documents.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Splitting into Sentences </SectionTitle> <Paragraph position="0"> Features used in the most NER systems can be classified into two groups according to the distance from a target NE instance. The one includes internal features of NE itself and context features within a small word window or sentence boundary and the other includes name alias and co-reference information beyond a sentence boundary. In fact, it is not easy to extract name alias and co-reference information directly from manually tagged NE corpus and needs additional knowledge or resources. This leads us to focus on automatic annotation in sentence level, not document level. Therefore, in this step, we split the texts of the collected documents into sentences by (Shim et al., 2002) and remove sentences without target NE instances.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Refining the Web Texts </SectionTitle> <Paragraph position="0"> The collected web documents may include texts actually matched by mistake, because most web search engines for Korean use n-gram, especially, bi-gram matching. This leads us to refine the sentences to exclude these erroneous matches. Sentence refinement is accomplished by three different processes: separation of functional words, segmentation of compound nouns, and verification of the usefulness of the extracted sentences.</Paragraph> <Paragraph position="1"> An NE is often concatenated with more than one josa, a Korean functional word, to compose a Korean word. Therefore we need to separate the functional words from an NE instance to detect the boundary of the NE instance and this is achieved by a part-of-speech tagger, POSTAG, which can detect unknown words (Lee et al., 2002). The separation of functional words gives us another benefit that we can resolve the ambiguities between an NE and a common noun plus functional words tomatic: Automatically annotated corpus, Manual: Manually annotated corpus and filter out erroneous matches. For example, 'E OE(kyunggi-do)' can be interpreted as either 'E OE(Kyunggi Province)' or 'E +OE(a game also)' according to its context. We can remove the sentence containing the latter case.</Paragraph> <Paragraph position="2"> A josa-separated Korean word can be a compound noun which only contains a target NE as a substring. This requires us to segment the compound noun into several correct single nouns to match with the target NE. If the segmented single nouns are not matched with a target NE, the sentence can be filtered out. For example, we try to search for an NE entry, ' `(Fin.KL, a Korean singer group)' and may actually retrieve sentences including ' ` Y=(surfing club)'. The compound noun, ' `Y=', can be divided into ' (surfing)' and '`Y=(club)' by a compound-noun segmenting method (Yun et al., 1997). Since both ' ' and '`Y=' are not matched with our target NE, ' `', we can delete the sentences. Although a sentence has a correct target NE, if it does not have context information, it is not useful as an NE tagged corpus. We also removed such sentences.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Generating an NE tagged corpus </SectionTitle> <Paragraph position="0"> The sentences selected by the refining process explained in previous section are finally annotated with the NE label. We acquired the NE tagged corpus including 68,793 NE instances through this automatic annotation process. We can annotate only one NE instance per sentence but almost infinitely increase the size of the corpus because the web provides unlimited data and our process is fully automatic.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Experimental Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Usefulness of the Automatically Tagged Corpus </SectionTitle> <Paragraph position="0"> For effectiveness of the learning, both the size and the accuracy of the training corpus are important.</Paragraph> <Paragraph position="1"> Generally, the accuracy of automatically created NE tagged corpus is worse than that of hand-made corpus. Therefore, it is important to examine the usefulness of our automatically tagged corpus compared to the manual corpus. We separately trained the decision list learning features using the automatically annotated corpus and hand-made one, and compared the performances. Table 1 shows the details of the corpus used in our experiments.2 Through the results in Table 2, we can verify that the performance with the automatic corpus is superior to that with only the seeds and comparable to that with the manual corpus.Moreover, the domain of the manual training corpus is same with that of the test corpus, i.e., news and novels, while the domain of the automatic corpus is unlimited as in the web. This indicates that the performance with the automatic corpus should be regarded as much higher than that with the manual corpus because the performance generally gets worse when we apply the learned system to different domains from the trained ones. Also, the automatic corpus is pretty much self-contained since the performance does not gain much though we use both the manual corpus and the automatic corpus for training.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Size of the Automatically Tagged Corpus </SectionTitle> <Paragraph position="0"> As another experiment, we tried to investigate how large automatic corpus we should generate to get the satisfiable performance. We measured the performance according to the size of the automatic corpus. We carried out the experiment with the decision list learning method and the result is shown in Table 3. Here, 5% actually corresponds to the size of the manual corpus. When we trained with that size of the automatic corpus, the performance was very low compared to the performance of the manual corpus. The reason is that the automatic corpus is com- null posed of the sentences searched with fewer named entities and therefore has less lexical and contextual information than the same size of the manual corpus. However, the automatic generation has a big merit that the size of the corpus can be increased almost infinitely without much cost. From Table 3, we can see that the performance is improved as the size of the automatic corpus gets increased. As a result, the NER system trained with the whole automatic corpus outperforms the NER system trained with the manual corpus.</Paragraph> <Paragraph position="1"> We also conducted an experiment to examine the saturation point of the performance according to the size of the automatic corpus. This experiment was focused on only 'person' category and the result is shown in Table 4. In the case of 'person' category, we can see that the performance does not increase any more when the corpus size exceeds 1.2 million words.</Paragraph> </Section> </Section> class="xml-element"></Paper>