File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1026_metho.xml
Size: 29,719 bytes
Last Modified: 2025-10-06 14:07:37
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1026"> <Title>Organizing Encyclopedic Knowledge based on the Web and its Application to Question Answering</Title> <Section position="3" start_page="0" end_page="2" type="metho"> <SectionTitle> 2 System Design 2.1 Overview </SectionTitle> <Paragraph position="0"> Figure 1 depicts the overall design of our system, which generates an encyclopedia for input terms.</Paragraph> <Paragraph position="1"> Our system, which is currently implemented for Japanese, consists of three modules: &quot;retrieval,&quot; &quot;extraction&quot; and &quot;organization,&quot; among which the organization module is newly introduced in this paper. In principle, the remaining two modules (&quot;retrieval&quot; and &quot;extraction&quot;) are the same as proposed by Fujii and Ishikawa (2000).</Paragraph> <Paragraph position="2"> In Figure 1, terms can be submitted either on-line or off-line. A reasonable method is that while the system periodically updates the encyclopedia off-line, terms unindexed in the encyclopedia are dynamically processed in real-time usage. In either case, our system processes input terms one by one.</Paragraph> <Paragraph position="3"> We briefly explain each module in the following three sections, respectively.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Retrieval </SectionTitle> <Paragraph position="0"> The retrieval module searches the Web for pages containing an input term, for which existing Web search engines can be used, and those with broad coverage are desirable.</Paragraph> <Paragraph position="1"> However, search engines performing query expansion are not always desirable, because they usually retrieve a number of pages which do not contain an input keyword. Since the extraction module (see Section 2.3) analyzes the usage of the input term in retrieved pages, pages not containing the term are of no use for our purpose.</Paragraph> <Paragraph position="2"> Thus, we use as the retrieval module &quot;Google,&quot; which is one of the major search engines and does not conduct query expansion</Paragraph> <Paragraph position="4"/> </Section> <Section position="2" start_page="0" end_page="2" type="sub_section"> <SectionTitle> 2.3 Extraction </SectionTitle> <Paragraph position="0"> In the extraction module, given Web pages containing an input term, newline codes, redundant white spaces and HTML tags that are not used in the following processes are discarded to standardize the page format.</Paragraph> <Paragraph position="1"> Second, we approximately identify a region describing the term in the page, for which two rules are used.</Paragraph> <Paragraph position="3"> The first rule is based on Japanese linguistic patterns typically used for term descriptions, such as &quot;X toha Y dearu (X is Y).&quot; Following the method proposed by Fujii and Ishikawa (2000), we semi-automatically produced 20 patterns based on the Japanese CD-ROM World Encyclopedia (Heibonsha, 1998), which includes approximately 80,000 entries related to various fields. It is expected that a region including the sentence that matched with one of those patterns can be a term description.</Paragraph> <Paragraph position="4"> The second rule is based on HTML layout. In a typical case, a term in question is highlighted as a heading with tags such as <DT>, <B> and <Hx> (&quot;x&quot; denotes a digit), followed by its description. In some cases, terms are marked with the anchor <A> tag, providing hyperlinks to pages where they are described.</Paragraph> <Paragraph position="5"> Finally, based on the region briefly identified by the above method, we extract a page fragment as a term description. Since term descriptions usually consist of a logical segment (such as a paragraph) rather than a single sentence, we extract a fragment that matched with one of the following patterns, which are sorted according to preference in descending order: 1. description tagged with <DD> in the case where the term is tagged with <DT> , 2. paragraph tagged with <P>, 3. itemization tagged with <UL>, 4. N sentences, where we empirically set N =3.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 2.4 Organization </SectionTitle> <Paragraph position="0"> As discussed in Section 1, organizing information extracted from the Web is crucial in our framework. For this purpose, we classify extracted term descriptions based on word senses and domains.</Paragraph> <Paragraph position="1"> Although a number of methods have been proposed to generate word senses (for example, one based on the vector space model (Sch&quot;utze, 1998)), it is still difficult to accurately identify word senses without explicit dictionaries that define sense candidates.</Paragraph> <Paragraph position="2"> In addition, since word senses are often associated with domains (Yarowsky, 1995), word senses can be consequently distinguished by way of determining the domain of each description. For example, different senses for &quot;pipeline (processing method/transportation pipe)&quot; are associated with the computer and construction domains (fields), respectively.</Paragraph> <Paragraph position="3"> To sum up, the organization module classifies term descriptions based on domains, for which we use domain and description models. In Section 3, we elaborate on our organization model.</Paragraph> <Paragraph position="4"> <DT> and <DD> are inherently provided to describe terms in HTML.</Paragraph> </Section> </Section> <Section position="4" start_page="2" end_page="3" type="metho"> <SectionTitle> 3 Statistical Organization Model </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.1 Overview </SectionTitle> <Paragraph position="0"> Given one or more (in most cases more than one) descriptions for a single input term, the organization module selects appropriate description(s) for each domain related to the term.</Paragraph> <Paragraph position="1"> We do not need all the extracted descriptions as final outputs, because they are usually similar to one another, and thus are redundant.</Paragraph> <Paragraph position="2"> For the moment, we assume that we know a priori which domains are related to the input term.</Paragraph> <Paragraph position="3"> From the viewpoint of probability theory, our task here is to select descriptions with greater probability for given domains. The probability for description d given domain c, P (d|c), is commonly transformed as in Equation (1), through use of the Bayesian theorem.</Paragraph> <Paragraph position="5"> a constant, and thus does not affect the relative probability for different descriptions.</Paragraph> <Paragraph position="6"> In Equation (1), P (c|d) models a probability that d corresponds to domain c. P (d) models a probability that d can be a description for the term in question, disregarding the domain. We shall call them domain and description models, respectively.</Paragraph> <Paragraph position="7"> To sum up, in principle we select d's that are strongly associated with a specific domain, and are likely to be descriptions themselves.</Paragraph> <Paragraph position="8"> Extracted descriptions are not linguistically understandable in the case where the extraction process is unsuccessful and retrieved pages inherently contain non-linguistic information (such as special characters and e-mail addresses).</Paragraph> <Paragraph position="9"> To resolve this problem, Fujii and Ishikawa (2000) used a language model to filter out descriptions with low perplexity. However, in this paper we integrated a description model, which is practically the same as a language model, with an organization model. The new framework is more understandable with respect to probability theory.</Paragraph> <Paragraph position="10"> In practice, we first use Equation (1) to compute P (d|c) for all the c's predefined in the domain model. Then we discard such c's whose P (d|c) is below a specific threshold. As a result, for the input term, related domains and descriptions are simultaneously selected.</Paragraph> <Paragraph position="11"> Thus, we do not have to know a priori which domains are related to each term.</Paragraph> <Paragraph position="12"> In the following two sections, we explain methods to realize the domain and description models, respectively. null</Paragraph> </Section> <Section position="2" start_page="2" end_page="3" type="sub_section"> <SectionTitle> 3.2 Domain Model </SectionTitle> <Paragraph position="0"> The domain model quantifies the extent to which description d is associated with domain c, which is fundamentally a categorization task. Among a number of existing categorization methods, we experimentally used one proposed by Iwayama and Tokunaga (1994), which formulates P (c|d) as in Equation (2).</Paragraph> <Paragraph position="2"> (2) Here, P (t|d), P (t|c) and P (t) denote probabilities that word t appears in d, c and all the domains, respectively. We regard P (c) as a constant. While P (t|d) is simply a relative frequency of t in d, we need predefined domains to compute P (t|c) and P (t). For this purpose, the use of large-scale corpora annotated with domains is desirable.</Paragraph> <Paragraph position="3"> However, since those resources are prohibitively expensive, we used the &quot;Nova&quot; dictionary for Japanese/English machine translation systems , which includes approximately one million entries related to 19 technical fields as listed below: aeronautics, biotechnology, business, chemistry, computers, construction, defense, ecology, electricity, energy, finance, law, mathematics, mechanics, medicine, metals, oceanography, plants, trade.</Paragraph> <Paragraph position="4"> We extracted words from dictionary entries to estimate P (t|c) and P (t), which are relative frequencies of t in c and all the domains, respectively. We used the ChaSen morphological analyzer (Matsumoto et al., 1997) to extract words from Japanese entries. We also used English entries because Japanese descriptions often contain English words.</Paragraph> <Paragraph position="5"> It may be argued that statistics extracted from dictionaries are unreliable, because word frequencies in real word usage are missing. However, words that are representative for a domain tend to be frequently used in compound word entries associated with the domain, and thus our method is a practical approximation.</Paragraph> </Section> <Section position="3" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.3 Description Model </SectionTitle> <Paragraph position="0"> The description model quantifies the extent to which a given page fragment is feasible as a description for the input term. In principle, we decompose the description model into language and quality properties, as shown in Equation (3).</Paragraph> <Paragraph position="1"> It is expected that the quality model discards incorrect or misleading information contained in Web pages. For this purpose, a number of quality rating methods for Web pages (Amento et al., 2000; Zhu and Gauch, 2000) can be used.</Paragraph> <Paragraph position="2"> However, since Google (i.e., the search engine used in our system) rates the quality of pages based on hyperlink information, and selectively retrieves those with higher quality (Brin and Page, 1998), we tenta-</Paragraph> <Paragraph position="4"> (d) as a constant. Thus, in practice the description model is approximated solely with the language model as in Equation (4).</Paragraph> <Paragraph position="6"> Statistical approaches to language modeling have been used in much NLP research, such as machine translation (Brown et al., 1993) and speech recognition (Bahl et al., 1983). Our model is almost the same as existing models, but is different in two respects.</Paragraph> <Paragraph position="7"> First, while general language models quantify the extent to which a given word sequence is linguistically acceptable, our model also quantifies the extent to which the input is acceptable as a term description.</Paragraph> <Paragraph position="8"> Thus, we trained the model based on an existing machine readable encyclopedia.</Paragraph> <Paragraph position="9"> We used the ChaSen morphological analyzer to segment the Japanese CD-ROM World Encyclopedia (Heibonsha, 1998) into words (we replaced head-words with a common symbol), and then used the CMU-Cambridge toolkit (Clarkson and Rosenfeld, 1997) to model a word-based trigram.</Paragraph> <Paragraph position="10"> Consequently, descriptions in which word sequences are more similar to those in the World Encyclopedia are assigned greater probability scores through our language model.</Paragraph> <Paragraph position="11"> Second, P (d), which is a product of probabilities for N-grams in d, is quite sensitive to the length of d. In the cases of machine translation and speech recognition, this problem is less crucial because multiple candidates compared based on the language model are almost equivalent in terms of length.</Paragraph> <Paragraph position="12"> However, since in our case length of descriptions are significantly different, shorter descriptions are more likely to be selected, regardless of the quality. To avoid this problem, we normalize P (d) by the number of words contained in d.</Paragraph> </Section> </Section> <Section position="5" start_page="3" end_page="4" type="metho"> <SectionTitle> 4 Application </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="3" end_page="4" type="sub_section"> <SectionTitle> 4.1 Overview </SectionTitle> <Paragraph position="0"> Encyclopedias generated through our Web-based method can be used in a number of applications, including human usage, thesaurus production (Hearst, 1992; Nakamura and Nagao, 1988) and natural language understanding in general.</Paragraph> <Paragraph position="1"> Among the above applications, natural language understanding (NLU) is the most challenging from a scientific point of view. Current practical NLU research includes dialogue, information extraction and question answering, among which we focus solely on question answering (QA) in this paper.</Paragraph> <Paragraph position="2"> A straightforward application is to answer interrogative questions like &quot;What is X?&quot; in which a QA system searches the encyclopedia database for one or more descriptions related to X (this application is also effective for dialog systems).</Paragraph> <Paragraph position="3"> In general, the performance of QA systems are evaluated based on coverage and accuracy. Coverage is the ratio between the number of questions answered (disregarding their correctness) and the total number of questions. Accuracy is the ratio between the number of correct answers and the total number of answers made by the system.</Paragraph> <Paragraph position="4"> While coverage can be estimated objectively and systematically, estimating accuracy relies on human subjects (because there is no absolute description for term X), and thus is expensive.</Paragraph> <Paragraph position="5"> In view of this problem, we targeted Information , which are biannual (spring and autumn) examinations necessary for candidates to qualify to be IT engineers in Japan. Among a number of classes, we focused on the &quot;Class II&quot; examination, which requires fundamental and general knowledge related to information technology. Approximately half of questions are associated with IT technical terms.</Paragraph> <Paragraph position="6"> Since past examinations and answers are open to the public, we can evaluate the performance of our QA system with minimal cost.</Paragraph> </Section> <Section position="2" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 4.2 Analyzing IT Engineers Examinations </SectionTitle> <Paragraph position="0"> The Class II examination consists of quadruple-choice questions, among which technical term questions can be subdivided into two types.</Paragraph> <Paragraph position="1"> In the first type of question, examinees choose the most appropriate description for a given technical term, such as &quot;memory interleave&quot; and &quot;router.&quot; In the second type of question, examinees choose the most appropriate term for a given question, for which we show examples collected from the examination in the autumn of 1999 (translated into English by one of the authors) as follows: 1. Which data structure is most appropriate for FIFO (First-In First-Out)? a) binary trees, b) queues, c) stacks, d) heaps 2. Choose the LAN access method in which multiple terminals transmit data simultaneously and</Paragraph> </Section> <Section position="3" start_page="4" end_page="4" type="sub_section"> <SectionTitle> Japan Information-Technology Engineers Examination </SectionTitle> <Paragraph position="0"> Center. http://www.jitec.jipdec.or.jp/ thus they potentially collide.</Paragraph> <Paragraph position="1"> a) ATM, b) CSM/CD, c) FDDI, d) token ring In the autumn of 1999, out of 80 questions, the number of the first and second types were 22 and 18, respectively. null</Paragraph> </Section> <Section position="4" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 4.3 Implementing a QA system </SectionTitle> <Paragraph position="0"> For the first type of question, human examinees would search their knowledge base (i.e., memory) for the description of a given term, and compare that description with four candidates. Then they would choose the candidate that is most similar to the description.</Paragraph> <Paragraph position="1"> For the second type of question, human examinees would search their knowledge base for the description of each of four candidate terms. Then they would choose the candidate term whose description is most similar to the question description.</Paragraph> <Paragraph position="2"> The mechanism of our QA system is analogous to the above human methods. However, unlike human examinees, our system uses an encyclopedia generated from the Web as a knowledge base.</Paragraph> <Paragraph position="3"> In addition, our system selectively uses term descriptions categorized into domains related to information technology. In other words, the description of &quot;pipeline (transportation pipe)&quot; is irrelevant or misleading to answer questions associated with &quot;pipeline (processing method).&quot; To compute the similarity between two descriptions, we used techniques developed in IR research, in which the similarity between a user query and each document in a collection is usually quantified based on word frequencies. In our case, a question and four possible answers correspond to query and document collection, respectively. We used a probabilistic method (Robertson and Walker, 1994), which is one of the major IR methods.</Paragraph> <Paragraph position="4"> To sum up, given a question, its type and four choices, our QA system chooses one of four candidates as the answer, in which the resolution algorithm varies depending on the question type.</Paragraph> </Section> <Section position="5" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 4.4 Related Work </SectionTitle> <Paragraph position="0"> Motivated partially by the TREC-8 QA collection (Voorhees and Tice, 2000), question answering has of late become one of the major topics within the NLP/IR communities.</Paragraph> <Paragraph position="1"> In fact, a number of QA systems targeting the TREC QA collection have recently been proposed (Harabagiu et al., 2000; Moldovan and Harabagiu, 2000; Prager et al., 2000). Those systems are commonly termed &quot;open-domain&quot; systems, because questions expressed in natural language are not necessarily limited to explicit axes, including who, what, when, where, how and why.</Paragraph> <Paragraph position="2"> However, Moldovan and Harabagiu (2000) found that each of the TREC questions can be recast as either a single axis or a combination of axes. They also found that out of the 200 TREC questions, 64 questions (approximately one third) were associated with the what axis, for which the Web-based encyclopedia is expected to improve the quality of answers.</Paragraph> <Paragraph position="3"> Although Harabagiu et al. (2000) proposed a knowledge-based QA system, most existing systems rely on conventional IR and shallow NLP methods.</Paragraph> <Paragraph position="4"> The use of encyclopedic knowledge for QA systems, as we demonstrated, needs to be further explored.</Paragraph> </Section> </Section> <Section position="6" start_page="4" end_page="4" type="metho"> <SectionTitle> 5 Experimentation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 5.1 Methodology </SectionTitle> <Paragraph position="0"> We conducted a number of experiments to investigate the effectiveness of our methods.</Paragraph> <Paragraph position="1"> First, we generated an encyclopedia by way of our Web-based method (see Sections 2 and 3), and evaluated the quality of the encyclopedia itself.</Paragraph> <Paragraph position="2"> Second, we applied the generated encyclopedia to our QA system (see Section 4), and evaluated its performance. The second experiment can be seen as a task-oriented evaluation for our encyclopedia generation method.</Paragraph> <Paragraph position="3"> In the first experiment, we collected 96 terms from technical term questions in the Class II examination (the autumn of 1999). We used as test inputs those 96 terms and generated an encyclopedia, which was used in the second experiment.</Paragraph> <Paragraph position="4"> For all the 96 test terms, Google (see Section 2.2) retrieved a positive number of pages, and the average number of pages for one term was 196,503. Since Google practically outputs contents of the top 1,000 pages, the remaining pages were not used in our experiments. null In the following two sections, we explain the first and second experiments, respectively.</Paragraph> </Section> <Section position="2" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 5.2 Evaluating Encyclopedia Generation </SectionTitle> <Paragraph position="0"> For each test term, our method first computed P (d|c) using Equation (1) and discarded domains whose P (d|c) was below 0.05. Then, for each remaining domain, descriptions with higher P (d|c) were selected as the final outputs.</Paragraph> <Paragraph position="1"> We selected the top three (not one) descriptions for each domain, because reading a couple of descriptions, which are short paragraphs, is not laborious for human users in real-world usage. As a result, at least one description was generated for 85 test terms, disregarding the correctness. The number of resultant descriptions was 326 (3.8 per term). We analyzed those descriptions from different perspectives.</Paragraph> <Paragraph position="2"> First, we analyzed the distribution of the Google ranks for the Web pages from which the top three descriptions were eventually retained. Figure 2 shows the result, where we have combined the pages in groups of 50, so that the leftmost bar, for example, denotes the number of used pages whose original Google ranks ranged from 1 to 50.</Paragraph> <Paragraph position="3"> Although the first group includes the largest number of pages, other groups are also related to a relatively large number of pages. In other words, our method exploited a number of low ranking pages, which are not browsed or utilized by most Web users.</Paragraph> <Paragraph position="4"> Google.</Paragraph> <Paragraph position="5"> Second, we analyzed the distribution of domains assigned to the 326 resultant descriptions. Figure 3 shows the result, in which, as expected, most descriptions were associated with the computer domain. However, the law domain was unexpectedly associated with a relatively great number of descriptions. We manually analyzed the resultant descriptions and found that descriptions for which appropriate domains are not defined in our domain model, such as sports, tended to be categorized into the law domain.</Paragraph> <Paragraph position="6"> computers (200), law (41), electricity (28), plants (15), medicine (10), finance (8), mathematics (8), mechanics (5), biotechnology (4), construction (2), ecology (2), chemistry (1), resultant descriptions.</Paragraph> <Paragraph position="7"> Third, we evaluated the accuracy of our method, that is, the quality of an encyclopedia our method generated. For this purpose, each of the resultant descriptions was judged as to whether or not it is a correct description for a term in question. Each domain assigned to descriptions was also judged correct or incorrect. We analyzed the result on a description-bydescription basis, that is, all the generated descriptions were considered independent of one another. The ratio of correct descriptions, disregarding the domain correctness, was 58.0% (189/326), and the ratio of correct descriptions categorized into the correct domain was 47.9% (156/326).</Paragraph> <Paragraph position="8"> However, since all the test terms are inherently related to the IT field, we focused solely on descriptions categorized into the computer domain. In this case, the ratio of correct descriptions, disregarding the domain correctness, was 62.0% (124/200), and the ratio of correct descriptions categorized into the correct domain was 61.5% (123/200).</Paragraph> <Paragraph position="9"> In addition, we analyzed the result on a term-by-term basis, because reading only a couple of descriptions is not crucial. In other words, we evaluated each term (not description), and in the case where at least one correct description categorized into the correct domain was generated for a term in question, we judged it correct. The ratio of correct terms was 89.4% (76/85), and in the case where we focused solely on the computer domain, the ratio was 84.8% (67/79).</Paragraph> <Paragraph position="10"> In other words, by reading a couple of descriptions (3.8 descriptions per term), human users can obtain knowledge of approximately 90% of input terms.</Paragraph> <Paragraph position="11"> Finally, we compared the resultant descriptions with an existing dictionary. For this purpose, we used the &quot;Nichigai&quot; computer dictionary (Nichigai Associates, 1996), which lists approximately 30,000 Japanese technical terms related to the computer field, and contains descriptions for 13,588 terms. In the Nichigai dictionary, 42 out of the 96 test terms were described. Our method, which generated correct descriptions associated with the computer domain for 67 input terms, enhanced the Nichigai dictionary in terms of quantity. These results indicate that our method for generating encyclopedias is of operational quality.</Paragraph> </Section> <Section position="3" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 5.3 Evaluating Question Answering </SectionTitle> <Paragraph position="0"> We used as test inputs 40 questions, which are related to technical terms collected from the Class II examination in the autumn of 1999.</Paragraph> <Paragraph position="1"> The objective here is not only to evaluate the performance of our QA system itself, but also to evaluate the quality of the encyclopedia generated by our method.</Paragraph> <Paragraph position="2"> Thus, as performed in the first experiment (Section 5.2), we used the Nichigai computer dictionary as a baseline encyclopedia. We compared the following three different resources as a knowledge base: * the Nichigai dictionary (&quot;Nichigai&quot;), * the descriptions generated in the first experiment (&quot;Web&quot;), * combination of both resources (&quot;Nichigai + Web&quot;).</Paragraph> <Paragraph position="3"> Table 1 shows the result of our comparative experiment, in which &quot;C&quot; and &quot;A&quot; denote coverage and accuracy, respectively, for variations of our QA system. Since all the questions we used are quadruplechoice, in case the system cannot answer the question, random choice can be performed to improve the coverage to 100%. Thus, for each knowledge resource we compared cases without/with random choice, which are denoted &quot;w/o Random&quot; and &quot;w/ Random&quot; in Table 1, respectively.</Paragraph> <Paragraph position="4"> In the case where random choice was not performed, the Web-based encyclopedia noticeably improved the coverage for the Nichigai dictionary, but decreased the accuracy. However, by combining both resources, the accuracy was noticeably improved, and the coverage was comparable with that for the Nichigai dictionary.</Paragraph> <Paragraph position="5"> On the other hand, in the case where random choice was performed, the Nichigai dictionary and the Web-based encyclopedia were comparable in terms of both the coverage and accuracy. Additionally, by combining both resources, the accuracy was further improved. We also investigated the performance of our QA system where descriptions related to the computer domain are solely used. However, coverage/accuracy did not significantly change, because as shown in Figure 3, most of the descriptions were inherently related to the computer domain.</Paragraph> </Section> </Section> class="xml-element"></Paper>