File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1208_metho.xml
Size: 16,900 bytes
Last Modified: 2025-10-06 14:07:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1208"> <Title>Question Answering Using Encyclopedic Knowledge Generated from the Web</Title> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> 3 Overview of our QA system </SectionTitle> <Paragraph position="0"> For the first type of question (see Section 2), human examinees would search their knowledge base (i.e., memory) for the description of a given term, and compare that description with four candidates. Then they would choose the candidate that is most similar to the description.</Paragraph> <Paragraph position="1"> For the second type of question, human examinees would search their knowledge base for the description of each of four candidate terms. Then they would choose the candidate term whose description is most similar to the question.</Paragraph> <Paragraph position="2"> The mechanism of our QA system is analogous to the above human methods. However, our system uses as a knowledge base an encyclopedia generated from the Web.</Paragraph> <Paragraph position="3"> To compute the similarity between two descriptions, we use techniques developed in IR research, in which the similarity between a user query and each document in a collection is usually quantified based on word frequencies. In our case, a question and four possible answers correspond to query and document collection, respectively. We use one of the major probabilistic IR method (Robertson and Walker, 1994).</Paragraph> <Paragraph position="4"> To sum up, given a question, its type and four choices, our QA system chooses as the answer one of four candidates, in which resolution algorithm varies depending on the question type.</Paragraph> </Section> <Section position="5" start_page="1" end_page="3" type="metho"> <SectionTitle> 4 Encyclopedia Generation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.1 Overview </SectionTitle> <Paragraph position="0"> Figure 1 depicts the overall design of our method to generate an encyclopedia for input terms. This figure consists of three modules: &quot;retrieval,&quot; &quot;extraction&quot; and &quot;organization,&quot; among which the organization module is newly introduced in this paper. In principle, the remaining two modules (&quot;retrieval&quot; and &quot;extraction&quot;) are the same as proposed by Fujii and Ishikawa (2000).</Paragraph> <Paragraph position="1"> In Figure 1, terms can be submitted either on-line or off-line. A reasonable method is that while the system periodically updates the encyclopedia off-line, terms unindexed in the encyclopedia are dynamically processed in real-time usage. In either case, our system processes input terms one by one. We briefly explain each module in the following three sections, respectively.</Paragraph> </Section> <Section position="2" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 4.2 Retrieval </SectionTitle> <Paragraph position="0"> The retrieval module searches the Web for pages containing an input term, for which existing Web search engines can be used, and those with broad coverage are desirable.</Paragraph> <Paragraph position="1"> However, search engines performing query expansion are not always desirable, because they usually retrieve a number of pages which do not contain a query keyword. Since the extraction module (see Section 4.3) analyzes the usage of the input term in retrieved pages, pages not containing the term are of no use for our purpose. Thus, we use as the retrieval module &quot;Google,&quot; which is one of the major search engines and does not conduct query expansion .</Paragraph> </Section> <Section position="3" start_page="2" end_page="3" type="sub_section"> <SectionTitle> 4.3 Extraction </SectionTitle> <Paragraph position="0"> In the extraction module, given Web pages containing an input term, newline codes, redundant white spaces and HTML tags that are not used in the following process are discarded so as to standardize the page format.</Paragraph> <Paragraph position="1"> Second, we (approximately) identify a region describing the term in the page, for which two rules are used.</Paragraph> <Paragraph position="2"> http://www.google.com/ The first rule is based on Japanese linguistic patterns typically used for term descriptions, such as &quot;X toha Y dearu (X is Y).&quot; Following the method proposed by Fujii and Ishikawa (2000), we semi-automatically produced 20 patterns based on the Japanese CD-ROM World Encyclopedia (Heibonsha, 1998), which includes approximately 80,000 entries related to various fields.</Paragraph> <Paragraph position="3"> It is expected that a region including the sentence that matched with one of those patterns can be a term description.</Paragraph> <Paragraph position="4"> The second rule is based on HTML layout. In a typical case, a term in question is highlighted as a heading with tags such as <DT>, <B> and <Hx> (&quot;x&quot; denotes a digit), followed by its description. In some cases, terms are marked with the anchor<A>tag, providing hyperlinks to pages where they are described.</Paragraph> <Paragraph position="5"> Finally, based on the region briefly identified by the above method, we extract a page fragment as a term description. Since term descriptions usually consist of a logical segment (such as a paragraph) rather than a single sentence, we extract a fragment that matched with one of the following patterns, which are sorted according to preference in descending order: 1. description tagged with <DD> in the case where the term is tagged with <DT> , 2. paragraph tagged with <P>, 3. itemization tagged with <UL>, 4. N sentences, where we empirically set N =3.</Paragraph> </Section> <Section position="4" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.4 Organization </SectionTitle> <Paragraph position="0"> For the purpose of organization, we classify extracted term descriptions based on word senses and domains.</Paragraph> <Paragraph position="1"> Although a number of methods have been proposed to generate word senses (for example, one based on the vector space model (Sch&quot;utze, 1998)), it is still difficult to accurately identify word senses without explicit dictionaries that predefine sense candidates.</Paragraph> <Paragraph position="2"> <DT> and <DD> are inherently provided to describe terms in HTML.</Paragraph> <Paragraph position="3"> Since word senses are often associated with domains (Yarowsky, 1995), word senses can be consequently distinguished by way of determining the domain of each description. For example, different senses for &quot;pipeline (processing method/transportation pipe)&quot; are associated with computer and construction domains (fields), respectively. null To sum up, the organization module classifies term descriptions based on domains, for which we use domain and description models. In Section 5, we elaborate on the organization model.</Paragraph> </Section> </Section> <Section position="6" start_page="3" end_page="4" type="metho"> <SectionTitle> 5 Statistical Organization Model </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 5.1 Overview </SectionTitle> <Paragraph position="0"> Given one or more (in most cases more than one) descriptions for a single term, the organization module selects appropriate description(s) for each domain related to the term.</Paragraph> <Paragraph position="1"> We do not need all the extracted descriptions as final outputs, because they are usually similar to one another, and thus are redundant. For the moment, we assume that we know a priori which domains are related to the input term.</Paragraph> <Paragraph position="2"> From the viewpoint of probability theory, our task here is to select descriptions with greater probability for given domains. The probability for description d given domain c, P(d|c), is commonly transformed as in Equation (1), through use of the Bayesian theorem.</Paragraph> <Paragraph position="4"> In practice, P(c) can be omitted because this factor is a constant, and thus does not affect the relative probability for different descriptions.</Paragraph> <Paragraph position="5"> In Equation (1), P(c|d) models a probability that d corresponds to domain c. P(d) models a probability that d can be a description for the term in question, disregarding the domain. We shall call them domain and description models, respectively. null To sum up, in principle we select d's that are strongly associated with a certain domain, and are likely to be descriptions themselves.</Paragraph> <Paragraph position="6"> Extracted descriptions are not linguistically understandable in the case where the extraction process is unsuccessful and retrieved pages inherently contain non-linguistic information (such as special characters and e-mail addresses).</Paragraph> <Paragraph position="7"> To resolve this problem, we previously used a language model to filter out descriptions with low perplexity (Fujii and Ishikawa, 2000). However, in this paper we integrated a description model, which is practically the same as a language model, with an organization model. The new framework is more understandable with respect to probability theory.</Paragraph> <Paragraph position="8"> In practice, we first use Equation (1) to compute P(d|c) for all the c's predefined in the domain model. Then we discard such c whose P(d|c) is below a specific threshold. As a result, for the input term, related domains and descriptions are simultaneously selected. Thus, we do not have to know a priori which domains are related to each term.</Paragraph> <Paragraph position="9"> In the following two sections, we explain methods to realize the domain and description models, respectively.</Paragraph> </Section> <Section position="2" start_page="3" end_page="4" type="sub_section"> <SectionTitle> 5.2 Domain Model </SectionTitle> <Paragraph position="0"> The domain model quantifies the extent to which description d is associated with domain c, which is fundamentally a categorization task.</Paragraph> <Paragraph position="1"> Among a number of existing categorization methods, we experimentally used one proposed by Iwayama and Tokunaga (1994), which formulates P(c|d) as in Equation (2).</Paragraph> <Paragraph position="3"> (2) Here, P(t|d), P(t|c) and P(t) denote probabilities that word t appears in d, c and all the domains, respectively. We regard P(c) as a constant. While P(t|d) is simply a relative frequency of t in d,we need predefined domains to compute P(t|c) and P(t). For this purpose, the use of large-scale corpora annotated with domains is desirable.</Paragraph> <Paragraph position="4"> However, since those resources are prohibitively expensive, we used the &quot;Nova&quot; dictionary for Japanese/English machine translation systems , which includes approximately one million entries related to 19 technical fields as listed below: Produced by NOVA, Inc.</Paragraph> <Paragraph position="5"> aeronautics, biotechnology, business, chemistry, computers, construction, defense, ecology, electricity, energy, finance, law, mathematics, mechanics, medicine, metals, oceanography, plants, trade.</Paragraph> <Paragraph position="6"> We extracted words from dictionary entries to estimate P(t|c) and P(t). For Japanese entries, we used the ChaSen morphological analyzer (Matsumoto et al., 1997) to extract words. We also used English entries because Japanese descriptions often contain English words. It may be argued that statistics extracted from dictionaries are unreliable, because word frequencies in real word usage are missing. However, words that are representative for a domain tend to be frequently used in compound word entries associated with the domain, and thus our method is a practical approximation.</Paragraph> </Section> <Section position="3" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 5.3 Description Model </SectionTitle> <Paragraph position="0"> The description model quantifies the extent to which a given page fragment is feasible as a description for the input term. In principle, we decompose the description model into language and quality properties, as shown in Equation (3).</Paragraph> <Paragraph position="1"> quality models, respectively.</Paragraph> <Paragraph position="2"> It is expected that the quality model discards incorrect or misleading information contained in Web pages. For this purpose, a number of quality rating methods for Web pages (Amento et al., 2000; Zhu and Gauch, 2000) can be used.</Paragraph> <Paragraph position="3"> However, since Google (i.e., the search engine we used in the retrieval module) rates the quality of pages based on hyperlink information, and selectively retrieves those with higher quality (Brin and Page, 1998), we tentatively regarded P as a constant. Thus, in practice the description model is approximated solely with the language model as in Equation (4).</Paragraph> <Paragraph position="5"> Statistical approaches to language modeling have been used in much NLP research, such as machine translation (Brown et al., 1993) and speech recognition (Bahl et al., 1983). Our language model is almost the same as existing models, but is different in two respects.</Paragraph> <Paragraph position="6"> First, while general language models quantify the extent to which a given word sequence is linguistically acceptable, our model also quantifies the extent to which the input is acceptable as a term description. Thus, we trained the model based on an existing machine readable encyclopedia. null We used the ChaSen morphological analyzer to segment the Japanese CD-ROM World Encyclopedia (Heibonsha, 1998) into words (we replaced headwords with a common symbol), and then used the CMU-Cambridge toolkit (Clarkson and Rosenfeld, 1997) to model a word-based trigram. Consequently, descriptions in which word sequences are more similar to those in the World Encyclopedia are assigned greater probability scores through our language model.</Paragraph> <Paragraph position="7"> Second, P(d), which is generally a product of probabilities for N-grams in d, is quite sensitive to the length of d. In the cases of machine translation and speech recognition, this problem is less crucial because multiple candidates compared based on the language model are almost equivalent in terms of length. For example, in the case of machine translation, candidates are translations for a single input, which are usually comparable with respect to length.</Paragraph> <Paragraph position="8"> However, since in our case length of descriptions are significantly different, shorter descriptions are more likely to be selected, regardless of the quality. To avoid this problem, we normalize P(d) by the number of words contained in d.</Paragraph> </Section> </Section> <Section position="7" start_page="4" end_page="4" type="metho"> <SectionTitle> 6 Experimentation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 6.1 Methodology </SectionTitle> <Paragraph position="0"> We evaluated the performance of our question answering system, for which we used as test inputs 40 technical term questions collected from the Class II examination (the autumn of 1999).</Paragraph> <Paragraph position="1"> First, we generated an encyclopedia including 96 terms that are associated with those 40 questions. For all the 96 test terms, Google retrieved a positive number of pages, and the average number of pages for one term was 196,503. Since Google practically outputs contents of the top 1,000 pages, the remaining pages were not used in our experiments.</Paragraph> <Paragraph position="2"> For each test term, we computed P(d|c) using Equation (1) and discarded domains whose P(d|c) was below 0.05. Then, for each remaining domain, the top three descriptions with higher P(d|c) values were selected as the final outputs, because a preliminary experiment showed that a correct description was generally found in the top three candidates.</Paragraph> <Paragraph position="3"> In addition, to estimate a baseline performance, we used the &quot;Nichigai&quot; computer dictionary (Nichigai Associates, 1996). This dictionary lists approximately 30,000 Japanese technical terms related to the computer field, and contains descriptions for 13,588 terms. In this dictionary 42 out of 96 test terms were described.</Paragraph> <Paragraph position="4"> We compared the following three different resources as a knowledge base:</Paragraph> </Section> </Section> class="xml-element"></Paper>