File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1199_metho.xml

Size: 28,742 bytes

Last Modified: 2025-10-06 14:08:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1199">
  <Title>Learning to Identify Single-Snippet Answers to Definition Questions</Title>
  <Section position="3" start_page="0" end_page="2" type="metho">
    <SectionTitle>
2 Previous techniques
</SectionTitle>
    <Paragraph position="0"> Prager et al. (2001, 2002) observe that definition questions can often be answered by hypernyms; for example, &amp;quot;schizophrenia&amp;quot; is a &amp;quot;mental illness&amp;quot;, where the latter is a hypernym of the former in WordNet. Deciding which hypernym to report, however, is not trivial. To use an example of Prager et al., in &amp;quot;What is a meerkat?&amp;quot; WordNet provides hypernym synsets such as {&amp;quot;viverrine&amp;quot;, &amp;quot;viverrine mammal&amp;quot;} (level 1), {&amp;quot;carnivore&amp;quot;} (level 2), {&amp;quot;placental&amp;quot;, ...} (level 3), {&amp;quot;mammal&amp;quot;} (level 4), up to {&amp;quot;entity&amp;quot;, &amp;quot;something&amp;quot;} (level 9). In a neutral context, the most natural response is arguably &amp;quot;mammal&amp;quot; or &amp;quot;animal&amp;quot;. A hypernym on its own may also not be a satisfactory answer.</Paragraph>
    <Paragraph position="1"> Responding that an amphibian is an animal is less satisfactory than saying it is an animal that lives both on land and in water.</Paragraph>
    <Paragraph position="2"> Prager et al. identify the best hypernyms by counting how often they co-occur with the definition term in two-sentence passages of the document collection. They then short-list the passages that contain both the term and any of the best hypernyms, and rank them using measures similar to those of Radev et al. (2000). More precisely, given a term to define, they compute the level-adapted count (LAC) of each of its hypernyms, defined as the number of two-sentence passages where the hypernym co-occurs with the term divided by the distance between the term and the hypernym in WordNet's hierarchy. They then retain the hypernym with the largest LAC, and all the hypernyms whose LAC is within a 20% margin from the largest one. To avoid very general hypernyms (e.g., {&amp;quot;entity&amp;quot;, &amp;quot;something&amp;quot;}), Prager et al. discard the hypernyms of the highest 1 or 2 levels in WordNet's trees, if the distance from the top of the tree to the definition term is up to 3 or 5, respectively; if the distance is longer, they discard the top three levels. This ceiling is raised gradually if no co-occurring hypernym is found.</Paragraph>
    <Paragraph position="3"> Prager et al.'s method performed well with the definition questions and documents of TREC-9 (Prager et al., 2001). In 20 out of the 24 definition questions they considered (83.3%), it managed to return at least one correct response in the five most highly ranked two-sentence passages; in all 20 questions, the correct response was actually the highest ranked. In TREC-2001, however, where there were more definition questions mirroring more directly real user queries, this percentage dropped to 46% (Prager et al., 2002). In 44 of the 130 definition questions (33.8%) they considered, WordNet did not help at all, because it either did not contain the term whose definition was sought, or none of its hypernyms were useful even as partial definitions. Even when WordNet contained a hypernym that constituted a self-contained definition, it was not always selected.</Paragraph>
    <Paragraph position="4"> In work that led to an alternative method, Hearst (1998) sketched a method to identify patterns that signal particular lexical semantic relations, in effect a bootstrapping approach. Applying this process by hand, Hearst was able to identify four high precision hyponym-hypernym patterns that are common across text genres. The patterns are shown below in the slightly modified form of Joho and Sanderson (2000). Here, qn (query noun) and dp (descriptive phrase) are phrases containing hyponyms and hypernyms, respectively.</Paragraph>
    <Paragraph position="5">  (1) (dp such  |such dp) as qn e.g., &amp;quot;injuries such as broken bones&amp;quot; (2) qn (and  |or) other dp e.g., &amp;quot;broken bones and other injuries&amp;quot; (3) dp especially qn e.g., &amp;quot;injuries especially broken bones&amp;quot; (4) dp including qn  e.g., &amp;quot;European countries including England, France, and Spain&amp;quot; Compared to Prager et al.'s method, Hearst's patterns have the advantage that they may identify hyponym-hypernym relations that are not present in WordNet, which is often the case with domain-specific terminology and proper names.</Paragraph>
    <Paragraph position="6"> Joho and Sanderson (2000) observe that co-occurrences of hyponyms and hypernyms are often indicative of contexts that define the hyponyms.</Paragraph>
    <Paragraph position="7"> Instead of using WordNet, they identify hyponym-hypernym contexts using Hearst's patterns, to which they add the following ones.</Paragraph>
    <Paragraph position="8">  Here, qn is the term to define, and dp is a phrase that contains a definition of qn; dp no longer necessarily contains a hypernym of qn. Each pattern is assigned a weight, which is its precision on a training set (the number of sentences it correctly identifies as  containing a definition, over the number of sentences it matches).</Paragraph>
    <Paragraph position="9"> (5) qn &amp;quot;(&amp;quot; dp &amp;quot;)&amp;quot;  |&amp;quot;(&amp;quot; dp &amp;quot;)&amp;quot; qn e.g., &amp;quot;MP (Member of Parliament)&amp;quot; (6) qn (is  |was  |are  |were) (a  |an  |the) dp e.g., &amp;quot;Tony Blair is a politician&amp;quot; (7) qn , (a  |an  |the) dp e.g., &amp;quot;Tony Blair, the politician&amp;quot; (8) qn , which (is  |was  |are  |were) dp e.g., &amp;quot;bronchitis, which is a disease of...&amp;quot; (9) qn , dp , (is  |was  |are  |were)  e.g., &amp;quot;Blair, Prime Minister of Britain, is...&amp;quot; Joho and Sanderson first locate all the sentences of the document collection that contain the term to define, and then rank them using three attributes. The first one (KPW) is the weight of the pattern the sentence matched, if any. The second attribute (SN) is the ordinal number of the sentence in its document, ignoring sentences that do not contain the term; sentences that mention first the term in a document are more likely to define it. The third attribute (WC) shows how many of the words that  We ignore a variant of (7) that ends in &amp;quot;.&amp;quot;, &amp;quot;?&amp;quot; or &amp;quot;!&amp;quot;. are common across candidate answers are present in the sentence. More precisely, to compute WC, Joho and Sanderson retrieve the first sentence of each document that contains the definition term, and retain the 20 most frequent words of these sentences after applying a stop-list and a lemmatizer. WC is the percentage of these words that are present in the sentence being ranked. The sentences are ranked using the weighted sum of the three attributes, after hand-tuning the weights of the sum on a training set. In Joho et al. (2001), this method was evaluated with 50 definition questions and the top 600 documents that Google returned for each definition term. It returned a correct definition in the five top-ranked sentences in 66% of the questions.</Paragraph>
    <Paragraph position="10">  As with Prager et al.'s method, then, there is scope for improvement.</Paragraph>
  </Section>
  <Section position="4" start_page="2" end_page="4" type="metho">
    <SectionTitle>
3 Our method
</SectionTitle>
    <Paragraph position="0"> Prager et al.'s approach is capable of answering a large number of definition questions, though, as discussed above, it does not always manage to locate the most appropriate hypernym, and the appropriate hyponym-hypernym relation may not be present in WordNet. Joho et al.'s technique does not rely on a predetermined ontology, which makes it less vulnerable to domain-specific terminology and proper names. Its limited pattern set, however, cannot capture definitions that are phrased in less typical ways, and the fact that the weights of the three attributes are hand-tuned raises doubts as to whether they are optimal. Our method overcomes these limitations by combining the two approaches in a common machine learning framework, and by using a larger attribute set.</Paragraph>
    <Paragraph position="1"> We assume that a question processing module that separates definition from other types of questions is available, and that in each definition question it also identifies the term to be defined.</Paragraph>
    <Paragraph position="2"> When such a module is not available, the user can be asked to specify explicitly the question type and the term to be defined, via a form-based interface as in Buchholtz and Daelemans (2001). Hence, the input to our method is a (possibly multi-word) term, along with the most highly ranked documents that an IR engine returned for that term (in the experiments of section 4, we used the top 50 documents). The goal is to identify five 250-character snippets in these documents, such that at least one of the snippets contains an acceptable definition of the input term. As an example, we show below the two snippets that configuration 4 of our method (to be discussed) considered most  Joho et al. report better results when using a less stringent form of evaluation that admits partial answers, but their acceptance criteria in that case appear to be over-permissive. appropriate in the case of &amp;quot;What are pathogens?&amp;quot;. The first snippet defines pathogens as hazardous microorganisms. The second one provides instances of pathogens, but does not actually state what a pathogen is.</Paragraph>
    <Paragraph position="3"> ...considerations as nutrient availability. In particular, the panel concluded that the fear of creating hazardous microorganisms, or pathogens, is overstated. &amp;quot;It is highly unlikely that moving one or a few genes from a pathogen to...</Paragraph>
    <Paragraph position="4"> ...definite intraspecial physiological and morphological diversity. Ph. helianthi thrives at higher temperatures than other sunflower pathogens (Sclerotinia sclerotiorum and Botrytis cinerea) do. In various nutrient media, Ph. helianthi ...</Paragraph>
    <Paragraph position="5"> We select the five snippets to report using alternatively a baseline and four different configurations of our learning-based method.</Paragraph>
    <Section position="1" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
3.1 Baseline
</SectionTitle>
      <Paragraph position="0"> As a baseline, we use a reimplementation of Prager et al.'s method that operates with 250-character snippets.</Paragraph>
      <Paragraph position="1">  Unlike Prager et al., our reimplementation does not use a ranking function (Radev et al., 2000). When more than five snippets contain both the input term and one of its best hypernyms, we rank the snippets according to the ranking (RK) of the documents they derive from, i.e., the ranking of the IR engine. Our evaluation results (section 4) indicate that the performance of our implementation is still very similar to that of Prager et al.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
3.2 Learning-based method
</SectionTitle>
      <Paragraph position="0"> In all the configurations of our learning-based method, we use a Support Vector Machine (SVM) with a simple inner product (polynomial of first degree) kernel (Scholkopf and Smola, 2002), which in effect learns an optimal linear classifier without moving to a higher-dimension space. (We have experimented with higher degree polynomial kernels, but there was no sign of improvement.) The SVM is trained as follows. Given a training set of terms to be defined and the corresponding documents that the IR engine returned, we collect from the documents all the 250-character snippets  Following Prager et al. (2002), we consider synonyms of the input term as level-0 hypernyms, and include them in the search for best hypernyms; their LAC is the number of times they co-occur with the input term. We did not implement the tests for orthographic variations and count ratios of Prager et al. When an input term occurs in multiple synsets, which produces multiple paths towards hypernyms, we select the hypernyms with the best overall LAC scores, instead of the best scores per path, unlike Prager et al. (2001). that have the term at their center. Each snippet is then converted to a training vector, the attributes of which differ across the configurations presented below. The training vectors are manually classified in two categories, depending on whether or not the corresponding snippets contain acceptable definitions. The SVM is trained on these vectors to predict the category of new, unseen vectors from their attributes.</Paragraph>
      <Paragraph position="1"> The SVM implementation we used actually returns confidence scores, showing how probable it is that a particular vector belongs to each category.</Paragraph>
      <Paragraph position="2">  Using a classifier that returns confidence scores instead of binary decisions is crucial, because the training vectors that correspond to definitions are much fewer than the vectors for non-definitions (3004 vs. 15469 in our dataset of section 4). As a result, the induced classifier is biased towards non-definitions, and, hence, most unseen vectors receive higher confidence scores for the category of non-definitions than for the category of definitions. We do not compare the two scores. We pick the five vectors whose confidence score for the category of definitions is highest, and report the corresponding snippets; in effect, we use the SVM as a ranker, rather than a classifier; see also Ravichandran et al. (2003). The imbalance between the two categories can be reduced by considering (during both training and classification) only the first three snippets of each document, which discards mostly non-definitions.</Paragraph>
      <Paragraph position="3"> 3.2.1 Configuration 1: attributes of Joho at al.</Paragraph>
      <Paragraph position="4"> In the first configuration of our learning-based approach, the attributes of the vectors are roughly those of Joho et al.: two numeric attributes for SN and WC (section 2), and a binary attribute for each one of patterns (1)-(9) showing if the pattern is satisfied. We have also added binary attributes for the following manually crafted patterns, and a numeric attribute for RK (section 3.1).</Paragraph>
      <Paragraph position="5">  (10) dp like qn e.g., &amp;quot;antibiotics like amoxicillin&amp;quot; (11) qn or dp e.g., &amp;quot;autism or some other type of disorder&amp;quot; (12) qn (can  |refer  |have) dp e.g., &amp;quot;amphibians can live on land and...&amp;quot; (13) dp (called  |known as  |defined) qn  e.g., &amp;quot;the giant wave known as tsunami&amp;quot; This configuration can be seen as an approximation of Joho et al.'s method, although it is actually an improvement, because of the extra attributes and the fact that it uses an SVM learner  We used Weka's SVM implementation. Weka is available from http://www.cs.waikato.ac.nz/ml/weka/ . instead of a weighted sum with hand-tuned weights to combine the attributes.</Paragraph>
      <Paragraph position="6"> 3.2.2 Configuration 2: adding WordNet The second configuration is identical to the first one, except that it adds an extra binary attribute showing if the snippet contains one of its best hypernyms, as returned by the baseline. This is essentially a combination of the approaches of Prager et al. and Joho et al. Unlike simple voting (Chu-Carroll et al., 2003), the two methods contribute attributes to the instance representations (vectors) of an overall learner (the SVM). This allows the overall learner to assess their reliability and adjust their weights accordingly.</Paragraph>
      <Paragraph position="7"> 3.2.3 Configuration 3: n-gram attributes The third configuration adds m extra binary attributes, each corresponding to an automatically acquired pattern. (In the experiments of section 4, m ranges from 100 to 300.) Observing that most of the previously proposed patterns are anchored at the term to define, we collect from the documents that the IR engine returns for the training questions all the n-grams (n [?] {1, 2, 3}) of tokens that occur immediately before or after the definition term.</Paragraph>
      <Paragraph position="8"> The n-grams that are encountered at least 10 times are considered candidate patterns. From these, we select the m patterns with the highest precision scores, where precision is defined as in section 2, but for snippets instead of sentences.</Paragraph>
      <Paragraph position="9"> In our experiments, this pattern acquisition process re-discovered several of the patterns (1)(13) or sub-expressions of them, namely [dp such as qn], [qn and other dp], [qn &amp;quot;(&amp;quot; dp], [dp &amp;quot;)&amp;quot; qn], [qn is (a  |an  |the) dp], [qn (are  |were) dp], [qn , (a  |an  |the) dp), [qn , which is dp], [qn , dp], [qn or dp], [qn can dp], [dp called qn], [dp known as qn]. It also discovered some reasonable variations of patterns (1)-(13); e.g, [qn is one dp], [dp , (a  |an) qn] (as in &amp;quot;A sudden health problem, a heart attack or ...&amp;quot;), [dp , qn], [qn , or dp]. We include dp, the phrase that defines qn, in the acquired patterns to make them easier to compare to patterns (1)-(13).</Paragraph>
      <Paragraph position="10"> The acquired patterns, however, do not predict the position of dp in a snippet.</Paragraph>
      <Paragraph position="11"> Many of the acquired patterns at first look odd, but under a closer examination turn out to be reasonable. For example, definition sentences often start with the term they define, as in &amp;quot;Amphibians are...&amp;quot;, &amp;quot;An alligator is...&amp;quot;, sometimes with the term quoted, which is how patterns like [. qn], [. An qn], [. ' ' qn] arise. Many of the questions in our data set were about diseases, and, hence, several of the acquired patterns were expressions (e.g,, &amp;quot;people with&amp;quot;, &amp;quot;symptoms of&amp;quot;) that co-occur frequently with definitions of diseases in snippets. This suggests that automatic pattern acquisition would allow domain-specific question answering systems (e.g., systems for medical documents) to exploit domain-specific indicators of definitions. The pattern acquisition process also produced many patterns (e.g., &amp;quot;the hallucinogenic drug&amp;quot;, &amp;quot;President Rafael&amp;quot;) that do not seem to provide any useful information in general, although they occurred frequently in definition snippets of our training data. Since we did not filter manually the automatically acquired patterns, patterns of the latter kind gave rise to irrelevant attributes that carried noise. This is a common problem in machine learning, which most learning algorithms are designed to cope with. Our experimental results indicate that the SVM learner benefits from the n-gram attributes, despite the noise they introduce. We also experimented with keeping both high and low precision patterns, hoping to acquire both n-grams that are indicative of definitions and n-grams that are indicative of contexts that do not constitute definitions. The experimental results, however, were inferior, and manual inspection of the low precision n-grams indicated that they carried mostly noise, suggesting that it is difficult to identify frequent n-grams whose presence rules out definitions reliably.</Paragraph>
      <Paragraph position="12"> The reader may wonder why we rely only on precision, rather than selecting also attributes with high recall. (Here, recall is the number of snippets the pattern correctly identifies as containing a definition, over the total number of snippets that contain a definition.) In multi-source document collections, like the Web or the TREC documents, one can expect to find several definitions of the same term, phrased in different ways. Unlike traditional document retrieval tasks, we are not interested in obtaining all the definitions; a single correct one suffices. Furthermore, as the number of snippets that can be reported is limited, we need to be confident that the snippets we return are indeed definitions. Hence, we need to rely on high-precision indicators. Preliminary experiments we performed using the F-measure (with b =1), a combination of recall and precision, instead of precision led to inferior results, confirming that attribute recall is not helpful in this task.</Paragraph>
      <Paragraph position="13"> We have also experimented with information gain. In that case, one selects the n-grams with the m highest IG(C,X) scores, defined below, where C and X are random variables denoting the category of a snippet and the value of the n-gram's binary attribute, respectively, and H(C) and H(C|X) are the entropy and conditional entropy of C. By selecting the attributes with the highest information gain scores, one selects the attributes that carry most information about the value of C.</Paragraph>
      <Paragraph position="14">  Although information gain is one of the best attribute selection measures in text categorization (Yang and Pedersen, 1997), in our case it led to very few attributes with non-zero IG(C,X) scores (around 90 attributes from the entire dataset of section 4). This is because most of the n-grams are very rare (i.e., P(X=0) is very large), and their absence (X=0) provides very little information about C (i.e., H(C) [?] H(C  |X = 0)). For example, not encountering [dp such as qn] provides very little information on whether or not the snippet is a definition. The experiments we performed with the resulting attributes led to inferior results, compared to those that we got via precision.</Paragraph>
      <Paragraph position="15"> 3.2.4 Configuration 4: discarding WordNet The fourth configuration is identical to the third one, except that it does not use the attribute that shows if the snippet contains one of the best hypernyms of Prager et al. (The attribute is present in configurations 2 and 3). The intention is to explore if the performance of configuration 3 can be sustained without the use of WordNet.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="4" end_page="4" type="metho">
    <SectionTitle>
4 Experimental results
</SectionTitle>
    <Paragraph position="0"> We evaluated the baseline and the machine learning configurations of section 3 on the definition questions of TREC-9 (2000) and TREC2001, the same data used by Prager et al. For each question, the TREC organizers provide the 50 most highly ranked documents that an IR engine returned from the TREC documents. The task is to return for each question five 250-character snippets from the corresponding 50 documents, such that at least one of the snippets contains an acceptable definition. Following Prager et al., we count a snippet as containing an acceptable definition, if it satisfies the Perl answer patterns that the TREC organizers provide for the corresponding question (Voorhees, 2001). The answer patterns incorporate the correct responses of all the participants of the corresponding TREC competition. In TREC-9, the correlation between system rankings produced by answer patterns and rankings produced by humans was at the same level as the average correlation between rankings of different human assessors (Voorhees and Tice 2000). In TREC-2001, the correlation between patterns and judges was lower, but still similar for 250-character responses (Voorhees, 2001).</Paragraph>
    <Paragraph position="1"> All the experiments with machine learning configurations were performed with 10-fold crossvalidation. That is, the question set was divided into 10 parts, and each experiment was repeated 10 times. At each iteration, the questions of a different part and the corresponding documents were reserved for testing, while the questions and documents of the remaining nine parts were used for training. (In configurations 3 and 4, pattern acquisition was repeated at each iteration.) Table 1 reports the total number of questions that each method managed to handle successfully over the ten iterations; i.e., questions with at least one acceptable definition in the five returned snippets.</Paragraph>
    <Paragraph position="2"> When questions for which there was no answer in the corresponding 50 documents and/or there was no answer pattern are excluded, the results are those shown in italics. The second row contains the results reported by Prager et al. (2001, 2002), while the third one shows the results of our reimplementation. We include six questions that Prager et al. appear to have excluded, which is why the total number of questions is different; there are also some implementation differences, as discussed in section 3.</Paragraph>
    <Paragraph position="3"> Method % questions handled correctly Prager et al. 51.95 (80/154), 60.15 (80/133) baseline 50.00 (80/160), 58.39 (80/137) config. 1 61.88 (99/160), 72.26 (99/137) config. 2 63.13 (101/160), 73.72 (101/137) config. 3 72.50 (116/160), 84.67 (116/137) config. 4 71.88 (115/160), 83.94 (115/137)  The SVM learner with roughly Joho et al.'s attributes (config. 1) clearly outperforms Prager et al.'s WordNet-based method. Adding Prager et al.'s method as an extra attribute to the SVM (config. 2) leads to only a marginal improvement.</Paragraph>
    <Paragraph position="4"> Automatic pattern acquisition (config. 3) is much more beneficial. Removing the attribute of the WordNet-based method (config. 4) caused the system to fail in only one of the questions that configuration 3 handled successfully, which again suggests that the WordNet-based method does not contribute much to the performance of configuration 3. This is particularly interesting for languages with no WordNet-like resources.</Paragraph>
    <Paragraph position="5"> The results of configurations 3 and 4 in table 1 were obtained using the 200 n-gram attributes with the highest precision scores. When using the 100 n-gram attributes with the highest and the 100 n-gram attributes with the lowest precision scores, the results of configuration 3 were 70.63% and 82.48%. When using all the n-gram attributes with non-zero information gain scores, the results of configuration 3 were 66.25% and 77.42%.</Paragraph>
    <Paragraph position="6"> Configurations 2 and 3 achieved inferior results with 300 highest-precision n-gram attributes (table 2), which may be a sign that low reliability n-grams are beginning to dominate the attribute set.</Paragraph>
  </Section>
  <Section position="6" start_page="4" end_page="4" type="metho">
    <SectionTitle>
5 Related work
</SectionTitle>
    <Paragraph position="0"> Ng et al. (2000) use machine learning (C5 with boosting) to classify and rank candidate answers, but do not treat definition questions in any special way, and use only four generic attributes across all question categories. Some of their worst results are for &amp;quot;What ...?&amp;quot; questions, that presumably include a large number of definition questions.</Paragraph>
    <Paragraph position="1"> Ittycheriah and Roukos (2002) employ a maximum entropy model to rank candidate answers, which uses a very rich set of attributes that includes 8,500 patterns. The latter are n-grams of words that occur frequently in answers of the training data, each associated with a two-word question prefix (e.g., &amp;quot;What is&amp;quot;) that also has to be matched for the pattern to be satisfied. Unlike our work, the n-grams have to be five or more words long, and, in the case of definition questions, they do not need to be anchored at the term to define.</Paragraph>
    <Paragraph position="2"> Ittycheriah and Roukos (2002) do not provide separate figures on the performance of their system on definition questions.</Paragraph>
    <Paragraph position="3"> Blair-Goldensohn et al. (2003) focus on definition questions, but aim at producing coherent multi-sentence definitions, rather than identifying single defining snippets. At the heart of their approach is a component that uses machine learning (Riper) to identify sentences that are candidates for inclusion in the multi-sentence definition. This component plays a role similar to that of our SVM learner, but it is intended to admit a larger range of sentences, and appears to employ only attributes conveying the position of the sentence in its document and the frequency of the definition term in the context of the sentence.</Paragraph>
    <Paragraph position="4"> Automatically acquired n-gram patterns can also be used for query expansion in information retrieval, as in Agichtein et al. (2001).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML