File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1039_metho.xml

Size: 19,199 bytes

Last Modified: 2025-10-06 14:09:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1039">
  <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 307-314, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Combining Deep Linguistics Analysis and Surface Pattern Learning: A Hybrid Approach to Chinese Definitional Question Answering</Title>
  <Section position="3" start_page="307" end_page="310" type="metho">
    <SectionTitle>
2 A Hybrid Approach to Definitional Ques-
</SectionTitle>
    <Paragraph position="0"> tion Answering The architecture of our QA system is shown in Figure 1. Given a question, we first use simple rules to classify it as a &amp;quot;Who-is&amp;quot; or &amp;quot;What-is&amp;quot; question and detect key words. Then we use a HMM-based IR system (Miller et al., 1999) for document retrieval by treating the question keywords as a query. To speed up processing, we only use the top 1000 relevant documents. We then select relevant sentences among the returned relevant documents. A sentence is considered relevant if it contains the query key-word or contains a word that is co-referent to the query term. Coreference is determined using an information extraction engine, SERIF (Ramshaw et al., 2001). We then conduct deep linguistic analysis and pattern matching to extract candidate answers. We rank all candidate answers by predetermined feature ordering. At the same time, we perform redundancy detection based on a1 -gram overlap. null</Paragraph>
    <Section position="1" start_page="307" end_page="308" type="sub_section">
      <SectionTitle>
2.1 Deep Linguistic Analysis
</SectionTitle>
      <Paragraph position="0"> We use SERIF (Ramshaw et al., 2001), a linguistic analysis engine, to perform full parsing, name entity detection, relation detection, and co-reference resolution. We extract the following linguistic features:  1. Copula: a copula is a linking verb such as &amp;quot;is&amp;quot; or &amp;quot;become&amp;quot;. An example of a copula feature is &amp;quot;Bill Gates is the CEO of Microsoft&amp;quot;. In this case, &amp;quot;CEO of Microsoft&amp;quot; will be extracted as  an answer to &amp;quot;Who is Bill Gates?&amp;quot;. To extract copulas, SERIF traverses the parse trees of the sentences and extracts copulas based on rules.</Paragraph>
      <Paragraph position="1"> In Chinese, the rule for identifying a copula is the POS tag &amp;quot;VC&amp;quot;, standing for &amp;quot;Verb Copula&amp;quot;. The only copula verb in Chinese is &amp;quot; a2 &amp;quot;. 2. Apposition: appositions are a pair of noun phrases in which one modifies the other. For example, In &amp;quot;Tony Blair, the British Prime Minister, ...&amp;quot;, the phrase &amp;quot;the British Prime Minister&amp;quot; is in apposition to &amp;quot;Blair&amp;quot;. Extraction of appositive features is similar to that of copula. SERIF traverses the parse tree and identifies appositives based on rules. A detailed description of the algorithm is documented  in (Ramshaw et al., 2001).</Paragraph>
      <Paragraph position="2"> 3. Proposition: propositions represent predicate-argument structures and take the form: predicate(a0a2a1 a3a5a4a7a6 : a8a9a0a11a10 a6 , ..., a0a2a1 a3a5a4a13a12 : a8a14a0a15a10 a12 ). The most common roles include logical subject, logical object, and object of a prepositional phrase that modifies the predicate. For example, &amp;quot;Smith went to Spain&amp;quot; is represented as a proposition, went(logical subject: Smith, PP-to: Spain).</Paragraph>
      <Paragraph position="3"> 4. Relations: The SERIF linguistic analysis en null gine also extracts relations between two objects. SERIF can extract 24 binary relations defined in the ACE guidelines (Linguistic Data Consortium, 2002), such as spouse-of, staff-of, parent-of, management-of and so forth. Based on question types, we use different relations, as listed in Table 1.</Paragraph>
      <Paragraph position="4">  Many relevant sentences do not contain the query key words. Instead, they contain words that are co-referent to the query. For example, in &amp;quot;Yesterday UN Secretary General Anan Requested Every Side..., He said ... &amp;quot;. The pronoun &amp;quot;He&amp;quot; in the second sentence refers to &amp;quot;Anan&amp;quot; in the first sentence. To select such sentences, we conduct co-reference resolution using SERIF.</Paragraph>
      <Paragraph position="5"> In addition, SERIF also provides name tagging, identifying 29 types of entity names or descriptions, such as locations, persons, organizations, and diseases. null We also select complete sentences mentioning the term being defined as backup answers if no other features are identified.</Paragraph>
      <Paragraph position="6"> The component performance of our linguistic analysis is shown in Table 2.</Paragraph>
    </Section>
    <Section position="2" start_page="308" end_page="310" type="sub_section">
      <SectionTitle>
for Chinese
2.2 Surface Pattern Learning
</SectionTitle>
      <Paragraph position="0"> We use two kinds of patterns: manually constructed patterns and automatically derived patterns. A manual pattern is a commonly used linguistic expression that specifies aliases, super/subclass and membership relations of a term (Xu et al., 2004). For example, the expression &amp;quot;tsunamis, also known as tidal waves&amp;quot; gives an alternative term for tsunamis. We  use 23 manual patterns for Who-is questions and 14 manual patterns for What-is questions.</Paragraph>
      <Paragraph position="1"> We also classify some special propositions as manual patterns since they are specified by computational linguists. After a proposition is extracted, it is matched against a list of predefined predicates. If it is on the list, it is considered special and will be ranked higher. In total, we designed 22 special propositions for Who-is questions, such as a0</Paragraph>
      <Paragraph position="3"> However, it is hard to manually construct such patterns since it largely depends on the knowledge of the pattern designer. Thus, we prefer patterns that can be automatically derived from training data.</Paragraph>
      <Paragraph position="4"> Some annotators labeled question-answer snippets.</Paragraph>
      <Paragraph position="5"> Given a query question, the annotators were asked to highlight the strings that can answer the question.</Paragraph>
      <Paragraph position="6"> Though such a process still requires annotators to have knowledge of what can be answers, it does not require a computational linguist. Our pattern learning procedure is illustrated in Figure 2.</Paragraph>
      <Paragraph position="7">  Here we give an example to illustrate how pattern learning works. The first step is annotation. An example of Chinese answer annotation with English translation is shown in Figure 3. Question words are assigned the tag QTERM, answer words are tagged ANSWER, and all other words are assigned BKGD, standing for background words (not shown in the example to make the annotation more readable).</Paragraph>
      <Paragraph position="8"> To obtain patterns, we conduct full parsing to obtain the full parse tree for a sentence. In our current  patterns, we only use POS tagging information, but other higher level information could also be used. The segmented and POS tagged sentence is shown in Figure 4. Each word is assigned a POS tag as defined by the Penn Chinese Treebank guidelines.</Paragraph>
      <Paragraph position="9">  Next we combine the POS tags and the answer tags by appending these two tags to create a new tag,  We can then obtain an answer snippet from this training sample. Here we obtain the snippet (a71a73a72 a72a75a74a77a76 NR/ANSWER)(TERM).</Paragraph>
      <Paragraph position="10"> We generalize a pattern using three heuristics (this particular example does not generalize). First, we replace all Chinese sequences longer than 3 characters with their POS tags, under the theory that long sequences are too specific. Second, we also replace NT (time noun, such as a78a4a79 ), DT (determiner, such as a80 ,a81 ), cardinals (CD, such as a82 ,a83 ,a84 ) and M  (measurement word such as a0a2a1a4a3 ) with their POS tags. Third, we ignore adjectives.</Paragraph>
      <Paragraph position="11"> After obtaining all patterns, we run them on the training data to calculate their precision and recall. We select patterns whose precision is above 0.6 and which fire at least 5 times in training data (parameters are determined with a held out dataset).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="310" end_page="311" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="310" end_page="310" type="sub_section">
      <SectionTitle>
3.1 Data Sets
</SectionTitle>
      <Paragraph position="0"> We produced a list of questions and asked annotators to identify answer snippets from TDT4 data. To produce as many training answer snippets as possible, annotators were asked to label answers exhaustively; that is, the same answer can be labeled multiple times in different places. However, we remove duplicate answers for test questions since we are only interested in unique answers in evaluation.</Paragraph>
      <Paragraph position="1"> We separate questions into two types, biographical (Who-is) questions, and other definitional questions (What-is). For &amp;quot;Who-is&amp;quot; questions, we used 204 questions for pattern learning, 10 for parameter tuning and another 42 questions for testing. For &amp;quot;What-is&amp;quot; questions, we used 44 for training and another 44 for testing.</Paragraph>
    </Section>
    <Section position="2" start_page="310" end_page="310" type="sub_section">
      <SectionTitle>
3.2 Evaluation
</SectionTitle>
      <Paragraph position="0"> The TREC question answering evaluation is based on human judgments (Voorhees, 2004). However, such a manual procedure is costly and time consuming. Recently, researchers have started automatic question answering evaluation (Xu et al., 2004; Lin and Demner-Fushman, 2005; Soricut and Brill, 2004). We use Rouge, an automatic evaluation metric that was originally used for summarization evaluation (Lin and Hovy, 2003) and was recently found useful for evaluating definitional question answering (Xu et al., 2004). Rouge is based on a1 -gram co-occurrence. An a1 -gram is a sequence of a1 consecutive Chinese characters.</Paragraph>
      <Paragraph position="1"> Given a reference answer a5 and a system answer a6 , the Rouge score is defined as follows:</Paragraph>
      <Paragraph position="3"> grams of a5 and a6 , and a60 a1a13a61 a1a63a62a17a70a22a5a73a72 a1a75a74 is the number of a1 -grams in a5 . If a59 is too small, stop words and bi-grams of such words will dominate the score; If a59 is too large, there will be many questions without answers. We select a59 to be 3, 4, 5 and 6.</Paragraph>
      <Paragraph position="4"> To make scores of different systems comparable, we truncate system output for the same question by the same cutoff length. We score answers truncated at length a76 times that of the reference answers, where a76 is set to be 1, 2, and 3. The rationale is that people would like to read at least the same length of the reference answer. On the other hand, since the state of the art system answer is still far from human performance, it is reasonable to produce answers somewhat longer than the references (Xu et al., 2004).</Paragraph>
      <Paragraph position="5"> In summary, we run experiments with parameters a59a78a77a80a79a81a72a51a82a23a72a11a83a81a72a11a84 and a76a85a77a4a86a87a72a11a88a81a72a11a79 , and take the average over all of the 12 runs.</Paragraph>
    </Section>
    <Section position="3" start_page="310" end_page="311" type="sub_section">
      <SectionTitle>
3.3 Overall Results
</SectionTitle>
      <Paragraph position="0"> We set the pure linguistic analysis based system as the baseline and compare it to other configurations.</Paragraph>
      <Paragraph position="1"> Table 3 and Table 4 show the results on &amp;quot;Who-is&amp;quot; and &amp;quot;What-is&amp;quot; questions respectively. The baseline (Run 1) is the result of using pure linguistic features; Run 2 is the result of adding manual patterns to the baseline system; Run 3 is the result of using learned patterns only. Run 4 is the result of adding learned patterns to the baseline system. Run 5 is the result of adding both manual patterns and learned patterns to the system.</Paragraph>
      <Paragraph position="2"> The first question we want to answer is how helpful the linguistic analysis and pattern learning are for definitional QA. Comparing Run 1 and 3, we can see that both pure linguistic analysis and pure pattern based systems achieve comparable performance; Combining them together improves performance (Run 4) for &amp;quot;who is&amp;quot; questions, but only slightly for &amp;quot;what is&amp;quot; questions. This indicates that linguistic analysis and pattern learning are complementary to each other, and both are helpful for biographical QA.</Paragraph>
      <Paragraph position="3"> The second question we want to answer is what kind of questions can be answered with pattern matching. From these two tables, we can see that patterns are very effective in &amp;quot;Who-is&amp;quot; questions while less effective in &amp;quot;What-is&amp;quot; questions. Learned patterns improve the baseline from 0.3399  to 0.3860; manual patterns improve the baseline to 0.3657; combining both manual and learned patterns improve it to 0.4026, an improvement of 18.4% compared to the baseline. However, the effect of patterns on &amp;quot;What-is&amp;quot; is smaller, with an improvement of only 3.5%. However, the baseline performance on &amp;quot;What-is&amp;quot; is also much worse than that of &amp;quot;Who-is&amp;quot; questions. We will analyze the reasons in Section 4.3. This indicates that answering general definitional questions is much more challenging than answering biographical questions and deserves more research.</Paragraph>
      <Paragraph position="4"> Run Run description Rouge  (1) Baseline 0.3399 (2) (1)+ manual patterns 0.3657 (3) Learned patterns 0.3549 (4) (1)+ learned patterns 0.3860 (5) (2)+ learned patterns 0.4026  Run Run description Rouge (1) Baseline 0.2126 (2) (1)+ manual patterns 0.2153 (3) Learned patterns 0.2117 (4) (1)+ learned patterns 0.2167 (5) (2)+ learned patterns 0.2201</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="311" end_page="312" type="metho">
    <SectionTitle>
4 Analysis
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="311" end_page="311" type="sub_section">
      <SectionTitle>
4.1 How much annotation is needed
</SectionTitle>
      <Paragraph position="0"> The third question is how much annotation is needed for a pattern based system to achieve good performance. We run experiments with portions of training data on biographical questions, which produce different number of patterns. Table 5 shows the details of the number of training snippets used and the number of patterns produced and selected. The performance of different system is illustrated in Figure 6. With only 10% of the training data (549 snippets, about two person days of annotation), learned patterns achieve good performance of 0.3285, considering the performance of 0.3399 of a well tuned system with deep linguistic features. Performance saturates with 2742 training snippets (50% training, 10 person days annotation) at a Rouge score of 0.3590, comparable to the performance of a well tuned system with full linguistic features and manual patterns (Run 2 in Table 3). There could even be a slight, insignificant performance decrease with more training data because our sampling is sequential instead of random. Some portions of training data might be more useful than others.</Paragraph>
      <Paragraph position="1">  sured on biographical questions)</Paragraph>
    </Section>
    <Section position="2" start_page="311" end_page="312" type="sub_section">
      <SectionTitle>
4.2 Contributions of different features
</SectionTitle>
      <Paragraph position="0"> The fourth question we want to answer is: what features are most useful in definitional question answering? To evaluate the contribution of each individual feature, we turn off all other features and test the system on a held out data (10 questions). We calculate the coverage of each feature, measured by Rouge. We also calculate the precision of each feature with the following formula, which is very similar to Rouge except that the denominator here is based on system output a60 a1a13a61 a1a63a62a17a70 a6 a72 a1a75a74 instead of reference a60 a1a13a61 a1a63a62a17a70a22a5a73a72 a1a75a74 . The notations are the same as  ingly, the learned patterns have the highest coverage and precision. The copula feature has the second highest precision; however, it has the lowest coverage. This is because there are not many copulas in the dataset. Appositive and manual pattern features have the same level of contribution. Surprisingly, the relation feature has a high coverage. This suggests that relations could be more useful if relation detection were more accurate; general propositions are not more useful than whole sentences since almost every sentence has a proposition, and since the high value propositions are identified by the lexical head of the proposition and grouped with the manual  sured on the biographical questions)</Paragraph>
    </Section>
    <Section position="3" start_page="312" end_page="312" type="sub_section">
      <SectionTitle>
4.3 Who-is versus What-is questions
</SectionTitle>
      <Paragraph position="0"> We have seen that &amp;quot;What-is&amp;quot; questions are more challenging than &amp;quot;Who-is&amp;quot; questions. We compare the precision and coverage of each feature for &amp;quot;Who-is&amp;quot; and &amp;quot;What-is&amp;quot; in Table 6 and Table 7. We see that although the precisions of the features are higher for &amp;quot;What-is&amp;quot;, their coverage is too low. The most useful features for &amp;quot;What-is&amp;quot; questions are propositions and raw sentences, which are the worst two features for &amp;quot;Who-is&amp;quot;. Basically, this means that most of the answers for &amp;quot;What-is&amp;quot; are from whole sentences. Neither linguistic analysis nor pattern matching works as efficiently as in biographical questions.</Paragraph>
      <Paragraph position="1">  To identify the challenges of &amp;quot;What-is&amp;quot; questions, we conducted an error analysis. The answers for &amp;quot;What-is&amp;quot; are much more diverse and are hard to capture. For example, the reference answers for the question of &amp;quot;a7a9a8 a2a11a10a13a12a15a14a17a16a19a18a21a20 / What is the international space station?&amp;quot; include the weight of the space station, the distance from the space station to the earth, the inner structure of the space station, and the cost of its construction. Such attributes are hard to capture with patterns, and they do not contain any of the useful linguistic features we currently have (copula, appositive, proposition, relation). Identifying more useful features for such answers remains for future work.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML