XML Viewer - p06-2027

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2027_metho.xml
Size: 13,164 bytes
Last Modified: 2025-10-06 14:10:25
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2027">
  <Title>Automatic Creation of Domain Templates</Title>
  <Section position="5" start_page="208" end_page="208" type="metho">
    <SectionTitle>
3 Our Approach to Template Creation
</SectionTitle>
    <Paragraph position="0"> After reading about presidential elections in different countries on different years, a reader has a general picture of this process. Later, when reading about a new presidential election, the reader already has in her mind a set of questions for which she expects answers. This process can be called domain modeling. The more instances of a particular domain a person has seen, the better understanding she has about what type of information should be expected in an unseen collection of documents discussing a new instance of this domain.</Paragraph>
    <Paragraph position="1"> Thus, we propose to use a set of document collections describing different instances within one domain to learn the general characteristics of this domain. These characteristics can be then used to create a domain template. We test our system on four domains: airplane crashes, earthquakes, presidential elections, terrorist attacks.</Paragraph>
  </Section>
  <Section position="6" start_page="208" end_page="208" type="metho">
    <SectionTitle>
4 Data Description
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="208" end_page="208" type="sub_section">
      <SectionTitle>
4.1 Training Data
</SectionTitle>
      <Paragraph position="0"> To create training document collections we used BBC Advanced Search3 and submitted queries of the type &lt;domain title + country&gt; . For example, &lt;&amp;quot;presidential election&amp;quot; USA&gt; .</Paragraph>
      <Paragraph position="1"> In addition, we used BBC's Advanced Search date filter to constrain the results to different date periods of interest. For example, we used known dates of elections and allowed a search for articles published up to five days before or after each such date. At the same time for the terrorist attacks or earthquakes domain the time constraints we submitted were the day of the event plus ten days.</Paragraph>
      <Paragraph position="2"> Thus, we identify several instances for each of our four domains, obtaining a document collection for each instance. E.g., for the earthquake domain we collected documents on the earthquakes in Afghanistan (March 25, 2002), India (January 26, 2001), Iran (December 26, 2003), Japan (October 26, 2004), and Peru (June 23, 2001). Using this procedure we retrieve training document collections for 9 instances of airplane crashes, 5 instances of earthquakes, 13 instances of presidential elections, and 6 instances of terrorist attacks.</Paragraph>
    </Section>
    <Section position="2" start_page="208" end_page="208" type="sub_section">
      <SectionTitle>
4.2 Test Data
</SectionTitle>
      <Paragraph position="0"> To test our system, we used document clusters from the Topic Detection and Tracking (TDT) cor-</Paragraph>
      <Paragraph position="2"> pus (Fiscus et al., 1999). Each TDT topic has a topic label, such as Accidents or Natural Disasters.4 These categories are broader than our domains. Thus, we manually filtered the TDT topics relevant to our four training domains (e.g., Accidents matching Airplane Crashes). In this way, we obtained TDT document clusters for 2 instances of airplane crashes, 3 instances of earthquakes, 6 instances of presidential elections and 3 instances of terrorist attacks. The number of the documents corresponding to the instances varies greatly (from two documents for one of the earthquakes up to 156 documents for one of the terrorist attacks).</Paragraph>
      <Paragraph position="3"> This variation in the number of documents per topic is typical for the TDT corpus. Many of the current approaches of domain modeling collapse together different instances and make the decision on what information is important for a domain based on this generalized corpus (Collier, 1998; Barzilay and Lee, 2003; Sudo et al., 2003). We, on the other hand, propose to cross-examine these instances keeping them separated. Our goal is to eliminate dependence on how well the corpus is balanced and to avoid the possibility of greater impact on the domain template of those instances which have more documents.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="208" end_page="209" type="metho">
    <SectionTitle>
5 Creating Templates
</SectionTitle>
    <Paragraph position="0"> In this work we build domain templates around verbs which are estimated to be important for the domains. Using verbs as the starting point we identify semantic dependencies within sentences.</Paragraph>
    <Paragraph position="1"> In contrast to deep semantic analysis (Fillmore and Baker, 2001; Gildea and Jurafsky, 2002; Pradhan et al., 2004; Harabagiu and LVacVatus,u, 2005; Palmer et al., 2005) we rely only on corpus statistics. We extract the most frequent syntactic sub-trees which connect verbs to the lexemes used in the same subtrees. These subtrees are used to create domain templates.</Paragraph>
    <Paragraph position="2"> For each of the four domains described in Section 4, we automatically create domain templates using the following algorithm.</Paragraph>
    <Paragraph position="3"> Step 1: Estimate what verbs are important for the domain under investigation. We initiate our algorithm by calculating the probabilities for all the verbs in the document collection for one domain -- e.g., the collection containing all the instances in the domain of airplane crashes. We  discard those verbs that are stop words (Salton, 1971). To take into consideration the distribution of a verb among different instances of the domain, we normalize this probability by its VIF value (verb instance frequency), specifying in how many domain instances this verb appears.</Paragraph>
    <Paragraph position="5"> These verbs are estimated to be the most important for the combined document collection for all the domain instances. Thus, we build the domain template around these verbs. Here are the top ten verbs for the terrorist attack domain: killed, told, found, injured, reported, happened, blamed, arrested, died, linked.</Paragraph>
    <Paragraph position="6"> Step 2: Parse those sentences which contain the top 50 verbs. After we identify the 50 most important verbs for the domain under analysis, we parse all the sentences in the domain document collection containing these verbs with the Stanford syntactic parser (Klein and Manning, 2002).</Paragraph>
    <Paragraph position="7"> Step 3: Identify most frequent subtrees containing the top 50 verbs. A domain template should contain not only the most important actions for the domain, but also the entities that are linked to these actions or to each other through these actions. The lexemes referring to such entities can potentially be used within the domain template slots. Thus, we analyze those portions of the syntactic trees which contain the verbs themselves plus other lexemes used in the same subtrees as the verbs. To do this we use FREQuent Tree miner.5 This software is an implementation of the algorithm presented by (Abe et al., 2002; Zaki, 2002), which extracts frequent ordered subtrees from a set of ordered trees. Following (Sudo et al., 2003) we are interested only in the lexemes which are near neighbors of the most frequent verbs. Thus, we look only for those subtrees which contain the verbs themselves and from four to ten tree nodes, where a node is either a syntactic tag or a lexeme with its tag. We analyze not only NPs which correspond to the sub-ject or object of the verb, but other syntactic constituents as well. For example, PPs can potentially link the verb to locations or dates, and we want to include this information into the template. Table 1 contains a sample of subtrees for the terrorist attack domain mined from the sentences containing</Paragraph>
  </Section>
  <Section position="8" start_page="209" end_page="209" type="metho">
    <SectionTitle>
8 (SBAR(S(VP(VBD killed)(NP(QP(IN at))(NNS people)))))
8 (SBAR(S(VP(VBD killed)(NP(QP(JJS least))(NNS people)))))
5 (VP(ADVP)(VBD killed)(NP(NNS people)))
6 (VP(VBD killed)(NP(ADJP(JJ many))(NNS people)))
5 (VP(VP(VBD killed)(NP(NNS people))))
7 (VP(ADVP(NP))(VBD killed)(NP(CD 34)(NNS people)))
6 (VP(ADVP)(VBD killed)(NP(CD 34)(NNS people)))
</SectionTitle>
    <Paragraph position="0"> the verb killed. The first column of Table 1 shows how many nodes are in the subtree.</Paragraph>
    <Paragraph position="1"> Step 4: Substitute named entities with their respective tags. We are interested in analyzing a whole domain, not just an instance of this domain. Thus, we substitute all the named entities with their respective tags, and all the exact numbers with the tag NUMBER. We speculate that sub-trees similar to those presented in Table 1 can be extracted from a document collection representing any instance of a terrorist attack, with the only difference being the exact number of causalities. Later, however, we analyze the domain instances separately to identity information typical for the domain. The procedure of substituting named entities with their respective tags previously proved to be useful for various tasks (Barzilay and Lee, 2003; Sudo et al., 2003; Filatova and Prager, 2005). To get named entity tags we used BBN's IdentiFinder (Bikel et al., 1999).</Paragraph>
    <Paragraph position="2"> Step 5: Merge together the frequent subtrees. Finally, we merge together those subtrees which are identical according to the information encoded within them. This is a key step in our algorithm which allows us to bring together subtrees from different instances of the same domain. For example, the information rendered by all the subtrees from the bottom part of Table 1 is identical. Thus, these subtrees can be merged into one which contains the longest common pattern:</Paragraph>
  </Section>
  <Section position="9" start_page="209" end_page="210" type="metho">
    <SectionTitle>
(VBD killed)(NP(NUMBER)(NNS people))
</SectionTitle>
    <Paragraph position="0"> After this merging procedure we keep only those subtrees for which each of the domain instances has at least one of the subtrees from the initial set of subtrees. This subtree should be used in the instance at least twice. At this step, we make sure that we keep in the template only the information which is generally important for the domain rather than only for a fraction of instances in this domain.</Paragraph>
    <Paragraph position="1"> We also remove all the syntactic tags as we want to make this pattern as general for the domain as possible. A pattern without syntactic dependencies contains a verb together with a prospective tem- null plate slot corresponding to this verb: killed: (NUMBER) (NNS people) In the above example, the prospective template slots appear after the verb killed. In other cases the domain slots appear in front of the verb. Two examples of such slots, for the presidential election and earthquake domains, are shown below: (PERSON) won (NN earthquake) struck The above examples show that it is not enough to analyze only named entities, general nouns contain important information as well. We term the structure consisting of a verb together with the associated slots a slot structure. Here is a part of the slot structure we get for the verb killed after cross-examination of the terrorist attack instances:</Paragraph>
    <Paragraph position="3"> Slot structures are similar to verb frames, which are manually created for the PropBank annotation (Palmer et al., 2005).6 An example of the PropBank frame for the verb to kill is: Roleset kill.01 &amp;quot;cause to die&amp;quot;:</Paragraph>
    <Section position="1" start_page="210" end_page="210" type="sub_section">
      <SectionTitle>
Arg0:killerArg1:corpse
Arg2:instrument
</SectionTitle>
      <Paragraph position="0"> The difference between the slot structure extracted by our algorithm and the PropBank frame slots is that the frame slots assign a semantic role to each slot, while our algorithm gives either the type of the named entity that should fill in this slot or puts a particular noun into the slot (e.g., ORGANIZA-TION, earthquake, people). An ideal domain template should include semantic information but this problem is outside of the scope of this paper.</Paragraph>
      <Paragraph position="1"> Step 6: Creating domain templates. After we get all the frequent subtrees containing the top 50 domain verbs, we merge all the subtrees corresponding to the same verb and create a slot structure for every verb as described in Step 5. The union of such slot structures created for all the important verbs in the domain is called the domain template.</Paragraph>
      <Paragraph position="2"> From the created templates we remove the slots which are used in all the domains. For example, (PERSON) told.a50 The presented algorithm can be used to create a template for any domain. It does not require pre-defined domain or world knowledge. We learn domain templates from cross-examining document collections describing different instances of the domain of interest.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML