File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2052_metho.xml

Size: 13,107 bytes

Last Modified: 2025-10-06 14:10:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2052">
  <Title>Efficient sentence retrieval based on syntactic structure</Title>
  <Section position="4" start_page="399" end_page="399" type="metho">
    <SectionTitle>
2 Tree Kernel
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="399" end_page="399" type="sub_section">
      <SectionTitle>
2.1 Definition of similarity
</SectionTitle>
      <Paragraph position="0"> Tree Kernel is proposed by Collins et al. (Collins, 2001a; Collins, 2001b) as a method to calculate similarity between tree structures. Tree Kernel defines similarity between two trees as the number of shared subtrees. Subtree S of tree T is defined as any tree subsumed by T, and consisting of more than one node, and all child nodes are included if any.</Paragraph>
      <Paragraph position="1"> Tree Kernel is not always suitable because the desired properties of similarity are different depending on applications. Takahashi et al. proposed three types of similarity based on Tree Kernel (Takahashi, 2002). We use one of the similarity measures (equation (1)) proposed by Takahashi et al.</Paragraph>
      <Paragraph position="3"> where C(n1,n2) is the number of shared subtrees by two trees rooted at nodes n1 and n2.</Paragraph>
    </Section>
    <Section position="2" start_page="399" end_page="399" type="sub_section">
      <SectionTitle>
2.2 Algorithm to calculate similarity
</SectionTitle>
      <Paragraph position="0"> Collins et al. (Collins, 2001a; Collins, 2001b) proposed an efficient method to calculate Tree Kernel by using C(n1,n2) as follows.</Paragraph>
      <Paragraph position="1"> * If the productions at n1 and n2 are different</Paragraph>
      <Paragraph position="3"> * If the productions at n1 and n2 are the same, and n1 and n2 are pre-terminals, then</Paragraph>
      <Paragraph position="5"> * Else if the productions at n1 and n2 are the same and n1 and n2 are not pre-terminals,</Paragraph>
      <Paragraph position="7"> (2) where nc(n) is the number of children of node n and ch(n,i) is the i'th child node of n. Equation (2) recursively calculates C on its child node, and calculating Cs in postorder avoids recalculation. Thus, the time complexity of KC(T1,T2) is O(mn), where m and n are the numbers of nodes in T1 and T2 respectively.</Paragraph>
    </Section>
    <Section position="3" start_page="399" end_page="399" type="sub_section">
      <SectionTitle>
2.3 Algorithm to retrieve sentences
</SectionTitle>
      <Paragraph position="0"> Neither Collins nor Takahashi discussed retrieval algorithms using Tree Kernel. We use the following simple algorithm. First we calculate the similarity KC(T1,T2) between a query tree and every tree in the corpus and rank them in descending order of KC.</Paragraph>
      <Paragraph position="1"> Tree Kernel exploits all subtrees shared by trees. Therefore, it requires considerable amount of time in retrieval because similarity calculation must be performed for every pair of trees. To improve retrieval time, an index table can be used in general. However, indexing by all subtrees is difficult because a tree often includes millions of subtrees.</Paragraph>
      <Paragraph position="2"> For example, one sentence in Titech Corpus (Noro et al., 2005) with 22 words and 87 nodes includes 8,213,574,246 subtrees. The number of subtrees in a tree with N nodes is bounded above by 2N.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="399" end_page="402" type="metho">
    <SectionTitle>
3 Tree Overlapping
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="399" end_page="399" type="sub_section">
      <SectionTitle>
3.1 Definition of similarity
</SectionTitle>
      <Paragraph position="0"> When putting an arbitrary node n1 of tree T1 on node n2 of tree T2, there might be the same production rule overlapping in T1 and T2. We define CTO(n1,n2) as the number of such overlapping production rules when n1 overlaps n2 (Figure 2).</Paragraph>
      <Paragraph position="1"> We will define CTO(n1,n2) more precisely.</Paragraph>
      <Paragraph position="2"> First we define L(n1,n2) of node n1 of T1 and node n2 of T2. L(n1,n2) represents a set of pairs of nodes which overlap each other when putting n1 on n2. For example in Figure 2, L(b11,b21) = {(b11,b21),(d11,d21),(e11,e21),(g11,g21),(i11,j21)}.</Paragraph>
      <Paragraph position="3"> L(n1,n2) is defined as follows. Here ni and mi are nodes of tree Ti, ch(n,i) is the i'th child of  2. If (m1,m2) [?] L(n1,n2), (ch(m1,i),ch(m2,i)) [?] L(n1,n2) 3. If (ch(m1,i),ch(m2,i)) [?] L(n1,n2), (m1,m2) [?] L(n1,n2) 4. L(n1,n2) includes only pairs generated by applying 2. and 3. recursively.</Paragraph>
      <Paragraph position="4"> CTO(n1,n2) is defined by using L(n1,n2) as follows.</Paragraph>
      <Paragraph position="5">  where NT(T) is a set of nonterminal nodes in tree T, PR(n) is a production rule rooted at node n. Tree Overlapping similarity STO(T1,T2) is defined as follows by using CTO(n1,n2).</Paragraph>
      <Paragraph position="7"> This formula corresponds to equation (1) of Tree Kernel.</Paragraph>
      <Paragraph position="8"> As an example, we calculate STO(T1,T2) in Figure 2 (1). Putting b11 on b21 gives Figure 2 (2) in which two production rules b - d e and e - g overlap respectively. Thus, CTO(b11,b21) becomes 2. While overlapping g11 and g21 gives Figure 2 (3) in which only one production rule g - i overlaps. Thus, CTO(g11,g21) becomes 1. Since there are no other node pairs which gives larger CTO than 2, STO(T1,T2) becomes 2.</Paragraph>
    </Section>
    <Section position="2" start_page="399" end_page="401" type="sub_section">
      <SectionTitle>
3.2 Algorithm
</SectionTitle>
      <Paragraph position="0"> Let us take an example in Figure 3 to explain the algorithm. Suppose that T0 is a query tree and the corpus has only two trees, T1 and T2.</Paragraph>
      <Paragraph position="1"> The method to find the most similar tree to a given query tree is basically the same as Tree Kernel's (section 2.2). However, unlike Tree Kernel, Tree Overlapping-based retrieval can be accelerated by indexing the corpus in advance. Thus, given a tree corpus, we build an index table I[p] which maps a production rule p to its occurrences.</Paragraph>
      <Paragraph position="2"> Occurrences of production rules are represented by their left-hand side symbols, and are distinguished with respect to trees including the rule and  where F is the corpus (here {T1,T2}) and the meaning of other symbols is the same as the definition of CTO (equation (3)).</Paragraph>
      <Paragraph position="3"> Table 1 shows an example of the index table generated from T1 and T2 in Figure 3 (1). In Table 1, a superscript of a nonterminal symbol identifies a tree, and a subscript identifies a position in the tree.</Paragraph>
      <Paragraph position="4"> By using the index table, we calculate C[n,m] with the following algorithm.</Paragraph>
      <Paragraph position="5"> for all (n,m) do C[n,m] := 0 end</Paragraph>
      <Paragraph position="7"> where top(n,m) returns the upper-most pair of overlapped nodes when node n and m overlap.</Paragraph>
      <Paragraph position="8"> The value of top uniquely identifies a situation of overlapping two trees. Function top(n,m) is calculated by the following algorithm.</Paragraph>
      <Paragraph position="10"> where parent(n) is the parent node of n, and order(n) is the order of node n among its siblings.</Paragraph>
      <Paragraph position="11"> Table 2 shows example values of top(n,m) generated by overlapping T0 and T1 in Figure 3. Note that top maps every pair of corresponding nodes in a certain overlapping situation to a pair of the upper-most nodes of that situation. This enables us to use the value of top as an identifier of a situation of overlap.</Paragraph>
      <Paragraph position="13"> tree similarity between a query tree T0 and each tree T in the corpus STO(T0,T)can be calculated</Paragraph>
    </Section>
    <Section position="3" start_page="401" end_page="402" type="sub_section">
      <SectionTitle>
3.3 Comparison with Tree Kernel
</SectionTitle>
      <Paragraph position="0"> The value of STO(T1,T2) roughly corresponds to the number of production rules included in the largest sub-tree shared by T1 and T2. Therefore, this value represents the size of the subtree shared  by both trees, like Tree Kernel's KC, though the definition of the subtree size is different.</Paragraph>
      <Paragraph position="1"> One difference is that Tree Overlapping considers shared subtrees even though they are split by a nonshared node as shown in Figure 4. In Figure 4, T1 and T2 share two subtrees rooted at b and c, but their parent nodes are not identical. While Tree Kernel does not consider the superposition putting node a on h, Tree Overlapping considers putting a on h and assigns count 2 to this superposition.</Paragraph>
      <Paragraph position="2">  Another, more important, difference is that Tree Overlapping retrieval can be accelerated by indexing the corpus in advance. The number of indexes is bounded above by the number of production rules, which is within a practical index size.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="402" end_page="402" type="metho">
    <SectionTitle>
4 Subpath Set
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="402" end_page="402" type="sub_section">
      <SectionTitle>
4.1 Definition of similarity
</SectionTitle>
      <Paragraph position="0"> Subpath Set similarity between two trees is defined as the number of subpaths shared by the trees. Given a tree, its subpaths is defined as a set of every path from the root node to leaves and their partial paths.</Paragraph>
      <Paragraph position="1"> Figure 5 (2) shows all subpaths in T1 and T2 in Figure 5(1). Here we denotes a path as a sequence of node names such as (a, b, d). Therefore, Sub-path Set similarity of T1 and T2 becomes 15.</Paragraph>
    </Section>
    <Section position="2" start_page="402" end_page="402" type="sub_section">
      <SectionTitle>
4.2 Algorithm
</SectionTitle>
      <Paragraph position="0"> Suppose T0 is a query tree, TS is a set of trees in the corpus and P(T) is a set of subpaths of T. We can build an index table I[p] for each production rule p as follows.</Paragraph>
      <Paragraph position="2"> Using the index table, we can calculate the number of shared subpaths by T0 and T, S[T], by the following algorithm: for all T S[T] := 0;</Paragraph>
      <Paragraph position="4"/>
    </Section>
    <Section position="3" start_page="402" end_page="402" type="sub_section">
      <SectionTitle>
4.3 Comparison with Tree Kernel
</SectionTitle>
      <Paragraph position="0"> As well as Tree Overlapping, Subpath Set retrieval can be accelerated by indexing the corpus. The number of indexes is bounded above by L x D2 where L is the maximum number of leaves of trees (the number of words in a sentence) and D is the maximum depth of syntactic trees. Moreover, considering a subpath as an index term, we can use existing retrieval tools.</Paragraph>
      <Paragraph position="1"> Subpath Set uses less structural information than Tree Kernel and Tree Overlapping. It does not distinguish the order and number of child nodes. Therefore, the retrieval result tends to be noisy. However, Subpath Set is faster than Tree Overlapping, because the algorithm is simpler.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="402" end_page="403" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> This section describes the experiments which were conducted to compare the performance of structure retrieval based on Tree Kernel, Tree Overlapping and Subpath Set.</Paragraph>
    <Section position="1" start_page="402" end_page="402" type="sub_section">
      <SectionTitle>
5.1 Data
</SectionTitle>
      <Paragraph position="0"> We conducted two experiments using different annotated corpora. Titech corpus (Noro et al., 2005) consists of about 20,000 sentences of Japanese newspaper articles (Mainiti Shimbun). Each sentence has been syntactically annotated by hand.</Paragraph>
      <Paragraph position="1"> Due to the limitation of computational resources, we used randomly selected 2,483 sentences as a data collection.</Paragraph>
      <Paragraph position="2"> Iwanami dictionary (Nishio et al., 1994) is a Japanese dictionary. We extracted 57,982 sentences from glosses in the dictionary. Each sentences was analyzed with a morphological analyzer, ChaSen (Asahara et al., 1996) and the MSLR parser (Shirai et al., 2000) to obtain syntactic structure candidates. The most probable structure with respect to PGLR model (Inui et al., 1996) was selected from the output of the parser. Since they were not investigated manually, some sentences might have been assigned incorrect structures. null</Paragraph>
    </Section>
    <Section position="2" start_page="402" end_page="403" type="sub_section">
      <SectionTitle>
5.2 Method
</SectionTitle>
      <Paragraph position="0"> We conducted two experiments Experiment I and Experiment II with different corpora. The queries  were extracted from these corpora. The algorithms described in the preceding sections were implemented with Ruby 1.8.2. Table 3 outlines the experiments. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML