XML Viewer - p04-1016

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1016_metho.xml
Size: 18,208 bytes
Last Modified: 2025-10-06 14:08:59
<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1016">
  <Title>Convolution Kernels with Feature Selection for Natural Language Processing Tasks</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Problem of Applying Convolution
</SectionTitle>
    <Paragraph position="0"> Kernels to NLP tasks This section discusses an issue that arises when applying convolution kernels to NLP tasks.</Paragraph>
    <Paragraph position="1"> According to the original definition of convolution kernels, all the sub-structures are enumerated and calculated for the kernels. The number of sub-structures in the input object usually becomes exponential against input object size. As a result, all kernel values ^K(X; Y ) are nearly 0 except the kernel value of the object itself, ^K(X; X), which is 1. In this situation, the machine learning process becomes almost the same as memory-based learning.</Paragraph>
    <Paragraph position="2"> This means that we obtain a result that is very precise but with very low recall.</Paragraph>
    <Paragraph position="3"> To avoid this, most conventional methods use an approach that involves smoothing the kernel values or eliminating features based on the sub-structure size.</Paragraph>
    <Paragraph position="4"> For sequence kernels, (Cancedda et al., 2003) use a feature elimination method based on the size of sub-sequence n. This means that the kernel calculation deals only with those sub-sequences whose size is n or less. For tree kernels, (Collins and Duffy, 2001) proposed a method that restricts the features based on sub-trees depth. These methods seem to work well on the surface, however, good results are achieved only when n is very small, i.e. n = 2.</Paragraph>
    <Paragraph position="5"> The main reason for using convolution kernels is that they allow us to employ structural features simply and efficiently. When only small sized sub-structures are used (i.e. n = 2), the full benefits of convolution kernels are missed.</Paragraph>
    <Paragraph position="6"> Moreover, these results do not mean that larger sized sub-structures are not useful. In some cases we already know that larger sub-structures are significant features as regards solving the target problem. That is, these significant larger sub-structures,  squared value c c P row u Ouc = y Ou c Ou = x u O uc O u c O uP column Oc = M O c N  which the conventional methods cannot deal with efficiently, should have a possibility of improving the performance furthermore.</Paragraph>
    <Paragraph position="7"> The aim of the work described in this paper is to be able to use any significant sub-structure efficiently, regardless of its size, to solve NLP tasks.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Proposed Feature Selection Method
</SectionTitle>
    <Paragraph position="0"> Our approach is based on statistical feature selection in contrast to the conventional methods, which use sub-structure size.</Paragraph>
    <Paragraph position="1"> For a better understanding, consider the two-class (positive and negative) supervised classification problem. In our approach we test the statistical deviation of all the sub-structures in the training samples between the appearance of positive samples and negative samples. This allows us to select only the statistically significant sub-structures when calculating the kernel value.</Paragraph>
    <Paragraph position="2"> Our approach, which uses a statistical metric to select features, is quite natural. We note, however, that kernels are calculated using the DP algorithm.</Paragraph>
    <Paragraph position="3"> Therefore, it is not clear how to calculate kernels efficiently with a statistical feature selection method. First, we briefly explain a statistical metric, the chi-squared ( 2) value, and provide an idea of how to select significant features. We then describe a method for embedding statistical feature selection into kernel calculation.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Statistical Metric: Chi-squared Value
</SectionTitle>
      <Paragraph position="0"> There are many kinds of statistical metrics, such as chi-squared value, correlation coefficient and mutual information. (Rogati and Yang, 2002) reported that chi-squared feature selection is the most effective method for text classification. Following this information, we use 2 values as statistical feature selection criteria. Although we selected 2 values, any other statistical metric can be used as long as it is based on the contingency table shown in Table 1.</Paragraph>
      <Paragraph position="1"> We briefly explain how to calculate the 2 value by referring to Table 1. In the table, c and c represent the names of classes, c for the positive class  and c for the negative class. Ouc, Ou c, O uc and O u c represent the number of u that appeared in the positive sample c, the number of u that appeared in the negative sample c, the number of u that did not appear in c, and the number of u that did not appear in c, respectively. Let y be the number of samples of positive class c that contain sub-sequence u, and x be the number of samples that contain u. Let N be the total number of (training) samples, and M be the number of positive samples.</Paragraph>
      <Paragraph position="2"> Since N and M are constant for (fixed) data, 2 can be written as a function of x and y,</Paragraph>
      <Paragraph position="4"> 2 expresses the normalized deviation of the observation from the expectation.</Paragraph>
      <Paragraph position="5"> We simply represent 2(x; y) as 2(u).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Feature Selection Criterion
</SectionTitle>
      <Paragraph position="0"> The basic idea of feature selection is quite natural.</Paragraph>
      <Paragraph position="1"> First, we decide the threshold of the 2 value. If 2(u) &lt; holds, that is, u is not statistically significant, then u is eliminated from the features and the value of u is presumed to be 0 for the kernel value.</Paragraph>
      <Paragraph position="2"> The sequence kernel with feature selection (FSSK) can be defined as follows:</Paragraph>
      <Paragraph position="4"> The difference between Equations (3) and (9) is simply the condition of the first summation. FSSK selects significant sub-sequence u by using the condition of the statistical metric 2(u).</Paragraph>
      <Paragraph position="5"> Figure 2 shows a simple example of what FSSK calculates for the kernel value.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Efficient 2(u) Calculation Method
</SectionTitle>
      <Paragraph position="0"> It is computationally infeasible to calculate 2(u) for all possible u with a naive exhaustive method.</Paragraph>
      <Paragraph position="1"> In our approach, we use a sub-structure mining algorithm to calculate 2(u). The basic idea comes from a sequential pattern mining technique, PrefixSpan (Pei et al., 2001), and a statistical metric pruning (SMP) method, Apriori SMP (Morishita and Sese, 2000). By using these techniques, all the significant sub-sequences u that satisfy 2(u) can be found efficiently by depth-first search and pruning. Below, we briefly explain the concept involved in finding the significant features.</Paragraph>
      <Paragraph position="2"> First, we denote uv, which is the concatenation of sequences u and v. Then, u is a specific sequence and uv is any sequence that is constructed by u with any suffix v. The upper bound of the 2 value of uv can be defined by the value of u (Morishita and Sese, 2000).</Paragraph>
      <Paragraph position="4"> where xu and yu represent the value of x and y of u. This inequation indicates that if b 2(u) is less than a certain threshold , all sub-sequences uv can be eliminated from the features, because no sub-sequence uv can be a feature.</Paragraph>
      <Paragraph position="5"> The PrefixSpan algorithm enumerates all the significant sub-sequences by using a depth-first search and constructing a TRIE structure to store the significant sequences of internal results efficiently. Specifically, PrefixSpan algorithm evaluates uw, where uw represents a concatenation of a sequence  u and a symbol w, using the following three conditions. null 1. 2(uw) 2. &gt; 2(uw), &gt; b 2(uw) 3. &gt; 2(uw), b 2(uw)  With 1, sub-sequence uw is selected as a significant feature. With 2, sub-sequence uw and arbitrary sub-sequences uwv, are less than the threshold . Then w is pruned from the TRIE, that is, all uwv where v represents any suffix pruned from the search space. With 3, uw is not selected as a significant feature because the 2 value of uw is less than , however, uwv can be a significant feature because the upper-bound 2 value of uwv is greater than , thus the search is continued to uwv.</Paragraph>
      <Paragraph position="6"> Figure 3 shows a simple example of PrefixSpan with SMP that searches for the significant features</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
SMP
</SectionTitle>
    <Paragraph position="0"> by using a depth-first search with a TRIE representation of the significant sequences. The values of each symbol represent 2(u) and b 2(u) that can be calculated from the number of xu and yu. The TRIE structure in the figure represents the statistically significant sub-sequences that can be shown in a path from ? to the symbol.</Paragraph>
    <Paragraph position="1"> We exploit this TRIE structure and PrefixSpan pruning method in our kernel calculation.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Embedding Feature Selection in Kernel
Calculation
</SectionTitle>
      <Paragraph position="0"> This section shows how to integrate statistical feature selection in the kernel calculation. Our proposed method is defined in the following equations.</Paragraph>
      <Paragraph position="2"> Let Km(Si; Tj) be a function that returns the sum value of all statistically significant common sub-sequences u if si = tj.</Paragraph>
      <Paragraph position="4"> whose size juj is m and that satisfy the above condition 1. The m(Si; Tj) is defined in detail in Equation (15).</Paragraph>
      <Paragraph position="5"> Then, let Ju(Si; Tj), J0u(Si; Tj) and J00u (Si; Tj) be functions that calculate the value of the common sub-sequences between Si and Tj recursively, as well as equations (5) to (7) for sequence kernels. We introduce a special symbol to represent an &amp;quot;empty sequence&amp;quot;, and define w = w and j wj = 1.</Paragraph>
      <Paragraph position="7"> where I(w) is a function that returns a matching value of w. In this paper, we define I(w) is 1.</Paragraph>
      <Paragraph position="8"> b m(Si; Tj) has realized conditions 2 and 3; the details are defined in Equation (16).</Paragraph>
      <Paragraph position="10"> The following five equations are introduced to select a set of significant sub-sequences. m(Si; Tj) and b m(Si; Tj) are sets of sub-sequences (features) that satisfy condition 1 and 3, respectively, when calculating the value between Si and Tj in Equations (11) and (12).</Paragraph>
      <Paragraph position="12"> where F represents a set of sub-sequences. Notice that m(Si; Tj) and b m(Si; Tj) have only sub-sequences u that satisfy 2(uw) or b 2(uw), respectively, if si = tj(= w); otherwise they become empty sets.</Paragraph>
      <Paragraph position="13"> The following two equations are introduced for recursive set operations to calculate m(Si; Tj) and</Paragraph>
      <Paragraph position="15"> In the implementation, Equations (11) to (14) can be performed in the same way as those used to calculate the original sequence kernels, if the feature selection condition of Equations (15) to (19) has been removed. Then, Equations (15) to (19), which select significant features, are performed by the PrefixSpan algorithm described above and the TRIE representation of statistically significant features.</Paragraph>
      <Paragraph position="16"> The recursive calculation of Equations (12) to (14) and Equations (16) to (19) can be executed in the same way and at the same time in parallel. As a result, statistical feature selection can be embedded in oroginal sequence kernel calculation based on a dynamic programming technique.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.5 Properties
</SectionTitle>
      <Paragraph position="0"> The proposed method has several important advantages over the conventional methods.</Paragraph>
      <Paragraph position="1"> First, the feature selection criterion is based on a statistical measure, so statistically significant features are automatically selected.</Paragraph>
      <Paragraph position="2"> Second, according to Equations (10) to (18), the proposed method can be embedded in an original kernel calculation process, which allows us to use the same calculation procedure as the conventional methods. The only difference between the original sequence kernels and the proposed method is that the latter calculates a statistical metric 2(u) by using a sub-structure mining algorithm in the kernel calculation.</Paragraph>
      <Paragraph position="3"> Third, although the kernel calculation, which unifies our proposed method, requires a longer training time because of the feature selection, the selected sub-sequences have a TRIE data structure.</Paragraph>
      <Paragraph position="4"> This means a fast calculation technique proposed in (Kudo and Matsumoto, 2003) can be simply applied to our method, which yields classification very quickly. In the classification part, the features (subsequences) selected in the learning part must be known. Therefore, we store the TRIE of selected sub-sequences and use them during classification.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Proposed Method Applied to Other
Convolution Kernels
</SectionTitle>
    <Paragraph position="0"> We have insufficient space to discuss this subject in detail in relation to other convolution kernels. However, our proposals can be easily applied to tree kernels (Collins and Duffy, 2001) by using string encoding for trees. We enumerate nodes (labels) of tree in postorder traversal. After that, we can employ a sequential pattern mining technique to select statistically significant sub-trees. This is because we can convert to the original sub-tree form from the string encoding representation.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Support Vector Machines
</SectionTitle>
      <Paragraph position="0"> parameter value soft margin for SVM (C) 1000 decay factor of gap ( ) 0.5 threshold of 2 ( ) 2.70553.8415 As a result, we can calculate tree kernels with statistical feature selection by using the original tree kernel calculation with the sequential pattern mining technique introduced in this paper. Moreover, we can expand our proposals to hierarchically structured graph kernels (Suzuki et al., 2003a) by using a simple extension to cover hierarchical structures.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Experiments
</SectionTitle>
    <Paragraph position="0"> We evaluated the performance of the proposed method in actual NLP tasks, namely English question classification (EQC), Japanese question classification (JQC) and sentence modality identification (MI) tasks.</Paragraph>
    <Paragraph position="1"> We compared the proposed method (FSSK) with a conventional method (SK), as discussed in Section 3, and with bag-of-words (BOW) Kernel (BOW-K)(Joachims, 1998) as baseline methods.</Paragraph>
    <Paragraph position="2"> Support Vector Machine (SVM) was selected as the kernel-based classifier for training and classification. Table 2 shows some of the parameter values that we used in the comparison. We set thresholds of = 2:7055 (FSSK1) and = 3:8415 (FSSK2) for the proposed methods; these values represent the 10% and 5% level of significance in the 2 distribution with one degree of freedom, which used the 2 significant test.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Question Classification
</SectionTitle>
      <Paragraph position="0"> Question classification is defined as a task similar to text categorization; it maps a given question into a question type.</Paragraph>
      <Paragraph position="1"> We evaluated the performance by using data provided by (Li and Roth, 2002) for English and (Suzuki et al., 2003b) for Japanese question classification and followed the experimental setting used in these papers; namely we use four typical question types, LOCATION, NUMEX, ORGANI-ZATION, and TIME TOP for JQA, and &amp;quot;coarse&amp;quot; and &amp;quot;fine&amp;quot; classes for EQC. We used the one-vs-rest classifier of SVM as the multi-class classification method for EQC.</Paragraph>
      <Paragraph position="2"> Figure 4 shows examples of the question classification data used here.</Paragraph>
      <Paragraph position="3"> question types input object : word sequences ([ ]: information of chunk and h i: named entity)</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Sentence Modality Identification
</SectionTitle>
      <Paragraph position="0"> For example, sentence modality identification techniques are used in automatic text analysis systems that identify the modality of a sentence, such as &amp;quot;opinion&amp;quot; or &amp;quot;description&amp;quot;.</Paragraph>
      <Paragraph position="1"> The data set was created from Mainichi news articles and one of three modality tags, &amp;quot;opinion&amp;quot;, &amp;quot;decision&amp;quot; and &amp;quot;description&amp;quot; was applied to each sentence. The data size was 1135 sentences consisting of 123 sentences of &amp;quot;opinion&amp;quot;, 326 of &amp;quot;decision&amp;quot; and 686 of &amp;quot;description&amp;quot;. We evaluated the results by using 5-fold cross validation.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML