XML Viewer - w00-1303

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1303_metho.xml
Size: 12,600 bytes
Last Modified: 2025-10-06 14:07:28
<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1303">
  <Title>Japanese Dependency Structure Analysis Based on Support Vector Machines</Title>
  <Section position="3" start_page="18" end_page="19" type="metho">
    <SectionTitle>
2 Support Vector Machines
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="18" end_page="18" type="sub_section">
      <SectionTitle>
2.1 Optimal Hyperplane
</SectionTitle>
      <Paragraph position="0"> Let us define the training data which belong either to positive or negative class as follows.</Paragraph>
      <Paragraph position="1"> (xl, vl),..., (xi, v~),..., (x~, v~) xiEa n , y/E{+i,-1} xi is a feature vector of i-th sample, which is represented by an n dimensional vector (xi = (fl,...,ff,) E Rn). Yi is a scalar value that specifies the class (positive(+l) or negative(l) class) of i-th data. Formally, we can define the pattern recognition problem as a learning and building process of the decision function f: lq. n ~ {=i::l}.</Paragraph>
      <Paragraph position="2"> In basic SVMs framework, we try to separate the positive and negative examples in the training data by a linear hyperplane written as:</Paragraph>
      <Paragraph position="4"> It is supposed that the farther the positive and negative examples are separated by the discrimination function, the more accurately we could separate unseen test examples with high generalization performance. Let us consider two hyperplanes called separating hyperplanes: null</Paragraph>
      <Paragraph position="6"> (2) (3) can be written in one formula as:</Paragraph>
      <Paragraph position="8"> Distance from the separating hyperplane to the point xi can be written as:</Paragraph>
      <Paragraph position="10"> Thus, the margin between two separating hyperplanes can be written as: rain d(w,b;xi)+ rain d(w,b;xi)</Paragraph>
      <Paragraph position="12"> To maximize this margin, we should minimize Hwll. In other words, this problem becomes equivalent to solving the following optimization problem: Minimize: L(w) = 1/2l\[wl\[ 2 Subject to: yi\[(w-xi)+b\]&gt; 1 (i=l,...,l). Furthermore, this optimization problem can be rewritten into the dual form problem: Find the Lagrange multipliers c~i &gt;_ O(i = 1,..., l) so that:</Paragraph>
      <Paragraph position="14"> In this dual form problem, xi with non-zero ai is called a Support Vector. For the Support Vectors, w and b can thus be expressed as follows</Paragraph>
      <Paragraph position="16"> The elements of the set SVs are the Support Vectors that lie on the separating hyperplanes.</Paragraph>
      <Paragraph position="17"> Finally, the decision function ff : R n ---r {::El} can be written as:</Paragraph>
      <Paragraph position="19"/>
    </Section>
    <Section position="2" start_page="18" end_page="19" type="sub_section">
      <SectionTitle>
2.2 Soft Margin
</SectionTitle>
      <Paragraph position="0"> In the case where we cannot separate training examples linearly, &amp;quot;Soft Margin&amp;quot; method forgives some classification errors that may be caused by some noise in the training examples.</Paragraph>
      <Paragraph position="1"> First, we introduce non-negative slack variables, and (2),(3) are rewritten as:</Paragraph>
      <Paragraph position="3"> In this case, we minimize the following value instead of 1 2 llwll</Paragraph>
      <Paragraph position="5"> The first term in (7) specifies the size of margin and the second term evaluates how far the training data are away from the optimal separating hyperplane. C is the parameter that defines the balance of two quantities. If we make C larger, the more classification errors are neglected.</Paragraph>
      <Paragraph position="6"> Though we omit the details here, minimization of (7) is reduced to the problem to minimize the objective function (5) under the following constraints.</Paragraph>
      <Paragraph position="7"> 0 &lt; ai _&lt; c, a y/= 0 (i = 1,..., z) Usually, the value of C is estimated experimentally. null</Paragraph>
    </Section>
    <Section position="3" start_page="19" end_page="19" type="sub_section">
      <SectionTitle>
2.3 Kernel Function
</SectionTitle>
      <Paragraph position="0"> In general classification problems, there are cases in which it is unable to separate the training data linearly. In such cases, the training data could be separated linearly by expanding all combinations of features as new ones, and projecting them onto a higher-dimensional space. However, such a naive approach requires enormous computational overhead. null Let us consider the case where we project the training data x onto a higher-dimensional space by using projection function * 1 As we pay attention to the objective function (5) and the decision function (6), these functions depend only on the dot products of the input training vectors. If we could calculate the dot products from xz and x2 directly without considering the vectors ~(xz) and C/(x2) projected onto the higher-dimensional space, we can reduce the computational complexity considerably. Namely, we can reduce the computational overhead if we could find the function</Paragraph>
      <Paragraph position="2"> On the other hand, since we do not need itself for actual learning and classification, 1In general, ,It(x) is a mapping into Hilbert space. all we have to do is to prove the existence of that satisfies (8) provided the function K is selected properly. It is known that (8) holds if and only if the function K satisfies the Mercer condition (Vapnik, 1998).</Paragraph>
      <Paragraph position="3"> In this way, instead of projecting the training data onto the high-dimensional space, we can decrease the computational overhead by replacing the dot products, which is calculated in optimization and classification steps, with the function K.</Paragraph>
      <Paragraph position="4"> Such a function K is called a Kernel function. Among the many kinds of Kernel functions available, we will focus on the d-th polynomial kernel:</Paragraph>
      <Paragraph position="6"> Use of d-th polynomial kernel function allows us to build an optimal separating hyperplane which takes into account all combination of features up to d.</Paragraph>
      <Paragraph position="7"> Using a Kernel function, we can rewrite the decision function as:</Paragraph>
      <Paragraph position="9"/>
    </Section>
  </Section>
  <Section position="4" start_page="19" end_page="20" type="metho">
    <SectionTitle>
3 Dependency Analysis using
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="19" end_page="20" type="sub_section">
      <SectionTitle>
SVMs
3.1 The Probability Model
</SectionTitle>
      <Paragraph position="0"> This section describes a general formulation of the probability model and parsing techniques for Japanese statistical dependency analysis.</Paragraph>
      <Paragraph position="1"> First of all, we let a sequence of chunks be {bz,b2...,bm} by B, and the sequence dependency pattern be {Dep(1),Dep(2),...,Dep(m 1)} by D, where Dep(i) = j means that the chunk b~ depends on (modifies) the chunk bj.</Paragraph>
      <Paragraph position="2"> In this framework, we suppose that the dependency sequence D satisfies the following  constraints.</Paragraph>
      <Paragraph position="3"> 1. Except for the rightmost one, each chunk depends on (modifies) exactly one of the chunks appearing to the right.</Paragraph>
      <Paragraph position="4"> 2. Dependencies do not cross each other.  Statistical dependency structure analysis is defined as a searching problem for the dependency pattern D that maximizes the conditional probability P(DIB ) of the in- null put sequence under the above-mentioned constraints. null Dbest = argmax P(D\[B) D If we assume that the dependency probabilities are mutually independent, P(DIB ) could be rewritten as:</Paragraph>
      <Paragraph position="6"> that bi depends on (modifies) b t. fit is an n dimensional feature vector that represents various kinds of linguistic features related with the chunks bi and b t.</Paragraph>
      <Paragraph position="7"> We obtain Dbest taking into all the combination of these probabilities. Generally, the optimal solution Dbest Can be identified by using bottom-up algorithm such as CYK algorithm.</Paragraph>
      <Paragraph position="8"> Sekine suggests an efficient parsing technique for Japanese sentences that parses from the end of a sentence(Sekine et al., 2000). We apply Sekine's technique in our experiments.</Paragraph>
      <Paragraph position="9"> ..?</Paragraph>
    </Section>
    <Section position="2" start_page="20" end_page="20" type="sub_section">
      <SectionTitle>
3.2 Training with SVMs
</SectionTitle>
      <Paragraph position="0"> In order to use SVMs for dependency analysis, we need to prepare positive and negative examples since SVMs is a binary classifier. We adopt a simple and effective method for our purpose: Out of all combination of two chunks in the training data, we take a pair of chunks that axe in a dependency relation as a positive example, and two chunks that appear in a sentence but are not in a dependency relation as a negative example.</Paragraph>
      <Paragraph position="2"> Then, we define the dependency probability</Paragraph>
      <Paragraph position="4"> (11) shows that the distance between test data fr O and the separating hyperplane is put into the sigmoid function, assuming it represents the probability value of the dependency relation. null We adopt this method in our experiment to transform the distance measure obtained in SVMs into a probability function and analyze dependency structure with a fframework of conventional probability model 2</Paragraph>
    </Section>
    <Section position="3" start_page="20" end_page="20" type="sub_section">
      <SectionTitle>
3.3 Static and Dynamic Features
</SectionTitle>
      <Paragraph position="0"> Features that are supposed to be effective in Japanese dependency analysis are: head words and their parts-of-speech, particles and inflection forms of the words that appear at the end of chunks, distance between two chunks, existence of punctuation marks. As those are solely defined by the pair of chunks, we refer to them as static features.</Paragraph>
      <Paragraph position="1"> Japanese dependency relations are heavily constrained by such static features since the inflection forms and postpositional particles constrain the dependency relation. However, when a sentence is long and there are more than one possible dependents, static features, by themselves cannot determine the correct dependency. Let us look at the following example. null watashi-ha kono-hon-wo motteim josei-wo sagasiteiru I-top, this book-acc, have, lady-acc, be looking for In this example, &amp;quot;kono-hon-wo(this bookacc)&amp;quot; may modify either of &amp;quot;motteiru(have)&amp;quot; or &amp;quot;sagasiteiru(be looking for)&amp;quot; and cannot be determined only with the static features.</Paragraph>
      <Paragraph position="2"> However, &amp;quot;josei-wo (lady-acc)&amp;quot; can modify the only the verb &amp;quot;sagasiteiru,&amp;quot;. Knowing such information is quite useful for resolving syntactic ambiguity, since two accusative noun phrses hardly modify the same verb. It is possible to use such information if we add new features related to other modifiers. In the above case, the chunk &amp;quot;sagasiteiru&amp;quot; can receive a new feature of accusative modification (by &amp;quot;josei-wo&amp;quot;) during the parsing process, which precludes the chunk &amp;quot;kono-honwo&amp;quot; from modifying &amp;quot;sagasiteiru&amp;quot; since there is a strict constraint about double-accusative 2Experimentally, it is shown that tlie sigmoid function gives a good approximation of probability function from the decision function of SVMs(Platt, 1999). 21 , modification that will be learned from training examples. We decided to take into consideration all such modification information by using functional words or inflection forms of modifiers.</Paragraph>
      <Paragraph position="3"> Using such information about modifiers in the training phase has no difficulty since they are clearly available in a tree-bank. On the other hand, they are not known in the parsing phase of the test data. This problem can be easily solved if we adopt a bottom-up parsing algorithm and attach the modification information dynamically to the newly constructed phrases (the chlmks that become the head of the phrases). As we describe later we apply a beam search for parsing, and it is possible to keep several intermediate solutions while suppressing the combinatorial explosion.</Paragraph>
      <Paragraph position="4"> We refer to the features that are added incrementally during the parsing process as dynamic features.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML