XML Viewer - w03-1308

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1308_metho.xml
Size: 6,125 bytes
Last Modified: 2025-10-06 14:08:37
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1308">
  <Title>Bio-Medical Entity Extraction using Support Vector Machines</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Method
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Basic model
</SectionTitle>
      <Paragraph position="0"> The named entity task can be formulated as a type of classification task. In the supervised machine learning approach which we adopt here we aim to estimate a classification function f,</Paragraph>
      <Paragraph position="2"> so that error on unseen examples is minimized, using training examples that are N dimensional vectors xi with class labels yi. The sample set S with</Paragraph>
      <Paragraph position="4"> (2) The classification function returns either +1 if the test data is a member of the class, or !1 if it is not. SVMs use linear models to discriminate between two classes. This raises the question of how can they be used to capture non-linear classification functions? The answer to this is by the use of a non-linear mapping function called a kernel,</Paragraph>
      <Paragraph position="6"> which maps the input space 'N into a feature space G. The kernel function k requires the evaluation of a dot product</Paragraph>
      <Paragraph position="8"> Clearly the complexity of data being classified determines which particular kernel should be used and of course more complex kernels require longer training times.</Paragraph>
      <Paragraph position="9"> By substituting Ph(xi) for each training example in S we derive the final form of the optimal decision function f,</Paragraph>
      <Paragraph position="11"> where b 2 R is the bias and the Lagrange parameters fii (fii , 0) are estimated using quadratic optimization to maximize the following function</Paragraph>
      <Paragraph position="13"> for i = 1;:::;m. C is a constant that controls the ratio between the complexity of the function and the number of misclassified training examples.</Paragraph>
      <Paragraph position="14"> The number of parameters to be estimated in fi therefore never exceeds the number of examples.</Paragraph>
      <Paragraph position="15"> The influence of fii basically means that training examples with fii &gt; 0 define the decision function (the support vectors) and those examples with fii = 0 have no influence, making the final model very compact and testing (but not training) very fast.</Paragraph>
      <Paragraph position="16"> The point x is classified as positive (or negative) if f(x) &gt; 0 (or f(x) &lt; 0).</Paragraph>
      <Paragraph position="17"> The kernel function we explored in our experiments was the polynomial function k(xi;xj) = (xi C/xj + 1)d for d = 2 which was found to be the best by (Takeuchi and Collier, 2002). Once input vectors have been mapped to the feature space the linear discrimination function which is found is the one which gives the maximum the geometric margin between the two classes in the feature space.</Paragraph>
      <Paragraph position="18"> Besides efficiency of representation, SVMs are known to maximize their generalizability, making them an ideal model for the NE+ task. Generalizability in SVMs is based on statistical learning theory and the observation that it is useful sometimes to misclassify some of the training data so that the margin between other training points is maximized.</Paragraph>
      <Paragraph position="19"> This is particularly useful for real world data sets that often contain inseparable data points.</Paragraph>
      <Paragraph position="20"> We implemented our method using the Tiny SVM package from NAIST2 which is an implementation of Vladimir Vapnik's SVM combined with an optimization algorithm (Joachims, 1999). The multi-class model is built up from combining binary classifiers and then applying majority voting.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Generalising with features
</SectionTitle>
      <Paragraph position="0"> In order for the model to be successful it must recognize regularities in the training data that relate pre-classified examples of terms with unseen terms that will be encountered in testing.</Paragraph>
      <Paragraph position="1"> Following on from previous studies in named entity we chose a set of linguistically motivated word-level features that include surface word forms, part of speech tags using the Brill tagger (Brill, 1992) and orthographic features. Additionally we used head-noun features that were obtained from pre-analysis of the training data set using the FDG shallow parser from Conexor (Tapanainen and J&amp;quot;arvinen, 1997). A significant proportion of the terms in our corpus undergo a local syntactic transformations such as coordination which introduces ambiguity that needs to be resolved by shallow parsing.</Paragraph>
      <Paragraph position="2"> For example the c- and v-rel (proto) oncogenes and NF-kappaB and I kappa B protein families. In these cases the head noun features oncogene and family would be added to each word in the constituent phrase. Head information is also needed when deciding the semantic category of a long term such as tumor necrosis factor-alpha which should be a PRO-TEIN, whereas tumor necrosis factor (TNF) gene and tumor necrosis factor promoter region should both be types of DNA.</Paragraph>
      <Paragraph position="3"> Table 2 shows the orthographic features that we used. We hypothesize that such features will help the model to find similarities between known words that were found in the training set and unknown words (of zero frequency in the training set) and so overcome the unknown word problem.</Paragraph>
      <Paragraph position="4"> In the experiments we report below we use feature vectors consisting of differing amounts of 'context' by varying the window around the focus word which is to be classified into one of the semantic classes.</Paragraph>
      <Paragraph position="5"> The full window of context considered in these experiments is SS3 about the focus word.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML