XML Viewer - p04-1055

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/p04-1055_evalu.xml
Size: 16,879 bytes
Last Modified: 2025-10-06 13:59:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1055">
  <Title>Classifying Semantic Relations in Bioscience Texts</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Models and Results
</SectionTitle>
    <Paragraph position="0"> This section describes the models and their performance on both entity extraction and relation classification. Generative models learn the prior probability of the class and the probability of the features given the class; they are the natural choice in cases with hidden variables (partially observed or missing data). Since labeled data is expensive to collect, these models may be useful when no labels are available. However, in this paper we test the generative models on fully observed data and show that, although not as accurate as the discriminative model, their performance is promising enough to encourage their use for the case of partially observed data.</Paragraph>
    <Paragraph position="1"> Discriminative models learn the probability of the class given the features. When we have fully observed data and we just need to learn the mapping from features to classes (classification), a discriminative approach may be more appropriate, as shown in Ng and Jordan (2002), but has other shortcomings as discussed below.</Paragraph>
    <Paragraph position="2"> For the evaluation of the role extraction task, we calculate the usual metrics of precision, recall and F-measure. Precision is a measure of how many of the roles extracted by the system are correct and recall is the measure of how many of the true roles were extracted by the system. The F-measure is a weighted combination of precision and recall4.</Paragraph>
    <Paragraph position="3"> Our role evaluation is very strict: every token is assessed and we do not assign partial credit for constituents for which only some of the words are correctly labeled. We report results for two cases: (i) considering only the relevant sentences and (ii) including also irrelevant sentences. For the relation classification task, we report results in terms of classification accuracy, choosing one out of seven choices for (i) and one out of eight choices for (ii).</Paragraph>
    <Paragraph position="4"> (Most papers report the results for only the relevant sentences, while some papers assign credit to their algorithms if their system extracts only one instance of a given relation from the collection. By contrast, in our experiments we expect the system to extract all instances of every relation type.) For both tasks, 75% of the data were used for training and the rest for testing.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Generative Models
</SectionTitle>
      <Paragraph position="0"> In Figure 1 we show two static and three dynamic models. The nodes labeled &amp;quot;Role&amp;quot; represent the entities (in this case the choices are DISEASE, TREATMENT and NULL) and the node labeled &amp;quot;Relation&amp;quot; represents the relationship present in the sentence. We assume here that there is a single relation for each sentence between the entities5.</Paragraph>
      <Paragraph position="1"> The children of the role nodes are the words and their features, thus there are as many role states as there are words in the sentence; for the static models, this is depicted by the box (or &amp;quot;plate&amp;quot;) which is the standard graphical model notation for replication. For each state, the features a0a2a1 are those mentioned in Section 3.</Paragraph>
      <Paragraph position="2"> The simpler static models S1 and S2 do not assume an ordering in the role sequence. The dynamic models were inspired by prior work on HMM-like graphical models for role extraction (Bikel et al., 1999; Freitag and McCallum, 2000; Ray and Craven, 2001). These models consist of a  lationship, often with multiple entities or the same entities taking part in several interconnected relationships; we did not include these in the study.</Paragraph>
      <Paragraph position="3">  Markov sequence of states (usually corresponding to semantic roles) where each state generates one or multiple observations. Model D1 in Figure 1 is typical of these models, but we have augmented it with the Relation node.</Paragraph>
      <Paragraph position="4"> The task is to recover the sequence of Role states, given the observed features. These models assume that there is an ordering in the semantic roles that can be captured with the Markov assumption and that the role generates the observations (the words, for example). All our models make the additional assumption that there is a relation that generates the role sequence; thus, these  role extraction.</Paragraph>
      <Paragraph position="5"> models have the appealing property that they can simultaneously perform role extraction and relationship recognition, given the sequence of observations. In S1 and D1 the observations are independent from the relation (given the roles). In S2 and D2, the observations are dependent on both the relation and the role (or in other words, the relation generates not only the sequence of roles but also the observations). D2 encodes the fact that even when the roles are given, the observations depend on the relation. For example, sentences containing the word prevent are more likely to represent a &amp;quot;prevent&amp;quot; kind of relationship. Finally, in D3 only one observation per state is dependent on both the relation and the role, the motivation being that some observations (such as the words) depend on the relation while others might not (like for example, the parts of speech). In the experiments reported here, the observations which have edges from both the role and the relation nodes are the words. (We ran an experiment in which this observation node was the MeSH term, obtaining similar results.) Model D1 defines the following joint probability distribution over relations, roles, words and word features, assuming the leftmost Role node is  Model D1 is similar to the model in Thompson et al. (2003) for the extraction of roles, using a different domain. Structurally, the differences are (i) Thompson et al. (2003) has only one observation node per role and (ii) it has an additional node &amp;quot;on top&amp;quot;, with an edge to the relation node, to represent a predicator &amp;quot;trigger word&amp;quot; which is always observed; the predicator words are taken from a fixed list and one must be present in order for a sentence to be analyzed.</Paragraph>
      <Paragraph position="6"> The joint probability distributions for D2 and D3 are similar to Equation (1) where we substitute the term a73a74a60a75a77a76a79a78a25a80 a0 a74a72a81a83a82a28a84a29a32a31a85a33 a81a57a86 with a73a74a60a75a77a76 a78a25a80 a0 a74a72a81a83a82a28a84a29a87a31a34a33 a81a57a88 a28a84a33a89a31 a86 for D2 and</Paragraph>
      <Paragraph position="8"> The parameters a78a25a80 a0 a74a72a81 a82a28a30a29a32a31a34a33 a81 a86 and a78a25a80 a0 a74 a35 a82a28a30a29a32a31a34a33 a35 a86 of Equation (1) are constrained to be equal.</Paragraph>
      <Paragraph position="9"> The parameters were estimated using maximum likelihood on the training set; we also implemented a simple absolute discounting smoothing method (Zhai and Lafferty, 2001) that improves the results for both tasks.</Paragraph>
      <Paragraph position="10"> Table 2 shows the results (F-measures) for the problem of finding the most likely sequence of roles given the features observed. In this case, the relation is hidden and we marginalize over it6. We experimented with different values for the smoothing factor ranging from a minimum of 0.0000005 to a maximum of 10; the results shown fix the smoothing factor at its minimum value. We found that for the dynamic models, for a wide range of smoothing factors, we achieved almost identical results; nevertheless, in future work, we plan to implement cross-validation to find the optimal smoothing factor. By contrast, the static models were more sensitive to the value of the smoothing factor.</Paragraph>
      <Paragraph position="11"> Using maximum likelihood with no smoothing, model D1 performs better than D2 and D3. This was expected, since the parameters for models D2 and D3 are more sparse than D1. However, when smoothing is applied, the three dynamic models achieve similar results. Although the additional edges in models D2 and D3 did not help much for the task of role extraction, they did help for relation classification, discussed next. Model D2 6To perform inference for the dynamic model, we used the junction tree algorithm. We used Kevin Murphy's BNT package, found at http://www.ai.mit.edu/ murphyk/Bayes/bnintro.html. null achieves the best F-measures: 0.73 for &amp;quot;only relevant&amp;quot; and 0.71 for &amp;quot;rel. + irrel.&amp;quot;. It is difficult to compare results with the related work since the data, the semantic roles and the evaluation are different; in Ray and Craven (2001) however, the role extraction task is quite similar to ours and the text is also from MEDLINE. They report approximately an F-measure of 32% for the extraction of the entities PROTEINS and LOCA-TIONS, and an F-measure of 50% for GENE and DISORDER.</Paragraph>
      <Paragraph position="12"> The second target task is to find the most likely relation, i.e., to classify a sentence into one of the possible relations. Two types of experiments were conducted. In the first, the true roles are hidden and we classify the relations given only the observable features, marginalizing over the hidden roles. In the second, the roles are given and only the relations need to be inferred. Table 3 reports the results for both conditions, both with absolute discounting smoothing and without.</Paragraph>
      <Paragraph position="13"> Again model D1 outperforms the other dynamic models when no smoothing is applied; with smoothing and when the true roles are hidden, D2 achieves the best classification accuracies. When the roles are given D1 is the best model; D1 does well in the cases when both roles are not present.</Paragraph>
      <Paragraph position="14"> By contrast, D2 does better than D1 when the presence of specific words strongly determines the outcome (e.g., the presence &amp;quot;prevention&amp;quot; or &amp;quot;prevent&amp;quot; helps identify the Prevent relation).</Paragraph>
      <Paragraph position="15"> The percentage improvements of D2 and D3 versus D1 are, respectively, 10% and 6.5% for relation classification and 1.4% for role extraction (in the &amp;quot;only relevant&amp;quot;, &amp;quot;only features&amp;quot; case). This suggests that there is a dependency between the observations and the relation that is captured by the additional edges in D2 and D3, but that this dependency is more helpful in relation classification than in role extraction.</Paragraph>
      <Paragraph position="16"> For relation classification the static models perform worse than for role extraction; the decreases in performance from D1 to S1 and from D2 to S2 are, respectively (in the &amp;quot;only relevant&amp;quot;, &amp;quot;only features&amp;quot; case), 7.4% and 7.3% for role extraction and 27.1% and 44% for relation classification. This suggests the importance of modeling the sequence of roles for relation classification.</Paragraph>
      <Paragraph position="17"> To provide an idea of where the errors occur, Table 4 shows the confusion matrix for model D2 for the most realistic and difficult case of &amp;quot;rel + irrel.&amp;quot;, &amp;quot;only features&amp;quot;. This indicates that the algorithm performs poorly primarily for the cases for which there is little training data, with the exception of the ONLY DISEASE case, which is often mistaken for CURE.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Neural Network
</SectionTitle>
      <Paragraph position="0"> To compare the results of the generative models of the previous section with a discriminative method, we use a neural network, using the Matlab package to train a feed-forward network with conjugate gradient descent.</Paragraph>
      <Paragraph position="1"> The features are the same as those used for the models in Section 4.1, but are represented with indicator variables. That is, for each feature we calculated the number of possible values a91 and then represented an observation of the feature as a sequence of a91 binary values in which one value is set to a92 and the remaining a91a94a93a95a92 values are set to a96 . The input layer of the NN is the concatenation of this representation for all features. The network has one hidden layer, with a hyperbolic tangent function. The output layer uses a logistic sigmoid function. The number of units of the output layer is fixed to be the number of relations (seven or eight) for the relation classification task and the number of roles (three) for the role extraction task. The network was trained for several choices of numbers of hidden units; we chose the best-performing networks based on training set error.</Paragraph>
      <Paragraph position="2"> We then tested these networks on held-out testing data.</Paragraph>
      <Paragraph position="3"> The results for the neural network are reported in Table 3 in the column labeled NN. These results are quite strong, achieving 79.6% accuracy in the relation classification task when the entities are hidden and 96.9% when the entities are given, outperforming the graphical models. Two possible reasons for this are: as already mentioned, the discriminative approach may be the most appropriate for fully labeled data; or the graphical models we proposed may not be the right ones, i.e., the independence assumptions they make may misrepresent underlying dependencies.</Paragraph>
      <Paragraph position="4"> It must be pointed out that the neural network  (NN). For absolute discounting, the smoothing factor was fixed at the minimum value. B is the baseline of always choosing the most frequent relation. The best results are indicated in boldface. is much slower than the graphical models, and requires a great deal of memory; we were not able to run the neural network package on our machines for the role extraction task, when the feature vectors are very large. The graphical models can perform both tasks simultaneously; the percentage decrease in relation classification of model D2 with respect to the NN is of 8.9% for &amp;quot;only relevant&amp;quot; and 5.8% for &amp;quot;relevant + irrelevant&amp;quot;.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Features
</SectionTitle>
      <Paragraph position="0"> In order to analyze the relative importance of the different features, we performed both tasks using the dynamic model D1 of Figure 1, leaving out single features and sets of features (grouping all of the features related to the MeSH hierarchy, meaning both the classification of words into MeSH IDs and the domain knowledge as defined in Section 3). The results reported here were found with maximum likelihood (no smoothing) and are for the &amp;quot;relevant only&amp;quot; case; results for &amp;quot;relevant + irrelevant&amp;quot; were similar.</Paragraph>
      <Paragraph position="1"> For the role extraction task, the most important feature was the word: not using it, the GM achieved only 0.65 F-measure (a decrease of 9.7% from 0.72 F-measure using all the features).</Paragraph>
      <Paragraph position="2"> Leaving out the features related to MeSH the F-measure obtained was 0.69% (a 4.1% decrease) and the next most important feature was the part-of-speech (0.70 F-measure not using this feature). For all the other features, the F-measure ranged between 0.71 and 0.73.</Paragraph>
      <Paragraph position="3"> For the task of relation classification, the MeSH-based features seem to be the most important. Leaving out the word again lead to the biggest decrease in the classification accuracy for a single feature but not so dramatically as in the role extraction task (62.2% accuracy, for a decrease of 4% from the original value), but leaving out all the MeSH features caused the accuracy to decrease the most (a decrease of 13.2% for 56.2% accuracy). For both tasks, the impact of the domain knowledge alone was negligible.</Paragraph>
      <Paragraph position="4"> As described in Section 3, words can be mapped to different levels of the MeSH hierarchy. Currently, we use the &amp;quot;second&amp;quot; level, so that, for example, surgery is mapped to G02.403 (when the whole MeSH ID is G02.403.810.762). This is somewhat arbitrary (and mainly chosen with the sparsity issue in mind), but in light of the importance of the MeSH features it may be worthwhile investigating the issue of finding the optimal level of description. (This can be seen as another form of smoothing.)</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML