File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/w01-0511_evalu.xml
Size: 11,140 bytes
Last Modified: 2025-10-06 13:58:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-0511"> <Title>ClassifyingtheSemanticRelationsinNounCompoundsviaa Domain-SpecificLexicalHierarchy</Title> <Section position="6" start_page="3" end_page="4" type="evalu"> <SectionTitle> 5 MethodandResults </SectionTitle> <Paragraph position="0"> Because we have defined noun compound relation determination as a classification problem, we can make use of standard classification algorithms. In particular,weusedneuralnetworkstoclassifyacross all relations simultaneously.</Paragraph> <Paragraph position="1"> In some cases a word maps to more than one CUI;for the work reported here we arbitrarily chose the first mapping in all cases. In future work we will explore how to make use of all of the mapped terms.</Paragraph> <Paragraph position="2"> shown in boldface are those used in the experiments reported on here. Relation ID numbers are shown in parentheses by the relation names. The second column shows the number of labeled examples for each class; the last row shows a class consisting of compounds that exhibit more than one relation. The notation (1-2) and (2-1) indicates the directionality of the relations. For example, Cause (1-2) indicates that the first noun causes the second, and Cause (2-1) indicates the converse.</Paragraph> <Paragraph position="3"> models.</Paragraph> <Paragraph position="4"> Werantheexperimentscreatingmodelsthatused different levels of the MeSH hierarchy. For example, for the NC flu vaccination, flu maps to the MeSH term D4.808.54.79.429.154.349 and vaccination to G3.770.670.310.890. Flu vaccination for Model 4 would be represented by a vector consisting of the concatenation of the two descriptors showing only the first four levels: D4.808.54.79 G3.770.670.310 (seeTable2). WhenawordmapstoageneralMeSH term(liketreatment,Y11)zerosareappendedtothe end of the descriptor to stand in place of the missing values (so, for example, treatment in Model 3 is Y 11 0, and in Model 4 is Y 11 0 0, etc.).</Paragraph> <Paragraph position="5"> The numbers in the MeSH descriptors are categorical values; we represented them with indicator variables. That is, for each variable we calculated the number of possible categories c andthenrepresentedanobservationofthevariableasasequenceof null c binary variables in which one binary variable was one and the remaining c [?] 1 binary variables were zero.</Paragraph> <Paragraph position="6"> We also used a representation in which the words themselves were used as categorical input variables (we call this representation &quot;lexical&quot;). For this collection of NCs there were 1184 unique nouns and therefore the feature vector for each noun had 1184 components. In Table 3 we report the length of the featurevectorsfor one nounfor eachmodel. The entire NC was described by concatenating the feature vectors for the two nouns in sequence.</Paragraph> <Paragraph position="7"> The NCs represented in this fashion were used as input to a neural network. We used a feed-forward network trained with conjugate gradient descent.</Paragraph> <Paragraph position="8"> number corresponds to the level of the MeSH hierarchy used for classification. Lexical NN is Neural Network on Lexical and Lexical: Log Reg is Logistic Regression on NN. Acc1 refers to how often the correct relation is the top-scoringrelation, Acc2 refers tohowoften thecorrect relationisoneofthetoptwoaccordingtotheneuralnet, and so on. Guessing would yield a result of 0.077.</Paragraph> <Paragraph position="9"> The network had one hidden layer, in which a hyperbolic tangent function was used, and an output layer representing the 18 relations. A logistic sigmoid function was used in the output layer to map the outputs into the interval (0,1).</Paragraph> <Paragraph position="10"> The number of units of the output layer was the number of relations (18) and therefore fixed. The networkwastrainedforseveralchoicesofnumbersof hiddenunits; wechosethebest-performingnetworks based on training set error for each of the models.</Paragraph> <Paragraph position="11"> We subsequently tested these networks on held-out testing data.</Paragraph> <Paragraph position="12"> We compared the results with a baseline in which logistic regression was used on the lexical features. Given the indicator variable representation of these features, this logistic regression essentially forms a table of log-odds for each lexical item. We also comparedtoamethodinwhichthelexicalindicatorvari- null ables were used as input to a neural network. This approach is of interest to see to what extent, if any, the MeSH-based features affect performance. Note also that this lexical neural-network approach is feasible in this setting because the number of unique words is limited (1184) - such an approach would not scale to larger problems.</Paragraph> <Paragraph position="13"> In Table 4 and in Figure 1 we report the results from these experiments. Neural network using lexical features only yields 62% accuracy on average across all 18 relations. A neural net trained on Model 6 using the MeSH terms to represent the nouns yields an accuracy of 61% on average across all18relations. Notethatreasonable performanceis alsoobtainedforModel2,whichisamuchmoregeneral representation. Table 4 shows that both methodsachieve up to78% accuracyatincluding the correct relation among the top three hypothesized.</Paragraph> <Paragraph position="14"> Multi-class classification is a difficult problem (Vapnik, 1998). In this problem, a baseline in which The dotted line at the bottom is the accuracy of guessing (the inverse of the number of classes). The dash-dot line above this is the accuracy of logistic regression on the lexical data. The solid line with asterisks is the accuracy of our representation, when only the maximum output value from the network is considered. The solid line with circles if the accuracy of getting the right answerwithinthetwolargestoutputvaluesfromtheneural null network and the last solid line with diamonds is the accuracy of getting the right answer within the first three outputs from the network. The three flat dashed lines are the corresponding performances of the neural network on lexical inputs.</Paragraph> <Paragraph position="15"> the algorithm guesses yields about 5% accuracy. We see that our method is a significant improvement over the tabular logistic-regression-based approach, which yields an accuracy of only 31 percent. Additionally, despite the significant reduction in raw information content as compared to the lexical representation, the MeSH-based neural network performs as well as the lexical-based neural network. (And we again stress that the lexical-based neural network is not a viable option for larger domains.) Figure 2 shows the results for each relation.</Paragraph> <Paragraph position="16"> MeSH-based generalization does better on some relations(forexample14and15)andLexicalonothers null (7, 22). It turns out that the test set for relationship 7 (&quot;Produces on a genetic level&quot;) is dominated by NCs containing the words alleles and mrna and that all the NCs in the training set containing these words are assigned relation label 7. A similar situation is seen for relation 22, &quot;Time(2-1)&quot;. In the test set examples the second noun is either recurrence, season or time. In the training set, these nouns appearonly inNCsthathavebeenlabeledasbelonging to relation 22.</Paragraph> <Paragraph position="17"> On the other hand, if we look at relations 14 and 15, wefindawiderrangeofwords, andinsomecases bottom refer to the class numbers in Table 1. Note the very high accuracy for the &quot;mixed&quot; relationship 20-27 (last bar on the right).</Paragraph> <Paragraph position="18"> the words in the test set are not present in the training set. In relationship 14 (&quot;Purpose&quot;), for example, vaccine appears6 timesin the test set (e.g., varicella vaccine). In the training set, NCs with vaccine in it have also been classified as &quot;Instrument&quot; (antigen vaccine, polysaccharide vaccine), as &quot;Object&quot; (vaccine development), as &quot;Subtype of&quot; (opv vaccine) and as &quot;Wrong&quot; (vaccines using). Other words in the test set for 14 are varicella which is present in the trainig set only in varicella serology labeled as &quot;Attribute of clinical study&quot;, drainage which is in the training set only as &quot;Location&quot; (gallbladder drainage and tract drainage) and &quot;Activity&quot; (bile drainage). Other test set words such as immunisation and carcinogen do not appear in the training setatall.</Paragraph> <Paragraph position="19"> In other words, it seems that the MeSHk-based categorization does better when generalization is required. Additionally, this data set is &quot;dense&quot; in the sense that very few testing words are not present in the training data. This is of course an unrealistic situation and we wanted to test the robustness of the method in a more realistic setting. The results reported in Table 4 and in Figure 1 were obtained splitting the data into 50% training and 50% testing for each relation and we had a total of 855 training points and 805 test points. Of these, only 75 examples in the testing set consisted of NCs in which both words were not present in the training set.</Paragraph> <Paragraph position="20"> We decided to test the robustness of the MeSH-based model versus the lexical model in the case of unseen words; we are also interested in seeing the relative importance of the first versus the second noun. Therefore, we split the data into 5% training (73 data points) and 95% testing (1587 data points) the test set.</Paragraph> <Paragraph position="21"> and partitioned the testing set into 4 subsets as follows (the numbers in parentheses are the numbers of points for each case): * Case 1: NCs for which the first noun was not present in the training set (424) * Case 2: NCs for which the second noun was not present in the training set (252) * Case 3: NCs for which both nouns were present in the training set (101) * Case 4: NCs for which both nouns were not present in the training set (810).</Paragraph> <Paragraph position="22"> Table5andFigures3and4presenttheaccuracies for these test set partitions. Figure 3 shows that the MeSH-based models are more robust than the lexical whenthe number of unseen words is highand when the size of training set is (very) small. In this morerealisticsituation,theMeSHmodelsareableto generalize over previously unseen words. For unseen words, lexical reduces to guessing.</Paragraph> <Paragraph position="23"> Figure 4 shows the accuracy for the MeSH basedmodelfor thethefour casesofTable 5. Itisinteresting to note that the accuracy for Case 1 (first noun not present in the training set) is much higher than the accuracy for Case 2 (second noun not present in the training set). This seems to indicate that the second noun is more important for the classification that the first one.</Paragraph> </Section> class="xml-element"></Paper>