File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1030_metho.xml
Size: 26,305 bytes
Last Modified: 2025-10-06 14:08:52
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1030"> <Title>Shallow Semantic Parsing using Support Vector Machines</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Problem Description </SectionTitle> <Paragraph position="0"> The problem of shallow semantic parsing can be viewed as three different tasks.</Paragraph> <Paragraph position="1"> Argument Identification - This is the process of identifying parsed constituents in the sentence that represent semantic arguments of a given predicate.</Paragraph> <Paragraph position="2"> Argument Classification - Given constituents known to represent arguments of a predicate, assign the appropriate argument labels to them.</Paragraph> <Paragraph position="3"> Argument Identification and Classification - A combination of the above two tasks.</Paragraph> <Paragraph position="4"> Each node in the parse tree can be classified as either one that represents a semantic argument (i.e., a NON-NULL node) or one that does not represent any semantic argument (i.e., a NULL node). The NON-NULL nodes can then be further classified into the set of argument labels. For example, in the tree of Figure 1, the node IN that encompasses &quot;for&quot; is a NULL node because it does not correspond to a semantic argument. The node NP that encompasses &quot;about 20 minutes&quot; is a NON-NULL node, since it does correspond to a semantic argument - ARGM-TMP.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Baseline Features </SectionTitle> <Paragraph position="0"> Our baseline system uses the same set of features introduced by G&J. Some of the features, viz., predicate, voice and verb sub-categorization are shared by all the nodes in the tree. All the others change with the constituent under consideration.</Paragraph> <Paragraph position="1"> Predicate - The predicate itself is used as a feature.</Paragraph> <Paragraph position="2"> Path - The syntactic path through the parse tree from the parse constituent to the predicate being classified. For example, in Figure 1, the path from ARG0 - &quot;He&quot; to the predicate talked, is represented with the string NP&quot;S#VP#VBD. &quot; and # represent upward and downward movement in the tree respectively. null Phrase Type - This is the syntactic category (NP, PP, S, etc.) of the phrase/constituent corresponding to the semantic argument.</Paragraph> <Paragraph position="3"> Position - This is a binary feature identifying whether the phrase is before or after the predicate.</Paragraph> <Paragraph position="4"> Voice - Whether the predicate is realized as an active or passive construction.</Paragraph> <Paragraph position="5"> Head Word - The syntactic head of the phrase. This is calculated using a head word table described by (Magerman, 1994) and modified by (Collins, 1999, Appendix. A).</Paragraph> <Paragraph position="6"> Sub-categorization - This is the phrase structure rule expanding the predicate's parent node in the parse tree. For example, in Figure 1, the sub-categorization for the predicate talked is VP!VBD-PP.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Classifier and Implementation </SectionTitle> <Paragraph position="0"> We formulate the parsing problem as a multi-class classification problem and use a Support Vector Machine (SVM) classifier (Hacioglu et al., 2003; Pradhan et al, 2003). Since SVMs are binary classifiers, we have to convert the multi-class problem into a number of binary-class problems. We use the ONE vs ALL (OVA) formalism, which involves training n binary classifiers for a n-class problem.</Paragraph> <Paragraph position="1"> Since the training time taken by SVMs scales exponentially with the number of examples, and about 80% of the nodes in a syntactic tree have NULL argument labels, we found it efficient to divide the training process into two stages, while maintaining the same accuracy: 1. Filter out the nodes that have a very high probability of being NULL. A binary NULL vs NON-NULL classifier is trained on the entire dataset. A sigmoid function is fitted to the raw scores to convert the scores to probabilities as described by (Platt, 2000).</Paragraph> <Paragraph position="2"> 2. The remaining training data is used to train OVA classifiers, one of which is the NULL-NON-NULL classifier.</Paragraph> <Paragraph position="3"> With this strategy only one classifier (NULL vs NON-NULL) has to be trained on all of the data. The remaining OVA classifiers are trained on the nodes passed by the filter (approximately 20% of the total), resulting in a considerable savings in training time.</Paragraph> <Paragraph position="4"> In the testing stage, we do not perform any filtering of NULL nodes. All the nodes are classified directly as NULL or one of the arguments using the classifier trained in step 2 above. We observe no significant performance improvement even if we filter the most likely NULL nodes in a first pass.</Paragraph> <Paragraph position="5"> For our experiments, we used TinySVM2 along with YamCha3 (Kudo and Matsumoto, 2000) (Kudo and Matsumoto, 2001) as the SVM training and test software. The system uses a polynomial kernel with degree 2; the cost per unit violation of the margin, C=1; and, tolerance of the termination criterion, e=0.001.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Baseline System Performance </SectionTitle> <Paragraph position="0"> Table 1 shows the baseline performance numbers on the three tasks mentioned earlier; these results are based on syntactic features computed from hand-corrected Tree-Bank (hence LDC hand-corrected) parses.</Paragraph> <Paragraph position="1"> For the argument identification and the combined identification and classification tasks, we report the precision (P), recall (R) and the F14 scores, and for the argument classification task we report the classification accuracy (A). This test set and all test sets, unless noted otherwise</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 System Improvements </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 7.1 Disallowing Overlaps </SectionTitle> <Paragraph position="0"> The system as described above might label two constituents NON-NULL even if they overlap in words. This is a problem since overlapping arguments are not allowed in PropBank. Among the overlapping constituents we retain the one for which the SVM has the highest confidence, and label the others NULL. The probabilities obtained by applying the sigmoid function to the raw SVM scores are used as the measure of confidence. Table 2 shows the performance of the parser on the task of identifying and labeling semantic arguments using the hand-corrected parses. On all the system improvements, we perform a 2 test of significance at p = 0:05, and all the</Paragraph> <Paragraph position="2"> significant improvements are marked with an . In this system, the overlap-removal decisions are taken independently of each other.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 7.2 New Features </SectionTitle> <Paragraph position="0"> We tested several new features. Two were obtained from the literature - named entities in constituents and head word part of speech. Other are novel features. 1. Named Entities in Constituents - Following Surdeanu et al. (2003), we tagged 7 named entities (PERSON, ORGANIZATION, LOCATION, PERCENT, MONEY, TIME, DATE) using Identi-Finder (Bikel et al., 1999) and added them as 7 binary features.</Paragraph> <Paragraph position="1"> 2. Head Word POS - Surdeanu et al. (2003) showed that using the part of speech (POS) of the head word gave a significant performance boost to their system. Following that, we experimented with the addition of this feature to our system.</Paragraph> <Paragraph position="2"> 3. Verb Clustering - Since our training data is relatively limited, any real world test set will con- null tain predicates that have not been seen in training. In these cases, we can benefit from some information about the predicate by using predicate cluster as a feature. The verbs were clustered into 64 classes using the probabilistic co-occurrence model of Hofmann and Puzicha (1998). The clustering algorithm uses a database of verb-direct-object relations extracted by Lin (1998). We then use the verb class of the current predicate as a feature.</Paragraph> <Paragraph position="3"> 4. Partial Path - For the argument identification task, path is the most salient feature. However, it is also the most data sparse feature. To overcome this problem, we tried generalizing the path by adding a new feature that contains only the part of the path from the constituent to the lowest common ancestor of the predicate and the constituent, which we call &quot;PartialPath&quot;. null 5. Verb Sense Information - The arguments that a predicate can take depend on the word sense of the predicate. Each predicate tagged in the PropBank corpus is assigned a separate set of arguments depending on the sense in which it is used. Table 3 illustrates the argument sets for the predicate talk. Depending on the sense of the predicate talk, either ARG1 or ARG2 can identify the hearer. Absence of this information can be potentially confusing to the learning mechanism.</Paragraph> <Paragraph position="4"> of predicate talk in PropBank corpus.</Paragraph> <Paragraph position="5"> We added the oracle sense information extracted from PropBank, to our features by treating each sense of a predicate as a distinct predicate.</Paragraph> <Paragraph position="6"> 6. Head Word of Prepositional Phrases - Many adjunctive arguments, such as temporals and locatives, occur as prepositional phrases in a sentence, and it is often the case that the head words of those phrases, which are always prepositions, are not very discriminative, eg., &quot;in the city&quot;, &quot;in a few minutes&quot;, both share the same head word &quot;in&quot; and neither contain a named entity, but the former is ARGM-LOC, whereas the latter is ARGM-TMP. Therefore, we tried replacing the head word of a prepositional phrase, with that of the first noun phrase inside the prepositional phrase. We retained the preposition information by appending it to the phrase type, eg., &quot;PP-in&quot; instead of &quot;PP&quot;.</Paragraph> <Paragraph position="7"> 7. First and Last Word/POS in Constituent - Some arguments tend to contain discriminative first and last words so we tried using them along with their part of speech as four new features.</Paragraph> <Paragraph position="8"> 8. Ordinal constituent position - In order to avoid false positives of the type where constituents far away from the predicate are spuriously identified as arguments, we added this feature which is a concatenation of the constituent type and its ordinal position from the predicate.</Paragraph> <Paragraph position="9"> 9. Constituent tree distance - This is a finer way of specifying the already present position feature.</Paragraph> <Paragraph position="10"> 10. Constituent relative features - These are nine features representing the phrase type, head word and head word part of speech of the parent, and left and right siblings of the constituent in focus. These were added on the intuition that encoding the tree context this way might add robustness and improve generalization. null 11. Temporal cue words - There are several temporal cue words that are not captured by the named entity tagger and were considered for addition as a binary feature indicating their presence.</Paragraph> <Paragraph position="11"> 12. Dynamic class context - In the task of argument classification, these are dynamic features that represent the hypotheses of at most previous two nodes belonging to the same tree as the node being classified. null</Paragraph> </Section> </Section> <Section position="9" start_page="0" end_page="0" type="metho"> <SectionTitle> 8 Feature Performance </SectionTitle> <Paragraph position="0"> Table 4 shows the effect each feature has on the argument classification and argument identification tasks, when added individually to the baseline. Addition of named entities improves the F1 score for adjunctive arguments ARGM-LOC from 59% to 68% and ARGM-TMP from 78.8% to 83.4%. But, since these arguments are small in number compared to the core arguments, the overall accuracy does not show a significant improvement. We found that adding this feature to the NULL vs NON-NULL classifier degraded its performance. It also shows the contribution of replacing the head word and the head word POS separately in the feature where the head of a prepositional phrase is replaced by the head word of the noun phrase inside it. Apparently, a combination of relative features seem to have a significant improvement on either or both the classification and identification tasks, and so do the first and last words in the constituent.</Paragraph> <Paragraph position="1"> system.</Paragraph> <Paragraph position="2"> We tried two other ways of generalizing the head word: i) adding the head word cluster as a feature, and ii) replacing the head word with a named entity if it belonged to any of the seven named entities mentioned earlier. Neither method showed any improvement. We also tried generalizing the path feature by i) compressing sequences of identical labels, and ii) removing the direction in the path, but none showed any improvement on the baseline.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 8.1 Argument Sequence Information </SectionTitle> <Paragraph position="0"> In order to improve the performance of their statistical argument tagger, G&J used the fact that a predicate is likely to instantiate a certain set of arguments. We use a similar strategy, with some additional constraints: i) argument ordering information is retained, and ii) the predicate is considered as an argument and is part of the sequence.</Paragraph> <Paragraph position="1"> We achieve this by training a trigram language model on the argument sequences, so unlike G&J, we can also estimate the probability of argument sets not seen in the training data. We first convert the raw SVM scores to probabilities using a sigmoid function. Then, for each sentence being parsed, we generate an argument lattice using the n-best hypotheses for each node in the syntax tree. We then perform a Viterbi search through the lattice using the probabilities assigned by the sigmoid as the observation probabilities, along with the language model probabilities, to find the maximum likelihood path through the lattice, such that each node is either assigned a value belonging to the PROPBANK ARGUMENTs, or NULL.</Paragraph> <Paragraph position="2"> cation and tagging after performing a search through the argument lattice.</Paragraph> <Paragraph position="3"> The search is constrained in such a way that no two NON-NULL nodes overlap with each other. To simplify the search, we allowed only NULL assignments to nodes having a NULL likelihood above a threshold. While training the language model, we can either use the actual predicate to estimate the transition probabilities in and out of the predicate, or we can perform a joint estimation over all the predicates. We implemented both cases considering two best hypotheses, which always includes a NULL (we add NULL to the list if it is not among the top two). On performing the search, we found that the overall performance improvement was not much different than that obtained by resolving overlaps as mentioned earlier. However, we found that there was an improvement in the CORE ARGUMENT accuracy on the combined task of identifying and assigning semantic arguments, given hand-corrected parses, whereas the accuracy of the ADJUNCTIVE ARGUMENTS slightly deteriorated. This seems to be logical considering the fact that the ADJUNCTIVE ARGUMENTS are not linguistically constrained in any way as to their position in the sequence of arguments, or even the quantity. We therefore decided to use this strategy only for the CORE ARGUMENTS. Although, there was an increase in F1 score when the language model probabilities were jointly estimated over all the predicates, this improvement is not statistically significant. However, estimating the same using specific predicate lemmas, showed a significant improvement in accuracy. The performance improvement is shown in Table 5.</Paragraph> </Section> </Section> <Section position="10" start_page="0" end_page="0" type="metho"> <SectionTitle> 9 Best System Performance </SectionTitle> <Paragraph position="0"> The best system is trained by first filtering the most likely nulls using the best NULL vs NON-NULL classifier trained using all the features whose argument identification F1 score is marked in bold in Table 4, and then training a ONE vs ALL classifier using the data remaining after performing the filtering and using the features that contribute positively to the classification task - ones whose accuracies are marked in bold in Table 4. Table 6 shows the performance of this system.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 10 Using Automatic Parses </SectionTitle> <Paragraph position="0"> Thus far, we have reported results using hand-corrected parses. In real-word applications, the system will have to extract features from an automatically generated parse. To evaluate this scenario, we used the Charniak parser (Chaniak, 2001) to generate parses for PropBank training and test data. We lemmatized the predicate using the XTAG morphology database5 (Daniel et al., 1992).</Paragraph> <Paragraph position="1"> Table 7 shows the performance degradation when automatically generated parses are used.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 11 Using Latest PropBank Data </SectionTitle> <Paragraph position="0"> Owing to the Feb 2004 release of much more and completely adjudicated PropBank data, we have a chance to parses instead of hand-corrected ones.</Paragraph> <Paragraph position="1"> report our performance numbers on this data set. Table 8 shows the same information as in previous Tables 6 and 7, but generated using the new data. Owing to time limitations, we could not get the results on the argument identification task and the combined argument identification and classification task using automatic parses.</Paragraph> </Section> </Section> <Section position="11" start_page="0" end_page="0" type="metho"> <SectionTitle> 12 Feature Analysis </SectionTitle> <Paragraph position="0"> In analyzing the performance of the system, it is useful to estimate the relative contribution of the various feature sets used. Table 9 shows the argument classification accuracies for combinations of features on the training and test data, using hand-corrected parses, for all PropBank arguments.</Paragraph> <Paragraph position="1"> the task of argument classification.</Paragraph> <Paragraph position="2"> In the upper part of Table 9 we see the degradation in performance by leaving out one feature or a feature family at a time. After the addition of all the new features, it is the case that removal of no individual feature except predicate degrades the classification performance significantly, as there are some other features that provide complimentary information. However, removal of predicate information hurts performance significantly, so does the removal of a family of features, eg., all phrase types, or the head word (HW), first word (FW) and last word (LW) information. The lower part of the table shows the performance of some feature combinations by themselves.</Paragraph> <Paragraph position="3"> Table 10 shows the feature salience on the task of argument identification. One important observation we can make here is that the path feature is the most salient feature in the task of argument identification, whereas it is the least salient in the task of argument classification. We could not provide the numbers for argument identification performance upon removal of the path feature since that made the SVM training prohibitively slow, indicating that the SVM had a very hard time separating the NULL class from the NON-NULL class.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Systems </SectionTitle> <Paragraph position="0"> We compare our system against 4 other shallow semantic parsers in the literature. In comparing systems, results are reported for all the three types of tasks mentioned earlier.</Paragraph> <Paragraph position="1"> 13.1 Description of the Systems The Gildea and Palmer (G&P) System.</Paragraph> <Paragraph position="2"> The Gildea and Palmer (2002) system uses the same features and the same classification mechanism used by G&J. These results are reported on the December 2001 release of PropBank.</Paragraph> <Paragraph position="3"> The Surdeanu et al. System.</Paragraph> <Paragraph position="4"> Surdeanu et al. (2003) report results on two systems using a decision tree classifier. One that uses exactly the same features as the G&J system. We call this &quot;Surdeanu System I.&quot; They then show improved performance of another system - &quot;Surdeanu System II,&quot; which uses some additional features. These results are are reported on the July 2002 release of PropBank.</Paragraph> <Paragraph position="5"> The Gildea and Hockenmaier (G&H) System The Gildea and Hockenmaier (2003) system uses features extracted from Combinatory Categorial Grammar (CCG) corresponding to the features that were used by G&J and G&P systems. CCG is a form of dependency grammar and is hoped to capture long distance relationships better than a phrase structure grammar. The features are combined using the same algorithm as in G&J and G&P. They use a slightly newer - November 2002 release of PropBank. We will refer to this as &quot;G&H System I&quot;.</Paragraph> <Paragraph position="6"> The Chen and Rambow (C&R) System Chen and Rambow report on two different systems, also using a decision tree classifier. The first &quot;C&R System I&quot; uses surface syntactic features much like the G&P system. The second &quot;C&R System II&quot; uses additional syntactic and semantic representations that are extracted from a Tree Adjoining Grammar (TAG) - another grammar formalism that better captures the syntactic properties of natural languages.</Paragraph> <Paragraph position="7"> Since two systems, in addition to ours, report results using the same set of features on the same data, we can directly assess the influence of the classifiers. G&P system estimates the posterior probabilities using several different feature sets and interpolate the estimates, while Surdeanu et al. (2003) use a decision tree classifier. Table 11 shows a comparison between the three systems for the task of argument classification.</Paragraph> </Section> </Section> <Section position="12" start_page="0" end_page="0" type="metho"> <SectionTitle> 13.3 Argument Identification (NULL vs NON-NULL) </SectionTitle> <Paragraph position="0"> Table 12 compares the results of the task of identifying the parse constituents that represent semantic arguments. As expected, the performance degrades considerably when we extract features from an automatic parse as opposed to a hand-corrected parse. This indicates that the syntactic parser performance directly influences the argument boundary identification performance. This could be attributed to the fact that the two features, viz., Path and Head Word that have been seen to be good discriminators of the semantically salient nodes in the syntax tree, are derived from the syntax tree.</Paragraph> <Paragraph position="1"> of various systems, and at various levels of classification granularity, and parse accuracy. It can be seen that the SVM System performs significantly better than all the other systems on all PropBank arguments.</Paragraph> <Paragraph position="2"> first identifies candidate argument boundaries and then labels them with the most likely argument. This is the hardest of the three tasks outlined earlier. SVM does a very good job of generalizing in both stages of processing. null 14 Generalization to a New Text Source Thus far, in all experiments our unseen test data was selected from the same source as the training data.</Paragraph> <Paragraph position="3"> In order to see how well the features generalize to texts drawn from a similar source, we used the classifier trained on PropBank training data to test data drawn from the AQUAINT corpus (LDC, 2002). We annotated 400 sentences from the AQUAINT corpus with PropBank arguments. This is a collection of text from the New There is a significant drop in the precision and recall numbers for the AQUAINT test set (compared to the precision and recall numbers for the PropBank test set which were 84% and 75% respectively). One possible reason for the drop in performance is relative coverage of the features on the two test sets. The head word, path and predicate features all have a large number of possible values and could contribute to lower coverage when moving from one domain to another. Also, being more specific they might not transfer well across domains.</Paragraph> <Paragraph position="4"> corrected PropBank test set. The tables show feature coverage for constituents that were Arguments and constituents that were NULL. About 99% of the predicates in the AQUAINT test set were seen in the PropBank training set. Table 17 shows coverage for the same features on the AQUAINT test set. We believe that the drop in coverage of the more predictive feature combinations explains part of the drop in performance.</Paragraph> </Section> class="xml-element"></Paper>