File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4036_metho.xml
Size: 15,005 bytes
Last Modified: 2025-10-06 14:08:53
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-4036"> <Title>Parsing Arguments of Nominalizations in English and Chinese</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Semantic Annotation and Corpora </SectionTitle> <Paragraph position="0"> For our experiments, we use the FrameNet database (Baker et al., 1998) which contains frame-specific se-This research was partially supported by the ARDA AQUAINT program via contract OCG4423B and by the NSF via grant IS-9978025 mantic annotation of a number of predicates in English. Predicates are grouped by the semantic frame that they instantiate, depending on the sense of their usage, and their arguments assume one of the frame elements or roles specific to that frame. The predicate can be a verb, noun, adjective, prepositional phrase, etc. FrameNet contains about 500 different frame types and about 700 distinct frame elements. The following example illustrates the general idea. Here, the predicate &quot;complain&quot; instantiates a &quot;Statement&quot; frame once as a nominal predicate and once as a verbal predicate.</Paragraph> <Paragraph position="1"> Did [Speaker she] make an official [Predicate:nominal complaint] [Addressee to you] [Topic about the attack.] [Message\Justice has not been done&quot;] [Speaker he] [Predicate:verbal complained.] Nominal predicates in FrameNet include ultra-nominals (Barker and Dowty, 1992), nominals and nominalizations. For the purposes of this study, a human analyst went through the nominal predicates in FrameNet and selected those that were identified as nominalizations in NOMLEX (Macleod et al., 1998). Out of those, the analyst then selected ones that were eventive nominalizations.</Paragraph> <Paragraph position="2"> These data comprise 7,333 annotated sentences, with 11,284 roles. There are 105 frames with about 190 distinct frame role1 types. A stratified sampling over predicates was performed to select 80% of this data for training, 10% for development and another 10% for testing. For the Chinese semantic parsing experiments, we selected 22 nominalizations from the Penn Chinese Tree-bank and tagged all the sentences containing these predicates with PropBank (Kingsbury and Palmer, 2002) style arguments - ARG0, ARG1, etc. These consisted of 630 sentences. These are then split into two parts: 503 (80%) for training and 127 (20%) for testing.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Baseline System </SectionTitle> <Paragraph position="0"> The primary assumption in our system is that a semantic argument aligns with some syntactic constituent. The goal is to identify and label constituents in a syntactic tree that represent valid semantic arguments of a given predicate. Unlike PropBank, there are no hand-corrected parses available for the sentences in FrameNet, so we cannot quantify the possible mis-alignment of the nominal arguments with syntactic constituents. The arguments that do not align with any constituent are simply missed by the current system.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Features We created a baseline system using </SectionTitle> <Paragraph position="0"> all and only those features introduced by Gildea and Jurafsky that are directly applicable to nominal predicates. Most of the features are extracted from the syntactic parse of a sentence. We used the Charniak parser (Chaniak, 2001) to parse the sentences in order to perform feature extraction. The features are listed below: Predicate - The predicate lemma is used as a feature.</Paragraph> <Paragraph position="1"> Path - The syntactic path through the parse tree from the parse constituent being classified to the predicate.</Paragraph> <Paragraph position="2"> Constituent type - This is the syntactic category (NP, PP, S, etc.) of the constituent corresponding to the semantic argument.</Paragraph> <Paragraph position="3"> Position - This is a binary feature identifying whether the constituent is before or after the predicate.</Paragraph> <Paragraph position="4"> Head word - The syntactic head of the constituent.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Classifier and Implementation We formulate the </SectionTitle> <Paragraph position="0"> parsing problem as a multi-class classification problem and use a Support Vector Machine (SVM) classifier in the ONE vs ALL (OVA) formalism, which involves training n classifiers for a n-class problem - including the NULL class. We use TinySVM2 along with YamCha3 (Kudo and Matsumoto (2000, 2001)) as the SVM training and test software.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Performance We evaluate our system on three </SectionTitle> <Paragraph position="0"> tasks: i) Argument Identification: Identifying parse constituents that represent arguments of a given predicate, ii) Argument Classification: Labeling the constituents that are known to represent arguments with the most likely roles, and iii) Argument Identification and Classification: Finding constituents that represent arguments of a predicate, and labeling them with the most likely roles. The baseline performance on the three tasks is shown in Table 1.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 New Features </SectionTitle> <Paragraph position="0"> To improve the baseline performance we investigated additional features that would provide useful information in identifying arguments of nominalizations. Following is a description of each feature along with an intuitive justification. Some of these features are not instantiated for a particular constituent. In those cases, the respective feature values are set to &quot;UNK&quot;.</Paragraph> <Paragraph position="1"> 1. Frame - The frame instantiated by the particular sense of the predicate in a sentence. This is an oracle feature. 2. Selected words/POS in constituent - Nominal predicates tend to assign arguments, most commonly through postnominal of-complements, possessive prenominal modifiers, etc. We added the values of the first and last word in the constituent as two separate features. Another two features represent the part of speech of these words. 3. Ordinal constituent position - Arguments of nouns tend to be located closer to the predicate than those for verbs. This feature captures the ordinal position of a particular constituent to the left or right of the predicate on a left or right tree traversal, eg., first PP from the predicate, second NP from the predicate, etc. This feature along with the position will encode the before/after information for the constituent.</Paragraph> <Paragraph position="2"> 4. Constituent tree distance - Another way of quantifying the position of the constituent is to identify its index in the list of constituents that are encountered during linear traversal of the tree from the predicate to the constituent.</Paragraph> <Paragraph position="3"> 5. Intervening verb features - Support verbs play an important role in realizing the arguments of nominal predicates. We use three classes of intervening verbs: i) auxiliary verbs - ones with part of speech AUX, ii) light verbs - a small set of known light verbs: took, take, make, made, give, gave, went and go, and iii) other verbs - with part of speech VBx. We added three features for each: i) a binary feature indicating the presence of the verb in between the predicate and the constituent ii) the actual word as a feature, and iii) the path through the tree from the constituent to the verb, as the subject of intervening verbs sometimes tend to be arguments of nominalizations. The following example could explain the intuition behind this feature: [Speaker Leapor] makes general [Predicate assertions] [Topic about marriage] 6. Predicate NP expansion rule - This is the noun equivalent of the verb sub-categorization feature used by Gildea and Jurafsky (2002). This is the expansion rule instantiated by the parser, for the lowermost NP in the tree, encompassing the predicate. This would tend to cluster NPs with a similar internal structure and would thus help finding argumentive modifiers.</Paragraph> <Paragraph position="4"> 7. Noun head of prepositional phrase constituents - Instead of using the standard head word rule for prepositional phrases, we use the head word of the first NP inside the PP as the head of the PP and replace the constituent type PP with PP-<preposition>.</Paragraph> <Paragraph position="5"> 8. Constituent sibling features - These are six features representing the constituent type, head word and part of speech of the head word of the left and right siblings of the constituent in consideration. These are used to capture arguments represented by the modifiers of nominalizations.</Paragraph> <Paragraph position="6"> 9. Partial-path from constituent to predicate - This is the path from the constituent to the lowest common parent of the constituent and the predicate. This is used to generalize the path statistics.</Paragraph> <Paragraph position="7"> 10. Is predicate plural - A binary feature indicating whether the predicate is singular or plural as they tend to have different argument selection properties.</Paragraph> <Paragraph position="8"> 11. Genitives in constituent - This is a binary feature which is true if there is a genitive word (one with the part of speech POS, PRP, PRP$ or WP$) in the constituent, as these tend to be markers for nominal arguments as in [Speaker Burma 's] [Phenomenon oil] [Predicate search] hits virgin forests 12. Constituent parent features - Same as the sibling features, except that that these are extracted from the constituent's parent.</Paragraph> <Paragraph position="9"> 13. Verb dominating predicate - The head word of the first VP ancestor of the predicate.</Paragraph> <Paragraph position="10"> 14. Named Entities in Constituent - As in Surdeanu et al. (2003), this is represented as seven binary features extracted after tagging the sentence with BBN's IdentiFinder (Bikel et al., 1999) named entity tagger.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Feature Analysis and Best System </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Performance 5.1 English For the task of argument identification, </SectionTitle> <Paragraph position="0"> features 2, 3, 4, 5 (the verb itself, path to light-verb and presence of a light verb), 6, 7, 9, 10 an 13 contributed positively to the performance. The Frame feature degrades performance significantly. This could be just an artifact of the data sparsity. We trained a new classifier using all the features that contributed positively to the performance and the F =1 score increased from the baseline of 72.8% to 76.3% ( 2; p < 0:05).</Paragraph> <Paragraph position="1"> For the task of argument classification, adding the Frame feature to the baseline features, provided the most significant improvement, increasing the classification accuracy from 70.9% to 79.0% ( 2; p < 0:05). All other features added one-by-one to the baseline did not bring any significant improvement to the baseline, which might again be owing to the comparatively small training and test data sizes. All the features together produced a classification accuracy of 80.9%. Since the Frame feature is an oracle, we were interested in finding out what all the other features combined contributed.</Paragraph> <Paragraph position="2"> We ran an experiment with all features, except Frame, added to the baseline, and this produced an accuracy of 73.1%, which however, is not a statistically significant improvement over the baseline of 70.9%.</Paragraph> <Paragraph position="3"> For the task of argument identification and classification, features 8 and 11 (right sibling head word part of speech) hurt performance. We trained a classifier using all the features that contributed positively to the performance and the resulting system had an improved F =1 score of 56.5% compared to the baseline of 51.4%</Paragraph> <Paragraph position="5"> We found that a significant subset of features that contribute marginally to the classification performance, hurt the identification task. Therefore, we decided to perform a two-step process in which we use the set of features that gave optimum performance for the argument identification task and identify all likely argument nodes. Then, for those nodes, we use all the available features and classify them into one of the possible classes. This &quot;two-pass&quot; system performs slightly better than the &quot;one-pass&quot; mentioned earlier. Again, we performed the second pass of classification with and without the Frame feature.</Paragraph> <Paragraph position="6"> Table 2 shows the improved performance numbers.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Chinese For the Chinese task, we use the one-pass </SectionTitle> <Paragraph position="0"> algorithm as used for English. A baseline system was created using the same features as used for English (Section 3). We evaluate this system on just the combined task of argument identification and classification. The base-line performance is shown in Table 3.</Paragraph> <Paragraph position="1"> To improve the system's performance over the baseline, we added all the features discussed in Section 4, except features Frame - as the data was labeled in a Prop-Bank fashion, there are no frames involved as in Frame-Net; Plurals and Genitives - as they are not realized the same way morphologically in Chinese, and Named Entities - owing to the unavailability of a Chinese Named Entity tagger. We found that of these features, 2, 3, 4, 6, 7 and 13 hurt the performance when added to the baseline, but the other features helped to some degree, although not significantly. The improved performance is shown in bined task of identifying and classifying semantic arguments. null An interesting linguistic phenomenon was observed which explains part of the reason why recall for Chinese argument parsing is so low. In Chinese, arguments which are internal to the NP which encompasses the nominalized predicate, tend to be multi-word, and are not associated with any node in the parse tree. These violates our basic assumption of the arguments aligning with parse tree constituents, and are guaranteed to be missed. In the case of English however, these tend to be single word arguments which are represented by a leaf in the parse tree and stand a chance of getting classified correctly.</Paragraph> </Section> </Section> class="xml-element"></Paper>