File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3214_metho.xml
Size: 18,573 bytes
Last Modified: 2025-10-06 14:09:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3214"> <Title>The Influence of Argument Structure on Semantic Role Assignment</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Experiment 1: Frame-Wise Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> of Semantic Role Assignment </SectionTitle> <Paragraph position="0"> In our first experiment, we perform a detailed (frame-wise) evaluation of semantic role assignment to discover general patterns in the data. Our aim is not to outperform existing models, but to replicate the workings of existing models so that our findings are representative for the task as it is currently addressed. To this end, we (a) use a standard dataset, the FrameNet data, (b) model the task with two different statistical frameworks, and (c) keep our models as generic as possible.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Data and experimental setup </SectionTitle> <Paragraph position="0"> For this experiment, we use 57758 manually annotated sentences from FrameNet (release 2), corresponding to all the sentences with verbal predicates (2228 lemmata from 196 frames). Gildea and Jurafsky (2000) and Fleischman et al. (2003) used a previous release of the dataset with less annotated instances, but covered all predicates (verbs, nouns and adjectives).</Paragraph> <Paragraph position="1"> Data preparation. After tagging the data with TnT (Brants, 2000), we parse them using the Collins parsing model 3 (Collins, 1997). We consider only berkeley.edu/~framenet/. Examples adapted from the FrameNet data, release 2.</Paragraph> <Paragraph position="2"> the most probable parse for each sentence and simplify the resulting parse tree by removing all unary nodes. We lemmatise the head of each constituent with TreeTagger (Schmid, 1994).</Paragraph> <Paragraph position="3"> Gold standard. We transform the FrameNet character-offset annotations for semantic roles into our constituent format by determining the maximal projection for each semantic role, i.e. the set of constituents that exactly covers the extent of the role. A constituent is assigned a role iff it is in the maximum projection of a role.</Paragraph> <Paragraph position="4"> Classification procedure. The instances to be classified are all parse tree constituents. Since direct assignment of role labels to instances fails due to the preponderance of unlabelled instances, which make up 86.7% of all instances, we follow Gildea and Jurafsky (2000) in splitting the task into two sequential subtasks: first, argument recognition decides for each instance whether it bears a semantic role or not; then, argument labelling assigns a label to instances recognised as role-bearers. For the second step, we train frame-specific classifiers, since the frame-specificity of roles does not allow to easily combine training data from different frames.</Paragraph> <Paragraph position="5"> Statistical modelling. We perform the classification twice, with two learners from different statistical frameworks, in order to make our results more representative for the different statistical models employed so far for the task. The first learner uses the maximum entropy (Maxent) framework, which has been applied e.g. by Fleischman et al. (2003).</Paragraph> <Paragraph position="6"> The model is trained with the estimatesoftware, which implements the LMVM algorithm (Malouf, 2002)4. The second learner is an instance of a memory-based learning (MBL) algorithm, the a0 nearest neighbour algorithm. We use the implementation provided by TiMBL (Daelemans et al., 2003) with the recommended parameters, namely a0a2a1a4a3 , adopting modified value difference with gain ratio feature weighting as similarity metric.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Features </SectionTitle> <Paragraph position="0"> In accordance with our goal of keeping our models generic, we use a set of vary (syntactic and lexical) features which more than one study in the literature has found helpful, without optimising the features for the individual learners.</Paragraph> <Paragraph position="1"> Constituent features: The first type of feature represents properties of the constituent in question. We use the phrase type and head lemma of each constituent; its preposition (if available); its position relative to the predicate (left, right or overlapping); the phrase type of its mother constituent; whether it is an argument of the target, according to the parser; and the path between target and constituent as well as its length.</Paragraph> <Paragraph position="2"> Sentence level features: The second type of feature describes the context of the current instance. The predicate is represented by its lemma, its part of speech, its (heuristic) subcategorisation frame, and its governing verb. We also compile a list of all the prepositions in the sentence.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Results </SectionTitle> <Paragraph position="0"> All results in this section are averages over F scores obtained using 10-fold cross validation. For each frame, we perform two evaluations, one in exact match and one in overlap mode. In exact match mode, an assignment only counts as a true positive if it coincides exactly with the gold standard, while in overlap mode it suffices that they are not disjoint.</Paragraph> <Paragraph position="1"> F scores are then computed in the usual manner.</Paragraph> <Paragraph position="2"> Table 1 shows the performance of the different configurations over the complete dataset, and the standard deviation of these results over all frames.</Paragraph> <Paragraph position="3"> To illustrate the results for individual frames, Table 2 lists frame-specific performances for five randomly selected frames and how they varied over</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Analysis and Discussion </SectionTitle> <Paragraph position="0"> In terms of overall results, the MBL model outperforms the Maxent model by 3 to 4 points F-score.</Paragraph> <Paragraph position="1"> However, all our results lie broadly in the range of existing systems with a similar architecture (i.e.</Paragraph> <Paragraph position="2"> sequential argument identification and labelling): Gildea and Jurafsky (2002) report a1 a1 a3 a3a3a2a5a4 , and Fleischman et al. (2003) a1 a1 a3a7a6a8a2a10a9 for exact match evaluation. We assume that our feature formulation is more suitable for the MBL model. Also, we do not smooth the Maxent model, while we use the recommended optimised parameters for TiMBL.</Paragraph> <Paragraph position="3"> Our most remarkable finding is the high amount of variance presented by the numbers in Table 1.</Paragraph> <Paragraph position="4"> Computed across frames, the standard deviation amounts to 10% to 11%, consistently across evaluation measures and statistical frameworks. Since these figures are results of a 10-fold cross validation run, it is improbable that the effect is solely validation runs for five random frames (Exp. 1).</Paragraph> <Paragraph position="5"> due to chance splits into training and test data. This assessment is supported by Table 2, which shows that, while the performance on individual frames can vary largely (especially for small frames like ROBBERY), the average performance on all frames varies less than 0.5% over the cross validation runs. The reasons which lead to the across-frames variance warrant investigation, since they may lead to new insights about the nature of the task in question, answering Fleischman et al.'s (2003) call for better models. Some of the plausible variables which might explain the variance are the number of semantic roles per frame, the amount of training data, and the number of verbs per frame.</Paragraph> <Paragraph position="6"> However, we suggest that a fourth variable might have a more decisive influence. Seen from a linguistic perspective, semantic role assignment is just an application of linking, i.e. learning the regularities of the relationship between semantic roles and their possible syntactic realisation and applying this knowledge. Therefore, our main hypothesis is: The more varied the realisation possibilities of the verbs in a frame, the more difficult it is for the learner to learn the correct linking patterns, and therefore the more error-prone semantic role assignment. Even though this claim appears intuitively true, it has never been explicitly made nor empirically tested, and its consequences might be relevant for the design of future models of semantic role assignment.</Paragraph> <Paragraph position="7"> As an example, compare the frame IMPACT, as exemplified by the instances in (1), with the frame INGESTION, which contains predicates such as drink, consume or nibble. While every sentence in (1) shows a different linking pattern, linking for INGESTION is rather straightforward: the subject is usually the Ingestor, and the direct object is an Ingestible. This is reflected in the scores: a1 a1 a3a7a6a8a2a1a0 for IMPACT and a1 a1 a6a2a0 a2a10a9 for INGESTION (exact match scores for the MBL model).</Paragraph> <Paragraph position="8"> The most straightforward strategy to test for the different variables would be to perform multiple correlation analyses. However, this approach has a serious drawback: The results are hard to interpret when more than one variable is significantly correlated with the data, and this is increasingly probable with higher amounts of data points. Instead, we adopt a second strategy, namely to design a new data set in which all variables but one are controlled for and correlation can be tested unequivocally. The new experiment is explained in Section 5. Section 4 describes the quantitative model of argument structure required for the experiment.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Argument Structure and Frame </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Uniformity </SectionTitle> <Paragraph position="0"> In this section, we define the concepts we require to test our hypothesis quantitatively. First, we define argument structure for our data in a corpus-driven way. Then, we define the uniformity of a frame according to its variance in argument structure.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 An Empirical Model of Argument Structure </SectionTitle> <Paragraph position="0"> Work in theoretical linguistics since at least Gruber (1965) and Jackendoff (1972) has attempted to account for the regularities in the syntactic realisation of semantic arguments. Models for role assignment also rely on these regularities, as can be seen from the kind of features used for this task (see Section 3.2), which are either syntactic or lexical.</Paragraph> <Paragraph position="1"> Thus, current models for automatic role labelling rely on the regularities at the syntax-semantics interface. Unlike theoretical work, however, they do not explicitly represent these regularities, but extract statistical properties about them from data.</Paragraph> <Paragraph position="2"> The model of argument structure we develop in this section retains the central idea of linking theory, namely to model argument structure symbolically, but deviates in two ways from traditional work in order to bridge the gap to statistical approaches: (1), in order to emulate the situation of the learners, we use only the data available from the FrameNet corpus; this excludes e.g. the use of more detailed lexical information about the predicates. (2), to be able to characterise not only the possibility, but also the probability of linking patterns, we take frequency information into account.</Paragraph> <Paragraph position="3"> Our definition proceeds in three steps. First, we define the concept of a pattern, then we define the argument structure of a predicate, and finally the argument structure of a frame.</Paragraph> <Paragraph position="4"> Patterns. A pattern encodes the argument structure information present in one annotated corpus sentence. It is an unordered set of pairs of semantic role and syntactic function, corresponding to all roles occurring in the sentences and their realisations. The syntactic functions used in the FrameNet corpus are as follows5: COMP (complement), EXT (subject in a broad sense, which includes controlling subjects), OBJ (object), MOD (modifier), GEN (genitive modifier, as 'John' in John's hat). For example, Sentence (1-a) gives rise to the pattern</Paragraph> <Paragraph position="6"> which states that the Impactee is realised as subject and the Impactor as complement.</Paragraph> <Paragraph position="7"> Argument Structure for Predicates and Frames.</Paragraph> <Paragraph position="8"> For each verb, we collect the set of all patterns in the annotated sentences. The argument structure of a verb is then a vector a11a12 , whose dimensionality is the number of patterns found for the frame. Each cell a12a5a13 is filled with the frequency with which pattern a14 occurs for the predicate, so that the vector mirrors the distribution of the occurrences of the verb over the possible patterns. Finally, the set of all vectors for the predicates in a frame is a model for the argument structure of the frame.</Paragraph> <Paragraph position="9"> The intuition behind this formalisation is that two verbs which realise their arguments alike will show a similar distribution of patterns, and conversely, if they differ in their linking, these differences will be mirrored in different pattern distributions.</Paragraph> <Paragraph position="10"> Example. If we only had the three sentences in (1) for the IMPACT corpus, the three occurring patterns would be {(Impactee, EXT), (Impactor, COMP)}, {(Impactor, EXT), (Result, COMP)}, and {(Impactors, EXT), (Place, MOD)}. The argument structure of the frame would be containing the information for the predicates strike, slam and collide, respectively. The variation arises from differences in syntactic construction (e.g. passive vs. active), but also, more significantly, from lexical differences: collide accepts a reciprocal plural subject, i.e. an Impactors role, while strike does not. This model is very simple, but achieves the 5See Johnson et al. (2002) for details.</Paragraph> <Paragraph position="11"> goal of highlighting the differences and similarities in the mapping between semantics and syntax for different verbs in a frame.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Uniformity of Argument Structure </SectionTitle> <Paragraph position="0"> At this point, we can define a measure to compute the uniformity of a frame from the frame's argument structure, which is defined as a set of integer-valued vectors.</Paragraph> <Paragraph position="1"> Similarity metrics developed for vector space models are obvious candidates, but work in this area has concentrated on metrics for comparing two vectors, whereas we may have an arbitrary number of predicates per frame. Therefore, we borrow the concept of cost function from clustering, as exemplified by the well known sum-of-squares function used in the k-means algorithm (see e.g. Kaufman and Rousseeuw (1990)), which estimates the &quot;cost&quot; of a cluster as the sum of squared distances a0 between each vector a11a12 a13 and the cluster centroid a11a1 : 6</Paragraph> <Paragraph position="3"> Under this view, a good cluster is one with a low cost, and the goal of the clustering algorithm is to minimise the average distance to the centroid.</Paragraph> <Paragraph position="4"> However, for our purposes it is more convenient for a good cluster to have a high rating. Therefore, we turn the cost function into a &quot;quality&quot; function. By replacing the distance function with a similarity function a13 , we say that a good cluster is one with a high average similarity to the centroid:</Paragraph> <Paragraph position="6"> If we consider each frame to be a cluster and each predicate to be an object in the cluster, represented by the argument structure vector, the values of a14 can be interpreted as a measure for frame uniformity: Verbs with a similar argument structure will have similar vectors, resulting in high values of a14 for the frame, and vice versa.</Paragraph> <Paragraph position="7"> What intuitively validates this formalisation is that frames are clusters of predicates grouped together on semantic grounds, i.e. predicates in a frame share a common set of arguments. What a14 checks is whether the mapping from semantics to syntax is also similar.</Paragraph> <Paragraph position="8"> found by averaging the measurement values along each dimension&quot; (Kaufman and Rousseeuw, 1990, p. 112), so that it is the point situated at the &quot;center&quot; of the cluster. In order to obtain an actual measure for frame uniformity, we take two further steps. First, we instantiate a13 with the cosine similarity a1a18a17 a13 , which has been found to be appropriate for a wide range of linguistic tasks (see e.g. Lee (1999)) and ranges between 0 (least similar) and 1 (identity):</Paragraph> <Paragraph position="10"> Second, we normalise the values of a14 , which grow in a26 a4a28a27 a7 , the number of vectors, to a29 a0a31a30 a4a33a32 , to make them interpretable analogously to values of the cosine similarity. Since this is possible in two different ways, we obtain two different measures for frame uniformity. The first one, which we call nor- null The second measure, weighted quality-based uniformity (a36a37a34a19a35 ), is a weighted average of the similarities. The weights are given by the vector sizes in our case, the frequency of the predicates: The weighting lends more importance to well-attested predicates, limiting the amount of noise introduced by infrequent predicates. Therefore, our intuition is that a36a37a34a19a35 should be a better measure than a34a19a35 for argument structure uniformity.</Paragraph> </Section> </Section> class="xml-element"></Paper>