File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2104_metho.xml
Size: 16,870 bytes
Last Modified: 2025-10-06 14:10:29
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2104"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Comparison of Alternative Parse Tre Paths for Labeling Semantic Roles</Title> <Section position="4" start_page="811" end_page="813" type="metho"> <SectionTitle> 2 Alternative Parse Tre Paths </SectionTitle> <Paragraph position="0"> Parse tree paths were introduced by Gildea & Jurafsky (202) as descriptive features of the syntactic relationship between predicates and arguments in the parse tree of a sentence. Predicates are typically assumed to be specific target words (usually verbs), and arguments are assumed to be a span of words in the sentence that are governed by a single node in the parse tree. A parse tree path can be described as a sequence of transitions up and down a parse tree from the target word to the governing node, as exemplified in Figure 1.</Paragraph> <Paragraph position="1"> The encoding of the parse tree path feature is dependent on the syntactic representation that is produced by the parser. This, in turn, is dependant on the training corpus used to build the parser, and the conditioning factors in its probability model. As result, encodings of parse tree paths can vary greatly depending on the parser that is used, yielding parse tree paths that vary in their ability to generalize acros sentences.</Paragraph> <Paragraph position="2"> In this paper we explore the characteristics of parse tree paths with respect to different approaches to automated parsing. We were particularly interested in comparing traditional constituency parsing (as exemplified in Figure 1) with dependency parsing, specifically the Minipar system built by Lin (198). Minipar is increasingly being used in semantics-based nlp applications (e.g. Pantel & Lin, 202). Dependency parse trees differ from constituency parses in that they represent sentence structures as a set of dependency relationships between words, typed asymmetric binary relationships between head words and modifying words. Figure 2 depicts the output of Minipar on an example sentence, where each node is a word or an empty node along with the word lemma, its part of speech, and the relationship type to its governing node.</Paragraph> <Paragraph position="3"> Our motivation for exploring the use of Minipar in for the creation of parse tree paths can be seen by comparing Figure 1 and Figure 2, where with a parse tree path from the predicate ate to the argument He.</Paragraph> <Paragraph position="4"> the Minipar path is both shorter and simpler for the same predicate-argument relationship, and could be encoded in various ways that take advantage of the additional semantic and lexical information that is provided.</Paragraph> <Paragraph position="5"> To compare traditional constituency parsing with dependency parsing, we evaluated the accuracy of argument labeling using parse tree paths generated by two leading constituency parsers and three variations of parse tree paths generated by Minipar, as folows: Charniak: We used the Charniak parser (200) to extract parse tree paths similar to those found in Palmer et al. (205), with some slight modifications. In cases where the last node in the path was a non-branching pre-terminal, we added the lexical information to the path node. In addition, our paths led to the lowest governing node, rather than the highest. For example, the parse tree path for the argument in Figure 1 would be encoded as: VB|VP|S|NP|PRP:he Stanford: We also used the Stanford parser developed by Klein & Manning (203), with the same path encoding as the Charniak parser.</Paragraph> <Paragraph position="6"> Minipar A: We used three variations of parse tree path encodings based on Lin's dependency parser, Minipar (198). Minipar A is the first and most restrictive path encoding, where each is annotated with the entire information output by Minpar at each node. A typical path might be: ate:eat,V,i|He:he,N,s Minipar B: A second parse tree path encoding was generated from Minipar parses that relaxes some of the constraints used in Minpar A. Instead of using all the information contained at a node, in Minipar B we only encode a path with its part of speech and relational information. For example: V,i|N,s Minipar C: As the converse to Minipar A we also tried one other Minipar encoding. As in Minipar A, we annotated the path with all the information output, but instead of doing a direct string comparison during our search, we considered two paths matching when there was a match between either the word, the stem, the part of speech, or the relation. For example, the folowing two parse tree paths would be considered a match, as both include the relation i.</Paragraph> <Paragraph position="7"> ate:eat,V,i|He:he,N,s was:be,VBE,i|He:he,N,s We explored other combinations of dependency relation information for Minipar-derived parse tree paths, including the use of the deep relations. However, results obtained using these other combinations were not notably different from those of the three base cases listed above, and are not included in the evaluation results reported in this paper.</Paragraph> <Paragraph position="8"> 3 Aligning arguments to parse tres nodes in a training / testing corpus We began our investigation by creating a training and testing corpus of 40 sentences each containing an inflection of one of four target verbs (10 each), namely believe, think, give, and receive. These sentences were selected at random from the 194-07 section of the New York Times gigaword corpus from the Linguistic Data Consortium. These four verbs were chosen because of the synonymy among the first two, and the reflexivity of the second two, and because all four have straightforward argument structures when viewed as predicates, as folows: predicate: believe arg0: the believer arg1: the thing that is believed predicate: think arg0: the thinker arg1: the thing that is thought predicate: give arg0: the giver arg1: the thing that is given arg2: the receiver predicate: receive arg0: the receiver arg1: the thing that is received arg2: the giver This corpus of sentences was then annotated with semantic role information by the authors of this paper. Al annotations were made by assigning start and stop locations for each argument in the unparsed text of the sentence. After an initial pilot annotation study, the folowing annotation policy was adopted to overcome common disagreements: (1) When the argument is a noun and it is part of a definite description then in- null clude the entire definite description. (2) Do not include complementizers such as 'that' in 'believe that' in an argument. (3) Do include prepositions such as 'in' in 'believe in'. (4) When in doubt, assume phrases attach locally. Using this policy, an agreement of 92.8% was achieved among annotators for the set of start and stop locations for arguments. Examples of semantic role annotations in our corpus for each of the four predicates are as folows: The next step was to parse the corpus of 40 sentences using each of three automated parsing systems (Charniak, Stanford, and Minipar), and align each of the annotated arguments with its closest matching branch in the resulting parse trees. Given the differences in the parsing models used by these three systems, each yield parse tree nodes that govern different spans of text in the sentence. Often there exists no parse tree node that governs a span of text that exactly matches the span of an argument in the annotated corpus. Accordingly, it was necessary to identify the closest match posible for each of the three parsing systems in order to encode parse tree paths for each. We developed a uniform policy that would facilitate a fair comparison between parsing techniques. Our approach was to identify a single node in a given parse tree that governed a string of text with the most overlap with the text of the annotated argument. Each of the parsing methods tokenizes the input string differently, so in order to simplify the selection of the governing node with the most overlap, we made this selection based on lowest minimum edit distance (Levenshtein distance).</Paragraph> <Paragraph position="9"> Al three of these different parsing algorithms produced single governing nodes that overlapped well with the human-annotated corpus. However, it appeared that the two constituency parsers produced governing nodes that were more closely aligned, based on minimum edit distance. The Charniak parser aligned best with the annotated text, with an average of 2.40 characters for the lowest minimum edit distance (standard deviation = 8.64). The Stanford parser performed slightly worse (average = 2.67, standard deviation = 8.86), while distances were nearly two times larger for Minipar (average = 4.73, standard deviation = 10.4).</Paragraph> <Paragraph position="10"> In each case, the most overlapping parse tree node was treated as correct for training and testing purposes.</Paragraph> </Section> <Section position="5" start_page="813" end_page="815" type="metho"> <SectionTitle> 4 Comparative Performance Evaluation </SectionTitle> <Paragraph position="0"> In order to evaluate the comparative performance of the parse tree paths for each of the five encodings, we divided the corpus in to equal-sized training and test sets (50 training and 50 test examples for each of the four predicates). We then constructed a system that identified the parse tree paths for each of the 10 arguments in the training sets, and applied them to the sentences in each corresponding test sets. When applying the 50 training parse tree paths to any one of the 50 test sentences for a given predicate-argument pair, a set of zero or more candidate answer nodes were returned. For the purpose of calculating precision and recall scores, credit was given when the correct answer appeared in this set. Precision scores were calculated as the number of correct answers found divided by the number of all candidate answer nodes returned. Recall scores were calculated as the number of correct answers found divided by the total number of correct answers posible. F-scores were calculated as the equallyweighted harmonic mean of precision and recall.</Paragraph> <Paragraph position="1"> Our calculation of recall scores represents the best-posible performance of systems using only these types of parse-tree paths. This level of performance could be obtained if a system could always select the correct answer from the set of candidates returned. However, it is also informative to estimate the performance that could be achieved by randomly selecting among the candidate answers, representing a lower-bound on performance. Accordingly, we computed an adjusted recall score that awarded only fractional credit in cases where more than one candidate answer was returned (one divided by the set size). Adjusted recall is the sum of all of these adjusted credits divided by the total number of correct answers posible.</Paragraph> <Paragraph position="2"> Figure 3 summarizes the comparative recall, precision, f-score, and adjusted recall performance for each of the five parse tree path formulations. The Charniak parser achieved the highest overall scores (precision=.49, recall=.68, fscore=.57, adjusted recall=.48), folowed closely by the Stanford parser (precision=.47, recall=.67, f-score=.5, adjusted recall=.48).</Paragraph> <Paragraph position="3"> Our expectation was that the short, semantically descriptive parse tree paths produced by Minipar would yield the highest performance.</Paragraph> <Paragraph position="4"> However, these results indicate the oposite; the constituency parsers produce the most accurate parse tree paths. Only Minipar C offers beter recall (0.71) than the constituency parsers, but at the expense of extremely low precision. Minipar A offers excellent precision (0.62), but with extremely low recall. Minipar B provides a balance between recall and precision performance, but falls short of being competitive with the parse tree paths generated by the two constituency parsers, with an f-score of .4.</Paragraph> <Paragraph position="5"> We utilized the Sign Test in order to determine the statistical significance of these differences. Rank orderings between pairs of systems were determined based on the adjusted credit that each system achieved for each test sentence. Significant differences were found between the performance of every system (p<0.05), with the exception of the Charniak and Stanford parsers.</Paragraph> <Paragraph position="6"> Interestingly, by comparing weighted values for each test example, Minipar C more frequently scores higher than Minipar A, even though the sum of these scores favors Minipar A.</Paragraph> <Paragraph position="7"> In addition to overall performance, we were interested in determining whether performance varied depending on the type of the argument that is being labeled. In assigning labels to arguments in the corpus, we folowed the general principles set out by Palmer et al. (205) for labeling arguments arg0, arg1 and arg2. Acros each of our four predicates, arg0 is the agent of the predication (e.g. the person that has the belief or is doing the giving), and arg1 is the thing that is acted upon by the agent (e.g. the thing that is believed or the thing that is given). Arg2 is used only for the predications based on the verbs give and receive, where it is used to indicate the other party of the action.</Paragraph> <Paragraph position="8"> Our interest was in determining whether these five approaches yielded different results depending on the semantic type of the argument. Figure 4 presents the f-scores for each of these encodings acros each argument type.</Paragraph> <Paragraph position="9"> Results indicate that the Charniak and Stanford parsers continue to produce parse tree paths that outperform each of the Minipar-based approaches. In each approach argument 0 is the easiest to identify. Minipar A retains the general 1 easier to identify than argument 2, while Minipar B and C show the reverse. The highest f-scores for argument 0 were achieved Stanford (f=.65), while Charniak achieved the highest scores for argument 1 (f=.5) and argument 2 (f=.49).</Paragraph> </Section> <Section position="6" start_page="815" end_page="815" type="metho"> <SectionTitle> 5 Learning Curve Comparisons </SectionTitle> <Paragraph position="0"> The creation of large-scale text corpora with syntactic and/or semantic annotations is difficult, expensive, and time consuming. The PropBank effort has shown that producing this type of corpora is considerably easier once syntactic analysis has been done, but substantial effort and resources are stil required. Better estimates of total costs could be made if it was known exactly how many annotations are necessary to achieve acceptable levels of performance. Accordingly, we investigated the learning curves of precision, recall, f-score, and adjusted recall achieved using the five different parse tree path encodings.</Paragraph> <Paragraph position="1"> For each encoding approach, learning curves were created by applying successively larger subsets of the training parse tree paths to each of the items in the corresponding test set. Precision, recall, f-scores, and adjusted recall were computed as described in the previous section, and identical subsets of sentences were used acros parsers, in one-sentence increments. Individual learning curves for each of the five approaches are given in Figures 5, 6, 7, 8, and 9. Figure 10 presents a comparison of the f-score learning curves for all five of the approaches.</Paragraph> <Paragraph position="2"> In each approach, the precision scores slowly degrade as more training examples are provided, due to the addition of new parse tree paths that yield additional candidate answers. Conversely, the recall scores of each system show their greatest gains early, and then slowly improve with the addition of more parse tree paths. In each approach, the recall scores (estimating best-case performance) have the same general shape as the adjusted recall scores (estimating the lower-bound performance). The divergence between these two scores increases with the addition of more training examples, and is more pronounced in systems employing parse tree paths with less specific node information. The comparative f-score curves presented in Figure 10 indicate that Minipar B is competitive with Charniak and Stanford when only a small number of training examples is available. There is some evidence here that the performance of Minipar A would continue to improve with the addition of more training data, sugesting that this approach might be well-suited for applications where lots of training data is available.</Paragraph> </Section> class="xml-element"></Paper>