File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2018_metho.xml
Size: 20,053 bytes
Last Modified: 2025-10-06 14:10:23
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2018"> <Title>Using Machine-Learning to Assign Function Labels to Parser Output for Spanish</Title> <Section position="5" start_page="136" end_page="137" type="metho"> <SectionTitle> 3 Previous Work </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="136" end_page="137" type="sub_section"> <SectionTitle> 3.1 LFG Annotation </SectionTitle> <Paragraph position="0"> A methodology for automatically obtaining LFG f-structures from trees output by probabilistic parsers trained on the Penn-II treebank has been described by Cahill et al. (2004). It has been shown that the methods can be ported to other languages and treebanks (Burke et al., 2004; Cahill et al., 2003), including Cast3LB (O'Donovan et al., 2005).</Paragraph> <Paragraph position="1"> Some properties of Spanish and the encoding of syntactic information in the Cast3LB treebank make it non-trivial to apply the method of automatically mapping c-structures to f-structures used by Cahill et al. (2004), which assigns grammatical functions to tree nodes based on their phrasal category, the category of the mother node and their position relative to the local head.</Paragraph> <Paragraph position="2"> In Spanish, the order of sentence constituents is flexible and their position relative to the head is an imperfect predictor of grammatical function. Also, much of the information that the Penn-II Treebank encodes in terms of tree configurations is encoded in Cast3LB in the form of function tags. As Cast3LB trees lack a VP node, the configurational information normally used in English to distinguish Subjects (NP which is left sister to VP) from Direct Objects (NP which is right sister to V) is not available in Cast3LB-style trees. This means that assigning correct LFG functional annotations to nodes in Cast3LB trees is rather difficult without use of Cast3LB function tags, and those tags are typically absent in output generated by probabilistic parsers.</Paragraph> <Paragraph position="3"> In order to solve this difficulty, O'Donovan et al. (2005) train Bikel's parser to output complex category-function labels. A complex label such as sn-SUJ (an NP node tagged with the Subject grammatical function) is treated as an atomic category in the training data, and is output in the trees produced by the parser. This baseline process is represented in Figure 2.</Paragraph> <Paragraph position="4"> This approach can be problematic for two main reasons. Firstly, by treating complex labels as atomic categories the number of unique labels increases and parse quality can deteriorate due to sparse data problems. Secondly, this approach, by relying on the parser to assign function tags, offers limited control over, or room for improvement in, this task.</Paragraph> </Section> <Section position="2" start_page="137" end_page="137" type="sub_section"> <SectionTitle> 3.2 Adding Function Tags to Parser Output </SectionTitle> <Paragraph position="0"> The solution we adopt instead is to add Cast3LB functional tags to simple constituent trees output by the parser, as a postprocessing step. For English, such approaches have been shown to give good results for the output of parsers trained on the Penn-II Treebank.</Paragraph> <Paragraph position="1"> Blaheta and Charniak (2000) use a probabilistic model with feature dependencies encoded by means of feature trees to add Penn-II Treebank function tags to Charniak's parser output. They report an f-score 88.472% on original treebank trees and 87.277% on the correctly parsed subset of tree nodes.</Paragraph> <Paragraph position="2"> Jijkoun and de Rijke (2004) describe a method of enriching output of a parser with information that is included in the original Penn-II trees, such as function tags, empty nodes and coindexations.</Paragraph> <Paragraph position="3"> They first transform Penn trees to a dependency format and then use memory-based learning to perform various graph transformations. One of the transformations is node relabelling, which adds function tags to parser output. They report an f-score of 88.5% for the task of function tagging on correctly parsed constituents.</Paragraph> </Section> </Section> <Section position="6" start_page="137" end_page="139" type="metho"> <SectionTitle> 4 Assigning Cast3LB Function Tags to </SectionTitle> <Paragraph position="0"> The complete processing architecture of our approach is depicted in Figure 3. We describe it in detail in this and the following sections.</Paragraph> <Paragraph position="1"> We divided the Spanish treebank into a training set of 80%, a development set of 10%, and a test set of 10% of all trees. We randomly assigned tree-bank files to these sets to ensure that different textual genres are about equally represented among the training, development and test trees.</Paragraph> <Section position="1" start_page="137" end_page="138" type="sub_section"> <SectionTitle> 4.1 Constituency Parsing </SectionTitle> <Paragraph position="0"> For constituency parsing we use Bikel's (2002) parser for which we developed a Spanish language package adapted to the Cast3LB data. Prior to parsing, we perform one of the tree transformations described by Cowan and Collins (2005), i.e. we add a CP and SBAR nodes to subordinate and relative clauses. This is undone in parser output.</Paragraph> <Paragraph position="1"> The category labels in the Spanish treebank are rather fine grained and often contain redundant information.1 We preprocess the treebank and re- null learning-based method.</Paragraph> <Paragraph position="2"> duce the number of category labels, only retaining distinctions that we deem useful for our purposes.2 For constituency parsing we also reduce the number of POS tags by including only selected morphological features. Table 2 provides the list of features included for the different parts of speech. In our experiments we use gold standard POS tagged development and test-set sentences as input rather than tagging text automatically.</Paragraph> <Paragraph position="3"> The results of the evaluation of parsing performance on the test set are shown in Table 3. Labelled bracketing f-score for all sentences is just below 84% for all sentences, and 84.58% for sentences of length [?] 70. In comparison, Cowan and Collins (2005) report an f-score of 85.1% ([?] 70) using a version of Collins' parser adapted for Cast3LB, and using reranking to boost perforsuch as grup.nom.ms (masculine singular), grup.nom.fs (feminine singular), grup.nom.mp (masculine plural) etc. This number and gender information is already encoded in the POS tags of nouns heading these constituents.</Paragraph> <Paragraph position="4"> 2The labels we retain are the following: INC, S, S.NF, S.NF.R, S.NF, S.R, conj.subord, coord, data, espec, gerundi, grup.nom, gv, infinitiu, interjeccio, morf, neg, numero, prep, relatiu, s.a, sa, sadv, sn, sp, and versions of those suffixed with .co to indicate coordination).</Paragraph> <Paragraph position="5"> refers to subcategories of parts of speech such as e.g. common and proper for nouns, or main, auxiliary and semiauxiliary for verbs. For details see (Civit, 2000).</Paragraph> <Paragraph position="6"> mance. They use a different, more reduced category label set as well as a different training-test split. Both Cowan and Collins and the present paper report scores which ignore punctuation.</Paragraph> </Section> <Section position="2" start_page="138" end_page="139" type="sub_section"> <SectionTitle> 4.2 Cast3LB Function Tagging </SectionTitle> <Paragraph position="0"> For the task of Cast3LB function tag assignment we experimented with three generic machine learning algorithms: a memory-based learner (Daelemans and van den Bosch, 2005), a maximum entropy classifier (Berger et al., 1996) and a Support Vector Machine classifier (Vapnik, 1998).</Paragraph> <Paragraph position="1"> For each algorithm we use the same set of features to represent nodes that are to be assigned one of the Cast3LB function tags. We use a special null tag for nodes where no Cast3LB tag is present.</Paragraph> <Paragraph position="2"> In Cast3LB only nodes in certain contexts are eligible for function tags. For this reason we only consider a subset of all nodes as candidates for function tag assignment, namely those which are sisters of nodes with the category labels gv (Verb Group), infinitiu (Infinitive) and gerundi (Gerund).</Paragraph> <Paragraph position="3"> For these candidates we extract the following three types of features encoding configurational, morphological and lexical information for the target node and neighboring context nodes: * Node features: position relative to head, head lemma, alternative head lemma (i.e. the head of NP in PP), head POS, category, definiteness, agreement with head verb, yield, human/nonhuman null * Local features: head verb, verb person, verb number, parent category * Context features: node features (except posi null tion) of the two previous and two following sister nodes (if present).</Paragraph> <Paragraph position="4"> We used cross-validation for refining the set of features and for tuning the parameters of the machine-learning algorithms. We did not use any additional automated feature-selection procedure. We made use of the following implementations: TiMBL (Daelemans et al., 2004) for Memory-Based Learning, the MaxEnt Toolkit (Le, 2004) for Maximum Entropy and LIBSVM (Chang and Lin, 2001) for Support Vector Machines. For TiMBL we used k nearest neighbors = 7 and the gain ratio metric for feature weighting. For Max-Ent, we used the L-BFGS parameter estimation and 110 iterations, and we regularize the model using a Gaussian prior with s2 = 1. For SVM we used the RBF kernel with g = 2[?]7 and the cost parameter C = 32.</Paragraph> </Section> </Section> <Section position="7" start_page="139" end_page="140" type="metho"> <SectionTitle> 5 Cast3LB Tag Assignment Evaluation </SectionTitle> <Paragraph position="0"> We present evaluation results on the original gold-standard trees of the test set as well as on the test-set sentences parsed by Bikel's parser. For the evaluation of Cast3LB function tagging performance on gold trees the most straightforward metric is the accuracy, or the proportion of all candidate nodes that were assigned the correct label.</Paragraph> <Paragraph position="1"> However we cannot use this metric for evaluating results on the parser output. The trees output by the parser are not identical to gold standard trees due to parsing errors, and the set of candidate nodes extracted from parsed trees will not be the same as for gold trees. For this reason we use an alternative metric which is independent of tree configuration and uses only the Cast3LB function labels and positional indices of tokens in a sentence. For each function-tagged tree we first remove the punctuation tokens. Then we extract a set of tuples of the form <GF,i,j> , where GF is the Cast3LB function tag and i [?] j is the range of tokens spanned by the node annotated with this function. We use the standard measures of precision, recall and f-score to evaluate the results.</Paragraph> <Paragraph position="2"> Results for the three algorithms are shown in</Paragraph> <Paragraph position="4"> for gold-standard trees scoring 89.34% on accuracy and 86.87% on fscore. The learning curves for the three algorithms, shown in Figure 4, are also informative, with SVM outperforming the other two methods for all training set sizes. In particular, the last section of the plot shows SVM performing almost as well as MBL with half as much learning material.</Paragraph> <Paragraph position="5"> Neither of the three curves shows signs of having reached a maximum, which indicates that in- null for parser output, for all constituents, and for correctly parsed constituents only parser output creasing the size of the training data should result in further improvements in performance.</Paragraph> <Paragraph position="6"> Table 5 shows the performance of the three methods on parser output. The baseline contains the results achieved by treating compound category-function labels as atomic during parser training so that they are included in parser output. For this task we present two sets of results: (i) for all constituents, and (ii) for correctly parsed constituents only. Again the best algorithm turns out to be SVM. It outperforms the baseline by a large margin (6.74% for all constituents).</Paragraph> <Paragraph position="7"> The difference in performance for gold standard trees, and the correctly parsed constituents in parser output is rather larger than what Blaheta and Charniak report. Further analysis is needed to identify the source of this difference but we suspect that one contributing factor is the use of greater number of context features combined with a higher parse error rate in comparison to their experiments on the Penn II Treebank. Since any misanalysis of constituency structure in the vicinity of target node can have negative impact, greater reliance on context means greater susceptibility to parse errors. Another factor to consider is the fact that we trained and adjusted parameters on gold-standard trees, and the model learned may rely on features of those trees that the parser is unable to reproduce.</Paragraph> <Paragraph position="8"> For the experiments on parser output (all constituents) we performed a series of sign tests in order to determine to what extent the differences in performance between the different methods are statistically significant. For each pair of methods we calculate the f-score for each sentence in the test set. For those sentences on which the scores differ (i.e. the number of trials) we calculate in how many cases the second method is better than the first (i.e. the number of successes). We then perform the test with the null hypothesis that the probability of success is chance (= 0.5) and the alternative hypothesis that the probability of success is greater than chance (> 0.5). The results are summarized in Table 6. Given that we perform 4 pairwise comparisons, we apply the Bonferroni correction and adjust our target ab = a4 . For the confidence level 95% (ab = 0.0125) all pairs give statistically significant results, except for MBL vs MaxEnt.</Paragraph> </Section> <Section position="8" start_page="140" end_page="141" type="metho"> <SectionTitle> 6 Task-Based LFG Annotation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="140" end_page="141" type="sub_section"> <SectionTitle> Evaluation </SectionTitle> <Paragraph position="0"> Finally, we also evaluated the actual f-structures obtained by running the LFG-annotation algorithm on trees produced by the parser and enriched with Cast3LB function tags assigned using SVM.</Paragraph> <Paragraph position="1"> For this task-based evaluation we produced a gold standard consisting of f-structures corresponding to all sentences in the test set. The LFG-annotation algorithm was run on the test set trees (which contained original Cast3LB treebank function tags), and the resulting f-structures were manually corrected. null Following Crouch et al. (2002), we convert the f-structures to triples of the form <GF,Pi,Pj> , where Pi is the value of the PRED attribute of the f-structure, GF is an LFG grammatical function attribute, and Pj is the value of the PRED attribute of the f-structure which is the value of the GF attribute. This is done recursively for each level of embedding in the f-structure. Attributes with atomic values are ignored for the purposes of this evaluation. The results obtained are shown in Table 7. We also performed a statistical significance test for these results, using the same method as for the Cast3LB tag assigment task. The p-value given by the sign test was 2.118x10[?]5, comfortably below a = 1%.</Paragraph> <Paragraph position="2"> The higher scores achieved in the LFG f-structure evaluation in comparison with the preceding Cast3LB tag assignment evaluation (Table 5) can be attributed to two main factors. Firstly, the mapping from Cast3LB tags to LFG grammatical functions is not one-to-one. For example three Cast3LB tags (CC, MOD and ET) are all mapped to LFG ADJUNCT. Thus mistagging a MOD as on test-set gold-standard trees. The gold-standard Cast3LB function tags are shown in the first row, the predicted tags in the first column. So e.g. SUJ was mistagged as CD in 26 cases. Low frequency function tags as well as those rarely mispredicted have been omitted for clarity.</Paragraph> <Paragraph position="3"> CC does not affect the f-structure score. On the other hand the Cast3LB CD tag can be mapped to OBJ, COMP, or XCOMP, and it can be easily decided which one is appropriate depending on the category label of the target node. Additionally many nodes which receive no function tag in Cast3LB, such as noun modifiers, are straightforwardly mapped to LFG ADJUNCT. Similarly, objects of prepositions receive the LFG OBJ function. Secondly, the f-structure evaluation metric is less sensitive to small constituency misconfigurations: it is not necessary to correctly identify the token range spanned by a target node as long as the head (which provides the PRED attribute) is correct. null</Paragraph> </Section> </Section> <Section position="9" start_page="141" end_page="141" type="metho"> <SectionTitle> 7 Error Analysis </SectionTitle> <Paragraph position="0"> In order to understand sources of error and determine how much room for further improvement there is, we examined the most common cases of Cast3LB function mistagging. A simplified confusion matrix with the most common Cast3LB tags is shown in Table 8. The most common mistakes occur between SUJ and CD, in both directions, and many also CREGs are erroneously tagged as CC.</Paragraph> <Section position="1" start_page="141" end_page="141" type="sub_section"> <SectionTitle> 7.1 Subject vs Direct Object </SectionTitle> <Paragraph position="0"> We noticed that in over 50% of cases when a Direct Object (CD) was misidentified as Subject (SUJ), the target node's mother was a relative clause. It turns out that in Spanish relative clauses genuine syntactic ambiguity is not uncommon.</Paragraph> <Paragraph position="1"> Consider the following Spanish phrase: Its translation into English is either Systems that use 95% of computers or alternatively Systems that 95% of computers use. In Spanish, unlike in English, preverbal / postverbal position of a constituent is not a good guide to its grammatical function in this and similar contexts. Human annotators can use their world knowledge to decide on the correct semantic role of a target constituent and use it in assigning a correct grammatical function, but such information is obviously not used in our machine learning methods. Thus such mistakes seem likely to remain unresolvable in our current approach.</Paragraph> </Section> <Section position="2" start_page="141" end_page="141" type="sub_section"> <SectionTitle> 7.2 Prepositional Object vs Adjunct </SectionTitle> <Paragraph position="0"> The frequent misidentification of Prepositional Objects (CREG) as Adjuncts (CC) seen in Table 8 can be accounted for by several factors. Firstly, Prepositional Objects are strongly dependent on specific verbs and the comparatively small size of our training data means that there is limited opportunity for a machine-learning algorithm to learn low-frequency lexical dependencies. Here the obvious solution is to use a more adequate amount of training material when it becomes available.</Paragraph> <Paragraph position="1"> A further problem with the Prepositional Object - Adjunct distinction is its inherent fuzziness. Because of this, treebank designers may fail to provide easy-to-follow, clearcut guidelines and human annotators necessarily exercise a certain degree of arbitrariness in assigning one or the other function.</Paragraph> </Section> </Section> class="xml-element"></Paper>