File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2935_metho.xml
Size: 14,356 bytes
Last Modified: 2025-10-06 14:10:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2935"> <Title>Language Independent Probabilistic Context-Free Parsing Bolstered by Machine Learning</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Experimental Setup </SectionTitle> <Paragraph position="0"> For development, we chose the initial a0 sentences of every treebank, where a0 is the number of the sentences in the test set. In this way, the sizes were realistic for the task. For parsing the test data, we added the development set to the training set.</Paragraph> <Paragraph position="1"> All the evaluations on the test sets were performed with the evaluation script supplied by the conference organizers. For development, we used labelled F-score computed from all tokens except the ones employed for punctuation (cf. section 3.2).</Paragraph> </Section> <Section position="5" start_page="0" end_page="234" type="metho"> <SectionTitle> 3 Context Free Parsing </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="231" type="sub_section"> <SectionTitle> 3.1 The Parser </SectionTitle> <Paragraph position="0"> Basically, we investigated the performance of a straightforward unlexicalized statistical parser, viz.</Paragraph> <Paragraph position="1"> BitPar (Schmid, 2004). BitPar is a CKY parser that uses bit vectors for efficient representation of the chart and its items. If frequencies for the grammatical and lexical rules in a training set are available, BitPar uses the Viterbi algorithm to extract the most probable parse tree (according to PCFG) from the chart.</Paragraph> </Section> <Section position="2" start_page="231" end_page="231" type="sub_section"> <SectionTitle> 3.2 Converting Dependency Structure to Constituency Structure </SectionTitle> <Paragraph position="0"> In order to determine the grammar rules required by the context-free parser, the dependency trees in the CONLL format have to be converted to constituency trees. Gaifman (1965) proved that projective dependency grammars can be mapped to context-free grammars. The main information that needs to be added in going from dependency to constituency structure is the category of non-terminals. The usage of special knowledge bases to determine projections of categories (Xia and Palmer, 2001) would have presupposed language-dependent knowledge, so we investigated two other options: Flat rules (Collins et al., 1999) and binary rules. In the flat rules approach, each lexical category projects to exactly one phrasal category, and every projection chain has a length of at most one. The binary rules approach makes use of the X-bar-scheme and thus introduces along with the phrasal category an intermediate category. The phrasal category must not occur more than once in a projection chain, and a projection chain must not end in an intermediate category. In both approaches, projection is only triggered if dependents are present; in case a category occurs as a dependent itself, no projection is required. In coordination structures, the parent category is copied from that of the last conjunct.</Paragraph> <Paragraph position="1"> Non-projective relations can be treated as unbounded dependencies so that their surface position (antecedent position) is related to the position of their head (trace position) with an explicit co-indexed trace (like in the Penn treebank). To find the position of trace and antecedent we assume three constraints: The antecedent should c-command its trace. The antecedent is maximally near to the trace in depth of embedding. The trace is maximally near to the antecedent in surface order.</Paragraph> <Paragraph position="2"> Finally the placement of punctuation signs has a major impact on the performance of a parser (Collins et al., 1999). In most of the treebanks, not much effort is invested into the treatment of punctuation. Sometimes, punctuation signs play a role in predicate-argument structure (commas acting as coordinators), but more often they do not, in which case they are marked by special roles (e.g. &quot;pnct&quot;, &quot;punct&quot;, &quot;PUNC&quot;, or &quot;PUNCT&quot;). We used a general mechanism to re-insert such signs, for all languages but CH (no punctuation signs) and AR, CZ, SL (reliable annotation). Correct placement of punctuation presupposes knowledge of the punctuation rules valid in a language. In the interest of generality, we opted for a suboptimal solution: Punctuation signs are inserted in the highest possible position in a tree.</Paragraph> </Section> <Section position="3" start_page="231" end_page="231" type="sub_section"> <SectionTitle> 3.3 Subcategorization and Coordination </SectionTitle> <Paragraph position="0"> The most important language-specific information that we made use of was a classification of dependency relations into complements, coordinators/conjuncts, and other relations (adjuncts). Given knowledge about complement relations, it is fairly easy to construct subcategorization frames for word occurrences: A subcategorization frame is simply the set of the complement relations by which dependents are attached to the word. To give the parser access to these lists, we annotated the category of a subcategorizing word with its subcategorization frame. In this way, the parser can learn to associate the subcategorization requirements of a word with its local syntactic context (Schiehlen, 2004).</Paragraph> <Paragraph position="1"> Coordination constructions are marked either in the conjuncts (CH, CZ, DA, DU, GE, PO, SW) or the coordinator (AR, SL). If conjuncts show coordination, a common representation of asyndetic coordination has one conjunct point to another conjunct.</Paragraph> <Paragraph position="2"> It is therefore important to distinguish coordinators from conjuncts. Coordinators are either singled out by special dependency relations (DA, PO, SW) or by their POS tags (CH, DU). In German, the first conjunct phrase is merged with the whole coordinated phrase (due to a conversion error?) so that determining the coordinator as a head is not possible.</Paragraph> <Paragraph position="3"> We also experimented with attaching the POS tags of heads to the categories of their adjunct dependents. In this way, the parser could differentiate between e.g. verbal and nominal adjuncts. In our experiments, the performance gains achieved by this strategy were low, so we did not incorporate it into the system. Possibly, better results could be achieved by restricting annotation to special classes of adjuncts or by generalizing the heads' POS tags.</Paragraph> </Section> <Section position="4" start_page="231" end_page="232" type="sub_section"> <SectionTitle> 3.4 Categories </SectionTitle> <Paragraph position="0"> As the treebanks provide a lot of information with every word token, it is a delicate question to de- null cide on the type and granularity of the information to use in the categories of the grammar. The treebanks specify for every word a (fine-grained) POS tag, a coarse-grained POS tag, a collection of morphosyntactic features, and a dependency relation (dep-rel). Only the dependency relation is really orthogonal; the other slots contain various generalizations of the same morphological information. We tested several options: coarse-grained POS tag (if available), fine-grained POS tag, fine-grained POS tag with morphosyntactic features (if available), name of dependency relation, and the combinations of coarse-grained or fine-grained POS tags with the dependency relation.</Paragraph> <Paragraph position="1"> Figure 1 shows F-score results on the development set for several languages and different combinations. The best overall performer is dep-rel; this somewhat astonishing fact may be due to the superior quality of the annotations in this slot (dependency relations were annotated by hand, POS tags automatically). Furthermore, being checked in evaluation, dependency relations directly affect performance. Since we wanted a general language-independent strategy, we used always the dep-rel tags but for Japanese. The Japanese treebank features only 8 different dependency relations, so we added coarse-grained POS tag information. In the categories for Czech, we deleted the suffixes marking coordination, apposition and parenthesis (Co, Ap, Pa), reducing the number of categories roughly by a factor of four. In coordination, conjuncts inherit the dep-rel category from the parent.</Paragraph> <Paragraph position="2"> Whereas the dep-rel information is submitted to the parser directly in terms of the categories, the information in the lemma, POS tag and morpho-syntactic features slot was used only for back-off smoothing when associating lexical items with cate- null gories. A grammar with this configuration was used to produce the results submitted (cf. line labelled CF in Figures 2 and 3).</Paragraph> <Paragraph position="3"> Instead of using the category generalizations supplied with the treebanks directly, manual labour can be put into discovering classifications that behave better for the purposes of statistical parsing. So, Collins et al. (1999) proposed a tag classification for parsing the Czech treebank. We also investigated a classification for German1, as well as one for Swedish and one for Spanish, which were modelled after the German classification. The results in Figure 4 show that new classifications may have a dramatic effect on performance if the treebank is sufficiently large. In the interest of generality, we did not make use of the language dependent tag classifications for the results submitted, but we will nevertheless report results that could have been achieved with these classifications.</Paragraph> </Section> <Section position="5" start_page="232" end_page="233" type="sub_section"> <SectionTitle> 3.5 Markovization </SectionTitle> <Paragraph position="0"> Another strategy that is often used in statistical parsing is Markovization (Collins, 1999): Treebanks usually contain very many long rules of low frequency (presumably because inserting nodes costs annotators time). Such rules cannot have an impact in a statistical system (the line new-rules in Figure 2 shows the percentage of rules in the test set that are not in the training set); it is better to view them as products of a Markov process that chooses first the head, then the symbols left of the head and finally the symbols right of the hand. In a bigram model, the choice of left and right siblings is made dependent not only on the parent and head category, but also on the last sibling on the left or right, respectively. Formally the probability of a rule with left hand side a0 and right hand side a1a3a2a5a4a6a4a6a4a7a1a9a8a11a10a13a12a14a8a15a4a6a4a6a4a16a12a18a17 is broken down to the product of the probability a19a21a20a23a22a24a10a26a25a0a28a27 of the head, the probabilities of the left siblings a37a11a10a13a37a11a1a33a32a56a37a11a1a21a32a41a35a15a8 a53 occur in less than a certain number of rules (50 in our case), we smooth to unigram symbols instead (a58a60a0 a37a11a10a51a37a11a12a68a32a24a63 and a49 a0 a37a11a10a13a37a11a1a21a32 a53 ). We used a script of Schmid (2006) to Markovize infrequent rules in this manner (i.e. all rules with less than 50 occurrences that are not coordination rules).</Paragraph> <Paragraph position="1"> For time reasons, Markovization was not taken into account in the submitted results. We refer to Figures 2 and 3 (line labelled CF+Markov) for a listing of the results attainable by Markovization on the individual treebanks. Performance gains are even more dramatic if in addition dependency relations + manual POS tag classes are used as categories (line labelled CFM+newcl in Figures 2 and 3).</Paragraph> </Section> <Section position="6" start_page="233" end_page="234" type="sub_section"> <SectionTitle> 3.6 From Constituency Structure Back to Dependency Structure </SectionTitle> <Paragraph position="0"> In a last step, we converted the constituent trees back to dependency trees, using the algorithm of Gaifman (1965). Special provisos were necessary for the root node, for which no head is given in certain treebanks (Dzeroski et al., 2006). To interpret the context-free rules, we associated their children with dependency relations. This information was kept in a separate file that was invisible to the parser. In cases there were several possible interpretations for a context free rule, we always chose the most frequent one in the training data (Schiehlen, 2004).</Paragraph> </Section> </Section> <Section position="6" start_page="234" end_page="234" type="metho"> <SectionTitle> 4 Machine Learning </SectionTitle> <Paragraph position="0"> While the results coming from the statistical parser are not really competitive, we believe that they nevertheless present valuable information for a machine learner. To give some substance to this claim, we undertook experiments with the Zhang Le's Max-Ent Toolkit2. For this work, we recast the dependency parsing problem as a classification problem: Given some feature information on the word token, in which dependency relations does it stand to which head? While the representation of dependency relations is straightforward, the representation of heads is more difficult. Building on past experiments (Schiehlen, 2003), we chose the &quot;nth-tag&quot; representation which consists of three pieces of information: the POS tag of the head, the direction in which the head lies (left or right), and the number of words with the same POS tag between head and dependent. We used the following features to describe a word token: the fine-grained POS tag, the lemma (or full form) if it occurs at least 10 times, the morphosyntactic features, and the POS tags of the four preceding and the four following word tokens. The learner was trained in standard configuration (30 iterations). The results for this method on the test data are shown in Figure 2 (line MaxEnt).</Paragraph> <Paragraph position="1"> In a second experiment we added parsing results (obtained by 10-fold cross validation on the training set) in two features: proposed dependency relation and proposed head. Results of the extended learning approach are shown in Figure 2 (line combined).</Paragraph> </Section> class="xml-element"></Paper>