File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1110_metho.xml

Size: 24,179 bytes

Last Modified: 2025-10-06 14:10:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1110">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Towards Case-Based Parsing: Are Chunks Reliable Indicators for Syntax Trees?</Title>
  <Section position="5" start_page="74" end_page="75" type="metho">
    <SectionTitle>
3 The German Data
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="74" end_page="74" type="sub_section">
      <SectionTitle>
3.1 The Treebank T&amp;quot;uBa-D/Z
</SectionTitle>
      <Paragraph position="0"> The T&amp;quot;uBa-D/Z treebank is based on text from the German newspaper 'die tageszeitung', the present release comprises approx. 22 000 sentences. The treebank uses an annotation framework that is based on phrase structure grammar enhanced by a level of predicate-argument structure. The annotation scheme uses pure projective tree structures. In order to treat long-distance relationships, T&amp;quot;uBa-D/Z utilizes a combination of topological fields (H&amp;quot;ohle, 1986) and specific functional labels (cf. the tree in Figure 5, there the extraposed relative clause modifies the subject, which is annotated via the label ON-MOD). Topological fields described the main ordering principles in a German sentence: In a declarative sentence, the position of the finite verb as the second constituent and of the remaining verbal elements at the end of the clause is fixed. The finite verb constitutes the left sentence bracket (LK), and the remaining verbal elements the right sentence bracket (VC). The left bracket is preceded by the initial field (VF), between the two verbal fields, we have the unstructured middle field (MF). Extraposed constituents are in the final field (NF).</Paragraph>
      <Paragraph position="1"> The tree for sentence (1a) is shown in Figure 1. The syntactic categories are shown in circular nodes, the function-argument structure as edge labels in square boxes. Inside a phrase, the function-argument annotation describes head/non-head relations; on the clause level, directly below the topological fields, grammatical functions are annotated. The prepositional phrase (PX) is marked as a verbal modifier (V-MOD), the noun phrase der international angesehene K&amp;quot;unstler as subject (ON), and the complex noun phrase den Ursprung aller Kreativit&amp;quot;at as accusative object (OA). The topological fields are annotated directly below the clause node (SIMPX): the finite verb is placed in the left bracket, the prepositional phrase constitutes the initial field, and the two noun phrases the middle field.</Paragraph>
    </Section>
    <Section position="2" start_page="74" end_page="75" type="sub_section">
      <SectionTitle>
3.2 Partially Parsed Data
</SectionTitle>
      <Paragraph position="0"> KaRoPars (M&amp;quot;uller and Ule, 2002) is a partial parser for German, based on the finite-state technology of the TTT suite of tools (Grover et al., 1999). It employs a mixed bottom-up top-down routine to parse German. Its actual performance is difficult to determine exactly because it employed manually written rules. The figures presented in Table 1 result from an evaluation (M&amp;quot;uller, 2005) in which the parser output was compared with tree-bank structures. The figures in the Table are based on an evaluation of chunks only, i.e. the annotation of topological fields and clause boundaries was not taken into account.</Paragraph>
      <Paragraph position="1"> The output of KaRoPars is a complex XML representation with more detailed information than is needed for the present investigation. For this reason, we show a condensed version of the parser output for sentence (1a) in Figure 2. The figure shows only the relevant chunks and POS tags, the complete output contains more embedded chunks, the n-best POS tags from different taggers, morphological information, and lemmas. As can be seen from this example, chunk boundaries often do not coincide with phrase boundaries. In the present case, it is clear from the word ordering constraints in German that the noun phrase des  &lt;s broken=&amp;quot;no&amp;quot;&gt; &lt;cl c=&amp;quot;V2&amp;quot;&gt; &lt;ch fd=&amp;quot;VF&amp;quot; c=&amp;quot;PC&amp;quot; prep=&amp;quot;in&amp;quot;&gt; &lt;ch c=&amp;quot;PC&amp;quot; prep=&amp;quot;in&amp;quot;&gt; &lt;t f=&amp;quot;In&amp;quot;&gt;&lt;P t=&amp;quot;APPR&amp;quot;&gt;&lt;/P&gt;&lt;/t&gt; &lt;ch nccat=&amp;quot;noun&amp;quot; hdnoun=&amp;quot;Wahrnehmung&amp;quot; c=&amp;quot;NC&amp;quot;&gt; &lt;t f=&amp;quot;der&amp;quot;&gt;&lt;P t=&amp;quot;ART&amp;quot;&gt;&lt;/P&gt;&lt;/t&gt; &lt;t f=&amp;quot;bewussten&amp;quot;&gt;&lt;P t=&amp;quot;ADJA&amp;quot;&gt;&lt;/P&gt;&lt;/t&gt; &lt;t f=&amp;quot;Wahrnehmung&amp;quot;&gt;&lt;P t=&amp;quot;NN&amp;quot;&gt;&lt;/P&gt;&lt;/t&gt;&lt;/ch&gt;&lt;/ch&gt; &lt;ch nccat=&amp;quot;noun&amp;quot; hdnoun=&amp;quot;Leben&amp;quot; c=&amp;quot;NC&amp;quot;&gt; &lt;t f=&amp;quot;des&amp;quot;&gt;&lt;P t=&amp;quot;ART&amp;quot;&gt;&lt;/P&gt;&lt;/t&gt; &lt;t f=&amp;quot;Lebens&amp;quot;&gt;&lt;P t=&amp;quot;NN&amp;quot;&gt;&lt;/P&gt;&lt;/t&gt;&lt;/ch&gt;&lt;/ch&gt; &lt;ch finit=&amp;quot;fin&amp;quot; c=&amp;quot;VCLVF&amp;quot; mode=&amp;quot;akt&amp;quot;&gt; &lt;t f=&amp;quot;sieht&amp;quot;&gt;&lt;P t=&amp;quot;VVFIN&amp;quot;&gt;&lt;/P&gt;&lt;/t&gt;&lt;/ch&gt; &lt;ch nccat=&amp;quot;noun&amp;quot; hdnoun=&amp;quot;K&amp;quot;unstler&amp;quot; c=&amp;quot;NC&amp;quot;&gt; &lt;t f=&amp;quot;der&amp;quot;&gt;&lt;P t=&amp;quot;ART&amp;quot;&gt;&lt;/P&gt;&lt;/t&gt; &lt;t f=&amp;quot;international&amp;quot;&gt;&lt;P t=&amp;quot;ADJD&amp;quot;&gt;&lt;/P&gt;&lt;/t&gt; &lt;t f=&amp;quot;angesehene&amp;quot;&gt;&lt;P t=&amp;quot;ADJA&amp;quot;&gt;&lt;/P&gt;&lt;/t&gt; &lt;t f=&amp;quot;K&amp;quot;unstler&amp;quot;&gt;&lt;P t=&amp;quot;NN&amp;quot;&gt;&lt;/P&gt;&lt;/t&gt;&lt;/ch&gt; &lt;ch nccat=&amp;quot;noun&amp;quot; hdnoun=&amp;quot;Ur=Sprung&amp;quot; c=&amp;quot;NC&amp;quot;&gt; &lt;t f=&amp;quot;den&amp;quot;&gt;&lt;P t=&amp;quot;ART&amp;quot;&gt;&lt;/P&gt;&lt;/t&gt; &lt;t f=&amp;quot;Ursprung&amp;quot;&gt;&lt;P t=&amp;quot;NN&amp;quot;&gt;&lt;/P&gt;&lt;/t&gt;&lt;/ch&gt; &lt;ch nccat=&amp;quot;noun&amp;quot; hdnoun=&amp;quot;Kreativit&amp;quot;at&amp;quot; c=&amp;quot;NC&amp;quot;&gt; &lt;t f=&amp;quot;aller&amp;quot;&gt;&lt;P t=&amp;quot;PIDAT&amp;quot;&gt;&lt;/P&gt;&lt;/t&gt; &lt;t f=&amp;quot;Kreativit&amp;quot;at&amp;quot;&gt;&lt;P t=&amp;quot;NN&amp;quot;&gt;&lt;/P&gt;&lt;/t&gt;&lt;/ch&gt;&lt;/cl&gt;&lt;/s&gt;  are displayed in bold.</Paragraph>
      <Paragraph position="2"> Lebens needs to be attached to the previous phrase.</Paragraph>
      <Paragraph position="3"> In the treebank, it is grouped into a complex noun phrase while in the KaRoPars output, this noun phrase is the sister of the prepositional chunk In der bewussten Wahrnehmung. Such boundary mismatches also occur on the clause level.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="75" end_page="77" type="metho">
    <SectionTitle>
4 Chunk Sequences as Indicators for
</SectionTitle>
    <Paragraph position="0"> Syntax Trees The complexity of the proposed parser depends on the proportion of chunk sequences versus syntax trees, as explained in section 2. A first indication of this proportion is given by the ratio of chunk sequence types and tree types. Out of the 22 091 sentences in the treebank, there are 20 340 different trees (types) and 14 894 different chunk sequences. This gives an average of 1.37 trees per chunk sequence. At a first glance, the result indicates that the chunk sequences are very good indicators for selecting the correct syntax tree. The negative aspect of this ratio is that many of these chunk sequences will not be part of the training data. This is corroborated by an experiment in which one tenth of the complete data set of chunk sequences (test set) was tested against the remainder of the data set (training set) to see how many of the test sequences could be found in the training data. In order to reach a slightly more accurate picture, a ten-fold setting was used, i.e. the experiment was repeated ten times, each time using a different segment as test set. The results show that on average only 43.61% of the chunk sequences  could be found in the training data.</Paragraph>
    <Paragraph position="1">  In a second experiment, we added more information about chunk types, namely the information from the fields nccat and finit in the XML representation to the chunk categories. Field nccat contains information about the head of the noun chunk, whether it is a noun, a reflexive pronoun, a relative pronoun, etc. Field finit contains information about the finiteness of a verb chunk.</Paragraph>
    <Paragraph position="2"> For this experiment, sentence (2) is represented by the chunk sequence &amp;quot;NC:noun VCL NC:refl PC NC:noun PC AVC NC:noun VCR:fin&amp;quot;. When using such chunk sequences, the ratio of sequences found in the training set decreases to 36.59%.</Paragraph>
    <Paragraph position="3"> In a third experiment, the chunk sequences were constructed without adverbial phrases, i.e. without the one category that functions as adjunct in a majority of the cases. Thus sentence (3) is represented by the chunk sequence &amp;quot;NC VCL NC NC&amp;quot; instead of by the complete sequence: &amp;quot;NC VCL NC AVC AVC AVC NC&amp;quot;. In this case, 54.72% of the chunk sequences can be found. Reducing the information in the chunk sequence even further seems counterproductive because every type of information that is left out will make the final decision on the correct syntax tree even more dif- null All the experiments reported above are based on data in which complete sentences were used. One possibility of gaining more generality in the chunk sequences without losing more information consists of splitting the sentences on the clause level.  can slander the twerp after the break-up because they already know what a loser he is.' Thus, the complex sentence in (4) translates into  5 different clauses, i.e. into 5 different chunk sequences: null 1. SubC NC:noun AVC AVC AVC NC:noun NC:noun VCR:fin 2. PC NC:noun PC PC VCR:fin 3. SubC NC:noun AVC AJVC VCR:fin 4. SubC AJVC NC:noun AVC VCR:fin 5. AVC VCR:fin PC  The last sequence covers the elliptical matrix clause ganz abgesehen davon, the first four sequences describe the subordinated clauses; i.e. the first sequence describes the subordinate clause dass man dann schon mal alle die Geschlechtsgenossinnen kennt, the second sequence covers the relative clause mit denen man nach der Trennung &amp;quot;uber den Kerl abl&amp;quot;astern kann. The third sequence describes the subordinate clause introduced by the conjunction weil, and the fourth sequence covers the subordinate clause introduced by the interrogative pronoun wie.</Paragraph>
    <Paragraph position="4"> On the one hand, splitting the chunk sequences into clause sequences makes the parsing task more difficult because the clause boundaries annotated during the partial parsing step do not always coincide with the clause boundaries in the syntax trees. In those cases where the clause boundaries do not coincide, a deterministic solution must be found, which allows a split that does not violate the parallelism constraints between both structures. On the other hand, the split into clauses allows a higher coverage of new sentences without extending the size of the training set. In an experiment, in which the chunk sequences were represented by the main chunk types plus subtypes (cf. experiment two) and were split into clauses, the percentage of unseen sequences in a tenfold split was reduced from 66.41% to 44.16%. If only the main chunk type is taken into account, the percentage of unseen sequences decreases from 56.39% to 36.34%.</Paragraph>
    <Paragraph position="5"> The experiments presented in this section show that with varying degrees of information and with different ways of extracting chunk sequences, a range of levels of generality can be represented. If the maximum of information regarded here is used, only 36.59% of the sequences can be found.</Paragraph>
    <Paragraph position="6"> If, in contrast, the sentences are split into chunks and only the main chunk type is used, the ratio of found sequences reaches 63.66%. A final decision on which representation of chunks is optimal, however, is also dependent on the sets of trees that  are represented by the chunk sequences and thus needs to be postponed.</Paragraph>
  </Section>
  <Section position="7" start_page="77" end_page="77" type="metho">
    <SectionTitle>
5 Tree Sets
</SectionTitle>
    <Paragraph position="0"> In the previous section, we showed that if we extract chunk sequences based on complete sentences and on main chunk types, there are on average 1.37 sentences assigned to one chunk sequences. At a first glance, this results means that for the majority of chunk sequences, there is exactly one sentence which corresponds to the sequence, which makes the final selection of the correct tree trivial. However, 1261 chunk sequences have more than one corresponding sentence, and there is one chunk sequence which has 802 sentences assigned. We will call these collections tree sets. In these cases, the selection of the correct tree from a tree set may be far from trivial, depending on the differences in the trees. A minimal difference constitutes a difference in the words only. If all corresponding words belong to the same POS class, there is no difference in the syntax trees. Another type of differences in the trees which does not overly harm the selection process are differences in the internal structure of phrases.</Paragraph>
    <Paragraph position="1"> In (K&amp;quot;ubler, 2004a), we showed that the tree can be cut at the phrase level, and new phrase-internal structures can be inserted into the tree. Thus, the most difficult case occurs when the differences in the trees are located in the higher regions of the trees where attachment information between phrases and grammatical functions are encoded. If such cases are frequent, the parser needs to employ a detailed search procedure.</Paragraph>
    <Paragraph position="2"> The question how to determine the similarity of trees in a tree set is an open research question. It is clear that the similarity measure should abstract away from unimportant differences in words and phrase-internal structure. It should rather concentrate on differences in the attachment of phrases and in grammatical functions. As a first approximation for such a similarity measure, we chose a measure based on precision and recall of these parts of the tree. In order to ignore the lower levels of the tree, the comparison is restricted to nodes in the tree which have grammatical functions.</Paragraph>
    <Paragraph position="3">  down a street that is still called Lagerstrasse.' For example, Figure 5 shows the tree for sentence (5). The matrix clause consists of a complex subject noun phrase (GF: ON), a finite verb phrase, which is the head of the sentence, an accusative noun phrase (GF: OA), a verb particle (GF: VPT), and an extraposed relative clause (GF: ON-MOD). Here the grammatical function indicates a long-distance relationship, the relative clause modifies the subject. The relative clause, in turn, consists of a subject (the relative pronoun), an adverbial phrase modifying the verb (GF: V-MOD), a named entity predicate (EN-ADD, GF: PRED), and the finite verb phrase. The comparison of this tree to other trees in its tree set will then be based on the following nodes:</Paragraph>
  </Section>
  <Section position="8" start_page="77" end_page="78" type="metho">
    <SectionTitle>
NX:ON VXFIN:HD NX:OA PTKVC:VPT R-
SIMPX:ON-MOD NX:ON ADVX:V-MOD EN-
ADD:PRED VXFIN:HD. Precision and recall are
</SectionTitle>
    <Paragraph position="0"> generally calculated based on the number of identical constituents between two trees. Two constituents are considered identical if they have the same node label and grammatical function and if they cover the same range of words (i.e. have the same yield). For our comparison, the concrete length of constituents is irrelevant, as long as the sequential order of the constituents is identical.</Paragraph>
    <Paragraph position="1"> Thus, in order to abstract from the length of constituents, their yield is normalized: All phrases are set to length 1, the yield of a clause is determined by the yields of its daughters. After this step, precision and recall are calculated on all pairs of trees in a tree set. Thus, if a set contains 3 trees, tree 1 is compared to tree 2 and 3, and tree 2 is compared to tree 3. Since all pairs of trees are compared, there is no clear separation of precision and recall, precision being the result of comparing tree A to B in the pair and recall being the result of comparing B to A. As a consequence only the Fa0a1a0-measure, a combination of precision and recall, is used.</Paragraph>
    <Paragraph position="2"> As mentioned above, the experiment is conducted with chunk sequences based on complete sentences and the main chunk types. The average F-measure for the 1261 tree sets is 46.49%, a clear indication that randomly selecting a tree from a tree set is not sufficient. Only a very small number of sets, 62, consists of completely identical trees, and most of these sets contain only two trees.</Paragraph>
    <Paragraph position="3"> The low F-measure can in part be explained  by the relatively free word order of German: In contrast to English, the grammatical function of a noun phrase in German cannot be determined by its position in a sentence. Thus, if the partial parser returns the chunk sequence &amp;quot;NC VCL NC NC&amp;quot;, it is impossible to tell which of the noun phrases is the subject, the accusative object, or the dative object. As a consequence, all trees with these three arguments will appear in the same tree set. Since German additionally displays case syncretism between nominative and accusative, a morphological analysis can also only provide partial disambiguation. As a consequence, it is clear that the selection of the correct syntax tree for an input sentence needs to be based on a selection module that utilizes lexical information.</Paragraph>
    <Paragraph position="4"> Another source of differences in the trees are errors in the partial analysis. In the tree set for the chunk sequence &amp;quot;NC VCL AVC PC PC VCR&amp;quot;, there are sentences with rather similar structure, one of them being shown in (6). Most of them only differ in the grammatical functions assigned to the prepositional phrases, which can serve either as complements or adjuncts. However, the tree set also contains sentence (7).</Paragraph>
    <Paragraph position="5">  wird.</Paragraph>
    <Paragraph position="6"> is.</Paragraph>
    <Paragraph position="7"> 'This is also true for the extent to which Montenegro is being attacked.' In sentence (7), the relative pronoun was erroneously POS tagged as a definite determiner, thus allowing an analysis in which the two phrases in dem and Montenegro are grouped as a prepositional chunk. As a consequence, no relative clause was found. The corresponding trees, however, are annotated correctly, and the similarity between those two sentences is consequently low.</Paragraph>
    <Paragraph position="8"> The low F-measure should not be taken as a completely negative result. Admittedly, it necessitates a rather complex tree selection module. The positive aspect of this one-to-many relation between chunk sequences and trees is its generality. If only very similar trees shared a tree set, then we would need many chunk sequences. In this case, the problem would be moved towards the question how to extract a maximal number of different partial parses from a limited number of training sentences. null</Paragraph>
  </Section>
  <Section position="9" start_page="78" end_page="79" type="metho">
    <SectionTitle>
6 Consequences for a Case-Based Parser
</SectionTitle>
    <Paragraph position="0"> The experiments in the previous two sections show that the chunk sequences extracted from a partial parse can serve as indicators for syntax trees.</Paragraph>
    <Paragraph position="1"> While the best definition of chunk sequences can only be determined empirically, the results presented in the previous section allow some conclusions on how the parser must be designed.</Paragraph>
    <Section position="1" start_page="78" end_page="79" type="sub_section">
      <SectionTitle>
6.1 Consequences for Matching Chunk
Sequences and Trees
</SectionTitle>
      <Paragraph position="0"> From the experiments in section 4, it is clear that a good measure of information needs to be found for an optimal selection process. There needs to be a good equilibrium between a high coverage of different chunk sequences and a low number of trees per chunk sequence. One possibility to  reach the first goal would be to ignore certain types of phrases in the extraction of chunk sequences from the partial parse. However, the experiments show that it is impossible to reduce the informativeness of the chunk sequence to a level where all possible chunk sequences are present in the training data. This means that the procedure which matches the chunk sequence of the input sentence to the chunk sequences in the training data must be more flexible than a strict left-to-right comparison. In (K&amp;quot;ubler, 2004a; K&amp;quot;ubler, 2004b), we allowed the deletion of chunks in either the input sentence or the training sentence. The latter operation is uncritical because it results in a deletion of some part of the syntax tree. The former operation, however, is more critical, it either leads to a partial syntactic analysis in which the deleted chunk is not attached to the tree or to the necessity of guessing the node to which the additional constituent needs to be attached and possibly guessing the grammatical function of the new constituent. Instead of this deletion, which can be applied anywhere in the sentence, we suggest the use of Levenshtein distance (Levenshtein, 1966). This distance measure is, for example, used for spelling correction: Here the most similar word in the lexicon is found which can be reached via the smallest number of deletion, substitution, and insertion operations on characters. Instead of operating on characters, we suggest to apply Levenshtein distance to chunk sequences. In this case, deletions from the input sequence could be given a much higher weight (i.e.</Paragraph>
      <Paragraph position="1"> cost) than insertions. We also suggest a modification of the distance to allow an exchange of chunks. This modification would allow a principled treatment of the relative free word order of German. However, if such an operation is not restricted to adjacent chunks, the algorithm will gain in complexity but since the resulting parser is still deterministic, it is rather unlikely that this modification will lead to complexity problems.</Paragraph>
    </Section>
    <Section position="2" start_page="79" end_page="79" type="sub_section">
      <SectionTitle>
6.2 Consequences for the Tree Selection
</SectionTitle>
      <Paragraph position="0"> As explained in section 5, there are chunk sequences that correspond to more than one syntax tree. Since differences in the trees also pertain to grammatical functions, the module that selects the best tree out of the tree set needs to use more information than the chunk sequences used for selecting the tree set. Since the holistic approach to parsing proposed in this paper does not lend itself easily to selecting grammatical functions separately for single constituents, we suggest to use lexical co-occurrence information instead to select the best tree out of the tree set for a given sentence. Such an approach generalizes Streiter's (2001) approach of selecting from a set of possible trees based on word similarity. However, an approach based on lexical information will suffer extremely from data sparseness. For this reason, we suggest a soft clustering approach based on a partial parse, similar to the approach by Wagner (2005) for clustering verb arguments for learning selectional preferences for verbs.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML