XML Viewer - w99-0629

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0629_metho.xml
Size: 20,571 bytes
Last Modified: 2025-10-06 14:15:31
<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0629">
  <Title>Cascaded Grammatical Relation Assignment</Title>
  <Section position="4" start_page="239" end_page="239" type="metho">
    <SectionTitle>
2 Memory-Based Learning
</SectionTitle>
    <Paragraph position="0"> Memory-Based Learning (MBL) keeps all training data in memory and only abstracts at classification time by extrapolating a class from the most similar item(s) in memory. In recent work Daelemans et al. (1999b) have shown that for typical natural language processing tasks, this approach is at an advantage because it also &amp;quot;remembers&amp;quot; exceptional, low-frequency cases which are useful to extrapolate from. Moreover, automatic feature weighting in the similarity metric of an MB learner makes the approach well-suited for domains with large numbers of features from heterogeneous sources, as it embodies a smoothing-by-similarity method when data is sparse (Zavrel and Daelemans, 1997).</Paragraph>
    <Paragraph position="1"> We have used the following MBL algorithms1: IB1 : A variant of the k-nearest neighbor (k-NN) algorithm. The distance between a test item and each memory item is defined as the number of features for which they have a different value (overlap metric).</Paragraph>
    <Paragraph position="2"> IBi-IG : IB1 with information gain (an information-theoretic notion measuring the reduction of uncertainty about the class to be predicted when knowing the value of a feature) to weight the cost of a feature value mismatch during comparison.</Paragraph>
    <Paragraph position="3"> IGTree : In.this variant, a decision tree is created with features as tests, and ordered according to the information gain of the features, as a heuristic approximation of the computationally more expensive IB1 variants. null For more references and information about these algorithms we refer to (Daelemans et al., 1998; Daelemans et al., 1999b). For other 1For the experiments described in this paper we have used TiMBL, an MBL software package developed in the ILK-group (Daelemans et al., 1998), TiMBL is available from: http://ilk.kub.nl/.</Paragraph>
    <Paragraph position="4"> memory-based approaches to parsing, see (Bod, 1992) and (Sekine, 1998).</Paragraph>
  </Section>
  <Section position="5" start_page="239" end_page="243" type="metho">
    <SectionTitle>
3 Methods and Results
</SectionTitle>
    <Paragraph position="0"> In this section we describe the stages of the cascade. The very first stage consists of a Memory-Based Part-of-Speech Tagger (MBT) for which we refer to (Daelemans et al., 1996). The next three stages involve determining boundaries and labels of chunks. Chunks are nonrecursive, non-overlapping constituent parts of sentences (see (Abney, 1991)). First, we simultaneously chunk sentences into: NP-, VP, Prep-, ADJP- and APVP-chunks. As these chunks are non-overlapping, no words can belong to more than one chunk, and thus no conflicts can arise. Prep-chunks are the prepositional part of PPs, thus excluding the nominal part. Then we join a Prep-chunk and one -or more coordinated -- NP-chunks into a PPchunk. Finally, we assign adverbial function (ADVFUNC) labels (e.g. locative or temporal) to all chunks.</Paragraph>
    <Paragraph position="1"> In the last stage of the cascade, we label several types of grammatical relations between pairs of words in the sentence.</Paragraph>
    <Paragraph position="2"> The data for all our experiments was extracted from the Penn Treebank II Wall Street Journal (WSJ) corpus (Marcus et al., 1993).</Paragraph>
    <Paragraph position="3"> For all experiments, we used sections 00-19 as training material and 20-24 as test material.</Paragraph>
    <Paragraph position="4"> See Section 4 for results on other train/test set splittings.</Paragraph>
    <Paragraph position="5"> For evaluation of our results we use the precision and recall measures. Precision is the percentage of predicted chunks/relations that are actually correct, recall is the percentage of correct chunks/relations that are actually found. For convenient comparisons of only one value, we also list the FZ=i value (C.J.van Rijsbergen, 1979): (Z2+l)'prec'rec /~2.prec+rec , with/~ = 1</Paragraph>
    <Section position="1" start_page="239" end_page="240" type="sub_section">
      <SectionTitle>
3.1 Chunking
</SectionTitle>
      <Paragraph position="0"> In the first experiment described in this section, the task is to segment the sentence into chunks and to assign labels to these chunks. This process of chunking and labeling is carried out by assigning a tag to each word in a sentence leftto-right. Ramshaw and Marcus (1995) first assigned a chunk tag to each word in the sentence: I for inside a chunk, O for outside a chunk, and  type precision B tbr inside a chunk, but tile preceding word is in another chunk. As we want to find more than one kind of chunk, we have to \[hrther differentiate tile IOB tags as to which kind of chunk (NP, VP, Prep, ADJP or ADVP) the word is ill. With the extended IOB tag set at hand we can tag the sentence:</Paragraph>
      <Paragraph position="2"> After having found Prep-, NP- and other chlmks, we collapse Preps and NPs to PPs in a second step. While the GR assigner finds relations between VPs and other chunks (cf. Section 3.2), the PP chunker finds relations between prepositions and NPs 2 in a way similar to GR assignment (see Section 3.2). In the last chunking/labeling step, we assign adverbial functions to chunks. The classes are the adverbial function labels from the treebank: LOC (locative), TMP (temporal), DIR (directional), PRP (purpose and reason), MNR (manner), EXT (extension) or &amp;quot;-&amp;quot; for none of the fornmr. Table 1 gives an overview of the results of the chunking-labeling experiments, using the following algorithms, determined by validation on the train set: IBi-IG for XP-chunking and IGTree for PP-chunking and ADVFUNCs assignment. null</Paragraph>
    </Section>
    <Section position="2" start_page="240" end_page="242" type="sub_section">
      <SectionTitle>
3.2 Grammatical Relation Assignment
</SectionTitle>
      <Paragraph position="0"> In grammatical relation assignment we assign a GR to pairs of words in a sentence. In our 2pPs containing anything else than NPs (e.g. without bringing his wife) are not searched for.</Paragraph>
      <Paragraph position="1">  Table h Results of chunking-labeling experiments. NP-,VP-, ADJP-, ADVP- and Prep-chunks are found sinmltaneously, but for convenience, precision and recall values are given separately for each type of chunk.</Paragraph>
      <Paragraph position="2"> experiments, one of these words is always a verb, since this yields the most important GRs. The other word is the head of the phrase which is annotated with this grammatical relation in the treebank. A preposition is the head of a PP, a noun of an NP and so on. Defining relations to hold between heads means that the algorithm can, for example, find a subject relation between a noun and a verb without necessarily having to make decisions about the precise boundaries of the subject NP.</Paragraph>
      <Paragraph position="3"> Suppose we had the POS-tagged sentence shown in Figure 1 and we wanted the algorithm to decide whether, and if so how, Miller (henceforth: the focus) is related to the first verb organized. We then construct an instance for this pair of words by extracting a set of feature values from the sentence. The instance contains information about the verb and the focus: a feature for the word form and a feature for the POS of both. It also has similar features for the local context of the focus. Experiments on the training data suggest an optimal context width of two elements to the left and one to the right. In the present case, elements are words or punctuation signs. In addition to the lexical and the local context information, we include superficial information about clause structure: The first feature indicates the distance from the verb to the focus, counted in elements. A negative distance means that the focus is to the left of the verb. The second feature contains the number of other verbs between the verb and the focus.</Paragraph>
      <Paragraph position="4"> The third feature is the number of intervening commas. The features were chosen by manual  6-7, 8-9 and 12-13 describe the context words, Features 10-11 the focus word. Empty contexts are indicated by the value &amp;quot;-&amp;quot; for all features. &amp;quot;feature engineering&amp;quot;. Table 2 shows the complete instance for Miller-organized in row 5, together with the other first four instances for the sentence. The class is mostly &amp;quot;-&amp;quot;, to indicate that the word does not have a direct grammatical relation to organized. Other possible classes are those from a list of more than 100 different labels found in the treebank. These are combinations of a syntactic category and zero, one or more functions, e.g. NP-SBJ for subject, NP-PRD for predicative object, NP for (in)direct object 3, PP-LOC for locative PP adjunct, PP-LOC-CLR for subcategorised locative PP, etcetera. According to their information gain values, features are ordered with decreasing importance as follows: 11, 13, 10, 1, 2, 8, 12, 9, 6 , 4, 7, 3 , 5. Intuitively,, this ordering makes sense. The most important feature is the POS of the focus, because this determines whether it can have a GR to a verb at all (15unctuation cannot) and what kind of relation is possible. The POS of the following word is important, because e.g. a noun followed by a noun is probably not the head of an NP and will therefore not have a direct GR to the verb. The word itself may be important if it is e.g. a preposition, a pronoun or a clearly temporal/local adverb. Features 1 and 2 give some indication of the complexity of the structure intervening between the focus and the verb. aDirect and indirect object NPs have the same label in the treebank annotation. They can be differentiated by their position.</Paragraph>
      <Paragraph position="5"> The more complex this structure, the lower the probability that the focus and the verb are related. Context further away is less important than near context.</Paragraph>
      <Paragraph position="6"> To test the effects of the chunking steps from Section 3.1 on this task, we will now construct instances based on more structured input text, like that in Figure 2. This time, the focus is described by five features instead of two, for the additional information: which type of chunk it is in, what the preposition is if it is in a PP chunk, and what the adverbial function is, if any. We still have a context of two elements left, one right, but elements are now defined to be either chunks, or words outside any chunk, or punctuation. Each chunk in the context is represented by its last word (which is the semantically most important word in most cases), by the POS of the last word, and by the type of chunk. The distance feature is adapted to the new definition of element, too, and instead of counting intervening verbs, we now count intervening VP chunks. Figure 3 shows the first five instances for the sentence in Figure 2. Class value&amp;quot;-&amp;quot; again means &amp;quot;the focus is not directly related to the verb&amp;quot; (but to some other verb or a non-verbal element). According to their information gain values, features are ordered in decreasing importance as follows: 16, 15, 12, 14, 11, 2, 1, 19, 10, 9, 13, 18, 6, 17, 8, 4, 7, 3, 5. Comparing this to the earlier feature ordering, we see that most of the new features are  - conf. nn np in York nnp pp loc org. vbd vp i York nnp pp org. vbd org. vbd org. vbd org. vbd org. vbd  6-8, 9-11 and 17 19 describe the context words/chunks, Features 12-16 the focus chunk. Empty contexts are indicated by the &amp;quot;-&amp;quot; for all features. very important, thereby justifying their introduction. Relative to the other &amp;quot;old&amp;quot; features, the structural features 1 and 2 have gained importance, probably because more structure is available in the input to represent.</Paragraph>
      <Paragraph position="7"> In principle, we would have to construct one instance tbr each possible pair of a verb and a focus word in the sentence. However, we restrict instances to those where there is at most one other verb/VP chunk between the verb and the focus, in case the focus precedes the verb, and no other verb in case the verb precedes the focus. This restriction allows, for example, for a relative clause on the subject (as in our example sentence). In the training data, 97.9% of the related pairs fulfill this condition (when counting VP chunks). Experiments on the training data showed that increasing the admitted number of intervening VP chunks slightly increases recall, at the cost of precision. Having constructed all instances from the test data and from a training set with the same level of partial structure, we first train the IG'IYee algorithm, and then let it classify the test instances. Then, for each test instance that was classified with a grammatical relation, we check whether the same verb-focuspair appears with the same relation in the GR list extracted directly from the treebank. This gives us the precision of the classifier. Checking the treebank list versus the classified list yields recall.</Paragraph>
    </Section>
    <Section position="3" start_page="242" end_page="243" type="sub_section">
      <SectionTitle>
3.3 Cascaded Experiments
</SectionTitle>
      <Paragraph position="0"> We have already seen from the example that the level of structure in the input text can influence the composition of the instances. We are interested in the effects of different sorts of partial structure in the input data on the classification performance of the final classifier.</Paragraph>
      <Paragraph position="1"> Therefore, we ran a series of experiments.</Paragraph>
      <Paragraph position="2"> The classification task was always that of finding grammatical relations to verbs and performance was always measured by precision and recall on those relations (the test set contained 45825 relations). The amount of structure in the input data varied. Table 4 shows the results of the experiments. In the first experiment, only POS tagged input is used. Then, NP chunks are added. Other sorts of chunks are inserted at each subsequent step. Finally, the adverbial function labels are added. We can see that the more structure we add, the better precision and recall of the grammatical relations get: precision increases from 60.7% to 74.8%, recall from 41.3% to 67.9%. This in spite of the fact that the added information is not always correct, because it was predicted for the test material on the basis of the training material by the classitiers described in Section 3.1. As we have seen in Table 1, especially ADJP and ADVP chunks  and adverbial function labels did not have very high precision and recall.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="243" end_page="244" type="metho">
    <SectionTitle>
4 Discussion
</SectionTitle>
    <Paragraph position="0"> There are three ways how two cascaded modules can interact.</Paragraph>
    <Paragraph position="1"> The first module can add information on which the later module can (partially) base its decisions. This is the case between the adverbial functions finder and the relations finder. The former adds an extra informative feature to the instances of the latter (Feature 16 in Table 3). Cf. column two of  The first module can restrict the number of decisions to be made by the second one. This is the case in the combination of the chunking steps and the relations finder. Without the chunker, the relations finder would have to decide for every word, whether it is the head of a constituent that bears a relation to the verb. With the chunker., the relations finder has to make this decision for fewer words, namely only for those which are the last word in a chunk resp. the preposition of a PP chunk. Practically, this reduction of the number of decisions (which translates into a reduction of instances) as can be seen in the third column of Table 4.</Paragraph>
    <Paragraph position="2"> * The first module can reduce the number of elements used for the instances by counting one chunk as just one context element. We can see the effect in the feature that indicates the distance in elements between the focus and the verb. The more chunks are used, the smaller the average absolute distance (see column four Table 4).</Paragraph>
    <Paragraph position="3"> All three effects interact in the cascade we describe. The PP chunker reduces the number of decisions for the relations finder (instead of one instance for the preposition and one for the NP chunk, we get only one instance for the PP chunk), introduces an extra feature (Feature 12 in Table 3), and changes the context (instead of a preposition and an NP, context may now be one PP).</Paragraph>
    <Paragraph position="4"> As we already noted above, precision and recall are monotonically increasing when adding more structure. However, we note large differences, such as NP chunks which increase FZ=i by more than 10%, and VP chunks which add another 6.8%, whereas ADVPs and ADJPs yield hardly any improvement. This may partially be explained by the fact that these chunks are less frequent than the former two. Preps, on the other hand, while hardly reducing the average distance or the number of instances, improve F~=i by nearly 1%. PPs yield another 1.1%. What may come as a surprise is that adverbial functions again increase FZ=i by nearly 2%, despite the fact that FZ=i for this ADVFUNC assignment step was not very high. This result shows that cascaded modules need not be perfect to be useful.</Paragraph>
    <Paragraph position="5"> Up to now, we only looked at the overall results. Table 4 also shows individual FZ=i values for four selected common grammatical relations: subject NP, (in)direct object NP, locative PP adjunct and temporal PP adjunct. Note that the steps have different effects on the different relations: Adding NPs increases FZ=i by 11.3% for subjects resp. 16.2% for objects, but only 3.9% resp. 3.7% for locatives and temporals. Adverbial functions are more important for the two adjuncts (+6.3% resp. +15%) than for the two complements (+0.2% resp. +0.7%).</Paragraph>
    <Paragraph position="6"> Argamon et al. (1998) report FZ=i for sub-ject and object identification of respectively 86.5% and 83.0%, compared to 81.8% and 81.0% in this paper. Note however that Argamon et al. (1998) do not identify the head of subjects, subjects in embedded clauses, or subjects and objects related to the verb only through a trace, which makes their task easier. For a detailed comparison of the two methods on the same task see (Daelemans et al., 1999a). That paper also shows that the chunking method proposed here performs about as well as other methods, and that the influence of tagging errors on (NP) chunking is less than 1%.</Paragraph>
    <Paragraph position="7"> To study the effect of the errors in the lower modules other than the tagger, we used &amp;quot;perfect&amp;quot; test data in a last experiment, i.e. data annotated with partial information taken directly from the treebank. The results are shown in Table 5. We see that later modules suffer from errors of earlier modules (as could be expected): Ff~=l of PP chunking is 92% but could have  added by earlier modules in the cascade. Columns show the number of features in the instances, the mlmber of instances constructed front the test input, the average distance between the verb and the tbcus element, precision, recall and FZ=i over all relations, and FC/~=i over some selected relations.</Paragraph>
    <Paragraph position="8">  previous modules in the cascade) vs. on &amp;quot;perfect&amp;quot; input (enriched with partial treebank annotation). For PPs, this means perfect POS tags and chunk labels/boundaries, for ADVFUNC additionally perfect PP chunks, for GR assignment also perfect ADVFUNC labels. been 97.9% if all previous chunks would have been correct (+5.9%). For adverbial functions, the difference is 3.5%. For grammatical relation assignment, the last module in the cascade, the difference is, not surprisingly, the largest: 7.9% for chunks only, 12.3% for chunks and AD-VFUNCs. The latter percentage shows what could maximally be gained by further improving the chunker and ADVFUNCs finder. 'On realistic data, a realistic ADVFUNCs finder improves GR assigment by 1.9%. On perfect data, a perfect ADVFUNCs finder increases performance by 6.3%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML