XML Viewer - w99-0707

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/99/w99-0707_abstr.xml
Size: 25,409 bytes
Last Modified: 2025-10-06 13:49:56
<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0707">
  <Title>Memory-Based Shallow Parsing</Title>
  <Section position="2" start_page="0" end_page="58" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> We present a memory-based learning (MBL) approach to shallow parsing in which POS tagging, chunking, and identification of syntactic relations are formulated as memory-based modules. The experiments reported in this paper show competitive results, the F~=l for the Wall Street Journal (WSJ) treebank is: 93.8% for NP chunking, 94.7% for VP chunking, 77.1% for subject detection and 79.0% for object detection.</Paragraph>
    <Paragraph position="1"> Introduction Recently, there has been an increased interest in approaches to automatically learning to recognize shallow linguistic patterns in text \[Ramshaw and Marcus, 1995, Vilain and Day, 1996, Argamon et al., 1998, Buchholz, 1998, Cardie and Pierce, 1998, Veenstra, 1998, Daelemans et aI., 1999a\]. Shallow parsing is an important component of most text analysis systems in applications such as information extraction and summary generation. It includes discovering the main constituents of sentences (NPs, VPs, PPs) and their heads, and determining syntactic relationships like subject, object, adjunct relations between verbs and heads of other constituents.</Paragraph>
    <Paragraph position="2"> Memory-Based Learning (MBL) shares with other statistical and learning techniques the advantages of avoiding the need for manual definition of patterns (common practice is to use hand-crafted regular expressions), and of being reusable for different corpora and sublanguages. The unique property of memory-based approaches which sets them apart from other learning methods is the fact that they are lazy learners: they keep all training data available for extrapolation. All other statistical and machine learning methods are eager (or greedy) learners: They abstract knowledge structures or probability distributions from the training data, forget the individual training instances, and extrapolate from the induced structures. Lazy learning techniques have been shown to achieve higher accuracy than eager methods for many language processing tasks. A reason for this is tile intricate interaction between regularities, subregularities and exceptions in most language data. and the related problem for learners of distinguishing noise from exceptions. Eager learning techniques abstract from what they consider noise (hapaxes, low-frequency events, non-typical events) whereas lazy learning techniques keep all data available, including exceptions which may sometimes be productive. For a detailed analysis of this issue, see \[Daelemans et al., 1999a\]. Moreover, the automatic feature weighting in the similarity metric of a memory-based learner makes the approach well-suited for domains with large numbers of features from heterogeneous sources, as it embodies a smoothing-by-similarity method when data is sparse \[Zavrel and Daelemans, 1997\].</Paragraph>
    <Paragraph position="3"> In this paper, we will provide a empirical evaluation of tile MBL approach to syntactic analysis on a number of shallow pattern learning tasks: NP chunking, \'P clmnking, and the assignment of subject-verb and object-verb relations. The approach is evaluated by cross-validation on the WSJ treebank corpus \[Marcus et al., 1993\]. We compare the approach qualitatively and as far as possible quantitatively with other approaches.</Paragraph>
    <Section position="1" start_page="0" end_page="53" type="sub_section">
      <SectionTitle>
Memory-Based Shallow Syntactic
Analysis
</SectionTitle>
      <Paragraph position="0"> Memory-Based Learning (MBL) is a classificationbased, supervised learning approach: a nmmory-based learning algorithm constructs a classifier for a task by storing a set of examples. Each example associates a feature vector (the problem description) with one of a finite number of classes (the solution). Given a new feature vector, the classifier extrapolates its class from those of the most similar feature vectors in memory.</Paragraph>
      <Paragraph position="1"> The metric defining similarity can be automatically adapted to the task at hand.</Paragraph>
      <Paragraph position="2"> In our approach to memory-based syntactic pattern recognition, we carve up the syntactic anal- null ! ysis process into a number of such classification tasks with input vectors representing 'a focus item and a dynamically selected surrounding context. As in Natural Language Processing problems in general \[Daelemans, 1995\], these classification tasks can be segmentation tasks (e.g. decide whether a focus word or tag is the start or end of an NP) or disambiguation tasks (e.g. decide whether a chunk is the subject NP, the object NP or neither). Output of some memory-based modules (e.g. a tagger or a chunker) is used as input by other memory-based modules (e.g. syntactic relation assignment).</Paragraph>
      <Paragraph position="3"> Similar cascading ideas have been explored in other approaches to text analysis: e.g. finite state partial parsing \[Abney, 1996, Grefenstette, 1996\], statistical decision tree parsing \[Magerman, 1994\], maximum entropy parsing \[Ratnaparkhi, 1997\], and memory-based learning \[Cardie, 1994, Daelemans et al., 1996\].</Paragraph>
    </Section>
    <Section position="2" start_page="53" end_page="53" type="sub_section">
      <SectionTitle>
Algorithms and Implementation
</SectionTitle>
      <Paragraph position="0"> For our experiments we have used TiMBL 1, an MBL software package developed in our group \[Daelemans et al., 1999b\]. We used the following variants of MBL: * IBI-IG: The distance between a test item and each memory item is defined as the number of features for which they have a different value (overlap metrid). Since in most cases not all features are equally relevant for solving the task, the algorithm uses information gain (an information-theoretic notion measuring the reduction of uncertainty about the class to be predicted when knowing the value of a feature) to weight the cost of a feature value mismatch during comparison. Then the class of the most similar training item is predicted to be the class of the test item. Classification speed is linear to the number of training instances times the number of features.</Paragraph>
      <Paragraph position="1"> * IGTREE: IBI-IG is expensive in basic memory aztd processing requirements. With IGTREE. an oblivious decision tree is created with features as tests, and ordered according to information gain of features, as a heuristic approximation of the computationally more expensive pure MBL variants. Classification speed is linear to the number of features times the average branching factor in the tree, which is less than or equal to the average number of values per feature.</Paragraph>
      <Paragraph position="2"> For more references and information about these algorithms we refer to \[Daelemans et al., 1999b, Daelemans et al., 1999a\]. In \[Daelemans et al., 1996\] both algorithms are explained in detail in the context ITiMBL is available from: http:\[/ilk.kub.nl/ of MBT, a memory-based POS tagger, which we presuppose as an available module in this paper. In the remainder of this paper, we discuss results on the different tasks in section Experiments, and compare our approach to alternative learning methods in section</Paragraph>
    </Section>
    <Section position="3" start_page="53" end_page="54" type="sub_section">
      <SectionTitle>
Experiments
</SectionTitle>
      <Paragraph position="0"> We carried out two series of experiments. In the first we evaluated a memory-based NP and VP chunker, in the second we used this chunker for memory-based subject/object detection.</Paragraph>
      <Paragraph position="1"> To evaluate the performance of our trained memory-based classifiers, we will use four measures: accuracy (the percentage of correctly predicted output classes), precision (the percentage of predicted chunks or subject- or object-verb pairs that is correct), recall (the percentage of chunks or subjector object-verb pairs to be predicted that is found), and F,~ \[C.J.van Rijsbergen. 1979\], which is given by (~2+1) v,.ec rec with ;3 = 1. See below for an example. 3 ~- precq-rec ' For the chunking tasks, we evaluated the algorithms by cross-validation on all 25 partitions of the WSJ treebank. Each partition in turn was selected as a test set, and the algorithms trained on the remaining partitions.</Paragraph>
      <Paragraph position="2"> Average precision and recall on the 25 partitions will be reported for both the IBI-IG and IGTREE variants of MBL. For the subject/object detection task, we used 10-fold cross-validation on treebank partitions 00-09.</Paragraph>
      <Paragraph position="3"> In section Related Research we will further evaluate our chunkers and subject/object detectors.</Paragraph>
      <Paragraph position="4"> Chunking Following \[Ramshaw and Marcus. 1995\] we defined chunking as a tagging task, each word in a sentence is assigned a tag which indicates whether this word is inside or outside a chunk. We used as tagset: I_NP inside a baseNP.</Paragraph>
      <Paragraph position="5"> 0 outside a baseNP or a baseVP.</Paragraph>
      <Paragraph position="6"> B_NP inside a baseNP, but the preceding word is in another baseNP.</Paragraph>
      <Paragraph position="7"> I_VP and B_VP are used in a similar fashion.</Paragraph>
      <Paragraph position="8"> Since baseNPs and baseVPs are non-overlapping and non-recursive these five tags suffice to unambiguously chunk a sentence. For example, the sentence: \[NP Pierre Vinken NP\] , \[NP 61 years NP\] old , \[vP will join vP\] \[NP the board NP\] as \[NP a nonexecutive director .~.p\] \[NP Nov. 29 .~'e\] * should be tagged as:  of two words and POS right and one left, and of using IGTREE with the same computed with IGTrtEE using only the focus POS tag or the focus word WSJ using IBI-IG with a context context. The baseline scores are ature I 11 5 0 7 ~Veight 39 40 4 3 2 10 12  sisters PRP$ seen VBN S seen VBN man NN lately RB O man NN lately RB  the features (truncated and multiplied by 100; from one of the 10 cross-validation experiments). Thus the order of importance of the features is: 2, 1, 11, 9, 13, 10, 8, 12, 7, 6, 3, 4, 5. Pierret_Np Vinkent_NP ,o 61t_Np yearsLNp oldo ,o willt.vp joinz_vp the~_Ne boardl_NV aso a~_,ve nonexecutivet_Np directort_Np Nov.a_Np 29t.~, p -o Suppose that our classifier erroneously tagged director as B_NP instead of I_NP, but classified the rest correctly. Accuracy would then be 17 y~ = 0.94. The resulting chunks would be \[NP a nonexecutive NP\] \[NP director NP\] instead of \[NP a nonexecutive director Nf'\] (the other chunks being the same as above). Then out of the seven predicted chunks, five are correct (precision= ~ = 71.4%) and from the six chunks that were to be found, five were indeed found (recall= ~ = 83.3%). F3=~ is 76.9%.</Paragraph>
      <Paragraph position="9"> The features for the experiments are the word form and the POS tag (as provided by the WSJ treebank) of the two words to the left, the focus word, and one word to the right. For the results see Table 1.</Paragraph>
      <Paragraph position="10"> The baseline for these experiments is computed with IBI-IG, with as only feature: i) the focus word, and ii) the focus POS tag.</Paragraph>
      <Paragraph position="11"> The results of the chunking experiments show that accurate chunking is possible, with Fz=t values around 94~c.</Paragraph>
    </Section>
    <Section position="4" start_page="54" end_page="58" type="sub_section">
      <SectionTitle>
Subject/Object Detection
</SectionTitle>
      <Paragraph position="0"> Finding a subject .or object (or any other relation of a constituent to a verb) is defined in our classification-based approach as a mapping from a pair of words (the verb and the head of the constituent) and a representation of its context to a class describing the type of relation (e.g. subject, object, or neither). A verb can have a subject or object relation to more than one word in case of NP coordination, and a word can be the sub-ject of more than one verb in case of VP coordination.</Paragraph>
      <Paragraph position="1"> Data Format In our representation, the tagged and chunked sentence</Paragraph>
      <Paragraph position="3"> will result in the instances in Table 2.</Paragraph>
      <Paragraph position="4"> Classes are S(ubject), O(bject) or &amp;quot;-&amp;quot; (for anything else). Features are: 1 the distance from the verb to the head (a chunk just counts for one word; a negative distance means that the head is to the left of the verb), 2 the number of other baseVPs between the verb and the head (in the current setting, this can maximally  For expository reasons, we also mention how well this classifier performs when computing precision and recall for subjects and objects separately.</Paragraph>
      <Paragraph position="5"> 3 the number of commas between the verb and the head, 4 the verb, and 5 its POS tag, 6-9 the two left context words/chunks of the head, represented by the word and its POS 10-11 the head itself, and 12-13 its right context word/chunk.</Paragraph>
      <Paragraph position="6">  Features one to three are numeric features. This prop-erty can only be exploited by IBI-IG. IGTREE treats them as symbolic. We also tried four additional features that indicate the sort of chunk (NP, VP or none) of the head and the three context elements respectively These features did not improve performance, presumably because this information is mostly inferrable from the POS tag.</Paragraph>
      <Paragraph position="7"> To find subjects and objects in a test sentence, the sentence is first POS tagged (with the Memory-Based Tagger MBT) and chunked (see section Experiments: Chunking). Subsequently all chunks are reduced to their heads. 2 Then an instance is constructed for every pair of a baseVP and another word/chunk head provided they are not too distant from each other in the sentence. A crucial point here is the definition of &amp;quot;not too distant&amp;quot;. If our definition is too strict, we might exclude too many actual subject-verb or object-verb pairs, which will result in low recall. If the definition is too broad, we will get very large training and test sets. This slows down learning and might even have a negative effect on precision because the learner is confronted with too much &amp;quot;noise&amp;quot;. Note further that defining distance purely 2By definition, the head is the rightmost word of a baseNP or baseVP.</Paragraph>
      <Paragraph position="8"> as the number of intervening words or chunks is not fully satisfactory as this does not take clause structure into account* As one clause normally contains one baseVP, we developped the idea of counting intervening baseVPs. Counts on the treebank showed that less than 1% of the subjects and objects are separated from their verbs by more than one other baseVP. We therefore constru! ! ct an instance for every pair of a baseVP and another word/chunk head if they have not more than one other baseVP in between them. s These instances are classified by the memory-based learner. For the training material, the POS.tags &amp;quot;and chunks from the treebank are used directly. Also, subject-verb and object-verb relations are extracted to yield the class values.</Paragraph>
      <Paragraph position="9"> Results and discussion Tile results in Table 3 show that finding (unrestricted) subjects and objects is a hard task. The baseline of classifying instances at random (using only the probability distribution of the classes) is about 4%. Using the simple heuristic of classifying each (pro)noun directly in front of resp. after the verb as S resp. 0 yields a much higher baseline of about 66%. Obviously, these are the easy cases. IGTREE, which is the better overall MBL algorithm on this task, scores 10% above this baseline, i.e. 76.2degA. The difference ill accuracy between IGTrtEE and IBI-IG is only 3The following sentence shows a subject-verb pair (in bold) with one intervening baseVP (in italics): \[,vP The plant .~p\], \[A'P which .~,-p\] \[l'P zs o~med l'P\] by \[:vP Hollingsworth &amp; Vose Co. NP\] , \[vP was vP\] under \[^,p contract ,vp\] with \[.~p Lorillard .x-p\] \[vp to nmke vP\] \[,~p the cigarette filters .we\] .</Paragraph>
      <Paragraph position="10"> The next example illustrates the same for all object-verb pair: Along \[,vp the way/vp\] , \[.~p he .~p\]/re meets vP\] \[,vp a solicitous Christian chauffeur .vP\] \[.vp who .vp\] \[vP of_ fers re\] \[Ne the hero ~ve\] \[.~,-p God .re\] \[^-,~ &amp;quot;s phone number .re/; and//re the Sheep Man .x'e\], \[.vP a sweet, roughhewn figure /v/,\] \[~vt, who .vP\] \[l'P wears t'P\] - /.re what  same dataset of several classifiers, the experiments with IBI-IG are all POS left and three right 0.3%. In terms of F-values, IBI-IG is better for finding subjects, whereas IGTREE is better for objects. We also note that IGTRv.E always yields a higher precision than recall, whereas IBI-IG does the opposite.</Paragraph>
      <Paragraph position="11"> IGTrtEv. is thus more &amp;quot;cautious&amp;quot; than IBI-IG. Presumably, this is due to the word-valued features. Many test instances contain a word not occurring in the training instances (in that feature). In that case, search in the IGTREV. is stopped and the default class for that node is used. As the &amp;quot;-&amp;quot; class is more than ten times more frequent than the other two classes, there is a high chance that this default is indeed the &amp;quot;-&amp;quot; class, which is always the &amp;quot;cautious&amp;quot; choice. IBI-IG, on the other hand, will not stop on encountering an unseen word, but will go on comparing the rest of the features, which might still opt for a non-&amp;quot;-&amp;quot; class. The differences in precision and recall surely are a topic for further research. So far, this observation led us to combine both algorithms by classifying an instance as S resp. O only if both algorithms agreed and as &amp;quot;-&amp;quot; otherwise. The combination yields higher precision at the cost of recall, but the overall effect is certainly positive (Fj=~ = 77.8%).</Paragraph>
      <Paragraph position="12">  In \[Argamon et al., 1998\], an alternative approach to memox3'-based learning of shallow patterns, memory-based sequence learning (MBSL), is proposed. In this approach, tasks such as base NP chunking and subject detection are formulated as separate bracketing tasks, with as input the POS tags of a sentence. For every input sentence, all possible bracketings in context (situated contexts) are hypothesised and the highest scoring ones m'e used for generating a bracketed output sentence. The score of a situated hypothesis depends on the scores of the tiles which are part of it and the degree to which they cover the hypothesis. A tile is defined as a substring of the situated hypothesis containing a bracket, and the score of a tile depends on the number of times it is found in the training material divided by the total number of times the string of tags occurs (i.e. including occurrences with another or no bracket). The approach is memory-based because all training data is kept available. Similar algorithms have been proposed for grapheme~tophoneme conversion by \[Dedina and Nusbaum, 1991\], and \[Yvon, 1996\], and the approach could be seen as a linear algorithmic simplification of the DOP memory-based approach for full parsing \[Bod, 1995\]. In the remainder of this section, we show that an empirical comparison of our computationally simpler MBL approach to MBSL on their data for NP chunking, subject, and object detection reveals comparable accuracies.</Paragraph>
      <Paragraph position="13"> Chunking For NP chunking, \[Argamon et al., 1998\] used data extracted from section 15-18 of the WS.J as a fixed train set and section 20 as a fixed test set, the same data as \[Ramshaw and Marcus, 1995\]. To find the optimal setting of learning algorithms and feature construction we used 10-fold cross validation on section 15; we found IBI-IG with a context of five words and POS-tags to the left and three to the right as a good parameter setting for the chunking task; we used this setting as the default setting for our experiments. For an overview of the results see Table 4. Since part of the chunking errors could be caused by POS errors, we also compared the same baseNP chunker on the santo corpus tagged with i) the Brill tagger as used in \[Ramshaw and Marcus, 1995\], ii) the Memory-Based Tagger (MBT) as described in \[Daelemans et al., 1996\]. We also present the results of \[Argamon et al., 1998\], \[Ramshaw and Marcus: 1995\] and \[Cardie and Pierce, 1998\] in Table 4. The latter two use a transformation-based error-driven learning method \[Brill, 1992\]. In \[Ramshaw and Marcus, 1995\], the method is used for NP chunking, and in \[Cardie and Pierce, 1998\] the approach is indirectly used to evaluate corpus-extracted NP chunking rules.</Paragraph>
      <Paragraph position="14"> As \[Argamon et al., 1998\] used only POS informa- null tion for their MBSL chunker, we also experimented with that option (POSonly in the Table). Results show that adding words as information provides useful information for MBL (see Table 4).</Paragraph>
      <Paragraph position="15"> Subject/object detection For subject/object detection, we trained our algorithm on section 01-09 of the WSJ and tested on Argamon et al.'s test data (section 00). We also used the treebank POS tags instead of MBT. For comparability, we performed two separate learning experiments. The verb windows are defined as reaching only to the left (up to one intervening baseVP) in the subject experiment and only to the right (with no intervening baseVP) in the object experiment. The relational output of MBL is converted to the sequence format used by MBSL. The conversion program first selects one relation in case of coordinated or nested relations. For objects, the actual conversion is trivial: The V-O sequence extends from the verb up to the head (seen the old man for the example sentence on page 55). In the case of subjects, the S-V sequence extends from the beginning of the baseNP of the head up to the first non-modal verb in the baseVP (My sisters have). The program also uses filters to model some restrictions of the patterns that Argamonet al. used for data extraction. They extracted e.g. only objects that immediately follow the verb.</Paragraph>
      <Paragraph position="16"> The results in Table 5 show that highly comparable results can be obtained with MBL on the (impoverished) definition of the subject-object task. IBI-IG as well as IGTREE are better than MBSL on the object data. They are however worse on the subject data.</Paragraph>
      <Paragraph position="17"> Two factors may have influenced this result. Firstly, more than 17% of the precision errors of IBI-IG concern cases in which the word proposed by the algorithm is indeed the subject according to the treebank, but the corresponding sequence is not included in Argamon et al.'s test data due to their restricted extraction pat* terns. Secondly. there are cases for which MBL correctly found the head of the subject, but the conversion results in an incorrect sequence. These are sentences like &amp;quot;All \[NP the man NP\] \[NP 's friends NP\] came.&amp;quot; in which all is part of the subject while not being part of an:)&amp;quot; baseNP.</Paragraph>
      <Paragraph position="18"> Apart from using a different algorithm, the MBL experiments also exploit more information ill the training data than MBSL does. Ignoring lexical information in chunking and subject/object detection decreased the Fa=I value by 2.5% for subjects and 6.9% for objects.</Paragraph>
      <Paragraph position="19"> The bigger influence for objects may be due to verbs that take a predicative object instead of a direct one.</Paragraph>
      <Paragraph position="20"> Knowing the lexical form of the verb helps to make this distinction. In addition, time expressions like &amp;quot;(it rained) last week&amp;quot; can be distinguished from direct objects on the basis of the head noun. Not chunking&amp;quot; the text before trying to find subjects and objects decreases F-values by more than 50%. Using the &amp;quot;perfect&amp;quot; chunks of the treebank, on the other hand. increases F by 5.9% for subjects and 5.1% for objects. These figures slmw how crucial the chunking step is for the succes of our method.</Paragraph>
      <Paragraph position="21"> General Clear advantages of MBL are its efficiency (especially when using IGTREE), the ease with which information apart from POS tags can be added to the input (e.g.</Paragraph>
      <Paragraph position="22"> word information, morphological information, wor(lnet tags. chunk information for subject aIld object detection), and the fact that NP and VP chunking and different types of relation tagging can be achieved in one classification pass. It is uncleax how MBSL could be extended to incorporate other sources of information apart from POS tags, and what the effect would be on performance. More limitations of MBSL are that it cannot find nested sequences, which nevertheless occur frequently in tasks such as subject identification 4, and that it does not mark heads.</Paragraph>
      <Paragraph position="23"> %.g. \[SV John, who \[SV I like SV\]. Is SV\] angry.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML