File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1055_metho.xml
Size: 18,859 bytes
Last Modified: 2025-10-06 14:07:57
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1055"> <Title>Shallow parsing on the basis of words only: A case study</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Task representation, data preparation, </SectionTitle> <Paragraph position="0"> and experimental setup We chose a shallow parsing task as our benchmark task. If, to support an application such as information extraction, summarization, or question answering, we are only interested in parts of the parse tree, then a shallow parser forms a viable alternative to a full parser. Li and Roth (2001) show that for the chunking task it is specialized in, their shallow parser is more accurate and more robust than a general-purpose, i.e. full, parser.</Paragraph> <Paragraph position="1"> Our shallow parsing task is a combination of chunking (finding and labelling non-overlapping syntactically functional sequences) and what we will call function tagging. Our chunks and functions are based on the annotations in the third release of the Penn Treebank (Marcus et al., 1993). Below is an example of a tree and the corresponding chunk (subscripts on brackets) and function (superscripts on</Paragraph> <Paragraph position="3"> Nodes in the tree are labeled with a syntactic category and up to four function tags that specify grammatical relations (e.g. SBJ for subject), subtypes of adverbials (e.g. TMP for temporal), discrepancies between syntactic form and syntactic function (e.g. NOM for non-nominal constituents functioning nominally) and notions like topicalization. Our chunks are based on the syntactic part of the constituent label. The conversion program is the same as used for the CoNLL-2000 shared task (Tjong Kim Sang and Buchholz, 2000). Head words of chunks are assigned a function code that is based on the full constituent label of the parent and of ancestors with a different category, as in the case of VP/S-NOM in the example.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Task representation and evaluation method </SectionTitle> <Paragraph position="0"> To formulate the task as a machine-learnable classification task, we use a representation that encodes the joint task of chunking and function-tagging a sentence in per-word classification instances. As illustrated in Table 2.1, an instance (which corresponds to a row in the table) consists of the values for all features (the columns) and the function-chunk code for the focus word. The features describe the focus word and its local context. For the chunk part of the code, we adopt the &quot;Inside&quot;, &quot;Outside&quot;, and &quot;Between&quot; (IOB) encoding originating from (Ramshaw and Marcus, 1995). For the function part of the code, the value is either the function for the head of a chunk, or the dummy value NOFUNC for all non-heads. For creating the POS-based task, all words are replaced by the gold-standard POS tags associated with them in the Penn Treebank. For the combined task, both types of features are used simultaneously.</Paragraph> <Paragraph position="1"> When the learner is presented with new instances from heldout material, its task is thus to assign the combined function-chunk codes to either words or POS in context. From the sequence of predicted function-chunk codes, the complete chunking and function assignment can be reconstructed. However, predictions can be inconsistent, blocking a straightforward reconstruction of the complete shallow parse. We employed the following four rules to resolve such problems: (1) When an O chunk code is followed by a B chunk code, or when an I chunk code is followed by a B chunk code with a different chunk type, the B is converted to an I.</Paragraph> <Paragraph position="2"> (2) When more than one word in a chunk is given a function code, the function code of the rightmost word is taken as the chunk's function code. (3) If all words of the chunk receive NOFUNC tags, a prior function code is assigned to the rightmost word of the chunk. This prior, estimated on the training set, represents the most frequent function code for that type of chunk.</Paragraph> <Paragraph position="3"> To measure the success of our learner, we compute the precision, recall and their harmonic mean, the F-score1 with a30 =1 (Van Rijsbergen, 1979). In the combined function-chunking evaluation, a chunk is only counted as correct when its boundaries, its type and its function are identified correctly.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Data preparation </SectionTitle> <Paragraph position="0"> Our total data set consists of all 74,024 sentences in the Wall Street Journal, Brown and ATIS Corpus subparts of the Penn Treebank III. We randomized the order of the sentences in this dataset, and then split it into ten 90%/10% partitionings with disjoint 10% portions, in order to run 10-fold cross-validation experiments (Weiss and Kulikowski, 1991). To provide differently-sized training sets for learning curve experiments, each training set (of 66,627 sentences) was also clipped at the following sizes: 100 sentences, 500, 1000, 2000, 5000, 10,000, 20,000 and 50,000. All data was converted to instances as illustrated in Table 2.1. For the total data set, this yields 1,637,268 instances, one for each word or punctuation mark. 62,472 word types occur in the total data set, and 874 different function-chunk codes.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Classifier: Memory-based learning </SectionTitle> <Paragraph position="0"> Arguably, the choice of algorithm is not crucial in learning curve experiments. First, we aim at measuring relative differences arising from the selection of types of input. Second, there are indications that increasing the training set of language processing tasks produces much larger performance gains than varying among algorithms at fixed training set sizes; moreover, these differences also tend to get smaller with larger data sets (Banko and Brill, 2001).</Paragraph> <Paragraph position="1"> Memory-based learning (Stanfill and Waltz, 1986; Aha et al., 1991; Daelemans et al., 1999b) is a supervised inductive learning algorithm for learning classification tasks. Memory-based learning treats a set of labeled (pre-classified) training instances as points in a multi-dimensional feature space, and stores them as such in an instance base in memory (rather than performing some abstraction over them). Classification in memory-based learning is performed by the a31 -NN algorithm (Cover and Hart, 1967) that searches for the a31 'nearest neighbors' according to the distance function between two in-</Paragraph> <Paragraph position="3"> he was held for three months without I-PP PP-TMP was held for three months without being I-NP NOFUNC held for three months without being charged I-NP NP for three months without being charged . I-PP PP three months without being charged . I-VP NOFUNC months without being charged . I-VP VP/S-NOM without being charged . O NOFUNC months without being charged .&quot; stances a46 and a47 , a48a50a49a51a46a53a52a54a47a56a55a58a57a60a59a62a61a63a65a64a67a66a69a68 a63a71a70 a49a51a72 a63 a52a74a73 a63 a55 , where a75 is the number of features, a68 a63 is a weight for feature a76 , and a70 estimates the difference between the two instances' values at the a76 th feature. The classes of the a31 nearest neighbors then determine the class of the new case.</Paragraph> <Paragraph position="4"> In our experiments, we used a variant of the IB1 memory-based learner and classifier as implemented in TiMBL (Daelemans et al., 2001). On top of the a31 -NN kernel of IB1 we used the following metrics that fine-tune the distance function and the class voting automatically: (1) The weight (importance) of a feature a76 , a68 a63 , is estimated in our experiments by computing its gain ratio a77a79a78 a63 (Quinlan, 1993). This is the algorithm's default choice. (2) Differences between feature values (i.e. words or POS tags) are estimated by the real-valued outcome of the modified value difference metric (Stanfill and Waltz, 1986; Cost and Salzberg, 1993). (3) a31 was set to seven.</Paragraph> <Paragraph position="5"> This and the previous parameter setting turned out best for a chunking task using the same algorithm as reported by Veenstra and van den Bosch (2000). (4) Class voting among the a31 nearest neighbours is done by weighting each neighbour's vote by the inverse of its distance to the test example (Dudani, 1976). In Zavrel (1997), this distance was shown to improve over standard a31 -NN on a PP-attachment task. (5) For efficiency, search for the a31 -nearest neighbours is approximated by employing TRIBL (Daelemans et al., 1997), a hybrid between pure a31 -NN search and decision-tree traversal. The switch point of TRIBL was set to 1 for the words only and POS only experiments, i.e. a decision-tree split was made on the most important feature, the focus word, respectively focus POS. For the experiments with both words and POS, the switch point was set to 2 and the algorithm was forced to split on the focus word and focus POS.</Paragraph> <Paragraph position="6"> The metrics under 1) to 4) then apply to the remaining features.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="35" type="metho"> <SectionTitle> 3 Learning Curve Experiments </SectionTitle> <Paragraph position="0"> We report the learning curve results in three paragraphs. In the first, we compare the performance of a plain words input representation with that of a gold-standard POS one. In the second we introduce a variant of the word-based task that deals with low-frequency words. The last paragraph describes results with input consisting of words and POS tags.</Paragraph> <Paragraph position="1"> Words only versus POS tags only As illustrated in Figure 1, the learning curves of both the word-based and the POS-based representation are upward with more training data. The word-based curve starts much lower but flattens less; in the tested range it has an approximately log-linear growth.</Paragraph> <Paragraph position="2"> Given the measured results, the word-based curve surpasses the POS-based curve at a training set size between 20,000 and 50,000 sentences. This proves two points: First, experiments with a fixed training set size might present a misleading snapshot. Second, the amount of training material available today is already enough to make words more valuable input than (gold-standard!) POS.</Paragraph> <Paragraph position="3"> Low-frequency word encoding variant If TRIBL encounters an unknown word in the test material, it stops already at the decision tree stage and returns the default class without even using the information provided by the context. This is clearly disadvantageous and specific to this choice of al- null nation of words and POS. The y-axis represents Fa81 a64a67a66 on combined chunking and function assignment. The x-axis represents the number of training sentences; its scale is logarithmic. gorithm. A more general shortcoming is that the word form of an unknown word often contains useful information that is not available in the present setup. To overcome these two problems, we applied what Eisner (1997) calls &quot;attenuation&quot; to all words occurring ten times or less in training material. If such a word ends in a digit, it is converted to the string &quot;MORPH-NUM&quot;; if the word is six characters or longer it becomes &quot;MORPH-XX&quot; where XX are the final two letters, else it becomes &quot;MORPH-SHORT&quot;. If the first letter is capitalised, the attenuated form is &quot;MORPH-CAP&quot;. This produces sequences such as A number of MORPH-ts were MORPHly MORPH-ed by traders . (A number of developments were negatively interpreted by traders ). We applied this attenuation method to all training sets. All words in test material that did not occur as words in the attenuated training material were also attenuated following the same procedure.</Paragraph> <Paragraph position="4"> The curve resulting from the attenuated word-based experiment is also displayed in Figure 1. The curve illustrates that the attenuated representation performs better than the pure word-based one at all reasonable training set sizes. However the effect clearly diminuishes with more training data, so we cannot exclude that the two curves will meet with yet more training data.</Paragraph> <Paragraph position="5"> Combining words with POS tags Although the word-based curve, and especially its attenuated variant, end higher than the POS-based curve, POS might still be useful in addition to words. We therefore also tested a representation with both types of features. As shown in Figure 1, the &quot;attenuated word + gold-standard POS&quot; curve starts close to the gold-standard POS curve, attains break-even with this curve at about 500 sentences, and ends close to but higher than all other curves, including the &quot;attenuated word&quot; curve.</Paragraph> <Paragraph position="6"> Although the performance increase through the addition of POS becomes smaller with more training data, it is still highly significant with maximal training set size. As the tags are the gold-standard tags taken directly from the Penn Treebank, this result provides an upper bound for the contribution of POS tags to the shallow parsing task under investigation. Automatic POS tagging is a well-studied ation, using the input features words, attenuated words, gold-standard POS, and MBT POS, and combinations, on the maximal training set size.</Paragraph> <Paragraph position="7"> task (Church, 1988; Brill, 1993; Ratnaparkhi, 1996; Daelemans et al., 1996), and reported errors in the range of 2-6% are common. To investigate the effect of using automatically assigned tags, we trained MBT, a memory-based tagger (Daelemans et al., 1996), on the training portions of our 10-fold cross-validation experiment for the maximal data and let it predict tags for the test material. The memory-based tagger attained an accuracy of 96.7% (a82 0.1; 97.0% on known words, and 80.9% on unknown words).</Paragraph> <Paragraph position="8"> We then used these MBT POS instead of the gold-standard ones.</Paragraph> <Paragraph position="9"> The results of these experiments, along with the equivalent results using gold-standard POS, are displayed in Table 2. As they show, the scores with automatically assigned tags are always lower than with the gold-standard ones. When taken individually, the difference in F-scores of the gold-standard versus the MBT POS tags is 1.6 points. Combined with words, the MBT POS contribute 0.5 points (compared against words taken individually); combined with attenuated words, they contribute 0.3 points.</Paragraph> <Paragraph position="10"> This is much less than the improvement by the gold-standard tags (1.7 points) but still significant. However, as the learning curve experiments showed, this is only a snapshot and the improvement may well diminish with more training data.</Paragraph> <Paragraph position="11"> A breakdown of accuracy results shows that the highest improvement in accuracy is achieved for focus words in the MORPH-SHORT encoding. In these cases, the POS tagger has access to more information about the low-frequency word (e.g. its suffix) than the attenuated form provides. This suggests that this encoding is not optimal.</Paragraph> </Section> <Section position="6" start_page="35" end_page="35" type="metho"> <SectionTitle> 5 Related Research </SectionTitle> <Paragraph position="0"> Ramshaw and Marcus (1995), Mu~noz et al. (1999), Argamon et al. (1998), Daelemans et al. (1999a) find NP chunks, using Wall Street Journal training material of about 9000 sentences. F-scores range between 91.4 and 92.8. The first two articles mention that words and (automatically assigned) POS together perform better than POS alone.</Paragraph> <Paragraph position="1"> Chunking is one part of the task studied here, so we also computed performance on chunks alone, ignoring function codes. Indeed the learning curve of words combined with gold-standard POS crosses the POS-based curve before 10,000 sentences on the chunking subtask.</Paragraph> <Paragraph position="2"> Tjong Kim Sang and Buchholz (2000) give an overview of the CoNLL shared task of chunking.</Paragraph> <Paragraph position="3"> The types and definitions of chunks are identical to the ones used here. Training material again consists of the 9000 Wall Street Journal sentences with automatically assigned POS tags. The best F-score (93.5) is higher than the 91.5 F-score attained on chunking in our study using attenuated words only, but using the maximally-sized training sets. With gold-standard POS and attenuated words we attain an F-score of 94.2; with MBT POS tags and attenuated words, 92.8. In the CoNLL competition, all three best systems used combinations of classifiers instead of one single classifier. In addition, the effect of our mix of sentences from different corpora on top of WSJ is not clear.</Paragraph> <Paragraph position="4"> Ferro et al. (1999) describe a system for finding grammatical relations in automatically tagged and manually chunked text. They report an F-score of 69.8 for a training size of 3299 words of elementary school reading comprehension tests.</Paragraph> <Paragraph position="5"> Buchholz et al. (1999) achieve 71.2 F-score for grammatical relation assignment on automatically tagged and chunked text after training on about 40,000 Wall Street Journal sentences. In contrast to these studies, we do not chunk before finding grammatical relations; rather, chunking is performed simultaneously with headword function tagging. Measuring F-scores on the correct assignment of functions to headwords in our study, we attain 78.2 F-score using words, 80.1 using attenuated words, 80.9 using attenuated words combined with gold-standard POS, and 79.7 using attenuated words combined with MBT POS (which is slightly worse than with attenuated words only). Our function tagging task is easier than finding grammatical relations as we tag a headword of a chunk as e.g. a subject in isolation whereas grammatical relation assignment also includes deciding which verb this chunk is the subject of. A&quot;it-Mokhtar and Chanod (1997) describe a sequence of finite-state transducers in which function tagging is a separate step, after POS tagging and chunking. The last transducer then uses the function tags to extract subject/verb and object/verb relations (from French text).</Paragraph> </Section> class="xml-element"></Paper>