File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/n01-1023_metho.xml
Size: 27,559 bytes
Last Modified: 2025-10-06 14:07:31
<?xml version="1.0" standalone="yes"?> <Paper uid="N01-1023"> <Title>Applying Co-Training methods to Statistical Parsing</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Unsupervised techniques in language </SectionTitle> <Paragraph position="0"> processing While machine learning techniques that exploit annotated data have been very successful in attacking problems in NLP, there are still some aspects which are considered to be open issues: lem from lexical (sparse) ones to improve performance on unseen data.</Paragraph> <Paragraph position="1"> In the particular domain of statistical parsing there has been limited success in moving towards unsupervised machine learning techniques (see Section 7 for more discussion). A more promising approach is that of combining small amounts of seed labeled data with unlimited amounts of unlabeled data to bootstrap statistical parsers. In this paper, we use one such machine learning technique: Co-Training, which has been used successfully in several classification tasks like web page classification, word sense disambiguation and named-entity recognition. null Early work in combining labeled and unlabeled data for NLP tasks was done in the area of unsupervised part of speech (POS) tagging. (Cutting et al., 1992) reported very high results (96% on the Brown corpus) for unsupervised POS tagging using Hidden Markov Models (HMMs) by exploiting hand-built tag dictionaries and equivalence classes. Tag dictionaries are predefined assignments of all possible POS tags to words in the test data. This impressive result triggered several follow-up studies in which the effect of hand tuning the tag dictionary was quantified as a combination of labeled and unla- null thy, 1994) showed that only in very specific cases HMMs were effective in combining labeled and unlabeled data. However, (Brill, 1997) showed that aggressively using tag dictionaries extracted from labeled data could be used to bootstrap an unsupervised POS tagger with high accuracy (approx 95% on WSJ data). We exploit this approach of using tag dictionaries in our method as well (see Section 3.2 for more details). It is important to point out that, before attacking the problem of parsing using similar machine learning techniques, we face a representational problem which makes it difficult to define the notion of tag dictionary for a statistical parser.</Paragraph> <Paragraph position="2"> The problem we face in parsing is more complex than assigning a small fixed set of labels to examples. If the parser is to be generally applicable, it has to produce a fairly complex &quot;label&quot; given an input sentence. For example, given the sentence Pierre Vinken will join the board as a non-executive director, the parser is expected to produce an output as shown in Figure 1.</Paragraph> <Paragraph position="3"> Since the entire parse cannot be reasonably considered as a monolithic label, the usual method in parsing is to decompose the structure assigned in the following way:</Paragraph> <Paragraph position="5"> ::: However, such a recursive decomposition of structure does not allow a simple notion of a tag dictionary. We solve this problem by decomposing the structure in an approach that is different from that shown above which uses context-free rules.</Paragraph> <Paragraph position="6"> The approach uses the notion of tree rewriting as defined in the Lexicalized Tree Adjoining Grammar (LTAG) formalism (Joshi and Schabes, 1992)</Paragraph> <Paragraph position="8"> This is a lexicalized version of Tree Adjoining Grammar (Joshi et al., 1975; Joshi, 1985).</Paragraph> <Paragraph position="9"> tains the notion of lexicalization that is crucial in the success of a statistical parser while permitting a simple definition of tag dictionary. For example, the parse in Figure 1 can be generated by assigning the structured labels shown in Figure 2 to each word in the sentence (for simplicity, we assume that the noun phrases are generated here as a single word). We use a tool described in (Xia et al., 2000) to convert the Penn Treebank into this representation. null Combining the trees together by rewriting nodes as trees (explained in Section 2.1) gives us the parse tree in Figure 1. A history of the bi-lexical dependencies that define the probability model used to construct the parse is shown in Figure 3. This history is called the derivation tree.</Paragraph> <Paragraph position="10"> In addition, as a byproduct of this kind of representation we obtain more than the phrase structure of each sentence. We also produce a more embellished parse in which phenomena such as predicate-argument structure, subcategorization and movement are given a probabilistic treatment.</Paragraph> <Paragraph position="11"> tween trees that have occurred during the parse of the sentence.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 The Generative Model </SectionTitle> <Paragraph position="0"> A stochastic LTAG derivation proceeds as follows (Schabes, 1992; Resnik, 1992). An initial tree is selected with probability P init and other trees selected by words in the sentence are combined using the operations of substitution and adjoining. These operations are explained below with examples. Each of these operations is performed with probability P</Paragraph> <Paragraph position="2"> Substitution is defined as rewriting a node in the frontier of a tree with probability P attach which is said to be</Paragraph> <Paragraph position="4"> where #1C;#11 ! #1C indicates that tree #1C is substituting into node #11 in tree #1C. An example of the operation of substitution is shown in Figure 4. Adjoining is defined as rewriting any internal node of a tree by another tree. This is a recursive rule and each adjoining operation is performed with probability P</Paragraph> <Paragraph position="6"> here is the probability that #1C rewrites an internal node #11 in tree #1C or that no adjoining (NA) occurs at node #11 in #1C. The additional factor that accounts for no adjoining at a node is required for the probability to be well-formed. An example of the operation of adjoining isshowninFigure5.</Paragraph> <Paragraph position="7"> Each LTAG derivationD which was built starting from tree #0B with n subsequent attachments has the probability: tree for join: #1C#28join#29;VP ! #1C #28will#29.</Paragraph> <Paragraph position="8"> Note that assuming each tree is lexicalized by one word the derivation D corresponds to a sentence of n+1 words.</Paragraph> <Paragraph position="9"> In the next section we show how to exploit this notion of tag dictionary to the problem of statistical parsing.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="2" type="metho"> <SectionTitle> 3 Co-Training methods for parsing </SectionTitle> <Paragraph position="0"> Many supervised methods of learning from a Treebank have been studied. The question we want to pursue in this paper is whether unlabeled data can be used to improve the performance of a statistical parser and at the same time reduce the amount of labeled training data necessary for good performance. We will assume the data that is input to our method will have the following characteristics: 1. A small set of sentences labeled with corrected parse trees and large set of unlabeled data.</Paragraph> <Paragraph position="1"> 2. A pair of probabilistic models that form parts of a statistical parser. This pair of models must be able to mutually constrain each other.</Paragraph> <Paragraph position="2"> 3. A tag dictionary (used within a backoff smoothing strategy) for labels are not covered in the labeled set.</Paragraph> <Paragraph position="3"> The pair of probabilistic models can be exploited to bootstrap new information from unlabeled data. Since both of these steps ultimately have to agree with each other, we can utilize an iterative method called Co-Training that attempts to increase agreement between a pair of statistical models by exploiting mutual constraints between their output.</Paragraph> <Paragraph position="4"> Co-Training has been used before in applications like word-sense disambiguation (Yarowsky, 1995), web-page classification (Blum and Mitchell, 1998) and named-entity identification (Collins and Singer, 1999). In all of these cases, using unlabeled data has resulted in performance that rivals training solely from labeled data. However, these previous approaches were on tasks that involved identifying the right label from a small set of labels (typically 2-3), and in a relatively small parameter space. Compared to these earlier models, a statistical parser has a very large parameter space and the labels that are expected as output are parse trees which have to be built up recursively. We discuss previous work in combining labeled and unlabeled data in more detail in Section 7.</Paragraph> <Paragraph position="5"> Co-training (Blum and Mitchell, 1998; Yarowsky, 1995) can be informally described in the following manner: null #0F Pick two (or more) &quot;views&quot; of a classification problem. null #0F Build separate models for each of these &quot;views&quot; and train each model on a small set of labeled data.</Paragraph> <Paragraph position="6"> #0F Sample an unlabeled data set and to find examples that each model independently labels with high confidence. (Nigam and Ghani, 2000) #0F Confidently labeled examples can be picked in various ways. (Collins and Singer, 1999; Goldman and Zhou, 2000) #0F Take these examples as being valuable as training examples and iterate this procedure until the unlabeled data is exhausted.</Paragraph> <Paragraph position="7"> Effectively, by picking confidently labeled data from each model to add to the training data, one model is labeling data for the other model.</Paragraph> <Section position="1" start_page="0" end_page="2" type="sub_section"> <SectionTitle> 3.1 Lexicalized Grammars and Mutual Constraints </SectionTitle> <Paragraph position="0"> In the representation we use, parsing using a lexicalized grammar is done in two steps: 1. Assigning a set of lexicalized structures to each word in the input sentence (as shown in Figure 2). 2. Finding the correct attachments between these structures to get the best parse (as shown in Fig- null ure 1).</Paragraph> <Paragraph position="1"> Each of these two steps involves ambiguity which can be resolved using a statistical model. By explicitly representing these two steps independently, we can pursue independent statistical models for each step: 1. Each word in the sentence can take many different lexicalized structures. We can introduce a statistical model that disambiguates the lexicalized structure assigned to a word depending on the local context. 2. After each word is assigned a certain set of lexical null ized structures, finding the right parse tree involves computing the correct attachments between these lexicalized structures. Disambiguating attachments correctly using an appropriate statistical model is essential to finding the right parse tree.</Paragraph> <Paragraph position="2"> These two models have to agree with each other on the trees assigned to each word in the sentence. Not only do the right trees have to be assigned as predicted by the first model, but they also have to fit together to cover the entire sentence as predicted by the second model</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> .This </SectionTitle> <Paragraph position="0"> represents the mutual constraint that each model places on the other.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.2 Tag Dictionaries </SectionTitle> <Paragraph position="0"> For the words that appear in the (unlabeled) training data, we collect a list of part-of-speech labels and trees that each word is known to select in the training data. This information is stored in a POS tag dictionary and a tree dictionary. It is important to note that no frequency or any other distributional information is stored. The only information stored in the dictionary is which tags or trees can be selected by each word in the training data.</Paragraph> <Paragraph position="1"> We use a count cutoff for trees in the labeled data and combine observed counts into an unobserved tree count.</Paragraph> <Paragraph position="2"> This is similar to the usual technique of assigning the token unknown to infrequent word tokens. In this way, trees unseen in the labeled data but in the tag dictionary are assigned a probability in the parser.</Paragraph> <Paragraph position="3"> The problem of lexical coverage is a severe one for unsupervised approaches. The use of tag dictionaries is a way around this problem. Such an approach has already been used for unsupervised part-of-speech tagging in (Brill, 1997) where seed data of which POS tags can be selected by each word is given as input to the unsupervised tagger.</Paragraph> <Paragraph position="4"> See x7 for a discussion of the relation of this approach to that of SuperTagging (Srinivas, 1997) In future work, it would be interesting to extend models for unknown-word handling or other machine learning techniques in clustering or the learning of subcategorization frames to the creation of such tag dictionaries.</Paragraph> </Section> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 Models </SectionTitle> <Paragraph position="0"> As described before, we treat parsing as a two-step process. The two models that we use are: 1. H1: selects trees based on previous context (tagging probability model) 2. H2: computes attachments between trees and returns best parse (parsing probability model)</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.1 H1: Tagging probability model </SectionTitle> <Paragraph position="0"> We select the most likely trees for each word by examining the local context. The statistical model we use to decide this is the trigram model that was used by B. Srinivas in his SuperTagging model (Srinivas, 1997). The model assigns an n-best lattice of tree assignments associated with the input sentence with each path corresponding to an assignment of an elementary tree for each word in the sentence. (for further details, see (Srinivas, 1997)).</Paragraph> <Paragraph position="2"> is a sequence of elementary trees assigned to the sentence W</Paragraph> <Paragraph position="4"> We get (2) by using Bayes theorem and we obtain (3) from (2) by ignore the denominator and by applying the usual Markov assumptions.</Paragraph> <Paragraph position="5"> The output of this model is a probabilistic ranking of trees for the input sentence which is sensitive to a small local context window.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.2 H2: Parsing probability model </SectionTitle> <Paragraph position="0"> Once the words in a sentence have selected a set of elementary trees, parsing is the process of attaching these trees together to give us a consistent bracketing of the sentences. Notation: Let #1C stand for an elementary tree which is lexicalized by a word: w and a part of speech tag: p.</Paragraph> <Paragraph position="1"> Let P init (introduced earlier in 2.1) stand for the probability of being root of a derivation tree defined as follows: null</Paragraph> <Paragraph position="3"> including lexical information, this is written as:</Paragraph> <Paragraph position="5"> where the variable top indicates that #1C is the tree that begins the current derivation. There is a useful approxi-</Paragraph> <Paragraph position="7"> where N is the number of bracketing labels and #0B is a constant used to smooth zero counts.</Paragraph> <Paragraph position="8"> including lexical information, this is written as:</Paragraph> <Paragraph position="10"> We decompose (8) into the following components:</Paragraph> <Paragraph position="12"> We do a similar decomposition for (9).</Paragraph> <Paragraph position="13"> For each of the equations above, we use a backoff model which is used to handle sparse data problems. We compute a backoff model as follows: we further smooth probabilities (10), (11) and (12). We use (10) as an example, the other two are handled in the same way.</Paragraph> <Paragraph position="14"> where k is the diversity of adjunction, that is: the number of different trees that can attach at that node. T is the set of all trees #1C that can possibly attach at Node in tree #1C.</Paragraph> <Paragraph position="15"> For our experiments, the value of #0B is set to</Paragraph> </Section> </Section> <Section position="6" start_page="2" end_page="2" type="metho"> <SectionTitle> 5 Co-Training algorithm </SectionTitle> <Paragraph position="0"> We are now in the position to describe the Co-Training algorithm, which combines the models described in Section 4.1 and in Section 4.2 in order to iteratively label a large pool of unlabeled data.</Paragraph> <Paragraph position="1"> We use the following datasets in the algorithm: labeled a set of sentences bracketed with the correct parse trees.</Paragraph> <Paragraph position="2"> cache a small pool of sentences which is the focus of each iteration of the Co-Training algorithm.</Paragraph> <Paragraph position="3"> unlabeled a large set of unlabeled sentences. The only information we collect from this set of sentences is a tree-dictionary: tree-dict and part-of-speech dictionary: pos-dict. Construction of these dictionaries is covered in Section 3.2.</Paragraph> <Paragraph position="4"> In addition to the above datasets, we also use the usual development test set (termed dev in this paper), and a test set (called test) which is used to evaluate the bracketing accuracy of the parser.</Paragraph> <Paragraph position="5"> The Co-Training algorithm consists of the following steps which are repeated iteratively until all the sentences in the set unlabeled are exhausted.</Paragraph> <Paragraph position="6"> 1. Input: labeled and unlabeled 2. Update cache #0F Randomly select sentences from unlabeled and refill cache #0F If cache is empty; exit 3. Train models H1 and H2 using labeled 4. Apply H1 and H2 to cache.</Paragraph> <Paragraph position="7"> 5. Pick most probablen from H1 (run through H2) and add to labeled.</Paragraph> <Paragraph position="8"> 6. Pick most probable n fromH2andaddtolabeled 7. n = n + k;GotoStep2 For the experiment reported here, n =10,andk was set to be n in each iteration. We ran the algorithm for 12 iterations (covering 20480 of the sentences in unlabeled) and then added the best parses for all the remaining sentences. null</Paragraph> </Section> <Section position="7" start_page="2" end_page="2" type="metho"> <SectionTitle> 6 Experiment </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 6.1 Setup </SectionTitle> <Paragraph position="0"> The experiments we report were done on the Penn Tree-bank WSJ Corpus (Marcus et al., 1993). The various settings for the Co-Training algorithm (from Section 5) are as follows: While it might seem expensive to run the parser over the cache multiple times, we use the pruning capabilities of the parser to good use here. During the iterations we set the beam size to a value which is likely to prune out all derivations for a large portion of the cache except the most likely ones. This allows the parser to run faster, hence avoiding the usual problem with running an iterative algorithm over thousands of sentences. In the initial runs we also limit the length of the sentences entered into the cache because shorter sentences are more likely to beat out the longer sentences in any case. The beam size is reset when running the parser on the test data to allow the parser a better chance at finding the most likely parse.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 6.2 Results </SectionTitle> <Paragraph position="0"> We scored the output of the parser on Section 23 of the Wall Street Journal Penn Treebank. The following are some aspects of the scoring that might be useful for comparision with other results: No punctuations are scored, including sentence final punctuation. Empty elements are not scored. We used EVALB (written by Satoshi Sekine and Michael Collins) which scores based on PAR-SEVAL (Black et al., 1991); with the standard parameter file (as per standard practice, part of speech brackets were not part of the evaluation). Also, we used Adwait Ratnaparkhi's part-of-speech tagger (Ratnaparkhi, 1996) to tag unknown words in the test data.</Paragraph> <Paragraph position="1"> We obtained 80.02% and 79.64% labeled bracketing precision and recall respectively (as defined in (Black et al., 1991)). The baseline model which was only trained on the 9695 sentences of labeled data performed at 72.23% and 69.12% precision and recall. These results show that training a statistical parser using our Co-training method to combine labeled and unlabeled data strongly outperforms training only on the labeled data.</Paragraph> <Paragraph position="2"> It is important to note that unlike previous studies, our method of moving towards unsupervised parsing are directly compared to the output of supervised parsers.</Paragraph> <Paragraph position="3"> Certain differences in the applicability of the usual methods of smoothing to our parser cause the lower accuracy as compared to other state of the art statistical parsers. However, we have consistently seen increase in performance when using the Co-Training method over the baseline across several trials. It should be emphasised that this is a result based on less than 20% of data that is usually used by other parsers. We are experimenting with the use of an even smaller set of labeled data to investigate the learning curve.</Paragraph> </Section> </Section> <Section position="8" start_page="2" end_page="2" type="metho"> <SectionTitle> 7 Previous Work: Combining Labeled and </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Unlabeled Data </SectionTitle> <Paragraph position="0"> The two-step procedure used in our Co-Training method for statistical parsing was incipient in the SuperTagger (Srinivas, 1997) which is a statistical model for tagging sentences with elementary lexicalized structures.</Paragraph> <Paragraph position="1"> This was particularly so in the Lightweight Dependency Analyzer (LDA), which used shortest attachment heuristics after an initial SuperTagging stage to find syntactic dependencies between words in a sentence. However, there was no statistical model for attachments and the notion of mutual constraints between these two steps was not exploited in this work.</Paragraph> <Paragraph position="2"> Previous studies in unsupervised methods for parsing have concentrated on the use of inside-outside algorithm (Lari and Young, 1990; Carroll and Rooth, 1998). However, there are several limitations of the inside-outside algorithm for unsupervised parsing, see (Marcken, 1995) for some experiments that draw out the mismatch between minimizing error rate and iteratively increasing the likelihood of the corpus. Other approaches have tried to move away from phrase structural representations into dependency style parsing (Lafferty et al., 1992; Fong and Wu, 1996). However, there are still inherent computational limitations due to the vast search space (see (Pietra et al., 1994) for discussion). None of these approaches can even be realistically compared to supervised parsers that are trained and tested on the kind of representations and the complexity of sentences that are found in the Penn Treebank.</Paragraph> <Paragraph position="3"> (Chelba and Jelinek, 1998) combine unlabeled and labeled data for parsing with a view towards language modeling applications. The goal in their work is not to get the right bracketing or dependencies but to reduce the word error rate in a speech recognizer.</Paragraph> <Paragraph position="4"> Our approach is closely related to previous Co-Training methods (Yarowsky, 1995; Blum and Mitchell, 1998; Goldman and Zhou, 2000; Collins and Singer, 1999). (Yarowsky, 1995) first introduced an iterative method for increasing a small set of seed data used to disambiguate dual word senses by exploiting the constraint that in a segment of discourse only one sense of a word is used. This use of unlabeled data improved performance of the disambiguator above that of purely supervised methods. (Blum and Mitchell, 1998) further embellish this approach and gave it the name of Co-Training. Their definition of Co-Training includes the notion (exploited in this paper) that different models can constrain each other by exploiting different 'views' of the data. They also prove some PAC results on learnability.</Paragraph> <Paragraph position="5"> They also discuss an application of classifying web pages by using their method of mutually constrained models.</Paragraph> <Paragraph position="6"> (Collins and Singer, 1999) further extend the use of classifiers that have mutual constraints by adding terms to AdaBoost which force the classifiers to agree (called Co-Boosting). (Goldman and Zhou, 2000) provide a variant of Co-Training which is suited to the learning of decision trees where the data is split up into different equivalence classes for each of the models and they use hypothesis testing to determine the agreement between the models. In future work we would like to experiment whether some of these ideas could be incorporated into our model.</Paragraph> <Paragraph position="7"> In future work we would like to explore use of the entire 1M words of the WSJ Penn Treebank as our labeled data and to use a larger set of unbracketed WSJ data as input to the Co-Training algorithm. In addition, we plan to explore the following points that bear on understanding the nature of the Co-Training learning algorithm: #0F The contribution of the dictionary of trees extracted from the unlabeled set is an issue that we would like to explore in future experiments. Ideally, we wish to design a co-training method where no such information is used from the unlabeled set.</Paragraph> <Paragraph position="8"> #0F The relationship between co-training and EM bears investigation. (Nigam and Ghani, 2000) is a study which tries to separate two factors: (1) The gradient descent aspect of EM vs. the iterative nature of co-training and (2) The generative model used in EM vs. the conditional independence between the features used by the two models that is exploited in co-training. Also, EM has been used successfully in text classification in combination of labeled and unlabeled data (see (Nigam et al., 1999)).</Paragraph> <Paragraph position="9"> #0F In our experiments, unlike (Blum and Mitchell, 1998) we do not balance the label priors when picking new labeled examples for addition to the training data. One way to incorporate this into our algorithm would be to incorporate some form of sample selection (or active learning) into the selection of examples that are considered as labeled with high confidence (Hwa, 2000).</Paragraph> </Section> </Section> class="xml-element"></Paper>