File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/w97-1005_evalu.xml
Size: 11,744 bytes
Last Modified: 2025-10-06 14:00:26
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-1005"> <Title>A Statistical Decision Making Method: A Case Study on Prepositional Phrase Attachment*</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> In some of the earlier works on PPA there are aspects of the model switching framework. For example, (Brill and Resnik, 1994) ordered rules to minimize the error-rate in PPA classification. Each of these inference rules may be considered a decision function in a decision list. Whenever a higher order rule fails, the control switches to the next rule to classify that test instance. (Collins and Brooks, 1995) ordered heuristic decision functions by complexity (arity) and classified test instances with the most complex applicable function.</Paragraph> <Paragraph position="1"> Non-recursive Model Switching consists of two phases: 1. Ordering available models (e.g., via leave-one-out cross-validation), 2. Applying the model on top of the list to the test data; whenever that model does not yield any estimate, the system switches to the next model on the list.</Paragraph> <Paragraph position="2"> The first phase corresponds to the learning phase of learning systems; whereas, the. last phase can be conceptualized as a decision list (Rivest, 1987) and (Kohavi and Benson, 1993), where the control is conditioned by the availability of a direct estimate given a model with a test instance. 6 In the recursive version of the Model Switching, however, the model list is dynamically changed since the above phases are within a loop, where in each iteration all instances of the available data are considered for classification and those which are classified are excluded from the data for the next iteration. The base case of recursion is reached when all instances are classified.</Paragraph> <Paragraph position="3"> Although in this work we suggest a precision-driven model ordering scheme, the Model Switching method enables one to use any other utility function such as accuracy or F-measure. There are other utility functions that need not be acquired through cross-validation, but rather can be collected by analyzing the entire training set as in statistical significance analysis (e.g., G 2, Pearson's X2), or information criteria (e,g., Akaike or Bayes Information Criteria etc.), which can be used as well.</Paragraph> <Paragraph position="4"> An advantage of this method is that we make use of a complex and powerful set of models. Much of 6This relevance of decision lists was indicated by Mike Collins in our personal discussions.</Paragraph> <Paragraph position="5"> the earlier PPA research was confined to singleclique models, such as ABCD or AB, which are a small subset of decomposable models.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Quantitative Analysis </SectionTitle> <Paragraph position="0"> Statistical (decomposable) model selection techniques were first applied to NLP problems by (Bruce and Wiebe, 1994). Those model selection techniques aim to find a single best model but they alone do not perform as well as Model Switching, since even the most accurate decomposable model, AB.AD, had a classification accuracy of 77%.</Paragraph> <Paragraph position="1"> Unlike Model Switching, the methods suggested in earlier PPA works are usually tailored to the PPA problem, thus it is hard to transfer them to another domain. On the other hand, neither Naive Bayes nor the conventional machine learning tools, such as CN2, C4.5 and PEBLS, perform as well. These four symbolic classifiers are well known and are diverse to some extent: Naive Bayes is a simple Bayesian approach, CN2 is based on rule induction, C4.5 is based on decision trees, and PEBLS is based on nearest neighbor method. A performance comparison of various classifiers with MS1 is given on Table 4. The comparison between the proposed systems solving PPA ambiguity and general machine learning systems was always neglected in earlier articles on PPA problem .7 The results of the first five classifiers presented in Table 4 and the performance of B&R classifier on IBM data were determined as part of this study, while the other four results are benchmarks quoted from the authors cited above. Those benchmarks were produced via single trials, hence we performed single trial tests as well. CN2, C4.5 and PEBLS performances were based on their default settings.</Paragraph> <Paragraph position="2"> The only exception involved CN2 where an orderedinduced-rule-list is used instead of an unordered one, since the ordered rules yield 99.7% accuracy versus 90.8% accuracy of unordered rules on the IBM training data. After the test, we checked the accuracy rates of unordered induced rules, which are unexpectedly better than the ordered ones: 78% on B~R data and 76.2% on IBM data. Naive Bayes' recall values are very low: 74% for IBM data and 78% for B&R data; therefore, the remaining test instances are classified as the most frequent class.</Paragraph> <Paragraph position="3"> Notice that this is also a type of model switching, where the forms of the models and the model list M = (AB.AC.AD.AE, A) are predetermined as done by (Collins.and Brooks, 1995).</Paragraph> <Paragraph position="4"> 7(Ratnaparkhi et al., 1994) reported a decision tree experiment using mutual information with 77.7% accuracy. null (data/classifier) by (Srill and Resnik, 1994); IBM: (data/classifier) by (Ratnaparkhi et al., 1994); Bayes: Naive Bayes with defaults, i.e., A/\[ = (AB.AC.AD.AE, A). The performance differences between MS1 and C&B, the Back-off Model by (Collins and Brooks, 1995), are 0.4% for IBM data and 0.7% for B~R data. With only two test trials and without any deviation measure these differences cannot be considered significant, especially in this case, where the performances of the classifiers fluctuate 2-3% (e.g., C~B accuracy deviates 2.2%) within two very similar data sets, B~R and IBM data. As one anonymous reviewer indicated, the 0.7% accuracy difference on B&R data needs to be evaluated cautiously due to the size of the B~R test data, which contains only 500 test instances; whereas the IBM data contains 3097 test instances.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Qualitative Analysis </SectionTitle> <Paragraph position="0"> The approach of (Collins and Brooks, 1995) is simpler than MS1, since it doesn't consist of any learning part; the models were selected and grouped by its designers and ordered heuristically, which means classification requires prior knowledge specific to the domain. With the human expertise involved, the list of models is simpler and shorter than the list found by MS1 and it is heuristically grouped and weighted (forming a kind of mixture model), which is not the case in MS1 at this point in time; nevertheless, MS1 reached to a performance level that is competitive to the other system supported with human expertise.</Paragraph> <Paragraph position="1"> MS1 uses neither any lexical information nor heuristics with respect to the PPA problem; hence, it can be adopted and applied to any other classification problem involving categorical data. MS1 is a machine learning alternative to the system developed by (Collins and Brooks, 1995), and the ordering of the models that it produces may provide insight into the data that could aid in developing a custom mixture model.</Paragraph> <Paragraph position="2"> Unlike the other techniques, MS1 generates an ordered list of models where each model provides a graphical representation of the interdependencies among variables. The user can identify relevant relations and see which features play the most significant roles; thus, one can not only predict the outcome of a classification problem with high accuracy but also Kayaalp, Pedersen ~ Bruce 40 gain insight into the nature of the domain and the data under investigation. For example, MS1 identified the fact that the preposition feature (variable D) is so important that all test instances (except the last four) were predicted by models that have this variable. This was one of the most important heuristic steps in formulating the approach used by (Collins and Brooks, 1995). Further analysis of the model list by linguists may yield other observations, such as, in the first 75% of the predictions, 97% of the test instances were identified using models containing the interaction ABD with a precision of 86%, and in the rest of the predictions this interaction was not useful. Similar model lists can be generated on various corpora and their comparisons may reveal differences in those corpora.</Paragraph> <Paragraph position="3"> MS1 and the systems by (Ratnaparkhi et al., 1994) and (Brill and Resnik, 1994) consist of a training phase, where they form certain structures (such as rules, models, etc.) that are used with the available statistics to classify test instances; therefore, these systems can be considered true learning systems. On the other hand in systems designed by (Hindle and Rooth, 1993), (Collins and Brooks, 1995), and (Franz, 1996), the forms of models were predetermined by their designers, as in the Naive Bayes approach.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Scalability </SectionTitle> <Paragraph position="0"> The structure of the underlying PPA data (4) casts a difficult problem to learning system. When the number of observations grows, the levels of features (except that of the preposition, which is limited by grammar) grow proportionally. This effect was first identified by (Zipf, 1935). Due to this effect the number of cells in contingency table representations explodes, which corresponds to an exponential growth in the search space.</Paragraph> <Paragraph position="1"> Three general machine learning systems cited above require very large main memory capacity to run the PPA data, which brings the scalability into question. MSi's implementation is based on large data and limited main memory assumptions, hence computation time has been traded with memory re-</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Statistical PP Attachment </SectionTitle> <Paragraph position="0"> quirement. The Model Switching approach is scalable in computation time and memory: While the data size grows, the leave-one-out cross-validation technique may be switched to a simpler v-fold cross-validation technique, which is &quot;stable&quot; and preferable for larger data size (Breiman et al., 1984).</Paragraph> <Paragraph position="1"> There is always, a much simpler choice: Ranking models through statistical significance analysis or through information criteria, whose cost is O(I.M I).</Paragraph> <Paragraph position="2"> One problem encountered in applying Model Switching to other domains is that the number of decomposable models grows exponentially with the number of possible variables. The method of (Edwards and HavrPSnek, 1987) or (Madigan and Raftery, 1994) for selecting a good subset of models for the data resolves this last concern regarding scalability. Using these techniques, the Model Switching method may be applied to other NLP problems with much larger size of feature variables. Model Switching method is currently being applied to word sense disambiguation which is cast with eight features. The preliminary results are very encouraging, and provide evidence for the robustness of the methodology.</Paragraph> </Section> </Section> class="xml-element"></Paper>