File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/p01-1070_evalu.xml
Size: 9,062 bytes
Last Modified: 2025-10-06 13:58:44
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1070"> <Title>Using Machine Learning Techniques to Interpret WH-questions</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> Our report on the predictive performance of the decision trees considers the effect of various training and testing factors on predictive performance, and examines the relationships among the target variables.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Training Factors </SectionTitle> <Paragraph position="0"> We examine how the quality of the training data and the size of the training set affect predictive performance.</Paragraph> <Paragraph position="1"> Quality of the data. In our context, the quality of the training data is determined by the wording of the queries and the output of the parser. For each query, the tagger could indicate whether it was a BAD QUERY or whether a WRONG PARSE had been produced. A BAD QUERY is incoherent or articulated in such a way that the parser generates a WRONG PARSE, e.g., &quot;When its hot it expand?&quot;. Figure 1 shows the predictive performance of the decision trees built for two training sets: All5145 and Good4617. The first set contains 5145 queries, while the second set contains a subset of the first set comprised of &quot;good&quot; queries only (i.e., bad queries and queries with wrong parses were excluded). In both cases, the same 1291 queries were used for testing. As a baseline measure, we also show the predictive ac- null curacy of using the maximum prior probability to predict each target variable. These prior probabilities were obtained from the training set All5145. The Information Need with the highest prior probability is IDentification, the highest Coverage Asked is Precise, while the highest Coverage Would Give is Additional; NOUN contains the most common Topic; the most common Focus and Restriction are NONE; and LIST is almost always False. As seen in Figure 1, the prior probabilities yield a high predictive accuracy for Restriction and LIST. However, for the other target variables, the performance obtained using decision trees is substantially better than that obtained using prior probabilities. Further, the predictive performance obtained for the set Good4617 is only slightly better than that obtained for the set All5145. However, since the set of good queries is 10% smaller, it is considered a better option.</Paragraph> <Paragraph position="2"> Size of the training set. The effect of the size of the training set on predictive performance was assessed by considering four sizes of training/test sets: Small, Medium, Large, and X-Large. Table 3 shows the number of training and test queries for each set size for the &quot;all queries&quot; and sets (1679, 2389, 3381 and 4617) - Good queries The predictive performance for the all-queries and good-queries sets is shown in Figures 2 and 3 respectively. Figure 2 depicts the average of the results obtained over five runs, while Figure 3 shows the results of a single run (similar results were obtained from other runs performed with the good-queries sets). As indicated by these results, for both data sets there is a general improvement in predictive performance as the size of the training set increases.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Testing Factors </SectionTitle> <Paragraph position="0"> We examine the effect of two factors on the predictive performance of our models: (1) query length (measured in number of words), and (2) information need (as recorded by the tagger). These effects were studied with respect to the predictions generated by the decision trees obtained from the set Good4617, which had the best per- null four length categories (measured in number of words): length , length , length and length. Figure 4 displays the distribution of queries in the test set according to these length categories. According to this distribution, over 90% of the queries have less than 11 words.</Paragraph> <Paragraph position="1"> The predictive performance of our decision trees broken down by query length is shown in Figure 5. As shown in this chart, for all target variables there is a downward trend in predictive accuracy as query length increases. Still, for queries of less than 11 words and all target variables except Topic, the predictive accuracy remains over 74%. In contrast, the Topic predictions drop from 88% (for queries of less than 5 words) to 57% (for queries of 8, 9 or 10 words). Further, the predictive accuracy for Information Need, Topic, Focus and Restriction drops substantially for queries that have 11 words or more. This drop in predictive performance may be explained by two factors. For one, the majority of the training data frequent information needs - Good queries consists of shorter questions. Hence, the applicability of the inferred models to longer questions may be limited. Also, longer questions may exacerbate errors associated with some of the independence assumptions implicit in our current model. Information need. Figure 6 displays the distribution of the queries in the test set according to Information Need. The five most common Information Need categories are: IDentification, Attribute, Topic Itself, Intersection and Process, jointly accounting for over 94% of the queries. Figure 7 displays the predictive performance of our models for these five categories. The best performance is exhibited for the IDentification and Topic Itself queries. In contrast, the lowest predictive accuracy was obtained for the Information Need, Topic and Restriction of Intersection queries.</Paragraph> <Paragraph position="2"> This can be explained by the observation that Intersection queries tend to be the longest queries (as seen above, predictive accuracy drops for long queries). The relatively low predictive accuracy obtained for both types of Coverage for Process queries remains to be explained.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Relations between target variables </SectionTitle> <Paragraph position="0"> To determine whether the states of our target variables affect each other, we built three prediction models, each of which includes six target variables for predicting the remaining variable. For instance, Information Need, Coverage Asked, Coverage Would Give, Focus, Restriction and LIST are incorporated as data (in addition to the observable variables) when training a model that predicts Focus. Our three models are: PredictionOnly - which uses the predicted values of the six target variables both for the training set and for the test set; Mixed - which uses the actual values of the six target variables for the training set and their predicted values for the test set; and PerfectInformation - which uses actual values of the six target variables for both training and testing. This model enables us to determine the performance boundaries of our methodology in light of the currently observed attributes.</Paragraph> <Paragraph position="1"> Figure 8 shows the predictive accuracy of five models: the above three models, our best model so far (obtained from the training set Good4617) - denoted BestResult, and prior probabilities.</Paragraph> <Paragraph position="2"> As expected, the PerfectInformation model has the best performance. However, its predictive accuracy is relatively low for Topic and Focus, suggesting some inherent limitations of our methodology. The performance of the PredictionOnly model is comparable to that of BestResult, but the performance of the Mixed model seems slightly worse. This difference in performance may be attributed to the fact that the PredictionOnly model is a &quot;smoothed&quot; version of the Mixed model. That is, the PredictionOnly model uses a consistent version of the target variables (i.e., predicted values) both for training and testing. This is not the case for the Mixed model, where actual values are used for training (thus the Mixed model is the same as the PerfectInformation model), but predicted values (which are not always accurate) are used for testing.</Paragraph> <Paragraph position="3"> Finally, Information Need features prominently both in the PerfectInformation/Mixed model and the PredictionOnly model, being used in the first or second split of most of the decision trees for the other target variables. Also, as expected, Coverage Asked is used to predict Coverage Would Give and vice versa. These results suggest using modeling techniques which can take advantage of dependencies among target variables. These techniques would enable the construction of models which take into account the distribution of the predicted values of one or more target variables when predicting another target variable.</Paragraph> </Section> </Section> class="xml-element"></Paper>