XML Viewer - p98-1048

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1048_metho.xml
Size: 18,689 bytes
Last Modified: 2025-10-06 14:14:55
<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1048">
  <Title>Sylvain_Delisle @uqtr.uquebec.ca</Title>
  <Section position="3" start_page="307" end_page="307" type="metho">
    <SectionTitle>
2 The existing hand-crafted heuristics
</SectionTitle>
    <Paragraph position="0"> The rule-based parser we used was DIPETT \[Delisle 1994\]: it is a top-down, depth-first parser, augmented with a few look-ahead mechanisms, which returns the first analysis (parse tree). The fact that our parser produces only a single analysis, the &amp;quot;best&amp;quot; one according to its hand-crafted heuristics, is part of the motivation for this work. When DIPETT is given an input string, it first selects the top-level rules it is to attempt, as well as their ordering in this process.</Paragraph>
    <Paragraph position="1"> Ideally, the parser would find an optimal order that minimises parsing time and maximises parsing accuracy by first selecting the most promising rules. For example, there is no need to treat a sentence as multiply coordinated or compound when the data contains only one verb. DIPETT has three top-level rules for declarative statements: i) MULT_COOR for multiple (normally, three or more) coordinated sentences; ii) COMPOUND for compound sentences, that is, correlative and simple coordination (of, normally, two sentences); iii) NONCOMPOUND for simple and complex sentences, that is, a single main clause with zero or more subordinate clauses (\[Quirk et el. 1985\]). To illustrate the data that we worked with and the classes for which we needed the rules, here are two sentences (from the Brown corpus) used in our experiments: &amp;quot;And know, while all this went on, that there was no real reason to suppose that the murderer had been a guest in either hotel.&amp;quot; is a non-compound sentence, and =Even I can remember nothing but ruined cellars and tumbled pillars, and nobody has lived there in the memory of any living man.&amp;quot; is a compound sentence.</Paragraph>
    <Paragraph position="2"> The current hand-crafted heuristic (\[Delisle 1994\]) is based on three parameters, obtained after (non-disambiguating) lexical analysis and before parsing: 1) the number of potential verbs' in the data, 2) the presence of potential coordinators in the data, and 3) verb density (roughly speaking, it indicates how potential verbs are distributed). For instance, low density means that verbs are scattered throughout the input string; high density means that the verbs appear close to each other in the input string, as in a conjunction i A &amp;quot;potential&amp;quot; verb may actually turn out to be, say, a noun, but only parsing can tell us how such a lexical ambiguity has been resolved. If the input were pre-processed by a tagger, the ambiguity might disappear. of verbs such as &amp;quot;Verbl and Verb2 and Verb3&amp;quot;. Given the input string's features we have just discussed, DIPETT's algorithm for top-level rule selection returns an ordered list of up to 3 of the rules COMPOUND, NONCOMPOUND, and MULT_COOR tO be attempted when parsing this string. For the purposes of our experiment, we simplified the situation by neglecting the MULT_COOR rule since it was rarely needed when parsing real-life text. Thus, the original problem went from a 3class to a 2-class classification problem: COMPOUND or NON_COMPOUND.</Paragraph>
  </Section>
  <Section position="4" start_page="307" end_page="311" type="metho">
    <SectionTitle>
3 Learning rules from sentences
</SectionTitle>
    <Paragraph position="0"> As any heuristic, the top-level rule selection mechanism just described is not perfect. Among the principal difficulties, the most important are: i) the accuracy of the heuristic is limited and ii) the internal choices are relatively complex and somewhat obscure from a linguist's viewpoint. The aim of this research was to use classification systems as a tool to help developing new knowledge for improving the parsing process. To preserve the broad applicability of DIPETT, we have emphasised the generality of the results and did not use any kind of domain knowledge. The sentences used to build the classifiers and evaluate the performance have been randomly selected from five unrelated real corpora.</Paragraph>
    <Paragraph position="1"> Typical classification systems (e.g. decision trees, neural networks, instance based learning) require the data to be represented by feature vectors. Developing such a representation for the task considered here is difficult. Since the top-level rule selection heuristic is one of the first steps in the parsing process, very little information for making this decision is available at the early stage of parsing. All the information available at this phase is provided by the (non-disambiguating) lexical analysis that is performed before parsing. This preliminary analysis provides four features: 1) number of potential verbs in the sentence, 2) presence of potential coordinators, 3) verb density, and 4) number of potential auxiliaries. As mentioned above, only the first three features are actually used by the current hand-crafted heuristic.</Paragraph>
    <Paragraph position="2"> However, preliminary experiments have shown that no interesting knowledge can be inferred by using only these four features. We then decided to improve our representation by the use of DIPETT's  fragmentary parser: an optional parsing mode in which DIPETT does not attempt to produce a single structure for the current input string but, rather, analyses a string as a sequence of major constituents (i.e. noun, verb, prepositional and adverbial phrases). The new features obtained from fragmentary parsing are: the number of fragments, the number of &amp;quot;verbal&amp;quot; fragments (fragments that contain at least one verb), number of tokens skipped, and the total percentage of the input recognised by the fragmentary parser. The fragmentary parser is a cost-effective solution to obtain a better representation of sentences because it is very fast---on average, less than one second of CPU time for any sentence--in comparison to full parsing.</Paragraph>
    <Paragraph position="3"> Moreover, the information obtained from the fragmentary parser is adequate for the task at hand because it represents well the complexity of the sentence to be parsed. In addition to the features obtained from the lexical analysis and those obtained from the fragmentary parser, we use the string length (number of tokens in the sentence) to describe each sentence. The attribute used to classify the sentences, provided by a human expert, is called rule-to-attempt and it can take two values: compound or non-compound, according to the type of the sentence. To summarise, we used the ten following features to represent each sentence: l) string-length: number of tokens (integer); 2) num-potential-verbs: number of potential verbs (integer); 3) num-potential-auxiliary: number of potential auxiliaries (integer); 4) verbdensity: a flag that indicates if all potential verbs are separated by coordinators (boolean); 5) nbr-potential, coordinators: number of potential coordinators (integer); 6) num-fragments: number of fragments used by the fragmentary parser (integer); 7) numverbal-fragments: number of fragments that contain at least one potential verb (integer); 8) num-tokensskip: number of tokens not considered by the fragmentary parser (integer); 9) %.input.recognized: percentage of the sentence recognized, i.e. not skipped (real); 10) rule-to-attempt: type of the sentence (COMPOUND or NON-COMPOUND).</Paragraph>
    <Paragraph position="4"> We built the first data set by randomly selecting 300 sentences from four real texts: a software user manual, a tax guide, a junior science textbook on weather phenomena, and the Brown corpus. Each sentence was described in terms of the above features, which are of course acquired automatically by the lexical analyser and the fragmentary parser, except for rule-to-attempt as mentioned above. After a preliminary analysis of these 300 sentences, we realised that we had unbalanced numbers of examples of compound and non-compound sentences: non-compounds are approximately five times more frequent than compounds. However, it is a well-known fact in machine learning that such unbalanced training sets are not suitable for inductive learning. For this reason, we have re-sampled our texts to obtain roughly an equal number of non-compound and compound sentences (55 compounds and 56 noncompounds). null Our experiment consisted in running a variety of attribute classification systems: IMAFO (\[Famili &amp; Tumey 1991\]), C4.5 (\[Quinlan 1993\]), and different learning algorithms from MLC++ (\[Kohavi et al. 1994\]). IMAFO includes an enhanced version of ID3 and an interface to C4.5 (we used both engines in our experimentation). MLC++ is a machine learning library developed in C++.</Paragraph>
    <Paragraph position="5"> We experimented with many algorithms included in MLC++.</Paragraph>
    <Paragraph position="6"> We concentrated mainly on learning algorithms that generate results in the form of rules. For this project, rules are more interesting than other form of results because they are relatively easy to integrate in a rule-based parser and because they can be evaluated by experts in the domain.</Paragraph>
    <Paragraph position="7"> However, for accuracy comparison, we have also used learning systems that do not generate rules in terms of the initial representation: neural networks and instance-based systems. We randomly divided our data set into the training set (2/3 of the examples, or 74 instances) and the testing set (1/3 of the examples, or 37 instances). Table 1 summarises the results obtained from different systems in terms of the error rates on the testing set. All systems gave results with an error rate below 20%.</Paragraph>
    <Paragraph position="8">  The error rates presented in Table I for the first four systems (decision rules systems) represent the average rates for all rules generated by these systems. However, not all rules were particularly interesting. We kept only some of them for further evaluation and integration in the parser. Our selection criteria were: 1) the estimated error rate, 2) the &amp;quot;reasonability&amp;quot; (only rules that made sense for a computational linguist were kept), 3) the readability (simple rules are preferred), and 4) the novelty (we discarded rules that are already in the parser). Tables 2 and 3 present rules that satisfy all the above the criteria: Table 2 focuses on rules to identify compound sentences while Table 3 presents rules to identify non-compound sentences. The error rate for each rule is also given. These error rates were obtained  The error rates that we have obtained are quite respectable for a two-class learning problem given the volume of available examples. Moreover, the rules are justified and make sense. They are also very compact in comparison with the original hand-crafted heuristics. We will see in section 4 how these rules behave on unseen data from a totally different text.</Paragraph>
    <Paragraph position="9">  Attribute classification systems such as those used during the experiment reported here are highly sensitive to the adequacy of the features used to represent the instances. For our task (parsing), these features were difficult to find and we had only a rough idea about their appropriateness. For this reason, we felt that better results could be obtained by transforming the original instance space into a more adequate space by creating new attributes. In machine learning research, this process is referred as constructive learning, or constructive induction (\[Wnek &amp; Michalski 1994\]). We even attempted to use principal component analysis (PCA) (\[Johnson &amp; Wichern 1992\]) as a technique of choice for simple constructive learning but we did not get very impressive results. We see two reasons for this. The primary reason is that the ratio between the number of examples and the number of attributes is not high enough for PCA to derive high-quality new attributes. The second reason is that the original attributes are already highly non-redundant. It is important to note that these rules do not satisfy the reasonability criteria applied to the original representation. In fact, losing the understandability of the attributes is the usual consequence of almost all approaches that change the representation of instances.</Paragraph>
    <Paragraph position="10"> 4 Evaluation of the new rules We explained in section 3 how we derived new parsing heuristics with the help of machine learning techniques. The next step was to evaluate how well would the new rules perform if we replaced the parser's current hand-crafted heuristics with the new ones. In particular, we wanted to evaluate the accuracy of the heuristics in correctly identifying the appropriate rule, COMPOUND or NON COMPOUND, that should first be attempted by  the parser. This goal was prompted by an earlier evaluation of DIPETT in which it was noted that a good proportion of questionable parses (i.e.</Paragraph>
    <Paragraph position="11"> either bad parses or correct but too time-consuming parses) were caused by a bad first attempt, such as attempting COMPOUND instead of NON_COMPOUND.</Paragraph>
    <Section position="1" start_page="310" end_page="310" type="sub_section">
      <SectionTitle>
4.1 From new rules to new parsers
</SectionTitle>
      <Paragraph position="0"> Our machine learning experiments lead us to two classes of rules obtained from a variety of classifiers and concerned only with the notion of compoundness: 1) those predicting a COMPOUND sentence, and 2) those predicting a NON_COMPOUND. The problem was then to decide what should be done with the set of new rules. More precisely, before actually implementing the new rules and including them in the parser, we first had to decide on an appropriate strategy for exploiting the set of new rules. We now describe the three implementations that we realised and evaluated.</Paragraph>
      <Paragraph position="1"> The first implements only the rules for the COMPOUND class---one big rule which is a disjunct of all the learned rules for that class. And since there are only two alternatives, either COMPOUND or NON_COMPOUND, if none of the COMPOUND rules applies, the NON_COMPOUND class is predicted. This first implementation is referred to as C-Imp. The second implementation, referred to as NC-Imp, does exactly the opposite: i.e. it implements only the rules predicting the NON_COMPOUND class.</Paragraph>
      <Paragraph position="2"> The third implementation, referred to as NC_C-Imp, benefits from the first two implementations. The class of a new sentence is determined by combining the output from C-Imp and NC-Imp. The combination of the output is done according to the following decision table in Table  The first two lines of this decision table are obvious since the outputs from both implementations are consistent. When the two implementations disagree, the NC_C-Imp implementation predicts the non-compound. This prediction is justified by a bayesian argumentation. In the absence of any additional knowledge, we are forced to assign an equal probability of success to each of the two sets of rules and the most probable class becomes the one with the highest frequency. Thus, in general, non-compound sentences are more frequent then compound ones. One obvious way to improve this third implementation would be to precisely evaluate the accuracies of the two sets of rules and then incorporate these accuracies in the decision process.</Paragraph>
    </Section>
    <Section position="2" start_page="310" end_page="311" type="sub_section">
      <SectionTitle>
4.2 The results
</SectionTitle>
      <Paragraph position="0"> To perform the evaluation, we randomly sampled 200 sentences from a new corpus on mechanics (\[Atkinson 1990\]): note that this text had not been used to sample the sentences used for learning. Out of these 200 sentences, 10 were discarded since they were not representative (e.g. one-word &amp;quot;sentences&amp;quot;). We ran the original implementation of DIPETT plus the three new implementations described in the previous section on the remaining 190 test sentences. Table 5 presents the results. The error-rate, the standard deviation of the error-rate and the p-value are listed for each implementation.</Paragraph>
      <Paragraph position="1"> The p-value gives the probability that DIPETT's original hand-crafted heuristics are better than the new heuristics. In other words, a small p-value means an increase in performance with a high probability.</Paragraph>
      <Paragraph position="2">  versus DIPETT's original heuristics.</Paragraph>
      <Paragraph position="3"> We observe that all new automatically-derived heuristics did beat DIPETT's hand-crafted heuristics and quite clearly. The results from the third implementation (i.e. NC_C-Imp) are especially remarkable: with a confidence of over 99%, we can  affirm that the NC_C-lmplementation will outperform DIPETT's original heuristic. We also note that the error rate drops by 35% of its value for the original heuristic. Similarly, with a confidence of 87.4%, we can affirm that the implementation that uses only the C-rules (i.e. C-Imp) will perform better then DIPETT's current heuristics. null These very good results are also amplified by the fact that the testing described in this evaluation was done on sentences totally independent from the ones used for training. Usually, in machine learning research, the training and the testing sets are sampled from the same original data set, and the kind of &amp;quot;out-of-sample&amp;quot; testing that we perform here has only recently come to the attention of the learning community (\[Ezawa et al. 1996\]). Our experiments have shown that it is possible to infer rules that perform very well and are highly meaningful in the eyes of an expert even if the training set is relatively small. This indicates that the representation of sentences that we chose for the problem was adequate. Finally, an other important output of our research is the identification of the most significant attributes to distinguish non-compound sentences from compound ones. This alone is valuable information to a computational linguist. Only five out of ten original attributes are used by the learned rules, and all of them are cheap to compute: two attributes are derived by fragmentary parsing (number of verbal fragments and number of fragments), and three are lexical (number of potential verbs, length of the input string, and presence of potential coordinators).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML