XML Viewer - w98-1126

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1126_metho.xml
Size: 22,655 bytes
Last Modified: 2025-10-06 14:15:13
<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1126">
  <Title>Mapping Collocational Properties into Machine Learning Features</Title>
  <Section position="4" start_page="225" end_page="227" type="metho">
    <SectionTitle>
3 Collocational Properties
</SectionTitle>
    <Paragraph position="0"> Collocations have been used extensively in wordsense disambiguation research. In that context, collocations are words that co-occur with senses of the target word more often than expected by chance. Collocations also usually involve some constraint(s). For example, the constraint might be that the word must appear immediately to the right of the target word (see, for example, Ng &amp; Lee 1996 and Bruce &amp; Wiebe 1994); the actual collocations would be words that occur there.</Paragraph>
    <Paragraph position="1">  We need to untie the notion of collocation from wordsense disambiguation, and consider collocations to be words that co-occur (more than chance) with whatever classes are being targeted (such as the event categories presented above). Viewed in this way, collocations are also important for many event categorization and discourse processing tasks. Examples are open-class words that suggest dialog acts; words that help disambiguate cue words (e.g., is now being used temporally, or as a discourse marker? (Hirschberg &amp; Litman 1993)); and words that suggest states versus events (Siegel 1997).</Paragraph>
    <Paragraph position="2"> The work reported here is relevant when there are man) potential collocations to choose from, and we are automatically sifting through the various possibilities for good ones. For word-sense disambiguation, many different words co-occur in the corpus with the target word; we want to choose a subset that are good indicators of the sense of the target word. For dialog act recognition, we could search through the adjectives in the corpus, for example, for some that suggest a rejection dialog act (e.g., busy, occupied, committed, tied up, ...) in the scheduling domain (Wiebe et. al 1997b)). For disambiguating the cue phrase now, we could search for words that prefer the temporal versus the discourse interpretation (perhaps temporal adverbs and verbs with temporal aspects of their meaning). For event categorization, we could sift through the main verbs to find those that are good indicators of speech, for example (say, demand, attack, concede,...).</Paragraph>
    <Paragraph position="3"> To aid discussion, we use the following formal definitions. A collocational property is a set of constraints, PI Pp. In wordsense disambiguation, for example, we might have an adjacent collocational property, defined by four constraints: null</Paragraph>
    <Paragraph position="5"> words. Finally, a collocation word is a potential collocation word that is judged to be correlated with the classification, according to a metric such as conditional probability, an information theoretic criterion, or a goodness-of-fit test.</Paragraph>
    <Paragraph position="6"> We allow properties to be divided into subproperties. That is, the set of constraints defining a property are divided into subsets, S1...Ss. In our example, if s = 1, there is just one undivided property, defined by the set (P1, P~, P3, P4}. If s = p = 4, then there are four subproperties, each defined by one of the constraints. Or, there might be two subproperties, 81 = (P1, P2}, corresponding to adjacent words on the left, and $2 = (Pa, P4), corresponding to adjacent words on the right. Because these definitions cover many variations in a uniform framework, they facilitate comparative evaluation of systems implementing different schemes.</Paragraph>
    <Paragraph position="7"> The experiments performed here use collocational properties defined in Wiebe et al. 1997a to perform the event categorization task described in section 2. For this and other applications in which event type is important, such as many information extraction, text categorization, and discourse processing tasks, highly definitive properties, i.e., properties that pinpoint only the more relevant parts of the sentence, can lead to better performance. We define such a highly definitive collocational property. Specifically, it is defined by a set of syntactic patterns that are regular expressions composed of parts of speech and root forms of words. The property is referred to as the SP collocational property; it yields the best overall results on our event categorization task, as shown later in table 1. A partial description of the SP prop-erty is the following (where NPapprox approximates a noun phrase): baseAdjPat = {a \] a is in the pattern (main_verb adv* a), where the main verb is copular}. E.g., &amp;quot;She is/seems happy&amp;quot; A potential collocation word is a word that satisfies one of the constraints. Continuing the example, all of the words that appear in the corpus one or two words to the left or right of the target word are potential adjacent collocation complexAdjPat = {a I a is in the pattern (main_verb adv* \[ YPapprox \] \[&amp;quot;to&amp;quot; \] adv* v adv* a), where v is copular} E.g., &amp;quot;It surprised him to actually be so happy.&amp;quot;  Our SP property is organized into two subproperties (i.e., s is 2). Recall that a subproperty is defined by a set of constraints. Our first SP subproperty is defined by baseAdjPat and ComplexAdjPat. The potential collocation words corresponding to this subproperty are all adjectives that are used in either pattern in the corpus, and the actual collocation words are words chosen from this set. Our second SP subproperty is defined by two verb patterns not shown above. Given a clause, our system can apply the syntactic patterns fully automatically, using regular expression matching techniques.</Paragraph>
    <Paragraph position="8"> The other collocational property, CO, was defined to contrast with the SP property because it is not highly definitive. That is, it is defined by very loose constraints that do not inw~lve syntactic patterns. The two CO constraints we use are simply adjective and verb, so that the potential collocation words are all the adjectives and verbs appearing in the corpus (ignoring where they appear in the sentence). In our experiments, each of these constraints is treated as a subproperty (so, again, s is 2).</Paragraph>
  </Section>
  <Section position="5" start_page="227" end_page="228" type="metho">
    <SectionTitle>
4 Selecting Collocations and
</SectionTitle>
    <Paragraph position="0"> Representing them as Features The context of this work is automatic classification. Suppose there is a training sample, where each tagged sentence is represented by a vector (F1,...,Fn_l,C). The Fi's are input features and C is the targeted classification. Our task is to induce a classifier that will predict the value of C given an untagged sentence represented by the Fi's. This section addresses selecting collocations and representing them as such features.</Paragraph>
    <Section position="1" start_page="227" end_page="227" type="sub_section">
      <SectionTitle>
4.1 Selecting Collocations
</SectionTitle>
      <Paragraph position="0"> Following are two methods for selecting collocation words of a given collocational property (Wiebe et al. 1997a). Assume there are c classes, C1 ... Cc, and s subproperties, $1 ... Ss.</Paragraph>
      <Paragraph position="1">  In the per-class method (also used by Ng and Lee 1996), a set of words, WordsCiSj, is selected for each combination of Class Ci and subproperty Sj. They are selected to be words that, when they satisfy a constraint in Sj, are correlated with class Ci. Specifically: WordsCiS) = {W\[ P(Cilw satisfies a constraint in Sj) &gt; k}.</Paragraph>
      <Paragraph position="2"> We use k = 0.5. We experimented with some other values of k and other criteria, but did not find any that consistently yield better results.</Paragraph>
      <Paragraph position="3"> A more thorough investigation is planned.</Paragraph>
      <Paragraph position="4">  In the over-range method, a set of words, lY=ordsSj, is selected for each subproperty Sj, such that, when they satisfy a constraint in Sj, they are correlated with the classification variable across the range of its values.</Paragraph>
      <Paragraph position="5"> Specifically, the model of independence between each word w (when satisfying a constraint in Sj) and the classification variable is assessed, using the likelihood ratio statistic, G 2 (Bishop et al. 1975). Those with the top N 6 2 values, i.e., for which independence is a poor fit, are chosen 1. For the purposes of comparison, we limit the number of words to the maximum number of features permitted by one of the ML packages, 20 for ORe and 19 for ORb (ORe and ORb are defined below.)</Paragraph>
    </Section>
    <Section position="2" start_page="227" end_page="228" type="sub_section">
      <SectionTitle>
4.2 Organizations
</SectionTitle>
      <Paragraph position="0"> Finally, the collocation words must be organized into features. Following are two organizations for each selection method (Wiebe et al. 1997a).</Paragraph>
      <Paragraph position="1">  This organization is commonly used in NLP, for example by Gale et al. 1992. A binary feature is defined for each word in each set WordsSj, l&lt;j&lt;s.</Paragraph>
      <Paragraph position="2">  This organization is used by, for example, Ng &amp; Lee 1996. One feature is defined per subproperty Sj. It has I WordsSj I +l values, one value for each word in IVordsSj, corresponding to the presence of that word. Each feature also has a value for the absence of any word in WordsSj. E.g., for both CO and SP collocations, there is one feature for adjectives and one for verbs. The adjective feature has a value for each selected adjective, and a value for none of them occurring. (The verb feature is analogous.)  There is one binary feature for each class Ci, whose value is 1 if any member of any of the sets WordsCiSj appears in the sentence, 1 _&lt; j &lt; s.  For each subproperty Sj, a feature is defined with c + 1 values as follows. There is one value for each class Ci, corresponding to the presence of a word in WordsCiSj. Each feature also has a value for the absence of any of those words. E.g., for both CO and SP collocations, there is one feature for adjectives and one for verbs. The adjective feature has one value for each class, corresponding to the presence of any of the adjectives chosen for that class; there is also a value for the absence of any of them. (The verb feature is analogous.) Note that, in the over-range organizations, increasing the number of words increases the complexity of the event space, in ORe by increasing the number of feature values and in ORb by increasing the number of features. These increases in complexity can worsen accuracy and computation time (Goldberg 1995, Bruce etal.</Paragraph>
      <Paragraph position="3"> 1996, Cohen 1996). The per-class organizations allow the number of collocation words to be increased without a corresponding increase in complexity.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="228" end_page="228" type="metho">
    <SectionTitle>
5 The Machine Learning Algorithms
</SectionTitle>
    <Paragraph position="0"> The algorithms included in this study are representative of the major types suggested by Michie et al. (1994) of the StatLog project comparing machine learning algorithms. (1) PEBLS, a K-Nearest Neighbor algorithm (Cost and Salzberg 1993); (2) C4.5, a decision tree algorithm (Quinlan 1994); (3) Ripper, an inductive rule based classifier (Cohen 1996); (4) the Naive Bayes classifier; and (5), a probabilistic model search procedure (Bruce &amp; Wiebe 1994) using the public domain software CoCo (Badsberg 1995). Linear discriminant classifiers are omitted because they are not appropriate for categorical data. Neural network classifiers are omitted as well.</Paragraph>
  </Section>
  <Section position="7" start_page="228" end_page="229" type="metho">
    <SectionTitle>
6 Results
</SectionTitle>
    <Paragraph position="0"> Figure 1 presents the accuracy of~ach of the machine learning algorithms on each combination of collocational property and feature organization. Table 1 shows the mean accuracy across algorithms. In addition to collocational features, all experiments included seven other (automatically determined) features, such as position in the paragraph. Two main modi-ORe ORb PCb PCe  fications of Wiebe et al. (1997a) were made to facilitate the comparisons at issue here. First, nouns were originally included in the CO but not the SP collocational property. Here, they are not included in either. Second, a weakness in the method for selecting the collocation sets is changed so that, for each collocational property, the words in the sets WordsCiSj are identical for both per-class experiments.</Paragraph>
    <Paragraph position="1"> The data consists of 2,544 main clauses from the Wall Street Journal Treebank corpus (Marcus et al., 1993). 2 There are six classes, and the lower bound for the classification problem--the frequency in the data set of the most frequent class--is 52%.</Paragraph>
    <Paragraph position="2"> 10-fold cross-validation was performed. All experiments were independent, so that, for each fold, the collocations were determined and rule induction or model search, etc., was performed anew on the training set.</Paragraph>
    <Paragraph position="3"> We performed an analysis of variance to detect significant differences in accuracy considering algorithm, collocational property, and feature organization. When there are, we performed post-hoc analyses (using Tukey's HSD, to control for multiple comparison error rates (SAS Institute 1989)) to identify the differences. The algorithms differ in accuracy, i.e., the analysis shows there is a significant main effect of algorithm on accuracy (p &lt; 0.0001). Post-hoc analysis shows that there is only one significant difference: the lower performance of PEBLS relative to the others.</Paragraph>
    <Paragraph position="4"> However, the pattern of interaction between algorithm and features is extremely consistent across algorithms. The analysis shows that there is no higher level interaction between algorithm, on the one hand, and collocational prop~The Treebank syntax trees are used only to identify the main clause. This must be done only because the problem is defined as classifying the main clause.  That is, the relative effects of property and organization on accuracy do not significantly change from one algorithm to another.</Paragraph>
    <Paragraph position="5"> No attempt was made to tune the algorithms for performance (e.g., varying the number of neighbors in the PEBLS experiments). Thus, we do not take the results to be indicative of the quality of the algorithms. Rather, the consistent pattern off results indicates that per-class organization is beneficial or not depending mainly on the collocational property.</Paragraph>
    <Paragraph position="6"> Further analysis, controlling for differences across algorithms, reveals a highly significant interaction (P &lt; 0.0001) between collocational property and feature organization. Post-hoc comparisons show that the best per-class experiment, SP-PCe, is significantly better than any over-range experiment, but is not significantly better than the other syntactic pattern/perclass experiment, SP-PCb. In fact, we experimented (using the CoCo search algorithm) with per-class variations not presented in Wiebe et al. (1997a), specifically with different sets of subproperties (e.g., PCe with s= 1). There is no statistically significant difference among any of the syntactic pattern/per-class experiments.</Paragraph>
    <Paragraph position="7"> In contrast, the co-occurrence/per-class experiments (CO-PCe and CO-PCb) are significantly worse than all the other experiments. Among the four over-range experiments, the only significant difference is between CO-ORb and CO-ORe. As seen in table 2, a large number of per-class collocation words appear only once (a consequence of the basic conditional probability test we use). We reran the per-class experiments (10-fold cross validation using CoCo search), excluding collocation words that appear only once in the training set. There were miniscule increases in the SP results (less than 0.3%). For the CO collocations, the PCb experiment increased by 3.15% and the PCe by less than 1%. With these new results, the perclass/co-occurrence results are still much worse than all the other experiments.</Paragraph>
  </Section>
  <Section position="8" start_page="229" end_page="231" type="metho">
    <SectionTitle>
7 Analysis
</SectionTitle>
    <Paragraph position="0"> In the previous section, we established that there is a highly significant interaction in the experiments between collocational property and feature organization, and that the pattern of this interaction is extremely consistent across the algorithms. In this section, the properties and organizations are analyzed in order to gain insight into the pattern of results and develop some diagnostics for recognizing when the per-class organizations may be beneficial. We consider a number of factors, including conflicting class indicators, entropy, conditional probability, and event space complexity.</Paragraph>
    <Paragraph position="1"> As table 2 illustrates, the SP collocations are of much lower frequency, since they are more constrained. Specifically, table 2 shows the number of occurrences in one training set of the collocation words selected per-class.</Paragraph>
    <Section position="1" start_page="229" end_page="230" type="sub_section">
      <SectionTitle>
7.1 Conflicts in Per-Class Experiments
</SectionTitle>
      <Paragraph position="0"> The main differences between CO and SP collocations occur under the per-class organizations. These organizations appear to be vulnerable to collocations that indicate conflicting classes, since the collocation words are selected to be those highly indicative of a particular class. Two words in the same sentence indicate conflicting classes if one is in a set WordsCjSi and the other is in a set WordsCkSt, and j C/ k.</Paragraph>
      <Paragraph position="1">  conflict, while the SP collocations rarely do.</Paragraph>
      <Paragraph position="2"> This is true whether or not the collocations appearing only once are included (shown on the left versus the right side of the table).</Paragraph>
    </Section>
    <Section position="2" start_page="230" end_page="230" type="sub_section">
      <SectionTitle>
ing Collocations
7.2 Measures of Feature Quality
</SectionTitle>
      <Paragraph position="0"> We argue that, for the per-class organizations to be beneficial, the individual collocation words must strongly select a single majority class.</Paragraph>
      <Paragraph position="1"> Suppose that two words wl and w2 in the set WordsCc4Ss: select different classes as the second most probable class, with, say, conditional probabilities of .24 and .22, respectively. Information concerning the second most probable class is lost under the per-class grouping, even though the words are associated with another class over 20% of the time. If the conditional probability of the most strongly associated class were higher for both words, the frequency of the secondary association would be reduced, resulting in fewer erroneous classifications.</Paragraph>
      <Paragraph position="2"> Two measures that can be used to assess how strongly collocation words select single majority classes are entropy and conditional probability of class given feature.</Paragraph>
      <Paragraph position="3"> Quality of low frequency collocations is difficult to measure. For example, entropy tends to be unreliable for low frequency features. Therefore, table 4 shows statistics calculated for the more frequent words selected in common un- null der the SP and CO constraints in the training set of one fold of a per-class experiment. The 17 selected words all occur at least 10 times under each constraint in the training set used.</Paragraph>
      <Paragraph position="4"> Since an identical set of words is measured under both kinds of collocational property, the results strongly reflect the quality of the properties. null The entropy of the conditional distribution of the class C given value f of feature F is: H ~ - ~ p(c \[ F = f)xlog(p(c I F = f)) ce{c~ ..... co} The first line of table 4 shows shows that, on average, the SP collocation words are more strongly indicative of a single class. The second line shows that, on average, SP collocations have much lower entropy than the others.</Paragraph>
    </Section>
    <Section position="3" start_page="230" end_page="231" type="sub_section">
      <SectionTitle>
7.3 The Potential of Per-Class
</SectionTitle>
      <Paragraph position="0"> Organizations: more information without added complexity As shown above in tables 2, 3, and 4, collocation words of the more constrained SP property are of lower frequency and higher quality than the CO collocations. Because the SP collocations are low frequency, using them requires including a larger number of collocations words.</Paragraph>
      <Paragraph position="1"> To assess the influence of the per-class organizations when the number of collocation words is not increased, the following exercise was performed. We took the collocation words that  were included in the original ORe experiment and organized them as PCe and similarly for ORb and PCb, and reran the experiments (10fold cross validation using CoCo search). When the features are so transformed, the accuracy is virtually unchanged, as shown in table 5.</Paragraph>
      <Paragraph position="2">  Mapped to PC Collocations The results suggest that simply applying the per-class organizations to existing collocations will not result in significant improvement. The improvement we see when moving from the over-range to the per-class organizations of the SP collocations is largely due to inclusion of additional high quality collocations; the PC organizations allow them to be included without adding complexity.</Paragraph>
      <Paragraph position="3"> Various methods have been proposed for reducing the complex feature space associated with large numbers of low frequency properties. For example, one can ignore infrequent collocations entirely (e.g., Ng &amp; Lee), consider only the single best property (e.g., Yarowsky 1993), or ignore negative evidence, i.e., the absence of a property (e.g., Hearst 1992). Another is to retain the high quality collocations, grouping them per-class. Cohen (1996) and Goldberg (1995) propose similar methods for text categorization tasks, although they do not address the comparative issues investigated here.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML