File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0903_metho.xml

Size: 26,991 bytes

Last Modified: 2025-10-06 14:08:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0903">
  <Title>Boosting automatic lexical acquisition with morphological informationa0</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 MiniWordnet
</SectionTitle>
    <Paragraph position="0"> Ideally the lexicon we would like to extend is a broad coverage machine readable dictionary like Wordnet (Miller et al., 1990; Fellbaum, 1998). The problem with trying to directly use Wordnet is that it contains too many classes (synsets), around 70 thousand. Learning in such a huge class space can be extremely problematic, and intuitively it is not the best way to start on a task that hasn't been much explored3. Instead, we manually developed a smaller lexicon dubbed MiniWordnet, which is derived from Wordnet version 1.6. The reduced lexicon has the same coverage (about 95 thousand noun types) but only a fraction of the classes. In this paper we considered only nouns and the noun database. The goal was to reduce the number of classes to about one hundred4 of roughly comparable taxonomical generality and consistency, while maintaining a little bit of hierarchical structure.</Paragraph>
    <Paragraph position="1">  ied text categorization data sets like the Reuters-21578 (Yang, 1999).</Paragraph>
    <Paragraph position="2"> The output of the manual coding is a set of 106 classes that are the result of merging hundreds of synsets. A few random examples of these classes are PERSON, PLANT, FLUID, LOCATION, AC-TION, and BUSINESS. One way to look at this set of classes is from the perspective of named-entity recognition tasks, where there are a few classes of a similar level of generality, e.g, PERSON, LOCA-TION, ORGANIZATION, OTHER. The difference here is that the classes are intended to capture all possible taxonomic distinctions collapsed into the OTHER class above. In addition to the 106 leaves we also kept a set of superordinate levels. We maintained the 9 root classes in Wordnet plus 18 intermediate ones. Examples of these intermediate classes are ANIMAL, NATURAL OBJECT, AR-TIFACT, PROCESS, and ORGANIZATION. The reason for keeping some of the superordinate structure is that hierarchical information might be important in word classification; this is something we will investigate in the future. For example, there might not be enough information to classify the noun ostrich in the BIRD class but enough to label it as ANIMAL.</Paragraph>
    <Paragraph position="3"> The superordinates are the original Wordnet synsets.</Paragraph>
    <Paragraph position="4"> The database has a maximum depth of 5.</Paragraph>
    <Paragraph position="5"> We acknowledge that the methodology and results of reducing Wordnet in this way are highly subjective and noisy. However, we also think that going through an intermediary step with the reduced database has been useful for our purposes and it might also be so for other researchers5. Figure 1 depicts the hierarchy below the root class ABSTRAC-TION. The classes that are lined up at the bottom of the figure are leaves. As in Wordnet, some sub5More information about MiniWordnet and the database itself are available at www.cog.brown.edu/a18 massi/research.</Paragraph>
    <Paragraph position="6"> hierarchies are more densely populated than others.</Paragraph>
    <Paragraph position="7"> For example, the ABSTRACTION sub-hierarchy is more populated (11 leaves) than that of EVENT (3 leaves). The most populated and structured class is ENTITY, with almost half of the leaves (45) and several superordinate classes (10).</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Automatic lexical acquisition
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Word classification
</SectionTitle>
      <Paragraph position="0"> We frame the task of inserting new words into the dictionary as a classification problem: a19 is the set of classes defined by the dictionary. Given a vector of features a20a21a16a22a24a23a26a25a28a27 we want to find functions of the form a23 a7 a19 . In particular we are interested in learning functions from data, i.e., a training set of</Paragraph>
      <Paragraph position="2"> be a small probability of error when we apply the classifier to unknown pairs (new nouns).</Paragraph>
      <Paragraph position="3"> Each class is described by a vector of features. A class of features that intuitively carry semantic information are collocations, i.e., words that co-occur with the nouns of interest in a corpus. Collocations have been widely used for tasks such as word sense disambiguation (WSD) (Yarowsky, 1995), information extraction (IE) (Riloff, 1996), and named-entity recognition (Collins and Singer, 1999). The choice of collocations can be conditioned in many ways: according to syntactic relations with the target word, syntactic category, distance from the target, and so on.</Paragraph>
      <Paragraph position="4"> We use a very simple set of collocations: each word a39 that appears within a40a42a41 positions from a noun a43 is a feature. Each occurrence, or token, a44 of a43 , a43a46a45 , is then characterized by a vector of feature counts a20a43a47a45 . The vector representation of the noun type a43 is the sum of all the vectors representing the contexts in which it occurs. Overall the vector representation for each class in the dictionary is the sum of the vectors of all nouns that are members of the</Paragraph>
      <Paragraph position="6"> while the vector representation of an unknown noun is the sum of the feature vectors of the contexts in which it occurred</Paragraph>
      <Paragraph position="8"> The corpus that we used to collect the statistics about collocations is the set of articles from the 1989 Wall Street Journal (about 4 million words) in the BLLIP'99 corpus.</Paragraph>
      <Paragraph position="9"> We performed the following tokenization steps.</Paragraph>
      <Paragraph position="10"> We used the Wordnet &amp;quot;morph&amp;quot; functions to morphologically simplify nouns, verbs and adjectives. We excluded only punctuation; we did no filtering for part of speech (POS). Each word was actually a word-POS pair; i.e., we distinguished between plant:NN and plant:VB. We collapsed sequences of NNs that appeared in Wordnet as one noun; so we have one entry for the noun car company:NN. We also collapsed sequences of NNPs, possibly interleaved by the symbol &amp;quot;&amp;&amp;quot;, e.g., George Bush:NNP and Procter &amp; Gamble:NNP. To reduce the number of features a little we changed all NNPs beginning with Mr. or Ms. to MISS X:NNP, all NNPs ending in CORP. or CO. to COMPANY X:NNP, and all words with POS CD, i.e., numbers, starting with a digit to NUMBER X:CD. For training and testing we considered only nouns that are not ambiguous according to the dictionary, and we used only features that occurred at least 10 times in the corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Simple models
</SectionTitle>
      <Paragraph position="0"> We developed several simple classifiers. In particular we focused on nearest neighbor (a10a59a10 ) and naive Bayes (a10a13a60 ) methods. Both are very simple and powerful classification techniques. For NN we used cosine as a measure of distance between two vectors, and the classifier is thus</Paragraph>
      <Paragraph position="2"> Since we used aggregate vectors for classes and noun types, we only used the best class; i.e., we always used 1-nearest-neighbor classifiers. Thus a41 in this paper refers only to the size of the window around the target noun and never to number of neighbors consulted in a41 -nearest-neighbor classification. We found that using TFIDF weights instead of simple counts greatly improved performance of the NN classifiers, and we mainly report results relative to the TFIDF NN classifiers (a10a59a10a82a81a84a83a58a85a36a86a73a83 ). A document in this context is the context, delimited by the window size a41 , in which each each noun occurs.</Paragraph>
      <Paragraph position="3">  models for a41 a48a96a95a66a97a98a97a99a95a101a100 .at level 1 words and re-weights features by their informativeness, thus making a stop list or other feature manipulations unnecessary. The naive Bayes classifiers is also very simple</Paragraph>
      <Paragraph position="5"> The parameters of the prior and class-conditional distributions are easily estimated using maximum likelihood. We smoothed all counts by a factor of .5.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Testing procedure
</SectionTitle>
      <Paragraph position="0"> We tested each model on an increasing numbers of classes or level. At level 1 the dictionary maps nouns only to the nine Wordnet roots; i.e., there is a very coarse distinction among noun categories at the level of ENTITY, STATE, ACT,.... At level 2 the dictionary maps nouns to all the classes that have a level-1 parent; thus each class can be either a leaf or an intermediate (level 2) class. In general, at level a44 nouns are only mapped to classes that have a level (a44a73a109 a95 ), or smaller, parent. There are 34 level-2 classes, 69 level-3 classes and 95 level-4 ones. Finally, at level 5, nouns are mapped to all 106 leaves. We compared the boosting models and the NN and NB classifiers over a fixed size for a41 of 4.</Paragraph>
      <Paragraph position="1"> For each level we extracted all unambiguous instances from the BLLIP'99 data. The data ranged from 200 thousand instances at level 5, to almost 400 thousand at level 1. As the number of classes grows there are less unambiguous words. We randomly selected a fixed number of noun types for each level: 200 types at levels 4 and 5, 300 at level 3, 350 at level 2 and 400 at level 1. Test was limited to common nouns with frequency between 10 and 300 on the total data. No instance of the noun types present in the test set ever appeared in the training data. The test data was between 5 and 10% of the training data; 10 thousand instances at level 5, 16 thousand at level 1, with intermediate figures for the other levels. We used exactly the same partition of the data for all experiments, across all models.</Paragraph>
      <Paragraph position="2"> Figure 2 shows the error rate of several simple models at level 1 for increasing values of a41 . The error keeps dropping until a41 reaches a value around 4 and then starts rising. Testing for all values of a41a111a110a26a112 a100 confirmed this pattern. This result suggests that the most useful contextual information is that close to the noun, which should be syntactic-semantic in nature, e.g., predicate-argument preferences. As the window widens, the bag of features becomes more noisy. This fact is not too surprising.</Paragraph>
      <Paragraph position="3"> If we made the window as wide as the whole document, every noun token in the document would have the same set of features. As expected, as the number of classes increases, the task becomes harder and the error of the classifiers increases. Nonetheless the same general pattern of performance with respect to a41 holds. As the figure shows a10a13a10a82a81a84a83a58a85a94a86a73a83 greatly improves over the simpler a10a13a10 classifier that only uses counts. a10a59a60 outperforms both.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Boosting for word classification
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 AdaBoost.MH with abstaining
</SectionTitle>
      <Paragraph position="0"> Boosting is an iterative method for combining the output of many weak classifiers or learners6 to produce an accurate ensemble of classifiers. The method starts with a training set a113 and trains the first classifier. At each successive iteration a114 a new classifier is trained on a new training set a113a70a115 , which is obtained by re-weighting the training data used at a114a82a109 a95 so that the examples that were misclassified at a114a116a109 a95 are given more weight while less weight is given to the correctly classified examples. At each  sify examples better than at random only by an arbitrarily small quantity.</Paragraph>
      <Paragraph position="1"> iteration a weak learner a117a56a115a94a29a92a118a34 is trained and added to the ensemble with weight a119a35a115 . The final ensemble has the form</Paragraph>
      <Paragraph position="3"> In the most popular version of a boosting algorithm, AdaBoost (Schapire and Singer, 1998), at each iteration a classifier is trained to minimize the exponential loss on the weighted training set. The exponential loss is an upper bound on the zero-one loss. AdaBoost minimizes the exponential loss on the training set so that incorrect classification and disagreement between members of the ensemble are penalized.</Paragraph>
      <Paragraph position="4"> Boosting has been successfully applied to several problems. Among these is text categorization (Schapire and Singer, 2000), which bears similarities with word classification. For our experiments we used AdaBoost.MH with real-valued predictions and abstaining, a version of boosting for multiclass classification described in Schapire and Singer (2000). This version of AdaBoost minimizes a loss function that is an upper bound on the Hamming distance between the weak learners' predictions and the real labels, i.e., the number of label mismatches (Schapire and Singer, 1998). This upper bound is the product a126</Paragraph>
      <Paragraph position="6"> function a32 a45a33a128a129a131a130 is 1 if a129 is the correct label for the training example a21 a45 and is -1 otherwise; a2 a48 a108a19a132a108 is the total number of classes; and a133 a48 a108a113a88a108 is the number of training examples. We explain what the term for the weak learner a117a70a134</Paragraph>
      <Paragraph position="8"> since we are interested in classifying noun types the final score for each unknown noun is</Paragraph>
      <Paragraph position="10"> where with a44a88a185a47a44 a22 a43 instance a21 a45 is a token of noun type a43 .</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Weak learners
</SectionTitle>
      <Paragraph position="0"> In this version of AdaBoost weak learners are extremely simple. Each feature, e.g., one particular collocation, is a weak classifier. At each round one feature a39 is selected. Each feature makes a real-valued prediction a186a125a115a94a29a140a39 a30 a129 a34 with respect to each class</Paragraph>
      <Paragraph position="2"> a34 is positive then feature a39 makes a positive prediction about class a129 ; if negative, it makes a negative prediction about class a129 . The magnitude of the prediction a108a186a188a115a94a29a140a39 a30 a129 a34 a108 is interpreted as a measure of the confidence in the prediction. Then for each training instance a simple check for the presence or absence of this feature is performed. For example, a possible collocation feature is eat:VB, and the corresponding prediction is &amp;quot;if eat:VB appears in the context of a noun, predict that the noun belongs to the class FOOD and doesn't belong to classes PLANT, BUSINESS,...&amp;quot;. A weak learner is defined as follows:  a175 ) is the sum of the weights of noun-label pairs, from the distribution a139 a115 , where the feature appears and the label is correct (wrong); a200 a48 a123 a135 a137 is a smoothing factor. In Schapire and Singer (1998) it</Paragraph>
      <Paragraph position="4"> a115 is minimized for a particular feature a39 by choosing its predictions as described in equation (8). The weight a119a106a115 usually associated with the weak classifier (see equation (2)) here is simply set to 1.</Paragraph>
      <Paragraph position="5"> If the value in (8) is plugged into (4),  a115 at each round we choose the feature a39 for which this value is the smallest. Updating these scores is what takes most of the computation, Collins (2000) describes an efficient version of this algorithm.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Morphological features
</SectionTitle>
      <Paragraph position="0"> We investigated two boosting models: a60a132a3a56a3a154a203 a114a205a211 , which uses only collocations as features, and a60a132a3a66a3a66a203 a114 a204 , which uses also a very simple set of morphological features. In a60a132a3a66a3a154a203 a114 a211 we used the collocations within a window of a40a42a41 a48a153a212 , which seemed to be a good value for both the nearest neighbor and the naive Bayes model. However, we didn't focus on any method for choosing a41 , since we believe that the collocational features we used only approximate more complex ones that need specific investigation. Our main goal was to compare models with and without morphological information. To specify the morphological properties of the nouns being classified, we used the following set of features: a213 plural (PL): if the token occurs in the plural form, PL=1; otherwise PL=0 a213 upper case (MU): if the token's first character is upper-cased MU=1; otherwise MU=0 a213 suffixes (MS): each token can have 0, 1, or more of a given set of suffixes, e.g., -er, ishment, -ity, -ism, -esse, ...</Paragraph>
      <Paragraph position="1"> a213 prefixes (MP): each token can have 0, 1 or more prefixes, e.g., pro-, re-, di-, tri-, ...</Paragraph>
      <Paragraph position="2"> a213 Words that have complex morphology share the morphological head word if this is a noun in Wordnet. There are two cases, depending on whether the word is hyphenated (MSHH) or the head word is a suffix (MSSH) - hyphenated (MSHH): drinking age and age share the same head-word age - non-hyphenated (MSSH): chairman and man share the same suffix head word, man. We limited the use of this feature to the case in which the remaining prefix (chair) also is a noun in Wordnet.</Paragraph>
      <Paragraph position="3"> We manually encoded two lists of 61 suffixes and 26 prefixes7. Figure 3 shows a few examples of the input to the models. Each line is a training instance; the attribute W refers to the lexical form of the noun and was ignored by the classifier.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Stopping criterion
</SectionTitle>
      <Paragraph position="0"> One issue when using iterative procedures is deciding when to stop. We used the simplest procedure of fixing in advance the number of iterations. We noticed that the test error drops until it reaches a point at which it seems not to improve anymore. Then the error oscillates around the same value even for thousands of iterations, without apparent overtraining. A similar behavior is observable in some of the results on text categorization presented in (Schapire and Singer, 2000). We cannot say that overtraining is not a potential danger in multiclass boosting models. However, for our experiments, in which the main goal is to investigate the impact of a particular class of features, we could limit the number of  this maximum number of iterations to be 3500; this allowed us to perform the experiments in a reasonable time. Figure 4 and Figure 5 plot training and test error for a60a132a3a66a3a154a203 a114a145a211 and a60a132a3a66a3a154a203 a114 a204 at level 4 (per instance). As the figures show, the error rate, on both training and testing, is still dropping after the fixed number of iterations. For the simplest model, a60a132a3a66a3a66a203 a114 a211 at level 1, the situation is slightly different: the model converges on its final test error rate after roughly 200 iterations and then remains stable. In general, as the number of classes grows, the model takes more iterations to converge and then the test error remains stable while the training error keeps slowly decreasing.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Results and discussion
</SectionTitle>
    <Paragraph position="0"> The following table summarizes the different models we tested:</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
MODEL FEATURES
</SectionTitle>
    <Paragraph position="0"> Figure 6 plots the results across the five different subsets of the reduced lexicon. The error rate is the error on types. We also plot the results of a baseline (BASE), which always chooses the most frequent class and the error rate for random choice (RAND). The baseline strategy is quite successful on the first sets of classes, because the hierarchy under the root a220 a10 a152 a12 a152a11a221 is by far the most populated. At level 1 it performs worse only than a60a132a3a66a3a154a203 a114 a204 . As the size of the model increases, the distribution of classes becomes more uniform and the task becomes harder for the baseline. As the figure shows the impact of morphological features is quite impressive.</Paragraph>
    <Paragraph position="1"> The average decrease in type error of a60a210a3a66a3a154a203 a114a204 over a60a132a3a66a3a66a203 a114a33a211 is more than 17%, notice also the difference in test and training error, per instance, in Figures 4 and 5.</Paragraph>
    <Paragraph position="2"> In general, we observed that it is harder for all classifiers to classify nouns that don't belong to the ENTITY class, i.e., maybe not surprisingly, it is harder to classify nouns that refer to abstract concepts such as groups, acts, or psychological features. Usually most of the correct guesses regard members of the ENTITY class or its descendants, which are also typically the classes for which there is more training data. a60a132a3a66a3a154a203 a114 a204 really improves on a60a132a3a66a3a66a203 a114 a211 in this respect.</Paragraph>
    <Paragraph position="3"> a60a210a3a66a3a154a203 a114a173a204 guesses correctly several nouns to which morphological features apply like spending, enforcement, participation, competitiveness, credibility or consulting firm. It makes also many mistakes, for example on conversation, controversy and insurance company. One problem that we noticed is that there are several cases of nouns that have intuitively meaningful suffixes or prefixes that are not present in our hand-coded lists. A possible solution to his problem might be the use of more general morphological rules like those used in part-of-speech tagging models (e.g.,  tain length are included. We observed also cases of recurrent confusion between classes. For example between ACT and ABSTRACTION (or their subordinates), e.g., for the noun modernization, possibly because the suffix is common in both cases.</Paragraph>
    <Paragraph position="4"> Another measure of the importance of morphological features is the ratio of their use with respect to that of collocations. In the first 100 rounds of a60a132a3a66a3a66a203 a114 a204 , at level 5, 77% of the features selected are morphological, 69% in the first 200 rounds. As Figures 4 and 5 show these early rounds are usually the ones in which most of the error is reduced. The first ten features selected at level 5 by a60a132a3a56a3a154a203 a114a182a204 were the following: PL=0, MU=0, PL=1, MU=0, PL=1, MU=1, MS=ing, PL=0, MS=tion, and finally CO=NUMBER X:CD. One final characteristic of morphology that is worth mentioning is that it is independent from frequency. Morphological features are properties of the type and not just of the token. A model that includes morphological information should therefore suffer less from sparse data problems.</Paragraph>
    <Paragraph position="5"> From a more general perspective, Figure 6 shows that even if the simpler boosting model's performance degrades more than the competitors after level 3, a60a132a3a66a3a154a203 a114 a204 performs better than all the other classifiers until level 5 when the TFIDF nearest neighbor and the naive Bayes classifiers catch up.</Paragraph>
    <Paragraph position="6"> It should be noted though that, as Figures 4 and 5 showed, boosting was still improving at the end of the fixed number of iterations at level 4 (but also 5). It might quite well improve significantly after more iterations. However, determining absolute performance was beyond the scope of this paper. It is also fair to say that both a10a13a10 and a10a59a60 are very competitive methods, and much simpler to implement efficiently than boosting. The main advantage with boosting algorithms is the flexibility in managing features of very different nature. Feature combination can be performed naturally with probabilistic models too but it is more complicated. However, this is something worth investigating.</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Related work
</SectionTitle>
    <Paragraph position="0"> Automatic lexical acquisition is a classic problem in AI. It was originally approached in the context of story understanding with the aim of enabling systems to deal with unknown words while processing text or spoken input. These systems would typically rely heavily on script-based knowledge resources. FOUL-UP (Granger, 1977) is one of these early models that tries to deterministically maximize the expectations built into its knowledge base. Jacobs and Zernik (1988) introduced the idea of using morphological information, together with other sources, to guess the meaning of unknown words. Hastings and Lytinen (1994) investigated attacking the lexical acquisition problem with a system that relies mainly on taxonomic information.</Paragraph>
    <Paragraph position="1"> In the last decade or so research on lexical semantics has focused more on sub-problems like word sense disambiguation (Yarowsky, 1995; Stevenson and Wilks, 2001), named entity recognition (Collins and Singer, 1999), and vocabulary construction for information extraction (Riloff, 1996). All of these can be seen as sub-tasks, because the space of possible classes for each word is restricted. In WSD the possible classes for a word are its possible senses; in named entity recognition or IE the number of classes is limited to the fixed (usually small) number the task focuses on. Other kinds of models that have been studied in the context of lexical acquisition are those based on lexico-syntactic patterns of the kind &amp;quot;X, Y and other Zs&amp;quot;, as in the phrase &amp;quot;bluejays, robins and other birds&amp;quot;. These types of models have been used for hyponym discovery (Hearst, 1992; Roark and Charniak, 1998), meronym discovery (Berland and Charniak, 1999), and hierarchy building (Caraballo, 1999). These methods are very interesting but of limited applicability, because nouns that do not appear in known lexico-syntactic patterns cannot be learned.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML