File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2103_metho.xml

Size: 22,878 bytes

Last Modified: 2025-10-06 14:09:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2103">
  <Title>Linguistic Preprocessing for Distributional Classification of Words</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Types of Linguistic Preprocessing
</SectionTitle>
    <Paragraph position="0"> In order to prepare a machine-processable representation of a word from particular instances of its occurrence, one needs to decide on, firstly, what is to be understood by the context of a word's use, and, secondly, which elements of that context will constitute distributional features. A straightforward decision is to take a certain number of words or characters around the target word to be its occurrence context, and all uninterrupted letter sequences within this delineation to be its features.</Paragraph>
    <Paragraph position="1"> However, one may ask the question if elements of the text most indicative of the target word's meaning can be better identified by looking at the linguistic analysis of the text.</Paragraph>
    <Paragraph position="2"> In this paper we empirically study the following types of linguistic preprocessing.</Paragraph>
    <Paragraph position="3"> 1. The use of original word forms vs. their stems vs. their lemmas as distributional features. It is not evident what kind of morphological preprocessing of context words should be performed, if at all.</Paragraph>
    <Paragraph position="4"> Stemming of context words can be expected to help better abstract from their particular occurrences and to emphasize their invariable meaning. It also relaxes the stochastic dependence between features and reduces the dimensionality of the representations. In addition to these advantages, lemmatization also avoids confusing words with similar stems (e.g., car vs. care, ski vs. sky, aide vs. aid ). On the other hand, morphological preprocessing cannot be error-free and it may seem safer to simply use the original word forms and preserve their intended meaning as much as possible. In text categorization, stemming has not been conclusively shown to improve effectiveness in comparison to using original word forms, but it is usually adopted for the sake of shrinking the dimensionality of the feature space (Sebastiani, 2002). Here we will examine both the effectiveness and the dimensionality reduction that stemming and lemmatization of context words bring about.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. Morphological decomposition of context
</SectionTitle>
    <Paragraph position="0"> words. A morpheme is the smallest meaningful unit of the language. Therefore decomposing context words into morphemes and using them as features may eventually provide more fine-grained evidence about the target word. Particularly, we hypothesize that using roots of context words rather than their stems or lemmas will highlight lexical similarities between context words belonging to different parts of speech (e.g., different, difference, differentiate) or differing only in affixes (e.g., build and rebuild ).</Paragraph>
    <Paragraph position="1"> 3. Different syntactically motivated methods of delimiting the context of the word's use. The lexical context permitting occurrence of the target word consists of words and phrases whose meanings have something to do with the meaning of the target word. Therefore, given that syntactic dependencies between words presuppose certain semantic relations between them, one can expect syntactic parsing to point to most useful context words. The questions we seek answers to are: Are syntactically related words indeed more revealing about the meaning of the target word than spatially adjacent ones? Which types of syntactic dependencies should be preferred for delimiting the context of a target word's occurrence? 4. Filtering out rare context words. The typical practice of preprocessing distributional data is to remove rare word co-occurrences, thus aiming to reduce noise from idiosyncratic word uses and linguistic processing errors and at the same time form more compact word representations (e.g., Grefenstette, 1993; Ciaramita, 2002). On the other hand, even single occurrence word pairs make up a very large portion of the data and many of them are clearly meaningful. We compare the quality of the distributional representations with and without context words that occurred only once with the target word.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Experimental Task
</SectionTitle>
      <Paragraph position="0"> The preprocessing techniques were evaluated on the task of automatic classification of nouns into semantic classes. The evaluation of each preprocessing method consisted in the following.</Paragraph>
      <Paragraph position="1"> A set of nouns N each belonging to one semantic class c,C was randomly split into ten equal parts.</Paragraph>
      <Paragraph position="2"> Co-occurrence data on the nouns was collected and preprocessed using a particular method under analysis. Then each noun n,N was represented as a vector of distributional features: nr= (vn,1, vn,2, ... vn,i), where the values of the features are the frequencies of n occurring in the lexical context corresponding to v. At each experimental run, one of the ten subsets of the nouns was used as the test data and the remaining ones as the train data. The reported effectiveness measures are microaveraged precision scores averaged over the ten runs. The statistical significance of differences between performance of particular preprocessing methods reported below was estimated by means of the one-tailed paired t-test.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Data
</SectionTitle>
      <Paragraph position="0"> The set of nouns each provided with a class label to be used in the experiments was obtained as follows. We first extracted verb-noun dependencies from the British National Corpus, where nouns are either direct or prepositional objects to verbs. Each noun that occurred with more than 20 different verbs was placed into a semantic class corresponding to the WordNet synset of its most frequent sense. The resulting classes with less than 2 nouns were discarded.</Paragraph>
      <Paragraph position="1"> Thus we were left with 101 classes, each containing 2 or 3 nouns.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Classification Methods
</SectionTitle>
      <Paragraph position="0"> Two classification algorithms were used in the study: Naive Bayes and Rocchio, which were previously shown to be quite robust on highly dimensional representations on tasks including word classification (e.g., Tokunaga et al., 1997, Ciaramita, 2002).</Paragraph>
      <Paragraph position="1"> The Naive Bayes algorithm classifies a test instance n by finding a class c that maximizes p(c|nr). Assuming independence between features, the goal of the algorithm can be stated as:</Paragraph>
      <Paragraph position="3"> where p(ci) and p(v|ci) are estimated during the training process from the corpus data.</Paragraph>
      <Paragraph position="4"> The Naive Bayes classifier was the binary independence model, which estimates p(v|ci) assuming the binomial distribution of features across classes. In order to introduce the information inherent in the frequencies of features into the model all input probabilities were calculated from the real values of features, as suggested in (Lewis, 1998).</Paragraph>
      <Paragraph position="5"> The Rocchio classifier builds a vector for each class c,C from the vectors of training instances. The value of jth feature in this vector is computed as:</Paragraph>
      <Paragraph position="7"> where the first part of the equation is the average value of the feature in the positive training examples of the class, and the second part is its average value in the negative examples. The parameters b and g control the influence of the positive and negative examples on the computed value, usually set to 16 and 4, correspondingly.</Paragraph>
      <Paragraph position="8"> Once vectors for all classes are built, a test instance is classified by measuring the similarity between its vector and the vector of each class and assigning it to the class with the greatest similarity. In this study, all features of the nouns were modified by the TFIDF weight before the training.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Syntactic Contexts
</SectionTitle>
      <Paragraph position="0"> The context of the target word's occurrence can be delimited syntactically. In this view, each context word is a word that enters in a syntactic dependency relation with the target word, being either the head or the modifier in the dependency.</Paragraph>
      <Paragraph position="1"> For example, in the sentence She bought a nice hat context words for hat are bought (the head of the predicate-object relation) and nice (the attributive modifier).</Paragraph>
      <Paragraph position="2"> We group typical syntactic relations of a noun together based on general semantic relations they indicate. We define five semantic types of distributional features of nouns that can be extracted by looking at the dependencies they participate in.</Paragraph>
      <Paragraph position="3"> A. verbs in the active form, to which the target nouns are subjects (e.g., the committee discussed (the issue), the passengers got on (a bus), etc); B. active verbs, to which the target nouns are direct or prepositional objects (e.g., hold a meeting; depend on a friend); passive verbs to which the nouns are subjects (e.g., the meeting is held); C. adjectives and nouns used as attributes or predicatives to the target nouns (e.g., a tall building, the building is tall; amateur actor, the actor is an amateur); D. prepositional phrases, where the target nouns are heads (e.g., the box in the room); we consider three possibilities to construct distributional features from such a dependency: with the preposition (in_room, D1), without it (room, D2), and creating to separate features for the preposition and the noun (in and room, D3).</Paragraph>
      <Paragraph position="4"> E. prepositional phrases, where the target nouns are modifiers (the ball in the box); as with type D, three subtypes are identified: E1 (ball_in ), E2 (ball), and E3 (ball and in); We compare these feature types to each other and to features extracted by means of the window-based context delineation. The latter were collected by going over occurrences of each noun with a window of three words around it. This particular size of the context window was chosen following findings of a number of studies indicating that small context windows, i.e. 2-3 words, best capture the semantic similarity between words (e.g., Levy et al., 1998; Ciaramita, 2002). Thereby, a common stoplist was used to remove too general context words. All the context words experimented with at this stage were lemmatized; those, which co-occurred with the target noun only once, were removed.</Paragraph>
      <Paragraph position="5"> We first present the results of evaluation of different types of features formed from prepositional phrases involving target nouns (see  prepositional phrases involving target nouns.</Paragraph>
      <Paragraph position="6"> On both classifiers and for both types D and E, the performance is noticeably higher when the collocation of the noun with the preposition is used as one single feature (D1 and E1). Using only the nouns as separate features decreases classification accuracy. Adding the prepositions to them as individual features improves the performance very slightly on Naive Bayes, but has no influence on the performance of Rocchio. Comparing types D1 and E1, we see that D1 is clearly more effective, particularly on Naive Bayes, and uses around 30% less features than E1.</Paragraph>
      <Paragraph position="7">  all the five feature types described above. On Naive Bayes, each of the syntactically-defined types yields performance inferior to that of the window-based features. On Rocchio, window-based is much worse than B and C, but is comparable to A, D1 and E1. Looking at the dimensionality of the feature space each method produces, we see that the window-based features are much more numerous than any of the syntactically-defined ones, although collected from the same corpus. The much larger feature however space does not yield a proportional increase in classification accuracy. For example, there are around seven times less type C features than window-based ones, but they are only 1.9% less effective on Naive Bayes and significantly more effective on Rocchio.</Paragraph>
      <Paragraph position="8"> Among the syntactically-defined features, types B and C perform equally well, no statistical significance between their performances was found on either NB or Rocchio. In fact, the ranking of the feature types wrt their performance is the same for both classifiers: types B and C trail E1 by a large margin, which is followed by D1, type A being the worst performer. The results so far suggest that adjectives and verbs near which target nouns are used as objects provide the best evidence about the target nouns' meaning.</Paragraph>
      <Paragraph position="9"> We further tried collapsing different types of features together. In doing so, we appended a tag to each feature describing its type so as to avoid confusing context words linked by different syntactic relations to the target noun (see Table 3). The best result was achieved by combining all the five syntactic feature types, clearly outperforming the window-based context delineation on both Naive Bayes (26% improvement, p&lt;0.05) and Rocchio (88% improvement, p&lt;0.001) and still using 20% smaller feature space. The combination of B and C produced only slightly worse results (the differences not significant for either classifiers), but</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Original word forms vs. stems vs. lemmas
</SectionTitle>
      <Paragraph position="0"> We next looked at the performance resulting from stemming and lemmatization of context words. Since morphological preprocessing is likely to differently affect nouns, verbs, and adjectives, we study them on data of types B (verbs), C (adjectives), and the combination of D1 and E1 (nouns) from the previous experiment. Stemming was carried out using the Porter stemmer.</Paragraph>
      <Paragraph position="1"> Lemmatization was performed using a pattern-matching algorithm which operates on PoS-tagged text and consults the WordNet database for exceptions. As before, context words that occurred only once with a target noun were discarded. Table 4 describes the results of these experiments.</Paragraph>
      <Paragraph position="2">  There is very little difference in effectiveness between these three methods (except for lemmatized nouns on NB). As a rule, the difference between them is never greater than 1%. In terms of the size of feature space, lemmatization is most advisable for verbs (32% reduction of feature space compared with the original verb forms), which is not surprising since the verb is the most infected part of speech in English. The feature space reduction for nouns was around 25%. Least reduction of feature space occurs when applying lemmatization to adjectives, which inflect only for degrees of comparison.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Morphological decomposition
</SectionTitle>
      <Paragraph position="0"> We further tried constructing features for a target noun on the basis of morphological analysis of words occurring in its context. As in the experiments with stemming and lemmatization, in order to take into account morphological differences between parts of speech, the effects of morphological decomposition of context words was studied on the distributional data of types B (verbs), C (adjectives), and D1+E1 (nouns).</Paragraph>
      <Paragraph position="1"> The decomposition of words into morphemes was carried out as follows. From &amp;quot;Merriam-Webster's Dictionary of Prefixes, Suffixes, and Combining Forms&amp;quot;1 we extracted a list of 12 verbal, 59 adjectival and 138 nounal suffixes, as well as 80 prefixes, ignoring affixes consisting of only one character. All suffixes for a particular part-of-speech and all prefixes were sorted according to their character length. First, all context words were lemmatized. Then, examining the part-of-speech of the context word, presence of each affix with it was checked by simple string matching, starting from the top of the corresponding array of affixes. For each word, only one prefix and only one suffix was matched. In this way, every word was broken down into maximum three morphemes: the root, a prefix and a suffix.</Paragraph>
      <Paragraph position="2"> Two kinds of features were experimented with: one where features corresponded to the roots of the context words and one where all morphemes of the context word (i.e., the root, prefix and suffix) formed separate features. When combining features created from context words belonging to different parts-of-speech, no tags were used in order to map roots of cognate words to the same feature. The results of these experiments are shown in Table 5.</Paragraph>
      <Paragraph position="3">  morphological analysis of context words.</Paragraph>
      <Paragraph position="4"> On Naive Bayes, using only roots increases the classification accuracy for B, C, and B+C compared to the use of lemmas. The improvement, however, is not significant. Inclusion of affixes does not produce any perceptible effect on the performance. In all other cases and when the Rocchio classifier is used, decomposition of words into morphemes consistently decreases performance compared to the use of their lemmas.</Paragraph>
      <Paragraph position="5"> These results seem to suggest that the union of the root with the affixes constitutes the most  optimal &amp;quot;container&amp;quot; for distributional information. Decomposition of words into morphemes often causes loss of a part of this information. It seems there are few affixes with the meaning so abstract that they can be safely discarded.</Paragraph>
      <Paragraph position="6"> 4.4 Filtering out rare context words To study the effect of removing singleton context words, we compared the quality of classifications with and without them. The results are shown in Table 6.</Paragraph>
      <Paragraph position="7">  The results do not permit making any conclusions as to the enhanced effectiveness resulting from discarding rare co-occurrences.</Paragraph>
      <Paragraph position="8"> Discarding singletons, however, does considerably reduce the feature space. The dimensionality reduction is especially large for the datasets involving types B, D1 and E1, where each feature is a free collocation of a noun or a verb with a preposition, whose multiple occurrences are much less likely than multiple occurrences of an individual context word.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Related work
</SectionTitle>
    <Paragraph position="0"> A number of previous studies compared different kinds of morphological and syntactic preprocessing performed before inducing a co-occurrence model of word meaning.</Paragraph>
    <Paragraph position="1"> Grefenstette (1993) studied two context delineation methods of English nouns: the window-based and the syntactic, whereby all the different types of syntactic dependencies of the nouns were used in the same feature space. He found that the syntactic technique produced better results for frequent nouns, while less frequent nouns were more effectively modeled by the windowing technique. He explained these results by the fact that the syntactic technique extracts much fewer albeit more useful features and the small number of features extracted for rare nouns is not sufficient for representing their distributional behavior.</Paragraph>
    <Paragraph position="2"> Alfonseca and Manandhar (2002) compared different types of syntactic dependencies of a noun as well as its &amp;quot;topic signature&amp;quot;, i.e. the features collected by taking the entire sentence as the context of its occurrence, in terms of their usefulness for the construction of its distributional representation. They found that the best effectiveness is achieved when using a combination of the topic signature with the &amp;quot;object signature&amp;quot; (a list of verbs and prepositions to which the target noun is used as an argument) and the &amp;quot;subject signature&amp;quot; (a list of verbs to which the noun is used as a subject). The &amp;quot;modifier signature&amp;quot; containing co-occurring adjectives and determiners produced the worst results.</Paragraph>
    <Paragraph position="3"> Pado and Lapata (2003) investigated different possibilities to delimit the context of a target word by considering the syntactic parse of the sentence.</Paragraph>
    <Paragraph position="4"> They examined the informativeness of features arising from using the window-based context delineation, considering the sum of dependencies the target word is involved in, and considering the entire argument structure of a verb as the context of the target word, so that, e.g. an object can be a feature for a subject of that verb. Their study discovered that indirect syntactic relations within an argument structure of a verb generally yield better results than using only direct syntactic dependencies or the windowing technique.</Paragraph>
    <Paragraph position="5"> Ciaramita (2002) looked at how the performance of automatic classifiers on the word classification task is affected by the decomposition of target words into morphologically relevant features. He found that the use of suffixes and prefixes of target nouns is indeed more advantageous, but this was true only when classifying words into large word classes. These classes are formed on the basis of quite general semantic distinctions, which are often reflected in the meanings of their affixes. In addition to that, the classification method used involved feature selection, which ensured that useless features resulting from semantically empty affixes and errors of the morphological decomposition did not harm the classification accuracy.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML