File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/e95-1020_metho.xml

Size: 23,607 bytes

Last Modified: 2025-10-06 14:14:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="E95-1020">
  <Title>Distributional Part-of-Speech Tagging Hinrich Schfitze</Title>
  <Section position="4" start_page="141" end_page="143" type="metho">
    <SectionTitle>
3 Tag induction
</SectionTitle>
    <Paragraph position="0"> We start by constructing representations of the syntactic behavior of a word with respect to its left and right context. Our working hypothesis is that syntactic behavior is reflected in co-occurrence patterns. Therefore, we will measure the similarity between two words with respect to their syntactic behavior to, say, their left side by the degree to which they share the same neighbors on the left. If the counts of neighbors are assembled into a vector (with one dimension for each neighbor), the cosine can be employed to measure similarity. It will assign a value close to 1.0 if two words share many neighbors, and 0.0 if they share none. We refer to the vector of left neighbors of a word as its left contezt vector, and to the vector of right neighbors as its right contezt vector.</Paragraph>
    <Paragraph position="1"> The unreduced context vectors in the experiment described here have 250 entries, corresponding to the 250 most frequent words in the Brown corpus.</Paragraph>
    <Paragraph position="2"> This basic idea of measuring distributional similarity in terms of shared neighbors must be modified because of the sparseness of the data. Consider two infrequent adjectives that happen to modify different nouns in the corpus. Their right similarity according to the cosine measure would be zero. This is clearly undesirable. But even with high-frequency words, the simple vector model can yield misleading similarity measurements. A case in point is &amp;quot;a&amp;quot; vs. &amp;quot;an&amp;quot;. These two articles do not share any right neighbors since the former is only used before consonants and the latter only before vowels. Yet intuitively, they are similar with respect to their right syntactic context despite the lack of common right neighbors.</Paragraph>
    <Paragraph position="3"> Our solution to these problems is the application of a singular value decomposition. We can represent the left vectors of all words in the corpus as a matrix C with n rows, one for each word whose left neighbors are to be represented, and k columns, one for each of the possible neighbors.</Paragraph>
    <Paragraph position="4"> SVD can be used to approximate the row and column vectors of C in a low-dimensional space. In more detail, SVD decomposes a matrix C, the matrix of left vectors in our case, into three matrices To, So, and Do such that:</Paragraph>
    <Paragraph position="6"> So is a diagonal k-by-k matrix that contains the singular values of C in descending order. The ith singular value can be interpreted as indicating the strength of the ith principal component of C. To and Do are orthonormal matrices that approximate the rows and columns of C, respectively. By restricting the matrices To, So, and Do to their first m &lt; k columns (= principal components) one obtains the matrices T, S, and D. Their product C is the best least square approximation of C by a matrix of rank m: C = TSD'. We chose m = 50 (reduction to a 50-dimensional space) for the SVD's described in this paper.</Paragraph>
    <Paragraph position="7"> SVD addresses the problems of generalization and sparseness because broad and stable generalizations are represented on dimensions with large values which will be retained in the dimensionality reduction. In contrast, dimensions corresponding to small singular values represent idiosyncrasies, like the phonological constraint on the usage of &amp;quot;an&amp;quot; vs. &amp;quot;a&amp;quot;, and will be dropped. We also gain efficiency since we can manipulate smaller vectors, reduced to 50 dimensions. We used SVDPACK to compute the singular value decompositions described in this paper (Berry, 1992).</Paragraph>
    <Paragraph position="8"> Table 1 shows the nearest neighbors of two words (ordered according to closeness to the head word) after the dimensionality reduction. Neighbors with highest similarity according to both left and right context are listed. One can see clear differences between the nearest neighbors in the two spaces. The right-context neighbors of &amp;quot;onto&amp;quot; contain verbs because both prepositions and verbs govern noun phrases to their right.</Paragraph>
    <Paragraph position="9"> The left-context neighborhood of &amp;quot;onto&amp;quot; reflects the fact that prepositional phrases are used in the same position as adverbs like &amp;quot;away&amp;quot; and &amp;quot;together&amp;quot;, thus making their left context similar. For &amp;quot;seemed&amp;quot;, left-context neighbors are words that have similar types of noun phrases in subject position (mainly auxiliaries). The right-context neighbors all take &amp;quot;to&amp;quot;-infinitives as complements. An adjective like &amp;quot;likely&amp;quot; is very sim- null ilar to &amp;quot;seemed&amp;quot; in this respect although its left context is quite different from that of &amp;quot;seemed&amp;quot;. Similarly, the generalization that prepositions and transitive verbs are very similar if not identical in the way they govern noun phrases would be lost if &amp;quot;left&amp;quot; and &amp;quot;right&amp;quot; properties of words were lumped together in one representation. These examples demonstrate the importance of representing generalizations about left and right context separately.</Paragraph>
    <Paragraph position="10"> The left and right context vectors are the basis for four different tag induction experiments, which are described in detail below:</Paragraph>
    <Section position="1" start_page="142" end_page="142" type="sub_section">
      <SectionTitle>
3.1 Induction based on word type only
</SectionTitle>
      <Paragraph position="0"> The two context vectors of a word characterize the distribution of neighboring words to its left an.d right. The concatenation of left and right context vector can therefore serve as a representation of a word's distributional behavior (Finch and Chater, 1992; Sch/itze, 1993). We formed such concatenated vectors for all 47,025 words (surface forms) in the Brown corpus. Here, we use the raw 250-dimensional context vectors and apply the SVD to the 47,025-by-500 matrix (47,025 words with two 250-dimensional context vectors each). We obtained 47,025 50-dimensional reduced vectors from the SVD and clustered them into 200 classes using the fast clustering algorithm Buckshot (Cutting et al., 1992) (group average agglomeration applied to a sample). This classification constitutes the baseline performance for distributional part-of-speech tagging. All occurrences of a word are assigned to one class. As pointed out above, such a procedure is problematic for ambiguous words.</Paragraph>
    </Section>
    <Section position="2" start_page="142" end_page="143" type="sub_section">
      <SectionTitle>
3.2 Induction based on word type and
</SectionTitle>
      <Paragraph position="0"> context In order to exploit contextual information in the classification of a token, we simply use context vectors of the two words occurring next to the token. An occurrence of word w is represented by a concatenation of four context vectors:  The motivation is that a word's syntactic role depends both on the syntactic properties of its neighbors and on its own potential for entering into syntactic relationships with these neighbors. The only properties of context that we consider are the right-context vector of the preceding word and the left-context vector of the following word because they seem to represent the contextual information most important for the categorization of w. For example, for the disambiguation of &amp;quot;work&amp;quot; in &amp;quot;her work seemed to be important&amp;quot;, only the fact that &amp;quot;seemed&amp;quot; expects noun phrases to its left is important, the right context vector of &amp;quot;seemed&amp;quot; does not contribute to disambiguation. That only the immediate neighbors are crucial for categorization is clearly a simplification, but as the results presented below show it seems to work surprisingly well.</Paragraph>
      <Paragraph position="1"> Again, an SVD is applied to address the problems of sparseness and generalization. We randomly selected 20,000 word triplets from the corpus and formed concatenations of four context vectors as described above. The singular value decomposition of the resulting 20,000-by-l,000 matrix defines a mapping from the 1,000-dimensional space of concatenated context vectors to a 50-dimensional reduced space. Our tag set was then induced by clustering the reduced vectors of the 20,000 selected occurrences into 200 classes. Each of the 200 tags is defined by the centroid of the corresponding class (the sum of its members). Distributional tagging of an occurrence of a word w proceeds then by retrieving the four relevant context vectors (right context vector of previous word, left context vector of following word, both context vectors of w) concatenating them to one 1000-component vector, mapping this vector to 50 dimensions, computing the correlations with the 200 cluster centroids and, finally, assigning the occurrence to the closest cluster. This procedure was applied to all tokens of the Brown corpus.</Paragraph>
      <Paragraph position="2"> We will see below that this method of distributional tagging, although partially successful, fails for many tokens whose neighbors are punctuation marks. The context vectors of punctuation marks contribute little information about syntactic categorization since there are no grammatical dependencies between words and punctuation marks, in contrast to strong dependencies between neighboring words.</Paragraph>
      <Paragraph position="3"> For this reason, a second induction on the basis of word type and context was performed, but only for those tokens with informative contexts.</Paragraph>
      <Paragraph position="4"> Tokens next to punctuation marks and tokens with rare words as neighbors were not included.</Paragraph>
      <Paragraph position="5"> Contexts with rare words (less than ten occurrences) were also excluded for similar reasons: If a word only occurs nine or fewer times its left and right context vectors capture little information for syntactic categorization. In the experiment, 20,000 natural contexts were randomly selected, processed by the SVD and clustered into  set. Structural tags derived from parse trees are marked with .. 200 classes. The classification was then applied to all natural contexts of the Brown corpus.</Paragraph>
    </Section>
    <Section position="3" start_page="143" end_page="143" type="sub_section">
      <SectionTitle>
3.3 Generalized context vectors
</SectionTitle>
      <Paragraph position="0"> The context vectors used so far only capture information about distributional interactions with the 250 most frequent words. Intuitively, it should be possible to gain accuracy in tag induction by using information from more words. One way to do this is to let the right context vector record which classes of left conte~t vectors occur to the right of a word. The rationale is that words with similar left context characterize words to their right in a similar way. For example, &amp;quot;seemed&amp;quot; and &amp;quot;would&amp;quot; have similar left contexts, and they characterize the right contexts of &amp;quot;he&amp;quot; and &amp;quot;the firefighter&amp;quot; as potentially containing an inflected verb form.</Paragraph>
      <Paragraph position="1"> Rather than having separate entries in its right context vector for &amp;quot;seemed&amp;quot;, &amp;quot;would&amp;quot;, and &amp;quot;likes&amp;quot;, a word like &amp;quot;he&amp;quot; can now be characterized by a generalized entry for &amp;quot;inflected verb form occurs frequently to my right&amp;quot;.</Paragraph>
      <Paragraph position="2"> This proposal was implemented by applying a singular value decomposition to the 47025-by-250 matrix of left context vectors and clustering the resulting context vectors into 250 classes. A generalized right context vector v for word w was then formed by counting how often words from these 250 classes occurred to the right of w. Entry vi counts the number of times that a word from class i occurs to the right of w in the corpus (as opposed to the number of times that the word with frequency rank i occurs to the right of w). Generalized left context vectors were derived by an analogous procedure using word-based right context vectors. Note that the information about left and right is kept separate in this computation.</Paragraph>
      <Paragraph position="3"> This differs from previous approaches (Finch and Chater, 1992; Schfitze, 1993) in which left and right context vectors of a word are always used in one concatenated vector. There are arguably fewer different types of right syntactic contexts than types of syntactic categories. For example, transitive verbs and prepositions belong to different syntactic categories, but their right contexts are virtually identical in that they require a noun phrase. This generalization could not be exploited if left and right context were not treated separately. null Another argument for the two-step derivation is that many words don't have any of the 250 most frequent words as their left or right neighbor.</Paragraph>
      <Paragraph position="4"> Hence, their vector would be zero in the word-based scheme. The class-based scheme makes it more likely that meaningful representations are formed for all words in the vocabulary.</Paragraph>
      <Paragraph position="5"> The generalized context vectors were input to the tag induction procedure described above for word-based context vectors: 20,000 word triplets were selected from the corpus, encoded as 1,000dimensional vectors (consisting of four generalized context vectors), decomposed by a singular value decomposition and clustered into 200 classes. The resulting classification was applied to all tokens in the Brown corpus.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="143" end_page="146" type="metho">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> The results of the four experiments were evaluated by forming 16 classes of tags from the Penn Treebank as shown in Table 2. Preliminary experiments showed that distributional methods distinguish adnominal and predicative uses of adjectives (e.g. &amp;quot;the black cat&amp;quot; vs. &amp;quot;the cat is black&amp;quot;). Therefore the tag &amp;quot;ADN&amp;quot; was introduced for uses of adjectives, nouns, and participles as adnominal modifiers. The tag &amp;quot;PRD&amp;quot; stands for predicative uses of adjectives. The Penn Treebank parses of the Brown corpus were used to determine whether a token functions as an adnominal modifier. Punctuation marks, special symbols, interjections, foreign words and tags with fewer than 100 instances were excluded from the evaluation.</Paragraph>
    <Paragraph position="1"> Tables 3 and 4 present results for word type-based induction and induction based on word type and context. For each tag t, the table lists the frequency of t in the corpus (&amp;quot;frequency&amp;quot;) 2, the number of induced tags i0, il, * *., iz, that were assigned to it (&amp;quot;# classes&amp;quot;); the number of times an occurrence of t was correctly labeled as belonging to one of i0, Q,...,iz (&amp;quot;correct&amp;quot;); the number of times that a token of a different tag t ~ was 2The small difference in overall frequency in the tables is due to the fact that some word-based context vectors consist entirely of zeros. There were about a hundred word triplets whose four context vectors did not have non-zero entries and could not be assigned a cluster.</Paragraph>
    <Paragraph position="2">  miscategorized as being an instance of i0, il, ..., il (&amp;quot;incorrect&amp;quot;); and precision and recall of the categorization of t. Precision is the number of correct tokens divided by the sum of correct and incorrect tokens. Recall is the number of correct tokens divided by the total number of tokens of t (in the first column). The last column gives van Rijsbergen's F measure which computes an aggregate score from precision and recall: (van Rijsbergen,</Paragraph>
    <Paragraph position="4"> equal weight to precision and recall.</Paragraph>
    <Paragraph position="5"> It is clear from the tables that incorporating context improves performance considerably. The F score increases for all tags except CD, with an average improvement of more than 0.20. The tag CD is probably better thought of as describing a word class. There is a wide range of heterogeneous syntactic functions of cardinals in particular contexts: quantificational and adnominal uses, bare NP's (&amp;quot;is one of&amp;quot;), dates and ages (&amp;quot;Jan 1&amp;quot;, &amp;quot;gave his age as 25&amp;quot;), and enumerations. In this light, it is not surprising that the word-type method does better on cardinals.</Paragraph>
    <Paragraph position="6"> Table 5 shows that performance for generalized context vectors is better than for word-based context vectors (0.74 vs. 0.72). However, since the number of tags with better and worse performance is about the same (7 and 5), one cannot conclude with certainty that generalized context vectors induce tags of higher quality. Apparently, the 250 most frequent words capture most of the relevant distributional information so that the additional information from less frequent words available from generalized vectors only has a small effect. null Table 6 looks at results for &amp;quot;natural&amp;quot; contexts, i.e. those not containing punctuation marks and rare words. Performance is consistently better than for the evaluation on all contexts, indicating that the low quality of the distributional information about punctuation marks and rare words is a difficulty for successful tag induction.</Paragraph>
    <Paragraph position="7"> Even for &amp;quot;natural&amp;quot; contexts, performance varies considerably. It is fairly good for prepositions, determiners, pronouns, conjunctions, the infinitive marker, modals, and the possessive marker. Tag induction fails for cardinals (for the reasons mentioned above) and for &amp;quot;-ing&amp;quot; forms. Present participles and gerunds are difficult because they exhibit both verbal and nominal properties and occur in a wide variety of different contexts whereas other parts of speech have a few typical and frequent contexts.</Paragraph>
    <Paragraph position="8"> It may seem worrying that some of the tags are assigned a high number of clusters (e.g., 49 for N, 36 for ADN). A closer look reveals that many clusters embody finer distinctions. Some exampies: Nouns in cluster 0 are heads of larger noun phrases, whereas the nouns in cluster 1 are full-fledged NPs. The members of classes 29 and 111 function as subjects. Class 49 consists of proper nouns. However, there are many pairs or triples of clusters that should be collapsed into one on linguistic grounds. They were separated on distributional criteria that don't have linguistic correlates. null An analysis of the divergence between our classification and the manually assigned tags revealed three main sources of errors: rare words and rare syntactic phenomena, indistinguishable distribution, and non-local dependencies.</Paragraph>
    <Paragraph position="9"> Rare words are difficult because of lack of distributional evidence. For example, &amp;quot;ties&amp;quot; is used as a verb only 2 times (out of 15 occurrences in the corpus). Both occurrences are miscategorized, since its context vectors do not provide enough evidence for the verbal use. Rare syntactic constructions pose a related problem: There are not enough instances to justify the creation of a separate cluster. For example, verbs taking bare in- null finitives were classified as adverbs since this is too rare a phenomenon to provide strong distributional evidence (&amp;quot;we do not DARE speak of&amp;quot;, &amp;quot;legislation could HELP remove&amp;quot;).</Paragraph>
    <Paragraph position="10"> The case of the tags &amp;quot;VBN&amp;quot; and &amp;quot;PRD&amp;quot; (past participles and predicative adjectives) demonstrates the difficulties of word classes with indistinguishable distributions. There are hardly any distributional clues for distinguishing &amp;quot;VBN&amp;quot; and &amp;quot;PRD&amp;quot; since both are mainly used as complements of &amp;quot;to be&amp;quot;.s A common tag class was created for &amp;quot;VBN&amp;quot; and &amp;quot;PRD&amp;quot; to show that they are reasonably well distinguished from other parts of speech, even if not from each other. Semantic understanding is necessary to distinguish between the states described by phrases of the form &amp;quot;to be adjective&amp;quot; and the processes described by phrases of the form &amp;quot;to be past participle&amp;quot;.</Paragraph>
    <Paragraph position="11"> Finally, the method fails if there are no local dependencies that could be used for categorization and only non-local dependencies are informative. For example, the adverb in &amp;quot;Mc*N. Hester, CURRENTLY Dean of...&amp;quot; and the conjunction in &amp;quot;to add that, IF United States policies ...&amp;quot; have similar immediate neighbors (comma, NP).</Paragraph>
    <Paragraph position="12"> The decision to consider only immediate neighbors is responsible for this type of error since taking a wider context into account would disambiguate the parts of speech in question.</Paragraph>
  </Section>
  <Section position="6" start_page="146" end_page="146" type="metho">
    <SectionTitle>
5 Future Work
</SectionTitle>
    <Paragraph position="0"> There are three avenues of future research we are interested in pursuing. First, we are planning to apply the algorithm to an as yet untagged language. Languages with a rich morphology may be more difficult than English since with fewer tokens per type, there is less data on which to base a categorization decision.</Paragraph>
    <Paragraph position="1"> Secondly, the error analysis suggests that considering non-local dependencies would improve results. Categories that can be induced well (those characterized by local dependencies) could be input into procedures that learn phrase structure (e.g. (Brill and Marcus, 19925; Finch, 1993)).</Paragraph>
    <Paragraph position="2"> These phrase constraints could then be incorporated into the distributional tagger to characterize non-local dependencies.</Paragraph>
    <Paragraph position="3"> Finally, our procedure induces a &amp;quot;hard&amp;quot; part-of-speech classification of occurrences in context, i.e., each occurrence is assigned to only one category.</Paragraph>
    <Paragraph position="4"> It is by no means generally accepted that such a classification is linguistically adequate. There is both synchronic (Ross, 1972) and diachronic (Tabor, 1994) evidence suggesting that words and their uses can inherit properties from several prototypical syntactic categories. For example, &amp;quot;fun&amp;quot; SBecause of phrases like &amp;quot;I had sweet potatoes&amp;quot;, forms of &amp;quot;have&amp;quot; cannot serve as a reliable discriminator either.</Paragraph>
    <Paragraph position="5"> in &amp;quot;It's a fun thing to do.&amp;quot; has properties of both a noun and an adjective (superlative &amp;quot;funnest&amp;quot; possible). We are planning to explore &amp;quot;soft&amp;quot; classification algorithms that can account for these phenomena. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML