File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-3020_intro.xml

Size: 4,827 bytes

Last Modified: 2025-10-06 14:03:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-3020">
  <Title>Automatic Part-of-Speech Induction from Text</Title>
  <Section position="4" start_page="0" end_page="77" type="intro">
    <SectionTitle>
2 Approach
</SectionTitle>
    <Paragraph position="0"> In principle, word classification can be based on a number of different linguistic principles, e.g. on phonology, morphology, syntax or semantics.</Paragraph>
    <Paragraph position="1"> However, in this paper we are only interested in syntactically motivated word classes. With syntactic classes the aim is that words belonging to the same class can substitute for one another in a sentence without affecting its grammaticality.</Paragraph>
    <Paragraph position="2"> As a consequence of the substitutability, when looking at a corpus words of the same class typically have a high agreement concerning their left and right neighbors. For example, nouns are frequently preceded by words like a, the, or this, and succeeded by words like is, has or in. In statistical  terms, words of the same class have a similar frequency distribution concerning their left and right neighbors. To some extend this can also be observed with indirect neighbors, but with them the effect is less salient and therefore we do not consider them here.</Paragraph>
    <Paragraph position="3"> The co-occurrence information concerning the words in a vocabulary and their neighbors can be stored in a matrix as shown in table 1. If we now want to discover word classes, we simply compute the similarities between all pairs of rows using a vector similarity measure such as the cosine coefficient and then cluster the words according to these similarities. The expectation is that unambiguous nouns like breath and meal form one cluster, and that unambiguous verbs like discuss and protect form another cluster.</Paragraph>
    <Paragraph position="4"> Ambiguous words like link or suit should not form a tight cluster but are placed somewhere in between the noun and the verb clusters, with the exact position depending on the ratios of the occurrence frequencies of their readings as either a noun or a verb. As this ratio can be arbitrary, according to our experience ambiguous words do not severely affect the clustering but only form some uniform background noise which more or less cancels out in a large vocabulary.1 Note that the correct assignment of the ambiguous words to clusters is not required at this stage, as this is taken care of in the next step.</Paragraph>
    <Paragraph position="5"> This step involves computing the differential vector of each word from the centroid of its closest cluster, and to assign the differential vector to the most appropriate other cluster. This process can be repeated until the length of the differential vector falls below a threshold or, alternatively, the agreement with any of the centroids becomes too low.</Paragraph>
    <Paragraph position="6"> This way an ambiguous word is assigned to several parts of speech, starting from the most common and proceeding to the least common. Figure 1 illustrates this process.</Paragraph>
    <Paragraph position="7"> 1 An alternative to relying on this fortunate but somewhat unsatisfactory effect would be not to use global co-occurrence vectors but local ones, as successfully proposed in word sense induction (Rapp, 2004). This means that every occurrence of a word obtains a separate row vector in table 1. The problem with the resulting extremely sparse matrix is that most vectors are either orthogonal to each other or duplicates of some other vector, with the consequence that the dimensionality reduction that is indispensable for such matrices does not lead to sensible results. This problem is not as severe in word sense induction where larger context windows are considered.</Paragraph>
    <Paragraph position="8"> The procedure that we described so far works in theory but not well in practice. The problem with it is that the matrix is so sparse that sampling errors have a strong negative effect on the results of the vector comparisons. Fortunately, the problem of data sparseness can be minimized by reducing the dimensionality of the matrix. An appropriate algebraic method that has the capability to reduce the dimensionality of a rectangular matrix is Singular Value Decomposition (SVD). It has the property that when reducing the number of columns the similarities between the rows are preserved in the best possible way. Whereas in other studies the reduction has typically been from several ten thousand to a few hundred, our reduction is from several ten thousand to only three. This leads to a very strong generalization effect that proves useful for our particular task.</Paragraph>
    <Paragraph position="9"> left neighbors right neighbors a we the you a can is well</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML