File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0712_metho.xml
Size: 13,161 bytes
Last Modified: 2025-10-06 14:07:21
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0712"> <Title>Knowledge-Free Induction of Morphology Using Latent Semantic Analysis</Title> <Section position="3" start_page="67" end_page="67" type="metho"> <SectionTitle> 2 Previous work </SectionTitle> <Paragraph position="0"> Existing induction algorithms all focus on identifying prefixes, suffixes, and word stems in inflectional languages (avoiding infixes and other language types like concatenative or agglutinative languages (Sproat, 1992)). They also observe high frequency occurrences of some word endings or beginnings, perform statistics thereon, and propose that some of these appendages are valid morphemes.</Paragraph> <Paragraph position="1"> However, these algorithms differ in specifics.</Paragraph> <Paragraph position="2"> D~Jean (1998) uses an approach derived from Harris (1951) where word-splitting occurs if the number of distinct letters that follows a given sequence of characters surpasses a threshoid.</Paragraph> <Paragraph position="3"> He uses these hypothesized affixes to resegment words and thereby identify additional affixes that were initially overlooked. His overall goal is different from ours: he primarily seeks an affix inventory.</Paragraph> <Paragraph position="4"> Goldsmith (1997) tries cutting each word in exactly one place based on probability and lengths of hypothesized stems and affixes. He applies the EM algorithm to eliminate inappropriate parses. He collects the possible suffixes for each stem calling these a signature which aid in determining word classes. Goldsmith (2000) later incorporates minimum description length to identify stemming characteristics that most compress the data, but his algorithm otherwise remains similar in nature. Goldsmith's algorithm is practically knowledge-free, though he incorporates capitalization removal and some word segmentation.</Paragraph> <Paragraph position="5"> Gaussier (1999) begins with an inflectional lexicon and seeks to find derivational morphology. The words and parts of speech from his inflectional lexicon serve for building relational families of words and identifying sets of word pairs and suffixes therefrom. Gaussier splits words based on p-similarity - words that agree in exactly the first p characters. He also builds a probabilistic model which indicates that the probability of two words being morphological variants is based upon the probability of their respective changes in orthography and morphosyntactics. null</Paragraph> </Section> <Section position="4" start_page="67" end_page="70" type="metho"> <SectionTitle> 3 Current approach </SectionTitle> <Paragraph position="0"> Our algorithm also focuses on inflectional languages. However, with the exception of word segmentation, we provide it no human information and we consider only the impact of semantics. Our approach (see Figure 1) can be decomposed into four components: (1) initially selecting candidate affixes, (2) identifying affixes which are potential morphological variants of each other, (3) computing semantic vectors for words possessing these candidate affixes, and (4) selecting as valid morphological variants those words with similar semantic vectors.</Paragraph> <Paragraph position="1"> Stage 1 Stage 2 Stage 3 Stage 4 Identify I\[ paa~~ ~l~ I\[ semantic I I variants potential \[lare pos slmell vectors II that have affixes I I morplm- I I for I I slmuar ........ ) ( logical \]( words \] ( semantic</Paragraph> <Section position="1" start_page="67" end_page="68" type="sub_section"> <SectionTitle> 3.1 Hypothesizing affixes </SectionTitle> <Paragraph position="0"> To select candidate affixes, we, like Gaussier, identify p-similar words. We insert words into a trie (Figure 2) and extract potential affixes by observing those places in the trie where branching occurs. Figure 2's hypothesized suffixes are NULL, &quot;s,&quot; &quot;ed,&quot; &quot;es,&quot; &quot;ing,&quot; &quot;e,&quot; and &quot;eful.&quot; We retain only the K most-frequent candidate affixes for subsequent processing. The value for K needs to be large enough to account for the number of expected regular affixes in any given language as well as some of the more frequent irregular affixes. We arbitrarily chose K to be 200 in our system. (It should also be mentioned that we can identify potential prefixes by inserting words into the trie in reversed order. This prefix mode can additionally serve for identifying capitalization.)</Paragraph> <Paragraph position="2"/> </Section> <Section position="2" start_page="68" end_page="68" type="sub_section"> <SectionTitle> 3.2 Morphological variants </SectionTitle> <Paragraph position="0"> We next identify pairs of candidate affixes that descend from a common ancestor node in the trie. For example, (&quot;s&quot;, NULL) constitutes such a pair from Figure 2. We call these pairs rules.</Paragraph> <Paragraph position="1"> Two words sharing the same root and the same affix rule, such as &quot;cars&quot; and &quot;car,&quot; form what we call a pair of potential morphological variants (PPMVs). We define the ruleset of a given rule to be the set of all PPMVs that have that rule in common. For instance, from Figure 2, the ruleset for (&quot;s&quot;, NULL) would be the pairs &quot;cars/car&quot; and &quot;cares/care.&quot; Our algorithm establishes a list which identifies the rulesets for every hypothesized rule extracted from the data and then it must proceed to determine which rulesets or PPMVs describe true morphological relationships.</Paragraph> </Section> <Section position="3" start_page="68" end_page="69" type="sub_section"> <SectionTitle> 3.3 Computing Semantic Vectors </SectionTitle> <Paragraph position="0"> Deerwester, et al. (1990) showed that it is possible to find significant semantic relationships between words and documents in a corpus with virtually no human intervention (with the possible exception of a human-built stop word list). This is typically done by applying singular value decomposition (SVD) to a matrix, M, where each entry M(i,j) contains the frequency of word i as seen in document j of the corpus.</Paragraph> <Paragraph position="1"> This methodology is referred to as Latent Semantic Analysis (LSA) and is well-described in the literature (Landauer, et al., 1998; Manning and Schfitze, 1999).</Paragraph> <Paragraph position="2"> SVDs seek to decompose a matrix A into the product of three matrices U, D, and V T where U and V T are orthogonal matrices and D is a diagonal matrix containing the singular values (squared eigenvalues) of A. Since SVD's can be performed which identify singular values by descending order of size (Berry, et al., 1993), LSA truncates after finding the k largest singular values. This corresponds to projecting the vector representation of each word into a k-dimensional subspace whose axes form k (latent) semantic directions. These projections are precisely the rows of the matrix product UkDk.</Paragraph> <Paragraph position="3"> A typical k is 300, which is the value we used.</Paragraph> <Paragraph position="4"> However, we have altered the algorithm somewhat to fit our needs. First, to stay as close to the knowledge-free scenario as possible, we neither apply a stopword list nor remove capitalization. Secondly, since SVDs are more designed to work on normally-distributed data (Manning and Schiitze, 1999, p. 565), we operate on Z-scores rather than counts. Lastly, instead of generating a term-document matrix, we build a term-term matrix.</Paragraph> <Paragraph position="5"> Schiitze (1993) achieved excellent performance at classifying words into quasi-partof-speech classes by building and performing an SVD on an Nx4N term-term matrix, M(i,Np+j). The indices i and j represent the top N highest frequency words. The p values range from 0 to 3 representing whether the word indexed by j is positionally offset from the word indexed by i by -2, -1, +1, or +2, respectively.</Paragraph> <Paragraph position="6"> For example, if &quot;the&quot; and &quot;people&quot; were respectively the 1st and 100th highest frequency words, then upon seeing the phrase &quot;the people,&quot; Schfitze's approach would increment the counts of M(1,2N+100) and M(100,N+i).</Paragraph> <Paragraph position="7"> We used Schfitze's general framework but tailored it to identify local semantic information.</Paragraph> <Paragraph position="8"> We built an Nx2N matrix and our p values correspond to those words whose offsets from word i are in the intervals \[-50,-1\] and \[1,501, respectively. We also reserve the Nth position as a catch-all position to account for all words that are not in the top (N-l). An important issue to resolve is how large should N be. We would like to be able to incorporate semantics for an arbitrarily large number of words and LSA quickly becomes impractical on large sets. Fortunately, it is possible to build a matrix with a smaller value of N (say, 2500), perform an SVD thereon, and then fold in remaining terms (Manning and Schfitze, 1999, p. 563). Since the U and V matrices of an SVD are orthogonal matrices, then uuT:vvT:I. This implies that AV=UD.</Paragraph> <Paragraph position="9"> This means that for a new word, w, one can build a vector ~T which identifies how w relates to the top N words according to the p different conditions described above. For example, if w were one of the top N words, then ~w T would simply represent w's particular row from the A matrix. The product f~w = ~wTVk is the projection of ~T into the k-dimensional latent semantic space. By storing an index to the words of the corpus as well as a sorted list of these words, one can efficiently build a set of semantic vectors which includes each word of interest.</Paragraph> </Section> <Section position="4" start_page="69" end_page="70" type="sub_section"> <SectionTitle> 3.4 Statistical Computations </SectionTitle> <Paragraph position="0"> Morphologically-related words frequently share similar semantics, so we want to see how well semantic vectors of PPMVs correlate. If we know how PPMVs correlate in comparison to other word pairs from their same rulesets, we can actually determine the semantic-based probability that the variants are legitimate. In this section, we identify a measure for correlating PPMVs and illustrate how ruleset-based statistics help identify legitimate PPMVs.</Paragraph> <Paragraph position="1"> The cosine of the angle between two vectors vl and v2 is given by, cos(vl,v2)- vl-v2 II vl llll v2 H&quot; We want to determine the correlation between each of the words of every PPMV. We use what we call a normalized cosine score (NCS) as a correlation. To obtain a NCS, we first calculate the cosine between each semantic vector, nw, and the semantic vectors from 200 randomly chosen words. By this means we obtain w's correlation mean (#w) and standard deviation (aw). If v is one of w's variants, then we define the NCS between ~w and nv to be cos(nw, nv) - #y ). min ( ye{w,v} ay Table 1 provides normalized cosine scores for several PPMVs from Figure 2 and from among words listed originally as errors in other systems. (NCSs are effectively Z-scores.) By considering NCSs for all word pairs coupled under a particular rule, we can determine semantic-based probabilities that indicate which PPMVs are legitimate. We expect random NCSs to be normally-distributed according to Af(0,1). Given that a particular ruleset contains nR PPMVs, we can therefore approximate the number (nT), mean (#T) and standard deviation (aT) of true correlations. If we _C .~___~ ~2. define ~z(#,a) to be fee &quot; - J ax, then we can compute the probability that the particular correlation is legitimate: Pr( true) = nT ~ Z(~T , aT) (nR--nT ~z(O, 1) +nT~Z(~T, aT)&quot; It is possible that a rule can be hypothesized at the trie stage that is true under only certain conditions. A prime example of such a rule is (&quot;es&quot;, NULL). Observe from Table 1 that the word &quot;cares&quot; poorly correlates with &quot;car.&quot; Yet, it is true that &quot;-es&quot; is a valid suffix for the words &quot;flashes,&quot; &quot;catches,&quot; &quot;kisses,&quot; and many other words where the &quot;-es&quot; is preceded by a voiceless sibilant.</Paragraph> <Paragraph position="2"> Hence, there is merit to considering subrules that arise while performing analysis on a particular rule. For instance, while evaluating the (&quot;es&quot;, NULL) rule, it is desirable to also consider potential subrules such as (&quot;ches&quot;, &quot;ch&quot;) and (&quot;tes&quot;, &quot;t&quot;). One might expect that the average NCS for the (&quot;ches&quot;, &quot;ch&quot;) subrule might be higher than the overall rule (&quot;es&quot;, NULL) whereas the opposite will likely be true for (&quot;tes', &quot;t&quot;). Table 2 confirms this.</Paragraph> </Section> </Section> class="xml-element"></Paper>