File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1040_metho.xml
Size: 22,105 bytes
Last Modified: 2025-10-06 14:10:16
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1040"> <Title>Expressing Implicit Semantic Relations without Supervision</Title> <Section position="6" start_page="314" end_page="316" type="metho"> <SectionTitle> 4 The Algorithm </SectionTitle> <Paragraph position="0"> The algorithm takes as input a set of word pairs }:,,:{ 11 nn YXYXW = and produces as output ranked lists of patterns mPP ,,1 for each input pair. The following steps are similar to the algorithm of Turney (2005), with several changes to support the calculation of pertinence.</Paragraph> <Paragraph position="1"> 1. Find phrases: For each pair ii YX : , make a list of phrases in the corpus that contain the pair.</Paragraph> <Paragraph position="2"> We use the Waterloo MultiText System (Clarke et al., 1998) to search in a corpus of about 10105 x English words (Terra and Clarke, 2003).</Paragraph> <Paragraph position="3"> Make one list of phrases that begin with iX and end with iY and a second list for the opposite order. Each phrase must have one to three intervening words between iX and iY . The first and last words in the phrase do not need to exactly match iX and iY . The MultiText query language allows different suffixes. Veale (2004) has observed that it is easier to identify semantic relations between nouns than between other parts of speech. Therefore we use WordNet 2.0 (Miller, 1995) to guess whether iX and iY are likely to be nouns. When they are nouns, we are relatively strict about suffixes; we only allow variation in pluralization. For all other parts of speech, we are liberal about suffixes. For example, we allow an adjective such as &quot;inflated&quot; to match a noun such as &quot;inflation&quot;. With MultiText, the query &quot;inflat*&quot; matches both &quot;inflated&quot; and &quot;inflation&quot;. 2. Generate patterns: For each list of phrases, generate a list of patterns, based on the phrases.</Paragraph> <Paragraph position="4"> Replace the first word in each phrase with the generic marker &quot;X&quot; and replace the last word with &quot;Y&quot;. The intervening words in each phrase may be either left as they are or replaced with the wildcard &quot;*&quot;. For example, the phrase &quot;carpenter nails the wood&quot; yields the patterns &quot;X nails the Y&quot;, &quot;X nails * Y&quot;, &quot;X * the Y&quot;, and &quot;X * * Y&quot;. Do not allow duplicate patterns in a list, but note the number of times a pattern is generated for each word pair ii YX : in each order ( iX first and iY last or vice versa). We call this the pattern frequency. It is a local frequency count, analogous to term frequency in information retrieval.</Paragraph> <Paragraph position="5"> 3. Count pair frequency: The pair frequency for a pattern is the number of lists from the preceding step that contain the given pattern. It is a global frequency count, analogous to document frequency in information retrieval. Note that a pair ii YX : yields two lists of phrases and hence two lists of patterns. A given pattern might appear in zero, one, or two of the lists for ii YX : . 4. Map pairs to rows: In preparation for building a matrix X , create a mapping of word pairs to row numbers. For each pair ii YX : , create a row for ii YX : and another row for ii XY : . If W does not already contain }:,,:{ 11 nn XYXY , then we have effectively doubled the number of word pairs, which increases the sample size for calculating pertinence.</Paragraph> <Paragraph position="6"> 5. Map patterns to columns: Create a mapping of patterns to column numbers. For each unique pattern of the form &quot;X ... Y&quot; from Step 2, create a column for the original pattern &quot;X ... Y&quot; and another column for the same pattern with X and Y swapped, &quot;Y ... X&quot;. Step 2 can generate millions of distinct patterns. The experiment in Section 5 results in 1,706,845 distinct patterns, yielding 3,413,690 columns. This is too many columns for matrix operations with today's standard desktop computer. Most of the patterns have a very low pair frequency. For the experiment in Section 5, 1,371,702 of the patterns have a pair frequency of one. To keep the matrix X manageable, we drop all patterns with a pair frequency less than ten. For Section 5, this leaves 42,032 patterns, yielding 84,064 columns. Turney (2005) limited the matrix to 8,000 columns, but a larger pool of patterns is better for our purposes, since it increases the likelihood of finding good patterns for expressing the semantic relations of a given word pair.</Paragraph> <Paragraph position="7"> 6. Build a sparse matrix: Build a matrix X in sparse matrix format. The value for the cell in row i and column j is the pattern frequency of the j-th pattern for the the i-th word pair.</Paragraph> <Paragraph position="8"> 7. Calculate entropy: Apply log and entropy transformations to the sparse matrix X (Landauer and Dumais, 1997). Each cell is replaced with its logarithm, multiplied by a weight based on the negative entropy of the corresponding column vector in the matrix. This gives more weight to patterns that vary substantially in frequency for each pair.</Paragraph> <Paragraph position="9"> 8. Apply SVD: After log and entropy transforms, apply the Singular Value Decomposition (SVD) to X (Golub and Van Loan, 1996). SVD decomposes X into a product of three matrices TVUS , where U and V are in column orthonormal form (i.e., the columns are orthogonal and have unit length) and S is a diagonal matrix of singular values (hence SVD). If X is of rank r , then S is also of rank r . Let kS , where rk < , be the diagonal matrix formed from the top k singular values, and let kU and kV be the matrices produced by selecting the corresponding columns from U and V . The matrix T kkk VU S is the matrix of rank k that best ap-proximates the original matrix X , in the sense that it minimizes the approximation errors (Golub and Van Loan, 1996). Following Landauer and Dumais (1997), we use 300=k . We may think of this matrix Tkkk VU S as a smoothed version of the original matrix. SVD is used to reduce noise and compensate for sparseness (Landauer and Dumais, 1997).</Paragraph> <Paragraph position="10"> 9. Calculate cosines: The relational similarity between two pairs, ):,:(sim 2211r YXYX , is given by the cosine of the angle between their corresponding row vectors in the matrix</Paragraph> <Paragraph position="12"> tween all possible pairs of pairs. All of the cosines can be efficiently derived from the matrix Tkkkk )( SS UU (Landauer and Dumais, 1997).</Paragraph> <Paragraph position="13"> 10. Calculate conditional probabilities: Using Bayes' Theorem (see Section 2) and the raw frequency data in the matrix X from Step 6, before log and entropy transforms, calculate the conditional probability ):(p jii PYX for every row (word pair) and every column (pattern).</Paragraph> <Paragraph position="14"> 11. Calculate pertinence: With the cosines from Step 9 and the conditional probabilities from Step 10, calculate ),:(pertinence jii PYX for every row ii YX : and every column jP for which 0):(p >jii PYX . When 0):(p =jii PYX , it is possible that 0),:(pertinence >jii PYX , but we avoid calculating pertinence in these cases for two reasons. First, it speeds computation, because X is sparse, so 0):(p =jii PYX for most rows and columns. Second, 0):(p =jii PYX implies that the pattern jP does not actually appear with the word pair ii YX : in the corpus; we are only guessing that the pattern is appropriate for the word pair, and we could be wrong. Therefore we prefer to limit ourselves to patterns and word pairs that have actually been observed in the corpus. For each pair ii YX : in W, output two separate ranked lists, one for patterns of the form &quot;X ... Y&quot; and another for patterns of the form &quot;Y ... X&quot;, where the patterns in both lists are sorted in order of decreasing pertinence to ii YX : . Ranking serves as a kind of normalization. We have found that the relative rank of a pattern is more reliable as an indicator of its importance than the absolute pertinence. This is analogous to information retrieval, where documents are ranked in order of their relevance to a query. The relative rank of a document is more important than its actual numerical score (which is usually hidden from the user of a search engine). Having two separate ranked lists helps to avoid bias. For example, ostrich:bird generates 516 patterns of the form &quot;X ... Y&quot; and 452 patterns of the form &quot;Y ... X&quot;. Since there are more patterns of the form &quot;X ... Y&quot;, there is a slight bias towards these patterns. If the two lists were merged, the &quot;Y ... X&quot; patterns would be at a disadvantage.</Paragraph> </Section> <Section position="7" start_page="316" end_page="317" type="metho"> <SectionTitle> 5 Experiments with Word Analogies </SectionTitle> <Paragraph position="0"> In these experiments, we evaluate pertinence using 374 college-level multiple-choice word analogies, taken from the SAT test. For each question, there is a target word pair, called the stem pair, and five choice pairs. The task is to find the choice that is most analogous (i.e., has the highest relational similarity) to the stem. This choice pair is called the solution and the other choices are distractors. Since there are six word pairs per question (the stem and the five choices), there are 22446374 =x pairs in the input set W.</Paragraph> <Paragraph position="1"> In Step 4 of the algorithm, we double the pairs, but we also drop some pairs because they do not co-occur in the corpus. This leaves us with 4194 rows in the matrix. As mentioned in Step 5, the matrix has 84,064 columns (patterns). The sparse matrix density is 0.91%.</Paragraph> <Paragraph position="2"> To answer a SAT question, we generate ranked lists of patterns for each of the six word pairs. Each choice is evaluated by taking the intersection of its patterns with the stem's patterns. The shared patterns are scored by the average of their rank in the stem's lists and the choice's lists. Since the lists are sorted in order of decreasing pertinence, a low score means a high pertinence.</Paragraph> <Paragraph position="3"> Our guess is the choice with the lowest scoring shared pattern.</Paragraph> <Paragraph position="4"> Table 1 shows three examples, two questions that are answered correctly followed by one that is answered incorrectly. The correct answers are in bold font. For the first question, the stem is ostrich:bird and the best choice is (a) lion:cat.</Paragraph> <Paragraph position="5"> The highest ranking pattern that is shared by both of these pairs is &quot;Y such as the X&quot;. The third question illustrates that, even when the answer is incorrect, the best shared pattern (&quot;Y powered * * X&quot;) may be plausible.</Paragraph> <Paragraph position="6"> Word pair Best shared pattern Score 1. ostrich:bird (a) lion:cat &quot;Y such as the X&quot; 1.0 (b) goose:flock &quot;X * * breeding Y&quot; 43.5 (c) ewe:sheep &quot;X are the only Y&quot; 13.5 (d) cub:bear &quot;Y are called X&quot; 29.0 (e) primate:monkey &quot;Y is the * X&quot; 80.0 2. traffic:street (a) ship:gangplank &quot;X * down the Y&quot; 53.0 (b) crop:harvest &quot;X * adjacent * Y&quot; 248.0 (c) car:garage &quot;X * a residential Y&quot; 63.0 (d) pedestrians:feet &quot;Y * accommodate X&quot; 23.0 (e) water:riverbed &quot;Y that carry X&quot; 17.0 3. locomotive:train (a) horse:saddle &quot;X carrying * Y&quot; 82.0 (b) tractor:plow &quot;X pulled * Y&quot; 7.0 (c) rudder:rowboat &quot;Y * X&quot; 319.0 (d) camel:desert &quot;Y with two X&quot; 43.0 (e) gasoline:automobile &quot;Y powered * * X&quot; 5.0 terns for the stem and solution for the first example. The pattern &quot;X lion Y&quot; is anomalous, but the other patterns seem reasonable. The shared pattern &quot;Y such as the X&quot; is ranked 1 for both pairs, hence the average score for this pattern is 1.0, as shown in Table 1. Note that the &quot;ostrich is the largest bird&quot; and &quot;lions are large cats&quot;, but the largest cat is the Siberian tiger.</Paragraph> <Paragraph position="7"> Word pair &quot;X ... Y&quot; &quot;Y ... X&quot; ostrich:bird &quot;X is the largest Y&quot; &quot;Y such as the X&quot; &quot;X is * largest Y&quot; &quot;Y such * the X&quot; lion:cat &quot;X lion Y&quot; &quot;Y such as the X&quot; &quot;X are large Y&quot; &quot;Y and mountain X&quot; the pattern &quot;Y such as the X&quot;. The pairs are sorted by ):(p PYX . The pattern &quot;Y such as the X&quot; is one of 146 patterns that are shared by ostrich:bird and lion:cat. Most of these shared patterns are not very informative.</Paragraph> <Paragraph position="8"> In Table 4, we compare ranking patterns by pertinence to ranking by various other measures, mostly based on varieties of tf-idf (term frequency times inverse document frequency, a common way to rank documents in information retrieval). The tf-idf measures are taken from Salton and Buckley (1988). For comparison, we also include three algorithms that do not rank patterns (the bottom three rows in the table).</Paragraph> <Paragraph position="9"> These three algorithms can answer the SAT questions, but they do not provide any kind of explanation for their answers.</Paragraph> <Paragraph position="10"> All of the pattern ranking algorithms are given exactly the same sets of patterns to rank. Any differences in performance are due to the ranking method alone. The algorithms may skip questions when the word pairs do not co-occur in the corpus. All of the ranking algorithms skip the same set of 15 of the 374 SAT questions. Precision is defined as the percentage of correct answers out of the questions that were answered (not skipped). Recall is the percentage of correct answers out of the maximum possible number correct (374). The F measure is the harmonic mean of precision and recall.</Paragraph> <Paragraph position="11"> For the tf-idf methods in Table 4, f is the pattern frequency, n is the pair frequency, F is the maximum f for all patterns for the given word pair, and N is the total number of word pairs. By &quot;TF = f, IDF = n/1 &quot;, for example (row 8), we mean that f plays a role that is analogous to term frequency and n/1 plays a role that is analogous to inverse document frequency. That is, in row 8, the patterns are ranked in decreasing order of pattern frequency divided by pair frequency.</Paragraph> <Paragraph position="12"> Table 4 also shows some ranking methods based on intermediate calculations in the algorithm in Section 4. For example, row 2 in Table 4 gives the results when patterns are ranked in order of decreasing values in the corresponding cells of the matrix X from Step 7.</Paragraph> <Paragraph position="13"> Row 12 in Table 4 shows the results we would get using Latent Relational Analysis (Turney, 2005) to rank patterns. The results in row 12 support the claim made in Section 3, that LRA is not suitable for ranking patterns, although it works well for answering the SAT questions (as we see in row 16). The vectors in LRA yield a good measure of relational similarity, but the magnitude of the value of a specific element in a vector is not a good indicator of the quality of the corresponding pattern.</Paragraph> <Paragraph position="14"> The best method for ranking patterns is pertinence (row 1 in Table 4). As a point of comparison, the performance of the average senior highschool student on the SAT analogies is about 57% (Turney and Littman, 2005). The second best method is to use the values in the matrix X after the log and entropy transformations in Step 7 (row 2). The difference between these two methods is statistically significant with 95% confidence. Pertinence (row 1) performs slightly below Latent Relational Analysis (row 16; Turney, 2005), but the difference is not significant. Randomly guessing answers should yield an F of 20% (1 out of 5 choices), but ranking patterns randomly (row 13) results in an F of 26.4%. This is because the stem pair tends to share more patterns with the solution pair than with the distractors. The minimum of a large set of random numbers is likely to be lower than the minimum of a small set of random numbers.</Paragraph> </Section> <Section position="8" start_page="317" end_page="318" type="metho"> <SectionTitle> 6 Experiments with Noun-Modifiers </SectionTitle> <Paragraph position="0"> In these experiments, we evaluate pertinence on the task of classifying noun-modifier pairs. The problem is to classify a noun-modifier pair, such as &quot;flu virus&quot;, according to the semantic relation between the head noun (virus) and the modifier (flu). For example, &quot;flu virus&quot; is classified as a causality relation (the flu is caused by a virus).</Paragraph> <Paragraph position="1"> For these experiments, we use a set of 600 manually labeled noun-modifier pairs (Nastase and Szpakowicz, 2003). There are five general classes of labels with thirty subclasses. We present here the results with five classes; the results with thirty subclasses follow the same trends (that is, pertinence performs significantly better than the other ranking methods). The five classes are causality (storm cloud), temporality (daily exercise), spatial (desert storm), participant (student protest), and quality (expensive book).</Paragraph> <Paragraph position="2"> The input set W consists of the 600 noun-modifier pairs. This set is doubled in Step 4, but we drop some pairs because they do not co-occur in the corpus, leaving us with 1184 rows in the matrix. There are 16,849 distinct patterns with a pair frequency of ten or more, resulting in 33,698 columns. The matrix density is 2.57%.</Paragraph> <Paragraph position="3"> To classify a noun-modifier pair, we use a single nearest neighbour algorithm with leave-one-out cross-validation. We split the set 600 times. Each pair gets a turn as the single testing example, while the other 599 pairs serve as training examples. The testing example is classified according to the label of its nearest neighbour in the training set. The distance between two noun-modifier pairs is measured by the average rank of their best shared pattern. Table 5 shows the resulting precision, recall, and F, when ranking patterns by pertinence.</Paragraph> <Paragraph position="4"> Class name Prec. Rec. F Class size To gain some insight into the algorithm, we examined the 600 best shared patterns for each pair and its single nearest neighbour. For each of the five classes, Table 6 lists the most frequent pattern among the best shared patterns for the given class. All of these patterns seem appropriate for their respective classes.</Paragraph> <Paragraph position="5"> Class Most frequent pattern Example pair causality &quot;Y * causes X&quot; &quot;cold virus&quot; participant &quot;Y of his X&quot; &quot;dream analysis&quot; quality &quot;Y made of X&quot; &quot;copper coin&quot; on the noun-modifier problem, compared to various other pattern ranking methods. The bottom two rows are included for comparison; they are not pattern ranking algorithms. The best method for ranking patterns is pertinence (row 1 in Table 7). The difference between pertinence and the second best ranking method (row 2) is statistically significant with 95% confidence.</Paragraph> <Paragraph position="6"> Latent Relational Analysis (row 16) performs slightly better than pertinence (row 1), but the difference is not statistically significant.</Paragraph> <Paragraph position="7"> Row 6 in Table 7 shows the results we would get using Latent Relational Analysis (Turney, 2005) to rank patterns. Again, the results support the claim in Section 3, that LRA is not suitable for ranking patterns. LRA can classify the nounmodifiers (as we see in row 16), but it cannot express the implicit semantic relations that make an unlabeled noun-modifier in the testing set similar to its nearest neighbour in the training set.</Paragraph> </Section> <Section position="9" start_page="318" end_page="318" type="metho"> <SectionTitle> 7 Discussion </SectionTitle> <Paragraph position="0"> Computing pertinence took about 18 hours for the experiments in Section 5 and 9 hours for Section 6. In both cases, the majority of the time was spent in Step 1, using MultiText (Clarke et al., 1998) to search through the corpus of 10105 x words. MultiText was running on a Beowulf cluster with sixteen 2.4 GHz Intel Xeon CPUs.</Paragraph> <Paragraph position="1"> The corpus and the search index require about one terabyte of disk space. This may seem computationally demanding by today's standards, but progress in hardware will soon allow an average desktop computer to handle corpora of this size.</Paragraph> <Paragraph position="2"> Although the performance on the SAT analogy questions (54.6%) is near the level of the average senior highschool student (57%), there is room for improvement. For applications such as building a thesaurus, lexicon, or ontology, this level of performance suggests that our algorithm could assist, but not replace, a human expert.</Paragraph> <Paragraph position="3"> One possible improvement would be to add part-of-speech tagging or parsing. We have done some preliminary experiments with parsing and plan to explore tagging as well. A difficulty is that much of the text in our corpus does not consist of properly formed sentences, since the text comes from web pages. This poses problems for most part-of-speech taggers and parsers.</Paragraph> </Section> class="xml-element"></Paper>