File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1041_metho.xml
Size: 11,065 bytes
Last Modified: 2025-10-06 14:15:28
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1041"> <Title>Automatic Identification of Non-compositional Phrases</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Input Data </SectionTitle> <Paragraph position="0"> The input to our algorithm is a collocation database and a thesaurus. We briefly describe the process of obtaining this input. More details about the construction of the collocation database and the thesaurus can be found in (Lin, 1998).</Paragraph> <Paragraph position="1"> We parsed a 125-million word newspaper corpus with Minipar, 1 a descendent of Principar (Lin, 1993; Lin, 1994), and extracted dependency relationships from the parsed corpus. A dependency relationship is a triple: (head type modifier), where head and modifier are words in the input sentence and type is the type of the dependency relation. For example, (la) is an example dependency tree and the set of dependency triples extracted from (la) are shown in (lb).</Paragraph> <Paragraph position="2"> compl John married Peter's sister b. (marry V:subj:N John), (marry V:compl:N sister), (sister N:gen:N Peter) There are about 80 million dependency relationships in the parsed corpus. The frequency counts of dependency relationships are filtered with the log-likelihood ratio (Dunning, 1993). We call a dependency relationship a collocation if its log-likelihood ratio is greater than a threshold (0.5). The number of unique collocations in the resulting database 2 is about 11 million.</Paragraph> <Paragraph position="3"> Using the similarity measure proposed in (Lin, 1998), we constructed a corpus-based thesaurus 3 consisting of 11839 nouns, 3639 verbs and 5658 adjective/adverbs which occurred in the corpus at least 100 times.</Paragraph> </Section> <Section position="5" start_page="0" end_page="317" type="metho"> <SectionTitle> 3 Mutual Information of a Collocation </SectionTitle> <Paragraph position="0"> We define the probability space to consist of all possible collocation triples. We use LH R M L to denote the frequency count of all the collocations that match the pattern (H R M), where H and M are either words or the wild card (*) and R is either a dependency type or the wild card. For example, * \[marry V:C/ompl:N sister\[ is the frequency count of (marry V: compl :N sister).</Paragraph> <Paragraph position="1"> * \[marry V:compl:~ *1 is the total frequency count of collocations in which the head is marry and the type is V:compl:hi (the verb-object relation).</Paragraph> <Paragraph position="2"> * I* * *l is the total frequency count of all collocations extracted from the corpus.</Paragraph> <Paragraph position="3"> To compute the mutual information in a collocation, we treat a collocation (head type modifier) as the conjunction of three events: A: (* type *) B: (head * *) C: (* * modifier) The mutual information of a collocation is the logarithm of the ratio between the probability of the collocation and the probability of events A, B, and C co-occur if we assume B and C are conditionally independent given A: (2)</Paragraph> <Paragraph position="5"/> </Section> <Section position="6" start_page="317" end_page="318" type="metho"> <SectionTitle> 4 Mutual Information and Similar Collocations </SectionTitle> <Paragraph position="0"> In this section, we use several examples to demonstrate the basic idea behind our algorithm.</Paragraph> <Paragraph position="1"> Consider the expression &quot;spill gut&quot;. Using the automatically constructed thesaurus, we find the following top-10 most similar words to the verb &quot;spill&quot; and the noun &quot;gut&quot;: spill: leak 0.153, pour 0.127, spew 0.125, dump 0.118, pump 0.098, seep 0.096, burn 0.095, explode 0.094, burst 0.092, spray 0.091; gut: intestine 0.091, instinct 0.089, foresight 0.085, creativity 0.082, heart 0.079, imagination 0.076, stamina 0.074, soul 0.073, liking 0.073, charisma 0.071; The collocation &quot;spill gut&quot; occurred 13 times in the 125-million-word corpus. The mutual information of this collocation is 6.24. Searching the collocation database, we find that it does not contain any collocation in the form (simvspilt V:compl:hl gut) nor (spill V: compl :N simngut), where sirnvsp~u is a verb similar to &quot;spill&quot; and simng,,~ is a noun similar to &quot;gut&quot;. This means that the phrases, such as &quot;leak gut&quot;, &quot;pour gut&quot;, ... or &quot;spill intestine&quot;, &quot;spill instinct&quot;, either did not appear in the corpus at all, or did not occur frequent enough to pass the log-likelihood ratio test.</Paragraph> <Paragraph position="2"> The second example is &quot;red tape&quot;. The top-10 most similar words to &quot;red&quot; and &quot;tape&quot; in our thesaurus are: red: yellow 0.164, purple 0.149, pink 0.146, green 0.136, blue 0.125, white 0.122, color 0.118, orange 0.111, brown 0.101, shade 0.094; tape: videotape 0.196, cassette 0.177, videocassette 0.168, video 0.151, disk 0.129, recording 0.117, disc 0.113, footage 0.111, recorder 0.106, audio 0.106; The following table shows the frequency and mutual information of &quot;red tape&quot; and word combinations in which one of &quot;red&quot; or &quot;tape&quot; is substituted by a mutual verb object freq info red tape 259 5.87 yellow tape 12 3.75 orange tape 2 2.64 black tape 9 1.07 Even though many other similar combinations exist in the collocation database, they have very different frequency counts and mutual information values than &quot;red tape&quot;.</Paragraph> <Paragraph position="3"> Finally, consider a compositional phrase: &quot;economic impact&quot;. The top-10 most similar words are: economic: financial 0.305, political 0.243, social 0.219, fiscal 0.209, cultural 0.202, budgetary 0.2, technological 0.196, organizational 0.19, ecological 0.189, monetary 0.189; impact: effect 0.227, implication 0.163, consequence 0.156, significance 0.146, repercussion 0.141, fallout 0.141, potential 0.137, ramification 0.129, risk 0.126, influence 0.125; The frequency counts and mutual information values of &quot;economic impact&quot; and phrases obtained by replacing one of &quot;economic&quot; and &quot;impact&quot; with a similar word are in Table 4. Not only many combinations are found in the corpus, many of them have very similar mutual information values to that of nomial distribution can be accurately approximated by a normal distribution (Dunning, 1993). Since all the potential non-compositional expressions that we are considering have reasonably large frequency counts, we assume their distributions are normal.</Paragraph> <Paragraph position="4"> Let Ihead 1;ype modifier I = k and 1. * .1 = n. The maximum likelihood estimation of the true probability p of the collocation (head type modifier) is /5 = ~. Even though we do not know what p is, since p is (assumed to be) normally distributed, there is N% chance that it falls within the interval</Paragraph> <Paragraph position="6"> n V n n n n where ZN is a constant related to the confidence level N and the last step in the above derivation is due to the fact that k is very small. Table 3 shows the z~ values for a sample set of confidence intervals. &quot;economic impact&quot;. In fact, the difference of mutual information values appear to be more important to the phrasal similarity than the similarity of individual words. For example, the phrases &quot;economic fallout&quot; and &quot;economic repercussion&quot; are intuitively more similar to &quot;economic impact&quot; than &quot;economic implication&quot; or &quot;economic significance&quot;, even though &quot;implication&quot; and &quot;significance&quot; have higher similarity values to &quot;impact&quot; than &quot;fallout&quot; and &quot;repercussion&quot; do.</Paragraph> <Paragraph position="7"> These examples suggest that one possible way to separate compositional phrases and non-compositional ones is to check the existence and mutual information values of phrases obtained by substituting one of the words with a similar word. A phrase is probably non-compositional if such substitutions are not found in the collocation database or their mutual information values are significantly different from that of the phrase.</Paragraph> </Section> <Section position="7" start_page="318" end_page="319" type="metho"> <SectionTitle> 5 Algorithm </SectionTitle> <Paragraph position="0"> In order to implement the idea of separating non-compositional phrases from compositional ones with mutual information, we must use a criterion to determine whether or not the mutual information values of two collocations are significantly different. Although one could simply use a predetermined threshold for this purpose, the threshold value will be totally arbitrary, b-hrthermore, such a threshold does not take into account the fact that with different frequency counts, we have different levels confidence in the mutual information values.</Paragraph> <Paragraph position="1"> We propose a more principled approach. The frequency count of a collocation is a random variable with binomial distribution. When the frequency count is reasonably large (e.g., greater than 5), a bi- null IN% 150% 80% 90% 95% 98% 99% I zg 0.67 1.28 1.64 1.96 2.33 2.58 We further assume that the estimations of P(A), P(B\]A) and P(CIA ) in (2) are accurate. The confidence interval for the true probability gives rise to a confidence interval for the true mutual information (mutual information computed using the true probabilities instead of estimations). The upper and lower bounds of this interval are obtained by substituting k with k+z~v'-g and k-z~vff in (2). Since our con- n n n fidence of p falling between k+,~v~ is N%, we can I% have N% confidence that the true mutual information is within the upper and lower bound.</Paragraph> <Paragraph position="2"> We use the following condition to determine whether or not a collocation is compositional: (3) A collocation a is non-compositional if there does not exist another collocation/3 such that (a) j3 is obtained by substituting the head or the modifier in a with a similar word and (b) there is an overlap between the 95% confidence interval of the mutual information values of a and f~.</Paragraph> <Paragraph position="3"> For example, the following table shows the frequency count, mutual information (computed with the most likelihood estimation) and the lower and upper bounds of the 95% confidence interval of the true mutual information: freq. mutual lower upper verb-object count info bound bound make difference 1489 2.928 2.876 2.978 make change 1779 2.194 2.146 2.239 Since the intervals are disjoint, the two collocations are considered to have significantly different mutual information values.</Paragraph> </Section> class="xml-element"></Paper>