File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1411_metho.xml
Size: 15,896 bytes
Last Modified: 2025-10-06 14:08:09
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1411"> <Title>Acquisition of Lexical Paraphrases from Texts</Title> <Section position="4" start_page="2" end_page="4" type="metho"> <SectionTitle> JUMAN </SectionTitle> <Paragraph position="0"> and then parsed by the KNP parser.</Paragraph> <Paragraph position="1"> We then obtained a relation triplet (c</Paragraph> </Section> <Section position="5" start_page="4" end_page="5" type="metho"> <SectionTitle> , Noun) </SectionTitle> <Paragraph position="0"> e.g. '0b(bombing)(U.S. army)' In this list, r i can be a particle, such as a case particle (r ) or an associative particle &quot;w&quot;</Paragraph> <Paragraph position="2"> in the list is a syntactic relation expressed without any particle or other functional marker. For instance, a verb or an adjective directly modifies a noun without using any functional words in Japanese. In this case, we introduce the notion of a constituent boundary proposed by Furuse and Iida (1994), which is a virtual functional marker inserted between two consecutive content words, in order to more easily analyze a sentence. For instance, if there are two consecutive nouns, we assume that <nn> is inserted between the two nouns, and consequently the relation r</Paragraph> <Paragraph position="4"/> <Section position="1" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 3.2 Bigraph construction </SectionTitle> <Paragraph position="0"> We then transform the collection of triplets into a bigraph (2-partite graph). In the first step, each triplet in the collection is converted into two couplets consisting of a content word and an operator by the following definition: an operator consists of a content word c and a relation with directionality r. It is defined as either r - c (something depends on c by r)orr - c (c depends on something by r). For instance, suppose that a triplet is (c ) are extracted in this operation.</Paragraph> <Paragraph position="1"> We perform this conversion for all of the triplets, and a list of couplets is obtained. From the viewpoint of graph theory, this couplet list is a bigraph, such as figure 1, which consists of two sets (content word set and operator set) and a list of edges, where each edge connects an element in one set to an element on the other side. This bigraph is a weighted graph, and each weight expresses the frequency of appearing in the corpus.</Paragraph> </Section> <Section position="2" start_page="4" end_page="5" type="sub_section"> <SectionTitle> 3.3 Paraphrasability computation </SectionTitle> <Paragraph position="0"> In the next step, we compute paraphrasability.</Paragraph> <Paragraph position="1"> In this work, the paraphrasability P for any two content words c</Paragraph> <Paragraph position="3"> in the bigraph.</Paragraph> <Paragraph position="4"> This formulation can be explained as follows. Paraphrasability between two content words c</Paragraph> <Paragraph position="6"> increases if these words behave similarly in terms of their dependency relations. That is, this metrics compares the similarity of the contextual situations of the two input words. The definition states that paraphrasability computes the number of operators that c</Paragraph> <Paragraph position="8"> However, we believe that the importance of each operator m is not equivalent to that of the others. For example, in figure 1, the operator is linked by almost all of the words. In this situation, it is not reasonable to handle the two operators equally, since m may confirm that the two words are similar or paraphrasable, whereas m may be a general operator widely used in various situations. In other words, when we compute paraphrasability ).</Paragraph> <Paragraph position="9"> Consequently, each operator is weighted by the definition of formula (3). Moreover, instances of low frequency are regarded as accidental and insignificant, so we filter out links where an instance appears only once.</Paragraph> <Paragraph position="10"> It is obvious in the definition of (1) that</Paragraph> <Paragraph position="12"> ) [?] 1, and a higher score expresses a higher possibility of paraphrasing. More importantly, the definition indicates the relation</Paragraph> <Paragraph position="14"> ), i.e., there is a directionality that gives larger differences than any similarity metrics. Even if an expression E</Paragraph> </Section> <Section position="3" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 3.4 Paraphrase knowledge filtering </SectionTitle> <Paragraph position="0"> By only taking the discussion of the last sub-section into account, we can compute paraphrasability between any two content words.</Paragraph> <Paragraph position="1"> However, this measure is not the final judgment of paraphrasability: some pairs score very high even though they are not paraphrasable. For example, the pair three and four may have a very high score but are of course not paraphrasable.</Paragraph> <Paragraph position="2"> In our observation, the following kinds are found to be misjudged as paraphrasable by our defini- null tions.</Paragraph> <Paragraph position="3"> 1. number, e.g., '~'(three) -''(four) 2. proper noun, e.g., 'z'(Beijing) -'Fz'(Taipei) 3. antonym, e.g., ''(right) -'('(left) Obviously, these errors occurred due to one of the limitations of our approach; since the formula only has an interest in the context of the words found in the corpus, not in the sense of the words found in a dictionary.</Paragraph> <Paragraph position="4"> However, we can filter out these kinds of word pairs by introducing language resources external to the corpus. First, we can judge whether the word is a number by applying some simple rules. Second, we can now easily obtain extensive lists of both major proper nouns and antonyms. We obtain the proper noun list from</Paragraph> <Paragraph position="6"> , one of the largest Japanese electronic thesauri, in which 169,682 proper noun entries are extracted. We obtain the antonym list from both Gakken Kokugo Daijiten (a Japanese word dictionary) and Kadokawa Ruigo In fact, further filtering is necessary in order to reduce errors. For example, in English, guitar, piano,andflute have very similar contexts, such as &quot;to play the ,&quot; &quot;an electric ,&quot; &quot;a violin and a ,&quot; and so on, although they are naturally not paraphrasable. We predict that in order to use lexical paraphrase collection for filtering, future research will need to concentrate on how to collect word pairs that are not paraphrasable but have the same context.</Paragraph> </Section> <Section position="4" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 3.5 Further filtering by heuristic method </SectionTitle> <Paragraph position="0"> In the final process, we filter the pairs further by using our proposed heuristic to improve the acquisition accuracy.</Paragraph> <Paragraph position="1"> From our observations of the results obtained by the above operations, we found a clear tendency in words that have a very high frequency or very broad sense: these words tend to be judged as having a high paraphrasability from many words or to many words, even if they are not actually paraphrasable. For example, in figure 2 (a), a content word such as c connects to only one word, would more likely have its paraphrasing judged as proper. We also build a hypothesis that case (c), where two words are exchangeable, has more accuracy than the other two cases, which are evaluated in the next section.</Paragraph> <Paragraph position="2"> We assume these errors occurred because such words can have dependency relations with many words, i.e., such words are general and frequently appearing. Consequently, such cases are unexpectedly judged as being highly paraphrasable from or to many words. As these words are used many times in many contexts, the possibility of inserted noises also increases. Therefore, distinguishing noises from real paraphrases becomes difficult.</Paragraph> <Paragraph position="3"> These spurious paraphrases should not remain in the final results, so we conduct another filtering according to the above analysis. The actual process is conducted as follows. For each In the experiment below, we set P const =0.1.</Paragraph> <Paragraph position="4"> In this heuristic filtering, some word pairs that are actually paraphrasable may, unfortunately, also be lost. The problem of saving them remains for our future work.</Paragraph> </Section> </Section> <Section position="6" start_page="5" end_page="5" type="metho"> <SectionTitle> 4 Knowledge Acquisition </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="5" end_page="5" type="sub_section"> <SectionTitle> Experiment 4.1 Experiment on content word </SectionTitle> <Paragraph position="0"> paraphrasing We have conducted an experiment of paraphrasing knowledge acquisition in the following con- null These two words have the same string but different part-of-speech, so our tagger judges these two as differ- null ditions. The corpus we used was all articles of The Mainichi Shimbun, which is one of the national daily newspapers of Japan, published in the year 1995. The size of the corpus is 87.3 MB, consisting of 1.33 million sentences.</Paragraph> <Paragraph position="1"> Table 1 illustrates evaluation results of knowledge acquisition. The results show that our proposed process can choose approximately 1,700 paraphrase pairs that have 66% accuracy. Although this accuracy is not satisfactory for an automatic process, it is already helpful from the engineering point of view; accordingly, we can obtain a large amount of high-quality paraphrase pairs with a minimum human check in significantly less time than one day.</Paragraph> <Paragraph position="2"> We also show the acquired paraphrase pairs with the highest paraphrasabilities in Table 2. Note that P (-) in the table denotes the paraphrasability of the inverted paraphrases, from right to left direction, and the symbol [?] indicates that this direction is also judged as paraphrasable, i.e., these two words are determined to be paraphrasable with each other. We found that most of the entries in the list are correctly judged to be paraphrasable, even though some of them cannot be paraphrasable, such as &quot;`h .</Paragraph> <Paragraph position="3"> We can also confirm that the directionality of the proposed measure works quite well. For example, we can paraphrase the term &quot;(anecdote)&quot; with the more general term &quot;(story),&quot; but it is impossible to replace the latter with the former except in some restricted context. The outputs seen in this table illustrate such an intuition.</Paragraph> <Paragraph position="4"> ]^(pref. assemblyman) <nn> -^<nn> .7859 If the process judges that the two words can paraphrase each other, these words are considered to be a paraphrase in a narrow sense. In this experiment, we can extract 114 pairs that satisfy this relation, and 75 of these pairs are evaluated as being correct, for an accuracy of 65.8%.</Paragraph> </Section> <Section position="2" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 4.2 Experiment for acquisition of </SectionTitle> <Paragraph position="0"> operator paraphrase So far in this paper, we have been using an operator set to compute any of two words in the content word set in the bigraph. We found that we can also do this in the reverse way: computing any of two operators by using the content word set. This is possible because even if we turn a bigraph upside-down, it is still a bigraph. In this subsection, we report an experiment on computing the paraphrasability of operators by thesameprocedureasabove.</Paragraph> <Paragraph position="1"> After multiple filtering, 432 pairs were judged as paraphrasable. From these we found that the number of correct pairs was 312 (72.2% accuracy). Table 3 illustrates the final paraphrasable pairs with the highest paraphrasability. null Unfortunately, these pairs include errors, so their performance in an automatic process should be improved. However, this performance is still promising for a human-assisted tool.</Paragraph> <Paragraph position="2"> We investigated the pairs and found that there were various kinds of paraphrasing knowledge obtained in this process. Not only paraphrases of content words but also paraphrase knowledge of the following types were obtained in this experiment.</Paragraph> <Paragraph position="3"> * insertion and deletion of the particle &quot;w&quot; in noun-noun sequences * paraphrasing for case particles; in Japanese, it may be possible to change a particle under a certain context.</Paragraph> <Paragraph position="4"> * voice conversion * different description of the same word, e.g., from a Chinese-origin word to a native Japanese word</Paragraph> </Section> </Section> <Section position="7" start_page="5" end_page="5" type="metho"> <SectionTitle> 5 Related Works </SectionTitle> <Paragraph position="0"> Lexical paraphrasing is very useful in information retrieval, since it is necessary to expand terms for improving retrieval coverage.</Paragraph> <Paragraph position="1"> Jacquemin et al. (1997) have proposed acquiring syntactic and morpho-syntactic variations of the multi-word terms using a corpus-based approach. They have searched for variation, i.e., similar expressions using (a part of) the input words, such as technique for measurement against measurement technique, while our target is the paraphrase of a single content word.</Paragraph> <Paragraph position="2"> The goal of our work is to obtain lexical knowledge for paraphrasing. For this purpose we use contextual similarity, which is also used in the sense similarity computation task in the fields of natural language processing, artificial intelligence, and cognitive science. Moreover, the idea of corpus-based context extraction is basically the same and also used in the task of automatic construction of thesauri or sense determination of unknown words.</Paragraph> <Paragraph position="3"> Although this is the first work to use context for paraphrase knowledge extraction, many previously reported works have used context for similarity calculation. Paraphrasability and word sense similarity may seem like similar metrics, but there are critical differences between the two tasks. First, similarity satisfies the symmetrical property while paraphrasability does not (explained in 3.3). Second, similarity is a relative measure while paraphrasability is an absolute measure; in many cases, we can answer ?&quot;. In other words, it is important to collect paraphrases while it may be pointless to collect similar words,since the border for the former is clearer than that of the latter.</Paragraph> <Paragraph position="4"> The kind of information used for defining context is important. For this question, Nagamatsu and Tanaka (1996) used a deep case (seen in a semantically tagged corpus), and Kanzaki et al.</Paragraph> <Paragraph position="5"> (2000) only extracted relations of nominal modification. The most closely related work in terms of similarity source is the work of Grefenstette (1994), where they obtained subject-verb, verbobject, adjective-noun, and noun-noun relations from a corpus. In contrast, as discussed in sub-section 3.1, we propose extracting all of the dependency relations around content words, i.e., nouns, verbs, and adjectives. This is the first attempt to introduce these features into a context definition, and it is obvious that coverage of extracted pairs becomes wider by using various features. However, we have not conducted enough experiments to prove that these factors are effective. This remains for our future work.</Paragraph> </Section> class="xml-element"></Paper>