File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1031_metho.xml

Size: 19,422 bytes

Last Modified: 2025-10-06 14:08:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-1031">
  <Title>Paraphrasing Predicates from Written Language to Spoken Language Using the Web</Title>
  <Section position="4" start_page="2" end_page="4" type="metho">
    <SectionTitle>
3 Learning Predicate Paraphrase Pairs
</SectionTitle>
    <Paragraph position="0"> Kaji et al. proposed a method of paraphrasing predicates using a dictionary (Kaji et al., 2002). For example,  when a definition sentence of 'chiratsuku (to shimmer)' is 'yowaku hikaru (to shine faintly)', his method paraphrases (1a) into (1b).</Paragraph>
    <Paragraph position="1"> (1) a. ranpu-ga chiratsuku  a lamp to shimmer b. ranpu-ga yowaku hikaru a lamp faintly to shine As Kaji et al. discussed, this dictionary-based paraphrasing involves three difficulties: word sense ambiguity, extraction of the appropriate paraphrase from a definition sentence, transformation of postposition  . In order to solve those difficulties, he proposed a method based on case frame alignment.</Paragraph>
    <Paragraph position="2"> If paraphrases can be extracted from the definition sentences appropriately, paraphrase pairs can be learned. We extracted paraphrases from definition sentences using the  Japanese noun is attached with a postposition. method of Kaji et al. However, it is beyond the scope of this paper to describe his method as a whole. Instead, we represent an overview and show examples.</Paragraph>
    <Paragraph position="3"> (predicate) (definition sentence) (2) a. chiratsuku [ kasukani hikaru ] to shimmer faintly to shine to shine faintly b. chokinsuru [ okane-wo tameru ] to save money money to save to save money c. kansensuru byouki-ga [ utsuru ] to be infected disease to be infected to be infected with a disease In almost all cases, a headword of a definition sentence of a predicate is also a predicate, and the definition sentence sometimes has adverbs and nouns which modify the head word. In the examples, headwords are 'hikaru (to shine)', 'tameru (to save)', and 'utsuru (to be infected)'. The adverbs are underlined, the nouns are underlined doubly, paraphrases of the predicates are in brackets. The head-word and the adverbs can be considered to be always included in the paraphrase. On the other hand, the nouns are not, for example 'money' in (2b) is included but 'disease' in (2c) is not. It is decided by the method of Kaji et al. whether they are included or not.</Paragraph>
    <Paragraph position="4"> The paraphrase includes one noun at most, and is in the form of 'adverbA3 noun+ predicate'  . Hereafter, it is assumed that a paraphrase pair which is learned is in the form of 'predicate AX adverbA3 noun+ predicate'. The predicate is called source, the 'adverbA3 noun+ predicate' is called target.</Paragraph>
    <Paragraph position="5"> We used reikai-shougaku-dictionary (Tadika, 1997), and 5,836 paraphrase pairs were learned. The main problem dealt with in this paper is to select paraphrase pairs in the form of 'UES AX SES' from those 5,836 ones.</Paragraph>
  </Section>
  <Section position="5" start_page="4" end_page="5" type="metho">
    <SectionTitle>
4 Collecting Written and Spoken
</SectionTitle>
    <Paragraph position="0"> Language Corpora from the Web We distinguish UES and SES (see Figure 1) using the occurrence probability in written and spoken language corpora. Therefore, large written and spoken corpora are necessary. We cannot use existing Japanese spoken language corpora, such as (Maekawa et al., 2000; Takezawa et al., 2002), because they are small.</Paragraph>
    <Paragraph position="1"> Our solution is to automatically collect written and spoken language corpora from the Web. The Web contains various texts in different styles. Such texts as news articles can be regarded as written language corpora, and such texts as chat logs can be regarded as spoken language corpora. Since we do not need information such as  A3 means zero or more, and + means one or more.</Paragraph>
    <Paragraph position="2"> accents or intonations, speech data of real conversations is not always required.</Paragraph>
    <Paragraph position="3"> This papepr proposes a method of collecting written and spoken language corpora from the Web using interpersonal expressions (Figure 2). Our method is as follows. First, a corpus is created by removing useless parts such as html tags from the Web. It is called Web corpus. Note that the Web corpus consist of Web pages (hereafter page). Secondly, the pages are classified into three types (written language corpus, spoken language corpus, and ambiguous corpus) based on interpersonal expressions.</Paragraph>
    <Paragraph position="4"> And then, only written and spoken language copora are used, and the ambiguous corpus is abandoned. This is because: AF Texts in the same page tend to be described in the same style.</Paragraph>
    <Paragraph position="5"> AF The boundary between written and spoken language is not clear even for humans, and it is almost impossible to precisely classify all pages into written language or spoken language.</Paragraph>
    <Section position="1" start_page="4" end_page="5" type="sub_section">
      <SectionTitle>
4.1 Interpersonal expressions
</SectionTitle>
      <Paragraph position="0"> Each page in the Web corpus is classified based on interpersonal expressions.</Paragraph>
      <Paragraph position="1"> Spoken language is often used as a medium of information which is directed to a specific listener. For example, face-to-face communication is one of the typical situations in which spoken language is used. Due to this fact, spoken language tends to contain expressions which imply an certain attitude of a speaker toward listeners, such as familiarity, politeness, honor or contempt etc. Such an expression is called interpersonal expression. On the other hand, written language is mostly directed to unspecific readers. For example, written language is often used in news articles or books or papers etc. Therefore, interpersonal expressions are not used so frequently in written language as in spoken language.</Paragraph>
      <Paragraph position="2"> Among interpersonal expressions, we utilized familiarity and politeness expressions. The familiarity expression is one kind of interpersonal expressions, which implies the speaker's familiarity toward the listener. It is represented by a postpositional particle such as 'ne'or'yo' etc. The following is an example:  (3) watashi-wa ureshikatta yo I was happy (familiarity) I was happy (3) implies familiarity using the postpositional particle 'yo'.</Paragraph>
      <Paragraph position="3">  The politeness expression is also one kind of interpersonal expressions, which implies politeness to the lis- null tener. It is represented by a postpositional particle. For example: (4) watashi-wa eiga-wo mi masu I a movie to watch (politeness) I watch a movie (4) implies politeness using the postpositional particle 'masu'.</Paragraph>
      <Paragraph position="4">  Those two interpersonal expressions often appear in spoken language, and are easily recognized as such by a morphological analyzer and simple rules. Therefore, a page in the Web corpus can be classified into the three types based the following two ratios.</Paragraph>
      <Paragraph position="5">  , sentences which include familiarity or politeness expressions are recognized in the following manner in order to calculate F-ratio and P-ratio. If a sentence has one of the following six postpositional particles, it is considered to include the familiarly expression. null ne, yo, wa, sa, ze, na A sentence is considered to include the politeness expression, if it has one of the following four postpositional particles. null desu, masu, kudasai, gozaimasu  http://www.kc.t.u-tokyo.ac.jp/nl-resource/juman-e.html If F-ratio and P-ratio of a page are very low, the page is in written language, and vice versa. We observed a part of the Web corpus, and empirically decided the rules illustrated in Table 1. If F-ratio and P-ratio are equal to 0, the page is classified as written language. If F-ratio is more than 0.2, or if F-ratio is more than 0.1 and P-ratio is more than 0.2, the page is classified as spoken language. The other pages are regarded as ambiguous.</Paragraph>
    </Section>
    <Section position="2" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
4.3 Evaluation
</SectionTitle>
      <Paragraph position="0"> The Web corpus we prepared consists of 660,062 pages and contains 733M words. Table 2 shows the size of the written and spoken language corpora which were collected from the Web corpus.</Paragraph>
      <Paragraph position="1">  Size comparison The reason why written and spoken language corpora were collected from the Web is that Japanese spoken language corpora available are too small. As far as we know, the biggest Japanese one is Spontaneous Speech Corpus of Japanese, which contains 7M words (Maekawa et al., 2000). Our corpus is about ten times as big as Spontaneous Speech Corpus of Japanese.</Paragraph>
      <Paragraph position="2"> Precision of our method What is important for our method is not recall but precision. Even if the recall is not high we can collect large corpora, because the Web corpus is very huge. However, if the precision is low, it is impossible to collect corpora with high quality.</Paragraph>
      <Paragraph position="3"> 240 pages of the written and spoken language corpora were extracted at random, and the precision of our method was evaluated. The 240 pages consist of 125 pages collected as written language corpus and 115 pages collected as spoken language corpus. Two judges (hereafter judge 1 and 2) respectively assessed how many of the 240 pages were classified properly.</Paragraph>
      <Paragraph position="4"> The result is shown in Table 3. The judge 1 identified 228 pages as properly classified ones; the judge 2 identified 221 pages as properly classified ones. The average precision of the total was 94% (=228+221/240+240) and we can say that our corpora have sufficient quality.</Paragraph>
      <Paragraph position="5">  were examined, and it was found that lexical information is useful in order to properly classify them. (5) is an example which means 'A new software is exciting'.</Paragraph>
      <Paragraph position="6">  familiarity and politeness expressions. This is because of the word 'wakuwakusuru', which is informal and means 'exiting'.</Paragraph>
      <Paragraph position="7"> On way to deal with such pages is to use words characteristic of written or spoken language. Such words will be able to be gathered form our written and spoken language corpora. It is our future work to improve the quality of our corpora in an iterative way.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="5" end_page="7" type="metho">
    <SectionTitle>
5 Paraphrase Pair Selection
</SectionTitle>
    <Paragraph position="0"> A paraphrase pair we want is one in which the source is UES and the target is SES. From the paraphrase pairs learned in Section 3, such paraphrase pairs are selected using the written and spoken language corpora.</Paragraph>
    <Paragraph position="1"> Occurrence probabilities (OPs) of expressions in the written and spoken language corpora can be used to distinguish UES and SES. This is because: AF An expression is likely to be UES if its OP in spoken language corpora is very low.</Paragraph>
    <Paragraph position="2"> AF An expression is likely to be UES, if its OP in written language corpora is much higher than that in spoken language corpora.</Paragraph>
    <Paragraph position="3"> For example, Table 4 shows OP of 'jikaisuru'. It is a difficult verb which means 'to admonish oneself', and rarely used in a conversation. The verb 'jikaisuru' appeared 14 times in the written language corpus, which contains 6.1M predicates, and 7 times in the spoken language corpus, which contains 11.7M predicates. The OP of jikaisuru in spoken language corpus is low, compared  with that in written language corpus. Therefore, we can say that 'jikaisuru' is UES.</Paragraph>
    <Paragraph position="4"> The paraphrase pair we want can be selected based on the following four OPs.</Paragraph>
    <Paragraph position="5">  (1) OP of source in the written language corpus (2) OP of source in the spoken language corpus (3) OP of target in the written language corpus (4) OP of target in the spoken language corpus  The selection can be considered as a binary classification task: paraphrase pairs in which source is UES and target is SES are treated as positive, and others are negative. We propose a method based on Support Vector Machine (Vapnik, 1995). The four OPs above are used as features.</Paragraph>
    <Section position="1" start_page="5" end_page="6" type="sub_section">
      <SectionTitle>
5.1 Feature calculation
</SectionTitle>
      <Paragraph position="0"> The method of calculating OP of an expression CT (BP C7C8B4CTB5) in a corpus is described. According to the method, those four features can be calculated. The method is broken down into two steps: counting the frequency of CT, and calculation of C7C8B4CTB5 using the frequency. null Frequency After a corpus is processed by the Japanese morphological analyzer (JUMAN) and the parser (KNP)  , the frequency of e (BYB4CTB5) is counted. Although the frequency is often obvious from the analysis result, there are several issues to be discussed.</Paragraph>
      <Paragraph position="1"> The frequency of a predicate is sometimes quite different from that of the same predicate in the different voice. Therefore, the same predicates which have different voice should be treated as different predicates.</Paragraph>
      <Paragraph position="2"> As already mentioned in Section 3, the form of source is 'predicate' and that of target is 'adjectiveA3 noun+ predicate'. If e is target and contains adverbs and nouns, it is difficult to count the frequency because of the sparse data problem. In order to avoid the problem, an approximation that the adverbs are ignored is used. For example, the frequency of 'run fast' is approximated by that of 'run'. We did not ignore the noun because of the following reason. As a noun and a predicate forms an idiomatic phrase more often than an adverb and a predicate, the meaning of such idiomatic phrase completely changes without the noun.</Paragraph>
      <Paragraph position="3">  http://www.kc.t.u-tokyo.ac.jp/nl-resource/knp-e.html If the form of target is 'adverbA3 noun predicate', the frequency is approximated by that of 'noun predicate', which is counted based on the parse result. However, generally speaking, the accuracy of Japanese parser is low compared with that of Japanese morphological analyzer; the former is about 90% while the latter about 99%. Therefore, only reliable part of the parse result is used in the same way as Kawahara et al. did. See (Kawahara and Kurohashi, 2001) for the details. Kawahara et al. reported that 97% accuracy is achieved in the reliable part.</Paragraph>
      <Paragraph position="4"> Occurrence probability In general, C7C8B4CTB5 is defined as: C7C8B4CTB5BPBYB4CTB5BP # of expressions in a corpus.</Paragraph>
      <Paragraph position="5"> BYB4CTB5 tends to be small when CT contains a noun, because only a reliable part of the parsed corpus is used to count BYB4CTB5. Therefore, the value of the denominator '# of expressions in a corpus' should be changed depending on whether CT contains a noun or not. The occurrence probability is defined as follows: if CT does not contain any nouns  # of predicates # of noun-predicates written language corpus 6.1M 1.5M spoken language corpus 11.7M 1.9M</Paragraph>
    </Section>
    <Section position="2" start_page="6" end_page="7" type="sub_section">
      <SectionTitle>
5.2 Evaluation
</SectionTitle>
      <Paragraph position="0"> The two judges built a data set, and 20-hold cross validation was used.</Paragraph>
      <Paragraph position="1"> Data set 267 paraphrase pairs were extracted at random form the 5,836 paraphrase pairs learned in section 3. Two judges independently tagged each of the 267 paraphrase pairs as positive or negative. Then, only such paraphrase pairs that were agreed upon by both of them were used as data set. The data set consists of 200 paraphrase pairs (70 positive pairs and 130 negative pairs).</Paragraph>
      <Paragraph position="2"> Experimental result We implemented the system using Tiny SVM package  .The Kernel function explored was the polynomial function of degree 2.</Paragraph>
      <Paragraph position="3"> Using 20-hold cross validation, two types of feature sets (F-set1 and F-set2) were evaluated. F-set1 is a feature set of all the four features, and F-set2 is that of only two features: OP of source in the spoken language corpus, and OP of target in the spoken language corpus. The results were evaluated through three measures: accuracy of the classification (positive or negative), precision of positive paraphrase pairs, and recall of positive paraphrase pairs. Table 6 shows the result. The accuracy, precision and recall of F-set1 were 76 %, 70 % and 73 % respectively. Those of F-set2 were 75 %, 67 %, and 69 %.</Paragraph>
      <Paragraph position="4">  phrase pair (1) is positive example and the paraphrase pair (2) is negative, and both of them were successfully classified. The source of (1) appears only 10 times in the spoken language corpus, on the other hand, the source of (2) does 67 times.</Paragraph>
      <Paragraph position="5"> Discussion It is challenging to detect the connotational difference between lexical paraphrases, and all the features were not explicitly given but estimated using the corpora which were prepared in the unsupervised manner. Therefore, we think that the accuracy of 76 % is very high.</Paragraph>
      <Paragraph position="6"> The result of F-set1 exceeds that of F-set2. This indicates that comparing C7C8B4CTB5 in the written and spoken language corpus is effective.</Paragraph>
      <Paragraph position="7"> Calculated C7C8B4CTB5 was occasionally quite far from our intuition. One example is that of 'kangekisuru', which is a very difficult verb that means 'to watch a drama'. Although the verb is rarely used in real spoken language, its occurrence probability in the spoken language corpus was very high: the verb appeared 9 times in the written language corpus and 69 times in the spoken language corpus. We examined those corpora, and found that the spoken language corpus happens to contain a lot of texts about dramas. Such problems caused by biased topics will be resolved by collecting corpora form larger Web corpus.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="7" end_page="7" type="metho">
    <SectionTitle>
6 Future Work
</SectionTitle>
    <Paragraph position="0"> In order to estimate more reliable features, we are going to increase the size of our corpora by preparing larger Web corpus.</Paragraph>
    <Paragraph position="1"> Although the paper has discussed paraphrasing from the point of view that an expression is UES or SES, there are a variety of SESs such as slang or male/female speech etc. One of our future work is to examine what kind of spoken language is suitable for such a kind of application that was illustrated in the introduction.</Paragraph>
    <Paragraph position="2"> This paper has focused only on paraphrasing predicates. However, there are other kinds of paraphrasing which are necessary in order to paraphrase written language text into spoken language. For example, paraphrasing compound nouns or complex syntactic structure is the task to be tackled.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML