File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1056_metho.xml

Size: 14,122 bytes

Last Modified: 2025-10-06 14:07:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1056">
  <Title>Paraphrasing of Chinese Utterances</Title>
  <Section position="4" start_page="0" end_page="3" type="metho">
    <SectionTitle>
3 Paraphrasing Pattern
</SectionTitle>
    <Paragraph position="0"> The paraphrase corpus of the spoken Chinese language consists of 20,000 original sentences and 44,480 paraphrases, one original sentence having at least two paraphrases (Zhang et al., 2001). The paraphrases were obtained by the manual rephrasing of the original sentences: words may be reordered, some words may be substituted with synonyms, or the syntactic structures may be changed. Such a paraphrase corpus contains the knowledge of how to generate paraphrases for one sentence. We intend to get paraphrasing patterns from the corpus. By pairing each paraphrase with its corresponding original sentence, 44,480 pairs were obtained.</Paragraph>
    <Paragraph position="1"> Hereafter, we call such pairs paraphrase pairs.</Paragraph>
    <Paragraph position="2"> Word segmentation and part-of-speech tagging were carried out on the paraphrase pairs. The part-of-speech tagger accepted the Penn Chinese Treebank tag set, which comprises 33 parts-of-speech (Xia, 2000). A part of the Penn Chinese Treebank tag set is shown in Table 1.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Extraction of Instances
</SectionTitle>
      <Paragraph position="0"> For one paraphrase pair, the paraphrase may differ from its original sentence in one of the following paraphrasing phenomena: (1) word order, (2) substitution of synonyms, and (3) change of syntactic structure. For most paraphrase pairs, the paraphrases contain a mixture of the above phenomena. We need to classify the paraphrasing phenomena and learn the relative paraphrasing patterns. In this way, we can restrict the paraphrasing process to some language phenomena and summarize the changes in the information of the resultant paraphrases.</Paragraph>
      <Paragraph position="1"> The following paraphrasing phenomena were considered and related paraphrase pairs were extracted.</Paragraph>
      <Paragraph position="2">  Word order in the spoken Chinese is comparatively free. In the paraphrase corpus, quite a large proportion of the paraphrases is created by word reordering. We extracted the paraphrase pairs in which the morpheme number of the original sentence is equal to that of the paraphrase and each morpheme of the original sentence appears in the paraphrase and vice versa. One example is shown in 3-1.</Paragraph>
      <Paragraph position="3"> [3[?]1] An extracted paraphrase pair.</Paragraph>
      <Paragraph position="4"> Original: 42/VV 38/AD 555448/VV</Paragraph>
      <Paragraph position="6"> Guided by the extracted paraphrase pair, we can in fact paraphrase the original sentence by reordering its words according to the word order of the paraphrase. The extracted paraphrase pairs of this kind provided instances for learning word order paraphrasing patterns.</Paragraph>
      <Paragraph position="7">  In some paraphrase pairs, we observed that paraphrasing phenomena were related to negative expressions. For example, original sentences include negative words &amp;quot;75(do not )&amp;quot; or &amp;quot;52(did not)&amp;quot; , but their corresponding paraphrases appear as affirmative forms without these negative words. This fact implied that the sentences could be simplified by deleting the negative expressions. For this purpose, the paraphrase pairs were extracted in which the original sentences included the words &amp;quot;75&amp;quot; or &amp;quot;52&amp;quot; and the corresponding paraphrases did not. One example is shown in 3-2.</Paragraph>
      <Paragraph position="8">  The Chinese language has a few grammatical markers. The particle &amp;quot;76&amp;quot; is one of such markers. The sentences with the form &amp;quot;S(subject) V(verb) O(object) C(complement)&amp;quot; may be changed into the form &amp;quot;S 76 OVC&amp;quot;by inserting the particle &amp;quot;76&amp;quot; (Zhang and Sato, 1999). The usage of &amp;quot;76&amp;quot; emphasizes the object by moving it before the verb. When the particle &amp;quot;76&amp;quot; is in a sentence, it is easier to identify the object. So the insertion of &amp;quot;76&amp;quot; will supply more information about syntactic structure and reduce syntactic ambiguities.</Paragraph>
      <Paragraph position="9"> Moreover, paraphrasing the sentences with particle &amp;quot;76&amp;quot; may be more exact because the identification of the object is more accurate.</Paragraph>
      <Paragraph position="10"> We extracted the paraphrase pairs in which the original sentences included the particle &amp;quot;76&amp;quot; and the corresponding paraphrases did not.</Paragraph>
      <Paragraph position="11"> See example 3-3 below.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Automatic Generalization of
Instances
</SectionTitle>
      <Paragraph position="0"> Then we attempted to generalize the extracted instances in order to obtain paraphrasing patterns. For each extracted paraphrase pair, the original sentence is generalized to make the matching part of the pattern, and the paraphrase is generalized to make the generation part of the pattern. The matching part specifies the components that will be paraphrased as well as the context conditions. The generation part defines how to construct a paraphrase. When the constituted pattern is applied to one input sentence, if the input matches with the matching part, a new sentence will be generated according to the generation part.</Paragraph>
      <Paragraph position="1"> In fact, the purpose of generalization is to get a regular expression from the original sentence and to get an operation expression containing substitutions from the paraphrase. As shown in 3-3, both the original sentence and the paraphrase are series of morphemes, and each morpheme consists of a part-of-speech and an orthographic expression. The important thing in paraphrasing is to maintain meaning. To what extent the series of morphemes will be generalized depends on each paraphrasing pair.</Paragraph>
      <Paragraph position="2"> First, parts-of-speech keep the syntactic information and therefore they should be kept. Second, orthographic expressions of verbs, auxiliary verbs, adverbs, etc., are important in deciding the main meaning of the sentence and therefore they should also be kept. The orthographic expressions of other categories, such as nouns, pronouns and numerals, can be generalized to an abstract level by replacing each orthographic expression with a wild card.</Paragraph>
      <Paragraph position="3"> The pattern generalized from 3-3 is illustrated in 3-4. The left part is the matching part and the right part is the generation part. The lexical information may be an orthographic expression or a variable represented by symbol</Paragraph>
      <Paragraph position="5"> in the matching part is in fact a wild card, which means it can match with any orthographic expression in the matching</Paragraph>
      <Paragraph position="7"> in the generation part defines a substitution operation.</Paragraph>
      <Paragraph position="8"> [3[?]4] A generalized pattern.</Paragraph>
      <Paragraph position="10"> However, we found two problems in this kind of automatic generalization. The first is that restrictions on the patterns generalized from long sentences are too specific at the lexical level. In fact, the clauses and noun phrases used as modifiers have no effect on the considered paraphrasing phenomena and can be generalized further.</Paragraph>
      <Paragraph position="11"> The second is that some orthographic expressions with important meanings are generalized to wild cards, for instance, the numeral &amp;quot;7247 (how many)&amp;quot; may imply that the sentence is interrogative. Therefore, a method is needed to prevent some orthographic expressions from being automatically replaced with wild cards.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="3" type="sub_section">
      <SectionTitle>
3.3 Semi-Automatic Generalization of
Instances
</SectionTitle>
      <Paragraph position="0"> Specifying which morphemes should be generalized and which orthographic expressions should be kept requires human experience. In order to integrate human experience into automatic generalization, we developed a semi-automatic generalization tool. The tool consists of description symbols and a transformation program. The description symbols are designed for people to define generalization information on instances, and the transformation program automatically transforms the defined instances into patterns.</Paragraph>
      <Paragraph position="1"> Three description symbols are defined as follows. null []: This symbol is followed by a numeral and is used to enclose a sequence of morphemes. The enclosed part is a syntactic component, e.g., a noun phrase or a clause. Except for the part-of-speech of the last morpheme, the enclosed part will be replaced with a variable. In the Chinese language, the syntactic property of a sequence of words is most likely reflected in the last word, so we keep the part-of-speech of the last morpheme. The enclosed parts in the original sentence and the paraphrase denoted by the same numerals will be replaced with the same variables.</Paragraph>
      <Paragraph position="2"> {}: This symbol is used to enclose a morpheme. The orthographic expression of the morpheme will be kept. In this way, the lexical information of morphemes can be utilized to define the context. A few orthographic expressions can be defined inside one symbol so that words that can be paraphrased in the same way can be stored as one pattern.</Paragraph>
      <Paragraph position="3"> &lt;&gt; : This symbol is used to enclose a morpheme. The orthographic expression of the morpheme will be replaced with a variable.</Paragraph>
      <Paragraph position="4"> In this way, the orthographic expressions of verbs or adverbs can also be generalized.</Paragraph>
      <Paragraph position="5"> The usage of the symbols is explained in 3-5 and 3-6. Example 3-5 is a paraphrase pair in which description symbols are defined. Example 3-6 is the paraphrasing pattern generalized from 3-5.</Paragraph>
      <Paragraph position="6">  plies the same meaning, but the part-of-speech of the last morpheme is equal to NN. In addition to the automatic generalization for morphemes of category PN and CD, the defined &amp;quot;&lt;56/M&gt; &amp;quot; is also generalized to X  /NN, although they are not exactly the same.</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.4 Construction of the Paraphrasing
Patterns
</SectionTitle>
      <Paragraph position="0"> Using the developed tool, we manually defined generalization information on the extracted paraphrase pairs and then obtained the following four groups of paraphrasing patterns through automatic transformation.</Paragraph>
      <Paragraph position="1">  (1) 459 patterns of deleting negative expressions. null (2) 160 patterns of inserting &amp;quot;76&amp;quot;. (3) 160 patterns of deleting &amp;quot;76&amp;quot;. (4) 2,030 patterns of reordering words.</Paragraph>
      <Paragraph position="2">  The patterns of (3) were obtained by reversing the matching part and the generation part of each pattern of (2).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="3" end_page="3" type="metho">
    <SectionTitle>
4 Design of the Paraphrasing
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
Process
</SectionTitle>
      <Paragraph position="0"> In order to generate as many different expressions as possible, we designed a mechanism for applying different groups of paraphrasing patterns. As described in Section 2, the paraphrasing process can be roughly classified into simplification paraphrasing aimed at simplifying expressions, and diversity paraphrasing aimed at increasing variations. Bearing in mind that simplification paraphrasing can reduce syntactic and semantic ambiguities, we apply this type of paraphrasing first, and then apply diversity paraphrasing. Using this strategy, we anticipate that the accuracy of diversity paraphrasing will be higher because there will be fewer ambiguities in syntax and semantics. In the four groups of patterns obtained above, group (1) belongs to simplification paraphrasing, and the other groups belong to diversity paraphrasing.</Paragraph>
      <Paragraph position="1"> For one input sentence, the procedure for applying the different groups of patterns is designed as follows.</Paragraph>
      <Paragraph position="2"> (1) Make the input sentence the application data for all groups of patterns. Set group number i =1.</Paragraph>
      <Paragraph position="3"> (2) In the application of group i, get one pattern from the group and repeat step (2.1) to step (2.3).</Paragraph>
      <Paragraph position="4"> (2.1) Match the input with the matching part of the selected pattern. If the matching succeeds, generate a sentence according to the generation part of the pattern.</Paragraph>
      <Paragraph position="5"> (2.2) Make the generated sentence the application data for all groups j (i&lt;j[?] 4). (At present there are four groups of patterns.)  (2.3) Get another pattern then go to step (2.1) until there are no patterns left in group i.</Paragraph>
      <Paragraph position="6"> (3) Set i = i + 1 and go to step (2) until i&gt;4. (4) When passing the generated sentences to  the transfer, do not pass duplicated ones.</Paragraph>
      <Paragraph position="7"> Using this procedure, the generated paraphrases can be passed to the transfer at any time of the paraphrasing process. If one of the paraphrases can be translated by the transfer, the paraphrasing process will be stopped. In addition, the generated paraphrases can be paraphrased further by the patterns of following groups, therefore more expressions are likely to be produced. Based on this design, a paraphraser was implemented.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML