File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-4002_metho.xml

Size: 9,984 bytes

Last Modified: 2025-10-06 14:10:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-4002">
  <Title>Is It Correct? - Towards Web-Based Evaluation of Automatic Natural Language Phrase Generation</Title>
  <Section position="5" start_page="5" end_page="6" type="metho">
    <SectionTitle>
3 Trivial Dialogue Phrases Generation:
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
Transfer-like GA Approach
3.1 Initial Population Selection
</SectionTitle>
      <Paragraph position="0"> In the population selection process a small population of phrases are selected randomly from the Phrase DB4. This is a small database created beforehand. The Phrase DB was used for setting the thresholds for the evaluation of the generated phrases. It contains phrases extracted from real human-human trivial dialogues (obtained from the corpus of the University of South California (2005)) and from the hand crafted ALICE 4In this paper DB stands for database.</Paragraph>
      <Paragraph position="1"> database. For the experiments this DB contained 15 trivial dialogue phrases. Some of those trivial dialogue phrases are: do you like airplanes ?, have you have your lunch ?, I am glad you are impressed, what are your plans for the weekend ?, and so forth. The initial population is formed by a number of phrases randomly selected between one and the total number of expressions in the database. No evaluation is performed to this initial population.</Paragraph>
    </Section>
    <Section position="2" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
3.2 Crossover
</SectionTitle>
      <Paragraph position="0"> Since the length, i.e., number of words, among the analyzed phrases differs and our algorithm does not use semantical information, in order to avoid the distortion of the original phrase, in our system the crossover rate was selected to be 0%. This is in order to ensure a language independent method.</Paragraph>
      <Paragraph position="1"> The generation of the new phrase is given solely by the mutation process explained below.</Paragraph>
    </Section>
    <Section position="3" start_page="5" end_page="6" type="sub_section">
      <SectionTitle>
3.3 Mutation
</SectionTitle>
      <Paragraph position="0"> During the mutation process, each one of the phrases of the selected initial population is mutated at a rate of a2a4a3a6a5 , where N is the total number of words in the phrase. The mutation is performed through a transfer process, using the Features DB.</Paragraph>
      <Paragraph position="1"> This DB contains descriptive features of different topics of human-human dialogues. The word features refers here to the speci c part of speech used, that is, nouns, adjectives and adverbs5. In order to extract the descriptive features that the Feature DB contains, different human-human dialogues, (USC, 2005), were clustered by topic6 and the most descriptive nouns, adjectives and adverbs of each topic were extracted. The word to be replaced within the original phrase is randomly selected as well as it is randomly selected the substitution feature to be used as a replacement from the Feature DB. In order to obtain a language independent system, at this stage part of speech tagging was not performed7. For this mutation process, the total number of possible different expressions that could be generated from a given phrase is a5a8a7a10a9a12a11 , where the exponent a13a15a14a15a16 is the total number of features in the Feature DB.</Paragraph>
    </Section>
    <Section position="4" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
3.4 Evaluation
</SectionTitle>
      <Paragraph position="0"> In order to evaluate the correctness of the newly generated expression, we used as database the WWW. Due to its signi cant growth8, the WWW has become an attractive database for different systems applications as, machine translation (Resnik and Smith, 2003), question answering (Kwok et al., 2001), commonsense retrieval (Matuszek et al., 2005), and so forth. In our approach we attempt to evaluate whether a generated phrase is correct through its frequency of appearance in the Web, i.e., the tness as a function of the frequency of appearance. Since matching an entire phrase on the Web might result in very low retrieval, in some cases even non retrieval at all, we applied the sectioning of the given phrase into its respective n-grams.</Paragraph>
      <Paragraph position="1">  For each one of the generated phrases to evaluate, n-grams are produced. The n-grams used are bigram, trigram, and quadrigram. Their frequency of appearance on the Web (using Google search engine) is searched and ranked. For each n-gram, thresholds have been established9. A phrase is evaluated according to the following algorithm10: ifa17a19a18a21a20a23a22a25a24a27a26a25a28a30a29a31a24a33a32a35a34a36a18a21a37 , then a20a23a22a25a24a33a26a25a28 weakly accepted elsifa20a23a22a25a24a33a26a25a28a38a29a31a24a6a32a35a34a15a39a21a37 , thena20a23a22a25a24a27a26a25a28 accepted else a20a40a22a25a24a33a26a25a28 rejected where, a41 and a42 are thresholds that vary according to the n-gram type, and a5a44a43a46a45a48a47a50a49a44a13a51a45a10a52a25a53 is the frequency, or number of hits, returned by the search engine for a given n-gram. Table 1 shows some of the n-grams produced for the generated phrase what are your plans for the game? The frequency of each n-gram is also shown along with the system evaluation. The phrase was evaluated 8As for 1998, according to Lawrence and Giles (1999) the surface Web consisted of approximately 2.5 billion documents. As for January 2005, according to Gulli and Signorini (2005),the size of indexable Web had become approximately 11.5 billion pages 9The tuning of the thresholds of each n-gram type was preformed using the phrases of the Phrase DB 10The evaluation weakly accepted has been designed to re ect n-grams whose appearance on the Web is signi cant even though they are rarely used. In the experiment they were treated as accepted.</Paragraph>
      <Paragraph position="2"> as accepted since none of the n-grams produced was rejected.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="6" end_page="7" type="metho">
    <SectionTitle>
4 Preliminary Experiments and Results
</SectionTitle>
    <Paragraph position="0"> The system was setup to perform 150 generations11. Table 2 contains the results. There were 591 different phrases generated, from which 80 were evaluated as accepted , and the rest 511 were rejected by the system.</Paragraph>
    <Paragraph position="1">  As part of the preliminary experiment, the generated phrases were evaluated by a native English speaker in order to determine their naturalness . The human evaluation of the generated phrases was performed under the criterion of the following categories: a) Unnatural: a phrase that would not be used during a conversation.</Paragraph>
    <Paragraph position="2"> b) Usable: a phrase that could be used during a conversation,even though it is not a common phrase.</Paragraph>
    <Paragraph position="3"> c) Completely Natural: a phrase that might be commonly used during a conversation.</Paragraph>
    <Paragraph position="4"> The results of the human evaluation are shown in Table 3. In this evaluation, 26 out of the 80 phrases accepted by the system were considered completely natural , and 18 out of the 80 accepted were considered usable , for a total of 44 well-generated phrases12. On the other hand, the system mis-evaluation is observed mostly within the accepted phrases, i.e., 36 out of 80 accepted were unnatural , whereas within the rejected phrases only 8 out of 511 were considered usable and 2 out of 511 were considered completely natural , which affected negatively the pre11Processing time: 20 hours 13 minutes. The Web search results are as for March 2006 12Phrases that could be used during a conversation  cision of the system.</Paragraph>
    <Paragraph position="5"> In order to obtain a statistical view of the system's performance, the metrics of recall, (R), and precision, (P), were calculated according to (A stands for Accepted , from Table 3):</Paragraph>
    <Paragraph position="7"> Table 4 shows the system output, i.e., phrases generated and evaluated as accepted by the system, for the original phrase what are your plans for the weekend ? According with the criterion shown above, the generated phrases were evaluated by a user to determine their naturalness - applicability to dialogue.</Paragraph>
    <Section position="1" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
4.1 Discussion
</SectionTitle>
      <Paragraph position="0"> Recall is the rate of the well-generated phrases given as accepted by the system divided by the total number of well-generated phrases. This is a measure of the coverage of the system in terms of the well-generated phrases. On the other hand, the precision rates the well-generated phrases divided by the total number of accepted phrases. The precision is a measure of the correctness of the system in terms of the evaluation of the phrases.</Paragraph>
      <Paragraph position="1"> For this experiment the recall of the system was 0.815, i.e., 81.5% of the total number of well-generated phrases where correctly selected, however this implied a trade-off with the precision, which was compromised by the system's wide coverage.</Paragraph>
      <Paragraph position="2"> An in uential factor in the system precision and recall is the selection of new features to be used during the mutation process. This is because the insertion of a new feature gives rise to a totally new phrase that might not be related to the original one. In the same tradition, a decisive factor in the evaluation of a well-generated phrase is the constantly changing information available on the Web. This fact rises thoughts of the application of variable threshold for evaluation. Even though the system leaves room for improvement, its successful implementation has been con rmed.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML