File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2096_evalu.xml
Size: 8,555 bytes
Last Modified: 2025-10-06 13:59:43
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2096"> <Title>Adding Syntax to Dynamic Programming for Aligning Comparable Texts for the Generation of Paraphrases</Title> <Section position="6" start_page="750" end_page="753" type="evalu"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="750" end_page="751" type="sub_section"> <SectionTitle> 4.1 Experimental Setup </SectionTitle> <Paragraph position="0"> The data we use in our experiment come from a number of sentence clusters on a variety of topics, but all related to the Milan plane crash event. This cluster was collected manually from the Web of five different news agencies (ABC, CNN, Fox, MSNBC, and USAToday). It concerns the April 2002 crash of a small plane into a building in Milan, Italy and contains a total of 56 documents published over a period of 1.5 days. To divide this corpus into representative smaller clusters, we had a colleague thoroughly read all 56 documents in the cluster and then create a list of important facts surrounding the story. We then picked key terms related to these facts, such as names (Fasulo - the pilot) and locations (Locarno - the city from which the plane had departed). Finally, we automatically clustered sentences based on the presence of these key terms, resulting in 21 clusters of topically related (comparable) sentences. The 21 clusters are grouped into three categories: 7 in training set, 3 in dev-testing set, and the remaining 11 in testing set. Table 1 shows the name and size of each clus- null To test the usefulness of our work, we ran 5 different alignments on the clusters. The first three represent different levels of baseline performance (without syntax consideration) whereas the last two fully employ the syntactic features but treat stop words differently. Table 2 describes the 5 versions of alignment.</Paragraph> <Paragraph position="1"> testing clusters. For the results on the test clusters, see Table 6 The motivation of trying such variations is as follows. Stop words often cause invalid alignment because of their high frequencies, and so do punctuations. Aligning on commas, in particular, is likely to produce long sentences that contain multiple sentence segments ungrammatically patched together.</Paragraph> <Paragraph position="2"> In order to get the best possible performance of the syntactic alignment versions, we use clusters in the training and dev-test sets to tune up the parameter values in our algorithm for checking syntactic match. The parameters in our algorithm are not independent. We pay special attention to the threshold of relative position difference, the discount factor of the trace length difference penalty, and the scores for exactly matched and partially matched IOB values. We try different parameter settings on the training clusters, and apply the top ranking combinations (according to human judgments described later) on clusters in the dev-testing set. The values presented in this paper are the manually selected ones that yield the best performance on the training and dev-testing sets.</Paragraph> <Paragraph position="3"> Experimenting on the testing data, we have two hypotheses to verify: 1) the 2 syntactic versions outperform the 3 baseline versions by both grammaticality and fidelity (discussed later) of the novel sentences produced by alignment; and 2) disallowing alignment on stop words and commas enhances the performance.</Paragraph> </Section> <Section position="2" start_page="751" end_page="752" type="sub_section"> <SectionTitle> 4.2 Experimental Results </SectionTitle> <Paragraph position="0"> For each cluster, we ran the 5 alignment versions and produce 5 FSA's. From each FSA (corresponding to a cluster A and alignment version i), 100 sentences are randomly generated. We removed those that appear in the original cluster.</Paragraph> <Paragraph position="1"> The remaining ones are hence novel sentences, among which we randomly chose 10 to test the performance of alignment version i on cluster A.</Paragraph> <Paragraph position="2"> In the human evaluation, each sentence received two scores - grammaticality and fidelity. These two properties are independent since a sentence could possibly score high on fidelity even if it is not fully grammatical. Four different scores are possible for both criteria: (4) perfect (fully grammatical or faithful); (3) good (occasional errors or quite faithful); (2) bad (many grammar errors or unfaithful pieces); and (1) nonsense.</Paragraph> <Paragraph position="3"> Four judges help our evaluation in the training phase. They are provided with the original clusters during the evaluation process, yet they are given the sentences in shuffled order so that they have no knowledge about from which alignment version each sentence is generated. Table 3 shows the averages of their evaluation on the 10 clusters in training and dev-testing set. Each cell corresponds to 400 data points as we presented 10 sentences per cluster per alignment version to each of the 4 judges (10 x 10 x 4 = 400).</Paragraph> <Paragraph position="4"> After we have optimized the parameter configuration for our syntactic alignment in the training phase, we ask another 6 human judges to evaluate our work on the testing data. These 6 judges come from diverse background including Information, Computer Science, Linguistics, and Bioinformatics. We distribute the 11 testing clusters among them so that each cluster gets evaluated by at least 3 judges. The workload for each judge is 6 clusters x 5 versions/cluster x 10 sentences/clusterversion = 300 sentences. Similar to the training phase, they receive the sentences in shuffled order without knowing the correspondence between sentences and alignment versions. Detailed average statistics are shown in Table 4 and Table 5 for grammaticality and fidelity, respectively. Each cell is the average over 30 - 40 data points, and notice the last row is not the mean of the other rows since the number of sentences evaluated for each cluster varies.</Paragraph> </Section> <Section position="3" start_page="752" end_page="753" type="sub_section"> <SectionTitle> 4.3 Result Analysis </SectionTitle> <Paragraph position="0"> The results support both our hypotheses. For Hypothesis I, we see that the performance of the two syntactic alignments was higher than the non-syntactic versions. In particular, Version 4 outperforms the the best baseline version by 19.9% on grammaticality and by 22.8% on fidelity. Our second hypothesis is also verified - disallowing alignment on stop words and commas yields better results. This is reflected by the fact that Version 4 beats Version 5, and Version 3 wins over the other two baseline versions by both criteria.</Paragraph> <Paragraph position="1"> At the level of individual clusters, the syntactic versions are also found to outrival the syntax-blind baselines. Applying a t-test on the score sets for the 5 versions, we can reject the null hypothesis with 99.5% confidence to ensure that the syntactic alignment performs better. Similarly, for hypothesis II, the same is true for the versions with and without stop word alignment. Figures 5 and 6 provide a graphical view of how each alignment version performs on the testing clusters. The clusters along the x-axis are listed in the order of increasing size.</Paragraph> <Paragraph position="2"> We have also done an analysis on interjudge agreement in the evaluation. The judges are instructed about the evaluation scheme individually, and do their work independently. We do not enforce them to be mutually consistent, as long as they are self-consistent. However, Table 6 shows the mean and standard deviation of human judgments (grammaticality and fidelity) on each version. The small deviation values indicate a fairly high agreement.</Paragraph> <Paragraph position="3"> Finally, because human evaluation is expensive, we additionally tried to use a language-model ap- null judgments.</Paragraph> <Paragraph position="4"> proach in the training phase for automatic evaluation of grammaticality. We have used BLEU scores(Papineni et al., 2001), but have observed that they are not consistent with those of human judges. In particular, BLEU assigns too high scores to segmented sentences that are otherwise grammatical. It has been noted in the literature that metrics like BLEU that are solely based on N-grams might not be suitable for checking grammaticality. null</Paragraph> </Section> </Section> class="xml-element"></Paper>