XML Viewer - n06-2011

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2011_metho.xml
Size: 10,233 bytes
Last Modified: 2025-10-06 14:10:11
<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-2011">
  <Title>Spectral Clustering for Example Based Machine Translation</Title>
  <Section position="4" start_page="0" end_page="41" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In EBMT, the source sentence to be translated is matched against the source language sentences present in a corpus of source-target sentence pairs.</Paragraph>
    <Paragraph position="1"> When a partial match is found, the corresponding target translations are obtained through subsentential alignment. These partial matches are put together to obtain the final translation by optimizing translation and alignment scores and using a statistical target language model in the decoding process. Prior work has shown that EBMT requires large amounts of data (in the order of two to three million words) (Brown, 2000) of pre-translated text, to function reasonably well. Thus, some modification of the basic EBMT method is required to make it effective when less data is available. In order to use the available text efficiently, systems such as, (Veale and Way, 1997) and (Brown, 1999), convert the examples in the corpus into templates against which the new text can be matched. Thus, source-target sentence pairs are converted to source-target generalized template pairs. An example of such a pair is shown below: The session opened at 2p.m La s'eance est ouverte 'a 2 heures The &lt;event&gt;&lt;verb-past-tense&gt; at &lt;time&gt; La &lt;event&gt;&lt;verb-past-tense&gt; a &lt;time&gt; This single template can be used to translate different source sentences, including for example, The session adjourned at 6p.m The seminar opened at 8a.m if 'session' and 'seminar' are both generalized to '&lt;event&gt;', 'opened' and 'adjourned' are both generalized to '&lt;verb-past-tense&gt;' and finally '6p.m' and '8a.m' are both generalized to '&lt;time&gt;'.</Paragraph>
    <Paragraph position="2"> The system used by (Brown, 1999) performs its generalization using both equivalence classes of words and a production rule grammar. This paper describes the use of spectral clustering (Ng. et. al., 2001; Zelnik-Manor and Perona, 2004), for automated extraction of equivalence classes. Spectral clustering is seen to be superior to Group Average Clustering (GAC) (Brown, 2000) both in terms of semantic similarity of words falling in a single cluster, and overall BLEU score (Papineni. et. al., 2002) in a large scale EBMT system.</Paragraph>
    <Paragraph position="3"> The next section explains the term vectors extracted for each word, which are then used to cluster words into equivalence classes and provides an outline of the Standard GAC algorithm. Section 3 describes the spectral clustering algorithm used. Sec- null tion 4 lists results obtained in a full evaluation of the algorithm. Section 5 concludes and discusses directions for future work.</Paragraph>
  </Section>
  <Section position="5" start_page="41" end_page="41" type="metho">
    <SectionTitle>
2 Term vectors for clustering
</SectionTitle>
    <Paragraph position="0"> Using a bilingual dictionary, usually created using statistical methods such as those of (Brown et. al., 1990) or (Brown, 1997), and the parallel text, a rough mapping between source and target words can be created. This word pair is then treated as an indivisible token for future processing. For each such word pair we then accumulate counts for each token in the surrounding context of its occurrences (N words, currently 3, immediately prior to and N words immediately following). The counts are weighted with respect to distance from occurrence, with a linear decay (from 1 to 1/N) to give greatest importance to the words immediately adjacent to the word pair being examined. These counts form a pseudo-document for each pair, which are then converted into term vectors for clustering.</Paragraph>
    <Paragraph position="1"> In this paper, we compare our algorithm against the incremental GAC algorithm(Brown, 2000). This method examines each word pair in turn, computing a similarity measure to every existing cluster. If the best similarity measure is above a predetermined threshold, the new word is placed in the corresponding cluster, otherwise a new cluster is created if the maximum number of clusters has not yet been reached.</Paragraph>
  </Section>
  <Section position="6" start_page="41" end_page="41" type="metho">
    <SectionTitle>
3 Spectral clustering
</SectionTitle>
    <Paragraph position="0"> Spectral clustering is a general term used to describe a group of algorithms that cluster points using the eigenvalues of 'distance matrices' obtained from data. In our case, the algorithm described in (Ng.</Paragraph>
    <Paragraph position="1"> et. al., 2001) was performed with certain variations that were proposed by (Zelnik-Manor and Perona, 2004) to compute the scaling factors automatically and for the k-Means orthogonal treatment (Verma and Meila, 2003) during the initialization. These scaling factors help in self-tuning distances between points according to the local statistics of the neighborhoods of the points. The algorithm is briefly de- null scribed below.</Paragraph>
    <Paragraph position="2"> 1. Let S =s1,s2,....sn, denote the term vectors to be clustered into k classes.</Paragraph>
    <Paragraph position="3"> 2. Form the affinity matrix A defined by</Paragraph>
    <Paragraph position="5"> sim(si,sj) is the Cosine similarity between si and sj, epsilon1 is used to prevent the ratio from becoming infinity si is the set of local scaling parameters for si. si = d(si,sT) where, sT is the Tth neighbor of point si for some fixed T (7 for this paper).</Paragraph>
    <Paragraph position="6"> 3. Define D to be the diagonal matrix given by, Dii = SjAij 4. Compute L = D[?]1/2AD[?]1/2 5. Select k eigenvectors corresponding to k largest eigenvalues (k is presently an externally set parameter). The eigenvectors are normalized to have unit length. Form matrix U by stacking all the eigenvectors in columns.</Paragraph>
    <Paragraph position="7"> 6. Form the matrix Y by normalizing U's rows,</Paragraph>
    <Paragraph position="9"> 7. Perform k-Means clustering treating each row of Y as a point in k dimensions. The k-Means algorithm is initialized either with random centers or with orthogonal vectors.</Paragraph>
    <Paragraph position="10"> 8. After clustering, assign the point si to cluster c if the corresponding row i of the matrix Y was assigned to cluster c.</Paragraph>
    <Paragraph position="11"> 9. Sum the distances between the members and the centroid of each cluster to obtain the classification cost.</Paragraph>
    <Paragraph position="12"> 10. Goto step 7, iterate for a fixed number of iterations. In this paper, 20 iterations were performed with orthogonal k-Means initialization and 5 iterations with random k-Means initialization. null 11. The clusters obtained from the iteration with least classification cost are selected as the k clusters.</Paragraph>
  </Section>
  <Section position="7" start_page="41" end_page="43" type="metho">
    <SectionTitle>
4 Preliminary Results
</SectionTitle>
    <Paragraph position="0"> The clusters obtained from the spectral clustering method are seen by inspection to correspond to more natural and intuitive word classes than those obtained by GAC. Even though this is subjective and not guaranteed to lead to improve translation performance, it shows that maybe the increased power of spectral clustering to represent non-convex classes  (non-convex in the term vector domain) could be useful in a real translation experiment. Some example classes are shown in Table 1. The first class in an intuitive sense corresponds to measurement units. We see that in the &lt;units&gt; case, GAC misses some of the members which are actually distributed among many different classes and hence these are not well generalized. In the second class &lt;months&gt;, spectral clustering has primarily the months in a single class whereas GAC adds a number of seemingly unrelated words to the cluster. The classes were all obtained by finding 80 clusters in a 20,000-sentence pair subset of the IBM</Paragraph>
    <Section position="1" start_page="42" end_page="43" type="sub_section">
      <SectionTitle>
Hansard Corpus (Linguistic Data Consortium, 1997)
</SectionTitle>
      <Paragraph position="0"> for spectral clustering. 80 was chosen as the number of clusters since it gave the highest BLEU score in the evaluation. For GAC, 300 clusters were used as this gave the best performance.</Paragraph>
      <Paragraph position="1"> To show the effectiveness of the clustering methods in an actual evaluation, we set up the following experiment for an English to French translation task on the Hansard corpus. The training data consists of three sets of size 10,000 (set1), 20,000 (set2) and 30,000 (set3) sentence pairs chosen from the first six files of the Hansard Corpus. Only sentences of length 5 to 21 words were taken. Only words with frequency of occurrence greater than 9 were chosen for clustering because more contextual information would be available when the word occurs frequently and this would help in obtaining better clusters. The test data was chosen to be a set of 500 sentences obtained from files 20, 40, 60 and 80 of the Hansard corpus with 125 sentences from each file. Each of the methods was run with different number of clusters and results are reported only for the optimal number of clusters in each case.</Paragraph>
      <Paragraph position="2"> The results in Table 2 show that spectral clustering requires moderate amounts of data to get a large improvement. For small amounts of data it is slightly worse than GAC, but neither gives much improvement over the baseline. For larger amounts of data, again both methods are very similar, though spectral clustering is better. Finally, for moderate amounts of data, when generalization is the most useful, spectral clustering gives a significant improvement over the baseline as well as over GAC.</Paragraph>
      <Paragraph position="3"> By looking at the clusters obtained with varying amounts of data, it can be concluded that high pu- null rity clusters can be obtained with even just moderate amounts of data.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML