XML Viewer - n03-1024

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1024_metho.xml
Size: 6,641 bytes
Last Modified: 2025-10-06 14:08:10
<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-1024">
  <Title>Syntax-based Alignment of Multiple Translations: Extracting Paraphrases and Generating New Sentences</Title>
  <Section position="4" start_page="0" end_page="12" type="metho">
    <SectionTitle>
3 A Syntax-Based Alignment Algorithm
</SectionTitle>
    <Paragraph position="0"> Our syntax-based alignment algorithm, whose pseudocode is shown in Figure 4, works in three steps. In the first step (lines 1-5 in Figure 4), we parse every sentence in a sentence group and merge all resulting parse trees into a parse forest. In the second step (line 6), we extract  1Linguistic Data Consortium (LDC) Catalog Number LDC2002T01, ISBN 1-58563-217-1.</Paragraph>
    <Paragraph position="1"> 1. ParseForest = epsilon1 2. foreach s [?] SentenceGroup 3. t = parseTree(s); 4. ParseForest = Merge(ParseForest, t); 5. endfor 6. Extract FSA from ParseForest; 7. Squeeze FSA;  an FSA from the parse forest and then we compact it further using a limited form of bottom-up alignment, which we call squeezing (line 7). In what follows, we describe each step in turn.</Paragraph>
    <Paragraph position="2"> Top-down merging. Given a sentence group, we pass each of the 11 sentences to Charniak's (2000) parser to get 11 parse trees. The first step in the algorithm is to merge these parse trees into one parse-forest-like structure using a top-down process.</Paragraph>
    <Paragraph position="3"> Let's consider a simple case in which the parse forest contains one single tree, Tree 1 in Figure 5, and we are adding Tree 2 to it. Since the two trees correspond to sentences that have the same meaning and since both trees expand an S node into an NP and a VP, it is reasonable to assume that NP1 is a paraphrase of NP2 and VP1 is a paraphrase of VP2. We merge NP1 with NP2 and VP1 with VP2 and continue the merging process on each of the subtrees recursively, until we either reach the leaves of the trees or the two nodes that we examine are expanded using different syntactic rules.</Paragraph>
    <Paragraph position="4"> When we apply this process to the trees in Figure 5, the NP nodes are merged all the way down to the leaves, and we get &amp;quot;12&amp;quot; as a paraphrase of &amp;quot;twelve&amp;quot; and &amp;quot;people&amp;quot; as a paraphrase of &amp;quot;persons&amp;quot;; in contrast, the two VPs are expanded in different ways, so no merging is done beyond this level, and we are left with the information that &amp;quot;were killed&amp;quot; is a paraphrase of &amp;quot;died&amp;quot;. We repeat this top-down merging procedure with each of the 11 parse trees in a sentence group. So far, only constituents with same syntactic type are treated as paraphrases. However, later we shall see that we can match word spans whose syntactic types differ.</Paragraph>
    <Paragraph position="5"> Keyword checking. The matching process described above appears quite strict - the expansions must match exactly for two nodes to be merged. But consider the following parse trees: 1.(S (NP1 people)(VP1 were killed in this battle)) 2.(S (NP2 this battle)(VP2 killed people)) If we applied the algorithm described above, we would mistakenly align NP1 with NP2 and VP1 with VP2 -the algorithm described so far makes no use of lexical  To prevent such erroneous alignments, we also implement a simple keyword checking procedure. We note that since the word &amp;quot;battle&amp;quot; appears in both VP1 and NP2, this can serve as an evidence against the merging of (NP1, NP2) and (VP1, VP2). A similar argument can be constructed for the word &amp;quot;people&amp;quot;. So in this example we actually have double evidence against merging; in general, one such clue suffices to stop the merging.</Paragraph>
    <Paragraph position="6"> Our keyword checking procedure acts as a filter. A list of keywords is maintained for each node in a syntactic tree. This list contains all the nouns, verbs, and adjectives that are spanned by a syntactic node. Before merging two nodes, we check to see whether the keyword lists associated with them share words with other nodes. That is, supposed we just merged nodes A and B, and they are expanded with the same syntactic rule into A1A2...An and B1B2...Bn respectively; before we merge each Ai with Bi, we check for each Bi if its keyword list shares common words with any Aj (j negationslash= i). If they do not, we continue the top-down merging process; otherwise we stop.  In our current implementation, a pair of synonyms can not stop an otherwise legitimate merging, but it's possible to extend our keyword checking process with the help of lexical resources such as WordNet in future work.</Paragraph>
    <Paragraph position="7"> Mapping Parse Forests into Finite State Automata.</Paragraph>
    <Paragraph position="8"> The process of mapping Parse Forests into Finite State Automata is simple. We simply traverse the parse forest top-down and create alternative paths for every merged node. For example, the parse forest in Figure 5 is mapped into the FSA shown at the bottom of the same figure. In the FSA, there is a word associated with each edge. Different paths between any two nodes are assumed to be paraphrases of each other. Each path that starts from the BEGIN node and ends at the END node corresponds to either an original input sentence or a paraphrase sentence. null Squeezing. Since we adopted a very strict matching criterion in top-down merging, a small difference in the syntactic structure of two trees prevents some legitimate mergings from taking place. This behavior is also exacerbated by errors in syntactic parsing. Hence, for instance, three edges labeled detroit at the leftmost of the top FSA in Figure 6 were kept apart. To compensate for this effect, our algorithm implements an additional step, which we call squeezing. If two different edges that go into (or out of) the same node in an FSA are labeled with the same word, the nodes on the other end of the edges are merged. We apply this operation exhaustively over the FSAs produced by the top-down merging procedure. Figure 6 illustrates the effect of this operation: the FSA at the top of this figure is compressed into the more compact FSA shown at the bottom of it. Note that in addition to reducing the redundant edges, this also gives us paraphrases not available in the FSA before squeezing (e.g. {reduced to rubble, blasted to ground}). Therefore, the squeezing operation, which implements a limited form of lexically driven alignment similar to that exploited by MSA algorithms, leads to FSAs that have a larger number of paths and paraphrases.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML