File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-1022_intro.xml

Size: 5,790 bytes

Last Modified: 2025-10-06 14:01:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1022">
  <Title>Bootstrapping Lexical Choice via Multiple-Sequence Alignment</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> One or two homologous sequences whisper ...a full multiple alignment shouts out loud (Hubbard et al., 1996).</Paragraph>
    <Paragraph position="1"> Today's natural language generation systems typically employ a lexical chooser that translates complex semantic concepts into words. The lexical chooser relies on a mapping dictionary that lists possible realizations of elementary semantic concepts; sample entries might be [Parent [sex:female]] ! mother or love(x,y)! fx loves y, x is in love with yg.1 To date, creating these dictionaries has involved human analysis of a domain-relevant corpus comprised of semantic representations and corresponding human verbalizations (Reiter and Dale, 2000). Thecorpusanalysisandknowledgeengineeringwork required in such an approach is substantial, prohibitivelysoinlargedomains. But,sincecorpusdata is already used in building lexical choosers by hand, anappealingalternativeistohavethesystemlearna mapping dictionary directly from the data. Clearly, this would greatly reduce the human efiort involved and ease porting the system to new domains. Hence, we address the following problem: given a parallel (but unaligned) corpus consisting of both complex semantic input and corresponding natural language verbalizations, learn a semantics-to-words mapping dictionary automatically.</Paragraph>
    <Paragraph position="2"> Now, we could simply apply standard statistical machinetranslationmethods,treatingverbalizations as \translations&amp;quot; of the semantics. These methods typically rely on one-parallel corpora consisting of text pairs, one in each \language&amp;quot; (but cf. Simard (1999); seeSection5). However, learningthe kind of semantics-to-words mapping that we desire from one-parallel data alone is di-cult even for humans. First, given the same semantic input, difierent authors may (and do) delete or insert information(seeFigure1);hence,directcomparisonbetween null a semantic text and a single verbalization may not provide enough information regarding their underlying correspondences. Second, a single verbalization certainly fails to convey the variety of potential linguistic realizations of the concept that an expressive lexical chooser would ideally have access to.</Paragraph>
    <Paragraph position="3"> The multiple-sequence idea Our approach is motivated by an analogous situation that arises in computational biology. In brief, an important bioin1Throughout, fonts denote a mapping dictionary's two information types: semantics and realizations.</Paragraph>
    <Paragraph position="4">  roughly correspond to the three arguments of show-from(a=0,b=0,a/b=0). (The phrases \as in the theorem statement&amp;quot; and \their product&amp;quot; correspond to chains of nodes, but are drawn as single nodes for clarity. Shading indicates argument-value matches (Section 3.1). All lattice flgures omit punctuation nodes for clarity.)  (1) Given a and b as in the theorem statement, prove that a/b=0.</Paragraph>
    <Paragraph position="5"> (2) Suppose that a and b are equal to zero.</Paragraph>
    <Paragraph position="6"> Prove that their product is also zero.</Paragraph>
    <Paragraph position="7"> (3) Assume that a=0 and b=0.</Paragraph>
    <Paragraph position="8">  show-from(a=0,b=0,a/b=0).</Paragraph>
    <Paragraph position="9"> formatics problem  |Gusfleld (1997) refers to it as \The Holy Grail&amp;quot;  |is to determine commonalities within a collection of biological sequences such as proteins or genes. Because of mutations within individualsequences,suchaschanges,insertions,ordele- null tions, pair-wise comparison of sequences can fail to reveal which features are conserved across the entire group. Hence, biologists compare multiple sequences simultaneously to reveal hidden structure characteristic to the group as a whole.</Paragraph>
    <Paragraph position="10"> Our work applies multiple-sequence alignment techniques to the mapping-dictionary acquisition problem. The main idea is that using a multi-parallel corpus  |one that supplies several alternative verbalizations for each semantic expression  |can enhance both the accuracy and the expressiveness of the resulting dictionary. In particular, matching a semantic expression against a composite of the common structural features of a set of verbalizations ameliorates the efiect of \mutations&amp;quot; within individual verbalizations. Furthermore, the existence of multipleverbalizationshelpsthesystemlearnseveral ways to express concepts.</Paragraph>
    <Paragraph position="11"> To illustrate, consider a sample semantic expression from the mathematical theorem-proving domain. The expression show-from(a=0,b=0,a/b=0) means \assuming the two premises a = 0 and b = 0, show that the goal a / b = 0 holds&amp;quot;. Figure 1 shows threehumanverbalizationsofthisexpression. Even for so formal a domain as mathematics, the verbalizationsvaryconsiderably,andnonedirectlymatches null the entire semantic input. For instance, it is not obvious without domain knowledge that \Given a and b as in the theorem statement&amp;quot; matches \a=0&amp;quot; and  \b=0&amp;quot;,northat\theirproduct&amp;quot;and\a/b&amp;quot;areequivalent. Moreover, sentence (3) omits the goal argument entirely. However, as Figure 2 shows, the combination of these verbalizations, as computed by our multiple-sequence alignment method, exhibits high structural similarity to the semantic input: the indicated \sausage&amp;quot; structures correspond closely to the three arguments of show-from.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML