File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-0301_metho.xml

Size: 19,774 bytes

Last Modified: 2025-10-06 14:15:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-0301">
  <Title>marization, and generation of natural language</Title>
  <Section position="4" start_page="0" end_page="4" type="metho">
    <SectionTitle>
3 The clause-like unit boundary and
</SectionTitle>
    <Paragraph position="0"> marker-identification algorithm</Paragraph>
    <Section position="1" start_page="0" end_page="2" type="sub_section">
      <SectionTitle>
3.1 Determining the potential discourse
markers
</SectionTitle>
      <Paragraph position="0"> The corpus analysis discussed above provides information about the orthographic environment of cue phrases and the function that they have in texts.</Paragraph>
      <Paragraph position="1"> A cue phrase was assigned a sentential role, when it had no function in structuring the discourse; a discourse role, when it signalled a discourse relation between two textual units; or a pragmatic role, when it signalled a relationship between a linguistic or nonlinguistic construct that pertained to the unit in which the cue phrase occurred and the beliefs, plans, intentions, and/or communicative goals of the speaker, hearer, or some character depicted in the text. In this case, the beliefs, plans, etc., did not have to be explicitly stated in discourse; rather, it was the role of the cue phrase to help the reader infer them. 2 Different orthographic environments often correlate with different discourse functions. For example, if the cue phrase Besides occurs at the beginning of a sentence and is not followed by a comma, 2This definition of pragmatic connective was first proposed by Fraser (1996). it should not be confused with the definition proposed by van Dijk (1979), who calls a connective&amp;quot;pragmatic&amp;quot; if it relates two speech acts and not two semantic units.  as in text (1), it usually signals a rhetorical relation that holds between the clause-like unit that contains it and the clause that comes after. However, if the same cue phrase occurs at the beginning of a sentence and is immediately followed by a comma, as in text (2), it usually signals a rhetorical relation that holds between the sentence to which Besides belongs and a textual units that precedes it.</Paragraph>
      <Paragraph position="2"> (1) \[Besides the lack of an adequate ethical dimension to the Governor's case,\] \[one can ask seriously whether our lead over the Russians in quality and quantity of nuclear weapons is so slight as to make the tests absolutely necessary.\] (2) \[For pride's sake, I will not say that the coy and leering vade mecum of those verses insinuated itself into my soul.\] \[Besides, that particular message does no more than weakly echo the roar in all fresh blood.\] I have taken each of the cue phrases in the corpus and evaluated its potential contribution in determining the elementary textual units and discourse function for each orthographic environment that characterized its usage.</Paragraph>
      <Paragraph position="3"> I used the cue phrases and the orthographic environments that characterized the cue phrases that played a discourse role in most of the text fragments in the corpus in order to manually develop a set of regular expressions that can be used to recognize potential discourse markers in naturally occurring texts. If a cue phrase had different discourse functions in different orthographic environments, as was the case with Besides, I created one regular expression for each function. I ignored the cue phrases that played a sentential role in a majority of the text fragments and the cue phrases for which I was not able to infer straightforward rules that would allow a shallow algorithm to discriminate between their discourse and sentential usages. Because orthographic markers, such as commas, periods, dashes, paragraph breaks, etc., play an important role in the surface-based approach to discourse processing that I present here, I included them in the list of potential discourse markers as well.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 From the corpus analysis to the elementary
</SectionTitle>
      <Paragraph position="0"> textual units of a text During the corpus analysis, I generated a set of eleven actions that constitutes the foundation of an algorithm to determine automatically the elementary units of a text. The algorithm processes a text given as input in a left-to-right fashion and &amp;quot;executes&amp;quot; the actions that are associated with each potential discourse marker and each punctuation mark that occurs in the text. Because the algorithm does not use any traditional parsing or tagging techniques, I call it a &amp;quot;shallow analyzer&amp;quot;.</Paragraph>
      <Paragraph position="1"> The names and the intended semantics of the actions used by the shallow analyzer are: * Action NOTHING instructs the shallow analyzer to treat the cue phrase under consideration as a simple word. That is, no textual unit boundary is normally set when a cue phrase associated with such an action is processed.</Paragraph>
      <Paragraph position="2"> For example, the action associated with the cue phrase accordingly is NOTHING.</Paragraph>
      <Paragraph position="3"> * Action NORMAL instructs the analyzer to insert a textual boundary immediately before the occurrence of the marker. Textual boundaries correspond to elementary unit breaks.</Paragraph>
      <Paragraph position="4"> * Action COMMA instructs the analyzer to insert a textual boundary immediately after the occurrence of the first comma in the input stream. If the first comma is followed by an and or an or, the textual boundary is set after the occurrence of the next comma. If no comma is found before the end of the sentence, a textual boundary is created at the end of the sentence.</Paragraph>
      <Paragraph position="5"> * Action NORMAL_THEN_COMMA instructs the analyzer to insert a textual boundary immediately before the occurrence of the marker and another textual boundary immediately after the occurrence of the first comma in the input stream. As in the case of the action COMMA, if the first comma is followed by an and or an or, the textual boundary is set after the occurrence of the next comma. If no comma is found before the end of the sentence, a textual boundary is created at the end of the sentence.</Paragraph>
      <Paragraph position="6"> * Action END instructs the analyzer to insert a textual boundary immediately after the cue phrase.</Paragraph>
      <Paragraph position="7"> * Action MATCH_PAREN instructs the analyzer to insert textual boundaries both before the occurrence of the open parenthesis that is normally characterized by such an action, and after the closed parenthesis that follows it.</Paragraph>
      <Paragraph position="8"> * Action COMMA_PAREN instructs the analyzer to insert textual boundaries both before the cue phrase and after the occurrence of the next comma in the input stream.</Paragraph>
      <Paragraph position="9"> * Action MATCH_DASH instructs the analyzer to insert a textual boundary before the occurrence of the cue phrase. The cue phrase is usually a dash. The action also instructs the analyzer to insert a textual boundary after the next dash in the text. If such a dash does not exist, the textual boundary is inserted at the end of the sentence. null The preceding three actions, MATCH_PAREN, COMMA_PAREN, and MATCH_DASH, are usually used for determining the boundaries of parenthetical units. These units, such as those shown in italics in (3) and (4) below, are related only to the larger units that they belong to or to the units that  immediately precede them.</Paragraph>
      <Paragraph position="10"> (3) \[With its distant orbit {-- 50 percent farther from the sun than the Earth --} and slim atmospheric blanket.l \[Mars experiences frigid weather conditions.\] null (4) \[Yet, even on the summer pole, {where the sun re- null mains in the sky all day long, } temperatures never warm enough to melt frozen water.\] Because the deletion of parenthetical units does not affect the readability of a text, in the algorithm that I present here I do not assign them an elementary unit status. Instead, I will only determine the boundaries of parenthetical units and record, for each elementary unit, the set of parenthetical units that belong to it.</Paragraph>
      <Paragraph position="11"> * Actions SET_AND (SET_OR) instructs the analyzer to store the information that the input stream contains the lexeme and (or).</Paragraph>
      <Paragraph position="12"> * Action DUAL instructs the analyzer to insert a textual boundary immediately before the cue phrase under consideration if there is no other cue phrase that immediately precedes it. If there exists such a cue phrase, the analyzer will behave as in the case of the action COMMA.</Paragraph>
      <Paragraph position="13"> The action DUAL is usually associated with cue phrases that can introduce some expectations about the discourse (Cristea and Webber, 1997). For example, the cue phrase although in text (5) signals a rhetorical relation of CONCESSION between the clause to which it belongs and the previous clause. However, in text (6), where although is preceded by an and, it signals a rhetorical relation of CONCESSION be- null tween the clause to which it belongs and the next clause in the text.</Paragraph>
      <Paragraph position="14"> (5) \[I went to the theater\] \[although I had a terrible headache.\] (6) \[The trip was fun,\] \[and although we were  badly bitten by blackflies,\] \[I do not regret it.\]</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="4" type="sub_section">
      <SectionTitle>
3.3 The clause-like unit and discourse-marker
</SectionTitle>
      <Paragraph position="0"> identification algorithm On the basis of the information derived from the corpus, I have designed an algorithm that identifies elementary textual unit boundaries in sentences and cue phrases that have a discourse function. Figure 1 shows only its skeleton and focuses on the variables and steps that are used in order to determine the elementary units. Due to space constraints, the steps that assert the discourse function of a marker are not shown; however, these steps are mentioned in the discussion of the algorithm that is given below.</Paragraph>
      <Paragraph position="1"> Marcu (1997b) provides a full description of the algorithm. null The algorithm takes as input a sentence S and the array markers\[n\] of cue phrases (potential discourse markers) that occur in that sentence; the array is produced by a trivial algorithm that recognizes regular expressions (see section 3.1). Each element in markers\[n\] is characterized by a feature structure with the following entries: * the action associated with the cue phrase; * the position in the elementary unit of the cue phrase; * a flag has_discourse.function that is initially set to &amp;quot;no&amp;quot;.</Paragraph>
      <Paragraph position="2"> The clause-like unit and discourse-marker identification algorithm traverses the array of cue phrases left-to-right (see the loop between lines 2 and 20) and identifies the elementary textual units in the sentence on the basis of the types of the markers that it processes. Crucial to the algorithm is the variable &amp;quot;status&amp;quot;, which records the set of markers that have been processed earlier and that may still influence the identification of clause and parenthetical unit boundaries.</Paragraph>
      <Paragraph position="3"> The clause-like unit identification algorithm has two main parts: lines 10--20 concern actions that are executed when the &amp;quot;status&amp;quot; variable is NIL. These actions can insert textual unit boundaries or modify the value of the variable &amp;quot;status&amp;quot;, thus influencing the processing of further markers. Lines 3-9 concem actions that are executed when the &amp;quot;status&amp;quot; variable is not NIL. We discuss now in turn each of these actions.</Paragraph>
      <Paragraph position="4"> Lines 3-4 of the algorithm treat parenthetical information. Once an open parenthesis, a dash,  or a discourse marker whose associated action is COMMA_PAREN has been identified, the algorithm ignores all other potential discourse markers until the element that closes the parenthetical unit is processed. Hence, the algorithm searches for the first closed parenthesis, dash, or comma, ignoring all other markers on the way. Obviously, this imPlementation does not assign a discourse usage to discourse markers that are used within a span that is parenthetic. However, this choice is consistent with the decision discussed in section 3.2, to assign parenthetical information no elementary textual unit status. Because of this, the text shown in italics in text (7), for example, is treated as a single parenthetical unit, which is subordinated to &amp;quot;Yet, even on the summer pole, temperatures never warm enough to melt frozen water&amp;quot;. In dealing with parenthetical units, the algorithm avoids setting boundaries in cases in which the first comma that comes after a COMMA_PAREN marker is immediately followed by an or or and. As example (7) shows, taking the first comma as boundary of the parenthetical unit would be inappropriate.</Paragraph>
      <Paragraph position="5"> (7) \[Yet, even on the summer pole, {where the sun remains in the s~ ~ all day long, and where winds are not as strong as at the Equator. } temperatures never warm enough to melt frozen water.\] Obviously, one can easily find counterexamples to this rule (and to other rules that are employed by the algorithm). For example, the clause-like unit and discourse-marker identification algorithm will produce erroneous results when it processes the sentence shown in (8) below.</Paragraph>
      <Paragraph position="6"> (8) \[I gave John a boat,\] \[which he liked, and a duck,\] \[which he didn't.\] Nevertheless, the evaluation results discussed in section 4 show that the algorithm produces correct results in the majority of the cases.</Paragraph>
      <Paragraph position="7"> If the &amp;quot;status&amp;quot; variable contains the action COMMA, the occurrence of the first comma that is not adjacent to an and or or marker determines the identification of a new elementary unit (see lines 5-7 in figure 1).</Paragraph>
      <Paragraph position="8"> Usually, the discourse role of the cue phrases and and or is ignored because the surface-form algorithm that we propose is unable to distinguish accurately enough between their discourse and sentential usages. However, lines 8-9 of the algorithm concern cases in which their discourse function can be unambiguously determined. For example, in our Input: Output: A sentence S.</Paragraph>
      <Paragraph position="9"> The array of n potential discourse markers markers\[n\] that occur in S. The clause-like units, parenthetical units, and discourse markers of S.  1. status := NIL;... ; 2. for i from lto n 3. if MATCH_PAREN E status V MATCH_DASH E status V COMMA_PAREN E status 4. (deal with parenthetical information) 5. if COMMA E status A markerTextEqual(i,&amp;quot;,&amp;quot;) A 6. NextAdjacentMarkerlsNotAnd0 A NextAdjacentMarkerlsNotOr0 7. (insert textual boundary after comma) 8. if (SET_AND E status V SET_OR E status) A markerAdjacent(i - 1, i) 9. (deal with adjacent markers) 10. switch(getActionType(i)) { 11. case DUAL: (deal with DUAL markers) 12. case NORMAL: (insert textual boundary before marker) 13. case COMMA: status := status U {COMMA}; 14. case NORMAL_THEN_COMMA: (insert textual boundary before marker)  the occurrence of other discourse markers (function markerAdj acent(i- 1, i) returns true), they had a discourse function. For example, in sentence (9), and acts as an indicator of a JOINT relation between the first two clauses of the text.</Paragraph>
      <Paragraph position="10"> (9) \[Although the weather on Mars is cold\] \[and although it is very unlikely that water exists,\] \[scientists have not dismissed yet the possibility of life on the Red Planet.\] If a discourse marker is found that immediately follows the occurrence of an and (or an or) and if the left boundary of the elementary unit under consideration is found to the left of the and (or the or), a new elementary unit is identified whose right boundary is just before the and (or the or). In such a case the and (or the or) is considered to have a discourse function as well, so the flag has_discourse function is set to &amp;quot;yes&amp;quot;.</Paragraph>
      <Paragraph position="11"> If any of the complex conditions in lines 3, 5, or 8 in figure 1 is satisfied, the algorithm not only inserts textual boundaries as discussed above, but it also resets the &amp;quot;status&amp;quot; variable to NIL.</Paragraph>
      <Paragraph position="12"> Lines 10-19 of the algorithm concern the cases in which the &amp;quot;status&amp;quot; variable is NIL. If the type of the marker is DUAL, the determination of the textual unit boundaries depends on the marker under scrutiny being adjacent to the marker that precedes it. If it is, the &amp;quot;status&amp;quot; variable is set such that the algorithm will act as in the case of a marker of type COMMA. If the marker under scrutiny is not adjacent to the marker that immediately preceded it, a textual unit boundary is identified. This implementation will modify, for example, the variable &amp;quot;status&amp;quot; to COMMA when processing the marker although in example (10), but only insert a textual unit boundary when processing the same marker in example (11).</Paragraph>
      <Paragraph position="13"> The final textual unit boundaries that are assigned by the algorithm are shown using square brackets.</Paragraph>
      <Paragraph position="14"> (10) \[John is a nice guy,\] \[but although his colleagues do not pick on him,\] \[they do not invite him to go camping with them.\] (11) \[John is a nice guy,\] \[although he made a couple of nasty remarks last night.\] Line 12 of the algorithm concerns the most frequent marker type. The type NORMAL determines the identification of a new clause-like unit boundary just before the marker under scrutiny. Line 13 concerns the case in which the type of the marker is COMMA. If the marker under scrutiny is adjacent to the previous one, the previous marker is considered to have a discourse function as well.</Paragraph>
      <Paragraph position="15"> Either case, the &amp;quot;status&amp;quot; variable is updated such that a textual unit boundary will be identified at the first occurrence of a comma. When a marker of type NORMAL_THEN_COMMA is processed, the algorithm identifies a new clause-like unit as in the case of a marker of type NORMAL, and then updates the variable &amp;quot;status&amp;quot; such that a textual unit boundary will be identified at the first occurrence of a comma. In the case a marker of type NOTHING is processed, the only action that might be executed is that of assigning that marker a discourse usage.</Paragraph>
      <Paragraph position="16"> Lines 7-8 of the algorithm concern the treatment of markers that introduce expectations with respect to the occurrence of parenthetical units: the effect of processing such markers is that of updating the &amp;quot;'status&amp;quot; variable according to the type of the action associated with the marker under scrutiny. The same effect is observed in the cases in which the marker under scrutiny is an and or an or.</Paragraph>
      <Paragraph position="17"> After processing all the markers, it is possible that some text will remain unaccounted for: this text usually occurs between the last marker and the end of the sentence. The procedure &amp;quot;finishUpParentheticalsAndClausesO&amp;quot; in line 21 of figure 1 flushes this text into the last clause-like unit that is under consideration. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML