File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-2100_metho.xml

Size: 28,206 bytes

Last Modified: 2025-10-06 14:12:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="C92-2100">
  <Title>A Three-level Revision Model for Improving Japanese Bad-styled Expressions</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
(1) The Analyzer must correctly capture the
</SectionTitle>
    <Paragraph position="0"> intermediate representation of the input text at a certain processing level.</Paragraph>
    <Paragraph position="1"> ACl'ES DE COLING-92, NANTEs, 23-28 nor\]r 1992 6 6 5 PUOC. OV COLING-92, NANTES, AUG. 23-28, 1992 (2) The Generator must be equipped with the complete set of prescriptive generation grammar.</Paragraph>
    <Paragraph position="2"> The first problem is serious, especially in revision support systems, and hard to be overcome. This is because input text may contain badly-styled expressions that prevent correct computational analyses. Moreover, a solution to the second problem is also problematic, because no perfect sets of prescriptive generation grammar have been developed to date. Furthermore, even if such a set could be developed, single-pass generation has been pointed out to have many drawbacks in producing optimal texts/1 \]. In spite of these problems, the regeneration-based model is crucial for offsetting the weaknesses of the rewriting-based model.</Paragraph>
    <Paragraph position="3"> In the rewriting-based model, on the other hand, the original text is iteratively rewritten to improve each badly-styled expression that has been detected; only detected expressions are revised. This means that the rewriting-based model is a weak but practical model.</Paragraph>
    <Paragraph position="4"> Even if the set of revision rules is incomplete, the revision process will not destroy the original text entirely; the worst case is that the revision will be insufficient. Moreover, if the set of revision rules successfully cover numerous badly-styled expressions, it is expected that the system can achieve good performance. Thus, if most of the considered style improvements can be handled with this model, we should combine it with the regeneration-based model.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Classification of Japanese Badly-
</SectionTitle>
    <Paragraph position="0"> styled Expressions It is obvious front the previous discussion that the effectiveness of the rewriting-based model depends on how many style improvements can be described as individual revision rules. Thus we must investigate what patterns of expressions should be considered to be badly-styled, especially in the technical communications field, and determine how many of them can be improved by revision rules.</Paragraph>
    <Paragraph position="2"> A~ DE COLING-92, NANTES. 23-28 AoL'r 1992 6 6 6 PROC. OF COLING-92, Nx~rrEs, AUG. 23-28. 1992 To investigate these issues, we have classified typical sentence-level Japanese badly-styled expressions. The classification was mainly used examples fiom several books on technical writingl6\],lT\] as well as general writingl81. Textual data from published manuals on computer systems was also investigated. The result is briefly outlined in Table.1. The viewpoints for classification are:  (1) Whether the item affects easy-understanding or correct-utlderstanding? (2) In which linguistic structure does tbe item occur? (3) Is the item general or peculiar to technical writing? (4) Can the item be improved with an individual revision rule?  The investigation showed that itelns peculiar to technical writing mainly affect easy-understanding, while general items principally affect correctunderstanding. In addition, most of the items peculiar to technical writing can be improved by the application of discrete revision rules. Fig.2 exemplifies a revision rule for a typical badly-styled expression pectdiar to technical writing; the expression directs the user's actions, but the actions ,are described in reverse order. We can identify most badly-styled expressions peculiar to technical writing by referring to particular partial syntactic structure patterns. As shown irt Fig.2, such patterns allow bad-styles to be detected and rewritten. Therefore, it is valid to adopt the rewriting-based model as the center component of our model.</Paragraph>
    <Paragraph position="3"> Type of the expression: Directing the user's operation</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 The Model
</SectionTitle>
      <Paragraph position="0"> The previous section has shown that tile rewriting-based model is applicable for most of the style improvements peculiar to technical writing. Table.l, however, shows that there are a couple of items which are poorly handled by the model. 311ey are: (a) excessively long complex sentences, and (b) ambiguous modification structures.</Paragraph>
      <Paragraph position="1"> These items cannot be detected and corrected by the particular revision rules, because they do not have unique syntactic patterns. These errors cannot be characterized by particular words aud/or particular linguistic attributes such as, tnodality, tense, etc.. Thus these badly-styled expressions cannot be easily corrected with the particular structural conversion operations.</Paragraph>
      <Paragraph position="2"> We are proposing a three-level revision model which combines the rewriting-based and regeneration-based models. The first level is for dividing excessively long complex sentences and is based on the regeneration-based model at the morphological level.</Paragraph>
      <Paragraph position="3"> Tile second level is for improving several badly-styled expressions and is based on tile rewriting-based model.</Paragraph>
      <Paragraph position="4"> Tbe third level is lor syntactic/semantic level regeneration, in which word ordering and punctuating to reduce tile uumber of structural ambiguities are involved.</Paragraph>
      <Paragraph position="5"> Our model is a three-level sequential model.</Paragraph>
      <Paragraph position="6"> ltere, the order of the components has the following  computational significance: (1) As shown in 5.1, excessively long complex sentences can be identified and divided with morphological level informationl9 I.</Paragraph>
      <Paragraph position="7"> (2) If long sentences arc divided at the early stage of the total process, processing loads for the remaining operations are significantly reduced.</Paragraph>
      <Paragraph position="8"> (3) The style improving process should precede the syntactic/semantic level regeneration process, because tile regeneration process should stmt with a well-formed synlactic/sentantic structure.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Issues in Improving Style
</SectionTitle>
      <Paragraph position="0"> Most style improvements can be realized by sequential application of the revision rules, However, there are two major design issues. One is how to feedback the result of each rewriting operation to the initially produced analysis results. The other is the handling of structural ambiguity. That is. if the ambiguity is not elinrinated, combinatorial explosion is inevitable in many aspects of the system. On tile other hand, overall structural disambiguation is compotationally expeusive due to processes such as selrlanlic analysis and context analysis. Moreover, uniform application of these processes violates one of the basic requirements of any writing aid; that is, it is unacceptable to incur high computational costs by processing good expressions that require no revision.</Paragraph>
      <Paragraph position="1"> We have three approaches to deal with these issues: (1) First, we detect all of the potential bad styles while accepting structural ambiguity. Each bad style is connected to an associated partial rewriting operation specified by its pattern. These operations are defined ill a rule-base, so that the detection process is the activation of these rules.</Paragraph>
      <Paragraph position="2"> (2) We then try to apply activated rules under an expectation-driven control strategy. That is, file system schedules tile order of rule applications using a priority that reflects how important tile rewriting operation is in improving the sentence. The scheduled application of a rule initiates the structural disambiguation of the  applicable expression.</Paragraph>
      <Paragraph position="3"> (3) During the revision process, internal data, such as  that generated by morphological or syntactic analyses and by the bad-style detection pnv.:ess, varies as a result of the p~u-tial rewriting operations. To avoid duplicative ACRES DE COLING-92, NANTES, 23-28 AO~' 1992 6 6 7 PROC. OF COLING-92, NANTES. AUG. 23-28, 1992 analysis and detection, we accurately know what has been revised, and ensure the consistency of the internal data with respect to the revision. This scheme solves the feedback problem mentioned before.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 The Architecture of the Prototype
System
</SectionTitle>
      <Paragraph position="0"> Figure.3 shows the architecture of the prototype system REVISE-S based on the three-level revision model and the above design principles.</Paragraph>
      <Paragraph position="2"> The Morphological Analyzer divides the sentence string into word sequences. At this time, basic operational units (called 'Bunsetsu') are recognized. The sentence dividing algorithm in the Sentence Dividor utilizes the result of the morphological analysis, and is outlined in 5.1. The sentence dividing process is recursively invoked until each divided sentence satisfies some predefined condition that prevents further division.</Paragraph>
      <Paragraph position="3"> Next, the Syntactic Analyzer finds all possible binary relations between modifier Bunsetsu and modified Bunsetsu. The result is represented in a network called a Kakari-Uke network which represents all possible syntactic structure intensionally.</Paragraph>
      <Paragraph position="4"> The Diagnoser, which utilizes the detection counterpart of the revision rule, finds all possible badly-styled expressions. The result semi-fires the conversion counterpart of the associated revision rule and constructs the agenda which lists the semi-fired rule instances. The Revision Process Controller sequences the successive execution of partial rewriting operations, and the Data Consistency Manager maintains consistency between the current sentence string and the internal data during the dynamic rewriting process.</Paragraph>
      <Paragraph position="5"> Finally, the regeneration process is invoked to generate a sentence with less reading ambiguity.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Generating Alternative Expressions
</SectionTitle>
    <Paragraph position="0"> Each component in our revision model generates alternative expressions for the user. This section gives a brief outline of the generation of alternative expressions in each level component.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Dividing Long Sentences
</SectionTitle>
      <Paragraph position="0"> Before dividing a long complex sentences, first the component must decide whether the sentence should be divided or not. If the sentence is so determined, then, the component must identify the division point. The top level clause boundary indicates the division point.</Paragraph>
      <Paragraph position="1"> Finally the divided sentences must be generated. These processes can be conducted with morphological level information; that is, they do not require full syntactic parsing or any semantic interpretation.</Paragraph>
      <Paragraph position="2"> In the first step, the decision is made with a discriminate function that computes the weighted sum of the number of characters, the number of Bunsetsus and the number of predicates (verbs and adjectives), etc.. Weighting coefficients and the threshold value for decision were determined through experiments.</Paragraph>
      <Paragraph position="3"> t~f~ bC/~&amp;quot;6 t. -~'~-~c b7&amp;quot;~o (a) Top Level Clause Boundary (b-l) The process advances while saving the result.</Paragraph>
      <Paragraph position="4"> Thus the result remains, even if error occurs.</Paragraph>
      <Paragraph position="5"> (b-2) The result remains, even if error occurs.</Paragraph>
      <Paragraph position="6"> Because the process advances while saving the result. Fig.4 An Example of the Sentence Division.</Paragraph>
      <Paragraph position="7"> The second step roughly analyzes the iutrasentence connective structurel9l and produces a shallow level intermediate representation as illustrated in Fig.4(a). The key to this process is the inter-predicates dependency relation analysis which utilizes a set of dependency rules. These rules are based on the classification of predicate expressions (including modal, tense, aspectual suffixes) in terms of the strength in forming connective structures. One significant point in the process is that the connective structure must not be tully disambiguated, because the main purpose of the analysis is identification of the division point; namely, there are cases where the division point can be uniquely identified, nevertheless the connective structure is ambiguous.</Paragraph>
      <Paragraph position="8"> AcI'~ DE COLING-92, NA~, 23-28 AOt~ 1992 6 6 g FROC. OF COLING-92, NANTES, AUG. 23-28, 1992 The final step generates tile divided sentences string by applying generation rules to tile intemmdiate representation. Fig.4(b) gives tbe generated alternatives ((b-1),(b-2)) for the example in Fig.4(a). In the process, ordering of the divided sentences aml choice of the conjunctive expression which provide cohesion between divided sentences are major considerations. In Fig.4(a), two tOll level clauses are connected with a causal relation. Thus associated conjunctive expressions (underlined in Fig.4(b)) are generated according to tile aheruatives in sentence ordering. To determine which is better, contexual processing is required; howevei the determination is currently left to the user's selection.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Rewriting through Partial Structural
Conversions
</SectionTitle>
      <Paragraph position="0"> Main stream of the algorithm in style improvement component is smnmarized in Fig.5. Tile rest of this subsection briefly introduces topics in each step (details are given in \[5\]).</Paragraph>
      <Paragraph position="1"> Detect all Possible Bad-styled Expression ; Construct the Agenda and the Revision Process Manager ; while (1&amp;quot;) do Select an unmarked rule instance with the highest pdodty from the agenda ; if there is no such rule instance then break ; Test its presupposition ; if the presupposition holds then { Apply the associated partial rewriting operation ; if the operation succeeds then  The Diagnoser detects badly-styled expressions liom the Kakari-Uke network which contains all detectable syntactic structures. The process is the semi-firing of the partial rewriting rules, because each detected badly-styled expression is associated with a rewriting rule specified by the type of the bad-style pattern. 'Semifiring' means that some of the focused rules are deactivated later in response to on-demand structural disambiguation or partial rewriting. From the computational viewpoint, the detection process should be regarded as a sort of feature extraction process. This allows the diagnosis process to be realized as an interpretation of the data-flow network; namely, the terminal node finally activated indicate which associated badly-st~,led expression has been detected and node own data provides justification.</Paragraph>
      <Paragraph position="2"> ~onstructing Agenda and Revisit~tl process Manaeer The rules semi-fired through the process described in the previous section are instantiated based on their justifications. The instances are then placed on the agenda, These justifications specify tile partial syntactic structures concerned with the detection patterns. Therefore, these are presuppusitions to tile application of tile associated rewriting operations.</Paragraph>
      <Paragraph position="3"> A justification is represented as a conjunction of predicales for ulodification relations between two Bunsetsus (called tile Kakari-Uke condition) and predicate on the Bunsetsu properties (called the Buosctsu property condition). For instance, the conjunctive formula stated below is thc justification of the detection pattern shown in Fig.2.</Paragraph>
      <Paragraph position="5"> Tlle literal of lhe formula is called a primitive condition. Litcrals nlust be neated as a sort of assumption, because all of tbenl have the possibility of becoming unsatisfied due to structural disambiguation and/or partial rewriting operations.</Paragraph>
      <Paragraph position="6"> The Revision F'rocess Manager lot managing the presuppositions is constructed at the sanle tinle as the agenda. It bolds a list of Bunsetsu property conditions and a list of the Kakari-Uke conditions. The data structure is suitable for nlanaging all presuppositions systematically because rule instances lhat sllare the same primilive condition are inlnlediately found through a data slot wtfich is indexed by the primitive conditions, and contains pointers to the rule instances.</Paragraph>
      <Paragraph position="7"> EXpC/C/tation-Driverl Control and Ott-dC/~ Disambiguation The priorities preassigned 1o the instances on tbe agenda sequence the successive application of partial rewriting operations witbin the revision process. \]'hat is, important rewriting operations arc assigned high priority values, and are scheduled to for earlier application, even if their presuppositions are not confirmed prior to their application. To actually apply a scheduled rewriting operation, its presupposition is tested first, At this tinle, the Disambiguator which involves tile application el heuristic disambiguation rules and/or user-interactions is inw~ked, aod tile minimum range of structural ambiguities is resolved in expectation of applying the scheduled rewriting operation.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Partial Rewriting~
</SectionTitle>
      <Paragraph position="0"> If the presupposition is confirmed to be satisfied, the associated partial rewriting operation is applied. Before commencing any partial rewriting operation, a sub-network concerned with scope of the rewriting is first extracted fi'om the Kakari-Uke network according to the given scope name such as 'simple sentence' and 'noun phrase', etc.. Secood, the extracted sub-network is converted into a sel of dependency trees, wberein each element is an explicitly represented possible syntactic structure. Third, tile partial rewriting ride defined by the structural conversion with the lexical operations is applied to lhe trees. Alternative expressions are generated from rule application. A partial rewriting operation is completed by user selection or rejection of the generated expressions. The partial dependency trees giving tbe selected partial expression are then convened to the sub-network agaiu, and restored in the Kakari-ACRES DE COLJNG-92, NANTES, 23-28 ho~r 1992 6 6 9 P~.oc. OF COLING-92. NANTES, AUG. 23-28. 1992 Uke network. This process is iterated until the agenda has no more applicable rule instances.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Maintenance of Data Consistencv with Cmtstraint
</SectionTitle>
      <Paragraph position="0"> er_erouagali~ Because the structural disambiguation and the partial rewriting operations affect internal data, the system must maintain the consistency of internal data whenever these operations are invoked, Brand-new information may be obtained as a result of the invoked operations, i.e., the acceptance/rejection of some modification or the change of some Bunsetsu structure. The new information can be considered as newly added constraints so that data consistency can be maiutained by propagating these constraints to the dependent intemal data.</Paragraph>
      <Paragraph position="1"> For instance, if the Revision Process Manager is notified by the Difference Analyzer that a particular Bunsetsu no longer has a certain property according to a particular partial rewriting operation, the rule instances which share the condition are immediately deactivated. Another typical example of the constraint propagation is created by structural disambiguation. If some Kakari-Uke relation is confirmed by the Disambiguator, exclusive Kakari-Uke relations are rejected at this time. This causes the deactivation of the rule instances which have these rejected Kakari-Uke relations as their primitive conditions,</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Word Ordering and Punctuating as
Regeneration
</SectionTitle>
      <Paragraph position="0"> Appropriate word ordering and punctuating help reduce the ambiguity in reading. Furthermore, it increases readability. In Japanese, however, word order is relatively free at the sentence constituent level and there are no strict grammatical restrictions on punctuation.</Paragraph>
      <Paragraph position="1"> Thus optimal word order and punctuation can not be decided only with syntactic information; reading and writing preferences must be considered.</Paragraph>
      <Paragraph position="2"> Our regeneration algorithm takes the syntactic structure (dependency tree structure) as its input and regenerates a new syntactic structure with less reading ambiguities. The algorithm employs the following heuristics based on the preferences in word ordering\[ 10J:  (1) Constituents which include the thematic marker (post position 'ha') are put at the head of the sentence, and punctuation marks are put after them.</Paragraph>
      <Paragraph position="3"> (2) Punctuation marks are placed on clause boundaries.</Paragraph>
      <Paragraph position="4"> (3) Heavier constituents (containing more  Bunsetsus) are made to precede light constituents on the same syntactic level.</Paragraph>
      <Paragraph position="5"> The algorithm first determines the constituent which includes the thematic marker. The constituent is positioned at the head of the sentence and a punctuation marker (Japanese comma) follows it. Next, the punctuating mark is added to the Bunsetsu which indicates the top level clause boundary. Then, at each syntactic level, constituents are sorted by their weight. Of course, the initially located constituents that include thematic markers are not moved by this constituent sorting operation. Finally, if the regenerated sentence string differs from the original, it is submitted for user confirmation.</Paragraph>
      <Paragraph position="6"> Figure.6 gives an example. In this example, B2 is the Bunsetsu that contains the thematic marker and B5 indicates the top-level clause boundary. According to tile regeneration algorithm, the segment (B1-B2) is placed at the head of the sentence and the Japanese comma is added. The segment (B4-B5) precedes segment (B3) because of its weight (two Bunsetsus) and anolher comma is added, B1 B2 B3 B4 B5 B6 This device will work manually, if the automatic-mode has been canceled.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Fig.6 An Example of the Regeneration.
6 Evaluation
</SectionTitle>
    <Paragraph position="0"> An evaluation experiment to show the effectiveness of the prototype system and the validity of the proposed revision model was made by using 113 sentences taken from published manuals and constructed examples. The points for evaluation were how much the the system contributes to easy-understanding and correctunderstanding. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Readability
</SectionTitle>
      <Paragraph position="0"> There is no established way to evaluate understandability of texts. In this paper, we treated understandability as roughly equivalent to readability, because readability is encompassed by understandability.</Paragraph>
      <Paragraph position="1"> The readability measure used in the experiment was proposed by Tateishi,et.al\[ 11 \] for Japanese texts. The method computes the readability with the following formula which utilizes surface level information. The term RS' indicates the readability; higher values indicate the text is more readable. The coefficients were determined through statistical analyses to normalize the mean value to 50 and the standard deviation to 10.</Paragraph>
      <Paragraph position="2"> RS' = -0.12 x Is - 1.37 x la + 7.4 x Ih - 23.18 x Ic - 5.4 x lk -4.67 x cp + 115.79 These terms are, Is: length of tile sentences, la:mean length of alphabetical characters run, lh: mean length of Hiragana characters run, lc: mean length of Kanji character runs, Ik: mean length of Katakana character runs, cp: mean number of commas per sentence, The system increased the RS' value by 42.5 to 49.0. This means that the readability was increased by the system. Sentence division and punctuation were the main contributors to this improvement.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Structural Ambiguity
</SectionTitle>
      <Paragraph position="0"> It is also difficult to quantitatively estimate correctunderstanding. In this paper, we estimate the level of correct-understanding from the structural ambiguity, because structurally ambiguous sentences/expressions AcrEs DE COLING-92, NAmXS, 23-28 not~ 1992 6 7 0 PROC. OF COLING-92, NANTEs, AU6.23-28, 1992 obviously degrade correct-understaoding, ltowever, measuring systematic\[121 or reading ambiguity witt~ algorithms is still a difficult problem. Thus we use computational ambiguity to approximate systematic ambiguity. The Japanese dependency structure analyzer developed by Shirai\[13\] was used tot this purpose.</Paragraph>
      <Paragraph position="1"> The original texts led to 18.4 analyses per sentence on average. After the texts were corrected by the prototype system, only 7.9 analyses were produced.</Paragraph>
      <Paragraph position="2"> This means that the system successfully reduced tbe amount of structural ambiguity. 'FILe major contributors to this improvement were sentence division and word ordering. Style improvements leading tn drastic changes in the syntactic structure also contributed to tbis improvement.</Paragraph>
      <Paragraph position="3"> Incidentally, after revision, only 4.9 possible syntactic structures remained per sentence on average within the internal data of the system. This is a fair bit less than the result from the reanalysis of the revised text. Thus where tile revised text is processed further (for instance, translation, summarization), the use of the internal data will help to reduce the effect of disambiguations on the remaining processes.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.3 Validity of the Model
</SectionTitle>
      <Paragraph position="0"> The validity of the proposed revision model was not directly evaluated in the experiments, itowever, the validity of the component order is evident, because structural ambiguity is continuously reduced with each processitlg step. If the style improvement component preceded the sentence division component, the structural conversion processes to improve the badly-styled expressions would handle numerous fruitless syntactic structures and generate too many inappropriate alternative expressions. Moreover, if the syntactic/semantic regeneration component l)receded the style improvement component, each structural conversion rule would be constructed as to preserve the word order and punctuation marks; this would afli~ct the writability of the rules.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML