File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1231_metho.xml

Size: 31,807 bytes

Last Modified: 2025-10-06 14:15:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1231">
  <Title>Shallow Post Morphological Processing with KURD</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
* Style checking
</SectionTitle>
    <Paragraph position="0"> Highly complex constructions or heavy phrases can disturb the reading and understanding process. To avoid this, style checkers can recognize such patterns so that the author can readjust his text for better communication. null</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
* Shallow parsing
</SectionTitle>
    <Paragraph position="0"> Shallow parsing can help to simplify the data before full parsing is undertaken. It recognizes syntactic phrases, mostly on the nominal level.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
* Segmentation
</SectionTitle>
    <Paragraph position="0"> The morphological analysis deals with words which are presented in texts. High level processing deals with units between the word level and the text level, mostly with sentences. Thus, sentence segmentation is a typical shallow process, but other subunits could be equally interesting.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
KURD
</SectionTitle>
    <Paragraph position="0"> Michael Carl and Antje Schmidt-Wigger (1998) Shallow Post Morphological Processing with KURD. In D.M.W. Powers (ed.) NeMLaP3/CoNLL98: New Methods in Language Processing and Computational Natural Language Learning. ACL, pp 257-265.</Paragraph>
    <Paragraph position="1"> The basic idea of the presented formalism is the following: in a set of rules, patterns are defined which are mapped onto the morphologically analyzed input strings. If the mapping is successful, modifications of the analysis are undertaken according to the specifications in the rule. To ensure expressiveness and ease of formulation of the rules, we have introduced some elements of unification based systems into the formalism.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Morphological Analysis
</SectionTitle>
    <Paragraph position="0"> Morphological analysis 2 is the process of separating grammatical information and (a) stem(s) from the surface form of an input word. Lemmatization generates from an input string a basic word form that does not contain inflectional information. A lemma together with the grammatical information is thus equivalent to the surface form of the word.</Paragraph>
    <Paragraph position="1"> In addition, lemma decomposition can be carried out by the morphological processor. Recognition of composition and derivation yields knowledge about the internal structure of the word.</Paragraph>
    <Paragraph position="2"> Morphological information and the value of the lemma are represented in the form of sets of attribute/operator/values (/I op V) which we will refer to as feature bundles (FBs). Beside morphological analysis and lemmatization, sentence segmentation is performed by the morphological processor. The output is thus a sentence descriptor SD that contains multiple Word Descriptors WDs. The distinction between WDs and deeper embedded FBs is useful later in this paper due to the important functional difference. The formal definition of a SD is as follows: Sentence Descriptor SD:</Paragraph>
    <Paragraph position="4"> A WD may consist of two types of disjunctive representation (local or complex disjunction) in 2In this section and in the paper we refer to MPRO as the analysis tool (Maas, 1996). MPRO is very powerful: it yields more than 95% correct morphological analysis and lemmas of arbitrary German and English text.</Paragraph>
    <Paragraph position="5"> a number of different levels. Local disjunction is an alternation of atomic values, complex disjunction is an alternation of complex features (FB).</Paragraph>
    <Paragraph position="6"> Which of the disjunctive representations is chosen depends on the one hand on the expressive requirements (i.e. no feature dependencies can be expressed with local disjunctions) and on the other hand on the linguistic assumptions of the morphological analysis.</Paragraph>
    <Paragraph position="8"> Both types of disjunction are shown in the representation of the German article &amp;quot;der'. A first level of disjunction occurs on the level of the word descriptors. Different analyses (as a determiner (lu=d_e.rt) and as a relative pronoun (lu=d_rel)) are separated by a semicolon ';'. The second level of disjunction occurs in the feature &amp;quot;agr', which has a complex disjunction as its value. The feature &amp;quot;case&amp;quot; in the first complex disjunctor has a local disjunction (g;d)as its value. The word &amp;quot;der&amp;quot; has seven different interpretations which axe melted together here by means of the two different types of disjunction.</Paragraph>
    <Paragraph position="9"> Note that we do not need variable binding between different attributes of the same FB 3. because we presume that each attribute in a (morphological) FB expresses a different piece of information (it thus has a different type).</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Formalism
</SectionTitle>
    <Paragraph position="0"> The formalism we shall describe in this section applies a set of rules in a predefmed order to sentence descriptors SD thereby modifying selected word descriptors WD. The modified SDs are returned as a result. For each SD, each rule is repeatedly Sin many theories and formalisms (e.g. HPSG, CAT2 (Sharp and Streiter, 1995)) different attributes in a FB can be forced to always have the same values by assigning the same variable as their values (they share the same structure). However, these approaches allow structure sharing and vl~able binding only among equal types.</Paragraph>
    <Paragraph position="1"> Carl and Schmidt- Wigger 258 KURD</Paragraph>
    <Paragraph position="3"> applied, starting from the first WD.</Paragraph>
    <Paragraph position="4"> A rule essentially consists of a description part and an action part. The description consists of a number of conditions that must match successive WDs. While matching the description part of a rule onto a SD, WDs are marked in order to be modified in the action part. A rule fails if a condition does not match. In this case the action part of the rule is not activated. The action part is activated if all conditions are satisfied. Actions may modify (Kill, Unify, Replace or Delete) a WD or single features of it.</Paragraph>
    <Paragraph position="5"> A condition of a rule can either match an interval or it can match a count of the WD. In the former case, one set of tests must be true. In the latter case two sets of tests must be true, one for an external interval and one for a count of an internal interval.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Some examples
</SectionTitle>
      <Paragraph position="0"> German verbs have detachable prefixes that can be homonyms to prepositions. Morphological analysis thus generates two interpretations for such a string. However, the syntactic position of prefixes and prepositions within a sentence is different. While prepositions occur as the head in prepositional phrases and thus ate always followed by a nominal phrase or a pronoun, detached prefixes occur at the end of the matrix sentence, thus fonowed by a punctuation mark or a coordinator. The following rule disambiguates a prefix at the end of a sentence, i.e the interpretation as a preposition ({c=w,sc=p}) shMi be deleted from the WD.</Paragraph>
      <Paragraph position="2"> The rule 1 consists of two conditions (separated by a comma) in the description part and one act in the action part. It illustrates the capacity of the formalism to express disjunction and conjunction at the same time. The first condition matches a preposition (~c=w, so=p}) and a prefix (~c=vpref}). That is, the matched WD is expected to be ambiguous with respect to its category. Feature cooccurrences are requited in the first test, where both features c=w and sc=p must occur in conjunction in (at least) one interpretation of the matched WD. The existence quantifier e preceding the FB means that there is an appropriate interpretation in the WD, i.e.</Paragraph>
      <Paragraph position="3"> there is a non-empty intersection of FB and WD. The second condition consists of one test only. The FB matches an end-of-sentence item (~sc--punct;corma}). Here, the all quantifier a requites the WD to be a subset of the FB i.e.</Paragraph>
      <Paragraph position="4"> there is no interpretation in the WD which is not an end-of-sentence item.</Paragraph>
      <Paragraph position="5"> A WD for which the first condition is true is marked by the marker ~A'. The rule applies if the second condition is true for the following WD.</Paragraph>
      <Paragraph position="6"> The act/on part has one consequence that consists of one act. The WD which has been marked in the description part is unified with the FB (~c=vpref}) of the act. This results in the un-ambiguous identification of the prefix because the prepositional analysis is ruled out.</Paragraph>
      <Paragraph position="7"> An example of a rule that disambiguates the agreement of a (German) noun phrase is given below (2). The rule can be paraphrased as follows: for all sequences of WD that have a unifyable agreement feature ({affr= lGlt~) and that consist of an article (~c=w, sc=ar~}) followed by zero or more adjectives (*~c=adj}) followed by one noun (~c--noun~): unify the intersection of the agreement ({agr=_AGlt}) into the respective features of the marked word descriptors.</Paragraph>
      <Paragraph position="8">  (2) Disambiguate_Noun_Phrase = Ae { c=,. so=art, ag-z=_A,It}.</Paragraph>
      <Paragraph position="9"> *Aa {c=adj. agr=_lGE}.</Paragraph>
      <Paragraph position="11"> The description part of rule (2) has three conditions. Each condition matches an interval of the WDs. The second condition can possibly be empty since it has the irleene star scope ('*').</Paragraph>
      <Paragraph position="12"> All WDs for which the test is true are marked by the maxker &amp;quot;A&amp;quot; and thus undergo the same act in the action part.</Paragraph>
      <Paragraph position="13"> The formalism allows the use of variables (e.g.</Paragraph>
      <Paragraph position="14"> _AGR) for the purpose of unification. WDs can only be modified by instantiatious of variables i.e. variable bindings may not be transferred into the WD. Each time a rule is activated, the variables are reinitialized.</Paragraph>
      <Paragraph position="15"> Carl and Schmidt-Wigger 259 KURD The rule (2) matches a noun phrase, thereby disambiguating the agreement. With slight changes, the output of the rule can be turned into a shallow parse:</Paragraph>
      <Paragraph position="17"> The operator &amp;quot;r&amp;quot; in the second conseqence of the rule (3) replaces the category value in the noun node by a new one (~c=np}) . The determiner node ({C/=w,sc=art}) and all adjective nodes ({c=adj}) are removed ('killed') by means of the kill operator ILk{} from the sentence descriptor such that only the NP node is printed as a result.</Paragraph>
      <Paragraph position="18"> Style checking has often to deal with the complexity 4 of a phrase. Therefore, it makes use of another type of rules where the presence of a number of word interpretations in a certain count is checked. For instance in technical texts, it may be advisable not to have more than eight words before the finite verb. The rule (4) unifies an appropriate warning number into the fixst finite verb analysis if more than eight words have occurred before it.</Paragraph>
      <Paragraph position="20"> The first condition matches the first WD in a sentence ({imrr=1}) if it has an interpretation different from a finite verb ({vtyp'=fiv}). The second condition is a count that matches a sequence of</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 The complexity of a phrase is a ~netlon of di~erent pa-
</SectionTitle>
    <Paragraph position="0"> rameters such as its length, the number of lexic-1 elements, the complexity of its structure. The definitions differ from one author to the next. In our calculation of complexity, only length and number of lexical elements Lre taken into 8,ccouxlt.</Paragraph>
    <Paragraph position="1"> WDs other than finite verbs. This is expressed by the external test ({vtyp'=fiv};{c'=verb}) following the vertical bar. The internal test ({sc'=comma;cit ;sZash}), e.g. the part before the vertical bat counts the number of words in the count different from punctuation marks and slashes. The count is true if eight or more such internal teas are true. The motivation for the third condition is to put the marker &amp;quot;A&amp;quot; on the finite verb such that it can be unified with the warning in the action part. The warning can be used by further tools to select an appropriate message for the user.</Paragraph>
  </Section>
  <Section position="11" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3.~ Formal Definition
</SectionTitle>
    <Paragraph position="0"> The formal definition of rule syntax is given below: null Definition of rule:</Paragraph>
    <Paragraph position="2"> Whether or not a rule applies (i.e. its action part is executed or not) depends on whether its conditions match. Each condition matches the longest possible sequence of WDs and, once matched, other segmentations axe not considered.</Paragraph>
    <Paragraph position="3"> We do not foresee backtracking or multiple solution generation. The length of an interval depends on the scope of the interval and the outcome of the tests. In accordance with many linguistic formalisms we distinguish between four scopes.</Paragraph>
    <Paragraph position="4">  is true or false depends on the quantifier of the test. The ezistence quantifier &amp;quot;e&amp;quot; and the all quantifier &amp;quot;a&amp;quot; are implemented as follows: e The test is true if there is a non-empty subset between the FB and the current WD. The FB describes a possible interpretation of the current WD. The test is true if there is at least one interpretation in the current WD that is unifyable with FB.</Paragraph>
    <Paragraph position="5">  a The test is true if the current WD is a subset of the FB. The FB describes the necessary interpretation of the current WD. The test is true if the FB subsumes all interpretations of the current WD.</Paragraph>
    <Paragraph position="6"> All consequences of the action part are executed if the description part matches. The acts of a consequence apply to the marked WD. The following operators are currently implemented: k kills the marked WD.</Paragraph>
    <Paragraph position="7"> u unifies FB into the marked WD.</Paragraph>
    <Paragraph position="8"> r replaces the values of the features in the marked WDs by those of FB.</Paragraph>
    <Paragraph position="9"> d deletes the specified features in FB f.tom the marked WD.</Paragraph>
    <Paragraph position="10"> Apart from an interval, a condition can consist of a count. The length of a count is controlled by a set of ezternal tests (intervalezt), i.e. the right border of the count is either the end of the SD or a WD where one of the ezternal tests is false. The outcome of a count (whether it is true or false) is controlled by a set of internal tests (intervali,~t). For a count to be true, at least the specified number of internal tests must be true.</Paragraph>
  </Section>
  <Section position="12" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Related Work
</SectionTitle>
    <Paragraph position="0"> In order to compare KURD with other postmorphological processing systems, one can distinguish between the formali.~ms' design, the implementation of a grammar and the tasks for which the system is designed. Most such comparisons (e.g.</Paragraph>
    <Paragraph position="1"> (Abney, 1996)) are based on processing time, accuracy and recall, which in fact do not differentiate between the strength of the form~l/~m and the strength of the grammar actually implemented.</Paragraph>
    <Paragraph position="2"> In this section we want to compare the capaxities of KURD to another formalisms by describing its formal characteristics for each possible step in the chain of NLP application. Two concrete applications will be presented in the following section. Similar to KURD, CGP of the 'Helsinki' project (el. (Karlsson, 1990)) is a system working on morphologically analysed text that contains lexical ambiguities. KURD and CGP are somewhat alike with respect to the basic assumptions on steps one would need to disambiguate morphological descriptions: an ambiguous word (WD) is observed in its context. If necessary it has to be acertained that the context itself is not ambiguous. In a fitting context the disambiguation operation is triggered. The realization of these assumptions in the two formalisms differs in the following features: In KURD ...</Paragraph>
    <Paragraph position="3"> * a rule definition is based on pattern matching of a specific context, in which the action's focus is than selected.</Paragraph>
    <Paragraph position="4"> * the scope of disambiguation is fixed by means of markers. This allows more than one operation to be defined in the marked scope (WDs) at a time, and the same operation to be applied to more than one word (WD).</Paragraph>
    <Paragraph position="5"> * the context of an operation and the operation itself are defined in separate parts of the rule. Each part may contain a distinct set of features while in CGP, all features specified for the focused word are subject to the same disambiguation.</Paragraph>
    <Paragraph position="6">  * variable binding is supported. Multiple interpretations of several words can be disambiguated by unification as exemplified in rule (2). In CGP, rule batteries are necessary for this task, and disambiguation of the combination of features of more than two WD is not possible.</Paragraph>
    <Paragraph position="7"> * unbounded dependencies can be modeled by means of intervals. We are not sure whether these can be modeled in CGP by means of relative positions.</Paragraph>
    <Paragraph position="8"> In CGP ...</Paragraph>
    <Paragraph position="9"> * the focus of the rule is positioned before the left- and rightward context is described.</Paragraph>
    <Paragraph position="10"> Carl and Schmidt-Wigger 261 KURD * one can look backwards in a context. This is  not always possible in KURD due to undeispecification in the morphological input. * one can define sets of features. In KURD, this can be modeled by means of feature disjunction; thus more freedom in KURD, but less consistency.</Paragraph>
    <Paragraph position="11"> * one can discard a reading when the context is NOT re~li~ed. In KURD, these possibility can only be modeled using two rules and a meta-feature.</Paragraph>
    <Paragraph position="12"> * there is a specific clause boundary mode. In KURD, clause boundaries have to be enumerated as simple features.</Paragraph>
    <Paragraph position="13"> To summarize the comparison, backward looking seems basically the only difference with which CGP has an advantage over KURD in terms of expressiveness, while variable binding gives KURD advantage over CGP. In terms of user-friendliness, the systems choose two different directions. In KURD the use of markers and rule separation into a description part and an action part may reduce the number of rules, while CGP allows for the simplification of rules by means of sets or the clause boundary mode.</Paragraph>
    <Paragraph position="14"> The next step in processing moves from the treatment of words towards the treatment of word groups i.e. to parsing. Traditional parsers are full parsers building all possible deep parse trees over the fiat input structure. Weaker models, usually referred to as 'shallowparsers' (cf. (Karlsson and Karttunen, 1997)), allow for partial parses, for trees of depth of one or for one result only. The output data structure of a parser is generally a bracketed structure which preserves the original morphological fiat structure inside the output structure. Some shallow parsers, however such as CGP, assign syntactic functions to the words of a sentence and renounce the representation of the dependency structure.</Paragraph>
    <Paragraph position="15"> Parsing with KURD results in a one level representation where the nodes (WD) can be enriched with information concerning their syntactic functions. The insertion of brackets is not supported in KURD but recognized phrases can be reduced to one node if they are part of higher level phrases. Also recursivity of language has to be approximated by means of iterative, multiple application of (not necessarily the same) rule set. Thus KURD has to be classified as a typical shallow parsing system, also Mlowing for partial parsing. The last step in the NLP processing chain is a practical application of the linguistic knowledge for a specific task. The next section describes such an application of KURD for style checking.</Paragraph>
    <Paragraph position="16"> It does not rely on a full disambiguation and syntactic tagging of the morphological analysis. Disambiguation is undertaken only when necessary. We believe that 100% disambiguation is too expensive for a rule based system 5 especially when it has to be adapted to each new text type. In the next section, we show that good results can also be obtained on ambiguous input.</Paragraph>
  </Section>
  <Section position="13" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Style checking
</SectionTitle>
    <Paragraph position="0"> In this section we want to describe an application of KURD for style checking of technical documents. The application has been developed and tested for a car manufacturing environment (Hailer, 1996).</Paragraph>
    <Paragraph position="1"> In technical documentation, the quality of the text in terms of completeness, correctness, consistency, readability and user-frlend\]hess is a central goal (Fottner-Top, 1996). Therefore completed documents undergo a cycle of correction and re-editing. As our experiments in this production process have shown, 40% of re-editions in technical documents are motivated by stylistic considerations (compared to corrections of orthographies, syntax, content or layout).</Paragraph>
    <Paragraph position="2"> On the basis of the observed re-editions, stylistic guidelines have been formulated, such as:  1. Do not use compounds made up of three or more elements.</Paragraph>
    <Paragraph position="3"> 2. Do not use the passive voice if there is aJa explicit agent.</Paragraph>
    <Paragraph position="4"> 3. Long coordinations should be represented in  lists.</Paragraph>
    <Paragraph position="5"> The compilation of these guidelines has influenced the architecture of KURD to a certain extent. Most scientists correlate the readability of a sentence with its complexity, defined often by length, 5 CGP contained 400 rules for 90~ disamhiguation quality (c~. (Karlsson, 1990)). In order to reach nearly 100%, this number increased up to 1109 rules.., cf. (Karlsson and  number of content words and/or structural embedding. Whereas such information is not common in NLP applications, its calculation can be modeled in KURD through the counC/ mechanism. The basic idea of Using the formalism for style checking is exemplified by rule (4): a morphosyntactic pattern is recognized by a specific rule unifying a warning number into the marked WD.</Paragraph>
    <Paragraph position="6"> This number triggers an appropriate message in further processing steps that signals the use of an undesirable formulation. As a result, the user can ameliorate that part of the text.</Paragraph>
    <Paragraph position="7"> For better results, the style checking application makes use of the disambiguating power of KURD; i.e. some tagging rules (e.g. rule (1)) precede the application of the style rules.</Paragraph>
    <Paragraph position="8"> The system cont~in~ at its present stage 36 style warnings which axe expressed by 124 KURD rules: an average of 3 to 4 rules for each style problem. The warnings can be classified as follows (for examples, see above): 1. One word warnings (10 types of warning): These warnings can either recognize the complex internal structure of compound words, or forbid the use of a certain word. For the latter task, style checking moves towards checking against the lexicon of a Controlled Language.</Paragraph>
    <Paragraph position="9"> This task should not be over-used, a lexically driven control mechanism seems to be more adequate.</Paragraph>
    <Paragraph position="10"> 2. Structure-linked warnings (19 types of warning): These warnings react to complex syntactic structures and trigger the proposition of a reformulation to the writer. They are therefore the most interesting for the user and for the rule writer.</Paragraph>
    <Paragraph position="11"> 3. Counting warnings (7 types of warning): These warnings measure the complexity of a sentence or of a sub-phrase by counting its dements. Complexity is a central topic in the readability literature (see footnote 5), but it does not allow the triggering of a concrete reformulation proposition to the user.</Paragraph>
    <Paragraph position="12"> Most structure-linked warnings require more than one KURD rule. This is due to the fact that the pattern to be recognized can occur in different forms in the text. As shown by the following example (5), two rules would be necessary to detect Carl and Schmidt-Wigger 263 the 'Future II' in German, because word order of verbal phrases in main sentences differs from that in subordinate clauses.</Paragraph>
    <Paragraph position="13"> (5) Der Mann wird schnell gekommen Tl~e man will quickly come sein.</Paragraph>
    <Paragraph position="14"> be.</Paragraph>
    <Paragraph position="15"> Er weifl, daft der Mann schnen He knows, ~h~zt $he man quickly gekommen sein wird.</Paragraph>
    <Paragraph position="16"> come be will.</Paragraph>
    <Paragraph position="17"> For recursive phenomena KURD's fiat matching approach is somewhat inconvenient. In example 6, rule 2 applies to die Werkzeuge, although the article die should in fact be \]inked to Arbeiter, while Werkzeuge stands in a bare plural.</Paragraph>
    <Paragraph position="18"> (6) die Werkzeuge herstelhnden Arbeiter the tools building workers To handle such problems, one can try to enumerate the dements which can be contained between the two borders of a pattern to be recognized. But this approach mostly yields only approximate results because it does not respect the generative capacity of language.</Paragraph>
    <Paragraph position="19"> However, most of the style warnings have been easily implemented in KURD, as the appropriate pattern can still often be recognized by one or two elements at its borders.</Paragraph>
    <Paragraph position="20"> The system has been tested against an analyzed corpus of approx. 76,000 sentences. More than 5,000 sentences to be ameliorated were detected by KURD. 757 of them were selected manually to control the validity of the rules of warning class 2 and 3: In 8% (64), the warnings had been applied incorrectly. In these cases, syntactic structure could not adequately be described in the KURD formalism. These 8%, however, only reflects the erroneous results of warning classes 2 and 3. They do not cover sentences selected by simple rules such as those of class 1. Rules of warning class 1 are responsible for 20% of the automatically detected sentences to be ameliorated. These rules do never apply incorrectly.</Paragraph>
    <Paragraph position="21"> In another test, a text of 30 pages was annotated by a human corrector and the KURD style checker. The results were compared. ABout 50%</Paragraph>
  </Section>
  <Section position="14" start_page="0" end_page="0" type="metho">
    <SectionTitle>
KURD
</SectionTitle>
    <Paragraph position="0"> of the human annotations were also annotated by the computer with a comparable amelioration proposition. 35% resisted an automatic diagnosis, either because the recursive structure could not adequately be modeled by the style checking rules, or because the information calculated by the morphological analysis was not sufficient (i.e. no semantic information was available). By writing new style rules, a 65% recall could be achieved.</Paragraph>
    <Paragraph position="1"> The precision of the style checker, on the other hand, seems to be a critical point. The checker produces three times more automatic warnings than the human corrector. This is mainly due to the 'counting rules', because the count limits were often too low. The choice of acceptable limits is still under discussion.</Paragraph>
    <Paragraph position="2"> It has been shown that pattern recognition could be a valuable means for applications needing at least basic information on syntactic structures and that KURD could be a tool for realizing these applications. null</Paragraph>
  </Section>
  <Section position="15" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Chunk Reduction and
Refinement
</SectionTitle>
    <Paragraph position="0"> In the framework of the CBAG s module (el.</Paragraph>
    <Paragraph position="1"> (Carl, 1998) in this volume) KURD is used in several components. CBAG is an example based translation engine whose aim it is to be used as a stand-alone Example Based Machine T~anslation system (EBMT) or to be dynamically integrated as a f~ont-end into a Rule Based Machine T~auslation system.</Paragraph>
    <Paragraph position="2"> The CBAG module is divided into three submodules: null * The Case Base Compilation module (CBC) compiles a set of bilingual SD equivalences into a case base thereby inducing case abstractions from the concrete SD. Case abstractions ensure a greater recall and are thus needed for a better coverage of the system.</Paragraph>
    <Paragraph position="3"> * The Case Based Analysis module (CBA) decomposes and reduces an input SD into a set of chunks according to the cases in the case base. Reduced sequences are more likely to match a case in the case base because they are shorter abstractions from the original sequence of WDs.</Paragraph>
  </Section>
  <Section position="16" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 CBAG stands for Case Based Analysis and Generation
* The Case Based Generation module (CBG)
</SectionTitle>
    <Paragraph position="0"> re-generates sequences of taxget language WDs from the reduced chunks. In the refinement process \]exical and grammatical information axe merged together into WDs.</Paragraph>
    <Paragraph position="1"> KURD is used for two different tasks in these modules. In the CBC module and in the CBA module, KURD performs chunk reduction and in the CBG module, KURD performs chunk refinement.</Paragraph>
    <Paragraph position="2"> In order to do chunk reduction, the input SD is first decomposed into a sequence of chunks according to the entries in the case base. KURD reduces those chunks which match a case in the case base into one chunk descriptor according to the schema of rule 3.</Paragraph>
    <Paragraph position="3"> In the refinement phase, KURD merges lexical and grammatical information which is extracted from two different sets of cases. These rules use all types of operators that axe available in KURD.</Paragraph>
  </Section>
  <Section position="17" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Implementation
</SectionTitle>
    <Paragraph position="0"> The KURD formalism is implemented in C and compilable under gcc. It runs on spazc workstations and is currently ported to PC (with gcc).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML