XML Viewer - a00-1033

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-1033_metho.xml
Size: 23,843 bytes
Last Modified: 2025-10-06 14:07:01
<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-1033">
  <Title>A Divide-and-Conquer Strategy for Shallow Parsing of German Free Texts</Title>
  <Section position="4" start_page="0" end_page="239" type="metho">
    <SectionTitle>
t DFKI GmbH, Stuhlsatzenhausweg 3, 66123 Saarbriicken, Germany, cbratm@dfki,
</SectionTitle>
    <Paragraph position="0"> de DFKI GmbH, Stuhlsatzenhausweg 3, 66123 Saarbriicken, Germany, piskorsk@dfki, de are triggered by domain-specific predicates attached only to a relevant subset of verbs which express domain-specific selectional restrictions for possible argument fillers.</Paragraph>
    <Paragraph position="1"> In most of the well-known shallow text processing systems (cf. (Sundheim, 1995) and (SAIC, 1998)) cascaded chunk parsers are used which perform clause recognition after fragment recognition following a bottom-up style as described in (Abne), 1996). We have also developed a similar bottom-up strategy for the processing of German texts, cf. (Neumann et al., 1997). However, the main problem we experienced using the bottom-up strategy was insufficient robustness: because the parser depends on the lower phrasal recognizers, its performance is heavily influenced by their respective performance. As a consequence, the parser frequently wasn't able to process structurally simple sentences, because they contained, for example, highly complex nominal phrases, as in the following example: &amp;quot;\[Die vom Bundesgerichtshof und den Wettbewerbshfitern als Verstofi gegen das Kartellverbot gegeiflelte zentrale TV-Vermarktung\] ist g~ngige Praxis.&amp;quot; Central television raarketing, censured by the German Federal High Court and the guards against unfair competition as an infringement of anti-cartel legislation, is common practice.</Paragraph>
    <Paragraph position="2"> During free text processing it might be not possible (or even desirable) to recognize such a phrase completely. However, if we assume that domain-specific templates are associated with certain verbs or verb groups which trigger template filling, then it will be very difficult to find the appropriate fillers without knowing the correct clause structure. Furthermore in a sole bottom-up approach some ambiguities - for example relative pronouns - can't be resolved without introducing much underspecification into the intermediate structures.</Paragraph>
    <Paragraph position="3"> Therefore we propose the following divide-and-conquer parsing strategy: In a first phase only the verb groups and the topological structure of a sentence according to the linguistic field the- null &amp;quot;\[CooraS \[sse,,* Diese Angaben konnte der Bundesgrenzschutz aber nicht best~itigen\], \[ssent Kinkel sprach von Horrorzahlen, \[relct denen er keinen Glauben schenke\]\]\].&amp;quot; This information couldn't be verified by the Border Police, Kinkel spoke of horrible figures that he didn't  ory (cf. (Engel, 1988)) are determined domainindependently. In a second phase, general (as well as domain-specific) phrasal grammars (nominal and prepositional phrases) are applied to the contents of the different fields of the main and sub-clauses (see fig. 1) This approach offers several advantages: * improved robustness, because parsing of the sentence topology is based only on simple indicators like verbgroups and conjunctions and their interplay, * the resolution of some ambiguities, including relative pronouns vs. determiner, sub junction vs. preposition and sentence coordination vs.</Paragraph>
    <Paragraph position="4"> NP coordination, and * a high degree of modularity (easy integration of domain-dependent subcomponents).</Paragraph>
    <Paragraph position="5"> The shallow divide-and-conquer parser (DC-PARSER) is supported by means of powerful morphological processing (including on-line compound analysis), efficient POS-filtering and named entity recognition. Thus the architecture of the complete shallow text processing approach consists basically of two main components: the preprocessor and the DC-PARSER itself (see fig. 2).</Paragraph>
  </Section>
  <Section position="5" start_page="239" end_page="240" type="metho">
    <SectionTitle>
2 Preprocessor
</SectionTitle>
    <Paragraph position="0"> The DC-PARSER relies on a suitably configured pre-processing strategy in order to achieve the desired simplicity and performance. It consists of the following main steps: Tokenization The tokenizer maps sequences of consecutive characters into larger units called tokens and identifies their types. Currently we use more than 50 domain-independent token classes including generic classes for semantically ambiguous tokens (e.g., &amp;quot;10:15&amp;quot; could be a time expression or volleyball result, hence we classify this token as numberdot compound) and complex classes like abbreviations or complex compounds (e.g., &amp;quot;AT&amp;T-Chief&amp;quot;). It proved that such variety of token classes simplifies the processing of subsequent submodules significantly. null Morphology Each token identified as a potential wordform is submitted to the morphological analysis including on-line recognition of compounds (which is crucial since compounding is a very productive process of the German language) and hyphen coordination (e.g., in &amp;quot;An- und Verkauf&amp;quot; (purchase and sale) &amp;quot;An-&amp;quot; is resolved to &amp;quot;Ankauf&amp;quot; (purchase)). Each token recognized as a valid word form is associated with the list of its possible readings, characterized by stem, inflection information and part of speech category.</Paragraph>
    <Paragraph position="1"> POS-filtering Since a high amount of German word forms is ambiguous, especially word forms with a verb reading 1 and due to the fact that the quality of the results of the DC-PARSER relies essentially on the proper recognition of verb groups, efficient disambiguation strategies are needed. Using case-sensitive rules is straightforward since generally only nouns (and proper names) are written in standard German with a capitalized initial letter (e.g., &amp;quot;das Unternehmen&amp;quot; - the enterprise vs. &amp;quot;wir unternehmen&amp;quot; - we undertake). However for disambiguation of word forms appearing at the beginning of the sentence local contextual filtering rules are applied. For instance, the rule which forbids the verb written with a capitalized initial letter to be followed by a finite verb would filter out the verb reading of the word &amp;quot;unternehmen&amp;quot; in the sentence 130% of the wordforms in the test corpus &amp;quot;Wirtschaftswoche&amp;quot; (business news journal), which have a verb reading, turned to have at least one other non-verb reading.</Paragraph>
    <Paragraph position="2">  &amp;quot;Unternehmen sind an Gewinnmaximierung interesiert.&amp;quot; (Enterprises are interested in maximizing their profits). A major subclass of ambiguous word-forms are those which have an adjective or attributivly used participle reading beside the verb reading. For instance, in the sentence &amp;quot;Sie bekannten, die bekannten Bilder gestohlen zu haben.&amp;quot; (They confessed they have stolen the famous paintings.) the wordform &amp;quot;bekannten&amp;quot; is firstly used as a verb (confessed) and secondly as an adjective (famous). Since adjectives and attributively used participles are in most cases part of a nominal phrase a convenient rule would reject the verb reading if the previous word form is a determiner or the next word form is a noun. It is important to notice that such rules are based on some regularities, but they may yield false results, like for instance the rule for filtering out the verb reading of some word forms extremely rarely used as verbs (e.g., &amp;quot;recht&amp;quot; - right, to rake (3rd person,sg)). All rules are compiled into a single finite-state transducer according to the approach described in (Roche and Schabes, 1995). 2 Named entity finder Named entities such as organizations, persons, locations and time expressions are identified using finite-state grammars. Since some named entities (e.g. company names) may appear in the text either with or without a designator, we use a dynamic lexicon to store recognized named entities without their designators (e.g., &amp;quot;Braun AG&amp;quot; vs. &amp;quot;Braun&amp;quot;) in order to identify subsequent occurrences correctly. However a named entity, consisting solely of one word, may be also a valid word form (e.g., &amp;quot;Braun&amp;quot; - brown). Hence we classify such words as candidates for named entities since generally such ambiguities cannot be resolved at this level. Recognition of named entities could be postponed and integrated into the fragment recognizer, but performing this task at this stage of processing seems to be more appropriate. Firstly because the results of POS-filtering could be partially verified and improved and secondly the amount of the word forms to be processed by subsequent modules could be considerably reduced. For instance the verb reading of the word form &amp;quot;achten&amp;quot; (watch vs. eight) in the time expression &amp;quot;am achten Oktober 1995&amp;quot; (at the eight of the October 1995) could be filtered out if not done yet.</Paragraph>
  </Section>
  <Section position="6" start_page="240" end_page="243" type="metho">
    <SectionTitle>
3 A Shallow Divide-and-Conquer
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="240" end_page="240" type="sub_section">
      <SectionTitle>
Strategy
</SectionTitle>
      <Paragraph position="0"> The DC-PARSER consists of two major domain-independent modules based on finite state technol2The manually constructed rules proved to be a useful means for disambiguation, however not sufficient enough to filter out all unplausible readings. Hence supplementary rules determined by Brill's tagger were used in order to achieve broader coverage.</Paragraph>
      <Paragraph position="1"> ogy: 1) construction of the topological sentence structure, and 2) application of phrasal grammars on each determined subclause (see also fig. 3). In this paper we will concentrate on the first step, because it is the more novel part of the DC-PARSER, and will only briefly describe the second step in section 3.2.</Paragraph>
    </Section>
    <Section position="2" start_page="240" end_page="241" type="sub_section">
      <SectionTitle>
3.1 Topological structure
</SectionTitle>
      <Paragraph position="0"> The DC-PARSER applies cascades of finite-state grammars to the stream of tokens and named entitles delivered by the preprocessor in order to determine the topological structure of the sentence according to the linguistic field theory (Engel, 1988). 3 Based on the fact that in German a verb group (like &amp;quot;h~tte fiberredet werden mfissen&amp;quot; -- *have convinced been should meaning should have been convinced) can be split into a left and a right verb part (&amp;quot;hPStte&amp;quot; and &amp;quot;fiberredet werden miissen&amp;quot;) these parts (abbreviated as LVP and RVP) are used for the segmentation of a main sentence into several parts: the front field (VF), the left verb part, middle field (MF), right verb part, and rest field (RF). Subclauses can also be expressed in that way such that the left&amp;quot; verb part is either empty or occupied by a relative pronoun or a sub junction element, and the complete verb group is placed in the right verb part, cf. figure 3. Note that each separated field can be arbitrarily complex with very few restrictions on the ordering of the phrases inside a field.</Paragraph>
      <Paragraph position="1"> Recognition of the topological structure of a sentence can be described by the following four phases realized as cascade of finite state grammars (see also fig. 2; fig. 4 shows the different steps in action). 4 Initially, the stream of tokens and named entities is separated into a list of sentences based on punctuation signs. 5 Verb groups A verb grammar recognizes all single occurrences of verbforms (in most cases corresponding to LVP) and all closed verbgroups (i.e., sequences of verbforms, corresponding to RVP). The parts of discontinuous verb groups (e.g., separated LvP and RVP or separated verbs and verb-prefixes) cannot be put together at that step of processing because one needs contextual information which will only be available in the next steps. The major problem at this phase is not a structural one but the 3Details concerning the implementation of the topological parsing strategy can be found in (Braun, 1999). Details concerning the representation and compilation of the used finite state machinery can be found in (Neumann et al., 1997)  tage that the tokenizer and named entity finder already have determined abbreviation signs, so that this sort of disambiguation is resolved.</Paragraph>
      <Paragraph position="2">  Verluste erlitten hat, musste sie Aktien verkaufen.&amp;quot; (Because the Siemens GmbH which strongly depends on exports suffered from losses they had to sell some of the shares.) abbreviated where convenient. It shows the separation of a sentence into the front field (vF), the verb group (VERB), and the middle field (MF). The elements of different fields have been computed by means of fragment recognition which takes place after the (possibly recursive) topological structure has been computed. Note that the front field consists only of one but complex subclause which itself has an internal field structure.  ample, most plural verb forms can also be non-finite or imperative forms). This kind of ambiguity cannot be resolved without taking into account a wider context. Therefore these verb forms are assigned disjunctive types, similar to the underspecified chunk categories proposed by (Federici et al., 1996). These types, like for example Fin-Inf-PP or Fin-PP, reflect the different readings of the verbform and enable following modules to use these verb fonns according to the wider context, thereby removing the ambiguity. In addition to a type each recognized verb form is assigned a set of features which represent various properties of the form like tense and mode information. (cf. figure 5).</Paragraph>
      <Paragraph position="3"> Base clauses (BC) are subclauses of type subjunctive and subordinate. Although they are embedded into a larger structure they can independently  gelobt haben kann&amp;quot; - *not praised have could-been meaning could not have been praised and simply be recognized on the basis of commas, initial elements (like complementizer, interrogative or relative item - see also fig. 4, where SUBCONJ-CL and REL-CL are tags for subclauses) and verb fragments. The different types of subclauses are described very compactly as finite state expressions.</Paragraph>
      <Paragraph position="4"> Figure 6 shows a (simplified) BC-structure in feature matrix notation.</Paragraph>
      <Paragraph position="5">  &amp;quot;..., wenn die Arbeitgeber Forderungen steUten, ohne als Gegenleistung neue Stellen zu schaffen.&amp;quot; ... if the employers make new demands, without compensating by creating new jobs.</Paragraph>
      <Paragraph position="6"> Clause combination It is very often the case that base clauses are recursively embedded as in the following example: ... well der Hund den Braten gefressen hatte, den die Frau, nachdem sie ihn zubereitet hatte, auf die Fensterbank gestellt hatte.</Paragraph>
      <Paragraph position="7"> Because the dog ate the beef which was put on the window sill after it had been prepared by the woman.</Paragraph>
      <Paragraph position="8"> Two sorts of recursion can be distinguished: 1) middle field (MF) recursion, where the embedded base clause is framed by the left and right verb parts of the embedding sentence, and 2) the rest field (RF) recursion, where the embedded clause follows the right verb part of the embedding sentence. In order to express and handle this sort of recursion using a finite state approach, both recursions are treated as iterations such that they destructively substitute recognized embedded base clauses with their type.</Paragraph>
      <Paragraph position="9"> Hence, the complexity of the recognized structure of the sentence is reduced successively. However, because subclauses of MF-recursion may have their own embedded RF-recursion the CLAUSE COMBINA-TION (CC) is used for bundling subsequent base clauses before they would be combined with sub-clauses identified by the outer MF-recursion. The BC and CC module are called until no more base clauses can be reduced. If the CC module would not be used, then the following incorrect segmentation could not be avoided: ... *\[daft das Gliick \[, das Jochen Kroehne ernpfunden haben sollte Rel-C1\] \[, als ihm jiingst sein Groflaktion/ir die Ubertragungsrechte bescherte Sub j-elf, nicht mehr so recht erwKrmt Sub j-C1\] In the correct reading the second subclause &amp;quot;... als ihm jiingst sein ...&amp;quot; is embedded into the first one &amp;quot;... das Jochen Kroehne ...&amp;quot;.</Paragraph>
      <Paragraph position="10"> Main clauses (MC) Finally the MC module builds the complete topological structure of the input sentence on the basis of the recognized (remaining) verb groups and base clauses, as well as on the word form information not yet consumed. The latter includes basically punctuations and coordinations.</Paragraph>
      <Paragraph position="11"> The following figure schematically describes the current coverage of the implemented MC-module (see figure 1 for an example structure):</Paragraph>
    </Section>
    <Section position="3" start_page="241" end_page="243" type="sub_section">
      <SectionTitle>
3.2 Phrase recognition
</SectionTitle>
      <Paragraph position="0"> After the topological structure of a sentence has been identified, each substring is passed to the FRAGMENT RECOGNIZER in order to determine the internal phrasal structure. Note that processing of a substring might still be partial in the sense that no complete structure need be found (e.g., if we cannot combine sequences of phrases to one larger unit). The FRAGMENT RECOGNIZER uses finite state grammars in order to extract nominal and prepositional phrases, where the named entities recognized by the preprocessor are integrated into appropriate places (unplausibte phrases are rejected by agreement checking; see (Neumann et al., 1997) for more  details)). The phrasal recognizer currently only considers processing of simple, non-recursive structures (see fig. 3; here, *NP* and *PP* are used for denoting phrasal types). Note that because of the high degree of modularity of our shallow parsing architecture, it is very easy to exchange the currently domain-independent fragment recognizer with a domain-specific one, without effecting the domain-independent DC-PARSER.</Paragraph>
      <Paragraph position="1"> The final output of the parser for a sentence is an underspecified dependence structure UDS. An UDS is a flat dependency-based structure of a sentence, where only upper bounds for attachment and scoping of modifiers are expressed. This is achieved by collecting all NPs and PPs of a clause into separate sets as long as they are not part of some subclauses. This means that although the exact attachment point of each individual PP is not known it is guaranteed that a PP can only be attached to phrases which are dominated by the main verb of the sentence (which is the root node of the clause's tree). However, the exact point of attachment is a matter of domain-specific knowledge and hence should be defined as part of the domain knowledge of an application. null</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="243" end_page="243" type="metho">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> Due to the limited space, we concentrate on the evaluation of the topological structure. An evaluation of the other components (based on a sub-set of 20.000 tokens of the mentioned corpus from the &amp;quot;Wirtschaftswoche&amp;quot;, see below) yields: From the 93,89% of the tokens which were identified by the morphological component as valid word forms, 95,23% got a unique POS-assignment with an accuracy of 97,9%. An initial evaluation on the same subset yielded a precision of 95.77% and a recall of 85% (90.1% F-measure) for our current named entity finder. Evaluation of the compound analysis of nouns, i.e. how often a morphosyntactical correct segmentation was found yield: Based on the 20.000 tokens, 1427 compounds are found, where 1417 have the correct segmentation (0.9929% precision). On a smaller subset of 1000 tokens containing 102 compounds, 101 correct segmentations where found (0.9901% recall), which is a quite promising result. An evaluation of simple NPs yielded a recall of 0.7611% and precision of 0.9194%. The low recall was mainly because of unknown words.</Paragraph>
    <Paragraph position="1"> During the 2nd and 5th of July 1999 a test corpus of 43 messages from different press releases (viz.</Paragraph>
  </Section>
  <Section position="8" start_page="243" end_page="243" type="metho">
    <SectionTitle>
DEUTSCHE PREESSEAGENTUR (dpa), ASSOCIATED
</SectionTitle>
    <Paragraph position="0"> PRESS (ap) and REUTERS) and different domains (equal distribution of politics, business, sensations) was collected. 6 The corpus contains 400 sentences 6This data collection and evaluation was carried out by (Braun, 1999).</Paragraph>
    <Paragraph position="1"> with a total of 6306 words. Note that it also was created after the DC-PARSER and all grammars were finally implemented. Table 1 shows the result of the evaluations (the F-measure was computed with /3=1). We used the correctness criteria as defined in figure 7.</Paragraph>
    <Paragraph position="2"> The evaluation of each component was measured on the basis of the result of all previous components. For the BC and MC module we also measured the performance by manually correcting the errors of the previous components (denoted as &amp;quot;isolated evaluation&amp;quot;). In most cases the difference between the precision and recall values is quite small, meaning that the modules keep a good balance between coverage and correctness. Only in the case of the MC-module the difference is about 5%. However, the result for the isolated evaluation of the MC-module suggests that this is mainly due to errors caused by previous components.</Paragraph>
    <Paragraph position="3"> A more detailed analysis showed that the majority of errors were caused by mistakes in the preprocessing phase. For example ten errors were caused by an ambiguity between different verb stems (only the first reading is chosen) and ten errors because of wrong POS-filtering. Seven errors were caused by unknown verb forms, and in eight cases the parser failed because it could not properly handle the ambiguities of some word forms being either a separated verb prefix or adverb.</Paragraph>
    <Paragraph position="4"> The evaluation has been performed with the Lisp-based version of SMES (cf. (Neumann et al., 1997)) by replacing the original bidirectional shallow buttom-up parsing module with the DC-PARSER.</Paragraph>
    <Paragraph position="5"> The average run-time per sentence (average length 26 words) is 0.57 sec. A C++-version is nearly finished missing only the re-implementation of the base and main clause recognition phases, cf. (Piskorski and Neumann, 2000). The run-time behavior is already encouraging: processing of a German text document (a collection of business news articles from the &amp;quot;Wirtschaftswoche&amp;quot;) of 197118 tokens (1.26 MB) needs 45 seconds on a PentiumII, 266 MHz, 128 RAM, which corresponds to 4380 tokens per second.</Paragraph>
    <Paragraph position="6"> Since this is an increase in speed-up by a factor &gt; 20 compared to the Lisp-version, we expect to be able to process 75-100 sentences per second.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML