File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1056_metho.xml
Size: 15,415 bytes
Last Modified: 2025-10-06 14:07:58
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1056"> <Title>An Integrated Architecture for Shallow and Deep Processing</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Integration </SectionTitle> <Paragraph position="0"> Morphology and POS The coupling between the morphology delivered by SPPC and the input needed for the German HPSG was easily established. The morphological classes of German are mapped onto HPSG types which expand to small feature structures representing the morphological information in a compact way. A mapping to the output of SPPC was automatically created by identifying the corresponding output classes.</Paragraph> <Paragraph position="1"> Currently, POS tagging is used in two ways. First, lexicon entries that are marked as preferred by the shallow component are assigned higher priority than the rest. Thus, the probability of finding the correct reading early should increase without excluding any reading. Second, if for an input item no entry is found in the HPSG lexicon, we automatically create a default entry, based on the part-of-speech of the preferred reading. This increases robustness, while avoiding increase in ambiguity.</Paragraph> <Paragraph position="2"> Named Entity Recognition Writing HPSG grammars for the whole range of NE expressions etc. is a tedious and not very promising task. They typically vary across text sorts and domains, and would require modularized subgrammars that can be easily exchanged without interfering with the general core.</Paragraph> <Paragraph position="3"> This can only be realized by using a type interface where a class of named entities is encoded by a general HPSG type which expands to a feature structure used in parsing. We exploit such a type interface for coupling shallow and deep processing. The classes of named entities delivered by shallow processing are mapped to HPSG types. However, some fine-tuning is required whenever deep and shallow processing differ in the amount of input material they assign to a named entity.</Paragraph> <Paragraph position="4"> An alternative strategy is used for complex syntactic phrases containing NEs, e.g., PPs describing time spans etc. It is based on ideas from Explanation-based Learning (EBL, see (Tadepalli and Natarajan, 1996)) for natural language analysis, where analysis trees are retrieved on the basis of the surface string. In our case, the part-of-speech sequence of NEs recognised by shallow analysis is used to retrieve pre-built feature structures. These structures are produced by extracting NEs from a corpus and processing them directly by the deep component. If a correct analysis is delivered, the lexical parts of the analysis, which are specific for the input item, are deleted. We obtain a sceletal analysis which is underspecified with respect to the concrete input items. The part-of-speech sequence of the original input forms the access key for this structure. In the application phase, the underspecified feature structure is retrieved and the empty slots for the input items are filled on the basis of the concrete input.</Paragraph> <Paragraph position="5"> The advantage of this approach lies in the more elaborate semantics of the resulting feature structures for DNLP, while avoiding the necessity of adding each and every single name to the HPSG lexicon. Instead, good coverage and high precision can be achieved using prototypical entries.</Paragraph> <Paragraph position="6"> Lexical Semantics When first applying the original VERBMOBIL HPSG grammar to business news articles, the result was that 78.49% of the missing lexical items were nouns (ignoring NEs). In the integrated system, unknown nouns and NEs can be recognized by SPPC, which determines morpho-syntactic information. It is essential for the deep system to associate nouns with their semantic sorts both for semantics construction, and for providing semantically based selectional restrictions to help constraining the search space during deep parsing. GermaNet (Hamp and Feldweg, 1997) is a large lexical database, where words are associated with POS information and semantic sorts, which are organized in a fine-grained hierarchy. The HPSG lexicon, on the other hand, is comparatively small and has a more coarse-grained semantic classification.</Paragraph> <Paragraph position="7"> To provide the missing sort information when recovering unknown noun entries via SPPC, a mapping from the GermaNet semantic classification to the HPSG semantic classification (Siegel et al., 2001) is applied which has been automatically acquired. The training material for this learning process are those words that are both annotated with semantic sorts in the HPSG lexicon and with synsets of GermaNet. The learning algorithm computes a mapping relevance measure for associating semantic concepts in GermaNet with semantic sorts in the HPSG lexicon. For evaluation, we examined a corpus of 4664 nouns extracted from business news that were not contained in the HPSG lexicon. 2312 of these were known in GermaNet, where they are assigned 2811 senses. With the learned mapping, the GermaNet senses were automatically mapped to HPSG semantic sorts. The evaluation of the mapping accuracy yields promising results: In 76.52% of the cases the computed sort with the highest relevance probability was correct. In the remaining 20.70% of the cases, the correct sort was among the first three sorts.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Integration on Phrasal Level </SectionTitle> <Paragraph position="0"> In the previous paragraphs we described strategies for integration of shallow and deep processing where the focus is on improving DNLP in the domain of lexical and sub-phrasal coverage.</Paragraph> <Paragraph position="1"> We can conceive of more advanced strategies for the integration of shallow and deep analysis at the length cover- complete LP LR 0CB a0 2CB level of phrasal syntax by guiding the deep syntactic parser towards a partial pre-partitioning of complex sentences provided by shallow analysis systems. This strategy can reduce the search space, and enhance parsing efficiency of DNLP.</Paragraph> <Paragraph position="2"> Stochastic Topological Parsing The traditional syntactic model of topological fields divides basic clauses into distinct fields: so-called pre-, middleand post-fields, delimited by verbal or sentential markers. This topological model of German clause structure is underspecified or partial as to non-sentential constituent boundaries, but provides a linguistically well-motivated, and theory-neutral macrostructure for complex sentences. Due to its linguistic underpinning the topological model provides a pre-partitioning of complex sentences that is (i) highly compatible with deep syntactic structures and (ii) maximally effective to increase parsing efficiency. At the same time (iii) partiality regarding the constituency of non-sentential material ensures the important aspects of robustness, coverage, and processing efficiency.</Paragraph> <Paragraph position="3"> In (Becker and Frank, 2002) we present a corpus-driven stochastic topological parser for German, based on a topological restructuring of the NEGRA corpus (Brants et al., 1999). For topological tree-bank conversion we build on methods and results in (Frank, 2001). The stochastic topological parser follows the probabilistic model of non-lexicalised PCFGs (Charniak, 1996). Due to abstraction from constituency decisions at the sub-sentential level, and the essentially POS-driven nature of topological structure, this rather simple probabilistic model yields surprisingly high figures of accuracy and coverage (see Fig.2 and (Becker and Frank, 2002) for more detail), while context-free parsing guarantees efficient processing.</Paragraph> <Paragraph position="4"> The next step is to elaborate a (partial) mapping of shallow topological and deep syntactic structures that is maximally effective for preference-gui- null ded deep syntactic analysis, and thus, efficiency improvements in deep syntactic processing. Such a mapping is illustrated for a verb-second clause in Fig.3, where matching constituents of topological and deep-syntactic phrase structure are indicated by circled nodes. With this mapping defined for all sentence types, we can proceed to the technical aspects of integration into the WHITEBOARD architecture and XML text chart, as well as preference-driven HPSG analysis in the PET system.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> An evaluation has been started using the NEGRA corpus, which contains about 20,000 newspaper sentences. The main objectives are to evaluate the syntactic coverage of the German HPSG on newspaper text and the benefits of integrating deep and shallow analysis. The sentences of the corpus were used in their original form without stripping, e.g. parenthesized insertions.</Paragraph> <Paragraph position="1"> We extended the HPSG lexicon semi-automatically from about 10,000 to 35,000 stems, which roughly corresponds to 350,000 full forms. Then, we checked the lexical coverage of the deep system on the whole corpus, which resulted in 28.6% of the sentences being fully lexically analyzed. The corresponding experiment with the integrated system yielded an improved lexical coverage of 71.4%, due to the techniques described in section 3. This increase is not achieved by manual extension, but only through synergy between the deep and shallow components.</Paragraph> <Paragraph position="2"> To test the syntactic coverage, we processed the subset of the corpus that was fully covered lexically (5878 sentences) with deep analysis only. The results are shown in table 4 in the second column. In order to evaluate the integrated system we processed 20,568 sentences from the corpus without further extension of the HPSG lexicon (see table 4, third column). null About 10% of the sentences that were successfully parsed by deep analysis only could not be parsed by the integrated system, and the number of analyses per sentence dropped from 16.2% to 8.6%, which indicates a problem in the morphology interface of the integrated system. We expect better over-all results once this problem is removed.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Applications </SectionTitle> <Paragraph position="0"> Since typed feature structures (TFS) in Whiteboard serve as both a representation and an interchange format, we developed a Java package (JTFS) that implements the data structures, together with the necessary operations. These include a lazy-copying unifier, a subsumption and equivalence test, deep copying, iterators, etc. JTFS supports a dynamic construction of typed feature structures, which is important for information extraction.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Information Extraction </SectionTitle> <Paragraph position="0"> Information extraction in Whiteboard benefits both from the integration of the shallow and deep analysis results and from their processing methods. We chose management succession as our application domain. Two sets of template filling rules are defined: pattern-based and unification-based rules.</Paragraph> <Paragraph position="1"> The pattern-based rules work directly on the output delivered by the shallow analysis, for example, This rule matches expressions like Nachfolger von Helmut Kohl (successor of) which contains two string tokens Nachfolger and von followed by a per-son name, and fills the slot ofperson outwith the recognized person name Helmut Kohl. The pattern-based grammar yields good results by recognition of local relationships as in (1). The unification-based rules are applied to the deep analysis results. Given the fine-grained syntactic and semantic analysis of the HPSG grammar and its robustness (through SNLP integration), we decided to use the semantic representation (MRS, see (Copestake et al., 2001)) as additional input for IE. The reason is that MRSs express precise relationships between the chunks, in particular, in constructions involving (combinations of) free word order, long distance dependencies, control and raising, or passive, which are very difficult, if not impossible, to recognize for a pattern-based grammar. E.g., the short sentence (2) illustrates a combination of free word order, control, and passive. The subject of the passive verb wurde gebeten is located in the middle field and is at the same time the subject of the infinitive verb zu &quot;ubernehmen. A deep (HPSG) analysis can recognize the dependencies quite easily, whereas a pattern based grammar cannot determine, e.g., for which verb Peter Miscke or Dietmar Hopp is the subject.</Paragraph> <Paragraph position="2"> (2) Peter Miscke following was Dietmar Hopp asked, the development sector to take over.</Paragraph> <Paragraph position="3"> gebeten, die &quot; According to Peter Miscke, Dietmar Hopp was asked to take over the development sector.&quot; We employ typed feature structures (TFS) as our modelling language for the definition of scenario template types and template element types. Therefore, the template filling results from shallow and deep analysis can be uniformly encoded in TFS. As a side effect, we can easily adapt JTFS unification for the template merging task, by interperting the partially filled templates from deep and shallow analysis as constraints. E.g., to extract the relevant information from the above sentence, the following unification-based rule can be applied:</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Language checking </SectionTitle> <Paragraph position="0"> Another area where DNLP can support existing shallow-only tools is grammar and controlled language checking. Due to the scarce distribution of true errors (Becker et al., to appear), there is a high a priori probability for false alarms. As the number of false alarms decides on user-acceptance, precision is of utmost importance and cannot easily be traded for recall. Current controlled language checking systems for German, such as MULTILINT (http://www.iai.uni-sb.de/en/multien.html) or FLAG (http://flag.dfki.de), build exclusively on SNLP: while checking of local errors (e.g. NP-internal agreement, prepositional case) can be performed quite reliably by such a system, error types involving non-local dependencies, or access to grammatical functions are much harder to detect. The use of DNLP in this area is confronted with several systematic problems: first, formal grammars are not always available, e.g., in the case of controlled languages; second, erroneous sentences lie outside the language defined by the competence grammar, and third, due to the sparse distribution of errors, a DNLP system will spend most of the time parsing perfectly well-formed sentences. Using an integrated approach, a shallow checker can be used to cheaply identify initial error candidates, while false alarms can be eliminated based on the richer annotations provided by the deep parser.</Paragraph> </Section> </Section> class="xml-element"></Paper>