File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-1049_intro.xml

Size: 5,703 bytes

Last Modified: 2025-10-06 14:05:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1049">
  <Title>Lean Formalisms~ Linguistic Theory~ and Applications. Grammar Development in ALEP.</Title>
  <Section position="2" start_page="0" end_page="286" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Applications on the basis of unification-based grammars (UG) so far are rather rare to say the least.</Paragraph>
    <Paragraph position="1"> Though the advantages of UGs are obvious in that properties such as monotonicity, declarativity, perspicouity are important for maintaining and easily extending grammars, their popularity (despite 15 years of history) is still restricted to the academia.</Paragraph>
    <Paragraph position="2"> This paper reports of a project, LS-GRAM, which tried to make a step further in bringing UGs closer to applications.</Paragraph>
    <Paragraph position="3"> Application-oriented grammar development has to take into account the following parameters:  led to the decision to use a so called 'lean' fornrMism, ALEP, providing efficient term unification. 'Leanness' means that computationally expensive formal constructs are sacrificed to gain efficiency. Though this is at the cost of expressiveness, it is claimed that by 'leanness' 'hnguistic felicity' does not suffer.</Paragraph>
    <Paragraph position="4"> * Coverage: Most grammar development projects are not based on an investigation of real texts, but start from 'the linguists&amp;quot; text book'. This is different here in that a corpus-based approach to granlmar development has been adopted which is the implementation of the sinlple principle that if a grainnlar is supposed to cover reM texts, that the coverage of these texts has to be determined first. The was a corpus investigation in the in the beginning, in the course of which tools have been used and developed which allow for automatic and semi-automatic determination of linguistic phenomena.</Paragraph>
    <Paragraph position="5"> * Conlpleteness: All modules needed from text handling to semantics had to ve developed.</Paragraph>
    <Paragraph position="6"> This is why this paper does not focus on one single topic, but tries to represent the ulajor achievements of the whole of the system. The paper reports on a text handling component, Two Level morphology, word structure, phrase structure, semantics and and very importan{ly the interaction of these components.</Paragraph>
    <Paragraph position="7"> * MMnstream approach: None-the-less, the approach we adopted clainls to be mainstream, very much indebted to HPSG, thus based on the currently most prominent and recent linguistic theory.</Paragraph>
    <Paragraph position="8"> The relation (and tension) between these parameters is the topic of this paper.</Paragraph>
    <Paragraph position="9"> First, we will show, how a corpus-investigation estabfished the basis for the coverage, second, how various phenomena deternlined by corpus-investigation are treated in text handhng (TH), third, how the linguistic modules, Two Level Morphology (TLM), word and phrase structure, the lexicons look hke. The last section is devoted to the  efficiency and performance of the system. Figures are given which prove that the system is not so far from real applications 2</Paragraph>
    <Paragraph position="11"> The project started with a corpus investigation. It consisted of 150 newspaper articles from the German 'Die ZEIT'. They are descriptive texts from the domain of economy. They were investigated automatically by the (non-statistical) 'MPRO' tagger.</Paragraph>
    <Paragraph position="12"> 'MPRO' provides the attachment of rich linguistic information to the words. In addition, 'MPRO' provides a built-in facility for resolving categorial ambiguities on the basis of homograph reductions and a facility for handling unknown words which are written on a file. Encoding the missing stems (which were very few) ensured complete tagging of the corpus.</Paragraph>
    <Paragraph position="13"> 'MPRO' also provides a facility for searching syntactic structures in corpora. A detailed analysis on the internal structure of main clauses, subordinate clauses, verbal clusters, clausal topoi (e.g. structure of Vorfeld and Nachfeld), NPs, PPs, APs, CARDPs, coordinate structures, occurrence of expletives, pronominals and negation occurring in the corpus was made which then guided grammar development. null Another major result of the corpus investigation was that most sentences coutMn so called 'messy details', brackets, figures, dates, proper names, appositions. Most sentences contain compounds.</Paragraph>
    <Paragraph position="14"> In generM, most of the known linguistic phenomena occur in all known variations. Description has to be done in great &amp;tail (all frames, all syntactic realizations of frames). (Long distant) discontinuities popular in theoretical linguistics did not play a role. In order to give a 'general tlavour' of the corpus-investigation one noteworthy result should be reported: 25% of the number of words occur in NPs of the structure \[ DET (A) (A) N \]. But 'A' and 'N' are of a complex and unexpected nature:</Paragraph>
    <Paragraph position="16"> pounding, including names and abbreviations).</Paragraph>
    <Paragraph position="17"> The corpus-investigation guided the grantmar developnrent. A.o. it showed the necessity to devlop a TH component and separate out specific phenomena from the treatment in the grannnar. (This was also necessary from an efficiency point of view).</Paragraph>
    <Paragraph position="18"> 2It should be mentioned that we are referring to the German grammar built in the LS-GRAM project. For other languages similar system exist.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML