File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1049_metho.xml

Size: 12,026 bytes

Last Modified: 2025-10-06 14:14:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1049">
  <Title>Lean Formalisms~ Linguistic Theory~ and Applications. Grammar Development in ALEP.</Title>
  <Section position="3" start_page="286" end_page="287" type="metho">
    <SectionTitle>
3 Text Handling
</SectionTitle>
    <Paragraph position="0"> The ALEP platform provides a TH component which allows &amp;quot;pre-processing&amp;quot; of inputs, it converts a number of formats among them 'Latex'.</Paragraph>
    <Paragraph position="1"> Then the ASCII text first goes through SGML-based tagging: Convertion to an EDIF (Eurotra Document Interchange Format) format, then paragraph recognition, sentence recognition and word recognition. The output of these processes consists in the tagging of the recognized elements: 'P' for paragraphs, 'S' for sentences, 'W' for words (in case of morphological analysis, the tag 'M' is provided for morphemes) and 'PT' for punctuation signs as exelnplified in (1).</Paragraph>
    <Paragraph position="2">  In the default case, this is the information which is input to the TH-LS component (Text-Handling to Linguistic Structure) component. ALEP provides a facility (tsls-rules) which allows the grammar writer to identify information which is to flow from the TtI to the linguistic processes. We will show how this facility can be used for all efficient and consistent treatnlent of all kinds of 'messy details'.</Paragraph>
    <Paragraph position="3"> The TH component of the ALEP platform also foresees the integration of user-defined tags. The tag (USR) is used if the text is tagged by a user-defined tagger. An example of an integration of a user-defined tagger between the sentence recognition level and the word recognition level of the TH tool is given below.</Paragraph>
    <Paragraph position="4"> The tagger for 'messy details' has been integrated into the German grammar and has been adapted for the following patterns:  1 'Quantities' (a cardinal number followed by an amount and a currency name (e.g. &amp;quot;16,7 Millionen Dollar&amp;quot;)) 2 Percentages (12%, 12 Prozent, zwSlf Prozent) 3 Dates (20. Januar 1996) 4 Acronyms and abbreviations ('Dasa', 'CDU', 'GmbH', etc.).</Paragraph>
    <Paragraph position="5"> 5 Prepositional contractions (zum, zur, am etc.) * Appositions: Prof. Dr. Robin Cooper  We will examplify the technique for 'quantities'. Recursive patterns are described in the programruing language 'awk'. (2) defines cardinal numbers (in letters).</Paragraph>
    <Paragraph position="7"> On the basis of these variables other variables can be defined such as in (3).</Paragraph>
    <Paragraph position="8">  (3) range = &amp;quot;(&amp;quot;number&amp;quot;l&amp;quot;card&amp;quot;)&amp;quot; amount =&amp;quot; (&amp;quot;Millionen&amp;quot; I&amp;quot;Milliaxden&amp;quot;) currency=&amp;quot; (&amp;quot;Mark&amp;quot;, &amp;quot;DM&amp;quot;, &amp;quot;Dollar&amp;quot;)&amp;quot; curmeasure=&amp;quot; (&amp;quot;amount&amp;quot;??&amp;quot;currency&amp;quot; ?)&amp;quot; quantity =&amp;quot; (&amp;quot;range .... curmeasure&amp;quot;)&amp;quot;  The following inputs are automatically recognized  This treatment of regular expressions also means a significant improvement of efficiency because there is only one string whereas the original input consisted of five items (&amp;quot;vierzig bis ffinfzig Milharden  Dollar&amp;quot;): &amp;quot;vierzig_bis_fuenfzig_Milliarden_Dollar&amp;quot;. (4) gives an exalnple for information flow from TH to linguistic structure: (4) id:{  spec =&gt; spec:{ lu =&gt; TYPE}, sign =&gt; sign:{ string =&gt; string: { first =&gt; \[ STRING I REST\], rest =&gt; REST}, synsem =&gt; synsem:{ syn =&gt; SYN =&gt; syn:{ constype =&gt; morphol:{ lemma =&gt; VAL, rain =&gt; yes } } } } }, 'USR',\['TYPE' =&gt; TYPE, 'VAL' =&gt; VAL\], STRING ).</Paragraph>
    <Paragraph position="9">  The feature 'TYPE' bears the variable TYPE (in our case: &amp;quot;quantities&amp;quot;). The feature 'VAL' represents the original input (e.g.: &amp;quot;ffinfzig Milliarden Dollar&amp;quot;) and the variable STRING represents the output string of the tagged input (in this case: &amp;quot;fuenfzig_Milharden_Dollar&amp;quot;). This value is coshared with the value of the &amp;quot;string&amp;quot; feature of the lexicon entry. The definition of such generic entries in the lexicon allows to keep the lexicon smaller but also to deal with a potentially infinite number of words.</Paragraph>
    <Paragraph position="10"> These strategies were extended to the other phenomena. The TH component represents a pre-processing component which covers a substantial part of what occurs in real texts.</Paragraph>
  </Section>
  <Section position="4" start_page="287" end_page="289" type="metho">
    <SectionTitle>
4 The Linguistic Modules
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="287" end_page="288" type="sub_section">
      <SectionTitle>
4.1 Two Level Morphology (TLM)
</SectionTitle>
      <Paragraph position="0"> The TLM component deals with most major morphographemic variations occurring in German, such as umlautung, ss-fl alternation, schwa-instability.</Paragraph>
      <Paragraph position="1"> We will introduce this component by way of exemplification. (5) represents the treatment of'e'-'i' umlautung as occurring in German verbs like 'gebe', 'gibst', referring to (Trostg0).</Paragraph>
      <Paragraph position="2"> An ALEP TL rule comes as a four to five place PROLOG term, the first argument being the rule name, the second a TL description, the third (represented by the anonymous variable ) a specifier feature structure, the fourth a typed feature structure constraining the application of the rule and a fifth allowing for linking variables to predeflned character sets.</Paragraph>
      <Paragraph position="3">  syn\] cons phenol \[umlaut none&amp;i The morphologically relevant information is encoded in 'cons'. It contains two features, 'lemma' which encodes the abstract nlorpheme with a capital 'E' (this is the basis for a treatment according to ((Trost90)) and the feature 'phenol' which encodes phonologically relevant information, in this case whether ulnlautung is available or not.</Paragraph>
      <Paragraph position="4"> The values for 'phenol' is a boolean conjunction from two sets: sl ={none, no, yes} and s2 = {e, i\].</Paragraph>
      <Paragraph position="5"> The treatment consists of mapping a surface 'e' to a lexical 'E' in case the constraint which is expressed  as a feature structure in the fourth argument holds. It says that for the first rule 'e' is nrapped on 'E' if the feature 'umlaut' has a value which is 'no'. This applies to (6). This handles cases such as 'geb-e'. The second rule maps 'i' 'E' if 'umlaut' has the value 'yes &amp; i'. This also holds in cases like 'gib-st'. One would expect according to (Trost90) that only two values 'no' and 'yes' are used. The change has been done for merely esthetic reasons. The '2nd pers sing' morpheme 'st' e.g. requires an 'umlaut = yes' stenr, if the stem is capable of ulnlautung, at all which is the case for 'gibst'. In case the stem cannot have umlautung (as for 'kommst') 'st' also attaches. This makes that uon-unflautung stems have to be left unspecified for umlautung, as otherwise 'st' could not attach. 'st' can now be encoded fox' 'umlaut = no'.</Paragraph>
    </Section>
    <Section position="2" start_page="288" end_page="289" type="sub_section">
      <SectionTitle>
4.2 Lexicon
</SectionTitle>
      <Paragraph position="0"> f,exical information is distributed over three lexicons: null  The distribution of information over three lexicons has a sinrple reason, namely avoiding lexical anrbiguities at places where they cannot be resolved or where they have inrpact on eificiency. So, e.g. the verbal suffix 't' has lots of interpretations: '3rd pets sing', 2nd pers pl', preterite and more. These ambiguities are NOT introduced in the TLM lexicon as the only effect would be that a great nunlber of scgmentations would go into syntactic analysis. Only then these ambiguities could be resolved on the basis of morphotactic intorlnation. A similar situation holds on syntactic level. There is no point in nmltiplying syntactic entries by their semantic ambiguities and make all of these entries available for analysis. It would result in a desaster ibr efficiency. Semantic reading distinctions thus are put into the (semantic) refinement lexicon. We would like to introduce lexical information for the preposition 'in' by way of illustration.</Paragraph>
      <Paragraph position="1"> TL-Entry for 'in': (7) 'string \[inl_ \] ,,,od,,,,,,,,ut ,,ojjjj The nrorphological information is encoded in the 'cons' feature.</Paragraph>
      <Paragraph position="2"> Analysls-Entries for 'in': Prepositions \]nay occur in 'strongly bound PPs where they are functional elenrents (semantically empty, but syntactically relevant). This is encoded in (8). A PP headed by a functor cannot be an adjunct (rood=none). The head-dtr chosen by 'in' is an NPacc. The mother of such a construction also comes out as NPacc which is encoded in 'projects'. The major reason for such a treatment lies in the fact that it allows for a unified treatment of all functional elements like inflectional affixes, complenrentizers, auxiliaries, infinitival zu, functional prepositions etc..).</Paragraph>
      <Paragraph position="4"> \[. \[rood oo et L suDca~ func \[projects NPacc (9) is the entry for 'in' as a head of a PP subcategorizing for an NPacc.</Paragraph>
      <Paragraph position="6"> Semantic entries for qrlh Prepositions need (semantically) different entries depending on whether the.p heads a PP which is a conlplentent or &amp;it adjunct.</Paragraph>
      <Paragraph position="7"> qn' as complement: l subj &lt; &gt; The content of a PP is a relational psoa. 'in' as Adjunct:  The preposition puts its content on a restriction list. It composes the restriction list with the restriction list of the modified item. Quants and the psoa are copied.</Paragraph>
    </Section>
    <Section position="3" start_page="289" end_page="289" type="sub_section">
      <SectionTitle>
4.3 Word Structure and Phrase Structure
</SectionTitle>
      <Paragraph position="0"> (PS) Both the word structure and the phrase structure conlponent are based on the same snlall set of binary schenlata closely related to HPSG. In the systenl described here they exist as nlacros and they are spelt out in category-specific word and phrase structure rules. (Efficiency is the major reason, as underspecified syntax rules are very inefficient). Such a schema is e.g. the following head-compschema. null  Head information is propagated from head-dtr to mother, so is semantic information. 'subact' informarion is structured slightly differently as in HPSG to allow for one innovation wrt HPSG which is our treatment of functional elements.</Paragraph>
      <Paragraph position="1">  HEAD I BASE deg\[- -....deg oj\] Functor: l F &amp;quot;EAD II The functor macro is highly general. It shows the functor treatment applied in the German grammar, namely that the functor selects its head-dtr, combines itself with the head-dtr and projects the mother. More specifically: The functor-dtr, indicated by the value 'funct' of the attribute 'subcat' shares the value of the attribute 'selects' with the 'synsem' value of the head-dtr and its 'projects' value with the 'syn' attribute of the mother. The 'head' value is shared between head-dtr and mother, the 'base' value in addition between head-dtr and functor. The subcategorization is shared between head-dtr and mother.</Paragraph>
      <Paragraph position="2"> The difference to the head-comp schema is that head information comes from the functor, also the semantics. 'subcat' is inherited from the head-dtr. The powerful mechanism comes in by the subcatfeature of the functor which allows for a very detailed specification of the information to be projected. null The PS component covers all categories, especially all clausal categories (main clauses, subordinate clauses, relatives), NPs, APs, ADVPs, PPs.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML