File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0908_metho.xml

Size: 14,200 bytes

Last Modified: 2025-10-06 14:14:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0908">
  <Title>C/n G o C/ Q W</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Grammar Sharing
</SectionTitle>
    <Paragraph position="0"> A broad-coverage multilingual NLP system such as the one currently being developed at Microsoft Research faces the challenge of parallel grammar development in multiple languages (currently English, French, Spanish, German, Chinese, Japanese, and Korean). This development is by nature a very complex and time-consuming task. In addition, the design of the overall NLP system has to be well suited to be readily portable to languages other than the one the development started with (English in our case). For these reasons, few groups have succeeded at the challenge of multilingual NLP.</Paragraph>
    <Paragraph position="1"> I This work has benefited from comments and suggestions from other members of the Natural Language Processing group at Microsoft Research. Particular thanks go to Simon Corston. Bill Dolan, Ken Felder, Karen Jensen, Martine Penenaro, Hisami Suzuki, and Lucy Vanderwende.</Paragraph>
    <Paragraph position="2"> One approach to multilingual development is to rely on theoretical concepts such as Universal Grammar. The goal is to create a grammar that can easily be parameterized to handle many languages.</Paragraph>
    <Paragraph position="3"> Wu (1994) describes an effort aimed at accounting for word order variation, but his focus is on the demonstration of a theoretical concept. Kameyama (1988) describes a prototype shared grammar for the syntax of simple nominal expressions for five languages, but the focus of the effort is only on the noun phrase, which makes the approach not applicable to a large-scale effort. Principle-based parsers are also designed with universal grammar in mind (Lin 1994), but have yet to demonstrate large-scale coverage in several languages. Other efforts have been presented in the literature, with a focus on generation (Bateman et al. 1991.) An effort to port a grammar of English to French and Spanish is also underway at SRI (Rayner et al. 1996.) The approach taken in the MSNLP project focused from the beginning on possibilities for grammar sharing between languages to facilitate grammar development and reduce the development time. We want to stress that our use of the term &amp;quot;grammar sharing&amp;quot; is not to be confused with &amp;quot;code sharing.&amp;quot; Grammar sharing, in our use of the term, simply means that the existing grammar for one language can be used totally or in part to serve as the development basis for a second language.</Paragraph>
    <Paragraph position="4"> In this paper we Want to demonstrate that the jumpstart through grammar sharing considerably accelerated grammar development in French, Spanish, and German. We will present test and progress data from all languages to support our claim.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="50" type="metho">
    <SectionTitle>
2 The Microsoft NLP System
</SectionTitle>
    <Paragraph position="0"> The English grammar that we used as our starting point, as well as the target-language grammars that were spawned from it, are sketch grammars.</Paragraph>
    <Paragraph position="1"> Sketch grammars use a computational dictionary  containing part-of-speech, morphological, and subcategofization information to yield an initial syntactic analysis (the sketch). The rules used in sketch have no access to any semantic information that would allow the assignment of semantic structure such as case frames or thematic roles. Further analysis proceeds through a stage of reattachment of phrases using both semantic and syntactic information to produce the portrait, then to a first representation of some aspects of meaning, the logical form, and to word sense-disambiguation and higher representations of meaning. In this paper, however, we will restrict our attention to the sketch grammars.</Paragraph>
    <Paragraph position="2"> A bottom-up parallel parsing algorithm is applied to the sketch grammar rules, resulting in one or more analyses for each input string, and defaulting in cases (such as PP attachment) where semantic information is needed at a later stage of the processing (portrait) to give the correct result. Context-sensitive binary rules are used because they have been found necessary for the successful analysis of natural languages (Jensen et al. 1993, pp. 33-35; Jensen 1987, pp. 65-86). 2 Figure 1 gives a template for the rule formalism for a binary rule, in this case a rule that combines a verb phrase with a prepositional phrase to its right.</Paragraph>
    <Paragraph position="3"> Each sentence parse contains syntactic and functional role information. This information is carried through the system in the form of arbitrarily complex attribute-value pairs. The sketch always produces at least one constituent analysis, even for syntactically invalid input, and displays its analyses as parse trees. FITTED parses are obtained when an input string cannot be parsed up to a sentence node (possibly because it is a noun phrase, a sentence fragment, or otherwise deficient). FrITED parses contain as much constituent structure as the grammar could assign to the input string.</Paragraph>
    <Paragraph position="4"> VPwPPr:  with a PP to its right (VPwPPr) Binary rules deal with the problem of free constituent order, which is significant even in a largely configurafional language such as English. A ease of frec word order in English is the position of adverbials and prepositional phrases.</Paragraph>
    <Paragraph position="5"> Two types of trees are available (Figure 2). One strictly follows the derivational history of the parse, and is therefore binary-branching. In the binary tree the names of the rules that have produced a node are displayed to the fight of that node. The second (which is used in later processing because it accords better with our intuitive understanding of many structures) is n-ary branching, or &amp;quot;flattened,&amp;quot; and is computed from a small set of syntactic attributes of the analysis record. The * indicates the head of the phrase.</Paragraph>
    <Paragraph position="7"> for the sentence &amp;quot;Dice Juan que Madrid es una ciudad hermosa&amp;quot; (&amp;quot;John says that Madrid is a beautiful city&amp;quot;) The sketch grammar is written in G, a Microsoft-internal programming language that has been specially designed for NLP. The English  grammar contains 129 rules, which vary in size from two to 600 lines of code, according to the number of conditions that they contain. The coverage of English is broad. Processing time is rapid: a typical 20-word sentence requires about an eighth of a second on a Pentium 200 MHz machine. The goal of all Natural Language Research and Development at Microsoft Research is to produce a broad coverage muitilingual NLP system that is not tailored to any specific application, but has the potential to be used by any of them. To date, the English system is the foundation of the grammar checker in Word 97. We expect our multilingual technology to be used in as wide a spectrum of applications as possible.</Paragraph>
  </Section>
  <Section position="4" start_page="50" end_page="51" type="metho">
    <SectionTitle>
3 The Development of the
</SectionTitle>
    <Paragraph position="0"> French, Spanish and German Grammars In this section we briefly explain the common strategy of grammatical development in the MSNLP system and we give the current status of development for each of these three languages. For each of the three languages under consideration, the development team consists of a lexicographer/morphologist, and a grammarian. Grammar work in each language proceeds according to the same rationale: the grammarian processes sentences from diverse text sources and examines the resulting parses. He/she then determines whether the resulting parse is a desirable one. If this is the case, the sentence with the correct parse is added to a regression file. If the parse is incorrect, conditions on grammar rules are modified or added to accommodate the sentence in question and similar constructions. Regression tests are run frequently to ensure that new changes do not affect the performance of the system in any negative way. A debugging tool is available for the linguist to immediately view differences that arise in the processing of the regression file compared to an earlier run. Another important tool enables the grammarian to identify conditions in grammar rules that have been tried during a particular parse, and distinguish those that succeeded from those that failed.</Paragraph>
    <Section position="1" start_page="50" end_page="51" type="sub_section">
      <SectionTitle>
3.1 French
</SectionTitle>
      <Paragraph position="0"> Development of the French grammar started in 1995. French grammar work has covered most major constructs including:  The German dictionary has over 140,000 entries. The morphology, which includes wordbreaking, is nearly complete, with 97% word recognition on a 400,000 word corpus.</Paragraph>
      <Paragraph position="1"> Because Spanish and German share the fundamental property of freer constituent order than  English, German grammar has benefited from some of the solutions for this challenge already worked out for Spanish. Grammar sharing between Spanish and German focused mainly on adoption of Spanish code from the binary rules that combine verbs and preceding/following noun phrases.</Paragraph>
    </Section>
    <Section position="2" start_page="51" end_page="51" type="sub_section">
      <SectionTitle>
3.4 Changes from the English
</SectionTitle>
      <Paragraph position="0"> Grammar to the Target Grammars In spite of the numerous areas of divergence between the target grammars and English, we found that the fundamental organization of the grammar changed as little as 10-19% (see specifics in Table 1). The bulk of the required modifications occurred in the conditions on the rules. Since these conditions are complex, it is difficult to illustrate them fully here. To give one simpl e example, in French and Spanish, we found it necessary to exclude all NPs that consisted of clitic pronouns from rules that attach modifiers on NPs.</Paragraph>
      <Paragraph position="1"> Few rules had to be added or completely removed from the grammar. For example, bootstrapping the Spanish grammar from an English grammar consisting of 129 rules required that only 13 of the original English rules (10.1%) be deleted, while 10 new rules (7.8%) were introduced.</Paragraph>
      <Paragraph position="2">  respect to the English source grammar.</Paragraph>
      <Paragraph position="3"> The new rules were added to accommodate constructions in the target language that are (virtually) non-existent in English. Spanish, for example, added rules to handle nominalized prepositional phrases like el de Juan and nominal uses of infinitives: al verlo. French needed rules to handle present participles introduced by en: en partant, and for sentential constructions like Heureusement qu'il est venu! German added rules for constructions such as postposed genitive NPs (das Buch Peters) and participial VPs premodifying NPs: die dem Mann gegebenen Biicher.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="51" end_page="51" type="metho">
    <SectionTitle>
4 Testing and Progress
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="51" end_page="51" type="sub_section">
      <SectionTitle>
Measurement
</SectionTitle>
      <Paragraph position="0"> Testing NLP systems is known to be a difficult task, and one that is hard to standardize across systems with different aims and different grammatical foundations (see e.g. the discussion in Balkan et al. n.d.). One relatively simple measurement that we found particularly useful for the beginning stages of grammar development is the percentage of non-FITI'ED parses on a corpus containing sentences from different types of text (news, literature, technical writing etc.).</Paragraph>
      <Paragraph position="1"> In what follows, this corpus for each language will be referred to as a benchmark corpus and coverage refers to the percentage of non-FITTED parses for the benchmark corpus. Sentence length refers to the number of words in the sentence. In testing, the linguist does not examine the output parses obtained from the benchmark corpus, in order to avoid targeting modifications of the grammar towards the particular problems with FITTED parses in the benchmark file. This &amp;quot;blind&amp;quot; test yields a rough measure of the real coverage of the grammar. It should be noted that although some non-FITTED parses may not constitute the desired parse, many FITTED parses yield a largely usable parse which has only failed at the sentence level) But more important at this point is the fact that our measurement against a benchmark allows us to reliably track progress over time.</Paragraph>
      <Paragraph position="2"> Even though not all of the successfully parsed sentences are guaranteed to have received a desired parse, a stable increase in the percentage of parsed sentences during language-specific grammar work has proven to be a reliable measurement of progress. This is particularly true given that grammar work (as described above) proceeds on the basis of example sentences that come to a large extent from real-life text.</Paragraph>
      <Paragraph position="3"> A factor that influences the coverage considerably is sentence length. In order to assess the relationship between sentence length and the percentage of parsed sentences in a corpus, we use a tool that extracts information from a parsed corpus on the ratio of successfully parsed sentences to FITI'ED parses depending on sentence length.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="51" end_page="52" type="metho">
    <SectionTitle>
3 Additional testing of considerable magnitude would be
</SectionTitle>
    <Paragraph position="0"> required to evaluate &amp;quot;perfect con~ctness&amp;quot;. This would take us away from development, and provide slower feedback of progress.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML