XML Viewer - p06-2062

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2062_metho.xml
Size: 22,081 bytes
Last Modified: 2025-10-06 14:10:28
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2062">
  <Title>GF Parallel Resource Grammars and Russian</Title>
  <Section position="4" start_page="475" end_page="476" type="metho">
    <SectionTitle>
2 Word Classes
</SectionTitle>
    <Paragraph position="0"> Every resource grammar starts with a description of word classes. Their names belong to the language-independent API, although their implementations are language-specific. Russian fits quite well into the common API here, since like all other languages it has nouns, verbs, adjectives etc. The type system for word classes of a language is the most stable part of the resource grammar library, since it follows traditional linguistic descriptions (Shelyakin, 2000; Wade, 2000; Starostin, 2005). For example, let us look at the implementation of the Russian adjective type AdjDegree:</Paragraph>
    <Paragraph position="2"> First, we need to specify parameters (param) on which inflection forms depend. A vertical slash (|) separates different parameter values. While in English the only parameter would be comparison degree (Degree), in Russian we have many more parameters: * Case, for example: bol'xie doma bol'xih domov (big houses - big houses').</Paragraph>
    <Paragraph position="3"> * Animacy only plays a role in the accusative case (Acc) in masculine (Masc) singular (ASingular) and in plural forms (APlural), namely, accusative animate form is the same as genitive (Gen) form, while accusative inanimate form is the same as nominative (Nom): Ya lyublyu bol'xie doma - ya lyublyu bol'xih muzhqin (I love big houses - I love big men).</Paragraph>
    <Paragraph position="4"> * Gender only plays role in singular: bol'xoishort dom - bol'xaya maxina (big house - big car). The plural never makes a gender distinction, thus, Gender and number are combined in the GenNum parameter to reduce redundant inflection table items. The possible values of GenNum are  ASingular Masc, ASingular Fem, ASingular Neut and APlural.</Paragraph>
    <Paragraph position="5"> * Number, for instance: bol'xoishort dom bol'xie doma (a big house - big houses).</Paragraph>
    <Paragraph position="6"> * Degree can be more complex, since most  Russian adjectives have two comparative (Comp) forms: declinable attributive and indeclinable predicative1: bolee vysokiishort (more high) - vyxe (higher), and more than one superlative (Super) forms: samyishort vysokiishort (the most high) - naivysxiishort (the highest).</Paragraph>
    <Paragraph position="7"> Even another parameter can be added, since Russian adjectives in the positive (Pos) degree have long and short forms: spokoishortnaya reka (the calm river) - reka - spokoishortna (the river is calm). The short form has no case declension, thus, it can be considered as an additional case (Starostin, 2005). Note, that although the predicative usage of the long form is perfectly grammatical, it can have a slightly different meaning compared to the short form. For example: long, predicative on - bol'noishort (&amp;quot;he is crazy&amp;quot;) vs. short, predicative on - bolen (&amp;quot;he is ill&amp;quot;). An oper judgement combines the name of the defined operation, its type, and an expression defining it. The type for degree adjective (AdjDegree) is a table of strings (s: .. =&gt; ..=&gt; Str) that has two main dimensions: Degree and AdjForm, where the last one is a combination of the parameters listed above. The reason to have the Degree parameter as a separate dimension is that a special type of adjectives Adj that just have positive forms is useful. It includes both non-degree adjective classes: possessive, like mamin (mother's), lisiishort (fox'es), and relative, like russkiishort (Russian).</Paragraph>
    <Paragraph position="8"> As a part of the language-independent API, the name AdjDegree denotes the adjective degree type for all languages, although each language has its own implementation. Maintaining parallelism among languages is rather straightforward at this stage, since the only thing shared is the name of  a part of speech. A possible complication is that parsing with inflectionally rich languages can be less efficient compared to, for instance, English. This is because in GF all forms of a word are kept in the same declension table, which is convenient for generation, since GF is a generation-oriented grammar formalism. Therefore, the more forms there are, the bigger tables we have to store in memory, which can become an issue as the grammars grow and more languages are added (Dada and Ranta, 2006).</Paragraph>
  </Section>
  <Section position="5" start_page="476" end_page="477" type="metho">
    <SectionTitle>
3 Inflection Paradigms and Lexicon
</SectionTitle>
    <Paragraph position="0"> Besides word class declarations, morphology modules also contain functions defining common inflectional patterns (paradigms) and a lexicon.</Paragraph>
    <Paragraph position="1"> This information is language-specific, so fitting into the common API is not a consideration here.</Paragraph>
    <Paragraph position="2"> Paradigms are used to build the lexicon incrementally as new words are used in applications. A lexicon can also be extracted from other sources.</Paragraph>
    <Paragraph position="3"> Unlike syntactic descriptions, morphological descriptions for many languages have been already developed in other projects. Thus, considerable efforts can be saved by reusing existing code. How easy we can perform the transformation depends on how similar the input and output formats are. For example, the Swedish morphology module is generated automatically from the code of another project, called Functional Morphology (Forsberg and Ranta, 2004). In this case the formats are very similar, so extracting is rather straightforward. However, this might not be the case if we build the lexicon from a very different representation or even from corpora, where post-modification by hand is simply inevitable.</Paragraph>
    <Paragraph position="4"> A paradigm function usually takes one or more string arguments and forms a lexical entry. For example, the function nGolova describes the inflectional pattern for feminine inanimate nouns ending with -a in Russian. It takes the basic form of a word as a string (Str) and returns a noun (CN stands for Common Noun, see definition in section 4). Six cases times two numbers gives twelve forms, plus two inherent parameters Animacy and Gender (defined in section 2):</Paragraph>
    <Paragraph position="6"> where \golova is a l-abstraction, which means that the function argument of the type Str will be denoted as golova in the definition. The construction let...in is used to extract the word stem (golov), in this case, by cutting off the last letter (init). Of course, one could supply the stem directly, however, it is easier for the grammarian to just write the whole word without worrying what stem it has and let the function take care of the stem automatically. The table structure is simple - each line corresponds to one parameter value. The sign =&gt; separates parameter values from corresponding inflection forms. Plus sign denotes string concatenation.</Paragraph>
    <Paragraph position="7"> The type signature (nGolova: Str -&gt; CN) and maybe a comment telling that the paradigm describes feminine inanimate nouns ending with -a are the only things the grammarian needs to know, in order to use the function nGolova. Implementation details (the inflection table) are hidden. The name nGolova is actually a transliteration of the Russian word golova (head) that represents nouns conforming to the pattern. Therefore, the grammarian can just compare a new word to the word golova in order to decide whether nGolova is appropriate.</Paragraph>
    <Paragraph position="8"> For example, we can define the word mashina (maxina) corresponding to the English word car.</Paragraph>
    <Paragraph position="9"> Maxina is a feminine, inanimate noun ending with -a. Therefore, a new lexical entry for the word maxina can be defined by: oper mashina = nGolova &amp;quot;maxina&amp;quot; ; Access via type signature becomes especially helpful with more complex parts of speech like verbs.</Paragraph>
    <Paragraph position="10"> Lexicon and inflectional paradigms are language-specific, although, an attempt to build  a general-purpose interlingua lexicon in GF has been made. Multilingual dictionary can work for words denoting unique objects like the sun etc., but otherwise, having a common lexicon interface does not sound like a very good idea or at least something one would like to start with. Normally, multilingual dictionaries have bilingual organization (Kellogg, 2005).</Paragraph>
    <Paragraph position="11"> At the moment the resource grammar has an interlingua dictionary for, so called, closed word classes like pronouns, prepositions, conjunctions and numerals. But even there, a number of discrepancies occurs. For example, the impersonal pronoun one (OnePron) has no direct correspondence in Russian. Instead, to express the same meaning Russian uses the infinitive: esli oqen' zahotet', mozhno v kosmos uletet' (if one really wants, one can fly into the space). Note, that the modal verb can is transformed into the adverb mozhno (it is possible). The closest pronoun to one is the personal pronoun ty (you), which is omitted in the final sentence: esli oqen' zahoqex', mozhex' v kosmos uletet'. The Russian implementation of OnePronuses the later construction, skipping the string (s), but preserving number (n), person (p) and animacy (anim) parameters, which are necessary for agreement:</Paragraph>
    <Paragraph position="13"/>
  </Section>
  <Section position="6" start_page="477" end_page="479" type="metho">
    <SectionTitle>
4 Syntax
</SectionTitle>
    <Paragraph position="0"> Syntax modules describe rules for combining words into phrases and sentences. Designing a language-independent syntax API is the most difficult part: several revisions have been made as the resource coverage has grown. Russian is very different from other resource languages, therefore, it sometimes fits poorly into the common API.</Paragraph>
    <Paragraph position="1"> Several factors have influenced the API structure so far: application domains, parsing algorithms and supported languages. In general, the resource syntax is built bottom-up, starting with rules for forming noun phrases and verb phrases, continuing with relative clauses, questions, imperatives, and coordination. Some textual and dialogue features might be added, such as contrasting, topicalization, and question-answer relations. On the way from dictionary entries towards complete sentences, categories loose declension forms and, consequently, get more parameters that &amp;quot;memorize&amp;quot; what forms are kept, which is necessary to arrange agreement later on. Closer to the end of the journey string fields are getting longer as types contain more complex phrases, while parameters are used for agreement and then left behind. Sentence types are the ultimate types that just contain one string and no parameters, since everything is decided and agreed on by that point.</Paragraph>
    <Paragraph position="2"> Let us take a look at Russian nouns as an example. A noun lexicon entry type (CN) mentioned in section 3 is defined like the following:  As we have seen in section 3, the string table field s contains twelve forms. On the other hand, to use a noun in a sentence we need only one form and several parameters for agreement. Thus, the ultimate noun type to be used in a sentence as an object or a subject looks more like Noun Phrase  in section 2), while the table field s only contains six forms: one for each Case value.</Paragraph>
    <Paragraph position="3"> The transition from CN to NP can be done via various intermediate types. A noun can get modifiers like adjectives - krasnaya komnata (the red room), determiners - mnogo xuma (much ado), genitive constructions - geroishort naxego vremeni (a hero of our time), relative phrases - qelovek, kotoryishort smeyotsya (the man who laughs). Thus, the string field (s) can eventually contain more than one word. A noun can become a part of other phrases, e.g. a predicate in a verb phrase - znanie - sila (knowledge is power) or a complement  in a prepositional phrase - za rekoishort, v teni derev'ev (across the river and into the trees).</Paragraph>
    <Paragraph position="4"> The language-independent API has an hierarchy of intermediate types all the way from dictionary entries to sentences. All supported languages follow this structure, although in some cases this does not happen naturally. For example, the division between definite and indefinite noun phrases is not relevant for Russian, since Russian does not have any articles, while being an important issue about nouns in many European languages. The common API contains functions supporting such division, which are all conflated into one in the Russian implementation. This is a simple case, where Russian easily fits into the common API, although a corresponding phenomenon does not really exist.</Paragraph>
    <Paragraph position="5"> Sometimes, a problem does not arise until the joining point, where agreement has to be made.</Paragraph>
    <Paragraph position="6"> For instance, in Russian, numeral modification uses different cases to form a noun phrase in nominative case: tri tovariwa (three comrades), where the noun is in nominative, but pyat' tovariweishort (five comrades), where the noun is in genitive! Two solutions are possible. An extra non-linguistic parameter bearing the semantics of a numeral can be included in the Numeral type.</Paragraph>
    <Paragraph position="7"> Alternatively, an extra argument (NumberVal), denoting the actual number value, can be introduced into the numeral modification function (IndefNumNP) to tell apart numbers with the last digit between 2 and 4 from other natural numbers:</Paragraph>
    <Paragraph position="9"> Unfortunately, this would require changing the language-independent API (adding the NumberVal argument) and consequent adjustments in all other languages that do not need this information. Note, that IndefNumNP, Numeral, CN (Common Noun) and NP (Noun Phrase) belong to the language-independent API, i.e. they have different implementations in different languages. We prefer the encapsulation version, since the other option will make the function more error-prone.</Paragraph>
    <Paragraph position="10"> Nevertheless, one can argue for both solutions, which is rather typical while designing a common interface. One has to decide what should be kept language-specific and what belongs to the language-independent API. Often this decision is more or less a matter of taste. Since Russian is not the main language in the GF resource library, the tendency is to keep things language-specific at least until the common API becomes too restrictive for a representative number of languages. The example above demonstrates a syntactic construction, which exist both in the language-independent API and in Russian although the common version is not as universal as expected. There are also cases, where Russian structures are not present in the common interface at all, since there is no direct analogy in other supported languages.</Paragraph>
    <Paragraph position="11"> For instance, a short adjective form is used in phrases like mne nuzhna pomow' (I need help) and eishort interesno iskusstvo (she is interested in art). In Russian, the expressions do not have any verb, so they sound like to me needed help and to her interesting art, respectively. Here is the function predShortAdj describing such adjective predication2 specific to Russian:</Paragraph>
    <Paragraph position="13"> predShortAdj takes three arguments: a non-degree adjective (Adj) and two noun phrases (NP) that work as a predicate, a subject and an object in the returned sentence (S). The third line indicates that the arguments will be denoted as Needed, I and Help, respectively (l-abstraction). The sentence type (S) only contains one string fields. The construction let...in is used to first form the individual words (toMe, needed and help) to put them later into a sentence. Each word is produced by taking appropriate forms from inflection tables of corresponding arguments (Needed.s, Help.s and I.s). In the noun arguments I and Help dative and nominative cases, respectively, are taken (!-sign denotes the selection operation). The adjective Needed agrees with the noun Help, so Help's gender (g) and number (n) are used to build an appropriate adjective form (AF Short Help.g Help.n). This is exactly where we finally use the parameters from Help argument of the type NP defined above.</Paragraph>
    <Paragraph position="14"> We only use the declension tables from the argu- null ments I and Needed - other parameters are just thrown away. Note, that predShortAdj uses the type Adj for non-degree adjectives instead of AdjDegree presented in section 2. We also use theShortadjective form as an extraCase-value.</Paragraph>
  </Section>
  <Section position="7" start_page="479" end_page="480" type="metho">
    <SectionTitle>
5 An Example Application Grammar
</SectionTitle>
    <Paragraph position="0"> The purpose of the example is to show similarities between the same grammar written for different languages using the resource library. Such similarities increase the reuse of previously written code across languages: once written for one language a grammar can be ported to another language relatively easy and fast. The more language-independent API functions (names conventionally starting with a capital letter) a grammar contains, the more efficient the porting becomes.</Paragraph>
    <Paragraph position="1"> We will consider a fragment of Health - a small phrase-book grammar written using the resource grammar library in English, French, Italian, Swedish and Russian. It can form phrases like she has a cold and she needs a painkiller. The following categories (cat) and functions (fun) constitute language-independent abstract syntax (domain semantics):  Abstract syntax determines the class of statements we are able to build with the grammar. The category Prop denotes complete propositions like she has a cold. We also have separate categories of smaller units like Patient, Medicine and Condition. To produce a proposition one can, for instance, use the function BeInCondition, which takes two arguments of the types Patient and Condition and returns the result of the type Prop. For example, we can form the phrase she has a cold by combining three functions above:</Paragraph>
    <Section position="1" start_page="479" end_page="480" type="sub_section">
      <SectionTitle>
BeInCondition
ShePatient CatchCold
</SectionTitle>
      <Paragraph position="0"> where ShePatient and CatchCold are constants used as arguments to the function BeInCondition.</Paragraph>
      <Paragraph position="1"> Concrete syntax translates abstract syntax into natural language strings. Thus, concrete syntax is language-specific. However, having the language-independent resource API helps to make even a part of concrete syntax shared among the languages: null</Paragraph>
      <Paragraph position="3"> The first group (lincat) tells that the semantic categories Patient, Condition, Medicine and Prop are expressed by the resource linguistic categories: noun phrase (NP), verb phrase (VP), common noun (CN) and sentence (S), respectively. The second group (lin) tells that the functionAndis the same as the resource coordination function ConjS, the function ShePatient is expressed by the resource pronoun SheNP and the function BeInCondition is expressed by the resource function PredVP (the classic NP VP-&gt;S rule). Exactly the same rules work for all five languages, which makes the porting trivial3. However, this is not always the case.</Paragraph>
      <Paragraph position="4"> Writing even a small grammar in an inflectionally rich language like Russian requires a lot of work on morphology. This is the part where using the resource grammar library may help, since resource functions for adding new lexical entries are relatively easy to use. For instance, the word painkiller is defined similarly in five languages by taking a corresponding basic word form as an argument to an inflection paradigm function:</Paragraph>
      <Paragraph position="6"/>
      <Paragraph position="8"> The Gender parameter (Neut) is provided for Swedish.</Paragraph>
      <Paragraph position="9"> In the remaining functions we see bigger differences: the idiomatic expressions I have a cold in French, Swedish and Russian is formed by adjective predication, while a transitive verb construction is used in English and Italian. Therefore, different functions (PosA and PosTV) are applied. tvHave and tvAvere denote transitive verb to have in English and Italian, respectively. IndefOneNP is used for forming an indefinite noun phrase from a noun in English and Italian:  In the next example the Russian version is rather different from the other languages. The phrase I need a painkiller is a transitive verb predication together with complementation rule in English and Swedish. In French and Italian we need to use the idiomatic expressions avoir besoin and aver bisogno. Therefore, a classic NP VP rule (PredVP) is used. In Russian the same meaning is expressed by using adjective predication defined in section 4:  Note, that the medicine argument (med) is used with indefinite article in the English version (IndefOneNP), but without articles in Swedish, French and Italian. As we have mentioned in section 4, Russian does not have any articles, although the corresponding operations exist for the sake of consistency with the language-independent API.</Paragraph>
      <Paragraph position="10"> Health grammar shows that the more similar languages are, the easier porting will be. However, as with traditional translation the grammarian needs to know the target language, since it is not clear whether a particular construction is correct in both languages, especially, when the languages seem to be very similar in general.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML