File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/e93-1023_metho.xml
Size: 29,946 bytes
Last Modified: 2025-10-06 14:13:17
<?xml version="1.0" standalone="yes"?> <Paper uid="E93-1023"> <Title>A Probabilistic Context-free Grammar for Disambiguation in Morphological Parsing</Title> <Section position="3" start_page="183" end_page="186" type="metho"> <SectionTitle> 2 Rule-based disambiguation </SectionTitle> <Paragraph position="0"> Decomposition of the input word is carried out in two successive stages. First, all the possible segmentations of an input word into strings of stems and affixes are generated. Secondly, each segmentation is tested for morpho-syntactic well-formedness. While the well-formedness is tested, word class is determined. null The task of recovering the morphemic segmentation with the help of a morpheme lexicon is very much complicated by the fact that a word can be segmented in more than one way. The number of alternative segmentations for an input word grows with increasing lexicon size, decreasing average length of the lexical elements and increasing average length of the input word. Our lexicon contains 17,087 entries, among which there is a large number of very small inflectional affixes. Furthermore, the input words may be very lengthy, as Dutch compounds are written as one word, and because nominal compounding, for instance, is a highly productive process. The result can be a combinatorial explosion, causing hundreds of segmentations to be generated.</Paragraph> <Paragraph position="1"> In order to restrict ambiguity in the segmentation stage, we employed a number of strategies. First, we made a pragmatic operalisation of the theoretical notion &quot;morpheme&quot;, which is traditionally defined as &quot;the smallest meaningful unit&quot; in word formation: in our lexicon we only listed words and affixes. Along with all simplex words and productive affixes, we listed all the word formations that belong to closed classes, i.e. words which are not formed according to productive word formation processes. Thus, our parser only has to analyse words formed according to productive rules.</Paragraph> <Paragraph position="2"> Secondly, MORPA performs, if available, some tests on phonological and phonetic restrictions on the recognition of morphemes in a specific context. The ultimate effect of these tests is that incorrect recognition of highly frequent and very small inflectional suffixes, such as -e, -t, -d, -s, -r, -n, -en or -er, can be prevented in many cases.</Paragraph> <Paragraph position="3"> Finally, MORPA sees to it that words belonging to minor lexical categories (such as determiners, pronouns, conjunctions, etc.) are not recognised as word parts. They never take part in morphological pro- null cesses. By rejecting these, we prevent the parser from doing work which we know beforehand will be in vain.</Paragraph> <Paragraph position="4"> To illustrate the effect of the segmentation procedure, its output for the noun beneveling (intoxication) is shown in (3)z: All of the parts in the segmentations under (3) are Dutch morphemes listed in the morpheme lexicon.</Paragraph> <Paragraph position="5"> Because the segmentation procedure analyses the input word into all possible strings of morphemes without any further grammatical knowledge, it generates along with the one and only plausible segmentation be + nevel + ing (3c), several alternative segmentations. Many of these violate grammatical and/or semantic restrictions.</Paragraph> <Paragraph position="6"> In order to filter out ungrammatical segmentations, each segmentation is checked for its morpho-syntactic well-formedness with the help of a categorial grammar. Consequently, every segmentation that is not in accordance with the rules of Dutch morphology is rejected by the parser. While checking, the word class of the grammatical segmentations is determined.</Paragraph> <Paragraph position="7"> In accordance with the principles of Categorial Grammar, our parser does not make use of a set of explicitly represented word formation rules. Instead, the morphological subcategorisation information is encoded in the form of category assignments in the lexicon. That is, prefixes have been assigned a category of type A/B, which means that they take a stem of category A on their right-hand side to yield a word of category B 2. For instance, the prefix bewith category N/V requires a nominal stem to the right to form a verb. Likewise, suffixes of category A\B look for a stem of category A on their left-hand side to yield a word of category B. Thus, the suffix -ing, V\N, requires a verbal stem to the left to form a noun. Free morphemes, such as nevel, are assigned primitive categories, such as V or N 3.</Paragraph> <Paragraph position="8"> 1When segmenting, MORPA takes into account that Dutch word stems, when inflected or used as the base of a derivation, may undergo spelling changes. It would take us too far to go into the spelling rules here, but in (3) the effect of rules such as 'vowel gemination' and 'devoicing of stem-final consonants' shows up. See for more detail \[Heemskerk and van Heuven, 1993\].</Paragraph> <Paragraph position="9"> ~Note that in the literature on categorial grammar the notational variant B/A is frequently used.</Paragraph> <Paragraph position="10"> SSince our parser only accounts for morphological subcategorisation, the set of lexical categories does not equal the set of syntactic categories. For example, all verbs are In a strictly bottom-up fashion, the parser iteratively attempts to combine two adjacent elements, reducing them in accordance with their categorial specification with the help of three very general re- null duction laws: (4) prefixation: A/B . A ---. B suffixation: A. A\B --* B compounding: A. B --~ B For pragmatic reasons, MORPA's rule for compounding is not a categorial rule, but a categorial-like rule: two adjacent stems AB may, according to the Right-Hand Head Rule be combined into a word of category B 4. In addition to this general rule for compounding, the grammar contains a small set of rules defining productive compounding. An analysis fails as soon as a string of categories cannot be reduced to one single category.</Paragraph> <Paragraph position="11"> The examples in (5) illustrate how iterative categorial reduction results in a successful parse. The structures show the derivation and determination of the output category of (3c). Also, the examples in (5) illustrate that, while the categorial grammar flters out many ungrammatical segmentations and derives the word class of the input word, parsing introduces a new type of ambiguity: one segmentation can be assigned more than one structure. The ambiguity in (5) is due to the fact that the morphemes be- (en-) and nevel (mist) can belong to more than one lexical category and as a consequence can be reduced in more than one way. The ambiguity in (5a) and (5b), is spurious in the sense that it does not correlate with a difference in pronunciation or word class assignment. The reduction in (5c) results in an incorrect word class assignment.</Paragraph> <Paragraph position="12"> Because the word syntax as such is not restrictive enough, it was supplemented with a component which heavily restrains the parser in building structures. This component, which is inspired by Lexical Phonology, imposes an ordering on the attachment of affixes and stems. Consequently, it restricts the type or the complexity of the stem that an affix or other stem may attach to. Rejection of structures can result in avoiding incorrect word class assignment and rejection of incorrect segmentations.</Paragraph> <Paragraph position="13"> In Lexical Phonology, the interaction between stress behaviour and affix order is explained. \[Choresky and Halle, 1968\] distinguished two classes of suffixes with different stress properties, and \[Siegel, 1979\] observed that this distinction correlates with the order in which the suffixes attach. Over the years, theoretical linguists have become sceptical assigned category V, irrespective of (in)transitivity. The use of syntactic categories would complicate the grammar considerably. See \[Dowty, 1979\] and \[Moortgat, 1987\] for a discussion on this matter.</Paragraph> <Paragraph position="14"> of these &quot;level theories&quot;, because of the so-called &quot;bracketing paradoxes&quot;, i.e. constructions in which two distinct constituent structures (for instance a morphological and a phonological one) have to be assigned to a word 5. Despite the occurrence of bracketing paradoxes, however, the claims on level ordered morphology following from these theories are highly interesting: in checking the morphological claims which follow from one of the theories that have been developed for Dutch, \[van Beurden, 1987\], against a large database containing approximately 123,000 Dutch words, relatively few counter-examples were found.</Paragraph> <Paragraph position="15"> Van Beurden claims that affix order does not depend on stress properties, but on categorial properties. Thus, the major characteristic of this model is that each attachment level is associated with a specific lexical output category. The model seems particularly suitable for use in MORPA, because it is easy to integrate with our categorial parser. The model implemented in MORPA, shown in (6), is an extension of Van Beurden's model in a way which is consistent with its basic assumptions s.</Paragraph> <Paragraph position="16"> On the basis of this model, the Dutch vocabulary can be divided into four levels. Each of the levels in (6) may be viewed as possible successive stages in word formation. The first level, or lexical level, comprises the lexicon of simplex words, affixes and irregular formations. This level also contains all (borrowed) Romance words. The elements of this lexical level may be successively developed on the second level on which V(erbal)-morphology takes place; the third level on which A(djectival)-morphology takes place and the fourth level on which N(ominal)-morphology takes place. The name of the level indicates the resulting word class. Each of these levels preserves the possibility for suffixation, compounding and prefixation. On the levels for V-morphology and Amorphology each of these processes may take place 6In van Beurden's model each categorial level has a phonological level associated with it. As we are mainly interested in the morphological aspects, we leave the phonological claims for what they are: within SPRAAK-MAKER, MORPA and MORPHON (the phonological module) are autonomous modules, and as MORPA precedes MORPHON, any interaction between the two systems is one way.</Paragraph> <Paragraph position="17"> only once. We assume that only the processes on the N-morphology level are recursive, i.e. may take place more than once (see \[Heemskerk, 1989\] for more details). null The model correctly predicts the derivation of the word onverdraagzaarnheid (intolerance). As shown in (7), first verbal prefixation yields the verbal stem verdraag (tolerate), then adjectival suffixation yields the adjective verdraagzaam (tolerant), adjectival prefixation yields the adjective onverdraagzaam (intolerant) and, finally, nominal suffixation yields the noun Also, the level module rules out the analysis in (5c): the nominal suffix -ing must not be attached before the verbal prefix be-. Therefore the word cannot be analysed as a verb.</Paragraph> <Paragraph position="18"> If we return to the example of beneveling we find that of the six alternative segmentations in (3), only four are accepted by the categorial component. As is shown in (8) one of these segmentations has been assigned a wrong word class. In (8) it is also shown that, as a result of the level ordering, three of the assigned word classes (and matching structures 7) were rejected. Consequently, two analyses remain.</Paragraph> </Section> <Section position="4" start_page="186" end_page="188" type="metho"> <SectionTitle> 3 Probability-based scoring function </SectionTitle> <Paragraph position="0"> Clearly, the ultimate handling of the remaining ambiguity in (8) demands recourse to semantics and world knowledge. For the large-scale domain we are dealing with, however, we considered it unfeasible to implement semantic and pragmatic constraints. Thanks to the availability of a large annotated corpus, the alternative of constructing a PCFG came within reach. The corpus, being a representative sample of the past or existing vocabulary, is expected to capture implicitly various semantic and pragmatic constraints. \[Fujisaki et al., 1989; Liberman, 1991\]. Empirical estimation of the probability of a parse tree on the basis of the corpus enables us to order the competing analyses along a scale of plausibility and select the &quot;best&quot; parse out of the set of alternatives.</Paragraph> <Paragraph position="1"> A parse tree, such as (5a), is a series of applied production rule@. In a context-free grammar it is assumed that the application of a production rule is independent of previously applied rules. In a PCFG, each production rule r is assigned an estimated probability of use and the probability of the parse tree t is the product of the constituting production rules</Paragraph> <Paragraph position="3"> The probability of each production rule in the grammar has been estimated by means of straightforward counting of appearances in the corpus, resulting in relative frequencies. Let G be any non-terminal symbol of the grammar; n(G) the number of productions rewriting G and P(ilG ) the probability that the ith of these productions takes place, then</Paragraph> <Paragraph position="5"> It is assumed that for all i -- 1, 2 .... , n(G), P(iIG ) is a positive number and that ~iP(ilG) -- 1.</Paragraph> <Paragraph position="6"> 7In (8), I abstract from hierarchical structures, since they are irrelevant for pronunciation. Relevant for pronunciation are the morphemic segmentation and word class assignment. Consequently, the structures of (5) are represented as the segmentation be + nevel + ing, which has been assigned two word classes N and V.</Paragraph> <Paragraph position="7"> Sin this section, I will give a top-down description of a parse tree and discuss production rules of the type &quot;A --, B C a, rather than bottom-up reduction and rules of the sort &quot;B C --+ A ~ used by the parser.</Paragraph> <Paragraph position="8"> MORPA's grammar comprises three different types of production rules: (11) a w ~ T b T--~ N1 N2 c N----*M In (11) w is the start symbol for words 9, T any member of the set of atomic categories which are possible top nodes: 7- = {n, v, a,...}, N any member of the set of non-terminals containing both atomic and functor categories: Af = {n, n/v, v\n, v, . . .}, 7- C .hf, and M any member of the set of terminals: Jvf = {be, nevel, ing,...}.</Paragraph> <Paragraph position="9"> The probability of (5a) is then determined as in</Paragraph> <Paragraph position="11"> Thus, this simple PCFG provides general information on how likely a parse tree is going to appear.</Paragraph> <Paragraph position="12"> It is well-known that the accuracy of the empirical estimate of a probability function depends heavily on the appropriateness of the training set: for one thing, it must have a reasonable size and be representative of the domain that is being modelled. Our training set was the CELEX database which contains approximately 123,000 Dutch stems provided with syntactic information, a morphological decomposition and token frequency information \[van der Wouden, 1988; Burnage, 1990\]. The token frequency information is based on a 44-million-word corpus. We collected from this database both type and token frequencies: type frequencies indicate how often a production rule occurs in the Dutch vocabulary (i.e. in the 123,000 stems corpus); token frequencies indicate how often a production rule occurs in Dutch texts (i.e. in the 44-million-word corpus). The underlying idea was that for tests on dictionary samples the empirical estimate must be based on type frequencies, whereas for tests on text samples it must be based on token frequencies.</Paragraph> <Paragraph position="13"> Given the information in the database, we expected the collection of frequency data to be a matter of straightforward counting: CELEX's morphological decomposition consists of hierarchical structures which are comparable to MORPA's structures (cf.</Paragraph> <Paragraph position="14"> 9Although not in the grammar, this symbol is used to make it possible to describe the possibility of a word being of a certain category in terms of (5).</Paragraph> <Paragraph position="15"> 10 For the reader's convenience, the probabilities denote the tree (in labelled bracketing) and production rules involved.</Paragraph> <Paragraph position="16"> the examples in (5)), the syntactic information consists of the word class, and because each stem in the stem corpus is provided with a token frequency, type and token frequencies could be collected simultaneously: every time a production rule was encountered in the stems corpus, 1 was added to its type frequency, and the token frequency of the word in which the rule was attested was added to its token frequency.</Paragraph> <Paragraph position="17"> Unfortunately, however, straightforward counting of all production rules contained in CELEX did not suffice to provide MORPA with the relevant information: it turned out that the set of production rules employed by MORPA was not contained in the set of production rules given by CELEX. For a very large part, the mismatch between the rules is caused by the fact that CELEX and MOR.PA yield different analyses. For example, because in MORPA all words formed according to unproductive rules are entirely listed in the lexicon, and the Dutch adjectival suffix elijk '-ly' is considered to be unproductive, all words derived by this suffix are listed. In CELEX, however, these words are decomposed. Now, in order to analyse the word vriendelijk (friendly), MORPA will employ the production rule (13a), whereas CELEX employed the rules in (13b): (13) a A --~ vriendelijk</Paragraph> <Paragraph position="19"> Consequently, straightforward counting of the production rules in CELEX, would result in overestimating the probability of the productions &quot;A ---* N N\A&quot; and &quot;N --~ vriend&quot;, and lack of frequency information for the production &quot;A --* vriendelijk&quot;.</Paragraph> <Paragraph position="20"> Amongst the MORPA rules which were not contained in the set of CELEX rules, there were also all the rules introducing inflectional affixes and infleeted stems. Of course, this is due to the fact that the 123,000-entry corpus only contains stems. As CELEX stems are considered to be an abstract way of representing a whole inflectional paradigm, inflectional affixes and inflected stems were not included in the database, and the token frequency associated with a stem is the sum of the token frequencies of the stem and all its inflected forms. However, MORPA also contains inflectional rules of which the token frequencies should be available. For obtaining frequency information on inflectional affixes and stems, we had to use the CELEX corpus, containing approximately 44 million words. Unfortunately, the morphological information in this corpus does not contain any production rules or information on the affixes.</Paragraph> <Paragraph position="21"> Thus, after all production rules in CELEX had been counted straightforwardly, we were only able to assign frequency information to a part of the MORPA rules. Moreover, we knew that some of these frequencies were overestimated. Because we expected these facts to have a negative influence on the accuracy of the PCFG, we decided to put some effort in making the empirical estimate more reliable. We had to be very creative in finding other ways to provide the rules which are not in CELEX with frequency information (from CELEX), but we finally managed to provide almost all production rules employed by MORPA with frequency information. Also, we put some effort into &quot;repairing&quot; the overestimated frequencies. Consequently, the data have become more complete and more reliable, but as a result of these problems, the collection of frequency information became a time-consuming and error-sensitive job: a lot of work had to be done by hand. Therefore, it is practically almost undoable to go over it all over again.</Paragraph> <Paragraph position="22"> With respect to the reliability of the frequency data, it turned out that the token frequencies are less reliable than the lexical frequencies. Most importantly, this was due to the fact that in CELEX, the token frequencies were &quot;string&quot; counts, i.e. they indicate how many times each separate string of letters occurs in the 44-million-word corpus. Because some of these &quot;separate strings of letters&quot; may be ambiguous in word class, morphemic segmentation or meaning, they are assigned different entries in the stems corpus. Ideally, the token frequencies in the corpus are disambiguated for the different entries, but at the time we collected our data they were not 11. As a consequence, numerous stems were assigned overestimated token frequencies.</Paragraph> <Paragraph position="23"> Consider, for example, the string rod, which can be linked to two entries in the stems database: the entry of the preposition met 'with', and the entry of the noun met 'minced pork'. Since the individual frequencies of each of these entries have not been sorted out, the rules &quot;P ---* met&quot; and &quot;N ---* met&quot; have the same frequency, i.e. the frequency of the string met. Because the preposition is highly frequent and the noun hardly ever occurs, the latter rule has been assigned a frequency which is highly overestimated. Since in addition to that overestimation the rule &quot;w ~ N&quot; is more frequent than the rule &quot;w --* P&quot;, and to the frequency of the rule &quot;N --* met&quot; is added the frequency of the two compounds in which it takes part, MORPA will consider the noun to be the most likely analysis. Had the frequencies been sorted out, this would not be the case: the high probability of the rule &quot;P ~ met&quot; would have overweighted all other probabilities.</Paragraph> <Paragraph position="24"> The unreliability of token frequencies was beared out by some preliminary tests, in which we experimented using type and token frequencies on both dictionary and text test samples. When examining 11By now, CELEX has disambiguated the token frequencies, but as the collection of reliable data was very time-consuming, we have not yet &quot;repaired&quot; our token frequencies.</Paragraph> <Paragraph position="25"> MORPA's output on a text test sample (for which token frequencies were used), we discovered that many of the erroneous selections were indeed attributable to the lack ofdisambiguation of token frequencies.</Paragraph> <Paragraph position="26"> Especially if the sample contained highly frequent string ambiguous simplex words, such as met, which do not take part in derivation or compounding, MORPA's performance got worse. It turned out that MORPA's performance was best, when type frequencies were used in a dictionary test sample.</Paragraph> <Paragraph position="27"> MORPA first generates all possible parses and the associated probabilities, ordering them along a scale of plausibility afterwards. Thus, as yet, it is not a probabilistic parser in the sense that it :prunes the low probability parses in an early stage \[Fujisaki et al., 1989; Jelinek d aL, 1990\]. Adjusting the parser will speed it up considerably, but also pruning lowranked analyses may lead to incompleteness.</Paragraph> <Paragraph position="28"> In conclusion, let us return to the example word beneveling. After likelihood determination and ordering of the two remaining analyses in (8), the correct analysis be + nevel + ing is in topmost position: (14) 1 be -t- nevel 4- ing N 2 be 4- neef -t- eling N</Paragraph> </Section> <Section position="5" start_page="188" end_page="190" type="metho"> <SectionTitle> 4 The performance of MORPA </SectionTitle> <Paragraph position="0"> In order to evaluate the performance of our system a test was run on a dictionary test sample of 3,077 words. The words contained in this sample were randomly taken from texts of the so-called &quot;Bloemendal corpus&quot; \[Bringmann, 1990\].</Paragraph> <Paragraph position="1"> For a correct interpretation of the results, it is necessary to know that a word was considered to be correctly analysed, if it had been assigned the correct morphemic segmentation and word class. The analysis in (15) is the correct analysis of the word beneveling: (15) \[- be\] \[o,.,. ,evel\] \[.-Ili. ins\]\] Thus, in the final output of MORPA, morphological information which is irrelevant for pronunciation is eliminated: analyses which have the same segmentation, but are ambiguous in their hierarchical structure and/or categorial labelling of the morphemes, such as (5a) and (5b), become one as long as the morphemes have the same morphological classification, e.g. ((non)-native) prefix, suffix or stem, and the word is assigned the same word class.</Paragraph> <Paragraph position="2"> As MORPA combines a conventional grammar with a probability-based scoring function, it is interesting to look at the effects of both the rule-based part and the probability-based ordering technique in their own right: the segmentation procedure and grammar determine the quality of the analyses and the number of analyses generated, and the probability-based scoring function enables MORPA to select the most likely analysis from a set of alternatives.</Paragraph> <Paragraph position="3"> The results in (16) show how well the segmentation procedure and grammar succeeded in deriving the correct analysis for the test words:</Paragraph> <Paragraph position="5"> a correct analysis 2,968 no correct analysis 32 no analysis at all 77 % MORPA assigned no analysis at all to 3% of the test words. For 1% of the test words, one or more analyses were generated, but the set of alternatives did not contain a correct analysis. In these cases, the word either contains an unknown morpheme, or the grammar is too restrictive. 96% of the test words were assigned a correct analysis.</Paragraph> <Paragraph position="6"> Given the problem of ambiguity, the number of analyses generated for one word is remarkably small: considering only the words which were correctly analysed, MORPA assigned a single, correct analysis to 46% of the test words. For 54%, the correct analysis was among alternatives: ing analyses along a scale of plausibility, it must be established how often MORPA succeeds in selecting the correct analysis from a set of alternatives. MORPA was able to select the correct analysis as most likely member of a set of alternatives for 92% of the test words. For a proper judgement of this performance, the percentage must be compared with the chances of selecting the most likely analysis from the set of alternatives. This chance is determined at It is not easy to tell which factors attributed to the fact that for 8% of the words the correct analysis was not selected as best analysis. The frequency data may be unreliable or the probability function may not be appropriate. Also, the correct analysis does not always have to be the most probable one.</Paragraph> <Paragraph position="7"> Most importantly however, is the overall performance of MORPA's PCFG on the Bloemendal corpus: 92% of the test words had been assigned a correct analysis which was also the first analysis yielded. Although we did not keep track of the number of segmentations assigned to the input words, it can be generally assumed that the number of alternative segmentations is very much reduced by the grammar. Also, through converting output that contains hierarchical structures and categorial labels (cf. (5)a and (5)b) to linear structures and morpheme classification (c/. (15)), a lot of unnecessary ambiguity is eliminated.</Paragraph> <Paragraph position="8"> In order to evaluate the probability-based scoring function, which enables MORPA to order compet-For the 8% of the test words which were not assigned a correct analysis in first position, MORPA either generated a correct analysis which was not in first position, or no correct analysis or no analysis at all. In order to establish the relevance for word level pronunciation, a test was run on a test file containing approximately 2000 isolated words. The test words were selected from different corpora to make sure the file contained both newspaper text, dictionary words and words of frequency 112. The words of the test file were analysed by MORPA and the topmost analyses were used by MORPHON to derive a pronunciation transcription. A transcription was considered correct if it had the proper phonemic transcription, which means that all appropriate non-optional phonological rules must have been applied, and that the words must have the correct syllable structure and stress pattern.</Paragraph> <Paragraph position="9"> Fifteen percent of the words were assigned an erroneous phonemic transcription 13. Twenty percent of the errors could be traced back to the phonological module, the remaining errors, 80%, are due to faulty morphological analyses. Of the errors made by MORPA, 88% led to an incorrect pronunciation representation. As expected, segmentation errors almost always led to an incorrect phonemic transcription. Category assignment errors also cause incorrect pronunciations, though less often. This bears out the importance of the category a word belongs to.</Paragraph> </Section> class="xml-element"></Paper>