File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/a94-1025_metho.xml
Size: 12,380 bytes
Last Modified: 2025-10-06 14:13:37
<?xml version="1.0" standalone="yes"?> <Paper uid="A94-1025"> <Title>A robust category guesser for Dutch medical language</Title> <Section position="4" start_page="150" end_page="150" type="metho"> <SectionTitle> 2 Full Form Dictionary </SectionTitle> <Paragraph position="0"> The lexical database for Dutch was built using several resources: an existing electronic valency dictionary and a list of words extracted from a medical corpus (cardiology patient discharge summaries).</Paragraph> <Paragraph position="1"> The already existing electronic dictionary (resulting from the K.U. Leuven PROTON-project (Dehaspe and Van Langendonck, ) and the newly coded entries were converted and merged into a common representation in a relational database (Dehaspe, 1993).</Paragraph> <Paragraph position="2"> It is intended to use the category guesser (cf. infra) as little as possible. To that extent, the dictionary is conceived as a full-form dictionary. Currently, there are some 100.000 full forms in the lexical database (which is some 8000 non inflected forms). However, since an exhaustive dictionary is an unrealistic assumption, a category guesser handles all the unknown word forms.</Paragraph> <Paragraph position="3"> The unknown words trigger a set of rules to identify the surface form, to attribute syntactic categories to it, and to calculate the possible canonical form(s). The category guesser can also enhance the robustness of the larger NLP-system since misspelled words can receive, to a certain extent, correct syntactic features. To reach this aim, the category guesser combines morphologic (3) as well as non morphologic knowledge (sections 4 & 6).</Paragraph> </Section> <Section position="5" start_page="150" end_page="151" type="metho"> <SectionTitle> 3 Morphological Analysis </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="150" end_page="151" type="sub_section"> <SectionTitle> 3.1 Preliminary Remarks </SectionTitle> <Paragraph position="0"> The morphological analyser consists mainly of three sections, which correspond more or less to the three linguistic operations on words: inflection, derivation and compounding. However, from an implementational point of view, the boundaries between derivation and compounding are defined in a different way.</Paragraph> <Paragraph position="1"> The compounds, created by agglutination or combined by means of a hyphen are computationally treated as non-compounds. This implies that the same segmentation routine can be used for the computation of derivations and monolithical compounds</Paragraph> </Section> <Section position="2" start_page="151" end_page="151" type="sub_section"> <SectionTitle> 3.2 Inflection </SectionTitle> <Paragraph position="0"> The inflection analyser produces one or more bundles of morphosyntactic feature value pairs for each submitted surface form (= cohort). The generated feature bundles comprise, among other features, the surface form (lex), the supposed canonical form (nlAu) as well as its category (cat) 2 A reduced example of the cohort produced for &quot;geprobeerd&quot; (Eng.: &quot;tried&quot;)follows (see figure 1).</Paragraph> <Paragraph position="1"> The initial cohort will later on be reduced as much as possible (the ideal result in most cases being a single feature bundle). Therefore, a cascading priority system has been defined. The attribute &quot;mort&quot; expresses the quality of the analysis, possible values being segm, suffix, string or guess with segm > suffix > string > guess. More details on this will be given below.</Paragraph> <Paragraph position="2"> Only the feature bundles of supposed nouns, verbs, adjectives and adverbs (i.e. the open categories) are admitted in the initial set of hazardous analyses or cohort.</Paragraph> </Section> <Section position="3" start_page="151" end_page="151" type="sub_section"> <SectionTitle> 3.3 Segmentation </SectionTitle> <Paragraph position="0"> Derivation and monolithical compounding are used to try and identify as many as possible of the canonical forms computed by the inflectional analyser.</Paragraph> <Paragraph position="1"> The starting principle here is that the right part of the computed canonical form usually constitutes the grammatical head of the whole word. The whole word thus inherits the feature-bundle associated with its right part (Selkirk, 1982, p.150) 3 In opposition to William (Williams, 1983) & Selkirk (Selkirk, 1982), we do not allow inflectional suffixes to be heads. The right part can be found in the dictionary (monolithical compounding) or in a list of suffixes (derivation). In the current segmentation program, the major part of this list contains medical suffixes, which constitute a clearly definable 2v for verbs, adj for adjectives and n for noun; others are nb \[sing or phr\] for number, pers \[1, 2, 3 or nil\] for person.</Paragraph> <Paragraph position="2"> 3We are fully aware that linguistic reality is more complex: e.g. some derivations (f.i. Dutch diminutifs cf. (Ritchie et al., 1992)) are regarded as left headed. Maybe they should be treated computationally by the inflectional analyser.</Paragraph> <Paragraph position="3"> set that is fairly regular in its (morphological and syntactic) behaviour (Dujols et al., 1991). Below (see figure 2) one can find an extract of the suffix list.</Paragraph> <Paragraph position="4"> suffix ( \[a, r, i, s\], \[cat : adj, nb: sing\] ).</Paragraph> <Paragraph position="5"> suffix ( \[a, a, 1\], \[cat : adj,nb : sing\] ).</Paragraph> <Paragraph position="6"> The computed canonical form is scanned and segmentated from right to left. All possible solutions are generated by a failure driven loop (no exclusive longest match principle). The segmentation routine which tries to identify a right part (head:dict or head:suffix) and then tries to recognize the remaining left part. If this succeeds, the segmentation is complete (morf:segm). Otherwise, it is only partial (morf: suffix).</Paragraph> <Paragraph position="7"> At the moment, only noun noun compounds are treated. Many medical noun noun compounds combine a medical non head part with a non medical head part (fi. hartziekte- Eng.: heartdisease).</Paragraph> <Paragraph position="8"> Only those feature bundles of the cohort are kept that are compatible (by means of graph-unification) with the feature bundle associated with the head part (suffix or dictionary entry). At this stage of filtering, the feature cat (syntactic category) plays a most prominent role.</Paragraph> </Section> </Section> <Section position="6" start_page="151" end_page="152" type="metho"> <SectionTitle> 4 Endstring Matching </SectionTitle> <Paragraph position="0"> When nothing can be predicted by means of morphology, another heuristic will be applied to reduce the set of remaining possible morphological analyses.</Paragraph> <Paragraph position="1"> This stage will focus more on the general language words. It is based on a series of endstrings (not limited by morphological boundaries) which determine the category of a word. Only the open syntactic classes are taken into account (noun, verb, adjective and adverb). Some endstrings uniquely identify the category of a word while others are more equivocal.</Paragraph> <Paragraph position="2"> The latter are correlated with two or even three categories. The necessary linguistic knowledge to build a list of non-inflected endstrings and their associated category (or categories) was found in Lemmens (Lemmens, 1989). Some combinations of an endstring and its category are shown below (see figure 3).</Paragraph> <Paragraph position="3"> When a computed lexical form is presented to the endstring matcher, the above mentioned list is checked to see if an endstring constitutes the endpart of the submitted word. In fact, the surface form as well as the hypothetical canonical form of the feature bundle are submitted to the endstring matcher.</Paragraph> <Paragraph position="4"> Only the categories resulting of both matching processes (= the intersection) are finally retained. Sub- null end( \[d, r, e, e \] -3, \[v, adj\], \[eerd\] ).</Paragraph> <Paragraph position="5"> end(\[1,e,e,il.3 , \[adj ,n\] , \[ieel\]).</Paragraph> <Paragraph position="6"> end(\[l,e,il-3 , \[adj\] , \[iel\]).</Paragraph> <Paragraph position="7"> end( \[e ,m, s, i 1-3 , \[n\] , \[isme\] ) .</Paragraph> <Paragraph position="8"> sequently, the feature bundle(s) of the cohort containing the proposed syntactic category are extended with an extra featurevaluepair (morf:string). Below (see figure 4) the result of endstring matching applied to the verb &quot;geprobeerd&quot; (Eng.: &quot;tried&quot;) is shown (rule with ending -eerd applies) 4 The inflection rules were able to produce a canonical form together with its category which the endstring matcher considers correct. This implies that the inflection rule was correctly triggered and applied. As a corollary, the other syntactic information in such a validated feature bundle (with morf:string) is supposed to be correct as well. However, many syntactic features are underspecified 5</Paragraph> </Section> <Section position="7" start_page="152" end_page="152" type="metho"> <SectionTitle> 5 Default or Catch All Rule </SectionTitle> <Paragraph position="0"> If none of the aforementioned cases apply, the computed canonical forms and its corresponding grammatical features are pure guesses. The complete cohort is retained and each of its feature bundles is extended with one extra feature morf: guess.</Paragraph> <Paragraph position="1"> 6 Final selection of the set of solutions After the stages mentioned above, only a subset of feature bundles of the cohort will contain the feature morf. All of these feature bundles contain morphosyntactic information that is validated by the mentioned heuristics (cf. supra) 6 This subset is retained and passed to the syntactic parser. When segmentations of both types (complete versus partial) are produced, the latter (morf:suffix) are discarded in favour of the former (morf:segm). In that case, endstring matching nor the catch all rule is applied.</Paragraph> </Section> <Section position="8" start_page="152" end_page="153" type="metho"> <SectionTitle> 7 Schematic Overview </SectionTitle> <Paragraph position="0"> Below, one can find a more formal description and a schematic overview (see figure 5) of the category guesser.</Paragraph> <Paragraph position="1"> linguistic features -- even when underspecified -- in the feature bundle.</Paragraph> <Paragraph position="2"> SWhen the default rule applied, the subset will be identical to the complete cohort. Validation is a too strong word in this case.</Paragraph> </Section> <Section position="9" start_page="153" end_page="153" type="metho"> <SectionTitle> 8 Some Results and Statistics </SectionTitle> <Paragraph position="0"> To examine the effectiveness of the category guesser, all the words from the corpus not appearing in the dictionary were submitted to the analyser. The total number of unknown words was 2832. Manual categorisation revealed the presence of 679 adjectives, 2056 nouns, 82 verbs. The 2832 unique unknown forms lead to the generation of 6342 supposed analyses, which means that for every unknown form 2.4 possible canonical forms are retained. We consider the case when an unknown surface form receives more than two different categories as a guess.</Paragraph> <Paragraph position="1"> Guesses are always interpreted as bad. If the category guesser is not able to attribute a correct category, the result is regarded as bad. Once a correct category, even concurrently with an incorrect one, is assigned to the submitted word, the outcome is perceived as good. 7 As the main concern lies with the syntactic characteristics, we did not consider an erroneously calculated canonical form as a reason to reject the complete feature bundle. Manual examination of the results permits us to state that 83.4 % of the unknown forms are correctly identified. We consider the result as fairly good and are convinced that refinements can lead to an even better result.</Paragraph> <Paragraph position="2"> The linguistic coverage can be still be improved by adding rules in order to treat comparatives and superlatives. null</Paragraph> </Section> class="xml-element"></Paper>