File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2209_metho.xml
Size: 8,092 bytes
Last Modified: 2025-10-06 14:14:21
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2209"> <Title>A tagger/lemmatiser for Dutch medical language</Title> <Section position="4" start_page="0" end_page="1147" type="metho"> <SectionTitle> 2 Linguistic Knowledge </SectionTitle> <Paragraph position="0"> In essence, the T/L is a generate-and-test engine. All possible morphological analyses of a word are provided (by the database or tile word recogniser cf.</Paragraph> <Paragraph position="1"> section 2.1), (generator), and the contextual disambiguator (cf. section 2.2), (test engine), must reduce as much as possible tile potentially valid analyses to the one(s) effectively applicable in the context of the given input sentence 1</Paragraph> <Section position="1" start_page="0" end_page="1147" type="sub_section"> <SectionTitle> 2.1 Lexlcal Front-end </SectionTitle> <Paragraph position="0"> The dictionary is conceived as a full form dictionary in order to speed up the tagging process. Experiments (Dehaspe, 1993b) have shown that full form retrieval is in most of the cases significantly faster than canonical form computation and retrieval. (cf. also (Ritehie et al., 1992, p.201)). The lexical data-base for Dutch was built using several resources: an existing electronic valency dictionary 2 and a list of words extracted from a medical corpus (cardiology patient discharge summaries). The already existing electronic dictionary and the newly coded entries were converted and merged into a common representation in a relational database (Dehaspe, 1993a). A Relational DataBase Management System (RDBMS) can handle very large amounts of data while guaranteeing flexibility and speed of execution. Currently, there are some 100.000 full forms in the lexical database (which is some 8000 non inflected forms). For the moment, the database contains for the major part simple wordforms. Complex wordforms nor idiomatic expressions are yet handled in a conclusive manner.</Paragraph> <Paragraph position="1"> Itowevcr, since an exhaustive dictionary is an unrealistic assumption, an intelligent word recognlser tries to cope with all the unknown word forms (Spyns, 1994). The morphological recogniser tries to identify the unknown form by computing its potential linguistic characteristics (including its canonical form). For this purpose, a set of heuristics that combine morphological (inflection, derivation and compounding) as well as non morphological (lists of endstrings coupled to their syntactic category) knowledge. When these knowledge sources do not permit to identify the unknown forms, they are marked as guesses and receive the noun category.</Paragraph> <Paragraph position="2"> Actually, a difference is made between the regular full form database dictionary and a much smaller canonical form dictionary. The latter consist of automatically generated entries. Those entries are asserted as temporary canonical form lexicon entries and do not need to be calculated again by the recogniser part of the T/L when encountered a second time in the submitted text. A substantial speedup can be gained that way.</Paragraph> </Section> <Section position="2" start_page="1147" end_page="1147" type="sub_section"> <SectionTitle> 2.2 The Disambiguator </SectionTitle> <Paragraph position="0"> The contextual 3 disambiguator of the DMLP is implemented as an &quot;expertlike system&quot; (Spyns, 1995), which does not only take the immediate left and/or right neighbour of a word in the sentence into account, but also the entire left or right part of the sentence, depending on the rule. E.g. if a simple form of the verb 'hebben' \[have\] appears, the auxilL ary reading is kept only if a past particL ple is present in the context 4 aWe only consider the syntactic context.</Paragraph> <Paragraph position="1"> 4Unlike in English, the past participle in Dutch does not need to occupy a position adjacent to the auxiliary.</Paragraph> <Paragraph position="2"> The rule base can be subdivided into 21 i.ndependent rule sets. A specific mechanism selects the appropriate ruleset to be triggered. Some rulesets are internally ordered. Iit that case, if the most specific rule is fired, the triggering of the more general rules is prevented. In other cases, all the rules of a ruleset are triggered sequentially. Some rules are mutually exclusive. The rules are implemented as Prolog clauses, which guarantees a declarative style of the rules (at !east to a large extent).</Paragraph> <Paragraph position="3"> The control mechanism works with an agenda that contains the position of the words ill the input sentence. The position in the sentence uniquely identifies a word (and thus its corresponding (group of different) morphological reading(s)).</Paragraph> <Paragraph position="4"> Every position in the agenda is sequentially checked whether it can be disambi~ guated or not. If an ambiguous word is encountered, its position is kept on the agenda. For every clement of the agenda, all possible binary combinations of the syntactic categories are tried (failure driven loop). 1'o avoid infinite loops (repeatedly firing the same rule that is not able to alter the current set of morphological readings), the same ruleset can only be fired once for the word on the same position during the same pass. As long as the disambiguator can reducc the number of readings and the agenda is not empty, a new pass is performed.</Paragraph> </Section> </Section> <Section position="5" start_page="1147" end_page="1148" type="metho"> <SectionTitle> 3 Software Engineering </SectionTitle> <Paragraph position="0"> In order to preserve the reusability of the dictionary, an extra software layer hides the database. This layer transforms the information from the database into a feature bundle containing the application specific features. The software layer restricts and adapts the &quot;view&quot; (just like the SQL-views) the programs have on the information of a lexical entry . This methods allows that all sorts of information can be coupled to a lexical entry in the database while only the information relevant for a specific NLP-application passes &quot;the software filter&quot;. Besides the qualitative aspect, the filter can also affect the quantitative aspect by collapsing or expanding certain entries (e.g. the 1st and 2nd person singular of many verbs constitute the same entry in the data-base but are differentiated afterwards) or excluding specific combinations after examination of the input.</Paragraph> <Paragraph position="1"> The feature bundles constitute the main datastructure of the T/L.Atself.</Paragraph> <Paragraph position="2"> They arc conceived as Directed Aeyclic Graphs, which are implemented as open ended Prolog lists (Gazdar and Mellish, 1989). This &quot;low level&quot; implementation is only known by the predicates that make up the interface. Graphunification provides a neat and easy way to impose various restrictions. A linguistic restriction can be exl)rcssed in terms of feature value pairs, which in turn can be represented as a l)AG. This DAG acts as filter towards other DAGs. The DAGs that are unifyable with the &quot;filter DAG&quot; meet the imposed restriction. The only thing to do is to define the appropriate filters. The contextual rules mainly consist of such filter DAGs.</Paragraph> <Paragraph position="3"> The T/L, able to analyse words lacking from the dictionary, is intended to fimction primarily as a lexical front-end for the DMIA ) syntactic analyser (Spyns and Adriaens, 1992). Itowever, as the result of the tagging and lemmatising process consists of feature bundh's implemented as DAGs, the output format can be adapted very easily if required (by defining various &quot;format filters&quot;). The output format can be transduced to the format required by the &quot;SAC-tools&quot; o1' the System Management 'lbols of the Menelasproject (Ogonowski, 1993). Another fib ter transforms the output to the format of the Multi-TMe semantic tagger (Ceusters, 1994).</Paragraph> </Section> class="xml-element"></Paper>