File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0506_metho.xml
Size: 19,587 bytes
Last Modified: 2025-10-06 14:15:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0506"> <Title>On Some Aspects of Le rtcal Standardtzauon On Some Aspects of Lexical Standardization</Title> <Section position="3" start_page="0" end_page="41" type="metho"> <SectionTitle> 2 A Generic Lexical Architecture </SectionTitle> <Paragraph position="0"> To support the development of lmge lexicon, we ~mplemented a Lex~cal Knowledge Base (LKB) called Habanera (Zajac 97) A Habanera LKB ~s composed of (1) several monohngual d~ct~onanes, (2) translation relations hnkmg these monohngual d~ct~onanes, and (3) a multdmgual d~ct~onary schema that defines a shared multdmgual inheritance h~erarchy of lex~cal types for all monohngual d~ct~onanes The system supports a variety of hngmstlc architectures Since the design of a lextcal architecture is a complex task, flex~bd~ty m des~gmng the structure of the LKB ~s an essentml feature Th~s flex~bdtty ~s provided by allowing ~o~ a multi-layered LKB schema m which each layer provides addmonal constrmnts on the structure of a lex~cal entry Thts approach ~s congruent w~th the d~stmct~on made m (Eagles 93) between meta-schemata, schemata and instances Thts These reqmrements mouvated various tmttal chotces for the design and the ~mplementaUon of the system We use the Text Encoding Imuatlve (TEI) definmon for printed dicuonarles (Sperberg-McQueen & Burnard 94, Chap 12) as a source of respiration for the defimtlon of a standard dlcuonary entry structure (definition of the 'meta-schema' in Eagles terminology) However, lexlcal entries are encoded as Typed Feature Structures (TFS) which Is our primary descnptwe dewce for encoding lexlcal data Typed Feature Structures prowde a declarauve formalism with a well-defined formal semanucs (and associated operations umficatton and subsumptlon) which we use instead of SGML to encode lex~cal entries A set of type defimtlons specifies what constitute vahd lexlcal entries and play a role similar to a DTD in SGML A type definmon specifies the set of features and restrictions on values for types Most of the lexlcal tools are parametnzed by the type defimtlons which are part ot&quot; the LKB schema Multlhngual dictionaries are orgamzed as a set of monohngual d~ctionanes plus translation relations between entries In the case of Knowledge-Based Machine Translation, relations are also defined between word senses and ontologtcal concepts Dictionaries and lexlcal entries are stored In a a commercial DBMS which allow concurrent access to a dictionary, an important conslderaUon when a dictionary Is developed by a team ol- lexicographers In the database, the format of stored data is independent of the external representation formahsm All strings are encoded using Umcode and we use UTF-8 for file exchange (Import/export functions) The system is designed to facd~tate acqmsmon as well as exploitation ot lexlcal resources Acqulsmon tools ale implemented using HTML forms for the acquisition lnterl-aces and additional integrated utd~tles for chef.king the correctness of entries, for transcriptions, etc These tools are patametrlzed by resources (e g, HTML templates, grammars for transcriptions) that are loaded at runtlme A dlctlonary can be accessed mteractwely through an HTML browser (also parametnzed by a set of HTML templates) Natural Language Processing tools such as parser do not access the database Instead, a dlcuonary ~s compded tn a compact binary format that allows fast lunt~me access to entries The dlcuonary compder can build several indexes to look-up entrtes m the compiled dictionary Runttme indexes are compressed tries that provide random access to a compact binary dtcttonary file</Paragraph> <Section position="1" start_page="38" end_page="40" type="sub_section"> <SectionTitle> 2.1 Dictionaries </SectionTitle> <Paragraph position="0"> The hngulst works with a source dictionary where each dlcttonary entry is structured as a set of sub-entries An entry can for example group together senses for the same lemma, different categories together for the same form, dfffelent lemmas m the same denvatlonal famdy, etc An entry has a unique key (a Umcode string) and a tree of sub-entries At each node o1&quot; the tree, we attach a feature structure which encodes lexlcal lnlormatlon The feature structure must follow On Some Aspects of Lextcal Standardization the type definmons specified in the dlctmnary schema The tree ot sub-entries defines an inheritance hleraichy Logically, only the leaves are actual entries the compiler traverses the tree of sub-entries, computing inheritance, and generating the compded dlctmnary from the set of leaves</Paragraph> <Paragraph position="2"> The dictionary schema contains various reformation useful for managing the dlcUonary (1) The schema ot entries is specified using Typed Feature Structure definmons (2) The schema of relations among entries, if any A lelatlon must spectahze the pre-defined RelatJ.on type and relations are used to describe synonymy, hyperonymy, etc They ate also used to hnk several monohngual d~ctlonanes to provide translations (3) The set ot macros, defining abbrevmttons for complex feature structures (4) The location of the key in the entry which is used to build the primary dictionary index (each entry has a unique key within a dlctlonary) (5) The language (as a 3-letter ISO code) (6) Additional indexes that are maintained by the database engine for mteractwe look-up of entries These indexes are specified as a set ot paths m an entry (7) The name of the checker class and of the checker defaulter class We use (typed) feature structures to model entries and relatmns (Zajac 98, 92) Each type has a definmon, is simllm to a class definmon in an Object Oriented language the defintuon of a type specifies what are the allowed features for that type and what is the type of the value for each feature Types are used to define the structure of entries, of relations (links), and of lexlcal rules Since types can be orgamzed in an inheritance hierarchy, it is possible to define a common framework for describing all dlctmnanes by defining a cross-language type hierarchy This multdmgual type h~erarchy specifies dictionary-dependent (that is, language-dependent) elements such as the mventory ot morphosyntacttc categories by defining super-types that are common to two or more languages, thereby dehnmg a multthngual mhentance hierarchy of lexlcal types Only syntactically correct entries are stored in the database However, there are someconsistency checks which es~.ape the checking done by the parser as well as the type-checking mechanism plovided by the Typed Feature Structure engine For example, all headwords must be written using the alphabet of the language and other characters would not be allowed This kind of checks must be added specifically for each dxctlonary through the Implementation ot a checker class that is used by the database before adding entries in a dictionary An optmnal defaulter can also be provided for a given dictionary the defaulter analyzes a dictmnary entry and apphes default rules to fill m m,ssmg reformation For example, ff a feature number with value Plural IS hlled for a noun, the noun is an irregular plural, otherwise, it is a regular noun and the number feature is not further specified, or, it the dictionary specifies a gender only for femm,ne nouns, the defaulter might add a masculine gender when tt is not specified Entries m the database m,ght have such missing mformatmn However, our Typed Feature Structure engine does not provide defaults and a runume dlctmnary must include explicitly all the defaults the defaulter is used by the compiler to fill in default mformauon and produce a compiled dlctmnary where all reformation is expl,cttly expanded The compilation process is done as follows on each entry (1) Apply dictionary-specific checks using the checker class (if defined) (2) Apply the defaulter to augment the dictionary entry and solve all the defaults Note that the checker and the defaulter work on the tree of sub-entries, not on mdwidual feature structures (3) Move all reformation down to the leaves of the tree of sub-entries (compute inheritance) (4) Expand macro defimtions (5) Comp,le a feature structure for each leaf of the sense tree (7) Use type inference to ,nfer the most specific type for each sub-~eature structure within a feature structure (8) Type check the feature structures m a feature structure, expand the types of all sub-teatuie structures by unifying m the defimtion of the type m On Some Aspects of Lexwal Standas dtzauon Relauonships between lexlcal entries are modeled using binary hnks (relauons), used to describe synonymy relations, denvatmns relauons, translanon relauons (see Sectmn 1 4), thesaurus relatmns, etc Any relatmn defined in the d~ctlonary schema must inherit from the Relation type Relations can be given an arbitrarily complex internal structure and can bear reformation A relatmon Is formally defined as Relatlon = \[dom Entry, range Entry\] , For example, ,n a relaUon that specifies a cross-reference defined freely by the lexicographer, the domain feature will point to the entry which is the source of the relatmn and the target entry (range feature) will be ldenufied by prowdmg the key of that entry as m #0= \[key ~' arm&quot;, .... xre f \[ dom # O, range \[ key &quot;armament&quot; \] , note &quot;Collectlve for arm &quot;\]\] A d~ctlonmy browser could Interpret these relations by generating hypelhnks between entries for example A dlct,onary also contains rules whlch specify producuve relations within an entry (see Sect,on 1 3) or among entries within multiple dlctmnanes or still within a single dlcUonary (see Section 1 4) The type Relation is used in the definition ot translation relat, ons, transfer rules and lex~cal rules each of these rules are defined as sub-types of Relate.on</Paragraph> </Section> <Section position="2" start_page="40" end_page="40" type="sub_section"> <SectionTitle> 2.2 Schema and meta-schema </SectionTitle> <Paragraph position="0"> The Eagles gmdehnes on standardization of lexlcal resources (Eagles 93) introduce the dlstmcuon between (1) &quot;The meta-schema which defines general well-formdness condmons for the schema&quot;, (2) The schema &quot;defines the logical format of language-specific and level-wise hngulstlc descriptions&quot;, and (3) &quot;Instances are the mdw,dual lexicons for which there is a translation relalaon expressed between the individual format of the instance and the 'type' defined by In an Habanera lexlcal knowledge base, the only fixed structure is the tree of sub-entries, and anything else xs defined via the dlcuonary schema Using the Typed Feature Structure language developed at CRL, Jt is possible to define dzcuonary schemata using several layers of abstractions, therefore introducing arbitrary intermediate layers between the meta-schema and the schema proper In this TFS language, sets of type definmons are grouped Into modules and sub-modules (a notion similar to the notion of package m programming languages such as Lisp or Java) The use el modules allows to structure a schema as a set of modules introducing addmonal structures and more specific constra,nts on the format of an instance In the next section, we wdl present the lexlcal stlucture which ,s used In CRL dictionaries The schemata of dlctmnanes are orgamzed as follows A generic module defines the generic structure ota dictionary Language specific modules add to that specification language dependent mformatxon (e g a specific Inventory ot morphosyntacuc leatures) of that is grafted on the generic structure or which speclafizes the generic sUucture The generic structure has been respired by the TEI defimtlon and in presented in Secuon 3 The set of type definmons specified m the dictionary schema Is used by the type-checker whlch checks that a d,ctzonary entry Is well-typed and by the compiler which braids a compact binary representatmn of a dictionary entry as a feature structure</Paragraph> </Section> <Section position="3" start_page="40" end_page="41" type="sub_section"> <SectionTitle> 2.3 Tools </SectionTitle> <Paragraph position="0"> The d~ctlonary browser and editor are parametnzed by a set ot HTML templates which dehne the presentaUon format to be used for dlsplaymg feature structures at each level of the tree ot sub-entries The mapping of the stlucture ot an entry F, gure 4 A Habanera Browser for a Persmn-Enghsh dleUonary Since most Web browsers do not support mput methods for languages other than Enghsh, mput of character strings ~s done using a transcr, ptlon A set of transcription tables can be defined by the user and selected m the browser when inputting some character strmg for e g headwords However, Web browsers support the display of almost any major language I and Umcode strings can be dtrectly embedded m HTML documents Habanera also provide import/export functions The format o1&quot; a dlcttonary file uses a textual syntax for feature structure (the one used in the examples) The dictionary file encoding is UTF-8</Paragraph> </Section> </Section> <Section position="4" start_page="41" end_page="43" type="metho"> <SectionTitle> 3 Standardizing the Structure of Lexical Entries </SectionTitle> <Paragraph position="0"> The d,ctlonartes developed at CRL shared the same generic structure Each language specific dictionary refines the shared schema by add,ng language specific ,nformation (e g, a specific inventory of morphosyntactlc features) The data ot a monohngual dtctlonary is a set of entr, es corresponding to word senses as descrtbed m (Meyer et al. 1990) and (Onyshkevych and N1renburg, 1994) We distinguish between computational features that are used by NLP components such as parsers (form, gram, sem, synSem, trans, rel, lexRule, usg). 2 and other features that are used by lexicographers definlUon (clef). example (eg). etymology (etym). closs-reference (xref) and note (note) The features present for each sub-entry are In the remainder of this section, we present the structure ot the form and gram features (see Zajac et als 98 for a description of other features)</Paragraph> <Section position="1" start_page="41" end_page="43" type="sub_section"> <SectionTitle> 3.1 Orthography and Morphology </SectionTitle> <Paragraph position="0"> The form feature records information about the type of word whether the word is a full word, and acronym, or an abbreviation These types are introduced since typically acronyms and abbrewauons are processed differently from ordinary words, for example dunng a tokemzation phase (see e g Grefenstette 94) and words or compounds are processed during or after a morphological analysm the dictionary compiler will produce different runhme dictionaries that include different hnds of information as needed by the various components of the system The orthography feature records the citation form of the word as well as a list of variants There could also be addmonal information such as capitalization, hyphenation or syllabification (a useful information tot an English morphological analyzer for example) The morphology records three different kinds of information morphological information that is attached to the word and stored In the lexicon (e g, gender Information), inflectional information that is typically computed by a morphological analyzer (and passed to the syntactic analyzer), and denvatlonal information that could be either pre-computed in the lexicon or dynamically computed by a morphological analyzer In our lexical model, we require that each dictionary includes as lexical morphological reformation the part-of-speech (using the pos feature) and the indication if the word has a regular morphology or not (using the Boolean regular feature) Irregular forms are listed In the dictionary if the value of the regular feature is False This feature is plovlded to handle simple cases where a given class of words has only one inflectional paradigm English noun for example can be defined as having only one paradigm for the number inflection, where phonological variants ale handled by the morphological processor and anything that falls out of the domain of the morphological processor will be treated as an irregular form Note that the dictionary schema must allow for the inclusion of all inflected forms for irregulars If the linguist has to define inflectional paradigms, as it is the case in many languages, these paradigms must also be specified m the dictionary schema and should allow for the specification ot various stems involved For example, one might consider that English verbs have two paradigms, one where all forms are derived from the citations \[orm (want, wants, wanted, wanted, wanting) modulo phonological changes, one class where some forms must be specified m the lexicon (take, takes, took, taken, takang), and a class of irregulars (be, is, was, been, being) Therefore, English verbs could be classified as regular or irregular, and for regular, they fall m one of two paradigms The readel will have noticed that the morphological model used in the lexicon must be compatible with the model Implemented by any morphological processor using the d~cuonary Our experience has shown that ~t ~s not always tnvml to reconcile a morphological analyzer developed independently from a dictionary with the dictionary The structure ot the form feature must therefore include the following elements \[type Full I Abbrevlatlon I Acronym, orth \[clt Strlng, // The cltatlon form varlants List\], // Optlonally, syllablflcatlon, capltallzatlon, etc morph \[ lex \[pos POS, regular Boolean\] , infl InflectlonalFeatures, // Always unspecxfled xn the dlctlonary derlv DerlvatlonalStructure \] \] For example, the form structute of an English entry might look like</Paragraph> </Section> <Section position="2" start_page="43" end_page="43" type="sub_section"> <SectionTitle> 3.2 Syntax </SectionTitle> <Paragraph position="0"> The gram feature groups all information related to the syntactic behavior ot the word The grammai teature gram contains as required features the part-of-speech information (feature pos) and the subcategonzatton frame (teature frame) The frame feature encodes the subcategonzatlon frame of the predicate expressed as a hst of phrasal types The grammar feature may include addmonal features such as the subcategory, for example Mass/Countable for nouns, or Intransmve/Transmve for verbs, although this Is typically better represented by defimng the appropriate sub-types tot each part-of-speech Additionally, an reflectional feature J.nfl Is also defined for use by syntactic processors the value of this feature ts shared with morphology During processing, a morphological analyzer will produce a set of mflecuonal features and make them available to syntax through the feature gram J-nfl Conversely, a syntactic generator will produce a set of mflecuonal features for iexlcal heads and make them available to the morphological generator The Grammar feature (path gram m an entry) has type Gram This type is defined as Gram = \[pos POS, frame List, infl MorphInf lectlon\], For example, the followmg (partml) entry specifies two subcategonzatlon frames for the noun &quot;announcement&quot; \[ key &quot;announc ement&quot; , gram \[pos N, sense gram frame < NpComp\[head &quot;that&quot;\] >, sense gram frame < NpObl \[head &quot;of&quot; \] > \] \]</Paragraph> </Section> </Section> class="xml-element"></Paper>