File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/p89-1012_metho.xml

Size: 52,842 bytes

Last Modified: 2025-10-06 14:12:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="P89-1012">
  <Title>DICTIONARIES, DICTIONARY GRAMMARS AND DICTIONARY ENTRY PARSING</Title>
  <Section position="3" start_page="0" end_page="91" type="metho">
    <SectionTitle>
I. INI&amp;quot;RODUCTION
</SectionTitle>
    <Paragraph position="0"> Machine-readable dictionaries (MRD's) axe typi, tally ayailable in the form of publishers typesetting tapes, and consequently are represented by a fiat character stream where lexical data proper is heavily interspersed with special (control) characters. These map to the font changes and other notational conventions used in the printed form of the dictionary and designed to pack, and present in a codified compact visual format, as much lexical data as possible.</Paragraph>
    <Paragraph position="1"> To make maximal use of MRD's, it is necessary to make their data, as well as structure, fully ex~ licit, in a data base format that lends itself to exible querying. However, since none of the lexical data base (LDB) creation efforts to date fully addresses both of these issues, they fail to offer a general framework for processing the wide range of dictionary resources available in machine-readable form. As one extreme, the conversion of an MRD into an LDB may be carried out by a 'one-off&amp;quot; program -- such as, for example, used for the Longman Dictionary of Contemporary English (LDOCE) and described in Bogtbr_ aev and Briscoe, 1989. While the resuiting LDB is quite explicit and complete with respect to the data in the source, all knowledge of the dictionary structure is embodied in the conversion program. On the other hand, more modular architectures consisting of a parser and a _grammar -- best exemplified by Kazman's (1986) analysis of the Oxford English Dictionary (OED) -- do not deliver the structurally rich and explicit LDB ideally required for easy and unconstrained access to the source data.</Paragraph>
    <Paragraph position="2"> The majority of computational lexicography projects, in fact, fall in the first of the categories above, in that they typically concentrate on the conversion of a single dictlonarv into an LDB: examples here include the work l~y e.g. Ahlswede et al., 1986, on The Webster's Seventh New Collegiate Dictionary; Fox et a/., 1988, on The Collins English Dictionary; Calzolari and Picchi, 1988, on H Nuovo Dizionario Italiano Garzanti; van der Steen, 1982, and Nakamura, 1988, on LDOCE. Even work based on multiple dictionaries (e.g. in bilingual context: see Calzolari and Picchi, 1986) appear to have used specialized programs for eac~ dictionary source. In addition, not an uncommon property of the LDB's cited above is their incompleteness with respect to the original source: there is a tendency_ to extract, in a pre-processing phase, only some fragments (e.g.</Paragraph>
    <Paragraph position="3">  part of speech information or definition fields) while ignoring others (e.g. etymology, pronunciation or usage notes).</Paragraph>
    <Paragraph position="4"> We have built a Dictionary Entry Parser (DEP) together with grammars for several different dictionaries. Our goal has been to create a general mechanism for converting to a common LDB format a wide range of MRD's demonstrating a wide range of phenomena. In contrast to the OED project, where the data in the dictionary is only tagged to indicate its structural characteristics, we identify ,two processes which are crucial for the 'unfolding, or making explicit, the structure of an MRD: identification of the structural markers, followed by their interpretation in context resulting in detailed parse trees for individual entries. Furthermore, unlike the tagging of the OED, carried out in several passes over the data and using different grammars (in order to cope with the highly complex, idiosyncratic and ambiguous nature of dictionary entries), we employ a parsing engine exploiting unification and backtracking, and using a single grammar consisting of three different sets of rules. The advantages of handling the structural complexities of MRD sources and deriving corresponding LDB s in one operation become clear below.</Paragraph>
    <Paragraph position="5"> While DEP has been described in general terms before (Byrd et al., 1987; Neff eta/., 1988), this paper draws on our experience in parsing the Collins German-English / Collins English-German (CGE/CEG) and LDOCE dictionaries, which represent two very different types of machine-readable sources vis-~t-vis format of the typesetting tapes and notational conventions exploited by the lexicographers. We examine more closely some of the phenomena encountered in these dictionaries, trace their implications for MRD-to-LDB parsing, show how they motivate the design of the DEP grammar formalism, and discuss treatment of typical entry configurations.</Paragraph>
  </Section>
  <Section position="4" start_page="91" end_page="92" type="metho">
    <SectionTitle>
2. STRUCTURAL PROPERTIES OF MRD'S
</SectionTitle>
    <Paragraph position="0"> The structure of dictionary entries is mostly implicit in the font codes and other special characters controlling the layout of an entry on the printed page; furthermore, data is typically compacted to save space in print, and it is common for different fields within an entry to employ radically different compaction schemes and abbreviatory devices. For example, the notation T5a, b,3 stands for the LDOCE grammar codes T5a;T5b;T3 (Boguraev and Briscoe, 1989, present a detailed description of the grammar coding system in this dictionary), and many adverbs are stored as run-ons of the adjectives, using the abbreviatory convention ~ly (the same convention appliesto ce~a~o types of atfixation in general: er, less, hess, etc.). In CGE, German compounds with a common first element appear grouped together under it: Kinder-: .~.ehor m children's choir; --doe nt children's \[ village; -ehe f child marriage. I Dictionaries often factor out common substrings in data fields as in the following LDOCE and CEG entries: ia.cu.bLtor ... a machine for a keeping eggs warm until they HATCH b keeping alive babies that are too small to live and breathe in ordinary air  Furthermore, a variety of conventions exists for making text fragments perfo.,rm more than one function (the capitalization of' HATCH above, for instance, signals a close conceptual link with the word being defined). Data of this sort is not very useful to an LDB user without explicit expansion and recovery of compacted headwords and fragments of entries. Parsing a dictionary to create an LDB that can be easily queried by a user or a program therefore implies not only tagg~ag the data in the entry, but also recovering ellided information, both in form and content.</Paragraph>
    <Paragraph position="1"> There are two broad types of machine-readable source, each requiring a different strategy for recovery of implicit structure and content of dictionary entries. On the one hand tapes may consist of a character stream with no explicit structure markings (as OED and the Collins bilinguals exemplify); all of their structure is iml~li.ed in the font changes and the overall syntax ot the entry. On the other hand, sources may employ mixed r~presentation, incorporating both global record delhniters and local structure encoded in font change codes and/or special character sequences (LDOCE and Webster s Seventh).</Paragraph>
    <Paragraph position="2"> Ideally, all MRD's should be mapped onto LDB structures of the same type, accessible with a sin~le query language that preserves the user s intuition about tile structure of lexical data (Neff et a/., 1988; Tompa, 1986), Dictionary entries can be naturally represented as shallov~ hierarchies with a variable number of instances of certain items at each level, e.g. multiple homographs within an entry or multiple senses within a homograph. The usual inlieritance mechanisms associated with a hierarchical orgardsation of data not only ensure compactness of representation, but also fit lexical intuitions. The figures overleaf show sample entries from CGE ,and LDOCE and their LDBforms with explicitly unfolded structure. null Within the taxonomy of normal forms .(NF) defreed by relational data base theo~, dictionary entries are 'unnormalized relations in which attributes can contain other relations, rather than simple scalar values; LDB's, therefore, cannot be correctly viewed as relational data bases (see Neff et al., 1988). Other kinds of hierarchically structured data similarly fall outside of the relational  NF mould; indeed recently there have been efforts to design a generalized data model which treats fiat relations, lists, and hierarchical struc-Ures uniformly (Dadam et al., 1986). Our LDB rmat and Lexical Query l_anguage (LQL) support the hierarchical model for dictionary data; the output of the .parser, similar to the examples in Figure 3 and Figure 4, is compacted, encoded, and loaded into an LDB. nei.~,.ce/'nju:s~ns II 'nu:-: n I a person or an/real that annoys or causes trouble, PEST: Don't make a nuisance of yourself.&amp;quot; sit down and be quiet! 2 an action or state of affairs which causes trouble, offence, or unpleasantness: What a nuisance! I've forgotten my ticket 3 Commit no nuisance (as a notice in a public place) Do not use this place as a a lavatory b aTIP ~</Paragraph>
  </Section>
  <Section position="5" start_page="92" end_page="94" type="metho">
    <SectionTitle>
3. DEP GRAMMAR FORMALISM
</SectionTitle>
    <Paragraph position="0"> The choice of the hierarchical model for the representation of the LDB entries (and thus the output of DEP) has consequences for the parsing mechanism. For us, parsing involves determining the structure of all the data, retrieving implicit information to make it explicit, reconstructing ellided information, and filling a (recursive) template, without any data loss. This contrasts with a strategy that fills slots in predefmed (and finite) sets of records for a relational system, often discarding information that does not fit.</Paragraph>
    <Paragraph position="1"> In order to meet these needs, the formalism for dictionary entry grammars must meet at least three criteria, in addition to being simply a notational device capable of describing any particular  dictionary format. Below we outline the basic requirements for such a formalism.</Paragraph>
    <Section position="1" start_page="93" end_page="93" type="sub_section">
      <SectionTitle>
3.1 Effects of context
</SectionTitle>
      <Paragraph position="0"> The graham,_ .~ formalism should be capable of handling mildly context sensitive' input streams, as structurally identical items may have widely differing functions depending on both local and global contexts. For example, parts of speech, field labels, paraphrases of cultural items, and many other dictionary fragments all appear in the CEG in italics, but their context defines their identity and, consequently, their interpretation.</Paragraph>
      <Paragraph position="1"> Thus, in the example entry in Figure 3 above, m, (also Sport), (of chapter), and (spec) acquire the very different labels of pos, do, in, us=g=_not=, and sty1.=. In addition, to distint~ish between domain labels, style labels, dialect els, and usage notes, the rules must be able to test candidate elements against a closed set of items. Situations like this, involving subsidiary application of auxiliary procedures (e.g. string matching, or dictionary lookup required for an example below), require that the rules be allowed to selectively invoke external functions.</Paragraph>
      <Paragraph position="2"> The assignment of labels discussed above is based on what we will refer to in the rest of this paper asglobal context. In procedural terms, this is defined as the expectations of a particular grammar fragment, reflected in the names of the assodated rides, which will be activated on a given pare through the grammar. Global context is a dynamic notion, best thought of as a 'snapshot' of the state of the parser at any_ point of processing an entry. In contrast, local context is defined by finite-length patterns of input tokens, ,arid has the effect of Identifying typographic 'clues to the structure of an entry. Finally, immediate context reflects v.ery loc~ character patte12as which tend t 9 drive the initial segmentatmn ot the 'raw' tape character stream and its fragmentation into structure- and information-carrying tokens.</Paragraph>
      <Paragraph position="3"> These three notions underlie our approach to structural analysis of dictionaries andare fundamental to the grammar formalism design.</Paragraph>
    </Section>
    <Section position="2" start_page="93" end_page="93" type="sub_section">
      <SectionTitle>
3.2 Structure manipulation
</SectionTitle>
      <Paragraph position="0"> The formalism should allow operations on the (partial) structures delivered during parsing, and not as.separate tree transtormations once processing is complete. This is needed, for instance, in order to handle a variety of scoping phenomena (discussed in section 5 below), factor out items common to more than one fragment within the same entry, and duplicate (sub-)trees as complete LDB representatmns ~ being fleshed out.</Paragraph>
      <Paragraph position="1"> Consider the CEG entry for abutment&amp;quot;: I abutment \[.,.\] n (Archit) Fltigel- or Wangenmauer f. I Here, as well as in &amp;quot;title&amp;quot; (Figure 3), a copy of the gender marker common to both translatmns needs to migrate back to the ftrst tram. In addition, a copy of the common second compound element -mauer also needs to migrate (note that</Paragraph>
      <Paragraph position="3"> /-gender: f identifying this needs a separate noun compound parser augmented with dictionary lookup). An example of structure duplication is illustrated by our treatment of (implicit) cross-references in LDOCE, where a link between two closely related words is indicated by having one of {hem typeset in small capitals embedded in, a definition of the other (e.g. &amp;quot;PEST' and &amp;quot;TIP' in the deftnitions of &amp;quot;nuisance&amp;quot; in Figure 4). The dual purpose such words serve requires them to appear on at least two different nodes in the final LDB structure: C/~f_string and implicit_xrf. In order to perform the required transformations, the formalism must provide an explicit dle on partial structures, as they are being built by the parser, together with operations which can mariipulate them -- both in terms of structure decomposition and node migration.</Paragraph>
      <Paragraph position="4"> In general, the formalism must be able to deal witli discontinuous constituents, a problem not dissimilar to the problems of discontinuous constituents in natural language parsing; however in dictionaries like the ones we discuss the phenomena seem less regular (if discontinuous constituents can be regarded as regular at all).</Paragraph>
    </Section>
    <Section position="3" start_page="93" end_page="94" type="sub_section">
      <SectionTitle>
3.3 Graceful failure
</SectionTitle>
      <Paragraph position="0"> The nature of the information contained in dictionaxies is such that certain fields within entries do not use any conventions or formal systems to present their data. For instance, the &amp;quot;USAGE&amp;quot; notes in LDOCE can be arbitrarily complex and unstructured. . fragments, .cdegmbining straaght text with a vanety of notattonal devices (e.g. font changes, item highlighting and notes segmentation) in such a way that no principled structure may be imposed on them. Consider, for example, the annotation of &amp;quot;loan&amp;quot;: loan 2 v ........ esp. AmE to give (someone) the use of, lend ........ USAGE It is perfectly good AmE to use loan in the meamng of lend: He loaned me ten dollars.</Paragraph>
      <Paragraph position="1"> The word is often used m BrE, esp. in the meaning 'to lend formally for a long period': He loaned h/s collection of pictures to the public GALLERY but many people do not like it to be used simply in the meaning of lend in BrE...</Paragraph>
      <Paragraph position="2"> Notwithstanding its complexity, we would still like to be able to process the complete entry, recovering as much as we can from the regularly encoded information and only 'skipping' over its truly unparseable fragment(s). Consequently, the formalism and the underlying processing flame- null work should incorporate a suitable mechanism for explicitly handling such data, systematically occumng in dictionaries.</Paragraph>
      <Paragraph position="3"> The notion of .graceful failure is, in fact, best regarded as 'seledive parsing'. Such a mechanism has the additional benefit of allowing the incremental development of dictionary grammars with (eventually) complete coverage, and arbit .r-~.ry depth of analysis, of the source data: a particular grammar might choose, for instance, to treat everything but the headword, part of speech, and pronunciation as 'junk', and concentrate on elaborate parsing of the pron.u:n, ciation fields, while still being able to accept all input without having to assign any structure to most of it.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="94" end_page="97" type="metho">
    <SectionTitle>
4. OVERVIEW OF DEP
</SectionTitle>
    <Paragraph position="0"> DEP uses as input a collection of 'raw' typesetting images of entries from a dictionary 0.e. a typesetting .tape. with begin-end' boundaries of entries explicitly marked) and, by consulting an externally supplied .gr-qmmar s.p~.&amp;quot; c for that particular dictionary, produces explicit structural representations for the individual entries, which are either displayed or loaded into an LDB.</Paragraph>
    <Paragraph position="1"> The system consists of a rule compiler, a parsing nDg~Be, a dictionary entry template generator, an loader, and various development facilities, all in a PROLOG shell. User-written PROLOG functions and primitives are easily added to the system. The fdrmalism and rule compiler use the Modular Logic Grammars of McCo/'d (1987) as a point of d~ure, but they have been substantially modified and extended to reflect the requirements of parsing dictionary entries.</Paragraph>
    <Paragraph position="2"> The compiler accepts three different kinds of rules corresponding to the three phases of dictionary entry analysis: tokenization, retokenization, and proper. Below we present informally ghts of the grammar formalism.</Paragraph>
    <Section position="1" start_page="94" end_page="94" type="sub_section">
      <SectionTitle>
4.1 Tokenization
</SectionTitle>
      <Paragraph position="0"> Unlike in sentence parsing, where tokenization (or lexical analysis) is driven entirely by blanks and punctuation, the DEP grammar writer explicitly defines token delimiters and token substitutions. Tokenixation rules specify a one-to-one mapping from a character substring to a rewrite token; the mapping is applied whenever the specified substring is encountered in the original typesetting tape character stream, and is only sensitive to immediate context. Delimiters are usually font change codes and other special characters or symbols; substitutions axe atoms (e.g.</Paragraph>
      <Paragraph position="1"> ital_correction, field_m) or structured terms be.g. fmtl italic l, ~! &amp;quot;1&amp;quot; I). Tokenization reaks the source character stream into a mixture of tokens and strings; the former embody the notational conventions employed by the printed dictionary, and are used by tlae parser to assign structure to an entry; the latter carry the textual (lexical) content of the dictionary. Some sample rules for the LDOCE machine-readable source, marking the beginning and end of font changes, or making explicit special print symbols, are shown below (to facilitate readability, (*AS) represents the hexadecimal symbol x'AS').</Paragraph>
      <Paragraph position="2"> dolim( &amp;quot;(~i)&amp;quot;, font( i~alic } ). dolia( &amp;quot;(UCA)&amp;quot;, font( beginl samll_caps ) I ). dolim(II{~mS) ii f~r~t ( end( small_caps ) ) ).</Paragraph>
      <Paragraph position="3"> dolim!&amp;quot;(~)&amp;quot;, ital correction). delill( &amp;quot;OqlO)&amp;quot;, hyl~in_mark ). Immediate context, as well as local string rewrite, &amp;quot; can be specified by more elaborate tokenization rules, in which two additional arguments specify strings to be 'glued' to the strings on the left and right of the token delimiter, respectively. For CEG, for instance, we have dotiml&amp;quot;. &gt;u4&lt;&amp;quot;, f~t;~l;)&gt;~).&lt;deg'). delim( &amp;quot;:&gt;u~&lt;&amp;quot;, delim( &amp;quot;&gt;uS&lt;&amp;quot;, font( roman ) ).</Paragraph>
      <Paragraph position="4"> Tokenization opeEates recursively on the string fragments formed by an active rule; thus, applicatton of the first two rules above to the stnng ,,mo~. :~a,: ~r~&amp;quot; results in the following token list: &amp;quot;xxx&amp;quot; . lad . fontlbold) , &amp;quot;y~C/&amp;quot;.</Paragraph>
    </Section>
    <Section position="2" start_page="94" end_page="97" type="sub_section">
      <SectionTitle>
4.2 Retokenization
</SectionTitle>
      <Paragraph position="0"> Longer_-range (but still local) context sensitivity~ is irfiplemented via retokenization, the effect ot which is the 'normalization' of the token list.</Paragraph>
      <Paragraph position="1"> Retokenization rules conform to a general rewrite format -- a pattern on the left-hand side defines a context as a sequence of (explicit or variable place holder) tokens, in which the token list should be adlusted as indicated by the right-hand side -- and can be used to .perform a range of cleaning up tasks before parsing proper.</Paragraph>
      <Paragraph position="2"> Streamlining the token list. Tokens without information- or structure-bearing content; such as associated with the codes for fialic correction or thin space, are removed: ital correction : ,Seg &lt;:&gt; /Seg.</Paragraph>
      <Paragraph position="3"> Superfluous font control characters can be simply deleted, when they follow or precede certain data-can'ying tokens which also incorporate typesetting information (such as a homogra.ph superscript symbol or a pronunciation marker indicating the be~finning of the scope of a phonetic font): rk font! phonetic ) &lt; * rk. supl N) &lt; * R (Re)adjusting the token list. New tokens can be introduced in place of certain token sequences: bra : fonttitalic) &lt;=&gt; beginlrestric~ion). f~'tt(r~m~'t) : ket &lt; * ~wl(r~stricti~'b). Reconstruction of string segments. Where the initial (blind) tokenization has produced spurious lragraentation, string sewnents can be suitably reconstructed. For instance, a hyphen-delimited sequence of syllables in place of the print form of a headword, created by tokeni~ation on ~,-rg), can be 'glued' back as follows: *Syl_l : ~ mark : +$ 1 Z t strxngpTSyl 1 ) : $s~r~ngp( SY=1 2 ) &lt;=&gt; w~oin(Seg, S~1_1.' .... .$yl_2.n:l&amp;quot;I t~.</Paragraph>
      <Paragraph position="4"> This rule demonstrates a characteristic property.</Paragraph>
      <Paragraph position="5"> of the DEP formalism, discussed in more detail  later: arbitrary Prolog predicates can be invoked to e.g. constrain rule application or manipulate strings. Thus, the rule oialy applies to string tokens surrounding a hyphen character; it manufactures, by string concatenation, a new segment which replaces the triggering pattern.</Paragraph>
      <Paragraph position="6"> Further segmentation. Often strings need to be split, with new tokens inserted between the pseces, to correct infelicities in the tapes, or to insert markers between recognizably distinct contiguous segments that appear in the same font.</Paragraph>
      <Paragraph position="7"> The rule below implements the CGE/CEG convention that a swung dash is an implicit switch to bold if the current font is not bold already.</Paragraph>
      <Paragraph position="9"> Dealing with irregular input. Rules that rearrange tokens are o~ten needed to correct errors in the tapes. In CEG/CGE, parentheses surrounding italic items often appear (erroneously) in a roman font. A suite ofiaxles detaches the stray parentheses from the surrounding tokens, moves them around the font marker, and glues them to the item to which they belong.</Paragraph>
      <Paragraph position="11"> &lt;:&gt; /El. /~ gluo */ eot~um invokes retokenization recursively on the sublist beginning with fontt e) and including all tokens to its right. In p &amp;quot;nneiple, the three rules can be subsumed by a single one; in practice, separate rules also 'catch' other types of erroneous or nots), input.</Paragraph>
      <Paragraph position="12"> Although retokenization is conceptually a separate process, it is interleaved in practice with tokemzation, bringing imp .rovements in performance. Upon completion, the tape stream corresponding, for instance, to the LDOCE entry non-trivial manipulation of (partial) trees, as implicit and/or ellided information packed in the bntries is being recovered and reor-gaxxized. Parsing is a top-down depth-first operation, and only the first successful parse is used. This strategy, augmented by a 'junk collection' mechanism (discussed below) to recover from parsing failures, turns out to be adequate for handling all of the phenomena encountered while assigning structural descriptions to dictionary entries.</Paragraph>
      <Paragraph position="13"> Dictionary grammars follow the basic notational conventions of logic grammars; .however, we use additional operators tailored to the structure manipulation requirements of dictionary parsing. In pLrticular, the right-hand side of grammar rules admits the use of-four different types ot operators, designed to deal with token list consumption, token list manipulation, structure assignment, and (local) tree transformations. These operators suitably modify the expansions of grammar rules; ultimately, all rules are compiled into Prolog.</Paragraph>
      <Paragraph position="14"> Token consumption. Tokens axe removed from the token list by the + and - operators; + also assigns them as terminal nodes under the head of the invoking rule. Typically, delimiters introduced by tokenization (and retokenization) are removed once they serve their primary function of identifying local context; string segments of the token list are assigned labels and migrate to appropriate places in the final structural represeniation ot an entry. A simple rule for the part of speech fields in CEG (Figure 3) would be: los ::&gt;-fzntl italic) = +Sag.</Paragraph>
      <Paragraph position="15"> A structured term stpos, &amp;quot;n&amp;quot;.nil) is built as a result of the rule consuming, for instance, the token &amp;quot;n&amp;quot;, Rule names are associated with attri- butes in the LDB representation for a dictionary entry; structures built by rules are pairs of the form sire, Vii=l, where velt~ is a list of one or more elements (strings or further structures 'returned' by reeunively invoked rules).</Paragraph>
      <Paragraph position="16"> au.tit.fiC/ ;C/C/'tistik, adj suffering from AUTISMI: I</Paragraph>
      <Paragraph position="18"> Token list manipulation. Adjustment of the token list may be required in, for instance, simple cases of recovering ellided information or reordering tokens in the input stream. This is achieved by the tm and ir~x operators, which respectively insert single, or sequences of, tokens into the token list at the current position; and the ++ operator, which inserts tokens (or arbitrary tree fragments) directly into the structure under construction. Assuming a global variable, .rod, bound to the headword of the current entry, and the ability to invoke a Prolog string concatenation tunction trom within a rule (~a the * operator; see below), abbreviated morphological derivations stored as run-ons might be recovered ~l~ e ltlqc~r</Paragraph>
      <Paragraph position="20"> wi.I X. suffix) 4.3 Parsing t~,n~'l:te,m,:l, x, Oerivl ++Ooriv. Parsing proper makes use of unification and backtrracking to handle identification of segments (i tin is separately defined to test for membership by context, and is heavily augmented with some of a closed class of suffixes.)  Structure assignment. The ++ operator can only assign arbitrary structures directly to the node in the tree which is currently under construction. A more general mechanism for retaining structures for future use is provided by allowing variables to be (optionally) associated with grammar rules: in this way the grammar writer can obtain an explicit handle on tree fragments, in contrast to tlae default situation where each rule implicitly 'returns' the structure it constructs to its caller.</Paragraph>
      <Paragraph position="21"> The following rule, for example, provides a skeleton treatment to the situation exemplified in Figure 4, where a definition-initial substring is common to more than one sub-definition:</Paragraph>
      <Paragraph position="23"> The defs rule removes the defmition-irtitial string segment and passes: it on to the repeatedly invoked ~s. This manufactures the complete definition string by concatenating the common initial segment, available as an argument instantiated two levels higher, with the continuation string specific to any given sub-definition. Tree transformations. The ability to refer, by name, to fragments of the tree being constructed by an active grammar rule, allows arbitrary tree transformations using the complementary operators -z. and +~.. They can only be applied to non-terminal grammar rules, and require the explicit specification of a place-holder variable as a rule argument; this is bound to the structure constructed by the rule. The effect of these operators on the tree fragments constructed by the rules they modify is to prevent their incorporation into the local tree (in the case of -z), to explicitly splice it in (in the case of /z), or simply to capture it (z). The use of this mechanism in conjunction with the structure naming facility allows both permanent deletion of nodes, as well as their practically unconstrained migration between, and within, different levels of grammar (thus implementing node raising and reordering). It is also possible to write a rule which builds no structure (the utility of such rules, in particular for controlling token consumption and junk collection, is discussed in section 5).</Paragraph>
      <Paragraph position="24"> Node-raising is illustrated by the grammar fragment below, which might be used to deal with certain collocation phenomena. Sometimes dictionaries choose to explain a word in the course of defining .another related word by arbitrarily insetting mm~-entnes in their defmitmns: lach.ry.mal 'l~kfimal adj \[Wa51 of or concerning tears of the organ (lach~mai gland/'_ ./) of the body that produces them The potentially complex structure associated with the embedded entry specification does not belong to the definition string, and should be factored out as a separate node moved to a higher level of the tree, or even used to create a new tree entirely. The rule for parsi.n.g the definition fields of an entry makes a provmon for embedded entries; the structure built as an ~ entry is bound to the str,ac argument in the aofn rule. The -z operator prevents the ~_entry node from being incorporated as a daughter to ae~n: however, by finification, it beghas its ,mi',gr, ation 'upwards' through the tree, till it is 'caught by the entry rule several levels ~gher and inserted (via  The expressive power of the system is further enhanced by allowing optionality (via the opt operator), alternations (I) and conditional constructs in the gra'--:nar rules; the latter are useful both for more co~:::,.,ct rule specification and to control backtracking while parsing. Rule application may be constrained by arbitrary tests (revoked, as Prolog predicates, via a t operator), and a string operator is available for sampling local context. The mechanism of escaping to Prolog, the motivation for which we discuss below, can also be invoked when arbitrary manipulation of lexical data -- ranging from e.g. simple string processing to complex morphological analysis --Is required during parsing.</Paragraph>
      <Paragraph position="25"> Tree structures. Additional control over the shape of dictionary&amp;quot; entry trees is provided by having two types of non-terminal nodes: weak and strong ones. The difference is in the explicit presence or absence of nodes, corresponding to the rule names, in the final tree: a structure fragment manufactu~d by a weak non-terminal is effectively spliced into the higher level structure, without an intermediate level of naming. One common use of such a device is the 'flattening' of branching constructions, typically built by recursive rules: the declaration str~;,-,~_nonterminals ( clefs . subdeC/ . nil 1.</Paragraph>
      <Paragraph position="26"> when applied to the sub-definitions fragment above, would lead to the creation of a group of sister ~f nodes, immediately dominated bv a aefs node. Another use of the distinction bewcteen weak and strong non-terminals is the ef- ive mapping from typographically identical entry segments to appropriately named structure fragments, with global context driving the name assignment. Thus, assuming a weak label rule which captures the label string for further testing, analysis of the example labels discussed in 3.1 could be achieved as follows (also see Figure 3):  labellXI =:&gt; -beginlrestriction} :./X : $strir~p(X\] : -endfresxrictionl. tr~n ==&gt; opt I doamin I style I diaZ I usaga_note -) : word. ~o~en ==&gt; labeltX} i ,i,,X, ~_!ab). ==&gt; label(X } Sisal X, lab\]. dial = * labellX} $isalX, dial-lab). usagenote ==&gt; labellX).  Such a mechanism captures g~aeralities in typograp~tc conventions employed across any given dictionary, and yet preserves the distinct, name spaces required for a meaningful unfolding of a dictionary entry structure.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="97" end_page="100" type="metho">
    <SectionTitle>
5. RANGE OF PHENOMENA TO HANDLE
</SectionTitle>
    <Paragraph position="0"> Below we describe some typical phenomena encountered in the dictionaries we have parsed and discuss their treatment.</Paragraph>
    <Section position="1" start_page="97" end_page="97" type="sub_section">
      <SectionTitle>
5.1 Messy token lists: controlling token
</SectionTitle>
      <Paragraph position="0"> consumption The unsystematic encoding of font changes before, as well as after, punctuation marks (commas, semicolons, parentheses) causes blind tokenization to remove punctuation marks from the data to which they are visually and conceptually attached. As already discussed (see 4.2), most errors of this nature can be corrected by retokenization. Similarly, the confusing effects of another pervasive error, namely the occurrence of consecuti, e font changes, can be avoided by having a retokenization rule simply remove all but the last one. In general, context sensitivity is handled by (re)adjusting the token list; retokenization, however, is only sensitive to local context. Since global context cannot be determined unequivob.ally till parsing, the grammar writer is given complete control over the consumption and addition of tokens as parsing proceeds from left to right -- this allows for motivated recovery of ellisions, as well as discarding of tokens in local transformations.</Paragraph>
      <Paragraph position="1"> For instance, spurious occurrences of a font marker before a print symbol such as an opening parenthesis, which is not affected by a font dec' laration, clearly cannot be removed by a retokenization rule font! roman\] : bra &lt;=&gt; bra.</Paragraph>
      <Paragraph position="2"> (The marker may be genuinely closing a font segment prior to a different entry fragment which commences with, e.g., a left parenthesis). Instead, a grammar rule anticipating a br~ token within its scope can readiust the token list using either of: ... ==&gt; ... : -fontlroman) : -bra : inslbr-a). ... ==&gt; ... : -fantlromanl : stringlbra.*\]. (The $*ri-e operator tests for a token list with br~ as its first element.)</Paragraph>
    </Section>
    <Section position="2" start_page="97" end_page="97" type="sub_section">
      <SectionTitle>
5.2 The Peter-1 principle: scoping phenomena
</SectionTitle>
      <Paragraph position="0"> Consider the entry for &amp;quot;Bankrott&amp;quot; in Figure 2.</Paragraph>
      <Paragraph position="1"> Translations sharing the label (fig) (&amp;quot;breakdown, collapse ') are grOUl&gt;ed together ~6ith commas and separated from other lists with semicolons. The restnctlon (context or label) precedes the llst and can be said to scope 'right' to the next semicolon.</Paragraph>
      <Paragraph position="2"> We place the righ-t-scoping labels or context under the (semicolon-delimited) t~,n_group as sister  among terms in the target langtmge responds to the &amp;quot;do not discard anything&amp;quot; philosophy; placing common data items as high as possible in the tree (the 'Peter-minus-1 princaple') is in the spirit of Flickinger et al. (1985), and implements the notion of placing a t~al node at the hi~. est position hi tlae tree wlaere its value is valid in combination with the values at or below its sister nodes. The latter principle also motivates sets of rules like ~rm~ ==&gt; &amp;quot;'&amp;quot; pr~n ... : homograph .... ==&gt; pratt used to account for entries in English where the pronunciation differs for different homographs.</Paragraph>
    </Section>
    <Section position="3" start_page="97" end_page="98" type="sub_section">
      <SectionTitle>
5.3 Tribal memory: rule variables
</SectionTitle>
      <Paragraph position="0"> Some compaction or notational conventions in dictionaries require a mechanism for a rule to re,member (part of) its ancestry or know its sister s descendants. Consider the l~roblem of determining the scope of gender or labels immediately following variants of the headword: Advolmturbfiro nt (Sw), Advokaturskanzlei f ( Aus) lawyer's offize. Tippfr~ein nt ( lnf), ~ppse f -, -n ( pej ) typist.</Paragraph>
      <Paragraph position="1"> Alchemic ( esp Aus) , Akhimief alchemy.</Paragraph>
      <Paragraph position="2"> The first two entries show forms differing, respectively, in dialect and gender, and register and gender. The third illustrates other combinations.</Paragraph>
      <Paragraph position="3"> The rule accounting for labels after a variant must know whether items of like type have already been found after the hcadword, since items before the variant belong to the headword, different items of identical type following both belong in.dividuaUy, and all the rest are common to botla.</Paragraph>
      <Paragraph position="4"> This 'tribal' memory is implemented using rule variables:</Paragraph>
      <Paragraph position="6"> In addition to enforcing rule constraints via unification, rule arguments also act as 'channels' for node raising and as a mcchanisrn for controlling rule behaviour depending on invocation context.</Paragraph>
      <Paragraph position="7"> This latter need stems from a pervasive phenomenon in dictionaries: the notational conventions for a logical unit within an entry persist across different contexts, and the sub-grammar for such a unit should be aware of the environment it is activated in. Implicit cross-references in LDOCE are consistently introduced by fontl stall csos \], independent of whether the runnin 8 text is a defmiuon (roman font), example (italic), or an era- null bedded phrase or idiom (bold); by enforcing the return to the font active before the invocation of iaq)iioit=xrf, we allow the analysis of cross-references to be shared: implicit xrft X) ==&gt; -1Font( begin( stall cams ) )</Paragraph>
      <Paragraph position="9"/>
    </Section>
    <Section position="4" start_page="98" end_page="98" type="sub_section">
      <SectionTitle>
5.4 Unpacking, duplication and movement of
</SectionTitle>
      <Paragraph position="0"> structures: node migration The whole range of phenomena requiring explicit manipulation of entry fragment trees is handled by the mechanisms for node raising, reordering, and deletion. Our analysis of implicit cross-references in LDOCE factors them out as separate structural units participatingin the make-up of a word sense definition, as well as reconstructs a 'text image' of the definition text, with just the orthography of the cross-reference item 'spliced in' (see Figure 4).</Paragraph>
      <Paragraph position="1">  darn ==&gt; .dof_segs.! O_String) . : ooT_szringCD_St r trig J. clef segslStr_l) = * def_nugget(Seg) ( d~f segslStr O) Str-O : &amp;quot;&amp;quot; )tcon(~*( Seg,Str_O ,Str_l ).</Paragraph>
      <Paragraph position="2"> def_nugget(Ptr ) ==&gt; 7.iatPlicit xrC/ (s( impliEit xrf, . s( to, Ptr.Ril ). Resx ) ). def_nuggot! Seg ) ==&gt; -Seg : Sstringpt Seg ). def_strlngi Dof) ==&gt; /+Oef.</Paragraph>
      <Paragraph position="3">  The rules build a definition string from any sequence of substrings or lexical items used as cross-references: by invoking the appropriate deC/_nusmat rule, the simple segments are retained only for splicing the complete definition text; cross-reference pointers are extracted from the structural representation of an implicit erossreference; and itmlicit._xef nodes are propagated up to a sister position to the dab_string. The string image is built incrementally (by string concatenation, as the individual a-C/_nutmts are parsed); ultim, ately the ~C/_strir~ rule simply incorporates tt into the structure for ae~. Declaring darn, def string and implicit_xrf to be strong non-terminals ultimately results in a dean structure similar to the one illustrated in  Copying and lateral migration of common gender labels in CEG translations, exemplified by title' (Figure 3) and &amp;quot;abutment&amp;quot; (section 3.2), makes a differ r- ent use of the C/z operator. To capture the leftward scope of gender labels, in contrast to common (right-scoping) context labels, we create, for each noun translatton (tran), a gender node with an empty value. The comma-delimited *ran nodes are collected by a recursive weak non- terminal *fans rule. trams ==&gt; tran(G) : opt( -ca : trans(G) ). tran(G) :=&gt; ... word ... : opt( -Zoenektr! G ) ) : *7.gendor( G ).</Paragraph>
      <Paragraph position="4"> The (conditional) removal of gander&amp;quot; in the second rule followed by (obligatory) insertion of a ~ne~r node captures the gender if present and 'digs a hole' for it if absent. Unification on the last iteration of tear~ fills the holes.</Paragraph>
      <Paragraph position="5"> Noun compound fragments, as in &amp;quot;abutment&amp;quot; can be copied and migrated forward or backward using the same mechknism. Since we have not implemented the noun compound parsing mechamsm required for identification of segments to be copied, we have temporized by naming the fragments needing partners alt_.=C/x or alt_sex.</Paragraph>
    </Section>
    <Section position="5" start_page="98" end_page="99" type="sub_section">
      <SectionTitle>
5.5 Conflated lexical entries: homograph
</SectionTitle>
      <Paragraph position="0"> unpacking We have implemented a mechanism to allow creation of additional entries out of a single one, for example from orthographic, dialect, or morphological variants of the original headword. Some CGE examples were given in sections 2 and 5.3 above. To handle these, the rules build the second entry inside the main one and manufacture cross reference information for both main form and variant, in anticipation of the implementation of a splitting mechanism. Examples of other types appear in both CGE and CEG: vampire \[...\] n (lit) Vampir, Blutsauger (old~ m; (fig) Vampir m. - hat Vampir, Blutsauger (old) m.</Paragraph>
      <Paragraph position="1"> wader \[...\] n (a) (Orn) Watvogel m. (b) ~s pl (boots) Watstiefel pl.</Paragraph>
      <Paragraph position="2"> house in cpd~ HaLts-; ~ arrest n Hausarrest m; ~ boat n Hausboot n~ - baund adj ans Haus gefesselt; .... house:. --hunt vi auf Haussuche sein; they have started --hunting sic haben angefangen, nach einem Haus zu suchen; -hunting n Haussuche n; ....</Paragraph>
      <Paragraph position="3"> The conventions for morphological vari,'ants, used heavily in e.g. LDOCE and Webster s Seventh, are different and would require a different mechanism. We have not yet developed a generalized rule mechanism for ordering any kind of split; indeed we do not know if it ts possible, given the wide variation ~, seemingly aa hoc conventions for 'sneaking in logically separate entries into related headword definitions: the case of &amp;quot;lachrymal gland&amp;quot; in 4.3 is iust one instance of this phenomena; below we list some more conceptually similar, but notationally different, examples, demonstrating the embedding of homographs in the variant, run-on, word-sense and example fields of LDOCE.</Paragraph>
      <Paragraph position="4"> daddy long.legs .da~i lot~jz also (/'m/) crane fly -- n ... a type of flying insect with long legs ac.rLmo.ny ... n bitterness, as of manner or language -- -nious ~,kri'maunias/ adj: an acrimonious quarrel --niously adv crash I ... v ... 6 infml also gatecrash -- to join (a party) without having been invited ...</Paragraph>
      <Paragraph position="5"> folk et.y.mol.o.gy ,,..'--~ n the changing of straage or foreign words so that they become like quite common ones: some people say ~parrowgrass instead of ASPARAGUS: that ia an example of folk etymology  5.6 Notational promiscuity: selective tokenization Often distinctly different data items appear contiguous in the same font: the grammar codes of LDOCE (section 2) are just one example. Such run-together segments clearly need their own tokenization rules, which can only be applied when they are located during parsing. Thus, commas and parentheses take on special meaning in the string &amp;quot;X(to be)l,7&amp;quot;, indicating, respectively, ellision of data and optionality of p~ase. This is a different interpretation from e.g. alternation (consider the meaning of &amp;quot;adj, noun&amp;quot;)or the enclosing of italic labels m parentheses (Figure 3). Submission of a string token to further tokemzation is best done by revoking a special purpose pattern matching module; thus we avoid global (and blind) tokenization on common (and ambiguous) characters such as punctuation marks. The functionality required for selective tokenization is provided'by a ~e primitive; below we demonstrate the construction of a list of sister synca* nodes from a segment like &amp;quot;n, v, adj&amp;quot;, repetitively invoking oa)-~a) to break a string into two substrings separated by a comma:</Paragraph>
      <Paragraph position="7"/>
    </Section>
    <Section position="6" start_page="99" end_page="100" type="sub_section">
      <SectionTitle>
5.7 Parsing failures: junk collection
</SectionTitle>
      <Paragraph position="0"> The systematic irregularity of dictionary data (see section 3.3) is only one problem when parsing dictionary entries. Parsing failures in general are common during .gr-,~maar development; more specifically, they tmght arise due to the format of an entry segment being beyond (easy) capturing within the grammar formalism, or requiring non-trivial external functionality (such as compound word parsing or noun/verb phrase analysis).</Paragraph>
      <Paragraph position="1"> Typically, external procedures o~. rate on a newly constructed string token which represents a 'packed' unruly token list. AlternaUvely, if no format need be assigned to the input, the graxn. mar should be able to 'skip over' the tokens m the list, collecting them under a 'junk' node.</Paragraph>
      <Paragraph position="2"> If data loss is not an issue for a specific application, there is no need even to collect tokens from irregular token lists; a simple rule to skip over USAGE fields might be wntten as usacje ==&gt; -usage nmrk : use field. use field ==&gt; -U ToKen : Snotiee~d ufield} : opt( use_f ield ). (Rules like these, building no structure, are especially convenient when extensive reorganizatmn of tile token list is required -- typically in cases of grammar-driven token reordering or token deletion without token consumption.) In order to achieve skipping over unparseable input without data loss, we have implemented a ootleztive rule class. The structure built by such rules the (transitive) concatenation of all the character strings in daughter segments. Coping with gross irregularities is achieved by picking up any number of tokens and 'packing' them tother. This strategy is illustrated by a grammar phrases conjoined with italic 'or' in example sentences and/or their translations (see Figure 3).</Paragraph>
      <Paragraph position="3"> The italic conjunction is surrounded by slashes in the resulting collected string as an audit trail. The extra argument to e~n$ ehforces, following the strategy outlined in section 5.3, rule application only m the correct font context.</Paragraph>
      <Paragraph position="4"> stron~nonterminals (source . targ . hill.</Paragraph>
      <Paragraph position="5"> colle~ives !conj . nil ).</Paragraph>
      <Paragraph position="7"> Finally, for the most complex cases of truly irregular input, a mechanism exists for constraining juiak collection to operate only as a last resort and only at the point at which parsing can go no further. null 5.8 Augmenting the power of the formalism: escape to Prolog Several of the mechanisms described above, such as contextual control of token consumption (section 5.1), explicit structure handling (5.4), or selective toke/fization (5.6), are implemented as * separate Prolo~z modules. Invoking such extemai functionality from the grammar rules allows the natural integration of the form- and contentrecovery procedures into the top-down process of dictionary entry analysis. The utility of this device should be clear from the examples so far.</Paragraph>
      <Paragraph position="8"> Such escape to the underlying implementation language goes against the grain of recent developments of declarative gran3m_ ar formalisms. (the procedural ramifications of, for instance, being able to call arbitrary LISP functions from the arcs of an ATN grammar have been discussed at length: see, for instance, the opening chapters in Whitelock et al., 1987). However, we feel justified in augmenting, the ..... formalism in such a way, as we are dealing with input which Is different m nature from, and on occasions possibly more complex than, straight natural language. Unhomogeneous mixtures of heavily formal notations and annotations in totally free format, interspersed with (occasionally incomplete) fragments of natural language phrases, can easily defeat any attempts at 'cleafi' parsing. Since the DEP system is designed to deal with an open-ended set of dictionaries, it must be able to corffront a similarly open-ended set of notational conventions and abbreviatory devices. Furthermore. dealing in full with some of these notations requires access to mechanisms and theories well beyond the power of any grammar formalism: consider, for stance, what is involved in analyzing pronunciation fields in a dictionary, where alternative pronunciation patterns are marked only for syllable(s) which differ from the primar3 ~ pronuncaation (as in arch.bish.op: /,a:tfbiDp II ,at-/); where the pronunciation string itself ts not marked for syllable structure; and where the assignment of syllable boundaries is far from trivial (as in fas.cist: /'f=ej'a,st/)!</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="100" end_page="100" type="metho">
    <SectionTitle>
6. CURRENT STATUS
</SectionTitle>
    <Paragraph position="0"> The run-time environment of DEP includes gr .ammar debugging utilities, and a number of opttons. All facilities have been implemented, except where noted. We have very detailed grammars for CGE (parsing 98% of the entries), CEG (95%), and LDOCE (93%); less detailed grammars for Webster s Seventh (98%), and both laalves of the Collins French Dictionary (approximately 90%).</Paragraph>
    <Paragraph position="1"> The Dictionary Entry Parser is an integra.1, part of a larger system designed to recover dictionary structure to an arbitrary depth of detail, convert the resulting trees into LDB records, and make the data av/tilable to end users via a flexible and powerful lexical query language (LQL). Indeed, we have built LDB's for all dictionaries we have parsed; further development of LQL and the exploitation of the LDB's via query for a number of lexical studies are separate projects.</Paragraph>
    <Paragraph position="2"> Finally, we note that, in the light of recent efforts to develop an interchange standard for (English mono-lingual) dictionaries (Amsler and Tompa, 1988), DEP acquires additional relevance, since it can be used, given a suitable annotation of the grammar rules for the machine-readable source, to transduce a typesetting tape into an interchangeable dictionary source, available to a larger user commumty.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML