File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-2182_intro.xml
Size: 2,547 bytes
Last Modified: 2025-10-06 14:06:05
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2182"> <Title>Formal Description of Multi-Word Lexemes with the Finite-State Formalism IDAREX</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Most texts are rich in multi-word expressions that cannot be properly understood let alone be processed in an NLP system, ff they are not recognized as complex lexical units. Such expressions which we call multi-word lexemes (MWL) range from idioms (to rack one's brains over sth), over phrasal verbs (to come up with), lexical and grammatical collocations (to make love, with regard to resp.) to compounds (on-line dictionary).</Paragraph> <Paragraph position="1"> While certain MWLs only occur in exactly one form, e.g. out of the blue or G:um Haaresbreite ('by a hair's breadth', lit. by hair's breadth), and can thus be easily recognised with simple pattern matching techniques, it is well-known (see e.g. Gross 1982, Brundage et al. 1992, Nunberg et al. 1994) that most MWLs cannot be treated like completely fixed patterns, since they may undergo some variation. However, only a subset of 1Part of this work was funded under LRE 62-080 by the EEC.</Paragraph> <Paragraph position="2"> the variations allowed by general rules is valid: outside that subset, the expression loses its special, idiomatic meaning, either reverting to its literal meaning or losing any significance altogether. In certain cases, MWLs can even contradict normal syntactic rules, as with by and large, or G:von Haus aus ('originally', lit. from house out), where general rules would require an article between the preposition and the noun.</Paragraph> <Paragraph position="3"> The identification of MWLs is essential for any natural language processing based on lexical information, ranging from intelligent dictionary look-up over concordancing or indexing to machine translation. Therefore, the restricted lexical and syntactic variability of MWLs and their idiosyncratic peculiarities need to be expressed in the computational lexicon in order to be able to recognize the full range of their occurrences. We propose to use local grammars for this, written as a special type of regular expressions (REs) in the finite-state formalism IDAItEX which makes use of a two-level morphological lexicon. So far, we have successfully applied this approach to approximately 15,000 English, French and German MWLs (see also Segond and Breidt 1995).</Paragraph> </Section> class="xml-element"></Paper>