File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-2115_intro.xml

Size: 2,792 bytes

Last Modified: 2025-10-06 14:02:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2115">
  <Title>Multiword Lexical Acquisition and Dictionary Formalization</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> MWEs have been viewed, for long time, as marginal idiosyncratic combinations of words. In recent years, however, there has been a growing awareness in the NLP community of the problems that MWEs pose and the need for their robust handling. Several major conferences and satellite workshops have been dedicated to the subject (ACL, EACL, LREC, for instance); major publications devote thematic issues to MWEs.</Paragraph>
    <Paragraph position="1"> Anticipating that growing interest, over the last years, a significant part of LabEL's research has been devoted to the development of large-scale, linguistically precise language resources, namely to the construction of computational lexicons for simple and multiword units (Eleuterio et al., 1995; Ranchhod et al., 1999; Ranchhod et al., 2004).</Paragraph>
    <Paragraph position="2"> In fact, we have observed that MWEs are used frequently in both everyday language and technical and scientific texts to express ideas and concepts that in general cannot be stated by &amp;quot;free&amp;quot; linguistic structures. They include a large range of different linguistic phenomena, such as: (i) lexical compounds (nouns: cellular phone, rush hour, New Jersey; adjectives: well-known; adverbs: for the  in spite of, in order to) (ii) phrasal verbs (give up); (iii) light verbs (give a lecture); (iv) fixed (proverbs and maxims) and semi-fixed sentences (to see the light at the end of the tunnel; to take the Lord' name in vain). From a linguistic point of view, all these expressions exhibit distributional and selectional constraints, i.e. they lack compositionality, and frequently have idiomatic interpretations.</Paragraph>
    <Paragraph position="3"> In this paper, we focus on multiword nouns.</Paragraph>
    <Paragraph position="4"> Special attention will be given to their formalization and generation, using INTEX, a public FST (Finite-State Transducer) based NLP system [Silberztein, 1993]. In this context, we present the main characteristics of a new inflectional module, conceived at LabEL, fully compatible with this system. Next, we describe the acquisition methodology used to gather new dictionary entries in a fragment (extracts 1,520,001 to 1,567,625) of the non-annotated version of a public Portuguese corpus, CETEMPublico2.</Paragraph>
    <Paragraph position="5"> Finally, based on this experiment, we assess, on the one hand, the dictionary increase, and, on the other hand, the lexical coverage improvement in the referred corpus.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML