File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0407_intro.xml

Size: 13,639 bytes

Last Modified: 2025-10-06 14:02:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0407">
  <Title>Representation and Treatment of Multiword Expressions in Basque Inaki Alegria, Olatz Ansa, Xabier Artola</Title>
  <Section position="2" start_page="0" end_page="2" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Most texts are rich in multiword expressions, which must be necessarily processed if we want any NLP tool to perform accurately. Jackendoff (1997) estimates that their number in the speakers' lexicon &amp;quot;is of the same order of magnitude as the number of single words&amp;quot;.</Paragraph>
    <Paragraph position="1"> There is no agreement among authors about the definition of the term Multiword Expression.</Paragraph>
    <Paragraph position="2"> However, in this article, Multiword Expressions (hereafter MWE) refer to any word combinations ranging from idioms, over proper names, compounds, lexical and grammatical collocations... to institutionalized phrases. MWEs comprise both semantically compositional and non-compositional combinations, and both syntactically regular and idiosyncratic phrases, including complex named entities such as proper nouns, dates and number expressions (see section 2).</Paragraph>
    <Paragraph position="3"> In contrast, Multiword Lexical Units (hereafter MWLU) comprise lexicalized phrases -semantically non-compositional or syntactically idiosyncratic word combinations-- which are represented and stored in the lexical database of Basque (EDBL).</Paragraph>
    <Paragraph position="4"> The remaining sections are organized as follows. Section 2 presents the main features of MWEs in Basque, and defines which are currently considered for automatic processing. Section 3 describes the representation of MWLUs in the lexical database. Section 4 is devoted to the description and evaluation of the automatic treatment of MWEs by means of HABIL. Section 5 summarizes future work. And, finally, section 6 outlines some conclusions.</Paragraph>
    <Paragraph position="5"> Multiword Expressions in the processing of real texts in Basque The definition of the term Multiword Expression and the types of such MWEs to be treated in NLP may vary considerably depending on the purposes or &amp;quot;the depth of processing being undertaken&amp;quot; (Copestake et al., 2002). Multiword itself is a Second ACL Workshop on Multiword Expressions: Integrating Processing, July 2004, pp. 48-55 vague term. At text level, a word could be defined as &amp;quot;any string of characters between two blanks&amp;quot; (Fontenelle et al., 1994). This is not applicable to languages as Japanese, which are typically written without spaces. Besides, a great number of MWEs that in uninflected languages would be multiword, constitute a single typographic unit in agglutinative languages such as Basque (ziurrenik 'most probably', aurrerantzean 'from now on', aurretiaz 'in advance'). Therefore, we consider them single words and they are included in the lexical database as such (or recognized by means of morphological analysis).</Paragraph>
    <Paragraph position="6"> In our case, when deciding which Basque MWEs to include in the database, we mostly rely on lexicographers' expertise since we consider lexicalized phrases have a top priority for both lemmatizing and syntactic purposes. So, the MWEs dealt with in the database comprise fixed expressions, which admit no morphosyntactic or internal modification --including foreign expressions such as in situ, a priori, strictu sensu, etc.--, idioms, both decomposable and nondecomposable, and lexicalized compounds. We also consider light verb constructions when they are syntactically idiosyncratic.</Paragraph>
    <Paragraph position="7"> However, currently we do not treat open collocations, proverbs, catch phrases and similes. Mostly, we don't include proper names in the database either, since complex named entities are given a separate treatment. Apart from proper nouns, also dates and number expressions are treated separately (see 4.1).</Paragraph>
    <Paragraph position="8"> So far we have described 2,270 MWLUs in our database. This work has been carried out in two phases. For the first phase, we made use of the Statistical Corpus of 20 th Century Basque (http://www.euskaracorpusa.net) that contains about 4.7 million words. As a starting point, we chose the MWLUs that occurred more than 10 times in this manually lemmatized corpus. This amounted to about 1,300 expressions. For the second phase, this list has been enlarged using the Hiztegi Batua, a dictionary of standard Basque that the Basque Language Academy updates regularly (http://www2.euskaltzaindia.net/hiztegibatua).</Paragraph>
    <Paragraph position="9">  Main features of lexicalized phrases Many of the lexicalized phrases are semantically non-compositional (or partially compositional), i.e. they can hardly be interpreted in terms of the meaning of their constituents (adarra jo 'to pull someone's leg', literally 'to play the horn').</Paragraph>
    <Paragraph position="10"> Often, a component of these sequences hardly occurs in any other context and it is difficult to assign it a part of speech. For example, the word noizik is an archaism of modern noiztik 'from when', which occurs just in the expressions noizik behin, noizik behinean, noizik noizera, and noizik behinka all meaning 'once in a while'. Besides, it is not clear which is the part of speech of the words laprast in laprast egin 'to slip' or dir-dir in dir-dir egin 'to shine'.</Paragraph>
    <Paragraph position="11"> From a syntactic point of view, many of these MWEs present an unusual structure. For example, many complex verbs in Basque are light verb constructions, being the meaning of the compound quite compositional, e.g. lo egin 'to sleep' literally 'to make (a) sleep' or lan egin 'to work' literally 'to make (a) work'. However, lo egin and lan egin can be considered 'syntactically idiomatic' since the nouns in these expressions, lo and lan, take no determiner, which would be completely ungrammatical for a noun functioning as a regular direct object (*arroz jan nuen 'I ate rice').</Paragraph>
    <Paragraph position="12"> Morphosyntactic flexibility, being significant in this type of constructions in Basque, may vary considerably. For example in lo egin 'to sleep' the noun lo admits modification (lo asko egin zuen 'he slept very much') and may take the partitive assignment (ez dut lorik egin 'I haven't slept') while the verb egin can be subject to focalization (egin duzu lorik bart? 'did you sleep at all last night?'); besides, the components of the construction may change positions and some elements and phrases may be placed between them (mendian egin omen zuen lasai lo 'it is said that he slept peacefully in the mountain'). In contrast, alde egin 'to escape' is morphosyntactically quite rigid. In all the cases, the verb egin can take any inflection.</Paragraph>
    <Paragraph position="13"> For our database, we have worked out a single representation that covers all MWLUs ranging from fixed expressions to these of highest morphosyntactic flexibility.</Paragraph>
    <Paragraph position="14"> Representation of MWLUs in the lexical database In this section we explain how MWLUs are represented in EDBL (Aldezabal et al., 2001), a lexical database oriented to language processing that currently contains more than 80,000 entries, out of which 2,270 are MWLUs. Among these: * ~69% are always unambiguous. The average number of Surface Realization Schemas (SRS, see section 3.2) is 1.02.</Paragraph>
    <Paragraph position="15"> * ~23% are sometimes unambiguous and have 3.6 SRSs in average, half of them ambiguous.</Paragraph>
    <Paragraph position="16"> * ~8% are always ambiguous and have 1.2 SRSs in average.</Paragraph>
    <Paragraph position="17"> We want to point out that almost all of the unambiguous MWLUs have only one SRS, their components appearing in contiguous positions and always in the same order. About half of them are inflected, so, even if we discard the interpretations of the components, there is still some morphosyntactic ambiguity left. However, the identification of these MWLUs helps in disambiguation, as the input of tagging is more precise.</Paragraph>
    <Paragraph position="18"> The description of MWLUs within a general-purpose lexical database must include, at least, two aspects (see Figure 1): (1) their composition, i.e. which the components of the MWLU are, whether each of them can be inflected or not, and according to which one-word lexical unit (OWLU  ) it inflects; and (2), what we call the surface realization, that is, the order in which the components may occur in the text, the mandatory or optional contiguousness of components, and the inflectional restrictions applicable to each one of the components.</Paragraph>
    <Paragraph position="19">  Composition As it has just been said, the description of the composition of MWLUs in EDBL gathers two aspects: on the one side, it depicts which the individual components of a MWLU are; on the other side, it links the inflectable components of a MWLU to the corresponding OWLU according to which each of them inflects.</Paragraph>
    <Paragraph position="20"> In Figure 1, we can see that the composed of relationship links every MWLU to up to 9 individual components (MWLU_Components).</Paragraph>
    <Paragraph position="21"> Each component is characterized by the following attributes:  We consider OWLUs lexical units with no spaces within its  orthographical form; so, we also take hyphenated compounds as OWLUs.</Paragraph>
    <Paragraph position="22"> * Component_Position: this indicates the position of the component word-form in the canonical form of the MWLU.</Paragraph>
    <Paragraph position="23"> * Component_Form: i.e. the word-form itself as it appears in the canonical form of the MWLU.</Paragraph>
    <Paragraph position="24"> * Conveys_Morph_Info?: this is a Boolean value, indicating whether the component inflection conveys the morphological information corresponding to the whole MWLU or not  Moreover, the components of a MWLU are linked to its corresponding OWLU (according to which it inflects). This is represented by means of the inflects according to relationship (see Figure 1).</Paragraph>
    <Paragraph position="25">  The morphological information that the attribute refers to is the set of morphological features the inflection takes in the current component instance.</Paragraph>
    <Paragraph position="26"> These two aspects concerning the composition of a MWLU are physically stored in a single table of the relational database in which EDBL resides. The columns of the table are the following:  Morph_Info?, OWLU_Entry, and OWLU_ Homograph_Id. In the example below, the composition of the MWLU begi bistan egon 'to be evident' is described. Note that one row is used per component: &lt;begi bistan egon, 0, 1, begi, -, begi, 2&gt; &lt;begi bistan egon, 0, 2, bistan, -, bista, 1&gt; &lt;begi bistan egon, 0, 3, egon, +, egon, 1&gt; This expression allows different realizations such as begi bistan dago 'it is evident' (literally 'it is at reach of the eyes'), begi bistan daude 'they are evident', begien bistan egon, 'to be evident', etc. In the table rows above, it can be seen that the last component egon 1 'to be' conveys the morphological information for the whole MWLU (+ in the corresponding column).</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 Surface realization
</SectionTitle>
      <Paragraph position="0"> As for surface realization, we have already mentioned that the components of a MWLU can occur in a text either contiguously or dispersed.</Paragraph>
      <Paragraph position="1"> Besides, the order of the constituents may be fixed or not, and they may either inflect or occur in an invariable form. In the case of inflected components, some of them may accept any inflection according to its corresponding OWLU, whilst others may only inflect in a restricted way.</Paragraph>
      <Paragraph position="2"> Moreover, some MWLUs are unambiguous and some are not, since it cannot be certainly assured that the very same sequence of words in a text corresponds undoubtedly to a multiword entry in every context. For example, in the sentence Emilek buruaz baiezko keinu bat egin zuen 'Emile nodded his head' the words bat and egin do not correspond to the complex verb bat egin 'to unite' but to two separate phrases.</Paragraph>
      <Paragraph position="3"> According to these features, we use a formal description where different realization patterns may be defined for each MWLU. The corresp.</Paragraph>
      <Paragraph position="4"> SR schemas relationship in Figure 1 links every MWLU to one or more Surface_Realization_Schemas. Each SRS is characterized by the following attributes: * Order_Contiguousness: an expression that indicates both the order in which the components may appear in the different instances of the MWLU and the contiguousness of these components. In these expressions the position of the digits indicate the position each component takes in a particular SRS, * indicates that 0 or more words may occur between two components, and ? indicates that at most one single word may appear between two given components of the MWLU.</Paragraph>
      <Paragraph position="5"> * Unambiguousness: a Boolean value, indicating whether the particular SRS corresponds to an unambiguous MWLU or not. It expresses whether the sequence of words matching this SRS must be unambiguously analyzed as an instance of the MWLU or, on the contrary, may be analyzed as separate OWLUs in some contexts.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML