File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0407_metho.xml

Size: 11,080 bytes

Last Modified: 2025-10-06 14:09:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0407">
  <Title>Representation and Treatment of Multiword Expressions in Basque Inaki Alegria, Olatz Ansa, Xabier Artola</Title>
  <Section position="3" start_page="2" end_page="3" type="metho">
    <SectionTitle>
* Inflection_Restrictions: an
</SectionTitle>
    <Paragraph position="0"> expression that indicates the inflection paradigm according to which the MWLU may inflect in this specific SRS. In these expressions each component of the MWLU is represented by one list component (in the same order as the components of the MWLU appear in its canonical form): % indicates that the whole inflection paradigm of the corresponding inflectable component may occur; the minus sign (-) is used for non-inflectable components (no inflection at all may occur); finally, a logical expression (and, or, and not are allowed) composed of attribute-value pairs is used to express the inflectional restrictions and the morphotactics the component undergoes in this particular SRS of the MWLU (in brackets in the examples below).</Paragraph>
    <Paragraph position="1"> In the examples below, it can be seen that one  row is used per SRS. The columns of the table are the following: Entry, Homograph_Id, Order_Contiguousness, Unambiguousness, and Inflection_Restrictions: &lt;begi bistan egon, 0, 123, +, (((CAS=ABS) and (DEF=-)) or ((CAS=GEN) and (NUM=PL)), -, %)&gt; &lt;begi bistan egon, 0, 312, +, (((CAS=ABS) and (DEF=-)) or ((CAS=GEN) and (NUM=PL)), -, %)&gt; &lt;begi bistan egon, 0, 3?12, +, (((CAS=ABS) and (DEF=-)), -, %)&gt;  The first SRS matches occurrences such as begi bistan dago hau ez dela aski 'it is evident that it is not enough' or begien bistan zegoen honela bukatuko genuela 'it was evident that we would end up this way', where the components are contiguous and the analysis as an instance of the MWLU would be unambiguous. This SRS allows the inflection of the first component as absolutive case (non-definite) or as genitive (plural), and the whole set of inflection morphemes of the third one. The third SRS matches occurrences such as ez dago horren begi bistan 'it is not so evident', where the components are not contiguous (at most one word is allowed between the &amp;quot;third&amp;quot; component and the &amp;quot;first one&amp;quot;) and they occur in a non-canonical order: 3?12. In this case, the interpretation as an instance of the MWLU would also be unambiguous. However, this SRS only allows the inflection of the first component as absolutive case (non-definite).</Paragraph>
    <Paragraph position="2">  Different information requirements in lemmatization and syntax processing The first prototype for the treatment of MWEs in Basque HABIL (Ezeiza et al., 1998; Ezeiza, 2003) was built for lemmatization purposes. However, we are nowadays involved in the construction of a deep syntactic parser (Aduriz et al., 2004) and the MWEs seem to need a different treatment. The fact that many MWEs may be syntactically regular but, above all, that an external element may have a dependency relation with one of the constituents, forces us to analyze the elements independently. For example, in the verb beldur izan 'to be afraid (of)' an external noun phrase may have a modifier-noun dependency relation with beldur 'fear' as in sugeen beldur naiz 'I'm afraid of snakes'. In loak hartu 'to fall asleep' there is a subject-verb relation as in loak hartu nau 'I have fallen asleep', literally 'sleep has caught me'; therefore subject-auxiliary verb agreement would fade if both components were analyzed as one.</Paragraph>
    <Paragraph position="3"> The MWLU representation we have adopted allows us to lemmatize the word combination as a unit and yet to parse the components individually whenever necessary. In order to do so, when describing each MWLU, we specify whether the elements in the MWLU must be analyzed separately or not  .</Paragraph>
    <Paragraph position="4"> Treatment of multiword expressions MWEs could be treated at different stages of the language process. Some approaches treat them at tokenization stage, identifying fixed phrases, such as prepositional phrases or compounds, included in a list (Carmona et al., 1998; Karlsson et al., 1995). Other approaches rely on morphological analysis to better identify the features of the MWE using finite state technology (Breidt et al., 1996). Finally, there is another approach that identifies them after the tagging process, allowing the correction of some tagging errors (Leech et al., 1994).</Paragraph>
    <Paragraph position="5"> All of these approaches are based on the use of a closed set of MWLUs that could be included in a list or a database. However, some groups of MWEs are not subject to be included in a database, because they comprise an open class of expressions. That is the case of collocations, compounds or named entities. The group of collocations and compounds should be delimited using statistical approaches, such as Xtract (Smadja, 1993) or LocalMax (Silva et al., 1999), so that only the most relevant--those of higher frequency-- are included in the database.</Paragraph>
    <Paragraph position="6"> Named entity recognition task has been solved for a large set of languages. Most of these works are linked to the Message Understanding Conference (Chinchor, 1997). There is a variety of methods that have been used in NE recognition, such as HMM, Maximum Entropy Models, Decision Trees, Boosting and Voted Perceptron (Collins, 2002), Syntactic Structure based approaches and WordNet-based approaches (Magnini et al., 2002; Arevalo, 2002). Most references on NE task might be accessed at http://www.muc.saic.com.</Paragraph>
    <Paragraph position="7">  Processing MWEs with HABIL We have implemented HABIL, a tool for the treatment of multiword expressions (MWE), based  Currently we are studying the MWLUs in the lexical database in order to determine which of them deserve to be parsed as separate elements. We have not defined yet how this will be formally represented in the database. on the features described in the lexical database. The most important features of HABIL are the following:  * It deals with both contiguous and split MWEs.</Paragraph>
    <Paragraph position="8"> * It takes into account all the possible orders of the components (SRS).</Paragraph>
    <Paragraph position="9"> * It checks that inflectional restrictions are complied with.</Paragraph>
    <Paragraph position="10"> * It generates morphosyntactic interpretations  for the MWE.</Paragraph>
    <Paragraph position="11"> This tool has two different components: on the one hand, there is a searching engine that identifies MWEs along the text, and, on the other hand, there is a morphosyntactic processor that assigns the corresponding interpretations to the components of the MWE.</Paragraph>
    <Paragraph position="12"> The morphosyntactic processor generates the interpretations for MWEs using category and subcategory information in the lexical database. When one of the components adds information to the MWE, the processor applies pattern-matching techniques to extract the corresponding morphological features of the analyses of that component, and these features are included in the interpretation of the MWE. Then, it replaces all the morphosyntactic interpretations of the components of unambiguous MWEs with the MWE interpretations. When MWEs are ambiguous, the new interpretations are added to the existing ones. HABIL also identifies and treats dates and numerical expressions. As they make up an open class, they are not obviously included in the lexical database. Furthermore, their components are always contiguous, have a very strict structure, and use a closed lexicon. Thus, it is quite easy to identify them using simple finite state transducers. For the morphosyntactic treatment of dates and numerical expressions, we use the morphosyntactic component of HABIL. These expressions may appear inflected and, in this case, the last component adds morphosyntactic features to the MWE. Finally, as they are unambiguous expressions, the processor discards the interpretations of the components and assigns them all the interpretations of the whole expression.</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.2 Evaluation
</SectionTitle>
      <Paragraph position="0"> We performed several experiments using 650 unambiguous, contiguous and ordered MWEs. We treated a reference corpus of around 36,000 tokens and there were 386 instances of 149 different MWEs. We also applied this process to a small test corpus of around 7,100 tokens in which there were 87 instances of 45 MWEs. Taking both corpora into account, there were 473 instances of 167 different MWEs, which amounted to 25% of the expressions considered, and 50% of the instances were ambiguous. Besides, only 14 dates and 12 numerical expressions were found in the reference corpus, and 18 dates and 9 numerical expressions in the test corpus.</Paragraph>
      <Paragraph position="1">  The ambiguity measures of the test corpus are shown in Table 1. The ambiguity rate of word-forms decreases by 2% and the average ambiguity rate by 1.5% after the processing of MWEs. It is important to point out that no error is made along the process. Furthermore, some important MWEs, more specifically, some complex sentence connectors that have highly ambiguous components, are correctly disambiguated.</Paragraph>
      <Paragraph position="2"> Bearing in mind the proportion of words treated by HABIL, these results help significantly in improving precision results of tagging and avoiding almost 10% of the errors, as shown in</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="3" end_page="3" type="metho">
    <SectionTitle>
5 Future work
</SectionTitle>
    <Paragraph position="0"> After confirming the viability of the system and the good results in POS tagging, our main goal is to increase the number of MWLUs in the database, which will improve the identification of MWEs in corpora.</Paragraph>
    <Paragraph position="1"> A remaining difficulty that we are facing is the problem of ambiguous split MWEs. At present, we are creating a disambiguation grammar that will discard or select the multiword interpretations in ambiguous MWLUs. We are developing similar rules using both the Constraint Grammar formalism and finite state transducers (XFST tools, Kartunnen et al. 1997). The very first rules seem to be quite effective. Soon, we will be assessing the first results, and then we will be able to choose the method that performs best with a lesser effort.</Paragraph>
    <Paragraph position="2"> Once we have chosen the best formalism, we intend to develop a comprehensive grammar that will disambiguate as many ambiguous MWLUs as possible.</Paragraph>
    <Paragraph position="3"> In addition, we are developing new processes after POS tagging in order to identify complex named entities and terminological units. These units constitute an open class and so their exhaustive inclusion in a database would not be viable.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML