File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2115_metho.xml

Size: 13,662 bytes

Last Modified: 2025-10-06 14:09:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2115">
  <Title>Multiword Lexical Acquisition and Dictionary Formalization</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Characterization of Multiword Nouns
</SectionTitle>
    <Paragraph position="0"> Multiword (or compound) nouns are composed of non-capitalized simple words. Superficially, they seem to result from general rules of word combinations but they present constraints (morphological, combinatorial, etc.) concerning the properties they were supposed to have.</Paragraph>
    <Paragraph position="1"> Regarding inflection, general rules presented by grammarians do apply to some cases, but most compounds exhibit inflectional restrictions on gender or number that cannot be described by the morphological properties of their constituents.</Paragraph>
    <Paragraph position="2"> Table 1 presents a few examples of the most representative classes of compound nouns in Portuguese.</Paragraph>
    <Paragraph position="3"> 2 CETEMPublico is a journalistic corpus containing about 180 million words (see Santos and Rocha, 2001 for techical information).</Paragraph>
    <Paragraph position="5"> These classes represent binary compounds, comprised of two content words (where one of them is a noun), eventually connected by a grammatical word3.</Paragraph>
    <Paragraph position="6"> The classification criteria are based on the noun's internal structures, which are generally associated with a characteristic inflectional pattern. For instance, compound nouns belonging to the NA class usually allow the inflection in gender and/or number of both constituents (e.g. bomba atomica, bombas atomicas); on the contrary, in the majority of NDN compound nouns, only the first noun can inflect (conselho de guerra, conselhos de guerra).</Paragraph>
    <Paragraph position="7"> In the following sections, further relevant information on inflection, formalization and generation of inflected forms will be given.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Formalization of NA and NDN Nouns
</SectionTitle>
    <Paragraph position="0"> Following methods and formalisms introduced at LADL [Gross, 1988; Courtois and Silberztein, 1990], linguistic attributes of simple and multiword units are systematically encoded in dictionaries compatible with INTEX.</Paragraph>
    <Paragraph position="1"> In this system, compound word entries are handled depending on their internal structure. They are formalized and processed separately by different programs.</Paragraph>
    <Paragraph position="2"> In order to simplify the formalization of linguistic attributes, and make the generation process easier, we implemented a new inflectional module compatible with INTEX system. The main strength of this tool is allowing the simultaneous generation of all compounds, regardless of their internal structure, reusing the inflectional graphs already built for simple words [Mota, forthcoming]. The morphological constraints are specified manually, assigning to each constituent the inflectional code that corresponds to its inflectional behavior within the compound, as illustrated by the following dictionary entries: actor(N040) secundario(A001),N+NA+Hum 3 Even though less productive than the previous structures, there are longer multi-word combinations that may involve more than one compound form (e.g. cabo de alta tensao, high-tension electricity cable).</Paragraph>
    <Paragraph position="4"> In the first compound, both constituents keep their simple word dictionary inflectional code; they inflect in gender and number, according to the compound inflectional behavior.</Paragraph>
    <Paragraph position="5"> On the other hand, the compound noun ser humano inflects only in number, which means that the masculine nominal constituent ser preserves its inflectional code, but the adjectival constituent humano (which also inflects in gender, as simple word) receives a new code that just allows its inflection in number within the compound.</Paragraph>
    <Paragraph position="6"> In the case of the NDN compounds, as previously mentioned, only the head can inflect: ponto de vista inflects in number, so the head receives a code allowing its inflection; direitos de autor does not inflect (it is an exclusive masculine plural noun), hence the head is assigned an inflectional code that simply transmits these gender and number features to the compound.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Generation of Inflected Forms
</SectionTitle>
    <Paragraph position="0"> The following example illustrates how the new inflection module uses the dictionary information, briefly presented in the previous section: actor(N040) secundario(A001),N+NA Inititally, the inflectional module generates the inflected forms of the noun actor, based on the inflectional paradigm described in the graph N040. Then, it combines the resulting inflected nouns with all inflected forms of the adjective secundario, generated given graph A001.</Paragraph>
    <Paragraph position="1"> Subsequentely, the combinations that do not verify the gender and number agreement constraints are eliminated. Additionally, the constituent inflectional attributes are inherited by the compound.</Paragraph>
    <Paragraph position="2"> As a result, the following entries are obtained: actor secundario,actor secundario.N+NA:ms actriz secundaria,actor secundario.N+NA:fs actores secundarios,actor secundario.N+NA:mp actrizes secundarias,actor secundario.N+NA:fp This example illustrates the case where both words that constitute the compound have similar inflectional features. These attributes are directly transferred to the inflected forms of the compound. When one of the compound constituents does not have either gender or number explicit morphemes, as artista (which can be either a masculine or a feminine singular form) in the following entry: artista(N101) plastico(A001),N+NA the inflectional module assigns to the compound the morphological attributes of the constituent that has explicit gender and/or number morphems (in this case, plastico).</Paragraph>
    <Paragraph position="3"> The compounds just illustrated belong to the NA class4. The inflection of NDN forms simply corresponds to the inflection of the head noun and assigment of its attributes to the compoud.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Acquisition of New Entries
</SectionTitle>
    <Paragraph position="0"> With the purpose of increasing the number of the most representative nominal entries (NA and NDN) in the common multiword dictionaries, we used a corpus-based approach.</Paragraph>
    <Paragraph position="1"> In the first stage, candidates were automatically extracted from a fragment of CETEMPublico corpus (from now on, acquisition corpus), using INTEX. After tokenized by INTEX, 6,385,531 tokens (corresponding to 138,230 different tokens) were identified in the acquisition corpus. From those, 5,162,111 (138,174 different forms) are alphabetic words.</Paragraph>
    <Paragraph position="2"> LabEL's simple and multiword electronic dictionaries were then applied to the acquisition corpus. These dictionaries contained 171,159 nominal entries, from which 82% were simple words and 18% compounds. From the 22,581 compounds, 61,7% are NA, 33,6% are NDN and the remaining 4,7% belong to other strucutres.</Paragraph>
    <Paragraph position="3"> Candidate identification was performed using the elementary regular expressions &lt;N&gt;de&lt;N&gt;</Paragraph>
    <Paragraph position="5"> nominal compounds presenting, respectively, an NDN and an NA structure. With respect to the latter structure, the expression also guarantees morphological agreement between nouns and adjectives.</Paragraph>
    <Paragraph position="6"> Such expressions recognized in the acquisition corpus 242,527 (187,146 different forms) candidates, from which 117,616 (69,066 different forms) are NDN structures and 230,761 (118,080 different forms) are NA structures.</Paragraph>
    <Paragraph position="7"> Each class's candidates were integrated into a concordance, to which were applied the existing compound dictionaries. The resulting list of non-recognized candidates was then manually reviewed by linguists, aiming the selection and linguistic formalization of valid compounds. Graphic 1 reflects the effort involved in the selection procedure.</Paragraph>
    <Paragraph position="8"> 4 This procedure also applies to other compounds composed of two or more elements that inflect (e.g.</Paragraph>
    <Paragraph position="10"> One clear observation is that there is a great discrepancy between the initial candidate lists (NA: 238,313; NDN: 116,246) and the final selected compound forms (NA: 21,289; NDN: 3750)6. The percentage in the graphic was calculated based on the number of non-recognized different candidates (NA: 104,715; NDN: 56,741).</Paragraph>
    <Paragraph position="11"> Another interesting observation regarding Graphic 1 is that the size of the NA candidate list is slightly more than a double of the NDN candidate list. Nevertheless, the NA candidate list includes proportionally more valid compound forms (NA: 20%; NDN: 7%). In addition, the final selected NA compound list contains about five times more entries than the corresponding NDN list.</Paragraph>
    <Paragraph position="12"> The selected compound forms resulted in a total of 19,825 NA and 3,769 NDN canonical entries, which correspond respectively to 41,267 NA and 7,722 NDN inflected forms.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Evaluation
</SectionTitle>
    <Paragraph position="0"> Before lexical acquisition, the application of LabEL's lexical resources to the acquistion corpus allowed assigning 94,229 different simple word tags (34,892 nouns) and 15,120 different compound tags (13,594 nouns). Hence, the compound noun percentage is low (28%), compared to the number of different nominal tags assigned to simple forms. This discrepancy is not unexpected, regarding the low number of nominal compound entries in dictionaries.</Paragraph>
    <Paragraph position="1"> As previously mentioned, the gathering of NA and NDN compounds in the acquisition corpus led to the formalization of 23,594 canonical entries. Accordingly, the inflected form dictionary increased approximately 3 to 4 times (more 53,815 entries, in a total of 76,396 entries).</Paragraph>
    <Paragraph position="2"> 6 In this study, we did not assess the list of candidates recognized by dictionaries, which means that we did not count the hypothetical cases of embedded compound forms (e.g. cabo de alta-tensao, high-tension electricity cable).</Paragraph>
    <Paragraph position="3"> When we apply the enlarged dictionary to the same corpus, we observe that the percentage of compound nouns with respect to the total of nominal tags incresead significantly. Now, 40,902 different compound forms were identified, which means that more than a half of the nouns in the corpus correspond to multiwords.</Paragraph>
    <Paragraph position="4"> Considering the compound occurrences in the acquisition corpus, Graphic 2 illustrates their frequency distribution.</Paragraph>
    <Paragraph position="5">  It is important to draw attention to the fact that 89% of compound forms occur less than five times; in particular, 58% occur just once.</Paragraph>
    <Paragraph position="6"> These figures demonstrate that, contrary to what is observed with simple nouns, which are very recurrent in texts, the average number of compound occurrences is, in general, extremely low. This evidence raises the question whether statistical methods, based on frequencies, can adequately handle the majority of compound forms.</Paragraph>
    <Paragraph position="7"> Regardless of wether the compound acquisition has been done exclusively in a fragment of CETEMPublico, the application of the new dictionary to the remaining fragments of this corpus also increased the number of tags assigned to compound words.</Paragraph>
    <Paragraph position="8"> On average, before lexical acquisition about 13,000 different compound nouns were recognized; this number more than doubled (approx. 33,000) when we applied the enlarged dictionaries to the other fragments. As mentioned before, in the acquisition corpus, the number of compound tags exceeded the number of simple noun tags. So, we may infer that a similar behavior would be expected in the remaining fragments, if they were also considered.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Final Remarks
</SectionTitle>
    <Paragraph position="0"> In this paper, we described a new FST-based compound inflectional tool. The main advantage of this inflectional module is reusing INTEX simple word inflectional graphs in the simultaneous generation of all compound, regardless of their internal structure. Even though it has only been tested in the formalization and generation of Portuguese compound nouns, we believe it can be easily adapted to handle other languages having similar compound inflectional behavior.</Paragraph>
    <Paragraph position="1"> A corpus-based approach to multiword acquisition was also presented. We showed that, in spite of involving human effort, the results obtained effectively improved dictionary coverage. Moreover, the results concerning compound noun frequency raised the question whether statistical approaches, based on word frequencies, are (un)adequate to extract multiword nouns from texts.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML