File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0801_metho.xml

Size: 17,558 bytes

Last Modified: 2025-10-06 14:08:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0801">
  <Title>A Multilingual Approach to Disambiguate Prepositions and Case Suffixes</Title>
  <Section position="3" start_page="0" end_page="2" type="metho">
    <SectionTitle>
2 Method for disambiguation
</SectionTitle>
    <Paragraph position="0"> The goal of the method is to disambiguate between the possible interpretations of a case suffix appearing in any text. We have taken as the target text the definitions from a monolingual Basque dictionary Euskal Hiztegia, EH in short (Sarasola, 1996). The method consists on five steps: * Extraction of the definitions in EH where the target case suffix occurs.</Paragraph>
    <Paragraph position="1"> * Search of on-line Spanish and English dictionaries to obtain the translation equivalent of the definitions.</Paragraph>
    <Paragraph position="2"> * Extraction of the target preposition from the translation definitions.</Paragraph>
    <Paragraph position="3"> * Disambiguation based on the intersection of the interpretations of case suffix and prepositions.</Paragraph>
    <Paragraph position="4"> We will explain each step in turn.</Paragraph>
    <Section position="1" start_page="0" end_page="2" type="sub_section">
      <SectionTitle>
2.1 Extraction of relations from EH
</SectionTitle>
      <Paragraph position="0"> Given a case suffix, in this step we will search the EH dictionary for occurrences of the case suffix.</Paragraph>
      <Paragraph position="1"> We first lemmatize and perform morphological analysis of the definitions (Aduriz et. al, 1996).</Paragraph>
      <Paragraph position="2"> The definitions that contain the target case suffix in a morphological analysis are extracted, storing the following information: the Basque dictionary entry of the definition, the lemma that has the case suffix, the case suffix, and the following lemma.</Paragraph>
      <Paragraph position="3"> Below we can see a sample definition, its lemmatized version, and the two triples extracted from this definition. The occurrences of the instrumental -z are shown in bold.</Paragraph>
      <Paragraph position="4"> Ildo iz. A1 Goldeaz lurra irauliz  irauli#INS#egin Extracting lemma-suffix-lemma triples in this simple way leads to some errors (cf. section 5.1). For instance, the first triple should rather be the dependency golde#INS#irauli (plow#with#turn, to be read in reverse order). We will see that even in this case we will be able to obtain correct translations and disambiguate the preposition correctly. Nevertheless, in the future we plan to use a syntactic parser to identify better the lemmas that are related by the case suffix.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.2 Search for Spanish/English
translations
</SectionTitle>
      <Paragraph position="0"> After we have a list of entries in the Basque dictionary that contain the lemma-suffix-lemma triple, we search for their equivalent definitions in Spanish and English. We first look up the entry in the bilingual dictionary, and then retrieve the over the ground with a plow.</Paragraph>
      <Paragraph position="1">  The translation of the first triple is plow#with#ground, to be read on reverse. The translation of the second is turn#NULL#produce, to be also read on reverse. In this second triple the instrumental case suffix is not translated explicitly by a preposition, but by a syntactic construct.</Paragraph>
      <Paragraph position="2"> definitions for each of the possible translations from the monolingual dictionaries.</Paragraph>
      <Paragraph position="3"> We use two bilingual and 6 monolingual  on-line dictionaries are: Colmex (online), Rae (online), and Vox (online). The Basque dictionary and the bilingual dictionaries are stored in a local server, while the monolingual dictionaries are accessed from the Internet using a wrapper.</Paragraph>
      <Paragraph position="4"> The incomplete list of the translation of ildo (furrow in English, surco in Spanish) is shown below. Note that we got two different definitions for surco, coming from different Spanish dictionaries.</Paragraph>
      <Paragraph position="5"> furrow#A long , narrow , shallow trench made in the ground by a plow surco#Excavacion alargada , angosta y poco profunda que se hace paralelamente en la tierra con el arado , para sembrarla despues surco#Hendedura que se hace en la tierra con el arado</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.3 Extraction of Spanish/English
</SectionTitle>
      <Paragraph position="0"> equivalent relations Given a list of definitions in Spanish and English, we search in the definition the translation of the Basque triple found in step 2.1, that is, we look for a triple of consecutive words where the first word is the translation of the last word in the Basque triple, the second word is a preposition (which corresponds to the Basque suffix) and the third word is the translation of the first word in the Basque triple. Between the preposition and the last word in the triple we allow for the presence of a determiner or an adjective in the text. More complex patterns could be allowed, up to full syntactic analyses, but at this point we follow this simple scheme.</Paragraph>
      <Paragraph position="1"> Below we can find the triples for golde#INS#lur, obtained from the three definitions above. One triple is obtained twice from two different definitions.</Paragraph>
      <Paragraph position="2"> furrow#ground#by#plow surco#tierra#con#arado surco#tierra#con#arado Definitions that do not have a matching triple are discarded, leaving Basque triples without matching triple ambiguous. For instance we could not find triples for irauli#INS#egin(cf. example in section 2.1). The instrumental suffix is sometimes translated without prepositions (in this case &amp;quot;... made turning ...&amp;quot;).</Paragraph>
      <Paragraph position="3"> Looking up the bilingual dictionaries for translation requires lemmatization and Part of Speech tagging. For English we use the TnT PoS tagger (Brants, 2000) and WordNet for lemmatization (Miller et al., 1990). For Spanish we use (Atserias et al., 1998).</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.4 Disambiguation
</SectionTitle>
      <Paragraph position="0"> For each Basque case suffix, Spanish preposition and English preposition we have a list of interpretations (cf. Table 1). We assign the interpretations of the preposition to each Spanish/English triple. The intersection of all the interpretations is assigned to it.</Paragraph>
      <Paragraph position="1"> Continuing with out example, we can see that the intersection between the interpretations of the English by preposition (three interpretations) and the interpretations of the Spanish con preposition (four interpretations) are manner and instrument.</Paragraph>
      <Paragraph position="2"> Therefore, we can say that the Basque instrumental case interpretation in this case will be manner or instrument.</Paragraph>
      <Paragraph position="3"> furrow#ground#by a#plow# manner instrument during-time surco#tierra#con el#arado# manner instrument cause containing golde#INS#lur#instrument manner</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="2" end_page="2" type="metho">
    <SectionTitle>
3 Interpretations for the
</SectionTitle>
    <Paragraph position="0"> instrumental case suffix and equivalent prepositions The method explained in the previous section is fully automatic, and it only requires the list of interpretations for each case suffix and preposition. In this work, we want to evaluate if the overall approach is feasible, so we selected Basque as the target language and a single case suffix, -z the instrumental case. Table 1 shows the list of possible interpretations and Table 2 and 3 examples for each interpretation.</Paragraph>
    <Paragraph position="1"> The sources for the interpretations of the instrumental case have been a grammar of Basque (Euskaltzaindia, 1985) and a bilingual dictionary (Elhuyar, 1996). Possible interpretations for Spanish and English prepositions have been taken from an English dictionary (Cambridge, online), a Spanish dictionary (Vox, online) and a Spanish grammar (Bosque &amp; Demonte, 1999).</Paragraph>
    <Paragraph position="2"> For this work we have taken a descriptive approach, but other more theoretically committed approaches are also possible. The overall method is independent of the set of interpretations, as it only needs a table of possible interpretations in the style of Table 1. Section 5.4 further discusses other alternatives.</Paragraph>
    <Paragraph position="3"> In order to disambiguate the occurrences of the instrumental case suffix we have taken the Spanish and English translations for this case suffix. The list of possible translations is preliminary and covers what we found necessary to make this experiment. Table 1 shows the list of prepositions and interpretations for Spanish and English. Examples of the interpretations can be found in Table 2. The Spanish preposition de had the same interpretations as the instrumental case suffix (cf. Table1), so it was discarded.</Paragraph>
  </Section>
  <Section position="5" start_page="2" end_page="2" type="metho">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> The instrumental case occurs in 4,004 different definitions in the EH dictionary. The algorithm in Section 2 was applied to all these definitions, yielding a result for 125 triples, 3.1% of the total.</Paragraph>
    <Paragraph position="1"> The triples for which we had an answer were tagged by hand independently, i.e. not consulting the results output by the algorithm. The hand-tagged set constitutes what we call the gold standard.</Paragraph>
    <Paragraph position="2"> A single linguist made the tagging, consulting other teammates when in doubt. Apart from marking the interpretation, there were some other special cases.</Paragraph>
    <Paragraph position="3">  1. In some of the examples, the instrumental case was part of a more complex scheme, and was tagged accordingly: * Part of a postposition (XPOST), e.g. -en bidez (by means of) or -en ordez (instead of).</Paragraph>
    <Paragraph position="4"> * Part of a conjunction (XLOK), e.g. batez ere (specially).</Paragraph>
    <Paragraph position="5"> * Part of a compounded suffix -zko (XZKO), which results from the aggregation of the instrumental -z with the location genitive -ko.</Paragraph>
    <Paragraph position="6"> 2. There were three errors in the lemmatization process (XLEM), due to lexicalized items, e.g. gizonezko (meaning male person).</Paragraph>
    <Paragraph position="7"> 3. Finally, the relation in the definition was sometimes wrongly retrieved, e.g.</Paragraph>
    <Paragraph position="8"> * The triple would contain the determiner or  an adjective instead of the dependencies. We thought that the algorithm would be able to work well even with those cases, so we decided to keep them.</Paragraph>
    <Paragraph position="9"> * The triple contains a conjunction (X): these were tagged as incorrect.</Paragraph>
    <Paragraph position="10"> Table 4 shows the amount of such cases, alongside the frequency of each interpretation. The most frequent interpretation is instrument. In seven examples, the linguist decided to keep two interpretations: instrument and manner. In a single example, the linguist was unable to select an interpretation, so this example was discarded. The output of the algorithm was compared with the gold standard, yielding the accuracy figures in Table 5. An output was considered correct if it yielded at least one interpretation in common with the gold standard. The accuracy is given for each dictionary in isolation, or merging all the results (as mentioned in section 2, when two dictionaries propose interpretations for the same triple, their intersection is taken). The remaining ambiguity is 3.1 overall.</Paragraph>
    <Paragraph position="11"> Basque English Spanish -z (ins.) of by with in de con a en</Paragraph>
    <Paragraph position="13"> (MF), constructed as follows: for each occurrence of the suffix, the three most frequent interpretations are chosen. The accuracy of this baseline is practically equal to that of the algorithm. Note that the frequency is computed on the same sample where it is applied, yielding better results than it should.</Paragraph>
  </Section>
  <Section position="6" start_page="2" end_page="2" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> The obtained results show a very good accuracy, leaving a remaining ambiguity of 3.1 results per example. This means that we were able to discard an average of 4 readings for each of the examples, introducing only 5.5% of error. The results are practically equal to the most frequent baseline, which is usually hard to beat using knowledge-based techniques.</Paragraph>
    <Paragraph position="1"> Coverage of the method is very low, only 2.3%, but this was not an issue for us, as we plan to couple this method with other Machine Learning techniques in a bootstrapping framework. Nevertheless, we are still interested in increasing the coverage, in order to obtain more training data.</Paragraph>
    <Paragraph position="2"> Next, we will analyze more in depth the causes of the low coverage, the sources of the errors and ambiguity and the interpretations of case suffixes and prepositions.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
5.1 Sources of low coverage
</SectionTitle>
      <Paragraph position="0"> As soon as we started devising this method, it was clear to us that the coverage will be rather low.</Paragraph>
      <Paragraph position="1"> The main reason is that different dictionaries tend to give different details in their definitions, or use differing paraphrases. This fact is intrinsic to our method, and accounts for the large majority of missing answers.</Paragraph>
      <Paragraph position="2"> On the other hand, the simple method used to find triples means that a change in the order of the complements will cause our method to fail looking for a translation triple. Syntactic analysis, even shallow parsing methods, will help increase the coverage.</Paragraph>
      <Paragraph position="3"> Another source of discarded triples are the cases where the suffix is not translated by a preposition, e.g. the relation is carried out by a subject or direct object. When syntactic analysis is performed, we interpretations of the oth</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
5.2 Sources of
</SectionTitle>
      <Paragraph position="0"> Only five errors w were caused by especially when determiner instead of the re  also plan to incorporate the er syntactic relations. error e made by the algorithm, which the wrong triple pairings, the Basque triple contained a lated word. Examples: : punta batez osatua/made by a e: odi batez osatua/wake made by could be avoided using a syntactic ong pairings were caused by</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="2" end_page="2" type="metho">
    <SectionTitle>
91 Total kept
</SectionTitle>
    <Paragraph position="0"> quency of tags in gold standard.</Paragraph>
    <Paragraph position="1"> al correct accur. ambig.</Paragraph>
    <Paragraph position="2">  s for each of the dictionaries, overall r all and the most frequent baseline. errors in the English PoS tagger, or chance made the algorithm find an unrelated definition.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
5.3 Remaining ambiguity
</SectionTitle>
      <Paragraph position="0"> The amount of readings left by our method in this experiment is rather high, around 3.1 readings compared to 7 possible readings for the instrumental. This is a strong reduction but we would like to make it even smaller.</Paragraph>
      <Paragraph position="1"> We plan to study which is the source of the residual ambiguity. Alternative sets of interpretations (cf. Section 5.4) with coarser grained differences and smaller ambiguity, could yield better results. Another alternative is to explore more infrequent translations of the case suffixes, which might yield a narrower overlap.</Paragraph>
      <Paragraph position="2"> This is the case for the instrumental case suffix being translated with from, up, etc.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
5.4 Interpretations of case suffixes and
</SectionTitle>
      <Paragraph position="0"> prepositions Different authors give differing interpretations for prepositions. It has been our choice to take a descriptive list of possible interpretations from a set of sources, mainly dictionaries and grammar books.</Paragraph>
      <Paragraph position="1"> This work covers only the instrumental case suffix and its translations to English and Spanish. If tables for all case suffixes and prepositions were built, the method could be applied to all case suffixes and prepositions, yielding disambiguated relations in all three languages.</Paragraph>
      <Paragraph position="2"> More theoretically committed lists of interpretations (Dorr et al., 1998; Civit et al., 2000; Sowa, 2000) should also be considered, but unfortunately we have not found a full account for all prepositions. If such a full table of interpretations existed, it could be very easy to apply our method, and obtain the outcome in terms of these other interpretations.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML