File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2405_metho.xml
Size: 18,712 bytes
Last Modified: 2025-10-06 14:10:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2405"> <Title>Identifying idiomatic expressions using automatic word-alignment</Title> <Section position="4" start_page="34" end_page="34" type="metho"> <SectionTitle> 2 Data and resources </SectionTitle> <Paragraph position="0"> We base our investigations on the Europarl corpus consisting of several years of proceedings from the European Parliament (Koehn, 2003). We focus on Dutch expressions and their translations into English, Spanish and German.2 Thus, we used the entire sections of Europarl in these three languages.</Paragraph> <Paragraph position="1"> The corpus has been tokenized and aligned at the sentence level (Tiedemann and Nygaard, 2004).</Paragraph> <Paragraph position="2"> The Dutch part contains about 29 million tokens in about 1.2 million sentences. The English, Spanish and German counterparts are of similar size between 28 and 30 million words in roughly the same number of sentences.</Paragraph> <Paragraph position="3"> Automatic word alignment has been done using GIZA++ (Och, 2003). We used standard settings of the system to produce Viterbi alignments of IBM model 4. Alignments have been produced for both translation directions (source to target and target to source) on tokenized plain text.3 We also used a well-known heuristics for combining the two directional alignments, the so-called refined alignment (Och et al., 1999). Word-to-word alignments have been merged such that words are connected with each other if they are linked to the same target. In this way we obtained three different word alignment files: source to target (src2trg) with possible multi-word units in the source language, target to source (trg2src) with possible multi-word units in the target language, and refined with possible multi-word units in both languages. We also created bilingual word type links from the different word-aligned corpora. These lists include alignment frequencies that we will use later on for extracting default alignments for individual words. Henceforth, we will call them sentence and word alignment have not been done. We rely entirely on the results of automatic processes.</Paragraph> </Section> <Section position="5" start_page="34" end_page="35" type="metho"> <SectionTitle> 3 Extracting candidates from corpora </SectionTitle> <Paragraph position="0"> The Dutch section from the Europarl corpus was automatically parsed with Alpino, a Dutch wide-coverage parser.4 1.25% of the sentences could not be parsed by Alpino, given the fact that many sentences are rather lengthy. We selected those sentences in the Dutch Europarl section that contain at least one of a group of verbs that can function as main or support verbs. Support verbs are prone to lexicalization or idiomatization along with their complementation (Butt, 2003). The selected verbs are: doen, gaan, geven, hebben, komen, maken, nemen, brengen, houden, krijgen, stellen and zitten.5 A fully parsed sentence is represented by the list of its dependency triples. From the dependency triples, each main verb is tallied with every dependent prepositional phrase (PP). In this way, we collected all the VERB PP tuples found in the selected documents. To avoid data sparseness, the NP inside the PP is reduced to the head noun's lemma and verbs are lemmatized, too. Other potential arguments under a verb phrase node are ignored.</Paragraph> <Paragraph position="1"> A sample of more than 191,000 candidates types (413,000 tokens) was collected. To ensure statistical significance, the types that occur less than 50 times were ignored.</Paragraph> <Paragraph position="2"> For each candidate triple, the log-likelihood (Dunning, 1993) and salience (Kilgarriff and Tugwell, 2001) scores were calculated. These scores have been shown to perform reasonably well in identifying collocations and other lexicalized expressions (Villada Moir'on, 2005). In addition, the head dependence between each PP in the candidates dataset and its selecting verbs was measured. Merlo and Leybold (2001) used the head dependence as a diagnostic to determine the argument (or adjunct) status of a PP. The head dependence is measured as the amount of entropy observed among the co-occurring verbs for a given PP as suggested in (Merlo and Leybold, 2001; Baldwin, 2005). Using the two association measures and the head dependence heuristic, three different rankings of the candidate triples were produced.</Paragraph> <Paragraph position="3"> The three different ranks assigned to each triple were uniformly combined to form the final ranking. From this list, we selected the top 200 triples of support verbs crosslinguistically. The other 5 have been suggested for Dutch by (Hollebrandse, 1993).</Paragraph> <Paragraph position="4"> which we considered a manageable size to test our method.</Paragraph> </Section> <Section position="6" start_page="35" end_page="38" type="metho"> <SectionTitle> 4 Methodology </SectionTitle> <Paragraph position="0"> We examine how expressions in the source language (Dutch) are conceptualized in a target language. The translations in the target language encode the meaning of the expression in the source language. Using the translation links in parallel corpora, we attempt to establish what type of meaning the expression in the source language has. To accomplish this we make use of the three word-aligned parallel corpora from Europarl as described in section 2.</Paragraph> <Paragraph position="1"> Once the translation links of each expression in the source language have been collected, the entropy observed among the translation links is computed per expression. We also take into account how often the translation of an expression is made out of the default alignment for each triple component. The default 'translation' is extracted from the corresponding bilingual link lexicon.</Paragraph> <Section position="1" start_page="35" end_page="35" type="sub_section"> <SectionTitle> 4.1 Collecting alignments </SectionTitle> <Paragraph position="0"> For each triple in the source language (Dutch) we collect its corresponding (hypothetical) translations in a target language. Thus, we have a list of 200 VERB PP triples representing 200 potential MWEs in Dutch. We selected all occurrences of each triple in the source language and all aligned sentences containing their corresponding translations into English, German and Spanish. We restricted ourselves to instances found in 1:1 sentence alignments. Other units contain many errors in word and sentence alignment and, therefore, we discarded them. Relying on automated word-alignment, we collect all translation links for each verb, preposition and noun occurrence within the triple context in the three target languages.</Paragraph> <Paragraph position="1"> To capture the meaning of a source expression (triple) S, we collect all the translation links of its component words s in each target language. Thus, for each triple, we gather three lists of translation links Ts. Let us see the example AAN LICHT BRENG representing the MWE iets aan het licht brengen 'reveal'. Table 1 shows some of the links found for the triple AAN LICHT BRENG. If a word in the source language has no link in the target language (which is usually due to alignments to the empty word), NO LINK is assigned.</Paragraph> <Paragraph position="2"> Note that Dutch word order is more flexible than</Paragraph> </Section> <Section position="2" start_page="35" end_page="35" type="sub_section"> <SectionTitle> Triple Links in English </SectionTitle> <Paragraph position="0"> aan NO LINK, to, of, in, for, from, on, into, at licht NO LINK, light, revealed, exposed, highlight, shown, shed light, clarify breng NO LINK, brought, bring, highlighted, has, is, makes English word order and that, the PP argument in a candidate expression may be separate from its selecting verb by any number of constituents. This introduces much noise during retrieving translation links. In addition, it is known that concepts may be lexicalized very differently in different languages. Because of this, words in the source language may translate to nothing in a target language. This introduces many mappings of a word to NO LINK.</Paragraph> </Section> <Section position="3" start_page="35" end_page="36" type="sub_section"> <SectionTitle> 4.2 Measuring translational entropy </SectionTitle> <Paragraph position="0"> According to our intuition it is harder to align words in idiomatic expressions than other words.</Paragraph> <Paragraph position="1"> Thus, we expect a larger variety of links (including erroneous alignments) for words in such expressions than for words taken from expressions with a more literal meaning. For the latter, we expect fewer alignment candidates, possibly with only one dominant default translation. Entropy is a good measure for the unpredictability of an event. We like to use this measure for comparing the alignment of our candidates and expect a high average entropy for idiomatic expressions. In this way we approximate a measure for meaning predictability. null For each word in a triple, we compute the entropy of the aligned target words as shown in equation (1).</Paragraph> <Paragraph position="3"> This measure is equivalent to translational entropy (Melamed, 1997b). P(t|s) is estimated as the proportion of alignment t among all alignments of word s found in the corpus in the context of the given triple.6 Finally, the translational entropy of a triple is the average translational entropy of its components. It is unclear how to 6Note that we also consider cases where s is part of an aligned multi-word unit.</Paragraph> <Paragraph position="4"> treat NO LINKS. Thus, we experiment with three variants of entropy: (1) leaving out NO LINKS, (2) counting NO LINKS as multiple types and (3) counting all NO LINKS as one unique type.</Paragraph> </Section> <Section position="4" start_page="36" end_page="36" type="sub_section"> <SectionTitle> 4.3 Proportion of default alignments (pda) </SectionTitle> <Paragraph position="0"> If an expression has a literal meaning, we expect the default alignments to be accurate literal translations. If an expression has idiomatic meaning, the default alignments will be very different from the links observed in the translations.</Paragraph> <Paragraph position="1"> For each triple S, we count how often each of its components s is linked to one of the default alignments Ds. For the latter, we used the four most frequent alignment types extracted from the corresponding link lexicon as described in section 2. A large proportion of default alignments7 suggests that the expression is very likely to have literal meaning; a low percentage is suggestive of non-transparent meaning. Formally, pda is calculated in the following way: We experimented with the three word-alignment types (src2trg, trg2src and refined) and the two scoring methods (entropy and pda). The 200 candidate MWEs have been assessed and classified into idiomatic or literal expressions by a human expert. For assessing performance, standard precision and recall are not applicable in our case because we do not want to define an artificial cut-off for our ranked list but evaluate the ranking itself. Instead, we measured the performance of each alignment type and scoring method by obtaining another evaluation metric employed in information retrieval, uninterpolated average precision (uap), that aggregates precision points into one evaluation figure. At each point c where a true positive Sc in the retrieved list is found, the precision P(S1..Sc) is computed and, all precision points are then averaged (Manning and Sch&quot;utze, 1999).</Paragraph> <Paragraph position="2"> 7Note that we take NO LINKS into account when computing the proportions.</Paragraph> <Paragraph position="4"> We used the initial ranking of our candidates as baseline. Our list of potential MWEs shows an overall precision of 0.64 and an uap of 0.755.</Paragraph> </Section> <Section position="5" start_page="36" end_page="37" type="sub_section"> <SectionTitle> 5.1 Comparing word alignment types </SectionTitle> <Paragraph position="0"> Table 2 summarizes the results of using the entropy measure (leaving out NO LINKS) with the three alignment types for the NL-EN language Using word alignments improves the ranking of candidates in all three cases. Among them, src2trg shows the best performance. This is surprising because the quality of word-alignment from English-to-Dutch (trg2src) in general is higher due to differences in compounding in the two languages. However, this is mainly an issue for noun phrases which make up only one component in the triples.</Paragraph> <Paragraph position="1"> We assume that src2trg works better in our case because in this alignment model we explicitly link each word in the source language to exactly one target word (or the empty word) whereas in the trg2src model we often get multiple words (in the target language) aligned to individual words in the triple. Many errors are introduced in such alignment units. Table 3 illustrates this with an example with links for the Dutch triple op prijs stel corresponding to the expression iets op prijs stellen 'to appreciate sth.' src2trg trg2src source target target source gesteld appreciate NO LINK stellen prijs appreciate much appreciate indeed prijs op appreciate NO LINK op gesteld be keenly appreciate stellen prijs delighted fact prijs op NO LINK NO LINK op src2trg alignment proposes appreciate as a link to all three triple components. This type of alignment is not possible in trg2src. Instead, trg2src includes two NO LINKS in the first example in table 3. Furthermore, we get several multiword-units in the target language linked to the triple components also because of alignment errors. This way, we end up with many NO LINKS and many alignment alternatives in trg2src that influence our entropy scores. This can be observed for idiomatic expressions as well as for literal expressions which makes translational entropy less reliable in trg2src alignments for contrasting these two types of expressions. null The refined alignment model starts with the intersection of the two directional models and adds iteratively links if they meet some adjacency constraints. This results in many NO LINKS and also alignments with multiple words on both sides.</Paragraph> <Paragraph position="2"> This seems to have the same negative effect as in the trg2src model.</Paragraph> </Section> <Section position="6" start_page="37" end_page="37" type="sub_section"> <SectionTitle> 5.2 Comparing scoring metrics </SectionTitle> <Paragraph position="0"> Table 4 offers a comparison of applying translational entropy and the pda across the three language pairs. To produce these results, src2trg alignment was used given that it reaches the best performance (refer to Table 2).</Paragraph> <Paragraph position="1"> three language pairs. Alignment is src2trg.</Paragraph> <Paragraph position="2"> All scores produce better rankings than the baseline. In general, pda achieves a slightly better accuracy than entropy except for the NL-DE language pair. Nevertheless, the difference between the metrics is hardly significant.</Paragraph> </Section> <Section position="7" start_page="37" end_page="38" type="sub_section"> <SectionTitle> 5.3 Further improvements </SectionTitle> <Paragraph position="0"> One problem in our data is that we deal with word-form alignments and not with lemmatized versions. For Dutch, we know the lemma of each word instance from our candidate set. However, for the target languages, we only have access to surface forms from the corpus. Naturally, inflectional variations influence entropy scores (because of the larger variety of alignment types) and also the pda scores (where the exact wordforms have to be matched with the default alignments instead of lemmas). In order to test the effect of lemmatization on different language pairs, we used CELEX (Baayen et al., 1993) for English and German to reduce wordforms in the alignments and in the link lexicon to corresponding lemmas. We assigned the most frequent lemma to ambiguous wordforms.</Paragraph> <Paragraph position="1"> Table 5 shows the scores obtained from applying lemmatization for the src2trg alignment using src2trgalignments across languages pairs with different settings.</Paragraph> <Paragraph position="2"> Surprisingly, lemmatization adds little or even decreases the accuracy of the pda and entropy scores. It is also surprising that lemmatization does not affect the scores for morphologically richer languages such as German (compared to English). One possible reason for this is that lemmatization discards morphological information that is crucial to identify idiomatic expressions. In fact, nouns in idiomatic expressions are more fixed than nouns in literal expressions. By contrast, verbs in idiomatic expressions often allow tense inflection. By clustering wordforms into lemmas we lose this information. In future work, we might lemmatize only the verb.</Paragraph> <Paragraph position="3"> Another issue is the reliability of the word alignment that we base our investigation upon. We want to make use of the fact that automatic word alignment has problems with the alignment of individual words that belong to larger lexical units. However, we believe that the alignment program in general has problems with highly ambiguous words such as prepositions. Therefore, preposi- null tions might blur the contrast between idiomatic expressions and literal translations when measured on the alignment of individual words. Table 5 includes scores for ranking our candidate expressions with and without prepositions. We observe that there is a large improvement when leaving out the alignments of prepositions. This is consistent for all language pairs and the scores we used for ranking.</Paragraph> <Paragraph position="4"> rank pda entropy MWE triple pda score of 60 candidate MWEs.</Paragraph> <Paragraph position="5"> Table 6 provides an excerpt from the ranked list of candidate triples. The ranking has been done using src2trg alignments from Dutch to German with the best setting (see table 5). The score assigned by the pda metric is also shown. The column labeled MWE states whether the expression is idiomatic ('ok') or literal ('*'). One issue that emerges is whether we can find a threshold value that splits candidate expressions into idiomatic and transparent ones. One should choose such a threshold empirically however, it will depend on what level of precision is desirable and also on the final application of the list.</Paragraph> </Section> </Section> class="xml-element"></Paper>