File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/93/p93-1003_evalu.xml

Size: 9,971 bytes

Last Modified: 2025-10-06 14:00:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="P93-1003">
  <Title>AN ALGORITHM FOR FINDING NOUN PHRASE CORRESPONDENCES IN BILINGUAL CORPORA</Title>
  <Section position="6" start_page="94304" end_page="94304" type="evalu">
    <SectionTitle>
RESULTS
</SectionTitle>
    <Paragraph position="0"> A sample of the aligned corpus comprising 2,600 alignments was used for testing the algorithm (not all of the alignments contained sentences). 4,900 distinct English noun phrases and 5,100 distinct French noun phrases were extracted from the sample. null When forming correspondences involving long sentences with many clauses, it was observed that the position at which a noun phrase occurred in El was very roughly proportional to the corresponding noun phrase in Fi. In such cases it was not necessary to form correspondences with all noun phrases in Fi for each noun phrase in Ei. Instead, the location of a phrase in Ei was mapped linearly to a position in Fi and correspondences were formed for noun phrases occurring in a window around that position. This resulted in a total of 34,000 correspondences. The mappings are stable within a few (2-4) iterations.</Paragraph>
    <Paragraph position="1"> In discussing results, a selection of examples will be presented that demonstrates the strengths and weaknesses of the algorithm. To give an indication of noun phrase frequency counts in the sample, the highest ranking correspondences are shown in Table 1. The figures in columns (1) and (3) indicate the number of instances of the noun phrase to their right.</Paragraph>
    <Paragraph position="2">  To give an informal impression of overall performance, the hundred highest ranking correspondences were inspected and of these, ninety were completely correct. Less frequently occurring noun phrases are also of interest for purposes of evaluation; some of these are shown in Table 2.</Paragraph>
    <Paragraph position="3">  The table also illustrates an unembedded English noun phrase having multiple prepositional  phrases in its French correspondent. Organizational acronyms (which may be not be available in general-purpose dictionaries) are also extracted, as the taggers are robust. Even when a noun phrase only occurs once, a correct correspondence can be found if there are only single noun phrases in each sentence of the alignment. This is demonstrated in the last row of Table 2, which is the result of the following alignment: Ei: &amp;quot;The whole issue of free trade has been mentioned.&amp;quot; null Fi: &amp;quot;On a mentionn~ la question du libre~change.&amp;quot; null Table 3 shows some incorrect correspondences produced by the algorithm (in the table, &amp;quot;usine&amp;quot; means &amp;quot;factory&amp;quot;).</Paragraph>
    <Paragraph position="4"> 11 r deg tho obtraining I 01 asia0 I 1 mix of on-the-job 6 usine Table 3 The sentences that are responsible for these correspondences illustrate some of the problems associated with the correspondence model: Ei: &amp;quot;They use what is known as the dual system in which there is a mix of on-the-job and offthe-job training.&amp;quot; Fi: &amp;quot;Ils ont recours PS une formation mixte, partie en usine et partie hors usine.&amp;quot; The first problem is that the conjunctive modifiers in the English sentence cannot be accommodated by the noun phrase recognizer. The tagger also assigned &amp;quot;on-the-job&amp;quot; as a noun when adjectival use would be preferred. If verb correspondences were included, there is a mismatch between the three that exist in the English sentence and the single one in the French. If the English were to reflect the French for the correspondence model to be appropriate, the noun phrases would perhaps be &amp;quot;part in the factory&amp;quot; and &amp;quot;part out of the factory&amp;quot;. Considered as a translation, this is lame. The majority of errors that occur are not the result of incorrect tagging or noun phrase recognition, but are the result of the approximate nature of the correspondence model. The correspondences in Table 4 are likewise flawed (in the table, &amp;quot;souris&amp;quot; means &amp;quot;mouse&amp;quot; and &amp;quot;tigre de papier&amp;quot; means &amp;quot;paper tiger&amp;quot;):  These correspondences are the result of the following sentences: Ei: &amp;quot;It is a roaring rabbit, a toothless tiger.&amp;quot; Fi: &amp;quot;C' est un tigre de papier, un souris qui rugit.&amp;quot; In the case of the alliterative English phrase &amp;quot;roaring rabbit&amp;quot;, the (presumably) rhetorical aspect is preserved as a rhyme in &amp;quot;souris qui rugit&amp;quot;; the result being that &amp;quot;rabbit&amp;quot; corresponds to &amp;quot;souris&amp;quot; (mouse). Here again, even if the best correspondence were made the result would be wrong because of the relatively sophisticated considerations involved in the translation.</Paragraph>
  </Section>
  <Section position="7" start_page="94304" end_page="94304" type="evalu">
    <SectionTitle>
EXTENSIONS
</SectionTitle>
    <Paragraph position="0"> As regards future possibilities, the algorithm lends itself to a range of improvements and applications, which are outlined next.</Paragraph>
    <Paragraph position="1"> Finding Word Correspondences: The algorithm finds corresponding noun phrases but provides no information about word-level correspondences within them. One possibility is simply to eliminate the tagger and noun phrase recognizer (treating all words as individual phrases of length unity and having a larger number of correspondences). Alternatively, the following strategy can be adopted, which involves fewer total correspondences. First, the algorithm is used to build noun phrase correspondences, then the phrase pairs that are produced are themselves treated as a bilingual noun phrase corpus. The algorithm is then employed again on this corpus, treating all words as individual phrases. This results in a set of single word correspondences for the internal words in noun phrases.</Paragraph>
    <Paragraph position="2"> Reducing Ambiguity: The basic algorithm assumes that noun phrases can be uniquely identified in both languages, which is only true for simple noun phrases. The problem of prepositional phrase attachment is exemplified by the following corresp on den ces:  The correct English and French noun phrases are &amp;quot;Secretary of State for External Affairs&amp;quot; and &amp;quot;secr~taire d' Etat aux Affaires ext~rieures&amp;quot;. If prepositional phrases involving &amp;quot;for&amp;quot; and &amp;quot;~&amp;quot; were also permitted, these phrases would be correctly  identified; however many other adverbial prepositional phrases would also be incorrectly attached to noun phrases.</Paragraph>
    <Paragraph position="3"> If all embedded prepositional phrases were permitted by the noun phrase recognizer, the algorithm could be used to reduce the degree of ambiguity between alternatives. Consider a sequence np~ppe of an unembedded English noun phrase npe followed by a prepositional phrase PPe, and likewise a corresponding French sequence nplpp I.</Paragraph>
    <Paragraph position="4"> Possible interpretations of this are: 1. The prepositional phrase attaches to the noun phrase in both languages.</Paragraph>
    <Paragraph position="5"> 2. The prepositional phrase attaches to the noun phrase in one language and does not in the other.</Paragraph>
    <Paragraph position="6"> 3. The prepositional phrase does not attach to the noun phrase in either language.</Paragraph>
    <Paragraph position="7"> If the prepositional phrases attach to the noun phrases in both languages, they are likely to be repeated in most instances of the noun phrase; it is less likely that the same prepositional phrase will be used adverbially with each instance of the noun phrase. This provides a heuristic method for reducing ambiguity in noun phrases that occur several times. The only modifications required to the algorithm are that the additional possible noun phrases and correspondences between them must be included. Given thresholds on the number of occurrences and the probability of the correspondence, the most likely correspondence can be predicted.</Paragraph>
    <Paragraph position="8"> Including Context: In the algorithm, correspondences between source and target noun phrases are considered irrespectively of other correspondences in an alignment. This does not make the best use of the information available, and can be improved upon. For example, consider the following alignment: El: &amp;quot;The Bill was introduced just before Christmas.&amp;quot; Fi: &amp;quot;Le projet de lot a ~t~ present~ juste avant le cong~ des F~tes.&amp;quot; Here it is assumed that there are many instances of the correspondence &amp;quot;Bill&amp;quot; and &amp;quot;projet de lot&amp;quot;, but only one instance of &amp;quot;Christmas&amp;quot; and &amp;quot;cong~ des F~tes&amp;quot;. This suggests that &amp;quot;Bill&amp;quot; corresponds to &amp;quot;projet de lot&amp;quot; with a high probability and that &amp;quot;Christmas&amp;quot; likewise corresponds strongly to &amp;quot;cong~ des F~tes&amp;quot;. However, the model will assert that &amp;quot;Christmas&amp;quot; corresponds to &amp;quot;projet de lot&amp;quot; and to &amp;quot;cong~ des F~tes&amp;quot; with equal probability, no matter how likely the correspondence between &amp;quot;Bill&amp;quot; and &amp;quot;projet de lot&amp;quot;.</Paragraph>
    <Paragraph position="9"> The model can be refined to reflect this situation by considering the joint probability that a target npr(t) corresponds to a source ripE(s) and all the other possible correspondences in the alignment are produced. This situation is very similar to that involved in training HMM text taggers, where joint probabilities are computed that a particular word corresponds to a particular part-ofspeech, and the rest of the words in the sentence are also generated (e.g. \[Cutting et al., 1992\]).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML