XML Viewer - w03-0414

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0414_metho.xml
Size: 23,533 bytes
Last Modified: 2025-10-06 14:08:25
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0414">
  <Title>Using 'smart' bilingual projection to feature-tag a monolingual dictionary</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 A feature value classifier
</SectionTitle>
    <Paragraph position="0"> In this section, we describe the general algorithm of training a classifier that assigns a feature value to each word of a specific core pos. The following sections will detail how this algorithm is applied to different pos and different features. The algorithm is general enough so that it can be applied to various pos/feature combinations. The extraction of training examples is the only part of the process that changes when applying the algorithm to a different pos and/or feature.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Extraction of training data
</SectionTitle>
      <Paragraph position="0"> Although the following sections will describe in greater detail how training data is obtained for each pos/feature combination, the basic approach is outlined here. As in the previous section, we use the sentence-aligned bilingual corpus in conjunction with the statistical bilingual dictionary to extract words that are likely to exhibit a feature. In the previous section, this feature was a particular pos tag. Here, we focus on other features, such as plural.</Paragraph>
      <Paragraph position="1"> For instance, when looking for plural nouns, we extract plural nouns from the English sentences (they are tagged as such by the Brill tagger, using the tag 'NNS'). We then extract the French word in the corresponding sentence that has the highest correspondence probability with the English word according to the statistical bilingual dictionary. This process again ensures that most (or at least a significant portion) of the extracted French words exhibits the feature in question. In principle, the purpose of the classifier training is then to determine what all (or most) of the extracted words have in common and what sets them apart.</Paragraph>
      <Paragraph position="2"> 3.1.1 Tagging of tense on verbs The first feature we wish to add to the target language lexicon is tense on verbs. More specifically, we restrict our attention to PAST vs. NON-PAST. This is a pragmatic decision: the tagged lexicon is to be used in the context of Machine Translation, and the most common two tenses that Machine Translation systems encounter are past and present. In the future, we may investigate a richer tense set.</Paragraph>
      <Paragraph position="3"> In order to tag tense on verbs, we proceed in principle as was described before when estimating lexical priors.</Paragraph>
      <Paragraph position="4"> We consider each word a8a10a9 in the English corpus that is tagged as a past tense verb. Then we find the likely correspondence on the French side, a12 a9 , by considering the list of French words that correspond to the given English word, starting from the pair with the highest correspondence probability (as obtained from the bilingual lexicon). The first French word from the top of the list that also occurs in the French sentence is extracted and added to the training set:</Paragraph>
      <Paragraph position="6"> where a6 is the number of French words in the lexicon.</Paragraph>
      <Paragraph position="7"> 3.1.2 Tagging of number on nouns, adjectives, and verbs Further, we tag nouns with number information. Again, we restrict our attention to two possible values: SINGULAR vs. PLURAL. Not only does this make sense from a pragmatic standpoint (i.e. if the Machine Translation system can correctly determine whether a word should be singular or plural, much is gained); it also allows us to train a binary classifier, thus simplifying the problem.</Paragraph>
      <Paragraph position="8"> The extraction of candidate French plural nouns is done as expected: we find the likely French correspondent of each English plural noun (as specified by the English pos-tagger), and add the French words to the training set.</Paragraph>
      <Paragraph position="9"> However, when tagging number on adjectives and verbs, things are less straight-forward, as these features are not marked in English and thus the information cannot be obtained from the English pos-tagger. In the case of verbs, we look for the first noun left of the candidate verb. More specificially, we consider an English verb from the corpus only if the closest noun to the left is tagged for plural. This makes intuitive sense linguistically, as in many cases the verb will follow the subject of a sentence.</Paragraph>
      <Paragraph position="10"> For adjectives, we apply a similar strategy. As most adjectives (in English) appear directly before the noun that they modify, we consider an adjective only if the closest noun to the right is in the plural. If this is the case, we extract the likely French correspondent of the adjective as before.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Projection into a similarity space of characters
</SectionTitle>
      <Paragraph position="0"> The extracted words are then re-represented in a space that is similar in concept to a vector space. This process is done as follows: Let</Paragraph>
      <Paragraph position="2"> denote the set of French words that have been extracted as training data for a particular pos/feature combination. For notational convenience, we will usually refer to a0a24a1a17a3a25a5a26a7 a43a15a9 a9 a3 a6a7a2a41a45a27a11a14a13a7a42a21a2a15a9a14a16 as a0 in the remainder of the paper. The reader is however reminded that each a0 is associated with a particular pos/feature combination. Let</Paragraph>
      <Paragraph position="4"> set into a space of a6a7a1a3a8 a28 a8a34a20 a4 a13a10a29 dimensions, where each character index represents a dimension. This implies that for the longest word (or all words of length a6 a1a3a8 a28 a8a34a20 a4 a13a10a29 ), each character is one dimension. For shorter words, the projection will contain empty dimensions. Our idea is based on the fact that in many languages, the most common morphemes are either prefixes or suffixes. We are interested in comparing what most words in a0 begin or end in, rather than emphasizing on the root part, which tends to occur inside the word.</Paragraph>
      <Paragraph position="5"> Therefore, we simply assign an empty value ('-') to those dimensions for short words that are in the middle of the word. A word a12a3a9 , such that a26 a12a3a9a31a26a2a35 a6a7a1 a8 a28 a8a34a20 a4 a13a10a29 , is split in the middle and its characters are assigned to the dimensions of the current space from both ends. In case a26 a12 a9 a26 a29a37a36a39a38 a51a41a40 a19 a38a43a42a45a44a23a46 , we double the character at position a47a15a26 a12 a9 a26 a48 a36a50a49 , so that it can potentially be part of a suffix or a prefix.</Paragraph>
      <Paragraph position="6"> For example if a0a24a1a17a3a25a5a26a7 a43a15a9 a9 a3 a6a7a2a41a45a27a11a14a13a7a42a21a2a15a9a14a16 a29</Paragraph>
      <Paragraph position="8"> a1a57a22 , then the corresponding space will be represented as follows:</Paragraph>
      <Paragraph position="10"/>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Similarity measure
</SectionTitle>
      <Paragraph position="0"> In order to determine what the common feature between most of the words in a0a24a1a4a3a6a5a8a7 a43a10a9 a9 a3 a6a7a2a41a45a12a11a14a13 a42a21a2a15a9a17a16 is, we define a similarity measure between any two words as represented in the space.</Paragraph>
      <Paragraph position="1"> We want our similarity measure to have certain properties. For instance, we want to 'reward' (consider as increasing similarity) if two words have the same character in a dimension. By the same token, a different character should decrease similarity. Further, the empty character should not have any effect, even if both words have the empty character in the same dimension. Regarding the empty character a match would simply consider short words similar, clearly not a desired effect.</Paragraph>
      <Paragraph position="2"> We therefore define our similarity measure as a measure related to the inner product of two vectors</Paragraph>
      <Paragraph position="4"> a38 is the number of dimensions. Note however two differences: first, the product</Paragraph>
      <Paragraph position="6"> Second, we must normalize the measure by the number of dimensions. This will become important later in the process, when certain dimensions are ignored and we do not always compute the similarity over the same number of dimensions. The similarity measure then looks as follows:</Paragraph>
      <Paragraph position="8"> Note that when all dimensions are considered, a38 will correspond to a6a7a1a3a8 a28 a8a34a20 a4 a13a10a29 . The similarity measure is computed for each pair of words a12 a9a15a19a31a12 a35 a42</Paragraph>
      <Paragraph position="10"> This number can be regarded as a measure of the incoherence of the space:</Paragraph>
      <Paragraph position="12"> Although it seems counterintuitive to define an incoherence measure as opposed to a coherence measure, calling the measure an incoherence measure fits with the intuition that low incoherence corresponds to a coherent space.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Run-time classification
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Perturbing and unifying dimensions
</SectionTitle>
      <Paragraph position="0"> The next step in the algorithm is to determine what influence the various dimensions have on the coherence of the space. For each dimension, we determine its impact: does it increase or decrease the coherence of the space. To this end, we compute the incoherence of the space with one dimension blocked out at a time. We denote this new incoherence measure as before, but with an additional subscript to indicate which dimension was blocked out, i.e.</Paragraph>
      <Paragraph position="1"> disregarded in the computation of the incoherence. Thus, for each dimension a53 a19 a40 a35 a53 a35 a6a7a1a3a8 a28 a8a34a20 a4 a13a10a29 , we obtain a new measure a53 a20a55a16 a52 a29a1a0 a9a9 . Two things should be noted: first,</Paragraph>
      <Paragraph position="3"> a29a1a0 a9a9 measures the coherence of the space without dimension a53 . Further, the normalization of the similarity metric becomes important now, if we want to be able to compare the incoherence measures.</Paragraph>
      <Paragraph position="4"> In essence, the impact of a dimension is perturbing if disregarding it increases the incoherence of the space.</Paragraph>
      <Paragraph position="5"> Similarly, it is unifying if its deletion decreases the incoherence of the space. The impact of a dimension is measured as follows:</Paragraph>
      <Paragraph position="7"> We then conjecture that those dimensions whose impact is positive (i.e. disregarding it results in an increased incoherence score) are somehow involved in marking the feature in question. These features, together with their impact score a53 a6 a24a14a0 a9a9 are retained in a set</Paragraph>
      <Paragraph position="9"> The a15 a24 a13a17a16 a8 a13a20a0 is used for classification as described in the following section.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Classification of French dictionary entries
</SectionTitle>
      <Paragraph position="0"> From the start we have aimed at tagging the target language dictionary with feature values. Therefore, it is clearly not enough to determine which dimensions in the space carry information about a given feature. Rather, we use this information to classify words from the target language dictionary.</Paragraph>
      <Paragraph position="1"> To this end, all those words in the target language dictionary that are of the pos in question are classified using the extracted information (the reader is reminded that the system learns a classifier for a particular pos/feature combination). For a given word a12 a11a28a2a41a43a55a11 , we first project the word into the space defined by the training set. Note that in can happen that a26 a12 a11a28a2a49a43 a11 a26 a18 a6a7a1a3a8 a28 a8a34a20 a4 a13a10a29 , i.e. that a12a54a11a28a2a41a43a55a11 is longer than any word encountered during training. In this case, we delete enough characters from the middle of the word to fit it into the space defined by the training set. Again, this is guided by the intuition that often morphemes are marked at the beginning and/or the end of words. While the deletion of characters (and thus elimination of information) is theoretically a suboptimal procedure, it has a negligible effect at run-time.</Paragraph>
      <Paragraph position="2"> After we project a12a54a11a28a2a49a43 a11 into the space, we compute the coherence of the combined space defined by the set denoted by a18 a0 a19a21a12 a11a28a2a41a43 a11 a22 a29 a0a22a21 a12 a11a28a2a41a43 a11 as follows, where the similarity is computed as above and a20 again denotes the size of the set F:  In words, the test word a12 a11a28a2a41a43 a11 is compared to each word</Paragraph>
      <Paragraph position="4"> In the following, all dimensions a53 a42 a15 a24 a13a17a16 a8 a13 a0 are blocked out in turn, and a53 a20a55a16  a29a32a23 a0 a9a6 a24a26a25a28a27a29a24 a30 a9a9 is computed, i.e. the incoherence of the set a18 a0 a19a21a12a50a11a28a2a41a43a55a11a27a22 with one of the dimensions blocked out. As before, the impact of dimension is defined by</Paragraph>
      <Paragraph position="6"> Finally, the word a12 a11a28a2a41a43a55a11 is classified as 'true' (i.e. as exhibiting the feature) if blocking out the dimensions in  a24 a13a17a16a30a8 a13a20a0 descreases incoherence more than the average, i.e. when the incoherence measures were computed on the training set. Thus, the final decision rule is:</Paragraph>
      <Paragraph position="8"> In practice, this decision rule has the following impact: If, for instance, we wish to tag nouns with plural information, a word a12 a11a28a2a49a43 a11 will be tagged with plural if classified as true, with singular if classified as false.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experimental results
</SectionTitle>
    <Paragraph position="0"> As with pos estimation, we tested the feature tagging algorithms on parts of the Hansards, namely on 200000  verbs with plural or singular, and tagging verbs with past vs. non-past, based on two dictionaries that was tagged with pos automatically, one of which used the probabilities of the translation dictionary for pos estimation. sentence pairs English-French. Accuracies were obtained from 2500 French dictionary entries that were not only hand-tagged with pos, but also with tense and number as appropriate. Table 2 summarizes the results. As mentioned above, we tag nouns, adjectives, and verbs with PLURAL vs. SINGULAR values, and additionally verbs with PAST vs. NON-PAST information. In order to abstract away from pos tagging errors, the algorithm is only evaluated on those words that were assigned the appropriate pos for a given word. In other words, if the test set contains a singular noun, it is looked up in the automatically produced target language dictionary. If this dictionary contains the word as an entry tagged as a noun, the number assignment to this noun is checked. If the classification algorithm assigned singular as the number feature, the algorithm classified the word successfully, otherwise not. This way, we can disregard pos tagging errors.</Paragraph>
    <Paragraph position="1"> When estimating pos tags, we produced two separate target language dictionaries, one where the correspondence probabilities in the bilingual Englisha0 French dictionary were ignored, and one where they were used to weigh the correspondences. Here, we report results for both of those dictionaries. Note that the only impact of the a different dictionary (automatically tagged with pos tags) is that the test set is slightly different, given our evaluation method as described in the previous paragraph.</Paragraph>
    <Paragraph position="2"> The fact that evaluating on a different dictionary has no consistent impact on the results only shows that the algorithm is robust on different test sets.</Paragraph>
    <Paragraph position="3"> The overall results are encouraging. It can be seen that the algorithm very successfully tags nouns and adjectives for plural versus singular. In contrast, tagging verbs is somewhat less reliable. This can be explained by the fact that French tags number in verbs differently in different tenses. In other words, the algorithm is faced with more inflectional paradigms, which are harder to learn because the data is fragmented into different patterns of plural markings.</Paragraph>
    <Paragraph position="4"> A similar argument explains the lower results for past versus non-past marking. French has several forms of past, each with different inflectional paradigms. Further, different groups of verbs inflect for tense differntly, fragmenting the data further.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Morpheme role assignment
</SectionTitle>
    <Paragraph position="0"> While in this work we use the defined space merely for classification, our approach can also be used for assigning roles to morphemes. Various morpheme extraction algorithms can be applied to the data. However, the main advantage of our framework is that it presents the morphology algorithm of choice with a training set for particular linguistic features. This means that whatever morphemes are extracted, they can immediately be assigned their linguistic roles, such as number or tense. Role assignment is generally not focused on or excluded entirely in morphology learning. While mere morpheme extraction is useful and sufficient for certain tasks (such as root finding and stemming), for Machine Translation and other tasks involving deeper syntactic analysis it is not enough to find the morphemes, unless they are also assigned roles. If, for instance, we are to translate a word for which there is no entry in the bilingual dictionary, but by stripping off the plural morpheme, we can find a corresponding (singular) word in the other language, we can ensure that the target language word is turned into the plural by adding the appropriate plural morpheme.</Paragraph>
    <Paragraph position="1"> In this section, we present one possible algorithm for extracting morphemes in our framework. Other, more sophisticated, unsupervised morphology algorithms, such as (Goldsmith, 1995), are available and can be applied here. Staying within our framework ensures the additional benefit of immediate role assignment.</Paragraph>
    <Paragraph position="2"> Another strength of our approach is that we make no assumption about the contiguity of the morphemes. Related work on morphology generally makes this assumption (e.g. (Goldsmith, 1995)), with the notable exception of (Schone and Jurafsky, 2001). While in the current experiments the non-contiguity possibility is not reflected in the results, it can become important when applying the algorithm to other languages such as German.</Paragraph>
    <Paragraph position="3"> We begin by conjecturing that most morphemes will not be longer than four characters, and learn only patterns up to that length. Our algorithm starts by extracting all patterns in the training data of up to four characters, however restricting its attention to the dimensions in a15 a24 a13a17a16 a8 a13a20a0 . If a15 a24 a13a17a16 a8 a13a20a0 contains more than 4 dimensions, the algorithm works only with those 4 dimensions that had the greatest impact. All 1, 2, 3, and 4 character combinations that occur in the training data are listed together with how often they occur. The reasoning behind this is that those patterns that occur most frequently in the training data are likely those 'responsible' for marking the given feature.</Paragraph>
    <Paragraph position="4"> However, it is not straightforward to determine automatically how long a morpheme is. For instance, consider the English morpheme '-ing' (the gerund morpheme).</Paragraph>
    <Paragraph position="5"> The algorithm will extract the patterns 'i ', ' n ', ' g', 'in ', 'i g', ' ng', and 'ing'. If we based the morpheme extraction merely on the frequency of the patterns, the algorithm would surely extract one of the single letter patterns, since they are guaranteed to occur at least as many times as the longer patterns. More likely, they will occur more frequently. In order to overcome this difficulty, we apply a subsumption filter. If a shorter patterns is subsumed by a longer one, we no longer consider it.</Paragraph>
    <Paragraph position="6"> Subsumption is defined as follows: suppose pattern a0 a9 appears with frequency a16a2a1a4a3 , where as pattern a0 a35 appears with frequency a16a2a1 a4 , and that a0 a9 is shorter than a0 a35 . Then</Paragraph>
    <Paragraph position="8"> The algorithm repeatedly checks for subsumption until no more subsumptions are found, at which point the remaining patterns are sorted by frequency. It then outputs the most frequent patterns. The cutoff value (i.e. how far down the list to go) is a tunable parameter. In our experiments, we set this parameter to 0.05 probability. Note that we convert the frequencies to probabilities by dividing the counts by the sum of all patterns' frequencies.</Paragraph>
    <Paragraph position="9"> The patterns are listed simply as arrays of 4-characters (or fewer if a15 a24 a13a17a16a30a8 a13a17a0 contains fewer elements). It should be noted that the characters are listed in the order of the dimensions. This, however, does not mean that the patterns have to be contiguous. For instance, if dimension 1 has a unifying effect, and so do dimensions 14, 15, and 16, the patterns are listed as 4-character combinations in increasing order of the dimensions.</Paragraph>
    <Paragraph position="10"> For illustration purposes, table 3 lists several patterns that were extracted for past tense marking on verbs2. All highly-ranked extracted patterns contained only letters in the last two dimensions, so that only those two dimensions are shown in the table.</Paragraph>
    <Paragraph position="11"> Further investigation and the development of a more sophisticated algorithm for extracting patterns should enable us to collapse some of the patterns into one. For instance, the patterns ''ee' and ''es' should be considered special cases of ''e '. Note further that the algorithm extracted the pattern ' s', which was caused by the fact that many verbs were marked for plural in the pass'e compos'e in French. In order to overcome this difficulty, a more complex morphology algorithm should combine findings from different pos/feature combinations. This has been left for future investigation.</Paragraph>
    <Paragraph position="12"> 2Note that no morphemes for the imparfait were extracted.</Paragraph>
    <Paragraph position="13"> This is an artifact of the training data which contains very few instances of imparfait.</Paragraph>
    <Paragraph position="14"> dimension a38 a11 a40 dimension a38  last two dimensions are shown. No extracted pattern involved any of the other dimensions.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML