File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0606_metho.xml

Size: 19,033 bytes

Last Modified: 2025-10-06 14:09:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0606">
  <Title>Computing Word Similarity and Identifying Cognates with Pair Hidden Markov Models</Title>
  <Section position="4" start_page="40" end_page="40" type="metho">
    <SectionTitle>
2 Word Similarity
</SectionTitle>
    <Paragraph position="0"> Word similarity is, at its core, an alignment task. In order to determine similarity between two words, we look at the various alignments that can exist between them. Each component of the alignment is assigned a probability-based score by our trained model. The scores are then combined to produce the overall similarity score for any word pair, which can be used to rank the word pairs against each other. Alternatively, a discrete cut-off point can be selected in order to separate pairs that show the required similarity from the ones that do not.</Paragraph>
    <Paragraph position="1"> Before we can align words, they must be separated into symbols. Typically, the symbols are characters in the orthographic representation, and phonemes in the phonetic representation. We also need to put some restrictions on the possible alignments between these symbols. By adopting the following two assumptions, we are able to fully exploit the simplicity and efficiency of the Pair Hidden Markov Model.</Paragraph>
    <Paragraph position="2"> First, we assume that the basic ordering of symbols remains the same between languages. This does not mean that every symbol has a corresponding one in the other language, but instead that word transformation comes from three basic operations: substitution, insertion and deletion. Exceptions to this rule certainly exist (e.g. metathesis), but are sufficiently infrequent to make the benefits of this constraint far outweigh the costs.</Paragraph>
    <Paragraph position="3"> Second, we assume that each symbol is aligned to at most one symbol in the other word. This assumption is aimed at reducing the number of parameters that have to be learned from limited-size training data. If there is a many-to-one correspondence that is consistent between languages, it would be beneficial to change the word representation so that the many symbols are considered as a single symbol instead. For example, a group of characters in the orthographic representation may correspond to a single phoneme if the word is written phonetically.</Paragraph>
  </Section>
  <Section position="5" start_page="40" end_page="41" type="metho">
    <SectionTitle>
3 Pair Hidden Markov Models
</SectionTitle>
    <Paragraph position="0"> Hidden Markov Models have been applied successfully to a number of problems in natural language processing, including speech recognition (Jelinek, 1999) and statistical machine translation (Och and Ney, 2000). One of the more intangible aspects of a Hidden Markov Model is the choice of the model itself. While algorithms exist to train the parameters of the model so that the model better describes its data, there is no formulaic way to create the model.</Paragraph>
    <Paragraph position="1"> We decided to adopt as a starting point a model developed in a different field of study.</Paragraph>
    <Paragraph position="2"> Durbin et al. (1998) created a new type of Hidden Markov Model that has been used for the task of aligning biological sequences (Figure 1). Called a Pair Hidden Markov Model, it uses two output streams in parallel, each corresponding to a sequence that is being aligned.1 The alignment model has three states that represent the basic edit operations: substitution (represented by state &amp;quot;M&amp;quot;), insertion (&amp;quot;Y&amp;quot;), and deletion (&amp;quot;X&amp;quot;). &amp;quot;M&amp;quot;, the match state, emits an aligned pair of symbols (not necessarily identical) with one symbol on the top and the other on the bottom output stream. &amp;quot;X&amp;quot; and &amp;quot;Y&amp;quot;, the gap states, output a symbol on only one stream against a gap on the other. Each state has its own emission probabilities representing the likelihood of producing a pairwise alignment of the type described by the state. The model has three transition parameters: d, e, and t. In order to reduce the number of parameters, there is no explicit start state. Rather, the probability of starting in a given state is equal to  natural language processing once before: Clark (2001) applied PHMMs to the task of learning stochastic finite-state transducers for modeling morphological paradigms.</Paragraph>
    <Paragraph position="3">  the probability of going from the match state to the given state.</Paragraph>
    <Paragraph position="4"> Durbin et al. (1998) describe several different algorithms that can be used to score and rank paired biological sequences. Two of them are based on common HMM algorithms. The Viterbi algorithm uses the most probable path through the model to score the pair. The forward algorithm computes the total overall probability for a pair by summing up the probabilities of every possible alignment between the words. A third algorithm (the log odds algorithm) was designed to take into account how likely the pair would be to occur randomly within the two languages by considering a separately trained random model (Figure 2) in conjunction with the similarity model. In the random model, the sequences are assumed to have no relationship to each other, so there is no match state. The log odds algorithm calculates a score for a pair of symbols by dividing the probability of a genuine correspondence between a pair of symbols (the similarity model) by the probability of them co-occurring by chance (the random model). These individual scores are combined to produce an overall score for the pair of sequences in the same way as individual symbol probabilities are combined in other algorithms.</Paragraph>
  </Section>
  <Section position="6" start_page="41" end_page="43" type="metho">
    <SectionTitle>
4 PHMMs for Word Similarity
</SectionTitle>
    <Paragraph position="0"> Because of the differences between biological sequence analysis and computing word similarity, the bioinformatics model has to be adapted to handle the latter task. In this section, we propose a number of modifications to the original model and the corre- null sponding algorithms. The modified model is shown in Figure 3.</Paragraph>
    <Paragraph position="1"> First, the original model's assumption that an insertion followed by a deletion is the same as a substitution is problematic in the context of word similarity. Covington (1998) illustrates the problem with an example of Italian &amp;quot;due&amp;quot; and the Spanish &amp;quot;dos&amp;quot;, both of which mean &amp;quot;two&amp;quot;. While there is no doubt that the first two pairs of symbols should be aligned, there is no historical connection between the Italian &amp;quot;e&amp;quot; and the Spanish &amp;quot;s&amp;quot;. In this case, a sequence of an insertion and a deletion is more appropriate than a substitution. In order to remedy this problem, we decided to a add a pair of transitions between states &amp;quot;X&amp;quot; and &amp;quot;Y&amp;quot;, which is denoted by l in Figure 3. The second modification involves splitting the parameter t into two separate values: tM for the match state, and tXY for the gap states. The original biological model keeps the probability for the transition to the end state constant for all other states. For cognates, and other word similarity tasks, it may be that similar words are more or less likely to end in gaps or matches. The modification preserves the symmetry of the model while allowing it to capture how likely a given operation is to occur at the end of an alignment.</Paragraph>
    <Section position="1" start_page="41" end_page="42" type="sub_section">
      <SectionTitle>
4.1 Algorithm Variations
</SectionTitle>
      <Paragraph position="0"> We have investigated several algorithms for the alignment and scoring of word pairs. Apart from the standard Viterbi (abbreviated VIT) and forward (FOR) algorithms, we considered two variations of the log odds algorithm, The original log odds algorithm (LOG) functions much like a Viterbi algo- null of states. We also created another variation, forward log odds (FLO), which uses a forward approach instead, considering the aggregate probability of all possible paths through both models.</Paragraph>
    </Section>
    <Section position="2" start_page="42" end_page="42" type="sub_section">
      <SectionTitle>
4.2 Model Variations
</SectionTitle>
      <Paragraph position="0"> Apart from comparing the effectiveness of different algorithms, we are also interested in establishing the optimal structure of the underlying model. The similarity model can be broken up into three sets of parameters: the match probabilities, the gap probabilities, and the transition probabilities. Our goal is to examine the relative contribution of various components of the model, and to find out whether simplifying the model affects the overall performance of the system. Since the match probabilities constitute the core of the model, we focus on the remaining emission and transition probabilities. We also investigate the necessity of including an explicit end state in the model.</Paragraph>
      <Paragraph position="1"> The first variation concerns the issue of gap emission probabilities. For the log odds algorithm, Durbin et al. (1998) allow the gap emission probabilities of both the similarity and random models to be equal. While this greatly simplifies the calculations and allows for the emphasis to be on matched symbols, it might be more in spirit with the word similarity task to keep the emissions of the two models separate. If we adopt such an approach, the similarity model learns the gap emission probabilities using the forward-backward algorithm, just as is done with the match probabilities, but the random model uses letter frequencies from the training data instead.</Paragraph>
      <Paragraph position="2"> A similar test of the effectiveness of trained gap parameters can be performed for the Viterbi and forward algorithms by proceeding in the opposite direction. Instead of deriving the gap probabilities from the training data (as in the original model), we can set them to uniform values after training, thus making the final scores depend primarily on matches.</Paragraph>
      <Paragraph position="3"> The second variation removes the effect the transition parameters have on the final calculation. In the resulting model, a transition probability from any state to any state (except the end state) is constant, effectively merging &amp;quot;X&amp;quot;, &amp;quot;Y&amp;quot;, and &amp;quot;M&amp;quot; into a single state. One of the purposes of the separated states was to allow for affine gap penalties, which is why there are different transition parameters for going to a gap state and for staying in that state. By making the transitions constant, we are also taking away the affine gap structure. As a third variant, we try both the first and second variation combined.</Paragraph>
      <Paragraph position="4"> The next variation concerns the effect of the end state on the final score. Unlike in the alignment of biological sequences, word alignment boundaries are known beforehand, so an end state is not strictly necessary. It is simple enough to remove the end state from our model after the training has been completed. The remaining transition probability mass is shifted to the transitions that lead to the match state.</Paragraph>
      <Paragraph position="5"> Once the end state is removed, it is possible to reduce the number of transition parameters to a single one, by taking advantage of the symmetry between the insertion and deletion states. In the resulting model, the probability of entering a gap state is equal to 1[?]x2 , where x is the probability of a transition to the match state. Naturally, the log odds algorithms also have a separate parameter for the random model.</Paragraph>
    </Section>
    <Section position="3" start_page="42" end_page="43" type="sub_section">
      <SectionTitle>
4.3 Correcting for Length
</SectionTitle>
      <Paragraph position="0"> Another problem that needs to be addressed is the bias introduced by the length of the words. The principal objective of the bioinformatics model is the optimal alignment of two sequences. In our case, the alignment is a means to computing word similarity. In fact, some of the algorithms (e.g. the forward algorithm) do not yield an explicit best alignment. While the log odds algorithms have a built-in length correction, the Viterbi and the forward do not.</Paragraph>
      <Paragraph position="1">  These algorithms continually multiply probabilities together every time they process a symbol (or a symbol pair), which means that the overall probability of an alignment strongly depends on word lengths. In order to rectify this problem, we multiply the final probability by 1Cn , where n is the length of the longer word in the pair, and C is a constant. The value of C can be established on a held-out data set.2</Paragraph>
    </Section>
    <Section position="4" start_page="43" end_page="43" type="sub_section">
      <SectionTitle>
4.4 Levenshtein with Learned Weights
</SectionTitle>
      <Paragraph position="0"> Mann and Yarowsky (2001) investigated the induction of translation lexicons via bridge languages.</Paragraph>
      <Paragraph position="1"> Their approach starts with a dictionary between two well studied languages (e.g. English-Spanish). They then use cognate pairs to induce a bridge between two strongly related languages (e.g. Spanish and Italian), and from this create a smaller translation dictionary between the remaining two languages (e.g. English and Italian). They compared the performances of several different cognate similarity (or distance) measures, including one based on the Levenshtein distance, one based on the stochastic transducers of Ristad and Yianilos (1998), and a variation of a Hidden Markov Model. Somewhat surprisingly, the Hidden Markov Model falls well short of the baseline Levenshtein distance.3 Mann and Yarowsky (2001) developed yet another model, which outperformed all other similarity measures. In the approach, which they call &amp;quot;Levenshtein with learned weights&amp;quot;, the probabilities of their stochastic transducer are transformed into substitution weights for computing Levenshtein distance: 0.5 for highly similar symbols, 0.75 for weakly similar symbols, etc. We have endeavored to emulate this approach (abbreviated LLW) by converting the log odds substitution scores calculated from the fully trained model into the substitution  tinctly different design than our PHMM model. For example, the emission probabilities corresponding to the atomic edit operations sum to one for each alphabet symbol. In our model, the emission probabilities for different symbols are interdependent.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="43" end_page="44" type="metho">
    <SectionTitle>
5 Experimental Setup
</SectionTitle>
    <Paragraph position="0"> We evaluated our word similarity system on the task of the identification of cognates. The input consists of pairs of words that have the same meaning in distinct languages. For each pair, the system produces a score representing the likelihood that the words are cognate. Ideally, the scores for true cognate pairs should always be higher than scores assigned to unrelated pairs. For binary classification, a specific score threshold could be applied, but we defer the decision on the precision-recall trade-off to downstream applications. Instead, we order the candidate pairs by their scores, and evaluate the ranking using 11-point interpolated average precision (Manning and Schutze, 2001).</Paragraph>
    <Paragraph position="1"> Word similarity is not always a perfect indicator of cognation because it can also result from lexical borrowing and random chance. It is also possible that two words are cognates and yet exhibit little surface similarity. Therefore, the upper bound for average precision is likely to be substantially lower than 100%.</Paragraph>
    <Section position="1" start_page="43" end_page="44" type="sub_section">
      <SectionTitle>
5.1 Data
</SectionTitle>
      <Paragraph position="0"> Training data for our cognate recognition model comes from the Comparative Indoeuropean Data Corpus (Dyen et al., 1992). The data contains word lists of 200 basic meanings representing 95 speech varieties from the Indoeuropean family of languages. Each word is represented in an orthographic form without diacritics using the 26 letters of the Roman alphabet. All cognate pairs are also identified in the data.</Paragraph>
      <Paragraph position="1"> The development set4 consisted of two language pairs: Italian and Serbo-Croatian, as well as Polish and Russian. We chose these two language pairs because they represent very different levels of relatedness: 25.3% and 73.5% of the word pairs are cognates, respectively. The percentage of cognates within the data is important, as it provides a simple baseline from which to compare the success of our algorithms. If our cognate identification process 4Several parameters used in our experiments were determined during the development of the word similarity model. These include the random model's parameter e, the constant transition probabilities in the simplified model, and the constant C for correcting the length bias in the Viterbi and forward algorithms. See (Mackay, 2004) for complete details.</Paragraph>
      <Paragraph position="2">  were random, we would expect to get roughly these percentages for our recognition precision (on average). null The test set consisted of five 200-word lists representing English, German, French, Latin, and Albanian, compiled by Kessler (2001). The lists for these languages were removed from the training data (except Latin, which was not part of the training set), in order to keep the testing and training data as separate as possible.5 We converted the test data to have the same orthographic representation as the training data.</Paragraph>
    </Section>
    <Section position="2" start_page="44" end_page="44" type="sub_section">
      <SectionTitle>
5.2 Significance tests
</SectionTitle>
      <Paragraph position="0"> We performed pairwise statistical significance tests for various model and algorithm combinations. Following the method proposed by Evert (2004), we applied Fisher's exact test to counts of word pairs that are accepted by only one of the two tested algorithms. For a given language pair, the cutoff level was set equal to the actual number of cognate pairs in the list. For example, since 118 out of 200 word pairs in the English/German list are cognate, we considered the true and false positives among the set of 118 top scoring pairs. For the overall average of a number of different language pairs, we took the union of the individual sets. For the results in Tables 1 and 2, the pooled set contained 567 out of 2000 pairs, which corresponds to the proportion of cognates in the entire test data (28.35%).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML