File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1103_metho.xml

Size: 19,910 bytes

Last Modified: 2025-10-06 14:10:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1103">
  <Title>Multilingual Comparable Corpora</Title>
  <Section position="4" start_page="818" end_page="818" type="metho">
    <SectionTitle>
2 Previous work
</SectionTitle>
    <Paragraph position="0"> There has been other work to automatically discover NE with minimal supervision. Both (Cucerzan and Yarowsky, 1999) and (Collins and Singer, 1999) present algorithms to obtain NEs from untagged corpora. However, they focus on the classification stage of already segmented entities, and make use of contextual and morphological clues that require knowledge of the language beyond the level we want to assume with respect to the target language.</Paragraph>
    <Paragraph position="1"> The use of similarity of time distributions for information extraction, in general, and NE extraction, in particular, is not new. (Hetland, 2004) surveys recent methods for scoring time sequences for similarity. (Shinyama and Sekine, 2004) used the idea to discover NEs, but in a single language, English, across two news sources.</Paragraph>
    <Paragraph position="2"> A large amount of previous work exists on transliteration models. Most are generative and consider the task of producing an appropriate transliteration for a given word, and thus require considerable knowledge of the languages. For example, (AbdulJaleel and Larkey, 2003; Jung et al., 2000) train English-Arabic and English-Korean generative transliteration models, respectively. (Knight and Graehl, 1997) build a generative model for backward transliteration from Japanese to English.</Paragraph>
    <Paragraph position="3"> While generative models are often robust, they tend to make independence assumptions that do not hold in data. The discriminative learning framework argued for in (Roth, 1998; Roth, 1999) as an alternative to generative models is now used widely in NLP, even in the context of word alignment (Taskar et al., 2005; Moore, 2005). We make use of it here too, to learn a discriminative transliteration model that requires little knowledge of the target language.</Paragraph>
    <Paragraph position="4"> We extend our preliminary work in (Klementiev and Roth, 2006) to discover multi-word Named Entities and to take advantage of a dictionary (if one exists) to handle NEs which are partially or entirely translated. We take advantage of dynamically growing feature space to reduce the number of supervised training examples.</Paragraph>
  </Section>
  <Section position="5" start_page="818" end_page="819" type="metho">
    <SectionTitle>
3 Co-Ranking: An Algorithm for NE
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="818" end_page="819" type="sub_section">
      <SectionTitle>
Discovery
3.1 The algorithm
</SectionTitle>
      <Paragraph position="0"> In essence, the algorithm we present uses temporal alignment as a supervision signal to iteratively train a transliteration model. On each iteration, it selects a list of top ranked transliteration candidates for each NE according to the current model (line 6). It then uses temporal alignment (with thresholding) to re-rank the list and select the best transliteration candidate for the next round of training (lines 8, and 9).</Paragraph>
      <Paragraph position="1"> Once the training is complete, lines 4 through 10 are executed without thresholding for each constituent NE word. If a dictionary is available, transliteration candidate lists a0a2a1a4a3 on line 6 are augmented with translations. We then combine the best candidates (as chosen on line 8, without thresholding) into complete target language NE.</Paragraph>
      <Paragraph position="2"> Finally, we discard transliterations which do not actually appear in the target corpus.</Paragraph>
      <Paragraph position="3"> Input: Bilingual, comparable corpus (a5 , a6 ), set of named entities a7a9a8a11a10 from a5 , threshold a12 Output: Transliteration model a13</Paragraph>
      <Paragraph position="5"> Use a13 to collect a list of candidates a7a9a8a28a27a29a15a30a66 with high transliteration scores;a14</Paragraph>
    </Section>
    <Section position="2" start_page="819" end_page="819" type="sub_section">
      <SectionTitle>
3.2 Time sequence generation and matching
</SectionTitle>
      <Paragraph position="0"> In order to generate time sequence for a word, we divide the corpus into a sequence of temporal bins, and count the number of occurrences of the word in each bin. We then normalize the sequence.</Paragraph>
      <Paragraph position="1"> We use a method called the F-index (Hetland, 2004) to implement the a0a2a1a4a3a6a5a8a7 similarity function on line 8 of the algorithm. We first run a Discrete Fourier Transform on a time sequence to extract its Fourier expansion coefficients. The score of a pair of time sequences is then computed as a Euclidean distance between their expansion coefficient vectors. null  As we mentioned in the introduction, an NE may map to more than one transliteration in another language. Identification of the entity's equivalence class of transliterations is important for obtaining its accurate time sequence.</Paragraph>
      <Paragraph position="2"> In order to keep to our objective of requiring as little language knowledge as possible, we took a rather simplistic approach for both languages of our corpus. For Russian, two words were considered variants of the same NE if they share a prefix of size five or longer. Each unique word had its own equivalence class for the English side of the corpus, although, in principal, ideas such as in (Li et al., 2004) could be incorporated.</Paragraph>
      <Paragraph position="3"> A cumulative distribution was then collected for such equivalence classes.</Paragraph>
    </Section>
    <Section position="3" start_page="819" end_page="819" type="sub_section">
      <SectionTitle>
3.3 Transliteration model
</SectionTitle>
      <Paragraph position="0"> Unlike most of the previous work considering generative transliteration models, we take the discriminative approach. We train a linear model to decide whether a word a1 a3a10a9a12a11 is a transliteration of an NE a1a14a13a15a9a17a16 . The words in the pair are partitioned into a set of substrings a0a18a13 and a0 a3 up to a particular length (including the empty string ). Couplings of the substrings a19a20a0 a13a22a21 a0 a3a24a23 from both sets produce features we use for training. Note that couplings with the empty string represent insertions/omissions.</Paragraph>
      <Paragraph position="1"> Consider the following example: (a1a25a13 , a1a51a3 ) = (powell, pauel). We build a feature vector from this example in the following manner: a26 First, we split both words into all possible substrings of up to size two:</Paragraph>
      <Paragraph position="3"> strings from the two sets: a19a33a19 a29a46a21 a23a47a21 a19 a29a46a21a36a39a48a23a47a21a50a49a51a49a51a49 a19 a32a34a21a36a39a52a41a53a23a47a21a50a49a51a49a51a49 a19a20a7 a35a54a21 a7 a35a55a23a47a21a56a49a51a49a51a49 a19 a35a57a35a37a21 a7 a35a58a23a33a23 We use the observation that transliteration tends to preserve phonetic sequence to limit the number of couplings. For example, we can disallow the coupling of substrings whose starting positions are too far apart: thus, we might not consider a pairing a19 a29 a3 a21a33a41 a7 a23 in the above example. In our experiments, we paired substrings if their positions in their respective words differed by -1, 0, or 1.</Paragraph>
      <Paragraph position="4"> We use the perceptron (Rosenblatt, 1958) algorithm to train the model. The model activation provides the score we use to select best transliterations on line 6. Our version of perceptron takes variable number of features in its examples; each example is a subset of all features seen so far that are active in the input. As the iterative algorithm observes more data, it discovers and makes use of more features. This model is called the infinite attribute model (Blum, 1992) and it follows the perceptron version of SNoW (Roth, 1998).</Paragraph>
      <Paragraph position="5"> Positive examples used for iterative training are pairs of NEs and their best temporally aligned (thresholded) transliteration candidates. Negative examples are English non-NEs paired with random Russian words.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="819" end_page="822" type="metho">
    <SectionTitle>
4 Experimental Study
</SectionTitle>
    <Paragraph position="0"> We ran experiments using a bilingual comparable English-Russian news corpus we built by crawling a Russian news web site (www.lenta.ru).</Paragraph>
    <Paragraph position="1"> The site provides loose translations of (and pointers to) the original English texts. We collected pairs of articles spanning from 1/1/2001 through 10/05/2005. The corpus consists of 2,327 documents, with 0-8 documents per day.</Paragraph>
    <Paragraph position="2"> The corpus is available on our web page at http://L2R.cs.uiuc.edu/a59 cogcomp/.</Paragraph>
    <Paragraph position="3"> The English side was tagged with a publicly available NER system based on the SNoW learning architecture (Roth, 1998), that is available on the same site. This set of English NEs was hand-pruned to remove incorrectly classified words to obtain 978 single word NEs.</Paragraph>
    <Paragraph position="4"> In order to reduce running time, some limited pre-processing was done on the Russian side. All classes, whose temporal distributions were close to uniform (i.e. words with a similar likelihood of occurrence throughout the corpus) were  didates. Unique words were thus grouped into 14,781 equivalence classes.</Paragraph>
    <Paragraph position="5"> Unless mentioned otherwise, the transliteration model was initialized with a set of 20 pairs of English NEs and their Russian transliterations. Negative examples here and during the rest of the training were pairs of randomly selected non-NE English and Russian words.</Paragraph>
    <Paragraph position="6"> New features were discovered throughout training; all but top 3000 features from positive and 3000 from negative examples were pruned based on the number of their occurrences so far. Features remaining at the end of training were used for NE discovery.</Paragraph>
    <Paragraph position="7"> Insertions/omissions features were not used in the experiments as they provided no tangible benefit for the languages of our corpus.</Paragraph>
    <Paragraph position="8"> In each iteration, we used the current transliteration model to find a list of 30 best transliteration equivalence classes for each NE. We then computed time sequence similarity score between NE and each class from its list to find the one with the best matching time sequence. If its similarity score surpassed a set threshold, it was added to the list of positive examples for the next round of training. Positive examples were constructed by pairing an NE with the common stem of its transliteration equivalence class. We used the same number of positive and negative examples.</Paragraph>
    <Paragraph position="9">  pairs vs. the initial example set size. As long as the size is large enough, decreasing the number of examples does not have a significant impact on the performance of the later iterations.</Paragraph>
    <Paragraph position="10"> We used the Mueller English-Russian dictionary to obtain translations in our multi-word NE experiments. We only considered the first dictionary definition as a candidate.</Paragraph>
    <Paragraph position="11"> For evaluation, random 727 of the total of 978 NEs were matched to correct transliterations by a language expert (partly due to the fact that some of the English NEs were not mentioned in the Russian side of the corpus). Accuracy was computed as the percentage of NEs correctly identified by the algorithm.</Paragraph>
    <Paragraph position="12"> In the multi-word NE experiment, 282 random multi-word (2 or more) NEs and their transliterations/translations discovered by the algorithm were verified by a language expert.</Paragraph>
    <Section position="1" start_page="819" end_page="821" type="sub_section">
      <SectionTitle>
4.1 NE discovery
</SectionTitle>
      <Paragraph position="0"> Figure 3 shows the proportion of correctly discovered NE transliteration equivalence classes throughout the training stage. The figure also shows the accuracy if transliterations are selected according to the current transliteration model (top scoring candidate) and temporal sequence matching alone.</Paragraph>
      <Paragraph position="1"> The transliteration model alone achieves an accuracy of about 38%, while the time sequence alone gets about 41%. The combined algorithm achieves about 63%, giving a significant improvement. null  vs. corpus misalignment (a0 ) for each of the three measures. DFT based measure provides significant advantages over commonly used metrics for weakly aligned corpora.</Paragraph>
      <Paragraph position="2">  vs. sliding window size (a32 ) for each of the three measures.</Paragraph>
      <Paragraph position="3"> In order to understand what happens to the transliteration model as the training proceeds, let us consider the following example. Figure 5 shows parts of transliteration lists for NE forsyth for two iterations of the algorithm. The weak transliteration model selects the correct transliteration (italicized) as the 24th best transliteration in the first iteration. Time sequence scoring function chooses it to be one of the training examples for the next round of training of the model. By the eighth iteration, the model has improved to select it as a best transliteration.</Paragraph>
      <Paragraph position="4"> Not all correct transliterations make it to the top of the candidates list (transliteration model by itself is never as accurate as the complete algorithm on Figure 3). That is not required, however, as the model only needs to be good enough to place the correct transliteration anywhere in the candidate list.</Paragraph>
      <Paragraph position="5"> Not surprisingly, some of the top transliteration candidates start sounding like the NE itself, as training progresses. On Figure 5, candidates for forsyth on iteration 7 include fross and fossett. Once the transliteration model was trained, we ran the algorithm to discover multi-word NEs, augmenting candidate sets of dictionary words with their translations as described in Section 3.1. We achieved the accuracy of about 66%. The correctly discovered Russian NEs included entirely transliterated, partially translated, and entirely translated NEs. Some of them are shown on Figure 6.</Paragraph>
    </Section>
    <Section position="2" start_page="821" end_page="821" type="sub_section">
      <SectionTitle>
4.2 Initial example set size
</SectionTitle>
      <Paragraph position="0"> We ran a series of experiments to see how the size of the initial training set affects the accuracy of the model as training progresses (Figure 4). Although the performance of the early iterations is significantly affected by the size of the initial training example set, the algorithm quickly improves its performance. As we decrease the size from 80 to 20, the accuracy of the first iteration drops by over 20%, but a few iterations later the two have similar performance. However, when initialized with the set of size 5, the algorithm never manages to improve.</Paragraph>
      <Paragraph position="1"> The intuition is the following. The few examples in the initial training set produce features corresponding to substring pairs characteristic for English-Russian transliterations. Model trained on these (few) examples chooses other transliterations containing these same substring pairs. In turn, the chosen positive examples contain other characteristic substring pairs, which will be used by the model to select more positive examples on the next round, and so on. On the other hand, if the initial set is too small, too few of the characteristic transliteration features are extracted to select a clean enough training set on the next round of training.</Paragraph>
      <Paragraph position="2"> In general, one would expect the size of the training set necessary for the algorithm to improve to depend on the level of temporal alignment of the two sides of the corpus. Indeed, the weaker the temporal supervision the more we need to endow the model so that it can select cleaner candidates in the early iterations.</Paragraph>
    </Section>
    <Section position="3" start_page="821" end_page="822" type="sub_section">
      <SectionTitle>
4.3 Comparison of time sequence scoring
</SectionTitle>
      <Paragraph position="0"> functions We compared the performance of the DFT-based time sequence similarity scoring function we use in this paper to the commonly used cosine (Salton and McGill, 1986) and Pearson's correlation measures. null We perturbed the Russian side of the corpus in the following way. Articles from each day were randomly moved (with uniform probability) within a a0 -day window. We ran single word NE temporal sequence matching alone on the perturbed corpora using each of the three measures (Table 1).</Paragraph>
      <Paragraph position="1"> Some accuracy drop due to misalignment could be accommodated for by using a larger temporal  improves, the correct transliteration moves up the list. bin for collecting occurrence counts. We tried various (sliding) window size a32 for a perturbed corpus with a0a2a1a7a6 (Table 2).</Paragraph>
      <Paragraph position="2"> DFT metric outperforms the other measures significantly in most cases. NEs tend to have distributions with few pronounced peaks. If two such distributions are not well aligned, we expect both Pearson and Cosine measures to produce low scores, whereas the DFT metric should catch their similarities in the frequency domain.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="822" end_page="822" type="metho">
    <SectionTitle>
5 Conclusions
</SectionTitle>
    <Paragraph position="0"> We have proposed a novel algorithm for cross lingual multi-word NE discovery in a bilingual weakly temporally aligned corpus. We have demonstrated that using two independent sources of information (transliteration and temporal similarity) together to guide NE extraction gives better performance than using either of them alone (see Figure 3).</Paragraph>
    <Paragraph position="1"> We developed a linear discriminative transliteration model, and presented a method to automatically generate features. For time sequence matching, we used a scoring metric novel in this domain. We provided experimental evidence that this metric outperforms other scoring metrics traditionally used.</Paragraph>
    <Paragraph position="2"> In keeping with our objective to provide as little language knowledge as possible, we introduced a simplistic approach to identifying transliteration equivalence classes, which sometimes produced erroneous groupings (e.g. an equivalence class for NE congolese in Russian included both congo and congolese on Figure 6). We expect that more language specific knowledge used to discover accurate equivalence classes would result in performance improvements.</Paragraph>
    <Paragraph position="3"> Other type of supervision was in the form of a  covered by the algorithm.</Paragraph>
    <Paragraph position="4"> very small bootstrapping transliteration set.</Paragraph>
  </Section>
  <Section position="8" start_page="822" end_page="822" type="metho">
    <SectionTitle>
6 Future Work
</SectionTitle>
    <Paragraph position="0"> The algorithm can be naturally extended to comparable corpora of more than two languages.</Paragraph>
    <Paragraph position="1"> Pair-wise time sequence scoring and transliteration models should give better confidence in NE matches.</Paragraph>
    <Paragraph position="2"> The ultimate goal of this work is to automatically tag NEs so that they can be used for training of an NER system for a new language. To this end, we would like to compare the performance of an NER system trained on a corpus tagged using this approach to one trained on a hand-tagged corpus.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML