File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1011_metho.xml

Size: 14,853 bytes

Last Modified: 2025-10-06 14:10:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1011">
  <Title>Named Entity Transliteration and Discovery from Multilingual Comparable Corpora</Title>
  <Section position="4" start_page="83" end_page="84" type="metho">
    <SectionTitle>
3 Co-ranking: An Algorithm for NE
</SectionTitle>
    <Paragraph position="0"> Discovery In essence, the algorithm we present uses temporal alignment as a supervision signal to iteratively train a discriminative transliteration model, which can be viewed as a distance metric between and English NE and a potential transliteration. On each iteration, it selects a set of transliteration candidates for each NE according to the current model (line 6). It then uses temporal alignment (with thresholding) to select the best transliteration candidate for the next round of training (lines 8, and 9).</Paragraph>
    <Paragraph position="1"> Once the training is complete, lines 4 through 10 are executed without thresholding for each NE ina0 to discover its counterpart ina1 .</Paragraph>
    <Section position="1" start_page="83" end_page="84" type="sub_section">
      <SectionTitle>
3.1 Time Sequence Generation and Matching
</SectionTitle>
      <Paragraph position="0"> In order to generate time sequence for a word, we divide the corpus into a sequence of temporal bins, and count the number of occurrences of the word in each bin. We then normalize the sequence.</Paragraph>
      <Paragraph position="1"> We use a method called the F-index (Hetland, 2004) to implement the a2a4a3a6a5a8a7a10a9 similarity function on line 8 of the algorithm. We first run a Discrete Fourier Transform on a time sequence to extract its Fourier expansion coefficients. The score of a pair of time sequences is then computed as a Euclidian distance between their expansion coefficient vectors.</Paragraph>
      <Paragraph position="2"> .</Paragraph>
      <Paragraph position="3"> Input: Bilingual, comparable corpus (a11 ,a12 ), set of named entitiesa13a15a14a17a16 froma11 , thresholda18  As we mentioned earlier, an NE in one language may map to multiple morphological variants and transliterations in another. Identification of the entity's equivalence class of transliterations is important for obtaining its accurate time sequence. In order to keep to our objective of requiring as little language knowledge as possible, we took a rather simplistic approach to take into account morpholog- null ical ambiguities of NEs in Russian. Two words were considered variants of the same NE if they share a prefix of size five or longer. At this point, our algorithm takes a simplistic approach also for the English side of the corpus - each unique word had its own equivalence class although, in principle, we can incorporate works such as (Li et al., 2004) into the algorithm. A cumulative distribution was then collected for such equivalence classes.</Paragraph>
    </Section>
    <Section position="2" start_page="84" end_page="84" type="sub_section">
      <SectionTitle>
3.2 Transliteration Model
</SectionTitle>
      <Paragraph position="0"> Unlike most of the previous work to transliteration, that consider generative transliteration models, we take a discriminative approach. We train a linear model to decide whether a word a0a2a1a4a3a43a1 is a transliteration of an NE a0a6a5a7a3 a0 . The words in the pair are partitioned into a set of substrings a2a8a5 and a2a9a1 up to a particular length (including the empty string ). Couplings of the substrings a10a2a11a5a13a12a49a2a14a1a16a15 from both sets produce features we use for training. Note that couplings with the empty string represent insertions/omissions. null Consider the following example: (a0a17a5 , a0a6a1 ) = (powell, pauel). We build a feature vector from this example in the following manner: a18 First, we split both words into all possible sub-strings of up to size two:</Paragraph>
      <Paragraph position="2"> strings from the two sets: a10a25a10a41a22a42a12 a15a43a12a14a10a41a22a13a12a29a35a44a15a43a12a46a45a47a45a47a45a48a10a49a24a50a12a29a35a40a37a51a15a43a12a46a45a47a45a47a45a48a10a9a9a28a30a12a49a9a32a28a52a15a43a12a46a45a47a45a53a45a48a10a23a28a23a28a30a12a49a9a9a28a54a15a25a15 We use the observation that transliteration tends to preserve phonetic sequence to limit the number of couplings. For example, we can disallow the coupling of substrings whose starting positions are too far apart: thus, we might not consider a pairing a10a41a22 a5a19a12a25a37a57a9a9a15 in the above example. In our experiments, we paired substrings if their positions in their respective words differed by -1, 0, or 1.</Paragraph>
      <Paragraph position="3"> We use the perceptron (Rosenblatt, 1958) algorithm to train the model. The model activation provides the score we use to select best transliterations on line 6. Our version of perceptron takes examples with a variable number of features; each example is a set of all features seen so far that are active in the input. As the iterative algorithm observes more data, it discovers and makes use of more features. This model is called the infinite attribute model (Blum, 1992) and it follows the perceptron version in SNoW (Roth, 1998).</Paragraph>
      <Paragraph position="4"> Positive examples used for iterative training are pairs of NEs and their best temporally aligned (thresholded) transliteration candidates. Negative examples are English non-NEs paired with random Russian words.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="84" end_page="86" type="metho">
    <SectionTitle>
4 Experimental Study
</SectionTitle>
    <Paragraph position="0"> We ran experiments using a bilingual comparable English-Russian news corpus we built by crawling a Russian news web site (www.lenta.ru).</Paragraph>
    <Paragraph position="1"> The site provides loose translations of (and pointers to) the original English texts. We collected pairs of articles spanning from 1/1/2001 through 12/24/2004. The corpus consists of 2,022 documents with 0-8 documents per day.</Paragraph>
    <Paragraph position="2"> The corpus is available on our web page at http://L2R.cs.uiuc.edu/a55 cogcomp/.</Paragraph>
    <Paragraph position="3"> The English side was tagged with a publicly available NER system based on the SNoW learning architecture (Roth, 1998), that is available at the same site. This set of English NEs was hand-pruned to remove incorrectly classified words to obtain 978 single word NEs.</Paragraph>
    <Paragraph position="4"> In order to reduce running time, some limited preprocessing was done on the Russian side. All classes, whose temporal distributions were close to uniform (i.e. words with a similar likelihood of occurrence throughout the corpus) were deemed common and not considered as NE candidates.</Paragraph>
    <Paragraph position="5"> Unique words were grouped into 15,594 equivalence classes, and 1,605 of those classes were discarded using this method.</Paragraph>
    <Paragraph position="6"> Insertions/omissions features were not used in the experiments as they provided no tangible benefit for the languages of our corpus.</Paragraph>
    <Paragraph position="7"> Unless mentioned otherwise, the transliteration model was initialized with a subset of 254 pairs of NEs and their transliteration equivalence classes. Negative examples here and during the rest of the training were pairs of randomly selected non-NE English and Russian words.</Paragraph>
    <Paragraph position="8"> In each iteration, we used the current transliter- null pairs vs. iteration. Complete algorithm outperforms both transliteration model and temporal sequence matching when used on their own.</Paragraph>
    <Paragraph position="9"> ation model to find a list of 30 best transliteration equivalence classes for each NE. We then computed time sequence similarity score between NE and each class from its list to find the one with the best matching time sequence. If its similarity score surpassed a set threshold, it was added to the list of positive examples for the next round of training. Positive examples were constructed by pairing each English NE with each of the transliterations from the best equivalence class that surpasses the threshold. We used the same number of positive and negative examples. For evaluation, random 727 of the total of 978 NE pairs matched by the algorithm were selected and checked by a language expert. Accuracy was computed as the percentage of those NEs correctly discovered by the algorithm.</Paragraph>
    <Section position="1" start_page="85" end_page="86" type="sub_section">
      <SectionTitle>
4.1 NE Discovery
</SectionTitle>
      <Paragraph position="0"> Figure 3 shows the proportion of correctly discovered NE transliteration equivalence classes throughout the run of the algorithm. The figure also shows the accuracy if transliterations are selected according to the current transliteration model (top scoring candidate) and sequence matching alone. The transliteration model alone achieves an accuracy of about 47%, while the time sequence alone gets about  pairs for various initial example set sizes. Decreasing the size does not have a significant effect of the performance on later iterations.</Paragraph>
      <Paragraph position="1"> 41%. The combined algorithm achieves about 66%, giving a significant improvement.</Paragraph>
      <Paragraph position="2"> In order to understand what happens to the transliteration model as the algorithm proceeds, let us consider the following example. Figure 4 shows parts of transliteration lists for NE forsyth for two iterations of the algorithm. The weak transliteration model selects the correct transliteration (italicized) as the 24th best transliteration in the first iteration. Time sequence scoring function chooses it to be one of the training examples for the next round of training of the model. By the eighth iteration, the model has improved to select it as a best transliteration.</Paragraph>
      <Paragraph position="3"> Not all correct transliterations make it to the top of the candidates list (transliteration model by itself is never as accurate as the complete algorithm on Figure 3). That is not required, however, as the model only needs to be good enough to place the correct transliteration anywhere in the candidate list.</Paragraph>
      <Paragraph position="4"> Not surprisingly, some of the top transliteration candidates start sounding like the NE itself, as training progresses. On Figure 4, candidates for forsyth on iteration 7 include fross and fossett.</Paragraph>
    </Section>
    <Section position="2" start_page="86" end_page="86" type="sub_section">
      <SectionTitle>
4.2 Rate of Improvement vs. Initial Example
Set Size
</SectionTitle>
      <Paragraph position="0"> We ran a series of experiments to see how the size of the initial training set affects the accuracy of the model as training progresses (Figure 5). Although the performance of the early iterations is significantly affected by the size of the initial training example set, the algorithm quickly improves its performance. As we decrease the size from 254, to 127, to 85 examples, the accuracy of the first iteration drops by roughly 10% each time. However, starting at the 6th iteration, the three are with 3% of one another.</Paragraph>
      <Paragraph position="1"> These numbers suggest that we only need a few initial positive examples to bootstrap the transliteration model. The intuition is the following: the few examples in the initial training set produce features corresponding to substring pairs characteristic for English-Russian transliterations. Model trained on these (few) examples chooses other transliterations containing these same substring pairs. In turn, the chosen positive examples contain other characteristic substring pairs, which will be used by the model to select more positive examples on the next round, and so on.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="86" end_page="87" type="metho">
    <SectionTitle>
5 Conclusions
</SectionTitle>
    <Paragraph position="0"> We have proposed a novel algorithm for cross lingual NE discovery in a bilingual weakly temporally aligned corpus. We have demonstrated that using two independent sources of information (transliteration and temporal similarity) together to guide NE extraction gives better performance than using either of them alone (see Figure 3).</Paragraph>
    <Paragraph position="1"> We developed a linear discriminative translitera- null ered by the algorithm.</Paragraph>
    <Paragraph position="2"> tion model, and presented a method to automatically generate features. For time sequence matching, we used a scoring metric novel in this domain. As supported by our own experiments, this method outperforms other scoring metrics traditionally used (such as cosine (Salton and McGill, 1986)) when corpora are not well temporally aligned.</Paragraph>
    <Paragraph position="3"> In keeping with our objective to provide as little language knowledge as possible, we introduced a simplistic approach to identifying transliteration equivalence classes, which sometimes produced erroneous groupings (e.g. an equivalence class for NE lincoln in Russian included both lincoln and lincolnshire on Figure 6). This approach is specific to Russian morphology, and would have to be altered for other languages. For example, for Arabic, a small set of prefixes can be used to group most NE variants. We expect that language specific knowl- null edge used to discover accurate equivalence classes would result in performance improvements.</Paragraph>
  </Section>
  <Section position="7" start_page="87" end_page="87" type="metho">
    <SectionTitle>
6 Future Work
</SectionTitle>
    <Paragraph position="0"> In this work, we only consider single word Named Entities. A subject of future work is to extend the algorithm to the multi-word setting. Many of the multi-word NEs are translated as well as transliterated. For example, Mount in Mount Rainier will probably be translated, and Rainier - transliterated.</Paragraph>
    <Paragraph position="1"> If a dictionary exists for the two languages, it can be consulted first, and, if a match is found, transliteration model can be bypassed.</Paragraph>
    <Paragraph position="2"> The algorithm can be naturally extended to comparable corpora of more than two languages. Pair-wise time sequence scoring and transliteration models should give better confidence in NE matches.</Paragraph>
    <Paragraph position="3"> It seems plausible to suppose that phonetic features (if available) would help learning our transliteration model. We would like to verify if this is indeed the case.</Paragraph>
    <Paragraph position="4"> The ultimate goal of this work is to automatically tag NEs so that they can be used for training of an NER system for a new language. To this end, we would like to compare the performance of an NER system trained on a corpus tagged using this approach to one trained on a hand-tagged corpus.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML