File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-2026_metho.xml

Size: 10,558 bytes

Last Modified: 2025-10-06 14:08:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-2026">
  <Title>Desparately Seeking Cebuano</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Obtaining Language Resources
</SectionTitle>
    <Paragraph position="0"> Our basic approach to development of an agile system for interactive CLIR relies on three strategies: (1) create an infrastructure in advance for English as a query language that makes only minimal assumptions about the document language; (2) leverage the asymmetry inherent in the problem by assembling strong resources for English in advance; and (3) develop a robust suite of capabilities to exploit any language resources that can be found for the &amp;quot;surprise language.&amp;quot; We defer the first two topics to the next section, and focus here on the third. We know of five possible sources of translation expertise: People. People who know the language are an excellent source of insight, and universities are an excellent place to find such people. We were able to locate a speaker of Cebuano within 50 feet of one of our offices, and to schedule an interview with a second Cebuano speaker within 36 hours of the announcement of the language.</Paragraph>
    <Paragraph position="1"> Scholarly literature. Major research universities are also an excellent place to find written materials describing a broad array of languages. Within 12 hours of the announcement, reference librarians at the University of Maryland had identified a textbook on &amp;quot;Beginning Cebuano,&amp;quot; and we had located a copy at the University of Southern California. Together with the excellent electronic resources located by the LDC, this allowed us to develop a rudimentary stemmer within 36 hours.</Paragraph>
    <Paragraph position="2"> Translation lexicons. Simple bilingual term lists are available for many language pairs. Using links provided by the LDC and our own Web searches, we were able to construct an English-Cebuano term list with over 14,000 translation pairs within 12 hours of the announcement. This largely duplicated a simultaneous effort at the LDC, and we later merged our term list with theirs.</Paragraph>
    <Paragraph position="3"> Parallel text. Translation-equivalent documents, when aligned at the word level, provide an excellent source of information about not just possible translations, but their relative predominance. Within 24 hours of the announcement, we had aligned Cebuano and English versions of the Holy Bible at the word level using Giza++. An evaluation by a native Cebuano speaker of a stratified random sample of 88 translation pairs showed remarkably high precision. On a 4-point scale with 1=correct and 4=incorrect the most frequent 100 words averaged 1.3, the next 400 most frequent terms averaged 1.6, and the 500 next most frequent terms after that averaged 1.7. The Bible's vocabulary covers only about half of the words found in typical English news text (counted by-token), so it is useful to have additional sources of parallel text. For this reason, we have extended our previously developed STRAND system to locate likely translations in the Internet Archive.</Paragraph>
    <Paragraph position="4"> Those runs were not yet complete when this paper was submitted.</Paragraph>
    <Paragraph position="5"> Printed Dictionaries. People learning a new language make extensive use of bilingual dictionaries, so we have developed a system that mimics that process to some extent. Within 12 hours of the announcement we had zoned page images from a Cebuano-English dictionary that was available commercially in Adobe Page Description Format (PDF) to identify each dictionary entry, performed optical character recognition, and parsed the entries to construct a bilingual term list. We were aided in this process by the fact that Cebuano is written in a Roman script.</Paragraph>
    <Paragraph position="6"> Again, we achieved good precision, with a sampled word error rate for OCR of 6.9% and a precision for a random sample of translation pairs of 87%. Part of speech tags were also extracted, although they are not used in our process.</Paragraph>
    <Paragraph position="7"> As this description illustrates, these five sources provide complementary information. Since there is some uncertainty at the outset about how long it will be before each delivers useful results, we chose a strategy based on concurrency, balancing our investment over each the five sources. This allowed us to use whatever resources became available first to get an initial system running, with refinements subsequently being made as additional resources became available. Because Cebuano and English are written in the same script, we did not need character set conversion or phonetic cognate matching in this case. The CLIR system described in the next section was therefore constructed using only English resources that were (or could have been) pre-assembled, plus a Cebuano-English bilingual term list, a rule-based stemmer, and the Cebuano Bible.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3 Building a Cross-Language Retrieval
System
</SectionTitle>
      <Paragraph position="0"> Ideally, we would like to build a system that would find whatever documents the searcher would wish to read in a fully automatic mode. In practice, fully automatic search systems are imperfect even in monolingual applications.</Paragraph>
      <Paragraph position="1"> We therefore have developed an interactive approach that functions something like a typical Web search engine: (1) the searcher poses their query in English, (2) the system ranks the Cebuano documents in decreasing order of likely relevance to the query, (3) the searcher examines a list of document titles in something approximating English, and (4) the searcher may optionally examine the full text of any document in something approximating English. The intent is to support an iterative process in which searchers learn to better express their query through experience. We are only able to provide very rough translations, so we expect that such a system would be used in an environment where searchers could send documents that appear promising off for professional translation when necessary.</Paragraph>
      <Paragraph position="2"> At the core of our system is the capability to automatically rank Cebuano documents based on an English query. We chose a query translation architecture using backoff translation and Pirkola's structured query method, implemented using Inquery version 3.1p1. The key idea in backoff translation is to first try to find consecutive sequences of query words on the English side of the bilingual term list, where that fails to try to find the surface form of each remaining English term, to fall back to stem matching when necessary, and ultimately to fall back to retaining the English term unchanged in the hope that it might be a proper name or some other form of cognate with Cebuano. Accents are stripped from the documents and all language resources to facilitate matching at that final step.</Paragraph>
      <Paragraph position="3"> Although we have chosen techniques that are relatively robust and therefore require relatively little domain-specific tuning, stemmer design is an area of uncertainty that could adversely affect retrieval effectiveness. We therefore needed a test collection on which we could try out variants of the Cebuano stemmer. We built this test collection using 34,000 Cebuano Bible verses and 50 English questions that we found on the Web for which appropriate Bible verses were known. Each question was posed as a query using the batch mode of Inquery, and the rank of the known relevant verse was taken as a measure of effectiveness. We took the mean reciprocal rank (the inverse of the harmonic mean) as a figure of merit for each configuration, and used a paired two-tailed a0 test (with pa1 0.05) to assess the statistical significance of observed differences. Our initial configuration, without stemming, obtained a mean inverse rank of 0.14, which is a statistically significant improvement over no translation at all (mean inverse rank 0.02 from felicitous cognate and loan word matches). The addition of Cebuano stemming resulted in a reduction in mean inverse rank to 0.09. Although the reduction is not statistically significant in that case, the result suggests that our initial stemmer is not yet useful for information retrieval tasks.</Paragraph>
      <Paragraph position="4"> The other key capability that is needed is title and document translation. We can accomplish this in one of two ways. The simplest approach is to reverse the bilingual term list, and to reverse the role of Cebuano and English in the process described above for query translation. Our user interface is capable of displaying multiple translations for a single term (arranged horizontally for compact depiction or vertically for clearer depiction), but searchers can choose to display only the single most likely translation. When reliable translation probability statistics (from parallel text) are not available, we use the relative word unigram frequency of each translation of a Cebuano term in a representative English collection as a substitute for that probability. A more sophisticated way is to build a statistical machine translation system using parallel text. We built our first statistical machine translation system within 40 hours of the announcement, and one sentence of the resulting translation using each technique is shown below: Cebuano: 'ang rebeldeng milf, kinsa lakip sa nangamatay, nagdala og backpack nga dunay explosives nga niguba sa waiting lounge sa airport, matod sa mga defense official.' Term-by-term translation: '(carelessness, circumference, conveyence) rebeldeng milf, who lakip (at in of) nangamatay, nagdala og backpack nga valid explosives nga niguba (at, in of) waiting lounge (at, in, of) airport, matod (at, in, of) mga defense official' Statistical translation: 'who was accused of rank, ehud og niguba waiting lounge defense of those dumah milf rebeldeng explosives backpack airport matod official.' At this point, term-by-term translation is clearly the better choice. But as more parallel text becomes available, we expect the situation to reverse. The LDC is preparing a set of human reference translations that will allow us to detect that changeover point automatically using the NIST variant of the BLEU measure for machine translation effectiveness.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML