File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1071_metho.xml

Size: 13,797 bytes

Last Modified: 2025-10-06 14:07:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1071">
  <Title>Towards Automatic Sign Translation</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. PROBLEM DESCRIPTION
</SectionTitle>
    <Paragraph position="0"> A sign can be a displayed structure bearing letters or symbols, used to identify or advertise a place of business. It can also be a posted notice bearing a designation, direction, or command. Figure 1 and Figure 2 illustrate two examples of signs. Figure 1 shows a Russian sign completely embedded in the background. Figure 2 is a sign that contains German text with no verb and article. In this research, we are interested in translating signs that have direct influence upon a tourist from a different country or culture. These signs, at least, include the following categories:  * Names: street, building, company, etc.</Paragraph>
    <Paragraph position="1"> * Information: designation, direction, safety advisory, warning, notice, etc.</Paragraph>
    <Paragraph position="2"> * Commercial: announcement, advertisement, etc.</Paragraph>
    <Paragraph position="3"> * Traffic: warning, limitation, etc.</Paragraph>
    <Paragraph position="4"> * Conventional symbol: especially those are  confusable to a foreign tourist, e.g., some symbols are not international.</Paragraph>
    <Paragraph position="5"> Fully automatic extraction of signs from the environment is a challenging problem because signs are usually embedded in the environment. The related work includes video OCR and automatic text detection. Video OCR is used to capture text in the video images and recognize the text. Many video images contain text contents. Such text can come from computer-generated text that is overlaid on the imagery (e.g., captions in broadcast news programs) or text that appears as a part of the video scene itself (e.g., a sign outside a place of business, or a post). Location and recognition of text in video imagery is challenging due to low resolution of characters and complexity of background. Research in video OCR has mainly focused on locating the text in the image and preprocessing the text area for OCR [4][6][7][9][10]. Applications of the research include automatically identifying the contents of video imagery for video index [7][9], and capturing documents from paper source during reading and writing [10]. Compared to other video OCR tasks, sign extraction takes place in a more dynamic environment. The user's movement can cause unstable input images. Non-professional equipment can make the video input poorer than that of other video OCR tasks, such as detecting captions in broadcast news programs. In addition, sign extraction has to be implemented in real time using limited resources.</Paragraph>
    <Paragraph position="6">  Sign translation requires sign recognition. A straightforward idea is to use advanced OCR technology. Although OCR technology works well in many applications, it requires some improvements before it can be applied to sign recognition. At current stage of the research, we will focus our research on sign detection and translation while taking advantage of state-of-the-art OCR technologies.</Paragraph>
    <Paragraph position="7"> Sign translation has some special problems compared to a traditional language translation task. The function of signs lead to the characteristic of the text used in the sign: it has to be short and concise. The lexical mismatch and structural mismatch problems become more severe in sign translation because shorter words/phrases are more likely to be ambiguous and insufficient information from the text to resolve the ambiguities which are related to the environment of the sign.</Paragraph>
    <Paragraph position="8"> We assume that a tourist has a video camera to capture signs into a wearable or portable computer. The procedure of sign translation is as follows: capturing the image with signs, detecting signs in the image, recognizing signs, and translating results of sign recognition into target language.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. AUTOMATIC SIGN DETECTION
</SectionTitle>
    <Paragraph position="0"> Fully automatic extraction of signs from the environment is very difficult, because signs are usually embedded in the environment. There are many challenges in sign detection, such as variation, motion and occlusion. We have no control in font, size, orientation, and position of sign texts.</Paragraph>
    <Paragraph position="1"> Originating in 3-D space, text on signs in scene images can be distorted by slant, tilt, and shape of objects on which they are found [8]. In addition to the horizontal left-to-right orientation, other orientations include vertical, circularly wrapped around another object, slanted, sometimes with the characters tapering (as in a sign angled away from the camera), and even mixed orientations within the same text area (as would be found on text on a T-shirt or wrinkled sign). Unlike other text detection and video OCR tasks, sign extraction is in a more dynamic environment. The user's movement can cause unstable input images. Furthermore, the quality of the video input is poorer than that of other video OCR tasks, such as detecting captions in broadcast news programs, because of low quality of equipment.</Paragraph>
    <Paragraph position="2"> Moreover, sign detection has to be real-time using a limited resource. Though automatic sign detection is a difficult task, it is crucial for a sign translation system.</Paragraph>
    <Paragraph position="3"> We use a hierarchical approach to address these challenges.</Paragraph>
    <Paragraph position="4"> We detect signs at three different levels. At the first level, the system performs coarse detection by extracting features from edges, textures, colors/intensities. The system emphasizes robust detection at this level and tries to effectively deal with the different conditions such as lighting, noise, and low resolution. A multi-resolution detection algorithm is used to compensate different lighting and low contrasts. The algorithm provides hypotheses of sign regions for a variety of scenes with large variations in both lighting condition and contrast. At the second level, the system refines the initial detection by employing various adaptive algorithms. The system focuses on each detected area and makes elaborate analysis to guarantee reliable and complete detection. In most cases, the adaptive algorithms can lead to finding the regions without missing any sign region. At the third level, the system performs layout analysis based on the outcome from the previous levels.</Paragraph>
    <Paragraph position="5"> The design and layout of signs are language and culture dependent. For example, many Asia languages, such as Chinese and Japanese, have two types of layout: the horizontal and the vertical. The system provides considerable flexibility to allow the detection of slanted signs and signs with non-uniform character sizes.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. SIGN TRANSLATION
</SectionTitle>
    <Paragraph position="0"> Sign translation has some special problems compared to a traditional language translation task. Sign translation depends not only on domain but also on functionality of the sign. The same text on different signs can be treated differently. In general, the text used in the sign is short and concise. For example, the average length of each sign in our Chinese sign database is 6.02 Chinese characters. The lexical mismatch and structural mismatch problems become more severe for sign translation because shorter words/phrases are more likely to be ambiguous and there isn't sufficient information from the text to resolve the ambiguities which are related to the environment of the sign. For example, in order to make signs short, abbreviations are widely used in signs, e.g., (/ji yan suo/) is the abbreviation for ,G3(/ji sheng chong yan jiu suo/ institute of parasites), such abbreviations are difficult, if not impossible, even for a human to understand without knowledge of the context of the sign. Since designers of signs always assume that readers can use the information from other sources to understand the meaning of the sign, they tend to use short words. e.g. in sign (/man xing/, drive slowly), the word (/xing/, walk, drive) is ambiguous, it can mean (/xing zou/ &amp;quot;move of human,&amp;quot; walk) or &amp;quot;move of a car,&amp;quot; drive). The human reader can understand the meaning if he knows it is a traffic sign for cars, but without this information, MT system cannot select the correct translation for this word. Another problem in sign is structural mismatch. Although this is one of the basic problems for all MT systems, it is more serious in sign translation: some grammatical functions are omitted to make signs concise. Examples include: (1) the subject &amp;quot;we&amp;quot; is omitted in (/li mao dai ke/, treat customers politely); (2) the sentence is reordered to emphasize the topic: rather than saying (/qing jiang bao zhuang zhi tou ru la ji xiang/, please throw wrapping paper into the garbage can), using (/bao zhuang zhi qing tou ru la ji xiang/, wrapping paper, please throw it into the garbage can) to highlight the &amp;quot;wrapping paper.&amp;quot; With these special features, sign translation is not a trivial problem of just using existing MT technologies to translate the text recognized by OCR module.</Paragraph>
    <Paragraph position="1"> Although a knowledge-based MT system works well with grammatical sentences, it requires a great amount of human effort to construct its knowledge base, and it is difficult for such a system to handle ungrammatical text that appears frequently in signs.</Paragraph>
    <Paragraph position="2"> We can use a database search method to deal with names, phrases, and symbols related to tourists. Names are usually location dependent, but they can be easily obtained from many information sources such as maps and phone books.</Paragraph>
    <Paragraph position="3"> Phrases and symbols related to tourists are relative fixed for a certain country. The database of phrases and symbols is relatively stable once it is built We propose to apply Generalized Example Based Machine Translation (GEBMT) [1][2] enhanced with domain detection to a sign translation task. This is a data-driven approach. What EBMT needs are a set of bilingual corpora each for one domain and a bilingual dictionary where the latter can be constructed statistically from the corpora.</Paragraph>
    <Paragraph position="4"> Matched from the corpus, EBMT can give the same style of translations as the corpus. The domain detection can be achieved from other sources. For example, shape/color of the sign and semantics of the text can be used to choose the domain of the sign.</Paragraph>
    <Paragraph position="5"> We will start with the EBMT software [1]. The system will be used as a shallow system that can function using nothing more than sentence-aligned plain text and a bilingual dictionary; and given sufficient parallel text, the dictionary can be extracted statistically from the corpus. In a translation process, the system looks up all matching phrases in the source-language half of the parallel corpus and performs a word-level alignment on the entries containing matches to determine a (usually partial) translation. Portions of the input for which there are no matches in the corpus do not generate a translation.</Paragraph>
    <Paragraph position="6"> Because the EBMT system does not generate translations for 100% of its input text, a bilingual dictionary and phrasal glossary are used to fill any gaps. Selection of the &amp;quot;best&amp;quot; translation is guided by a trigram model of the target language and a chart table [3].</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5. A PROTOTYPE SYSTEM
</SectionTitle>
    <Paragraph position="0"> We have developed a prototype system for Chinese sign recognition and translation. Figure 3 shows the architecture of the prototype system. A user can interactively involve sign recognition and translation process when needed. For example, a user can select the area of interest, or indicate that the sign is a street name. The system works as follows.</Paragraph>
    <Paragraph position="1"> The system captures the sign in a natural background using a video camera. The system then automatically detects or interactively selects the sign region. The system performs sign recognition and translation within the detected/selected region. It first preprocesses the selected region, binarizes the image to get text or symbol, and feeds the binary image into the sign recognizer. OCR software from a third party is used for text recognition. The recognized text is then translated into English. The output of the translation is fed to the user by display on screen or synthesized speech.</Paragraph>
    <Paragraph position="2"> Festival, a general purpose multi-lingual text-to-speech (TTS) system is used for speech synthesis.</Paragraph>
    <Paragraph position="3">  An efficient user interface is important to a user-centered system. Use of interaction is not only necessary for an interactive system, but also useful for an automatic system. A user can select a sign from multiple detected signs for translation, and get involved when automatic sign detection is wrong. Figure 4 is the interface of the system. The window of the interface displays the image from a video camera. The translation result is overlaid on the location of the sign. A user can select the sign text using pen or mouse anywhere in the window.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML