File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0506_metho.xml

Size: 20,515 bytes

Last Modified: 2025-10-06 14:07:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0506">
  <Title>Pre-processing Closed Captions for Machine Translation</Title>
  <Section position="4" start_page="38" end_page="39" type="metho">
    <SectionTitle>
2 Pre-processing design
</SectionTitle>
    <Paragraph position="0"> Input pre-processing is essential in an embedded real time system, in order to simplify the core processing and make it both time- and memoryeffective. In addition to this, we followed the guideline of separating domain-dependent processes and resources from general purpose ones.</Paragraph>
    <Paragraph position="1"> On the one hand, grammars and lexicons are costly resources. It would be desirable for them to be domain-independent and portable across different domains, as well as declarative and bidirectional. On the other hand, a domain with distinctive characteristics requires some specific treatment, if a system aims at robustness. We decided to have a domain independent core MT system, locating the domain dependent processing in a pipeline of low-level components, easy to implement, aiming at fast and robust processing and using limited linguistic knowledge.</Paragraph>
    <Paragraph position="2"> We use declarative and bidirectional grammars and lexicons. The lexicMist approach is indeed suitable to the closed caption domain, e.g. in terms of its capability of handling loosely structured or incomplete sentences. Also, the linguistic resources are geared towards this domain in terms of grammatical and lexical coverage. However, our system architecture and formalism make them equally usable in any other domain and translation direction, as the linguistic knowledge therein contained is valid in any domain. For the architecture we refer the reader to (Popowich et al., 1997). In the rest of this paper we focus on the pre-processing module  and how it deals with the issues discussed in the introduction.</Paragraph>
    <Paragraph position="3"> The task of the pre-processing pipeline is to make the input amenable to a linguisticallyprincipled, domain independent treatment. This task is accomplished in two ways: 1. By normalizing the input, i.e. removing noise, reducing the input to standard typographical conventions, and also restructuring and simplifying it, whenever this can be done in a reliable, meaning-preserving way.</Paragraph>
    <Paragraph position="4"> 2. By annotating the input with linguistic information, whenever this can be reliably done with a shallow linguistic analysis, to reduce input ambiguity and make a full linguistic analysis more manageable.</Paragraph>
    <Paragraph position="5"> Figure (1) shows the system architecture, with a particular emphasis on the pre-processing pipeline. The next section describes the pipeline up to tagging. Proper name recognition and segmentation, which deal more specifically with the problems described in the introduction, are discussed in further sections.</Paragraph>
  </Section>
  <Section position="5" start_page="39" end_page="39" type="metho">
    <SectionTitle>
3 Normalization and tagging
</SectionTitle>
    <Paragraph position="0"> The label normalization groups three components, which clean up and tokenize the input.</Paragraph>
    <Paragraph position="1"> The text-level normalization module performs operations at the string level, such as removing extraneous text and punctuation (e.g. curly brackets , used to mark off sound effects), or removing periods from abbreviations. E.g.: (I) &amp;quot;I went to high school in the u.s.&amp;quot; &amp;quot;I went to high school in the usa.&amp;quot; The tokenizer breaks a line into words. The token-level normalization recognizes and annotates tokens belonging to special categories  (times, numbers, etc.), expands contractions, recognizes, normalizes and annotates stutters (e.g. b-b-b-bright), identifies compound words  and converts number words into digits. E.g.: (2) &amp;quot;I&amp;quot; &amp;quot;went&amp;quot; &amp;quot;to&amp;quot; &amp;quot;high&amp;quot; &amp;quot;school&amp;quot; &amp;quot;in&amp;quot; &amp;quot;the&amp;quot; &amp;quot;usa&amp;quot; &amp;quot; &amp;quot; &amp;quot;I&amp;quot; &amp;quot;went&amp;quot; &amp;quot;to&amp;quot; &amp;quot;high school&amp;quot; &amp;quot;in&amp;quot; &amp;quot;the&amp;quot; &amp;quot;usa&amp;quot; &amp;quot; &amp;quot; (3) &amp;quot;W-wh-wha~'s&amp;quot; &amp;quot;that&amp;quot; &amp;quot;?&amp;quot;0 &amp;quot;what&amp;quot;/stutter &amp;quot;is&amp;quot; &amp;quot;that&amp;quot; &amp;quot;?&amp;quot;  Note that annotations associated with tokens are carried along the entire translation process, so as to be used in producing the output (e.g.</Paragraph>
    <Paragraph position="2"> stutters are re-inserted in the output).</Paragraph>
    <Paragraph position="3"> The tagger assigns parts of speech to tokens.</Paragraph>
    <Paragraph position="4"> Part of speech information is used by the subsequent pre-processing modules, and also in parsing, to prioritize the most likely lexical assignments of ambiguous items.</Paragraph>
  </Section>
  <Section position="6" start_page="39" end_page="42" type="metho">
    <SectionTitle>
4 Proper name recognition
</SectionTitle>
    <Paragraph position="0"> Proper names are ubiquitous in closed captions (see Table 1). Their recognition is important for effective comprehension of closed captions, particularly in consideration of two facts: (i) users have little time to mentally rectify a mistranslation; (ii) a name can occur repeatedly in a program (e.g. a movie), with an annoying effect if it is systematically mistranslated (e.g. a golf tournament where the golfer named Tiger Woods is systematically referred to as los bosques del tigre, lit. 'the woods of the tiger').</Paragraph>
    <Paragraph position="1"> Name recognition is made harder in the closed caption domain by the fact that no capitalization information is given, thus making unusable all methods that rely on capitalization as the main way to identify candidates (Wolinski et al., 1995) (Wacholder et al., 1997). For instance, an expression like 'mark shields', as occurs in Table (1), is problematic in the absence of capitalization, as both 'mark' and 'shields' are three-way ambiguous (proper name, common noun and verb). Note that this identical problem may be encountered if an MT system is embedded in a speech-to-speech translation as well. This situation forced us to explore different ways of identifying proper names.</Paragraph>
    <Paragraph position="2"> The goal of our recognizer is to identify proper names in a tagged line and annotate them accordingly, in order to override any other possiblelexical assignment in the following modules. The recognizer also overrides previous tokenization, by possibly compounding two or more tokens into a single one, which will be treated as such thereafter. Besides part of speech, the only other information used by the recognizer is the lexical status of words, i.e.</Paragraph>
    <Paragraph position="3"> their ambiguity class (i.e. the range of possible syntactic categories it can be assigned) or their status as an unknown word (i.e. a word that is not in the lexicon). The recognizer scans an input line from left to right, and tries to match  each item against a sequences of patterns. Each pattern expresses constraints (in terms of word, part of speech tag and lexical status) on the item under inspection and its left and right contexts. Any number of items can be inspected to the left and right of the current item. Such patterns also make use of regular expression bperators (conjunction, disjunction, negation, Kleene star). For instance (a simplified version of) a pattern might look like the following: (4) /the/DEW (NOUNIADJ)*\] X' \['NOUN\] where we adopt the convention of representing words by lowercase strings, part of speech tags by uppercase strings and variables by primed Xs. The left and right context are enclosed in square brackets, respectively to the left and right of the current item. They can also contain special markers for the beginning and end of a line, and for the left or right boundary of the proper name being identified. This way tokenization can be overridden and separate tokens joined into a single name. Constraints on the lexical status of items are expressed as predicates associated with pattern elements, e.g.: (5) proper_and_common (X') A pattern like the one above (4-5) would match a lexically ambiguous proper/common noun preceded by a determiner (with any number of nouns or adjectives in between), and not followed by a noun (e.g. 'the bill is...'). Besides identifying proper names, some patterns may establish that a given item is not a name (as in the case above). A return value is associated with each pattern, specifying whether the current match is or is not a proper name.</Paragraph>
    <Paragraph position="4"> Once a successful match occurs, no further patterns are tried. Patterns are ordered from more to less specific. At the bottom of the pattern sequence are the simplest patterns, e.g.:  (6) ( \[\] X' \[\] ), proper_and_common(X') yes which is the default assignment for words like 'bill' if no other pattern matched. However (6) is overridden by more specific patterns like:</Paragraph>
    <Paragraph position="6"> The former pattern covers cases like 'telecommunications bill', preventing 'bill' from being interpreted as a proper name, the latter covers cases like 'damian bill', where 'bill' is more likely to be a name. In general, the recognizer tries to disambiguate lexically ambiguous nouns or to assign a category to unknown words on the basis of the available context. However, in principle any word could be turned into a proper name. For instance, verbs or adjectives can be turned into proper names, when the context contains strong cues, like a title. Increasingly larger contexts provide evidence for more informed guesses, which override guesses based on narrower contexts. Consider the following examples that show how a word or expression is treated differently depending on the available context. Recognized names are in italics.</Paragraph>
    <Paragraph position="7">  (9) biZ~ ~ (i0) the bill is ...</Paragraph>
    <Paragraph position="8"> (11) the bill clinton is ...</Paragraph>
    <Paragraph position="9"> (12) the bill clinton administration is  The lexically ambiguous bill, interpreted as a proper name in isolation, becomes a common noun if preceded by a determiner. However, the interpretation reverts to proper name if another noun follows. Likewise the unknown word clinton is (incorrectly) interpreted as a common noun in (11), as it is the last item of a noun phrase introduced by a determiner, but it becomes a proper name if another noun follows. We also use a name memory, which patterns have access to. As proper names are found in an input stream, they are added to the name memory. A previous occurrence of a proper name is used as evidence in making decisions about further occurrences. The idea is to cache names occurred in an 'easy' context (e.g. a name preceded by a title, which provides strong evidence for its status as a proper name), to use them later to make decisions in 'difficult' contexts, where the internal evidence would not be sufficient to support a proper name interpretation. Hence, what typically happens is that the same name in the same context is interpreted differently at different times, if previously the name has occurred in an 'easy' context and has been  memorized. E.g.: (13) the individual title went to tiger  woods.</Paragraph>
    <Paragraph position="10"> mr. tiger woods struggled today with a final round 80.</Paragraph>
    <Paragraph position="11"> name-memory the short well publicized professional life of tiger woods has been an open book.</Paragraph>
    <Paragraph position="12"> The name memory was designed to suit the peculiarity of closed captions. Typically, in this domain proper names have a low dispersion.</Paragraph>
    <Paragraph position="13"> They are concentrated in sections of an input stream (e.g. the name of the main characters in a movie), then disappear for long sections (e.g. after the movie is over). Therefore, a name memory needs to be reset to reflect such changes. However, it is problematic to decide when to reset the name memory. Even if it was possible to detect when a new program starts, one should take into account the possible scenario of an MT system embedded in a consumer product, in which case the user might unpredictably change channel at any time. In order to keep a name memory aligned with the current program, without any detection of program changes, we structured the name memory as a relatively short queue (first in, first out). Every time a new item is added to the end of the queue, the first item is removed and all the other items are shifted. Moreover, we do not check whether a name is already in the memory. Every time a suitable item is found, we add it to the memory, regardless of whether it is already there. Hence, the same item could be present twice or more in the memory at any given time. The result of this arrangement is that a name only remains in the memory :for a relatively short time. It can only remain :\[or a longer time if it keeps reappearing frequently in the input stream (as typically happens), otherwise it is removed shortly after it stopped appearing. In this way, the name memory is kept  aligned with the current program, with only a short transition period, during which names no longer pertinent are still present in the memory, before getting replaced by pertinent ones.</Paragraph>
    <Paragraph position="14"> The recognizer currently contains 63 patterns. We tested the recognizer on a sample of 1000 lines (5 randomly chosen continuous fragments of 200 lines each). The results, shown in table (2), illustrate a recall of 72.7% and a precision of 95.0%. These results reflect our cautious approach to name recognition. Since the core MT system has its own means of identifying some proper names (either in the lexicon or via default assignments to unknown words) we aimed at recognizing names in pre-processing only when this could be done reliably. Note also that 6 out of the 8 false positives were isolated interjections that would be better left untranslated (e.g. pffoo, el smacko), or closed captioner's typos (e.g. yo4swear).</Paragraph>
  </Section>
  <Section position="7" start_page="42" end_page="43" type="metho">
    <SectionTitle>
5 Segmentation
</SectionTitle>
    <Paragraph position="0"> Segmentation breaks a line into one or more segments, which are passed separately to subsequent modules (Ejerhed, 1996) (Beeferman et al., 1997). In translation, segmentation is applied to split a line into a sequence of translationally self-contained units (Lavie et al., 1996). In our system, the translation units we identify are syntactic units, motivated by cross-linguistic considerations. Each unit is a constituent that dan be translated independently.</Paragraph>
    <Paragraph position="1"> Its translation is insensitive to the context in which the unit occurs, and the order of the units is preserved by translation.</Paragraph>
    <Paragraph position="2"> One motivation for segmenting is that processing is faster: syntactic ambiguity is reduced, and backtracking from a module to a previous one does not involve re-processing an entire line, but only the segment that failed. A second motivation is robustness: a failure in one segment does not involve a failure in the entire line, and error-recovery can be limited only to a segment. Further motivations are provided by the colloquial nature of closed captions. A line often contains fragments with a loose syntactic relation to each other and to the main clause: vocatives, false starts, tag questions, etc. These are most easily translated as individual segments. Parenthetical expressions are often also found in the middle of a main clause, thus making complete parses problematic. However, the solution involves a heavier intervention than just segmenting. Dealing with parentheticals requires restructuring a line, and reducing it to a 'normal' form which ideally always has parenthetical expressions at one end of a sentence (under the empirical assumption that the overall meaning is not affected). We will see how this kind of problem is handled in segmentation. A third motivation is given by the format of closed captions, with input lines split across non-constituent boundaries. One solution would be delaying translation until a sentence boundary is found, and restructuring the stored lines in a linguistically principled way.</Paragraph>
    <Paragraph position="3"> However, the requirements of real time translation (either because of real time captioning at the source, or because the MT system is embedded in a consumer product), together with the requirement that translations be aligned with the source text and, above all, with the images, makes this solution problematic. The solution we are left with, if we want lines to be broken along constituent boundaries, is to further segment a sentence, even at the cost of sometimes separating elements that should go together for an optimal translation. We also argued elsewhere (Toole et al., 1998) that in a time-constrained application the output grammaticality is of paramount importance, even at the cost of a complete meaning equivalence with the source. For this reason, we also simplify likely problematic input, when a simplification is possible without affecting the core meaning.</Paragraph>
    <Paragraph position="4"> To sum up, the task at hand is broader than just segmentation: re-ordering of constituents and removal of words are also required, to syntactically 'normalize' the input. As with name recognition, we aim at using efficient and easy to implement techniques, relying on limited linguistic information. The segmenter works by matching input lines against a set of templates represented by pushdown transducers. Each transducer is specified in a fairly standard way (Gazdar and Mellish, 1989, 82), by defining an initial state, a final state, and a set of transitions of the following form: (14) (State I, State2, Label, Transducer&gt; Such a transition specifies that Transducer can move from Statel to State2 when the input specified by Label is found. Label can be either a pair (InputSymbol, OutputSymbol) or the name of another transducer, which needs to be entirely traversed for the transition from Statel to State2 to take place. An input symbol is a &lt;Word, Tag&gt; pair. An output symbol is an integer ranging from 0 to 3, specifying to which of two output segments an input symbol is assigned (0 = neither segment, 3 = both segments, 1 and 2 to be interpreted in the obvious way). The output codes are then used to perform the actual split of a line. A successful match splits a line into two segments at most.</Paragraph>
    <Paragraph position="5"> However, on a successful split, the resulting segments are recursively fed to the segmenter, until no match is found. Therefore, there is no limit to the number of segments obtained from an input line. The segmenter currently contains 37 top-level transducers, i.e. segmenting patterns. Not all of them are used at the same time. The implementation of patterns is straightforward and the segmenter can be easily adapted to different domains, by implementing specific patterns and excluding others. For instance, a very simple patterns split a line at every comma, a slightly more sophisticated one, splits a line at every comma, unless tagged as a coordination; other patterns split a final adverb, interjection, prepositional phrase, etc.</Paragraph>
    <Paragraph position="6"> Note that a segment can be a discontinuous part of a line, as the same output code can be assigned to non-contiguous elements. This feature is used, e.g., in restructuring a sentence, as when a parenthetical expression is encountered.</Paragraph>
    <Paragraph position="7"> Thefollowing example shows an input sentence, an assignment, and a resulting segmentation.</Paragraph>
    <Paragraph position="8"> (15) this, however, is a political science course.</Paragraph>
    <Paragraph position="9"> (16) this/2 ,/0 however/l ,/i is/2 a/2 political/2 science/2 course/2.</Paragraph>
    <Paragraph position="10"> (17) I. however ,  2. this is a political science course We sometimes use the segmenter's ability to simplify the input, e.g. with adverbs like just, which are polysemous and difficult to translate, but seldom contribute to the core meaning of a sentence.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML