File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-1020_intro.xml
Size: 11,911 bytes
Last Modified: 2025-10-06 14:01:47
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1020"> <Title>tRuEcasIng</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Approach </SectionTitle> <Paragraph position="0"> In this paper we take a statistical approach to truecasing. First we present the baseline: a simple, straight forward unigram model which performs reasonably well in most cases. Then, we propose a better, more flexible statistical truecaser based on language modeling.</Paragraph> <Paragraph position="1"> From a truecasing perspective we observe four general classes of words: all lowercase (LC), first letter uppercase (UC), all letters uppercase (CA), and mixed case word MC). The MC class could be further refined into meaningful subclasses but for the purpose of this paper it is sufficient to correctly identify specific true MC forms for each MC instance.</Paragraph> <Paragraph position="2"> We are interested in correctly assigning case labels to words (tokens) in natural language text. This represents the ability to discriminate between class labels for the same lexical item, taking into account the surrounding words. We are interested in casing word combinations observed during training as well as new phrases. The model requires the ability to generalize in order to recognize that even though the possibly misspelled token &quot;lenon&quot; has never been seen before, words in the same context usually take the UC form.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Baseline: The Unigram Model </SectionTitle> <Paragraph position="0"> The goal of this paper is to show the benefits of truecasing in general. The unigram baseline (presented below) is introduced in order to put task based evaluations in perspective and not to be used as a straw-man baseline.</Paragraph> <Paragraph position="1"> The vast majority of vocabulary items have only one surface form. Hence, it is only natural to adopt the unigram model as a baseline for truecasing. In most situations, the unigram model is a simple and efficient model for surface form restoration. This method associates with each surface form a score based on the frequency of occurrence. The decoding is very simple: the true case of a token is predicted by the most likely case of that token.</Paragraph> <Paragraph position="2"> The unigram model's upper bound on truecasing performance is given by the percentage of tokens that occur during decoding under their most frequent case. Approximately 12% of the vocabulary items have been observed under more than one surface form. Hence it is inevitable for the unigram model to fail on tokens such as &quot;new&quot;. Due to the overwhelming frequency of its LC form, &quot;new&quot; will take this particular form regardless of what token follows it. For both &quot;information&quot; and &quot;york&quot; as subsequent words, &quot;new&quot; will be labeled as LC. For the latter case, &quot;new&quot; occurs under one of its less frequent surface forms.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Truecaser </SectionTitle> <Paragraph position="0"> The truecasing strategy that we are proposing seeks to capture local context and bootstrap it across a sentence. The case of a token will depend on the most likely meaning of the sentence - where local meaning is approximated by n-grams observed during training. However, the local context of a few words alone is not enough for case disambiguation.</Paragraph> <Paragraph position="1"> Our proposed method employs sentence level context as well.</Paragraph> <Paragraph position="2"> We capture local context through a trigram language model, but the case label is decided at a sentence level. A reasonable improvement over the unigram model would have been to decide the word casing given the previous two lexical items and their corresponding case content. However, this greedy approach still disregards global cues. Our goal is to maximize the probability of a larger text segment (i.e. a sentence) occurring under a certain surface form. Towards this goal, we first build a language model that can provide local context statistics.</Paragraph> <Paragraph position="3"> Language modeling provides features for a labeling scheme. These features are based on the probability of a lexical item and a case content conditioned on the history of previous two lexical items and their corresponding case content:</Paragraph> <Paragraph position="5"> where trigram, bigram, unigram, and uniform probabilities are scaled by individual is which are learned by observing training examples. wi represents a word with a case tag treated as a unit for probability estimation.</Paragraph> <Paragraph position="6"> Using the language model probabilities we decode the case information at a sentence level. We construct a trellis (figure 1) which incorporates all the sentence surface forms as well as the features computed during training. A node in this trellis consists of a lexical item, a position in the sentence, a possible casing, as well as a history of the previous two lexical items and their corresponding case content. Hence, for each token, all surface forms will appear as nodes carrying additional context information. In the trellis, thicker arrows indicate higher delay and DeLay, are most probable - perhaps in the context of &quot;time delay&quot; and respectively &quot;Senator Tom DeLay&quot; The trellis can be viewed as a Hidden Markov Model (HMM) computing the state sequence which best explains the observations. The states (q1;q2; ;qn) of the HMM are combinations of case and context information, the transition probabilities are the language model ( ) based features, and the observations (O1O2 Ot) are lexical items.</Paragraph> <Paragraph position="7"> During decoding, the Viterbi algorithm (Rabiner, 1989) is used to compute the highest probability state sequence (q at sentence level) that yields the desired case information:</Paragraph> <Paragraph position="9"> (2) where P(qi1qi2 qitjO1O2 Ot; ) is the probability of a given sequence conditioned on the observation sequence and the model parameters. A more sophisticated approach could be envisioned, where either the observations or the states are more expressive. These alternate design choices are not explored in this paper.</Paragraph> <Paragraph position="10"> Testing speed depends on the width and length of the trellis and the overall decoding complexity is: Cdecoding = O(SMH+1) where S is the sentence size, M is the number of surface forms we are willing to consider for each word, and H is the history size (H = 3 in the trigram case).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Unknown Words </SectionTitle> <Paragraph position="0"> In order for truecasing to be generalizable it must deal with unknown words -- words not seen during training. For large training sets, an extreme assumption is that most words and corresponding casings possible in a language have been observed during training. Hence, most new tokens seen during decoding are going to be either proper nouns or misspellings. The simplest strategy is to consider all unknown words as being of the UC form (i.e. people's names, places, organizations).</Paragraph> <Paragraph position="1"> Another approach is to replace the less frequent vocabulary items with case-carrying special tokens.</Paragraph> <Paragraph position="2"> During training, the word mispeling is replaced with by UNKNOWN LC and the word Lenon with UNKNOWN UC. This transformation is based on the observation that similar types of infrequent words will occur during decoding. This transformation creates the precedent of unknown words of a particular format being observed in a certain context. When a truly unknown word will be seen in the same context, the most appropriate casing will be applied.</Paragraph> <Paragraph position="3"> This was the method used in our experiments. A similar method is to apply the case-carrying special token transformation only to a small random sample of all tokens, thus capturing context regardless of frequency of occurrence.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Mixed Casing </SectionTitle> <Paragraph position="0"> A reasonable truecasing strategy is to focus on token classification into three categories: LC, UC, and CA. In most text corpora mixed case tokens such as McCartney, CoOl, and TheBeatles occur with moderate frequency. Some NLP tasks might prefer mapping MC tokens starting with an uppercase letter into the UC surface form. This technique will reduce the feature space and allow for sharper models. However, the decoding process can be generalized to include mixed cases in order to find a closer fit to the true sentence. In a clean version of the AQUAINT (ARDA) news stories corpus, 90% of the tokens occurred under the most frequent surface form (figure 2).</Paragraph> <Paragraph position="1"> The expensive brute force approach will consider all possible casings of a word. Even with the full casing space covered, some mixed cases will not be seen during training and the language model probabilities for n-grams containing certain words will back off to an unknown word strategy. A more feasible method is to account only for the mixed case items observed during training, relying on a large enough training corpus. A variable beam decoding will assign non-zero probabilities to all known casings of each word. An n-best approximation is somewhat faster and easier to implement and is the approach employed in our experiments. During the sentence-level decoding only the n-most-frequent mixed casings seen during training are considered.</Paragraph> <Paragraph position="2"> If the true capitalization is not among these n-best versions, the decoding is not correct. Additional lexical and morphological features might be needed if identifying MC instances is critical.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.5 First Word in the Sentence </SectionTitle> <Paragraph position="0"> The first word in a sentence is generally under the UC form. This sentence-begin indicator is sometimes ambiguous even when paired with sentenceend indicators such as the period. While sentence splitting is not within the scope of this paper, we want to emphasize the fact that many NLP tasks would benefit from knowing the true case of the first word in the sentence, thus avoiding having to learn the fact that beginning of sentences are artificially important. Since it is uneventful to convert the first letter of a sentence to uppercase, a more interesting problem from a truecasing perspective is to learn how to predict the correct case of the first word in a sentence (i.e. not always UC).</Paragraph> <Paragraph position="1"> If the language model is built on clean sentences accounting for sentence boundaries, the decoding will most likely uppercase the first letter of any sentence. On the other hand, if the language model is trained on clean sentences disregarding sentence boundaries, the model will be less accurate since different casings will be presented for the same context and artificial n-grams will be seen when transitioning between sentences. One way to obtain the desired effect is to discard the first n tokens in the training sentences in order to escape the sentence-begin effect. The language model is then built on smoother context. A similar effect can be obtained by initializing the decoding with n-gram state probabilities so that the boundary information is masked.</Paragraph> </Section> </Section> class="xml-element"></Paper>