File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1020_metho.xml
Size: 11,809 bytes
Last Modified: 2025-10-06 14:08:15
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1020"> <Title>tRuEcasIng</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Evaluation </SectionTitle> <Paragraph position="0"> Both the unigram model and the language model based truecaser were trained on the AQUAINT (ARDA) and TREC (NIST) corpora, each consisting of 500M token news stories from various news agencies. The truecaser was built using IBM's ViaVoiceTMlanguage modeling tools. These tools implement trigram language models using deleted interpolation for backing off if the trigram is not found in the training data. The resulting model's perplexity is 108.</Paragraph> <Paragraph position="1"> Since there is no absolute truth when truecasing a sentence, the experiments need to be built with some reference in mind. Our assumption is that professionally written news articles are very close to an intangible absolute truth in terms of casing. Furthermore, we ignore the impact of diverging stylistic forms, assuming the differences are minor.</Paragraph> <Paragraph position="2"> Based on the above assumptions we judge the truecasing methods on four different test sets. The first test set (APR) consists of the August 25, 2002 top 20 news stories from Associated Press and Reuters excluding titles, headlines, and section headers which together form the second test set lier news stories from AP and New York Times belonging to the ACE dataset. The last test set (MT) includes a set of machine translation references (i.e. human translations) of news articles from the Xinhua agency. The sizes of the data sets are as follows: APR - 12k tokens, ACE - 90k tokens, and MT - 63k tokens. For both truecasing methods, we computed the agreement with the original news story considered to be the ground truth.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Results </SectionTitle> <Paragraph position="0"> The language model based truecaser consistently displayed a significant error reduction in case restoration over the unigram model (figure 3). On current news stories, the truecaser agreement with the original articles is 98%.</Paragraph> <Paragraph position="1"> Titles and headlines usually have a higher concentration of named entities than normal text. This also means that they need a more complex model to assign case information more accurately. The LM based truecaser performs better in this environment while the unigram model misses named entity components which happen to have a less frequent surface form.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Qualitative Analysis </SectionTitle> <Paragraph position="0"> The original reference articles are assumed to have the absolute true form. However, differences from these original articles and the truecased articles are not always casing errors. The truecaser tends to modify the first word in a quotation if it is not proper name: &quot;There has been&quot; becomes &quot;there has been&quot;. It also makes changes which could be considered a correction of the original article: &quot;Xinhua news agency&quot; becomes &quot;Xinhua News Agency&quot; and &quot;northern alliance&quot; is truecased as &quot;Northern Alliance&quot;. In more ambiguous cases both the original version and the truecased fragment represent different stylistic forms: &quot;prime minister Hekmatyar&quot; becomes &quot;Prime Minister Hekmatyar&quot;.</Paragraph> <Paragraph position="1"> There are also cases where the truecaser described in this paper makes errors. New movie names are sometimes miss-cased: &quot;my big fat greek wedding&quot; or &quot;signs&quot;. In conducive contexts, person names are correctly cased: &quot;DeLay said in&quot;. However, in ambiguous, adverse contexts they are considered to be common nouns: &quot;pond&quot; or &quot;to delay that&quot;. Unseen organization names which make perfectly normal phrases are erroneously cased as well: &quot;international security assistance force&quot;.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Application: Machine Translation Post-Processing </SectionTitle> <Paragraph position="0"> We have applied truecasing as a post-processing step to a state of the art machine translation system in order to improve readability. For translation between Chinese and English, or Japanese and English, there is no transfer of case information. In these situations the translation output has no case information and it is beneficial to apply truecasing as a post-processing step. This makes the output more legible and the system performance increases if case information is required.</Paragraph> <Paragraph position="1"> We have applied truecasing to Chinese-to-English translation output. The data source consists of news stories (2500 sentences) from the Xinhua News Agency. The news stories are first translated, then subjected to truecasing. The translation output is evaluated with BLEU (Papineni et al., 2001), which is a robust, language independent automatic machine translation evaluation method. BLEU scores are highly correlated to human judges scores, providing a way to perform frequent and accurate automated evaluations. BLEU uses a modified n-gram precision metric and a weighting scheme that places more emphasis on longer n-grams.</Paragraph> <Paragraph position="2"> In table 1, both truecasing methods are applied to machine translation output with and without uppercasing the first letter in each sentence. The truecasing methods are compared against the all letters lowercased version of the articles as well as against an existing rule-based system which is aware of a limited number of entity casings such as dates, cities, and countries. The LM based truecaser is very effective in increasing the readability of articles and captures an important aspect that the BLEU score is sensitive to. Truecasig the translation output yields</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Task Based Evaluation </SectionTitle> <Paragraph position="0"> Case restoration and normalization can be employed for more complex tasks. We have successfully leveraged truecasing in improving named entity recognition and automatic content extraction.</Paragraph> <Paragraph position="1"> In order to evaluate the effect of truecasing on extracting named entity labels, we tested an existing named entity system on a test set that has significant case mismatch to the training of the system. The base system is an HMM based tagger, similar to (Bikel et al., 1997). The system has 31 semantic categories which are extensions on the MUC categories. The tagger creates a lattice of decisions corresponding to tokenized words in the input stream. When tagging a word wi in a sentence of words w0:::wN , two possibilities. If a tag begins: p(tN1 jwN1 )i = p(tijti 1;wi 1)py(wijti;wi 1) If a tag continues: p(tN1 jwN1 )i = p(wijti;wi 1) The y indicates that the distribution is formed from words that are the first words of entities. The py distribution predicts the probability of seeing that word given the tag and the previous word instead of the tag and previous tag. Each word has a set of features, some of which indicate the casing and embedded punctuation. These models have several levels of back-off when the exact trigram has not been seen in training. A trellis spanning the 31 futures is built for each word in a sentence and the best path is derived using the Viterbi algorithm.</Paragraph> <Paragraph position="2"> yTruecasing improves legibility, not the translation itself The performance of the system shown in table 2 indicate an overall 26.52% F-measure improvement when using truecasing. The alternative to truecasing text is to destroy case information in the training material SNORIFY procedure in (Bikel et al., 1997). Case is an important feature in detecting most named entities but particularly so for the title of a work, an organization, or an ambiguous word with two frequent cases. Truecasing the sentence is essential in detecting that &quot;To Kill a Mockingbird&quot; is the name of a book, especially if the quotation marks are left off.</Paragraph> <Paragraph position="3"> Automatic Content Extraction (ACE) is task focusing on the extraction of mentions of entities and relations between them from textual data. The textual documents are from newswire, broadcast news with text derived from automatic speech recognition (ASR), and newspaper with text derived from optical character recognition (OCR) sources. The mention detection task (ace, 2001) comprises the extraction of named (e.g. &quot;Mr. Isaac Asimov&quot;), nominal (e.g. &quot;the complete author&quot;), and pronominal (e.g. &quot;him&quot;) mentions of Persons, Organizations, Locations, Facilities, and Geo-Political Entities.</Paragraph> <Paragraph position="4"> The automatically transcribed (using ASR) broadcast news documents and the translated Xinhua News Agency (XINHUA) documents in the ACE corpus do not contain any case information, while human transcribed broadcast news documents contain casing errors (e.g. &quot;George bush&quot;). This problem occurs especially when the data source is noisy or the articles are poorly written.</Paragraph> <Paragraph position="5"> For all documents from broadcast news (human transcribed and automatically transcribed) and XINHUA sources, we extracted mentions before and after applying truecasing. The ASR transcribed broadcast news data comprised 86 documents containing a total of 15,535 words, the human transcribed version contained 15,131 words. There were only two XINHUA documents in the ACE test set containing a total of 601 words. None of this data or any ACE data was used for training the truecasing models.</Paragraph> <Paragraph position="6"> Table 3 shows the result of running our ACE participating maximum entropy mention detection system on the raw text, as well as on truecased text. For ASR transcribed documents, we obtained an eight fold improvement in mention detection from 5% F-measure to 46% F-measure. The low baseline score is mostly due to the fact that our system has been trained on newswire stories available from previous ACE evaluations, while the latest test data included ASR output. It is very likely that the improvement due to truecasing will be more modest for the next ACE evaluation when our system will be trained on ASR output as well.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Possible Improvements & Future Work </SectionTitle> <Paragraph position="0"> Although the statistical model we have considered performs very well, further improvements must go beyond language modeling, enhancing how expressive the model is. Additional features are needed during decoding to capture context outside of the current lexical item, medium range context, as well as discontinuous context. Another potentially helpful feature to consider would provide a distribution over similar lexical items, perhaps using an edit/phonetic distance.</Paragraph> <Paragraph position="1"> Truecasing can be extended to cover a more general notion surface form to include accents. Depending on the context, words might take different surface forms. Since punctuation is a notion extension to surface form, shallow punctuation restoration (e.g. word followed by comma) can also be addressed through truecasing.</Paragraph> </Section> class="xml-element"></Paper>