File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/04/w04-3237_abstr.xml
Size: 1,423 bytes
Last Modified: 2025-10-06 13:44:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3237"> <Title>Adaptation of Maximum Entropy Capitalizer: Little Data Can Help a Lot</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> A novel technique for maximum &quot;a posteriori&quot; (MAP) adaptation of maximum entropy (MaxEnt) and maximum entropy Markov models (MEMM) is presented.</Paragraph> <Paragraph position="1"> The technique is applied to the problem of recovering the correct capitalization of uniformly cased text: a &quot;background&quot; capitalizer trained on 20Mwds of Wall Street Journal (WSJ) text from 1987 is adapted to two Broadcast News (BN) test sets -one containing ABC Primetime Live text and the other NPR Morning News/CNN Morning Edition text -- from 1996.</Paragraph> <Paragraph position="2"> The &quot;in-domain&quot; performance of the WSJ capitalizer is 45% better than that of the 1-gram baseline, when evaluated on a test set drawn from WSJ 1994. When evaluating on the mismatched &quot;out-ofdomain&quot; test data, the 1-gram baseline is outperformed by 60%; the improvement brought by the adaptation technique using a very small amount of matched BN data -- 25-70kwds -- is about 20-25% relative. Overall, automatic capitalization error rate of 1.4% is achieved on BN data.</Paragraph> </Section> class="xml-element"></Paper>