File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-2192_intro.xml
Size: 6,210 bytes
Last Modified: 2025-10-06 14:06:03
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2192"> <Title>Tagging Spoken Language Using Written Language Statistics</Title> <Section position="3" start_page="0" end_page="1078" type="intro"> <SectionTitle> 2 Background </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="1078" type="sub_section"> <SectionTitle> 2.1 Probabilistie Part-of-speech Tagging </SectionTitle> <Paragraph position="0"> The problem of (automatically) assigning parts of speech to words in context has received a lot of attention within computational corpus linguistics.</Paragraph> <Paragraph position="1"> A variety of diffexent methods have been investigated, most of which fall into two broad classes: 1992).</Paragraph> <Paragraph position="2"> Probabilistic taggers have typically been implemented as hidden Markov models, using prohabilistic models with two kinds of' basic probabilities: null * The lexical probability of seeing the word w given the part-of-speech t: P(w I t).</Paragraph> <Paragraph position="3"> * The contextual pwbability of seeing the part-of-speech ti given the context of n - 1 parts-of-speech: P(ti I ti-(,~-,),...,ti 1).</Paragraph> <Paragraph position="4"> Models of this kind are usually referred to as n-class models, the most common instances of which are the biclass (n = 2) and triclass (n = 3) models. The lexical and contextual probabilities of an n-class tagger are usually estimated using one of two methods: ~ 1The terms 'RF training' and 'ML training' are taken from Merialdo 1994. It should be pointed out, though, that the use of relative frequencies to estimate occurrence probabilities is also a case of maximmn likelihood estimation (MLE).</Paragraph> <Paragraph position="5"> * Relative l,Yequency (RF) training: Given a tagged training corpus, the i)rohabilities (:an be estimated with relative frequencies.</Paragraph> <Paragraph position="6"> * Maxinnun Likelihood (ML) training: Given an untagged training corpus, the probabilities can be estimated using the Bauin-Welch algorithm (also known as tile Forward-Backward algorithin) (Baron 1972).</Paragraph> <Paragraph position="7"> Of these two methods, R.F training seelns to give better estilnations while t)eing more labor intensive (Merialdo 1994). With proper training, r> class taggers typically readt all accuracy rate of about 95% \['or English texts (Charniak 1993), and similar results have been reported for other languages such as lh'ench and Swedish (Chanod & Tapanainen 1995; Brants & Samuelsson 1995).</Paragraph> </Section> <Section position="2" start_page="1078" end_page="1078" type="sub_section"> <SectionTitle> 2.2 Tagging Spoken Language </SectionTitle> <Paragraph position="0"> Spoken language transcrit)tions are essentially a Mud of text, and can therefore be tagged with the methods used for otller kinds of text,. IIowever, sin(:(; t, he transcription of spoken language is a fairly labor-intensive tasks, the availability of suitable training corpora is much more limitexl than for ordinary written texts. One way to circuinvent this problem is to use taggers trained on written texts to tag spoken language also. This has apparently been done successflllly for the spoken language part of the British National Corpus, using the CLAWS tagger (Garsi(te).</Paragraph> <Paragraph position="1"> However, the application of writte, n language taggers to spol(en language is not entirely unproblematic. First of all, spoken language transcriptions are typically produced ill a different format and with different conventions than ordinary written texts. For example, a transcription is likely to contain markers tbr pauses, (aspects of) t)rosody, overlapping speech, etc. Moreover, they do not usually contain the pun(:tuation marks found in ordinary texts. This means that the application of a written language tagger to spoken language minimally requires a special tokenizer, i. e., a preprocessor segmenting the text into appropriate coding units (words).</Paragraph> <Paragraph position="2"> A second type of ditficulty arises from tile fact that spoken language is otten transcribed using non-standard orthograI)hy. Even if no phonetic t;ranscrit)tion is used, most transcription eonvenlions support the use of modified orthography to capture typical features of st)oken language (such as gem instead of going, kinda instead of kind of, etc.). Thus, the application of a written language tagger to spoken language typically requires a special lexicon, mapt)ing spoken language variants onto their canonical written language forms, in addition to a special tokenizer.</Paragraph> <Paragraph position="3"> The problems considered so far may be seen as problems of a practical nature, but there is also a more filndmnentat problem with tile use of written language statistics to analyze spoken language, namely that the probability estimates derived from written language may not be rcpresentative for spoken language. In the extreme case, some st)oken language phenomena (such as hesitation markers) Inay l)e (nearly) non-existent; in written language. But even for words and collocations that occur both ill written and ill spoken language, t;he occurrence probabilities may vary greatly between tile two media. How riffs affects the performance of taggers and what methods can be use(l to over(;olne or circunlvent tile I)rol)lems m-e issues that, surprisingly, do not seem to have t)een discussed in the literature at all. The I)resent paper can be seen as a first attempt to ext)lore this area.</Paragraph> </Section> <Section position="3" start_page="1078" end_page="1078" type="sub_section"> <SectionTitle> 2.3 Tagging Swedish </SectionTitle> <Paragraph position="0"> As far as we know, the methods for mltomatic part-of-speech tagging have not before been applied (;o (transcribed) spoken Swedish. For written Swedish, there are a few tagge, d corpora availat)le, such as the Teleman tort)us (see, e. g., (Brants &, Samuelsson 1995)) and the Stockholnl-Urneh Corpus (Ejerhed et al 1992). A subpart of tim latter has been used as training dal;a in the experiments reported t)elow.</Paragraph> </Section> </Section> class="xml-element"></Paper>