File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2192_metho.xml
Size: 7,596 bytes
Last Modified: 2025-10-06 14:14:22
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2192"> <Title>Tagging Spoken Language Using Written Language Statistics</Title> <Section position="4" start_page="1078" end_page="1079" type="metho"> <SectionTitle> 3 Method </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1078" end_page="1078" type="sub_section"> <SectionTitle> 3.1 The Tagger </SectionTitle> <Paragraph position="0"> The tagger used fl)r tile experiments is a standard ItMM tagger using tile Viterbi algorithm to calculate the most probable sequence of parts-ofspee(:h for each string of words actor(ling to the following prol)al)ilistic t)iclass modeh</Paragraph> <Paragraph position="2"> The tagger is coupled with a tokenizer that segments a transcription into utterances (strings of words), that are fed to the tagger one by one. Besides ordinary words, the utterances may also contain markers for pauses and inaudibh: stretches of speech. ~</Paragraph> </Section> <Section position="2" start_page="1078" end_page="1079" type="sub_section"> <SectionTitle> 3.2 Training the Tagger </SectionTitle> <Paragraph position="0"> Tile lexical and contextual probabilities were estimated with relative frequencies ill a tagged corpus of written Swedish, a subpart of the Stockholm-Ume'PS Cortms (SUC) containing 122,377 word tokens (1.8,343 word types). Tile tagset included 27 parts-of-speech. 3</Paragraph> </Section> <Section position="3" start_page="1079" end_page="1079" type="sub_section"> <SectionTitle> 3.3 The Spoken Language Lexicon </SectionTitle> <Paragraph position="0"> As noted earlier, the spoken language transcriptions contain many deviations fl'om standard orthography. Therefore, in order to inake optimal use of tile written language statistics, a special lexicon is required to map spoken language variants onto their canonical written forms. For the present experiments we have developed a lexicon covering 2113 spoken language variants (which are mapped onto 1764 written language forms). We know, however, that this lexicon has less than total coverage and that many regular spoken language reductions are not currently covered. 4</Paragraph> </Section> <Section position="4" start_page="1079" end_page="1079" type="sub_section"> <SectionTitle> 3.4 Unknown Words and Collocations </SectionTitle> <Paragraph position="0"> The occurrence of &quot;unknown words&quot;, i. e., words not occurring in the training corpus, is a notorious problem in (probabilistic) part-of-speech tagging.</Paragraph> <Paragraph position="1"> In our case, this problem is even more serious, since we know beforehand that some words will be treated as unknown although they do in fact occur in the training corpus (because of deviations Dom standard orthography). In the experiments reported below, we have allowed unknown words to belong to any part-of-speech (which is possible in the given context), but with different weightings for different parts-of-speech. More precisely, when a word cannot be found in the lexicon, we replace the product in (2) (cf. equation 1 above) with the product in (3), where TTR(ti) is the type-token ratio of ti (in the training corpus).</Paragraph> <Paragraph position="2"> (2) p(t I I td (3) P(t{ I t{_l)P(ti)TTR(t{) In this way, we favor parts-of-speech with high probability and high type-token ratio. In practice, this favors open classes (such as nouns, verbs, adjectives) over closed classes (determiners, conjunctions, etc.), and more frequent ones (e. g., nouns) over less frequent ones (e. g., adjectives).</Paragraph> <Paragraph position="3"> In addition to &quot;unknown words&quot;, we have to deal with &quot;unknown collocations&quot;, i. e., biclasses that do not occur in the training data. If these biclasses are simply assigned zero probability, then in tile extreme case a word which is in the lexicon may fail to get a tag because the contextual probabilities of all its known parts-of-speech are zero in the given context. In order to prevent this, we use the following formula to assign contextual probabilities to unknown collocations: (4) P(ti l t{_l) = P(ti)K The constant K is chosen in such a way that tile contextual probabilities defined by equation (4) are significantly lower than the &quot;real&quot; contextual probabilities derived from the training corpus, so 4A common example is the ending -igt, which appears in many adjectives (neuter singular) and adverbs and which is usually reduced to -it in ordinary speech. that they only come into play when no known collocation is possible.</Paragraph> </Section> <Section position="5" start_page="1079" end_page="1079" type="sub_section"> <SectionTitle> 3.5 Pauses and Inaudible Speech </SectionTitle> <Paragraph position="0"> As indicated earlier, the utterances to be tagged included markers for pauses and inaudible speech, since these were thought to contain information relevant for tile tagging process. The symbol for inaudible (and therethre untranscribed) speech (...) -was simply added to the lexicon and assigned the &quot;t)art-of-speech&quot; major delimiter (mad), which is the category assigned to full stops, etc. in written texts. The result is that the tagger will not treat the last, word before tile untranscribed passage as immediate context for tile first word after tile passage.</Paragraph> <Paragraph position="1"> For pintoes we have experimented with two different treatments, which are compared below. We refer to these different treatments as tagging condition 1 and 2, respectively: * Condition 1: Pauses are simply ignored in tile tagging process, which means that the last word before a pause is treated as immediate context for the first word after the pause.</Paragraph> <Paragraph position="2"> * Condition 2: Pause symbols are added to the lexicon, where short pauses are categorized as minor delimiters (mid) (commas, etc.), while long pauses are categorized as mad (fllll stops, etc.), which means that the contextual probabilities of words occurring before and after pauses in spoken language will be modelled on the probabilities of words occurring before and after certain punctuation marks in written language.</Paragraph> <Paragraph position="3"> It was hypothesized that, in certain cases, the tagger might perform better under condition 2, since pauses in spoken language often though by no means always indicate major phrase boundaries or even breaks in the grammatical structure.</Paragraph> </Section> <Section position="6" start_page="1079" end_page="1079" type="sub_section"> <SectionTitle> 3.6 Test Corpus </SectionTitle> <Paragraph position="0"> The test corpus was composed of a set of 47 utterances, chosen randomly from a corpus of transcribed spoken Swedish containing 267,206 words.</Paragraph> <Paragraph position="1"> The utterance length varied from 1 word to 688 words (not counting pauses as words), with a mean length of 29 words. The test corpus contained 1360 word tokens and 498 word types.</Paragraph> </Section> </Section> <Section position="5" start_page="1079" end_page="1080" type="metho"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"> The number of correctly tagged word tokens under condition 1 was 1153 out of a total of 1360, i. e., 84.8deg./o. The results for condition 2 were slightly better: 1248/1457 = 85.7%. However, the latter figures also include the tagged imuses, for which only one category was possible. If these tokens are subtracted, the results for condition 2 are: 1151/1360 = 84.6%.</Paragraph> </Section> class="xml-element"></Paper>