File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/96/c96-2192_evalu.xml

Size: 2,933 bytes

Last Modified: 2025-10-06 14:00:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2192">
  <Title>Tagging Spoken Language Using Written Language Statistics</Title>
  <Section position="6" start_page="1080" end_page="1080" type="evalu">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> The overall ac(:uracy rate for the I,agger is al'Omld 85%, which is not too imi)ressive wh(m (:oInl)m'e(| to the results reporte, d for writt;cn laitguage. However, if we take a closer look at the results, it; seems that an imt)ortant source of error is the lack of coverage of the, lexicon m,t the training corpus. Of the |;we lmndred or so errors made 1)y the tagger, more than eighty con(:ern tokens that could not be matched with any word form occurring in the training corpus. The most; common tyt)e of error in this class is that a word is (~rroneously tagge, d as a noun. \[t is likely that this is an artifact of the way we assign lexical prol)abilities to unknown words and that a more Sol)histi(:ated method may lint)rove the results for this class of words. More importantly, though, if we only (:oilsi(ler the resuits for words that were known to the tagger, the accuracy rate goes up to about 90%, mid most of the errors relnailfii~g concern classes that are notoriously difficult even un(ter norlnal cir(:umstmLces, such as adverbs vs verb particles and prepositions vs sut)ordinating conjunctions. Taken togedmr, these results seen~ to indicate that with a more e.xtensive lexicon, a larger training corpus of written language, and l)erhat)s a more sot)histi(:ated treatment of mtknown words, it should |)e possible to el)Cain results al)proa&lt;',hing those, ()I&gt;taine&lt;l for written language.</Paragraph>
    <Paragraph position="1"> As regards the two treatments ()\[' \[)allses, the results are virtually identi(:al in terms of overall accuracy rate. If we look at individual words, however, we find that the part-of-st)eech assignillellt differs in 25 cases, hi 10 of these (:ases, the corrc(:t part-of-st)eech is assigned under condition 1; in 9 cases, the corre, ct ttLg is tbund under (:ondition 2; ittl(t in 6 cases, l)oth conditions yield an incorrect assignlnent. The conclusion to draw from the.se results is i)robably that the. tre&amp;tmcnt of pauses as delimiters yields it t)etter analysis in cases where the pause, marks an interruption or major phrase t)omldary, while it is better t() ignore pauses when they do iloi-, mark any break in grmnlnatical structure. Unfortunately, these two tyl)eS of t)auses seem to 1)e equally (:ommon, whi(:h means that neither treatment results in any gain in overall accuracy. However, preliminary observations seem to in(ticate thai, it may be possible to get better results if a more line-grained analysis o\[&amp;quot; t)ause length is taken into account. This pre-supposes, of course, that lifts kind of informal;ion is available in the transcriptions.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML