File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-2021_intro.xml
Size: 2,376 bytes
Last Modified: 2025-10-06 14:03:09
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-2021"> <Title>Speech Recognition of Czech - Inclusion of Rare Words Helps</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Large vocabulary continuous speech recognition of in ective languages is a challenging task for mainly two reasons. Rich morphology generates huge number of forms which are not captured by limited-size dictionaries, and therefore leads to worse recognition results. Relatively free word order admits enormous number of word sequences and thus impoverishes a2 -gram language models. In this paper we are concerned with the former issue.</Paragraph> <Paragraph position="1"> Previous work which deals with excessive vocabulary growth goes mainly in two lines. Authors have either decided to break words into sub-word units or to adapt dictionaries in a multi-pass scenario. On Czech data, (Byrne et al., 2001) suggest to use linguistically motivated recognition units. Words are broken down to stems and endings and used as the recognition units in the rst recognition phase. In the second phase, stems and endings are concatenated. On Serbo-Croatian, (Geutner et al., 1998) also tested morphemes as the recognition units. Both groups of authors agreed that this approach is not bene cial for speech recognition of in ective languages. Vocabulary adaptation, however, brought considerable improvement. Both (Icring and Psutka, 2001) on Czech and (Geutner et al., 1998) on Serbo-Croatian reported substantial reduction of word error rate. Both authors followed the same procedure.</Paragraph> <Paragraph position="2"> In the rst pass, they used a dictionary composed of the most frequent words. Generated lattices were then processed to get a list of all words which appeared in them. This list served as a basis for a new adapted dictionary into which morphological variants were added.</Paragraph> <Paragraph position="3"> It can be concluded that large corpora contain a host of words which are ignored during estimation of language models used in rst pass, despite the fact that these rare words can bring substantial improvement. Therefore, it is desirable to explore how to incorporate rare or even unseen words into a language model which can be used in a rst pass.</Paragraph> </Section> class="xml-element"></Paper>