XML Viewer - e06-1008

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1008_metho.xml
Size: 11,573 bytes
Last Modified: 2025-10-06 14:10:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1008">
  <Title>Generating statistical language models from interpretation grammars in dialogue systems</Title>
  <Section position="4" start_page="57" end_page="58" type="metho">
    <SectionTitle>
2 Description of Corpora
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="57" end_page="58" type="sub_section">
      <SectionTitle>
2.1 The MP3 corpus
</SectionTitle>
      <Paragraph position="0"> The domain that we are considering in this paper is the domain of an MP3 player application.</Paragraph>
      <Paragraph position="1"> The talking MP3 player, DJGoDiS, is one of several applications that are under development in the TALK project. It has been built with the TrindiKit toolkit (Larsson et al, 2002) and the GoDiS dialogue system (Larsson, 2002) as a GoDiS application and works as a voice interface to a graphical MP3 player. The user can among other things change settings, choose stations or songs to play and create playlists. The current version of DJGoDiS works in both English and Swedish.</Paragraph>
      <Paragraph position="2"> The interpretation and generation grammars are written with the GF grammar formalism. GF is being further developed in the project to adapt it to the use in spoken dialogue systems. This adaptation includes the facility of generating Nuance recognition grammars from the interpretation grammar and the possibility of generating corpora from the grammars. The interpretation grammar for the domain, written in GF, translates user utterancesto dialoguemoves and thereby holds all possible interpretations of user utterances (Ljungl&amp;quot;of et al, 2005). We used GF's facilities to generate a corpus in Swedish consisting of all possible meaningful utterances generated by the grammar to a certain depth of the analysis trees in GF's abstract syntax as explained in (Weilhammer et al, 2006).</Paragraph>
      <Paragraph position="3"> As the current grammar is under development it is not complete and some linguistic structures are missing. The grammar is written on the phrase level accepting spoken language utterances such as e.g. &amp;quot;next, please&amp;quot;.</Paragraph>
      <Paragraph position="4"> The corpus of possible user utterances resulted in around 320 000 user utterances (about 3 million words) corresponding to a vocabulary of only 301 words. The database of songs and artists in this first version of the application is limited to 60 Swedish songs, 60 Swedish artists, 3 albums and 3 radio stations. The vocabulary may seem small if you consider the number of songs and artists included, but the small size is due to a huge  overlap of words in songs and artists as pronouns (such as Jag (I) and Du (You)) and articles (such as Det (The)) are very common. This corpus is very domain specific as it includes many artist names, songs and radio stations that often consist of rare words. It is also very repetitive covering all combinations of songs and artists in utterances such as &amp;quot;I want to listen to Mamma Mia with Abba&amp;quot;. All utterances in the corpus occur exactly once.</Paragraph>
    </Section>
    <Section position="2" start_page="58" end_page="58" type="sub_section">
      <SectionTitle>
2.2 The GSLC corpus
</SectionTitle>
      <Paragraph position="0"> The Gothenburg Spoken Language (GSLC) corpus consists of transcribed Swedish spoken language from different social activities such as auctions, phone calls, meetings, lectures and task-oriented dialogue (Allwood, 1999). To be able to use the GSLC corpus for language modelling it was pre-processed to remove annotations and all non-alphabetic characters. The final GSLC corpus consisted of a corpus of about 1,300,000 words with a vocabulary of almost 50,000 words.</Paragraph>
    </Section>
    <Section position="3" start_page="58" end_page="58" type="sub_section">
      <SectionTitle>
2.3 The newspaper corpus
</SectionTitle>
      <Paragraph position="0"> We have also used a corpus consisting of a collection of Swedish newspaper texts of 397 million words.1 Additionally, we have created a subcorpus of the newspaper corpus by extracting only the sentences including domain related words. With domain related words we mean typical words for an MP3 domain such as &amp;quot;music&amp;quot;, &amp;quot;mp3-player&amp;quot;, &amp;quot;song&amp;quot; etc. This domain vocabulary was handcrafted. The domain-adapted newspaper corpus, obtained by selecting sentences where these words occurred, consisted of about 15 million words i.e.</Paragraph>
      <Paragraph position="1"> 4% of the larger corpus.</Paragraph>
    </Section>
    <Section position="4" start_page="58" end_page="58" type="sub_section">
      <SectionTitle>
2.4 The Test Corpus
</SectionTitle>
      <Paragraph position="0"> To collect a test set we asked students to describe how they would address a speech-enabled MP3 player by writing Nuance grammars that would cover the domain and its functionality. Another group of students evaluated these grammars by recording utterances they thought they would say to an MP3 player. One of the Nuance grammars was used to create a development test set by generating a corpus of 1500 utterances from it. The corpus generated from another grammar written by some other students was used as evaluation test set. Added to the evaluation test set were the transcriptions of the recordings made by the third  group of students that evaluated both grammars.</Paragraph>
      <Paragraph position="1"> This resulted in a evaluation test set of 1700 utterances. null The recording test set was made up partly of the students' recordings. Additional recordings were carried out by letting people at our lab record randomly chosen utterances from the evaluation test set. Wealsohadademorunningforashorttimeto collect user interactions at a demo session. The final test set included 500 recorded utterances from 26 persons. This test set has been used to compare recognition performance between the different models under consideration.</Paragraph>
      <Paragraph position="2"> The recording test set is just an approximation to the real task and conditions as the students only capture how they think they would act in an MP3 task. Their actual interaction in a real dialogue situation may differ considerably so ideally, we would want more recordings from dialogue system interactions which at the moment constitutes only a fifth of the test set. However, until we can collect more recordings we will have to rely on this approximation.</Paragraph>
      <Paragraph position="3"> In addition to the recorded evaluation test set, a second set of recordings was created covering only in-grammar utterances by randomly generating a test set of 300 utterances from the GF grammar. These were recorded by 8 persons. This test set was used to contrast with a comparison of in-grammar recognition performance.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="58" end_page="59" type="metho">
    <SectionTitle>
3 Language modelling
</SectionTitle>
    <Paragraph position="0"> To generate the different trigram language models we used the SRI language modelling toolkit (Stolcke, 2002) with Good-Turing discounting.</Paragraph>
    <Paragraph position="1"> The first model was generated directly from the MP3 corpus we got from the GF grammar. This simple SLM (named MP3GFLM) has the same vocabulary as the Nuance Grammar and models the same language as the GF grammar. This model was chosen to see if we could increase flexibility and robustness in such a simple way while maintaining in-grammar performance.</Paragraph>
    <Paragraph position="2"> We also created two other simple SLMs: a class-based one (with the classes Song, Artist and Radiostation) and a model based on a variant of the MP3 corpus where the utterances in which songs and artists co-occur would only match real artist-song pairs (i.e. including some music knowledge in the model).</Paragraph>
    <Paragraph position="3"> These three SLMs were the three basic MP3  models considered although we only report the results for the MP3GFLM in this article (the class-based model gave a slightly worse result and the a other slightly better result).</Paragraph>
    <Paragraph position="4"> In addition to this we used our general corpora to produce three different models: GSLCLM from the GSLC corpus, NewsLM from the newspaper  corpusandDomNewsLMfromthedomainadapted newspaper Corpus.</Paragraph>
    <Paragraph position="5"> 3.1 Interpolating the GSLC corpus and the MP3 corpus  A technique used in language modelling to combinedifferentSLMsislinearinterpolation(Jelinek null &amp; Mercer, 1980). This is often used when the domain corpus is too small and a bigger corpus is available. There have been many attempts at combining domain corpora with news corpora, as this has been the biggest type of corpus available and this has given slightly better models (Janiszek et al, 1998; Rosenfeld, 2000a). Linear interpolation has also been used when building state dependent models by combining the state models with a general domain model (Xu &amp; Rudnicky, 2000; Solsona et al, 2002).</Paragraph>
    <Paragraph position="6"> Rosenfeld (Rosenfeld, 2000a) argues that a little more domain corpus is always better than a lot more training data outside the domain. Many of these interpolation experiments have been carried out by adding news text, i.e. written language. In this experiment we are going to interpolate our domain model (MP3GFLM) with a spoken language corpus, the GSLC, to see if this improves perplexity and recognition rates. As the MP3 corpus is generated from a grammar without probabilities this is hopefully a way to obtain better and more realistic estimates on words and word sequences. Ideally, what we would like to capture from the GSLC corpus is language that is invariant from domain to domain. However, Rosenfeld (Rosenfeld, 2000b) is quite pessimistic about this, arguing that this is not possible with today's interpolation methods. The GSLC corpus is also quite small.</Paragraph>
    <Paragraph position="7"> The interpolation was carried out with the</Paragraph>
    <Paragraph position="9"> The optimal lambda weight was estimated to 0.65 with the SRILM toolkit using the development test set.</Paragraph>
    <Paragraph position="11"> In addition to these models we created a model where we interpolated both the GSLC model and the domain adapted newspaper model with MP3GFLM. This model was named TripleLM.</Paragraph>
    <Paragraph position="12">  The resulting mixed models have a huge vocabulary as the GSLC corpus and the newspaper corpus include thousands of words. This is not a convenient size for recognition as it will affect accuracy and speed. Therefore we tried to find an optimal vocabulary combining the small MP3 vocabulary of around 300 words with a smaller part of the GSLC vocabulary and the newspaper vocabulary.</Paragraph>
    <Paragraph position="13"> We used the the CMU toolkit (Clarkson &amp; Rosenfeld, 1997) to obtain the most frequent words of the GSLC corpus and the News Corpus.</Paragraph>
    <Paragraph position="14"> We then merged these vocabularies with the small MP3 vocabulary. It should be noted that the overlap between the most frequent GSLC words and the MP3 vocabulary was quite low (73 words for the smallest vocabulary) showing the peculiarity of the MP3 domain. We also added the vocabulary used for extracting domain data to this mixed vocabulary. This merging of vocabularies resulted in a vocabulary of 1153 words. The vocabulary for the MP3GFLM and the MP3NuanceGr is the small MP3 vocabulary.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML