File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0613_intro.xml
Size: 12,978 bytes
Last Modified: 2025-10-06 14:06:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0613"> <Title>The &quot;Casual Cashmere Diaper Bag&quot;: Constraining Speech Recognition Using Examples</Title> <Section position="4" start_page="0" end_page="63" type="intro"> <SectionTitle> 2 Background </SectionTitle> <Paragraph position="0"> Currently, the best speaker-independent continuous speech recognition (SR) is orders of magnitude weaker than a human native speaker in recognizing arbitrary sequences of words. That is, humans do pretty well on clearly spoken sequences of words chosen randomly from a pool of tens of thousands of words, while unconstrained SR systems only do as well when the vocabulary is much smaller, in the range of hundreds of words. When the recognition is to be done over the telephone, the reduced signal-to-noise ratio of the speech data makes this weakness even more dramatic.</Paragraph> <Section position="1" start_page="0" end_page="61" type="sub_section"> <SectionTitle> 2.1 Language Models </SectionTitle> <Paragraph position="0"> In order to achieve useful recognition rates, current SR systems impose constraints beyond just a limited vocabulary, either by specifying an exact grammar of the sequences which are allowed or by providing statistical likelihoods for word sequences (ngram statistics). The grammars are built by hand as context-free formalisms determining allowable word sequences. The statistical models use tables of the &quot;raw&quot; probabilities of each word (unigram) usually augmented with additional tables of the likelihood of each word given each possible preceding word (bigram) or each possible two preceding words (trigram). These statistical systems have been experimentally extended to include n-grams where n is exceeds three, but even for higher n they generally express only the probability of a word based on the adjacent preceding words.</Paragraph> <Paragraph position="1"> Hand-built grammars can provide exquisitely fine control over the word sequences recognized, but their construction is difficult and painstaking, even for those who are practiced in the art.</Paragraph> <Paragraph position="2"> Conversely, statistical &quot;grammars&quot; can be built automatically by running an analysis program over an appropriate collection of the kinds of sentences that one wishes to recognize. A prime example of this technology is the ARPA-initiated Wall Street Journal (WSJ) dictation project, where recognizers trained on the text of previously-printed articles from the WSJ are tested by having them recognize text read from a later edition of the WSJ. Unfortunately, the database of WSJ text used in these experiments contained approximately 40 million words, and researchers using this database have indicated that their speech systems work better when they were able to double the size of their training set (Schwartz et al., 1994).</Paragraph> <Paragraph position="3"> While the recognition achieved on the WSJ with this technique is impressive, the information embodied in the statistical model is so specific there is not much &quot;transfer&quot; to recognizing text that varies in style, even when content and vocabulary are shared.</Paragraph> <Paragraph position="4"> \[cite example of NYTimes financial stories and the ads in WSJ not working well\]</Paragraph> </Section> <Section position="2" start_page="61" end_page="61" type="sub_section"> <SectionTitle> 2.2 Command and Control </SectionTitle> <Paragraph position="0"> In the domain of command and control of computer programs, the utterances to be recognized do not correspond directly to any existing body of text that could be used analogously to the WSJ text's role in training the dictation recognizers. Traditional statistical modeling requires a relatively huge database of example utterances and the models do not include any abstraction of the words, so the actual co-occurence of words is necessary to count the relative frequency of each. For many applications of speech recognition there simply is not enough training data to support using statistical models.</Paragraph> <Paragraph position="1"> 2.2.1 Automating the Lands' End catalog We discovered the need for some new method to restrict a speech recognizer when we attempted to implement an automated customer service agent to interact with users wanting to browse and order items from an online catalog. Lands' End Direct Merchants provided a collection of &quot;video assets&quot; from one of their catalogs for this experiment. A typical &quot;page&quot; illustrated and described an item or a collection of related items, and might have associated with it additional information such as a video clip, color and size pages, and indications of the pages that are specializations of this page. We prototyped a speech-controlled application which allows a user to interact with the automated agent using speech through the telephone while viewing the video on a televison 1. Allowing a free conversational dialogue and supporting a large subset of the myriad ways an untrained caller might describe the catalog items overwhelmed our speech recognizer.</Paragraph> <Paragraph position="2"> 3 How Restrictive is a grammar? Writing a grammar to allow a user to make queries about the contents of this computerized catalog was the concrete example that drove our new approach.</Paragraph> </Section> <Section position="3" start_page="61" end_page="62" type="sub_section"> <SectionTitle> 3.1 What do users say? </SectionTitle> <Paragraph position="0"> We collected examples of what users said to an expert human service representative in a &quot;Wizard of Oz&quot; experiment (Yankelovich, forthcoming). Besides the action words and phrases (&quot;can you show me <itemPhrase>?&quot; or &quot;what <itemPhrase> do you carry?&quot;) in a shopping query, the user commonly supplied a phrase that names or describes the item of interest.</Paragraph> <Paragraph position="1"> D: I'd like a soft-sided attache.</Paragraph> <Paragraph position="2"> <displays luggage page> D: The canvas line.</Paragraph> <Paragraph position="3"> C: How about kids? B: Can I see the squall jacket? C: Could I see the men's clothes? <displays menswear page> C: Dress shirts.</Paragraph> <Paragraph position="4"> S: Could we switch to children's clothing. L: Let's look at some casual dresses.</Paragraph> <Paragraph position="5"> M: I'd like to see the sweaters please.</Paragraph> <Paragraph position="6"> S: I'm looking for things from bed and bath. B: Let's go back to sweaters.</Paragraph> <Paragraph position="7"> B: Can I go back to the main screen.</Paragraph> <Paragraph position="8"> L: I'll go back to the womens.</Paragraph> <Paragraph position="9"> A: I'm looking for a blazer and slacks and skirts to go with it.</Paragraph> <Paragraph position="10"> C: I need a flat sheet and a fitted sheet in queen. Example queries users said to &quot;Wizard&quot; system. 3.1.1 A grammar to collect semantics We implemented the prototype Lands' End system using our SpeechActs (Martin et al., 1996) system, collecting the relevant semantics from utterances with a simple grammar specifying the allowable phrases.</Paragraph> <Paragraph position="11"> One over-simplified grammar of such &quot;item specification&quot; phrases would allow any basic item (such as tin a real installation, the televison would be connected to a pay-per-view channel or a cable system such as in a hotel &quot;pants&quot;) to be modified by any combination ofmetastyle, pattern style, color, size, gender, wearer's age, fabric type, fabric style, and maker's name. A particular sweater could be referred to as &quot;the petite women's medium dusty sage jewel-neck cashmere fine-knit 'drifter' sweater&quot;. While no one would ever spontaneously utter this monster, we cannot predict which portion of these options will be used in any given utterance. Such an accepting grammar works just fine for extracting the meaning from a written form of the item description, and in fact, is used in the Lands' End system to identify what items are displayed on each &quot;page&quot; of the video-accessible cat-</Paragraph> <Paragraph position="13"> Example UG rule allowing many possible modifers</Paragraph> </Section> <Section position="4" start_page="62" end_page="63" type="sub_section"> <SectionTitle> 3.2 Semantic grammar is too loose for SR </SectionTitle> <Paragraph position="0"> Unfortunately, the perplexity of the grammar produced by the cross product of all these choices is so large that the word accuracy of the speech recognition becomes uselessly low. Phrases that no user would ever utter are &quot;heard&quot; by the Sl:t engine; the &quot;casual cashmere diaper bag&quot; mentioned in the title of this paper refers to one of the more outrageous combinations that pass the muster of this weaklyconstraining grammar.</Paragraph> <Paragraph position="1"> If the lexical entry for every modifier were marked with a feature containing the set of things it could realistically modify (or, better yet, the set of classes of things), then the grammar could be written to allow only the &quot;reasonable&quot; combinations and to rule out the ridiculous ones that should be omitted to reduce the perplexity. With a grammar compiler that accepts such restrictions based on features in the lexicon, such a markup appears to be a possible solution. The grammar writer could create and record classes of basic items, noting that &quot;chinos&quot; and &quot;jeans&quot; were &quot;tough clothing&quot; and then only allowing them to be associated with fabrics appropriate for &quot;tough&quot; clothes. This strategy would block combinations such as &quot;lace chinos&quot; but allow &quot;silk blouses&quot; and &quot;denim jeans.&quot; The biggest disadvantage of requiring a grammar writer to figure out and record the features that determine allowable modifiers is the large amount of detailed work required to make such annotations. If these markings could be derived automatically from some pre-existing or easily-created data, then the task would be much reduced, and the cost of adding new items to the catalog would be much smaller. (In the particular ease of modeling a catalog, the effort required to accomodate each subsequent revision of the items carried is a primary concern 2)</Paragraph> <Paragraph position="3"> Example indexing of an item page described as &quot;women's chino slacks&quot; and as &quot;casual cotton pants.&quot; In the Lands' End example, we already have item descriptions which are part of their standard catalog database. We use these descriptive phrases both to navigate to the item or item collection (such as &quot;men's jackets&quot;) the user has requested and to verify that the semantic grammar and lexicon will accept the phrases used by the catalog designers. Any new version of the catalog will necessarily already have these phrases created for it; using them additionally for grammar restriction almost automates the update chore for new editions of the catalog.</Paragraph> <Paragraph position="4"> If the grammar were written incorporating tests to require the lexical markings indicating allowable modifiers, then it would reject any phrase that lacked the needed marks. If such a grammar were used with a &quot;bare&quot; lexicon (one lacking these modifier markings), it would not support parsing the page descriptors, and would compile into a speech grammar allowing only bare item names, devoid of any modifiers. We addressed this problem by adding the ability to switch the restrictions on or off, and then turning them off when parsing the (written) page descriptors.(See the example of switched tests in a grammar rule.) 3.2.3 Automating the markup Indexing and then processing the results of all the page descriptor parses provides the information content needed to automatically mark up the lexicon with the compatibility results derived from the page descriptors. Once the lexicon has been enhanced with this information, the restrictions can be turned on while the unified grammar is used by the speech recognizer. In our system, we compile the unified grammar to produce BNF reflecting the restrictions, but logically these restrictions could be applied &quot;on the fly&quot; by a speech recognizer or used in post-processing to choose among the n-best alternatives from a less restricted SR. Regardless of how it is implemented, the resultant grammar will not allow &quot;lace jeans&quot; simply because no page description phrase mentions any such thing.</Paragraph> <Paragraph position="6"> The *= operator is the switched test operator in this example grammar rule.</Paragraph> </Section> </Section> class="xml-element"></Paper>