File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/a97-1001_intro.xml
Size: 21,965 bytes
Last Modified: 2025-10-06 14:06:08
<?xml version="1.0" standalone="yes"?> <Paper uid="A97-1001"> <Title>was given the highest grade of any Marine Corps portion of the exercise. In addition to these milestones, CommandTalk has been included in demonstrations of LeatherNet to numerous VIPs including General</Title> <Section position="3" start_page="0" end_page="5" type="intro"> <SectionTitle> 2 Architecture </SectionTitle> <Paragraph position="0"> CommandTalk combines a number of separate components, developed independently, some of which are implemented in C and others in Prolog. These components are integrated through the use of the Open Agent Architecture (OAA) (Cohen et al., 1994).</Paragraph> <Paragraph position="1"> OAA makes use of a facilitator agent that plans and coordinates interactions among agents during distributed computation. Other processes are encapsulated as agents that register with the facilitator the types of messages they can respond to. An agent posts a message in an Interagent Communication Language (ICL) to the facilitator, which dispatches the message to the agents that have registered their ability to handle messages of that type. This mediated communication makes it possible to &quot;hot-swap&quot; or restart individual agents without restarting the whole system. The ICL communications mechanism is built on top of TCP/IP, so an OAA-based system can be distributed across both local- and wide-area networks based on Internet technology. OAA also provides an agent library to simplify turning independent components into agents. The agent library supplies common functionality to agents in multiple languages for multiple platforms, managing network communication, ICL parsing, trigger and monitor handling, and distributed message primitives.</Paragraph> <Paragraph position="2"> CommandTalk is implemented as a set of agents communicating as described above. The principal</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 2.1 Speech Recognition </SectionTitle> <Paragraph position="0"> The speech recognition (SR) agent consists of a thin agent layer on top of the Nuance (formerly Corona) speech recognition system. Nuance is a commercial speech recognition product based on technology developed by SRI International. The recognizer listens on the audio port of the computer on which it is running, and produces its best hypothesis as to what string of words was spoken. The SR agent accepts messages that tell it to start and stop listening and to change grammars, and generates messages that it has stopped listening and messages containing the hypothesized word string.</Paragraph> <Paragraph position="1"> The Nuance recognizer is customized in two ways for use in CommandTalk. First, we have replaced the narrow-band (8-bit, 8-kHz sampled) acoustic models included with the Nuance recognizer and designed for telephone applications, with wide-band (16-bit, 16-kHz sampled) acoustic models that take advantage of the higher-quality audio available on computer workstations. Second, any practical application of speech recognition technology requires a vocabulary and grammar tailored to the particular application, since for high accuracy the recognizer must be restricted as to what sequences of words it will consider. To produce the recognition vocabulary and grammar for CommandTalk, we have implemented an algorithm that extracts these from the vocabulary and grammar specifications for the natural-language component of CommandTalk. This eases development by automatically keeping the language that can be recognized and the language that can be parsed in sync; that is, it guarantees that every word string that can be parsed by the natural-language component is a potential recognition hypothesis, and vice versa. This module that generates the recognition grammar for CommandTalk is described in Section 3.</Paragraph> </Section> <Section position="2" start_page="2" end_page="3" type="sub_section"> <SectionTitle> 2.2 Natural Language </SectionTitle> <Paragraph position="0"> The natural-language (NL) agent consists of a thin agent layer on top of Gemini (Dowding et al., 1993, 1994), a natural-language parsing and semantic interpretation system based on unification grammar.</Paragraph> <Paragraph position="1"> &quot;Unification grammar&quot; means that grammatical categories incorporate features that can be assigned values; so that when grammatical category expressions are matched in the course of parsing or semantic interpretation, the information contained in the features is combined, and if the feature values are incompatible the match fails. Gemini applies a set of syntactic and semantic grammar rules to a word string using a bottom-up parser to generate a logical form, a structured representation of the context-independent meaning of the string. The NL agent accepts messages containing word strings to be parsed and interpreted, and generates messages containing logical forms or, if no meaning representation can be found, error messages to be displayed to the user.</Paragraph> <Paragraph position="2"> Gemini is a research system that has been developed over several years, and includes an extensive grammar of general English. For CommandTalk, however, we have developed an application-specific grammar, which gives us a number of advantages.</Paragraph> <Paragraph position="3"> First, because it does not include rules for English expressions not relevant to the application, the grammar runs faster and finds few grammatical ambiguities. Second, because the semantic rules are tailored to the application, the logical forms they generate require less subsequent processing to produce commands to the application system. Finally, by restricting the form of the CommandTalk grammar, we are able to automatically extract the grammar that guides the speech recognizer.</Paragraph> <Paragraph position="4"> The Nuance recognizer, like all other practical recognizers, requires a grammar that defines a finite-state language model. The Gemini grammar formalism, on the other hand, is able to define grammars of much greater computational complexity. For CommandTalk, extraction of the recognition grammar is made possible by restricting the Gemini syntactic rules to a finite-state backbone with finitely valued features. It should be noted that, although we are not using the full power of the Gemini grammar formalism, we still gain considerable benefit from Gemini because the feature constraints let us write the grammar much more compactly, Gemini's morphology component simplifies maintaining the vocabulary, and Gemini's unification-based semantic rules let us specify the translation from word strings into logical forms easily and systematically.</Paragraph> </Section> <Section position="3" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 2.3 Contextual Interpretation </SectionTitle> <Paragraph position="0"> The contextual-interpretation (CI) agent accepts a logical form from the NL agent, and produces one or more commands to ModSAF. Since a logical form encodes only information that is directly expressed in the utterance, the CI agent often must apply contextual information to produce a complete interpretation. Sources of this information can include linguistic context, situational context, and defaults.</Paragraph> <Paragraph position="1"> Since ModSAF itself is the source of situational information about the simulation, the interaction between the CI agent and ModSAF is not a simple one-direction pipeline. Often, there will be a series of queries to ModSAF about the current state of the simulation before the ModSAF command or commands that represent the final interpretation of an utterance are produced.</Paragraph> <Paragraph position="2"> Some of the problems which must be solved by the A noun phrase denoting an object in the simulation must be resolved to the unique ModSAF identifier for that object. &quot;M1 platoon,&quot; &quot;tank platoon,&quot; or &quot;Charlie 4 5&quot; could all refer to the same entity in the simulation. To keep the CI informed about the objects in the simulation and their properties, the ModSAF agent notifies the CI agent whenever an object is created, modified, or destroyed. Since the CI agent is immediately notified whenever the user creates an object through the ModSAF GUI, the CI can note the salience of such objects, and make them available for pronominal reference (just as objects created by speech are), leading to smoother interoperation between speech and the GUI.</Paragraph> <Paragraph position="3"> While users employ generic verbs like move, attack, and assault to give verbal commands, the corresponding ModSAF tasks often differ depending on the units involved. The ModSAF movement task for a tank platoon is different from the one for an infantry platoon or the one for a tank company. Similarly, the parameter value indicating a column formation for tanks is different from the one indicating a column formation for infantry, and the parameter that controls the speed of vehicles has a different name than the one that controls the speed of infantry. All these differences need to be taken into account when generating the ModSAF command for something like &quot;Advance in a column to Checkpoint 1 at 10 kph,&quot; depending on what type of unit is being given the command.</Paragraph> <Paragraph position="4"> The CI agent needs to determine when a command is given to a unit should be carried out. The command may be part of a mission to be carried out later, or it may be an order to be carried out immediately. If the latter, it may be a permanent change to the current mission, or merely a temporary interruption of the current task in the mission, which should be resumed when the interrupting task is completed. The CI agent decides these questions based on a combination of phrasing and context.</Paragraph> <Paragraph position="5"> Sometimes, explicit indicators may be given as to when the command is to be carried out, such as a specific time, or after a given duration of time has elapsed, or on the commander's order.</Paragraph> <Paragraph position="6"> Sometimes a verbal command does not include all the information required by the simulation. The CI agent attempts to fill in this missing information by using a combination of linguistic and situational context, plus defaults. For instance, if no unit is explicitly addressed by a command, it is assumed that the addressee is the unit to whom the last verbal command was given. The ModSAF &quot;occupy position&quot; and &quot;attack by fire&quot; tasks require that a line be given as a battle position, but users often give just a point location for the position of the unit. In such cases, the CI agent calls ModSAF to construct a line through the point, and uses that line for the battle position.</Paragraph> </Section> <Section position="4" start_page="3" end_page="4" type="sub_section"> <SectionTitle> 2.4 Push to Talk </SectionTitle> <Paragraph position="0"> The push-to-talk (PTT) agent manages the interactions with the user. It provides a long narrow window running across the top of the screen--the only visible indication that a ModSAF is CommandTalkenabled. This window contains a microphone icon that indicates the state of CommandTalk (ready, listening, or busy), an area for the most recent recognized string to be printed, and an area for text messages from the system to appear (confirmation messages and error messages).</Paragraph> <Paragraph position="1"> This agent provides two mechanisms for the user to initiate a spoken command. A push-to-talk button attached to the serial port of the computer can be pushed down to signal the computer to start listening and released to indicate that the utterance is finished (push-and-hold-to-talk). The second option is to click on the microphone icon with the left mouse button to signal the computer to start listening (click-to-talk). With click-to-talk, the system listens for speech until a sufficiently long pause is detected. The length of time to wait is a parameter that can be set in the recognizer. The push-andhold method generally seems more satisfactory for a number of reasons: Push-and-hold leads to faster response because the system does not have to wait to hear whether the user is done speaking, click-to-talk tends to cut off users who pause in the middle of an utterance to figure out what to say next, and pushand-hold seems natural to military users because it works like a tactical field radio.</Paragraph> <Paragraph position="2"> The PTT agent issues messages to the SR agent to start and stop listening. It accepts messages from the SR agent containing the words that were recognized, messages that the user has stopped speaking (for click-to-talk), and messages, from any agent, that contain confirmation or error messages to be displayed to the user.</Paragraph> </Section> <Section position="5" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 2.5 ModSAF </SectionTitle> <Paragraph position="0"> The ModSAF agent consists of a thin layer on top of ModSAF. It sends messages that keep the CI agent informed of the current state of the simulation and executes commands that it receives from the CI agent. Generally, these commands access functions that are also available using the GUI, but not always.</Paragraph> <Paragraph position="1"> For example, it is possible with CommandTalk to tell ModSAF to center its map display on a point that is not currently visible. This cannot be done with the GUI, because there is no way to select a point that is not currently displayed on the map. The set of messages that the ModSAF agent responds to is defined by the ModSAF Agent Layer Language (MALL).</Paragraph> </Section> <Section position="6" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 2.6 Start-It Start-It is a graphical processing-spawning agent </SectionTitle> <Paragraph position="0"> that helps control the large number of processes that make up the CommandTalk system. It provides a mouse-and-menu interface to configure and start other processes. While it is particularly useful for starting agent processes, it can also be used to start nonagent processes such as additional ModSAF simulators and interfaces, CommandVu, and the LeatherNet sound server.</Paragraph> <Paragraph position="1"> include the following:</Paragraph> </Section> <Section position="7" start_page="4" end_page="4" type="sub_section"> <SectionTitle> Features of Start-It </SectionTitle> <Paragraph position="0"/> </Section> <Section position="8" start_page="4" end_page="5" type="sub_section"> <SectionTitle> Compiler </SectionTitle> <Paragraph position="0"> The SR agent requires a grammar to tell the recognizer what sequences of words are possible in a particular application, and the NL agent requires a grammar to specify the translation of word strings into logical forms. For optimal performance, these two grammars should, as nearly as possible, accept exactly the same word sequences. In general, we would like the recognizer to accept all word sequences that can be interpreted, and any over-generation by the recognition grammar increases the likelihood of recognition errors without providing any additional functionality. In order to keep these two grammars synchronized, we have implemented a compiler that derives the recognition grammar automatically from the NL grammar.</Paragraph> <Paragraph position="1"> To derive a recognition grammar with coverage equivalent to the NL grammar, we must restrict the form of the NL grammar. Like virtually all practical speech recognizers, the Nuance recognizer requires a finite-state grammar, while the Gemini parser accepts grammars that have a context-free backbone, plus unification-based feature constraints that give Gemini grammars the power of an arbitrary Turing machine. To make it possible to derive an equivalent finite-state grammar, we restrict the Gemini grammars used as input to our Gemini-to-Nuance compiler as follows: * All features in the Gemini grammar that are compiled into the recognition grammar must allow only a finite number of values. This means that no feature values are structures that can grow arbitrarily large.</Paragraph> <Paragraph position="2"> * The Gemini grammar must not contain any indirect recursion. That is, no rule subsets are allowed with patterns such as A --+ BC, C --+ AD.</Paragraph> <Paragraph position="3"> * Immediately recursive rules are allowed, but only if the recursive category is leftmost or rightmost in the list of daughters, so that there is no form of center embedding. That is, A --+ AB and A -~ CA are allowed (even simultane null ously), but not A --+ CAB.</Paragraph> <Paragraph position="4"> There are many possible formats for specifying a finite-state grammar, and the one used by the Nuance recognition system specifies a single definition for each atomic nonterminal symbol as a regular expression over vocabulary words and other nonterminals, such that there is no direct or indirect recursion in the set of definitions. To transform a restricted Gemini grammar into this format, we first transform the Gemini rules over categories with feature constraints into rules over atomic symbols, and we then transform these rules into a set of definitions in terms of regular expressions.</Paragraph> </Section> <Section position="9" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 3.1 Generating Atomic Categories </SectionTitle> <Paragraph position="0"> Given the restriction that all features must allow only a finite number of values, it would be trivial to transform all unification rules into rules over atomic categories by generating all possible full feature instantiations of every rule, and making up an atomic name for each combination of category and feature values that occur in these fully-instantiated rules.</Paragraph> <Paragraph position="1"> This would, however, increase the total number of rules to a size that would be too large to deal with.</Paragraph> <Paragraph position="2"> We therefore instantiate the rules in a more careful way that avoids unnecessarily instantiating features and prunes out useless rules.</Paragraph> <Paragraph position="3"> The set of atomic categories is defined by considering, for each daughter category of each rule, all instantiations of just the subset of features on the daughter that are constrained by the rule. Thus, if there is a rule that does not constrain a feature on a particular daughter category, an atomic category will be created for that daughter that is under-specified for the value of that feature. A prime example of this in the CommandTalk grammar is the rule coordinate_hums : \[\] -+ digit:f\] digit:f\] digit:f\] digit:f\] which says that a set of coordinate numbers can be a sequence of four digits. In the CommandTalk grammar the digit category has features (singular vs. plural, zero vs. nonzero, etc.) that would generate at least 60 combinations if all instantiations were considered. So, if we naively generated all possible complete instantiations of this rule, we would get at least 604 rules. Even worse, we need other rules to permit up to eight digits to form a set of coordinate numbers, which would give rise to 60 s rules. Since the original rule, however, puts no constraints on any of the features of the digit category, by generating an atomic category that is under-specified for all features, we only need a single rule in the derived grammar.</Paragraph> <Paragraph position="4"> From the set of atomic categories defined in this way, we generate all rules consistent with the original Gemini rules, except that for daughters that have unconstrained features, we use only the corresponding under-specified categories. We then iteratively remove all rules that cannot participate in a complete parse of an utterance, either because they contain daughter categories that cannot be expanded into any sequence of words, given the particular lexicon we have, or because they have a mother category that cannot be reached from the top category of the grammar.</Paragraph> </Section> <Section position="10" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 3.2 Compiling Rules to Regular Expressions </SectionTitle> <Paragraph position="0"> Once we have transformed the Gemini unification grammar into an equivalent grammar over atomic nonterminals, we then rewrite the grammar as a set of definitions of the nonterminals as regular expressions. For the nonterminals that have no recursive rules, we simply collect all the rules with the same left-hand side and create a single rule by forming the disjunction of all the right-hand sides. For example, if the only rules for the nonterminal A are then the regular expression defining A would be \[(BC)(DE)\]. In the Nuance regular expression notation, &quot;( )&quot; indicates a sequence and &quot;\[ \]&quot; indicates a set of disjunctive alternatives.</Paragraph> <Paragraph position="1"> For nonterminals with recursive rules, we eliminate the recursion by introducing regular expressions using the Kleene star operator. For each recursive nonterminal A, we divide the rules defining A into right-recursive, left-recursive, and nonrecursive subsets. For the right-recursive subset, we form the disjunction of the expressions that occur to the left of A. That is, for the rules A -* BA A-*CA we generate \[BC\]. Call this expression LEFT-A. For the left-recursive subset, we form the disjunction of the expressions that occur to the right of A, which we may call RIGHT-A. Finally, we form the disjunction of all the right-hand sides of the nonrecursive rules, which we may call NON-REC-A. The complete regular expression defining A is then</Paragraph> </Section> </Section> class="xml-element"></Paper>