File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/81/p81-1011_metho.xml

Size: 7,850 bytes

Last Modified: 2025-10-06 14:11:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="P81-1011">
  <Title>Natural Language Processing: The EEL System as</Title>
  <Section position="4" start_page="39" end_page="39" type="metho">
    <SectionTitle>
AND ERROR ANALYSIS
</SectionTitle>
    <Paragraph position="0"> System performance can obviously be evaluated in a number of ways, but without good response time meaningful experiments are impossible. When much data is involved in processing a delay of a few minutes can probably be tolerated, but the vast majority of requests should be responded to within seconds. The latter was the case in my experiments. Fairly complex messages of about 12 words were responded to in about l0 seconds.</Paragraph>
    <Paragraph position="1"> The system clearly has to be reasonably free of bugs -in my case, 12 bugs were hit in the total of 1615 parsed and nonparsed messages. The adequate extent of natural language syntax is impossible to determine. Table 3 shows the syntax used by my subjects.</Paragraph>
    <Paragraph position="2"> sentences; or possibly just &amp;quot;baby talk&amp;quot; due to the suspicion of the computer's limitations.</Paragraph>
    <Paragraph position="3"> An interesting fact to note is that similar results with respect to syntax were obtained in the exper~nents with USL, the &amp;quot;sister system&amp;quot; of REL developed by IBM Heidelberg \[10\] -- with German used as gLl in two studies of high school students: predominance of wh-questions (317 in total of 451); not many relative clauses (66); commands (35); conjunctions (26); quantifiers (15); definitions (ii); comparisons (2); yes/no questions (i).</Paragraph>
    <Paragraph position="4"> An evaluation which would not include an analysis of unparsed input would at best be of limited value. It was shown in Table i that i093 out of 1515 or about ~o thirds were parsed in my experiments.</Paragraph>
    <Paragraph position="5">  All sentences Simple sentences, e.g., &amp;quot;List the decks of the Alamo.&amp;quot; 73.8 Sentences with pronouns, e.g., '~/hat is its length?&amp;quot;, &amp;quot;what is in its pyrotechnic looker?&amp;quot; 30 3.A Sentences with quantifier(s), e.g., &amp;quot;List the class of each cargo.&amp;quot; 71 8.0 Sentences with conjunctions, e.g. &amp;quot;What is the maxim,-- stow height and bale cube of the pyrotechnic locker of the AL?&amp;quot; 88 I0.0 Sentences with quantifier and conjunction(s), e.g., &amp;quot;List hatch width and hatch length of each deck of the Alamo.&amp;quot; 13 2.6 Sentences with relative clause, e.g., &amp;quot;List the ships that have water.&amp;quot; 6 .7 Sentences with relative clause (or related construction) and cemparator, e.g., &amp;quot;List the ships with a beam less than lO00.&amp;quot; 6 .7 Sentences with quantifier and relative clause, e.g., &amp;quot;List height of each content whose class is class IV.&amp;quot; 2 .23 Sentences with quantifier, conjunction and relative clause, e.g., &amp;quot;List length, width and height of each content whose class is a--nunicion.&amp;quot; 2 .23 Sentences with quantifiers and comparator, e.g., '~Iow many ships have a beam greater  Considering the wide range of R k'r- syntax \[7\], the paucity of complex sentences is surprising. The use of definitions which often involved complex constructions (relative clauses, conjunctions, even quantifiers) had a definite influence. So did, undoubtedly, the task situation causing optimization of work methods. The influence of the specific nature of the task would require additional studies, but the special device provided by the system (a loading prompt sequence -- which was not analyzed) was employed by every subject. Dewices such as these obviously are a great aid in accomplishin 8 tasks. They should be tested extensively to determine how they can augment the uaturalness of NLIs. Other reasons for the relatively simple syntax used were special strategies: paraphrasing into simpler syntax even though a sentence did not parse for other reasons; &amp;quot;SUCCesS strategy&amp;quot; resulting in repetitious simple  predominance of vocabulary is not surprising, but relatively few syntactic errors are. In part this may be due to the method of scoring in which errors were counted only once, so if a sentence contained an unknown vocabulary item (e.g. &amp;quot;On what decks of the Alamo cargo be stored?&amp;quot;) but would have failed on syatactic grounds as well, it would fall in the vocabulary category. A comparison can be made here with Damerau's study Ill\] of the use of the ll~A system by the city plannin S department in White Plains, at least with regard to the total of queries to those completed: 788 to 513. So, again, roughly t~ao thirds were parsed. In other categories &amp;quot;parsin S failure&amp;quot; is 147, &amp;quot;lookup failures&amp;quot; 119, &amp;quot;nothing in data base&amp;quot; 61, &amp;quot;program error&amp;quot; 39, but this only points to the general difficulties of comparisons of system performance.</Paragraph>
  </Section>
  <Section position="5" start_page="39" end_page="41" type="metho">
    <SectionTitle>
SOME CONCLUSIONS
</SectionTitle>
    <Paragraph position="0"> Norm Sondheimer suggested some questions we might try to answer. What has been learned about user needs? What most important linguistic phenomena to allOW for? What other kinds of interactions? Error analysis points in the obvious directions of user needs, and so do the types of sentences employed. While it is justified to quit the search for an almost perfect grnmm,r, it would be a mistake to constrain it to the constructions used.</Paragraph>
    <Paragraph position="1"> Improved naturalness can be achieved with diagnostics, definitions, and devices geared to specific tasks such as special prompting sequences. Some tasks clearly require math in the NLI. How good are systems? An objective measurement is probably impossible, but the percentage of requests processed might give some idea.</Paragraph>
    <Paragraph position="2"> In the case of a task situation such as loading cargo items, the percentage of task completion may signal both system performance and user satisfaction. System response times are a very important measure. The questionnaire method can and has been used (in the case of MT and USL), but as yet there is too little experience to measure user satisfaction. Users seem very good at adapting to systems. They paraphrase, use success strategy, simplify syntax, use special devices -- what they really do is maximize their performance with respect Co a given task.</Paragraph>
    <Paragraph position="3">  What have we learned about running evaluations7 It is important Co know what to look for, therefore the need for good knowledge of human to hmnan discourse. Good system response times are a sine qua non. Controlled experiments have the advantage of being replicable, a crucial factor in arriving ac evaluation criteria.</Paragraph>
    <Paragraph position="4"> Determining user bias and experience nay be important, but even more so PSs user training. Controlled experiments can show what methods are ~ost effective (e.g. a manual or study of proCocols~). Study of user commence -- phacic material -- gives some measure of user (dis)satisfaction (I have seen '&amp;quot;/ou lie,&amp;quot; buc I have yeC to see &amp;quot;Good boy, youZ&amp;quot;). Clearly, the best indication of user satisfaction is whether he or she uses the system again. Extensive IonS-term studies are needed for that.</Paragraph>
    <Paragraph position="5"> What should the future look like? Task oriented situations seem to be a promising envirooment for ~LZ. The standards of NL systems performance will be set by the users. Future evaluations? As Antoine de Sainc-Zxup&amp;r7 wrote, &amp;quot;As for the Future, your task is not to foresee, but to enable it.&amp;quot;</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML