XML Viewer - j02-4007

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/j02-4007_metho.xml
Size: 21,620 bytes
Last Modified: 2025-10-06 14:07:56
<?xml version="1.0" standalone="yes"?>
<Paper uid="J02-4007">
  <Title>c(c) 2002 Association for Computational Linguistics Squibs and Discussions Human Variation and Lexical Choice</Title>
  <Section position="3" start_page="546" end_page="550" type="metho">
    <SectionTitle>
2. Evidence for Human Lexical Variation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="546" end_page="546" type="sub_section">
      <SectionTitle>
2.1 Previous Research
</SectionTitle>
      <Paragraph position="0"> Linguists have acknowledged that people may associate different meanings with the same word. Nunberg (1978, page 81), for example, writes: There is considerable variation among speakers in beliefs about what does and does not constitute a member of the category....Take jazz.I may believe that the category includes ragtime, but not blues; you may believe the exact opposite. After all, we will have been exposed to a very different set of exemplars. And absent a commonly accepted authority, we must construct our own theories of categories, most probably in the light of varying degrees of musical sophistication.</Paragraph>
      <Paragraph position="1"> Many modern theories of mental categorization (Rosch 1978; Smith and Medin 1981) assume that mental categories are represented by prototypes or exemplars. Therefore, if different people are exposed to different category prototypes and exemplars, they are likely to have different rules for evaluating category membership.</Paragraph>
      <Paragraph position="2"> Parikh (1994) made a similar point and backed it up with some simple experimentation. For example, he showed squares from the Munsell chart to participants and asked them to characterize the squares as red or blue; different individuals characterized the squares in different ways. In another experiment he showed that differences remained even if participants were allowed to associate fuzzy-logic-type truth values with statements.</Paragraph>
      <Paragraph position="3"> In the psychological community, Malt et al. (1999) investigated what names participants gave to real-world objects. For example, these researchers wished to know whether participants would describe a pump-top hand lotion dispenser as a bottle or a container. They were primarily interested in variations across linguistic communities, but they also discovered that even within a linguistic community there were differences in how participants named objects. They state (page 242) that only 2 of the 60 objects in their study were given the same name by all of their 76 native-Englishspeaker participants.</Paragraph>
      <Paragraph position="4"> In the lexicographic community, field-workers for the Dictionary of American Regional English (DARE) (Cassidy and Hall 1996) asked a representative set of Americans to respond to fill-in-the-blank questionnaires. The responses they received revealed substantial differences among participants. For example, there were 228 different responses to question B12, When the wind begins to increase, you say it's , the most common of which were getting windy and blowing up; and 201 different responses to question B13, When the wind begins to decrease, you say it's , the most common of which were calming down and dying down.</Paragraph>
    </Section>
    <Section position="2" start_page="546" end_page="550" type="sub_section">
      <SectionTitle>
2.2 SUMTIME Project
</SectionTitle>
      <Paragraph position="0"> The SUMTIME project at the University of Aberdeen is researching techniques for generating summaries of time-series data.</Paragraph>
      <Paragraph position="1">  Much of the project focuses on content determination (see, for example, Sripada et al. [2001]), but it is also examining lexical choice algorithms for time-series summaries, which is where the work described in this article originated. To date, SUMTIME has primarily focused on two domains, weather forecasts and summaries of gas turbine sensors, although we have recently started  Reiter and Sripada Human Variation and Lexical Choice work in a third domain as well, summaries of sensor readings in neonatal intensive care units.</Paragraph>
      <Paragraph position="2">  gas turbine sensor data, we asked two experts to write short descriptions of 38 signal fragments. (A signal fragment is an interval of a single time-series data channel.) The descriptors were small, with an average size of 8.3 words. In no case did the experts produce exactly the same descriptor for a fragment. Many of the differences simply reflected usage of different words to express the same underlying concept; for example, one expert typically used rise to describe what the other expert called increase. In other cases the differences reflected different levels of detail. For example, one fragment was described by expert A as Generally steady with a slow rising trend. Lots of noise and a few small noise steps, whereas expert B used the shorter phrase Rising trend, with noise. Both experts also had personal vocabulary; for example, the terms bathtub and dome were used only by expert A, whereas the terms square wave and transient were used only by expert B.</Paragraph>
      <Paragraph position="3"> Most importantly from the perspective of this article, there were cases in which the differences between the experts reflected a difference in the meanings associated with words. For example, both experts used the word oscillation. Six signals were described by both experts as oscillations, but two signals, including the one shown in Figure 1, were described as oscillations only by expert B. We do not have enough examples to solidly support hypotheses about why the experts agreed on application of the term oscillation to some signals and disagreed on its application to others, but one explanation that seems to fit the available data is that the experts agreed on applying the term to signals that were very similar to a sine wave (which presumably is the prototype [Rosch 1978] of an oscillation), but sometimes disagreed on its application to signals that were less similar to a sine wave, such as Figure 1.</Paragraph>
      <Paragraph position="4">  lyzed a corpus of 1,099 human-written weather forecasts for offshore oil rigs, together with the data files (produced by a numerical weather simulation) that the forecasters examined when writing the forecasts. The forecasts were written by five different forecasters. A short extract from a typical forecast is shown in Figure 2; this text describes changes in wind speed and direction predicted to occur two days after the forecast was issued. An extract from the corresponding data file is shown in Table 1; it describes the predicted wind speed and direction from the numerical weather simulation at three-hourly intervals.</Paragraph>
      <Paragraph position="5"> As in the gas turbine domain, our corpus analysis showed that individual forecasters had idiosyncratic vocabulary that only they used. For example, one forecaster used the verb freshening to indicate a moderate increase in wind speed from a low or moderate initial value, but no other forecaster used this verb. There were also Figure 1 Signal fragment (gas turbine exhaust temperature): Is this an oscillation?  Wind (at 10m) extract from five-day weather forecast issued on October 2, 2000. differences in orthography. For example, some forecasters lexicalized the four basic directions as N, E, S, and W, whereas others used the lexicalizations N'LY, E'LY, S'LY, and W'LY.</Paragraph>
      <Paragraph position="6"> We performed a number of semantic analyses to determine when different forecasters used different words; these invariably showed differences between authors. For example, we attempted to infer the meaning of time phrases such as by evening by searching for the first data file record that matched the corresponding wind descriptor. The forecast in Figure 2, for example, says that the wind will change to SSW 10-14 at the time suggested by BY MIDDAY. In the corresponding data shown in Table 1, the first entry with a direction of SSW and a speed in the 10-14 range is 1200; hence in this example the time phrase by midday is associated with the time 1200. A similar analysis suggests that in this example the time phrase by evening is associated with the time 0000 (on 5-10-00).</Paragraph>
      <Paragraph position="7"> We repeated this procedure for every forecast in our corpus and statistically analyzed the results to determine how individual forecasters used time phrases. More details about the analysis procedure are given by Reiter and Sripada (2002). As reported in that paper, the forecasters seemed to agree on the meaning of some time phrases; for example, all forecasters predominantly used by midday to mean 1200. They disagreed, however, on the use of other terms, including by evening. The use of by evening is shown in Table 2; in particular, whereas forecaster F3 (the author of the text in Figure 2) most often used this phrase to mean 0000, forecasters F1 and F4 most often used this phrase to mean 1800. The differences between forecasters in their usage of by evening are significant at p &lt;.001 under both a chi-squared test (which treats time as a categorical variable) and a one-way analysis of variance (which compares the mean time for each forecaster; for this test we recoded the hour 0 as 24).</Paragraph>
      <Paragraph position="8">  in particular was a domain with an established sublanguage and usage conventions, whose words might correspond to technical terms, and wondered what would hap- null pen in a domain in which there was no established sublanguage and technical terminology. We therefore performed a small experiment at the University of Aberdeen in which we asked 21 postgraduate students and academic staff members to fill out a questionnaire asking which of the following individuals they would regard as  occasionally refer to documentation for details of the Java language and class libraries.</Paragraph>
      <Paragraph position="9"> Respondents could tick Yes, Unsure,orNo. All 21 respondents ticked No for A and Yes for E. They disagreed about whether B, C, and D could be considered to know Java; 3 ticked Yes for B, 5 ticked Yes for C, and 13 ticked Yes for D. In other words, even among this relatively homogeneous group, there was considerable disagreement over what the phrase knows Java meant in terms of actual knowledge of the Java programming language.</Paragraph>
      <Paragraph position="10"> 2 The questionnaire in fact contained a sixth item: &amp;quot;F can create complex Java libraries and almost never needs to refer to documentation because she has memorised most of it.&amp;quot; However, a few participants were unsure whether create complex Java libraries meant programming or meant assembling compiled object files into a single archive file using a tool such as tar or jartool, so we dropped F from our study.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="550" end_page="552" type="metho">
    <SectionTitle>
3. Implications for Natural Language Generation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="550" end_page="551" type="sub_section">
      <SectionTitle>
3.1 Lexical Choice
</SectionTitle>
      <Paragraph position="0"> The previous section has argued that people in many cases do associate different meanings with lexemes and phrases such as oscillation, by evening, and knows; and that some words, such as bathtub and freshening, are used only by a subset of authors in a particular domain. What impact does this have on lexical choice? In applications in which users read only one generated text, it may be necessary to restrict lexeme definitions to those that we expect all users to share. Indeed, essentially this advice was given to us by a domain expert in an earlier project on generating personalized smoking-cessation letters (Reiter, Robertson, and Osman 2000). In applications in which users read many generated texts over a period of time, however, an argument could be made for using a richer vocabulary and set of lexeme definitions and expecting users to adapt to and learn the system's vocabulary and usage over the course of time; it may be appropriate to add &amp;quot;redundant&amp;quot; information to texts (section 3.2) while the user is still learning the system's lexical usage. This strategy has some risks, but if successful, can lead to short and less awkward texts. Consistency is essential if this strategy is followed; the system should not, for example, sometimes use by evening to mean 1800 and sometimes use by evening to mean 0000.</Paragraph>
      <Paragraph position="1"> We are not aware of previous research in lexical choice that focuses on dealing with differences in the meanings that different human readers associate with words.</Paragraph>
      <Paragraph position="2"> Perhaps the closest research strand is that which investigates tailoring word choice and phrasing according to the expertise of the user (Bateman and Paris 1989; Reiter 1991; McKeown, Robin, and Tanenblatt 1993). For example, the Coordinated Multimedia Explanation Testbed (COMET) (McKeown, Robin, and Tanenblatt 1993) could generate Check the polarity for skilled users and Make sure the plus on the battery lines up with the plus on the battery compartment for unskilled users. General reviews of previous lexical choice research in NLG are given by Stede (1995) and Wanner (1996); Zukerman and Litman (2001) review research in user modeling and NLP.</Paragraph>
      <Paragraph position="3"> In the linguistic community, Parikh (1994) has suggested that utility theory be applied to word choice. In other words, if we know (1) the probability of a word's being correctly interpreted or misinterpreted and (2) the benefit to the user of correct interpretation and the cost of misinterpretation, then we can compute an overall utility to the user of using the word. This seems like an interesting theoretical model, but in practice (at least in the applications we have looked at), whereas it may be just about possible to get data on the likelihood of correct interpretation of a word, it is probably impossible to calculate the cost of misinterpretation, because we do not have accurate task models that specify exactly how the user will use the generated texts (and our domain experts have told us that it is probably impossible to construct such models).</Paragraph>
      <Paragraph position="4"> 3.1.1 Near Synonyms. A related problem is choosing between near synonyms, that is, words with similar meanings: for example, choosing between easing and decreasing when describing changes in wind speed, or saw-tooth transient and shark-tooth transient when describing gas turbine signals. The most in-depth examination of choosing between near synonyms was undertaken by Edmonds (1999), who essentially suggested using rules based on lexicographic work such as Webster's New Dictionary of Synonyms (Gove 1984).</Paragraph>
      <Paragraph position="5"> Edmonds was working on machine translation, not generating texts from nonlinguistic data, and hence was looking at larger differences than the ones with which we are concerned. Indeed, dictionaries do not in general give definitions at the level of detail required by SUMTIME; for example, we are not aware of any dictionary that  Reiter and Sripada Human Variation and Lexical Choice defines oscillation in enough detail to specify whether it is appropriate to describe the signal in Figure 1 as an oscillation. We also have some doubts, however, as to whether the definitions given in synonym dictionaries such as Gove (1984) do indeed accurately represent how all members of a language community use near synonyms. For example, when describing synonyms and near synonyms of error, Gove (page 298) states that faux pas is &amp;quot;most frequently applied to a mistake in etiquette.&amp;quot; This seems to be a fair match to DARE's fieldwork question (see section 2.1) JJ41, An embarrassing mistake: Last night she made an awful , and indeed DARE (volume 2, page 372) states that faux pas was a frequent response to this question. DARE adds, however, that faux pas was less often used in the South Midland region of the United States (the states of Kentucky and Tennessee and some adjacent regions of neighboring states) and also was less often used by people who lacked a college education. So although faux pas might be a good lexicalization of a &amp;quot;mistake in etiquette&amp;quot; for most Americans, it might not be appropriate for all Americans; and, for example, if an NLG system knew that its user was a non-college-educated man from Kentucky, then perhaps it should consider using another word for this concept.</Paragraph>
    </Section>
    <Section position="2" start_page="551" end_page="551" type="sub_section">
      <SectionTitle>
3.2 Redundancy
</SectionTitle>
      <Paragraph position="0"> Another implication of human variation in word usage is that if there is a chance that people may not interpret words as expected, it may be useful for an NLG system to include extra information in the texts it generates, beyond what is needed if users could be expected to interpret words exactly as the system intended. Indeed, unexpected word interpretations could perhaps be considered to be a type of &amp;quot;semantic&amp;quot; noise; and as with all noise, redundancy in the signal (text) can help the reader recover the intended meaning.</Paragraph>
      <Paragraph position="1"> For example, the referring-expression generation model of Dale and Reiter (1995) selects attributes to identify referents based on the assumption that the hearer will interpret the attributes as the system expects. Assume, for instance, that there are two books in focus, B  has color blue. Then, according to Dale and Reiter, the red book is a distinguishing description that uniquely identifies B  .</Paragraph>
      <Paragraph position="2"> Parikh, however, has shown that people can in fact disagree about which objects are red and which are blue; if this is the case, then it is possible that the hearer will not in fact be able to identify B  as the referent after hearing the red book. To guard against this eventuality, it might be useful to add additional information about a different attribute to the referring expression; for example, if B  is 100 pages and B  is 1,000 pages, the system could generate the thin red book. This is longer than the red book and thus perhaps may take longer to utter and comprehend, but its redundancy provides protection against unexpected lexical interpretations.</Paragraph>
    </Section>
    <Section position="3" start_page="551" end_page="552" type="sub_section">
      <SectionTitle>
3.3 Corpus Analysis
</SectionTitle>
      <Paragraph position="0"> A final point is that differences between individuals should perhaps be considered in general by people performing corpus analyses to derive rules for NLG systems (Reiter and Sripada 2002). To take one randomly chosen example, Hardt and Rambow (2001) suggest a set of rules for deciding on verb phrase (VP) ellipsis that are based on machine learning techniques applied to the Penn Treebank corpus. These achieve a 35% reduction in error rate against a baseline rule. They do not consider variation in author, and we wonder if a considerable amount of the remaining error is due to individual variation in deciding when to ellide a VP. It would be interesting to perform a similar analysis with author specified as one of the features given to the machine learning algorithm and see whether this improved performance.</Paragraph>
      <Paragraph position="1">  Computational Linguistics Volume 28, Number 4 4. Natural Language Generation versus Human-Written Texts To finish on a more positive note, human variation may potentially be an opportunity for NLG systems, because they can guarantee consistency. In the weather-forecasting domain we examined, for example, users receive texts written by all five forecasters, which means that they may have problems reliably interpreting phrases such as by evening; an NLG system, in contrast, could be programmed always to use this phrase consistently. An NLG system could also be programmed to avoid idiosyncratic terms with which users might not be familiar (bathtub, for example) and not to use terms in cases in which people disagree about their applicability (e.g., oscillation for Figure 1). Our corpus analyses and discussions with domain experts suggest that it is not always easy for human writers to follow such consistency rules, especially if they have limited amounts of time.</Paragraph>
      <Paragraph position="2"> Psychologists believe that interaction is a key aspect of the process of humans agreeing on word usage (Garrod and Anderson 1987). Perhaps a small group of people who constantly communicate with each other over a long time period (presumably the circumstances under which language evolved) will agree on word meanings. But in the modern world it is common for human writers to write documents for people whom they have never met or with whom they have never otherwise interacted, which may reduce the effectiveness of the natural interaction mechanism for agreeing on word meanings.</Paragraph>
      <Paragraph position="3"> In summary, dealing with lexical variation among human readers is a challenge for NLG systems and will undoubtably require a considerable amount of thought, research, and data collection. But if NLG systems can do a good job of this, they might end up producing superior texts to many human writers, which would greatly enhance the appeal of NLG technology.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML