XML Viewer - w04-0216

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0216_metho.xml
Size: 17,763 bytes
Last Modified: 2025-10-06 14:09:05
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0216">
  <Title>Animacy Encoding in English: why and how</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 The Animacy Hierarchy
</SectionTitle>
    <Paragraph position="0"> Given the pervasive importance of animacy information in human languages one might expect it to be a well-understood linguistic category.</Paragraph>
    <Paragraph position="1"> Nothing could be farther from the truth. Linguistic descriptions appeal to a hierarchy that in its minimal form distinguishes between human, non-human animate and inanimate but can contain more distinctions, such as distinctions between higher and lower animals (see Yamamoto, 1999 for a particularly elaborated scheme).</Paragraph>
    <Paragraph position="2"> What makes it difficult to develop clear categories of animacy is that the linguistically relevant distinctions between animate and non-animate and between human and non-human are not the same as the biological distinctions. Part of this research is devoted to discovering the principles that underlie the distinctions; and the type of distinctions proposed depend on the assumptions that a researcher makes about the underlying motivation for them, e.g. as a reflection of the language user's empathy with living beings (e.g.</Paragraph>
    <Paragraph position="3"> Yamamoto, 1999). What is of particular interest for natural language processing is the observation that the distinctions are most likely not the same across languages (cf. Comrie, 1989) and can even change over time in a given language. They are similar to other scalar phenomena such as voicing onset times that play a role in different languages but where the categorization into voiced and unvoiced does not correspond to the same physical boundary in each language. But whereas voicing onset times can be physically measured, we do not have an objective measure of animacy. The categories involved correspond to the degree to which various entities are construed as human-like by a given group of speakers and at this point we have no language independent measure for this.</Paragraph>
    <Paragraph position="4"> Moreover, languages make ample use of metaphor and metonomy. The intent of an animacy coding is to encode the animacy status of the referent of the linguistic expression. But sometimes in figurative language it is not clear what the referent it.</Paragraph>
    <Paragraph position="5"> Especially prevalent cases of metonomy are the use of names to refer both to organizations (e.g.</Paragraph>
    <Paragraph position="6"> IBM) and to characteristic members of them, and the use of place names (e.g. Russia) to refer both to organizational entities and geographical places or inhabitants of them. Terms belonging to these semantic classes are systematically ambiguous.</Paragraph>
    <Paragraph position="7"> Whereas it is true that animacy can be determined by looking at the entity an expression refers to, in practice it is not always clear what the referent of an expression is.</Paragraph>
    <Paragraph position="8"> The notions that the animacy hierarchy appeals to, then, are not a priori well defined. And work is necessary on two levels: to better define which distinctions play a role in English and to determine where they play a role. Conceptually, it might be desirable to replace the idea of a hierarchy with discrete classes by a partial ordering of entities.</Paragraph>
    <Paragraph position="9"> This is, however, not the place to pursue this idea.</Paragraph>
    <Paragraph position="10"> Fortunately, one doesn't need to wait until the first problem is solved completely to tackle the second.</Paragraph>
    <Paragraph position="11"> The results obtained in certain linguistic contexts are robust for the top and the bottom of the hierarchy. Uncertainty about the middle does not prevent us from establishing the importance of the dimension as such. Refining the definition of animacy will, however, be important for more detailed studies of the interaction between the various accessibility hierarchies. This more precise notion will be needed for cross-linguistic studies, and, in the context of natural language processing, for high quality generation and translation.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="1" type="metho">
    <SectionTitle>
5 Animacy Annotation
</SectionTitle>
    <Paragraph position="0"> As we have discussed above, the animacy scale is an important factor in the choice of certain construction in English. But it is only a soft constraint and as such outside of the realm of things that native speakers have clear judgments about. The best ways to study such phenomena are psychological experiments and corpus analysis.</Paragraph>
    <Paragraph position="1"> The annotation exercise we engaged in is meant to facilitate the latter.</Paragraph>
    <Paragraph position="2"> Given the situation described with respect to animacy categories, a natural way to proceed is to start with a commonsensical approach and see where it leads. In 2000-2002, two rather similar initiatives led to the need for animacy annotations: one, the paraphrase-project, a collaboration between Stanford and Edinburgh, concentrating on the factors that influence the choice between different sentence level syntactic paraphrases (Bresnan et al. 2002) and another concentrating on the possessive alternation (O'Connor, 2000). The two projects used a very similar animacy annotation scheme, developed in the context of the O'Connor project.</Paragraph>
    <Paragraph position="3"> The scheme was used in two different ways. The Boston team coded 20,000 noun phrases in 'possessive' constructions from the Brown Corpus. The first round of coding was automated, with the animacy annotation based primarily on word lists and morphological information. The second round was performed manually by pairs of coders using a decision tree. The two coders were required to agree on each code; every case in which there was not complete agreement was discussed by the rest of the team, until a choice of code was made. This way of annotating does not lend itself to a study of reliability, except between the automated coder and the human coders as a group. For more information on this use of the coding system, see Garretson &amp; O'Connor (2004).</Paragraph>
    <Paragraph position="4"> In what follows we concentrate on the use of the coding scheme in the Stanford-Edinburgh paraphrase project.</Paragraph>
    <Paragraph position="5"> The overall aim of the paraphrase project is to provide the community of linguists and computational linguists with a corpus that can be used to calculate the impact of the various factors on different constructions. The annotation scheme assumes that the main distinction is three-way: human, other animates and inanimates, but the two latter categories are subdivided further as follows:  - Other animates: organizations, animals, intelligent machines and vehicles.</Paragraph>
    <Paragraph position="6"> - Inanimates: concrete inanimate, non-concrete inanimate, place and time.</Paragraph>
    <Paragraph position="7">  The category 'organization' is important because organizations are often presented as groups of humans engaging in actions that are typically associated with humans (they make pronouncements, decisions, etc.). The categories place and time are especially important for the possessive encoding as it has often been observed that some spatial and temporal expressions are realized as Saxon genitives (see e.g. Rosenbach (2002)).</Paragraph>
    <Paragraph position="8"> For the cases in which no clear decision could be made, a category 'variable animacy' was invented, and the coders were also given the option to defer the decision by marking an item with 'oanim'.</Paragraph>
    <Paragraph position="9"> The overall coding scheme, with a summary of the instructions given to the coders, looks as follows</Paragraph>
  </Section>
  <Section position="6" start_page="1" end_page="1" type="metho">
    <SectionTitle>
HUMAN
</SectionTitle>
    <Paragraph position="0"> Refers to one or more humans; this includes imaginary entities that are presented as human, gods, elves, ghosts, etc.: things that look human and act like humans.</Paragraph>
    <Paragraph position="1"> ORG This tag was proposed for collectivities of humans when displaying some degree of group identity. The properties that are deemed relevant can be represented by the following implicational hierarchy: +/- chartered/official +/- temporally stable +/- collective voice/purpose +/- collective action +/- collective The cut-off point between HUMAN and ORG was put at 'having a collective voice/purpose': so a group with collective voice and purpose is deemed to be an ORG, a group with collective action, such as a mob, is not an ORG.</Paragraph>
    <Paragraph position="2">  For a more extensive description of the annotation scheme see Garretson et al. 2004.</Paragraph>
  </Section>
  <Section position="7" start_page="1" end_page="1" type="metho">
    <SectionTitle>
ANIMAL
</SectionTitle>
    <Paragraph position="0"> Non-human animates, including viruses and bacteria.</Paragraph>
  </Section>
  <Section position="8" start_page="1" end_page="1" type="metho">
    <SectionTitle>
PLACE
</SectionTitle>
    <Paragraph position="0"> The tag is used for nominals that 'refer to a place as a place'. There are two different problems with the delimitation of place. On the one hand, any location can be a place, e.g. a table, a drawer, a pinhead, ... The coding scheme takes the view that only potential locations for humans are thought of as 'places'. On the other hand some places can be thought of as ORGs. The tag was applied in a rather restricted way, for instance in a sentence such as 'my house was built in 1960', 'my house' is coded as CONC (see below), whereas in 'I was at my house', it would be a PLACE.</Paragraph>
  </Section>
  <Section position="9" start_page="1" end_page="1" type="metho">
    <SectionTitle>
TIME
</SectionTitle>
    <Paragraph position="0"> This tag is meant to be applied to expressions referring to periods of time. It was applied rather liberally.</Paragraph>
  </Section>
  <Section position="10" start_page="1" end_page="1" type="metho">
    <SectionTitle>
CONCRETE
</SectionTitle>
    <Paragraph position="0"> This tag is restricted to 'prototypical' concrete objects or substances. Excluded are things like air, voice, wind and other intangibles. Body parts are concrete.</Paragraph>
  </Section>
  <Section position="11" start_page="1" end_page="1" type="metho">
    <SectionTitle>
NONCONC
</SectionTitle>
    <Paragraph position="0"> This is the default category. It is used for events, and anything else that is not prototypically concrete but clearly inanimate.</Paragraph>
    <Paragraph position="1"> MAC A minor tag used for intelligent machines, such as computers or robots.</Paragraph>
    <Paragraph position="2"> VEH Another minor category used for vehicles as it has been observed that these are treated as living beings in some linguistic contexts (e.g. pronoun selection in languages such as English where normal gender distinctions only apply to animates). OANIM This tag is used when the coder is completely unsure and wants to come back to the example later.</Paragraph>
  </Section>
  <Section position="12" start_page="1" end_page="1" type="metho">
    <SectionTitle>
VANIM
</SectionTitle>
    <Paragraph position="0"> This tag can be used in conjunction with another one to indicate that the coder is not entirely sure of the code and thinks there are reasons to give another code too.</Paragraph>
    <Paragraph position="1"> Finally, NOT-UNDERSTOOD was supposed to be used when the text as a whole was not clear. Three coders coded the parsed part of the Switchboard corpus (Godfrey et al. 1992) over the summer of 2003. The corpus consists of around 600 transcribed dialogues on various predetermined topics among speakers of American English. Before the annotation exercise began, the dialogues were converted into XML (Carletta et al. 2004). The entities that needed to be annotated (the NPs and possessives determiners) were automatically selected and filtered for the coders. The three coders were undergraduate students at Stanford University who were paid for the work.</Paragraph>
    <Paragraph position="2"> The schema presented above was discussed with them and presented in the form of a decision tree. Difficult cases were discussed but eventually each coder worked independently. 599 dialogues were annotated.</Paragraph>
  </Section>
  <Section position="13" start_page="1" end_page="2" type="metho">
    <SectionTitle>
6 Coding reliability
</SectionTitle>
    <Paragraph position="0"> The reliability of the annotation was evaluated using the kappa statistic (Carletta, 1996).</Paragraph>
    <Paragraph position="1"> Although there are no hard and fast rules about what makes an acceptable kappa coefficient--it depends on the use to which the data will be put--many researchers in the computational linguistics community have adopted the rule of thumb that discourse annotation should have a kappa of at least .8.</Paragraph>
    <Paragraph position="2"> For the reliability study, we had three individuals work separately to code the same four dialogues with the animacy scheme. Markables (in this case NPs and possessives) had been extracted automatically from the data, leading the coders to mark around 10% of the overall set with a category that indicated that they were not proper markables and therefore not to be coded. Omitting these (non-) markables, for the data set overall, K=.92 (k=3, N=1081).</Paragraph>
    <Paragraph position="3"> In general, coders did not agree about which cases were problematic enough to mark as VANIM, and omitting the markables that any coder marked as problematic using the VANIM code leads to a slight improvement (K=.96, k=3,N=1135).</Paragraph>
    <Paragraph position="4"> It is important to note that these kappa coefficients are so high primarily because two categories which are easy to differentiate from each other, HUMAN and NONCONC, swamp the rest of the categories.</Paragraph>
    <Paragraph position="5"> The cross-coder reliability for them is satisfactory but the intermediate categories were not defined well enough to allow reliable coding.</Paragraph>
    <Paragraph position="6"> Figure 1 shows the confusion matrix for the data including markables that any coder marker additionally as problematic using the VANIM notation. Considering the coders one pair at a time, the matrix records the frequency with which one coder of the pair chose the category named in the row header whilst the other chose the category named in the column header for the same markable.</Paragraph>
    <Paragraph position="7"> Although we were aware of the less than formal definitions given for the categories, we had hoped that the coders would share the intuitive understanding of the developers of the categories. This is obviously not the case for all categories. What was also surprising was that allowing coders to mark cases as problematic using the VANIM code was not worthwhile, since the coders did not often take advantage of this option and taking the VANIM codes into account during analysis has little effect.</Paragraph>
    <Paragraph position="8"> Analysis of the four annotated dialogues points to several sources for the intercoder disagreement.</Paragraph>
    <Paragraph position="9"> * The categories TIME and PLACE were defined in a way that did not coincide with the coders' intuitive understanding of them. The tag TIME was supposed to refer to 'periods of time'. This led to some wavering interpretations for temporal expressions that do not designate a once-occurring period of time.</Paragraph>
    <Paragraph position="10"> For instance 'this time' and 'next time' were coded as TIME by two coders but as 'NONCONC&amp;quot; by the third one. Clearer training on what was meant could have helped here.</Paragraph>
    <Paragraph position="11"> * As mentioned above, the choice between HUMAN, ORG and NONCONC depended on how the coders interpreted the referent of the expression. Although guidelines were given about the difference between HUMAN and ORG (see above), the cut-off point wasn't always clear  . The distinction between ORGs as proposed in our schema and less organized human groups seems too fluctuating to be useful.</Paragraph>
    <Paragraph position="12">  reference: for instance a school as an organization can be marked as ORG by the coders but later in the dialogue there is discussion about the what is done with napping children in the school and one speaker says 'if they (the children) fall asleep they kind of let them sleep', one coder interpreted that the second 'they' as simply referring to the school organization and marked it as ORG, whereas another interpreted it as referring to a rather vague group of humans, presumably some teachers, and marked it as HUMAN. This vagueness of reference is quite prevalent in spoken language, especially with the pronoun 'they'.</Paragraph>
    <Paragraph position="13"> * Attention errors, e.g. vehicles were supposed to get a special code but, presumably because there were so few, this was sometimes forgotten. One coder coded 'a couple of weeks' as HUMAN. These kinds of mistakes are unavoidable and the very tools that make the encoding easier (e.g. the automatic advancing from one markable to another) might make them more frequent.</Paragraph>
    <Paragraph position="14"> While the problems with ORG and HUMAN don't come as a surprise, the difficulties with PLACE, TIME and CONCRETE are more surprising. The two minor classes, MAC and VEH and the ANIMAL class occurred so seldom that no significant results were obtained in this sample. They were equally rare in the corpus as a whole.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML