File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1052_metho.xml

Size: 23,164 bytes

Last Modified: 2025-10-06 14:07:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1052">
  <Title>Using an Ontology to Determine English Countability</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Linguistic Background
</SectionTitle>
    <Paragraph position="0"> Grammatical countability is motivated by the semantic distinction between object and substance reference (also known as bounded/non-bounded or individuated/ non-individuated). Imai and Gentner (1997) show that the presence of countability in English and its absence in Japanese in uences how native speakers conceptualize unknown nouns as objects or substances. There is de nitely some link between countability and conceptualization, but it is a subject of contention among linguists as to how far grammatical countability is motivated and how much it is arbitrary. Jackendo (1991) assumes countability and number to be fully motivated, and shows various rules for conversion between countable and uncountable meanings, but does not discuss any of the problematic exceptions.</Paragraph>
    <Paragraph position="1"> The prevailing position in the natural language processing community is to e ectively treat countability as though it were arbitrary and encode it as a lexical property of nouns.</Paragraph>
    <Paragraph position="2"> Copestake (1992) has gone some way toward representing countability at the semantic level using a type form with subtypes countable and uncountable with further subtypes below these. Words that undergo conversion between di erent values of form can be linked with lexical rules, such as the grinding rule that links a countable animal with its uncountable interpretation as meat. These are not, however directly linked to a full ontology. Therefore there is no direct connection between being an animal and being countable.</Paragraph>
    <Paragraph position="3"> Bond et al. (1994) suggested a division of countability into ve major types, based on Allan (1980)'s noun countability preferences (NCPs). Nouns which rarely undergo conversion are marked as either fully countable, uncountable or plural only. Nouns that are non-speci ed are marked as either strongly countable (for count nouns that can be converted to mass, such as cake) or weakly countable (for mass nouns that are readily convertible to count, such as beer). Conversion is triggered by surrounding context. Noun phrases headed by uncountable nouns can be converted to countable noun phrases by generating classi ers: one piece of equipment, as described in Bond and Ikehara (1996).</Paragraph>
    <Paragraph position="4"> Full knowledge of the referent of a noun phrase is not enough to predict countability.</Paragraph>
    <Paragraph position="5"> There is also language-speci c knowledge required. There are at least three sources of evidence for this: the rst is that di erent languages encode the countability of the same referent in di erent ways. To use Allan (1980)'s example, there is nothing about the concept denoted by lightning that rules out *a lightning being interpreted as a ash of lightning. In both German and French (which distinguish between countable and uncountable uses of words) the translation equivalents of lightning are fully countable (ein Blitz and un eclair respectively). Even within the same language, the same referent can be encoded countably or uncountably: clothes/clothing, things/stu , jobs/work. The second evidence comes from the psycho-linguistic studies of Imai and Gentner (1997) who show that speakers of Japanese and English characterize the same referent in di erent ways depending on whether they consider it to be countable (more common for English speakers) or uncountable (more common for Japanese speakers). Further evidence comes from the English of non-native speakers, particularly those whose native grammar does not mark countability. Presumably, their knowledge of the world is just as complete as English native speakers, but they tend to have di culty with the English speci c conceptual encoding of countability.</Paragraph>
    <Paragraph position="6"> In the next section (x 3) we describe the resources we use to measure the predictability of countability by meaning, and then describe our experiment (x 4).</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Resources
</SectionTitle>
    <Paragraph position="0"> We use the ve noun countability classes of Bond et al. (1994), and the 2,710 semantic classes used in the Japanese-to-English machine translation system ALT-J/E (Ikehara et al., 1991). These are combined in the machine translation lexicons, allowing us to quantify how well semantic classes predict countability.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Semantic Transfer Dictionary
</SectionTitle>
      <Paragraph position="0"> We use the common noun part of ALT-J/E's Japanese-to-English semantic transfer dictionary. It contains 71,833 linked Japanese-English pairs. A simpli ed example of the entry for usagi \rabbit&amp;quot; is given in Figure 1. Each record of the dictionary has a Japanese index form, a sense number, an English index form, English syntactic information, English semantic information, domain information and so on.</Paragraph>
      <Paragraph position="1"> English syntactic information includes the part of speech, noun countability preference, default number, default article and whether the noun is inherently possessed. The semantic information includes common and proper noun semantic classes. In this example, there are two semantic classes: animal subsumed by living thing, and meat subsumed by foodstu .</Paragraph>
      <Paragraph position="2"> Because the dictionary was developed for a Japanese-to-English machine translation system, many of the English translations are longer than the Japanese source terms: many concepts encoded in a single lexical item in Japanese may need multiple words in English. Of the 71,833 entries, 41,285 are multi-word expressions in English (57.4%).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Semantic Ontology
</SectionTitle>
      <Paragraph position="0"> ALT-J/E's ontology classi es concepts to use in expressing relationships between words. The meanings of common nouns are given in terms of a semantic hierarchy of 2,710 nodes. Each node in the hierarchy represents a semantic class.</Paragraph>
      <Paragraph position="1"> Edges in the hierarchy represent is-a or has-a relationships, so that the child of a semantic class related by an is-a relation is subsumed by it. For example, organ is-a body-part.</Paragraph>
      <Paragraph position="2"> The semantic hierarchy and the Japanese dictionary marked with it have been published as Goi-Taikei: A Japanese Lexicon (Ikehara et al., 1997).</Paragraph>
      <Paragraph position="3"> The semantic classes are primarily used to distinguish between word-senses using the selectional restrictions which predicates place on their arguments. Countability has not been used as a criterion in deciding which word should go into which class. In fact, because the dictionary has been built mainly by native Japanese speakers, who do not have reliable intuitions on countability, it was not possible to use countability to help decide into which class to put a given word.</Paragraph>
      <Paragraph position="4"> Although the dictionary has been extensively used in a machine translation system, errors still exist. A detailed examination of user dictionaries with the same information content, made by the same lexicographers who built the lexicon, found errors in 11{21% of the entries (Ikehara et al., 1995). A particularly common source of errors was words being placed one level too high or low in the hierarchy. The same study found that 90% of words entered into a user dictionary could be automatically assigned to lexical classes with 13{25% errors, although words were assigned to too many semantic classes 32{56% of the time (the range in errors is due to di erent results from di erent domains: newspapers and software manuals).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Noun Countability Preferences
</SectionTitle>
      <Paragraph position="0"> Nouns in the dictionary are marked with one of ve major countability preference classes: fully countable, strongly countable, weakly countable, uncountable and plural only, described at length in Bond (2001).</Paragraph>
      <Paragraph position="1"> In addition to countability, default values for number and classi er (cl) are also part of the lexicon. The classes and additional features are summarized in Table 1, along with their distribution in ALT-J/E's common noun dictionary.1 The most common NCP is fully countable, followed by uncountable.</Paragraph>
      <Paragraph position="2"> The two most basic types are fully countable and uncountable. Fully countable nouns such as knife have both singular and plural forms, and cannot be used with determiners such as much, little, a little, less and overmuch. Uncountable nouns, such as furniture, have no plural form, and can be used with much.</Paragraph>
      <Paragraph position="3"> Between these two extremes there are a vast number of nouns, such as cake, that can be used in both countable and uncountable noun phrases. They have both singular and plural forms, and can also be used with much.</Paragraph>
      <Paragraph position="4"> Whether such nouns will be used countably or uncountably depends on whether their referent is being thought of as made up of discrete units or not. As it is not always possible to determine this explicitly when translating from Japanese to English, we divide these nouns into two groups: strongly countable, those that refer to discrete entities by default, such as cake, and weakly countable, those that refer to non-bounded referents by default, such as beer. At present, these distinctions were made by the lexicographers' intuition, as there are no large sense-tagged corpora to train from.</Paragraph>
      <Paragraph position="5"> In fact, almost all English nouns can be used in uncountable environments, for example, if they are given the ground interpretation. The only exception is classi ers such as piece or bit, which refer to quanta, and thus have no uncountable interpretation.</Paragraph>
      <Paragraph position="6"> Language users are sensitive to relative frequencies of variant forms and senses of lexical items (Briscoe and Copestake, 1999, p511). The division into fully, strongly, weakly 1We ignore the two subclasses in this paper: collective nouns are treated as fully countable and semi-countable as uncountable.</Paragraph>
      <Paragraph position="7">  strongly countable BC cake sg  |3,110 4.3 weakly countable BU beer sg  |3,377 4.7 uncountable UC furniture sg piece 15,435 21.5 plural only PT scissors pl pair 2,107 2.9 and uncountable is, in e ect, as a coarse way of re ecting this variation for noun countability. The last major type of countability preference is plural only: nouns that only have a plural form, such as scissors. They can neither be denumerated nor modi ed by much. plural only are further divided depending on what classi er they take. For example, pair plural only nouns use pair as a classi er when they are denumerated: a pair of scissors. This is motivated by the shape of the referent: pair plural only nouns are things that have a bipartite structure. Such words only use a singular form when used as modi ers (a scissor movement). Other plural only such as clothes use the plural form even as modi ers (a clothes horse). In this case, the base (unin ected) form is clothes, and the plural form is zero-derived from it. The word clothes cannot be denumerated at all. If clothes must be counted, then a countable word of similar meaning is substituted, or clothing is used with a classi er: a garment, a suit, a piece of clothing.</Paragraph>
      <Paragraph position="8"> Information this detailed about noun countability preferences is not found in standard dictionaries. To enter this information into the transfer lexicon, a single (Australian) English native speaker with some knowledge of Japanese examined all of the entries in Goi-Taikei's common-noun dictionary and determined appropriate values for their countability preferences.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiment and Results
</SectionTitle>
    <Paragraph position="0"> To test how well the semantic classes predict the countability preferences, we carried out a series of experiments.</Paragraph>
    <Paragraph position="1"> We ran the experiments under several conditions, to test the e ect of combinations of semantic classes and single-word or multi-word entries. In all cases the baseline was to give the most frequently occurring noun countability preference (which was always fully countable).</Paragraph>
    <Paragraph position="2"> In the experiments, we use ve NCPs (fully, strongly, weakly countable, uncountable and plural only), we do not consider default number in any of the experiments.</Paragraph>
    <Paragraph position="3"> For each combination of semantic classes in the lexicon, we calculated the most common NCP. Ties are resolved as follows: fully countable beats strongly countable beats weakly countable beats uncountable beats plural only. For example, consider the semantic class 910:tableware with four members: shokki , tableware (UC), youshokki , dinner set (CO), youshokki , Western-style tableware (UC) and toukirui , crockery (UC).</Paragraph>
    <Paragraph position="4">  The most common NCP is UC, so the NCP associated with this class is uncountable.</Paragraph>
    <Paragraph position="5"> In our rst experiment, we calculated the percentage of entries whose NCP was the same as the most common one. For example, the NCP associated with the semantic class 910:tableware is uncountable. This is correct for three out of the four words in this semantic class. This is equivalent to testing on the training data, and gives a measure of how well semantic classes actually predict noun countability in ALT-J/E's lexicon: 77.9% of the time. This is better than the base-line of all fully countable which would give 65.8%. All the results are presented in Table 2.</Paragraph>
    <Paragraph position="6"> In order to test how useful countability would be in predicting the countability of unknown words, we tested the system using strati ed ten-fold cross validation. That is, we divided the common noun dictionary into ten sets, then tested on each set in turn, with the other ninetenths of the data used as the training set. In order to ensure an even distribution, the data was strati ed by sorting according to semantic class with every 10th item included in the same set. If the combination of semantic classes was not found in the test set, we took the countability to be the overall most common NCP: fully countable. This occurred 11.6% of the time. Using only nine tenths of the data, the accuracy went down to 71.2%, 5.4% above the baseline. In this case the training set for 910:tableware will still always contain a majority of uncountable nouns, so it will be associated with UC. This will be correct for all the words in the class except youshokki , dinner set (CO).</Paragraph>
    <Paragraph position="7"> Finally, we divided the dictionary into single and multiple word entries (looked at from the English side) and re-tested. It was much harder to predict countability for single words (66.6%) than it was for multi-word expressions (74.8%). We will discuss the reason for this in the next section.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> The upper bound of 78% was lower than we expected. There were some problems with the granularity of the hierarchy. In English, the class names of heterogeneous collections of objects tend to be uncountable, while the names of the actual objects are countable.</Paragraph>
    <Paragraph position="1"> For example, the following terms are all hyponyms of tableware in Wordnet (Fellbaum, 1998): cutlery, chopsticks, crockery, dishware, dinnerware, glassware, glasswork, gold plate, service, tea set, .... Most of the entries are either uncountable, or multi-word expressions headed by group classi ers, such as service and set. The words below these classes are almost all countable, with a sprinkling of plural only (like tongs). Thus in the three levels of the hierarchy, two are mainly uncountable, and below that mainly countable.</Paragraph>
    <Paragraph position="2"> However, ALT-J/E's ontology only has two levels here: 910:tableware has four daughters, all leaf nodes in the semantic hierarchy: 911:crockery, 912:cookware, 913:cutlery and 914:tableware (other). The majority NCPs for all four of these classes are fully countable. The question arises as to whether words such as cutlery should be in the upper or lower level. Using countability as an additional criterion for deciding which class to add a word to makes the task more constrained, and therefore more consistent. In this case, we would add cutlery to the parent node 910:tableware, on the basis of its countability (or add a new layer to the ontology).</Paragraph>
    <Paragraph position="3"> Adding countability as a criterion would also help to solve the problem of words being entered in a class one level too high or too low, as noted in Section 3.2.</Paragraph>
    <Paragraph position="4"> We were resigned to getting almost all of the pair plural only wrong, and we did, but they amount to less than 3% of the total. Although there are some functional similarities, such as a large percentage of 820:clothes for the lower body, it was more common to get one or two in an otherwise large group, such as tongs in the 913:cutlery class, which is overwhelmingly fully countable. Because the major di erentiator is physical shape, which is not included in our semantic hierarchy, these words cannot be learned by our method. This is another argument for the importance of representing physical shape so that it is accessible for linguistic processing.</Paragraph>
    <Paragraph position="5"> We had expected single word entries to be easier to predict than multiple word entries, because of the lack of in uence of modi ers. However, the experiment showed the opposite. Investigating the reason found that single word entries tended to have more semantic classes per word (1.38 vs 1.34) and more varied combinations of semantic classes. This meant that there were 5.1 entries per combination to train on for the multi-word entries, but only 3.7 for the single word entries. Therefore, it was harder to train for the single word entries.</Paragraph>
    <Paragraph position="6"> As can be seen in the case of tableware given above, there were classes where the single-word and multi-word expressions in the same semantic class had di erent countabilities. Therefore, even though there were fewer training examples, learning the NCPs di erently for single and multi-word expressions and then combing the results gave an improved score: 72.0%.</Paragraph>
    <Paragraph position="7"> Finally, there were also substantial numbers of genuine errors, such as a0a2a1a4a3a6a5a8a7a10a9 sofuto kar a which has two translations soft colour and soft collar. Their semantic classes should have been hue and clothing respectively, but the semantic labels were reversed. In this case the countability preferences were correct, but the semantic classes incorrect.</Paragraph>
    <Paragraph position="8"> An initial analysis of the erroneous predictions suggested that the upper bound with all genuine errors in the lexicon removed would be closer to 85% than 78%. We speculate that this would be true for languages other than English because is not speci cally tuned to English, it was developed for Japanese analysis.</Paragraph>
    <Paragraph position="9"> Unfortunately we do not have a large lexicon of French, German or some other countable language marked with the same ontology to test on.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Further Work
</SectionTitle>
      <Paragraph position="0"> First, we would like to look more at multi-word expressions. There is a general trend for the head of a multiword expression to determine the overall countability, which we did not exploit. Modi ers can also be informative, particularly for quanti ed expressions such as zasshoku , various colors whose English part must be countable as it is explicitly denumerated. null Second, we would like to investigate further the relation between under-speci ed semantics and countability. Words such as usagi , rabbit are marked with the semantic classes for animal and meat, and the single NCP strongly countable. It may be better to explicitly identify countability with the animal sense, and uncountability with the meat sense. In this way, we could learn NCPs for each semantic class individually (ignoring plural only) and look at ways of combining them, or of dynamically assigning countability during sense disambiguation. Learning NCPs for each class individually could also help to predict NCPs for entries with idiosyncratic combinations, for which training data may not be found.</Paragraph>
      <Paragraph position="1"> Finally, from a psycho-linguistic point of view, it would be interesting to test whether unpredictable countabilities (that is those words whose countability is not motivated by their semantic class) are in fact harder for non-native speakers to use, and more likely to be translated incorrectly by humans.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Applications
</SectionTitle>
      <Paragraph position="0"> In general, many errors in countability that had been overlooked by the lexicographers in the original compilation of the lexicon and its subsequent revisions became obvious when looking at the words grouped by semantic class and noun countability preference. Most entries were made by Japanese native speakers, who do not make countability distinctions. They were checked by a native speaker of English, who in turn did not always understand the Japanese source word, and thus was unable to identify the correct sense.</Paragraph>
      <Paragraph position="1"> Adding a checker to the dictionary tools, which warns if the semantic class does not predict the assigned countability, would help to avoid such errors. Such a tool could also be used for ne tuning the position of words in the hierarchy, and spotting at-out errors.</Paragraph>
      <Paragraph position="2"> Another application of these results is in automatically predicting the countability of unknown words. It is possible to automatically predict semantic classes up to 80% of the time (Ikehara et al., 1995). These semantic classes could then be used to predict the countability at a level substantially above the baseline.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML