File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2077_metho.xml

Size: 23,499 bytes

Last Modified: 2025-10-06 14:10:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2077">
  <Title>Reinforcing English Countability Prediction with One Countability per Discourse Property</Title>
  <Section position="4" start_page="595" end_page="596" type="metho">
    <SectionTitle>
2 One Countability per Discourse
</SectionTitle>
    <Paragraph position="0"> One countability per discourse is an extension of one sense per discourse proposed by Gale et al. (1992). One sense per discourse claims that when a polysemous word appears more than once in a discourse it is likely that they will all share the same sense. Yarowsky (1995) tested the claim on about 37,000 examples and found that when a polysemous word appeared more than once in a discourse, they took on the majority sense for the discourse 99.8% of the time on average.</Paragraph>
    <Paragraph position="1"> Based on one sense per discourse, we hypothesize that when a noun appears more than once in a discourse, they will all share the same countability in the discourse, that is, one countability per discourse. The motivation for this hypothesis is that if one sense per discourse is satis ed, so is one countability per discourse because countability is often determined by word sense. For example, if the noun paper appears in a discourse and it has the sense of newspaper, which is countable, the rest of papers in the discourse also have the same sense according to one sense per discourse, and thus they are also countable.</Paragraph>
    <Paragraph position="2"> We tested this hypothesis on a set of nouns1 1The conditions of this test are shown in Section 5. Note that although the source of the data is the same as in Section 5, as Yarowsky (1995) did. We calculated how accurately the majority countability for each discourse predicted countability of the nouns in the discourse when they appeared more than once. If the one countability per discourse property is always satis ed, the majority countability for each discourse should predict countability with the accuracy of 100%. In other others, the obtained accuracy represents how often the one countability per discourse property is satis ed.</Paragraph>
    <Paragraph position="3"> Table 1 shows the results. MCD in Table 1 stands for Majority Countability for Discourse and its corresponding column denotes accuracy where countability of individual nouns was predicted by the majority countability for the discourse in which they appeared. Also, Baseline denotes accuracy where it was predicted by the majority countability for the whole corpus used in this test.</Paragraph>
    <Paragraph position="4">  discourses in which the target noun appears only once are excluded from this test unlike in Section 5.</Paragraph>
    <Paragraph position="5">  course property is a good source of evidence for predicting countability compared to the baseline while it is not as strong as the one sense per discourse property is. It also reveals that the tendency of one countability per discourse varies from noun to noun. For instance, nouns such as aid and cover show a strong tendency while others such as advantage and improvement do not. On average, MCD achieves an improvement of approximately 10% in accuracy over the baseline.</Paragraph>
    <Paragraph position="6"> Having observed the results, it is reasonable to exploit the one countability per discourse prop-erty for predicting countability. In order to do it, however, the following two questions should be addressed. First, how can the majority countability be obtained from a novel discourse? Since our intention is to predict values of countability of instances in a novel discourse, none of them are known. Second, even if the majority countability is known, how can it be ef ciently exploited for predicting countability? Although we could simply predict countability of individual instances of a target noun in a discourse by the majority countability for the discourse, it is highly possible that this simple method will cause side effects considering the results in Table 1. These two questions are addressed in the next section.</Paragraph>
  </Section>
  <Section position="5" start_page="596" end_page="599" type="metho">
    <SectionTitle>
3 Basic Idea
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="596" end_page="596" type="sub_section">
      <SectionTitle>
3.1 How Can the Majority Countability be
</SectionTitle>
      <Paragraph position="0"> Obtained from a Novel Discourse? Although we do not know the true value of the majority countability for a novel discourse, we can at least estimate it because we have a method for predicting countability to be reinforced by the proposed method. That is, we can predict countability of the target noun in a novel discourse using the method. Simply counting the results would give the majority countability for it.</Paragraph>
      <Paragraph position="1"> Here, we should note that countability of each instance is not the true value but a predicted one.</Paragraph>
      <Paragraph position="2"> Considering this fact, it is sensible to set a certain criterion in order to lter out spurious predictions. Fortunately, most methods based on machine learning algorithms give predictions with their con dences. We use the con dences as the criterion. Namely, we only take account of predictions whose con dences are greater than a certain threshold when we estimate the majority countability for a novel discourse.</Paragraph>
    </Section>
    <Section position="2" start_page="596" end_page="597" type="sub_section">
      <SectionTitle>
3.2 How Can the Majority Countability be
</SectionTitle>
      <Paragraph position="0"> Ef ciently Exploited? In order to ef ciently exploit the one countability per discourse property, we treat the majority countability for each discourse as a feature in addition to other features extracted from instances of the target noun. Doing so, we let a machine learning algorithm decide which features are relevant to the prediction. If the majority countability feature is relevant, the machine learning algorithm should give a high weight to it compared to others.</Paragraph>
      <Paragraph position="1"> To see this, let us suppose that we have a set of discourses in which instances of the target noun are tagged with their countability (either countable or uncountable2) for the moment; we will describe how to obtain it in Subsection 4.1. For each discourse, we can know its majority countability by counting the numbers of countables and uncountables. We can also generate a model for predicting countability from the set of discourses using a machine learning algorithm. All we have to do is to extract a set of training data from the tagged instances and to apply a machine learning algorithm to it. This is where the majority countability feature comes in. The majority countability for each instance is added to its corresponding training data as a feature to create a new set of training data before applying a machine learning algorithm; then a machine learning algorithm is applied to the new set. The resulting model takes the majority countability feature into account as well as the other features when making predictions.</Paragraph>
      <Paragraph position="2"> It is important to exercise some care in counting the majority countability for each discourse.</Paragraph>
      <Paragraph position="3"> Note that one countability per discourse is always satis ed in discourses where the target noun appears only once. This suggests that it is highly possible that the resulting model too strongly favors the majority countability feature. To avoid this, we could split the discourses into two sets, one for where the target noun appears only once and one for where it appears more than once, and train a model on each set. However, we do not take this strategy because we want to use as much data as possible for training. As a compromise, we approximate the majority countability for discourses where the target noun appears only once to the value unknown.</Paragraph>
      <Paragraph position="4"> 2This paper concentrates solely on countable and uncountable nouns, since they account for the vast majority of nouns (Lapata and Keller, 2005).</Paragraph>
    </Section>
    <Section position="3" start_page="597" end_page="598" type="sub_section">
      <SectionTitle>
4.1 Generating Training Data
</SectionTitle>
      <Paragraph position="0"> As discussed in Subsection 3.2, training data are needed to exploit the one countability per discourse property. In other words, the proposed method requires a set of discourses in which instances of the target noun are tagged with their countability. Fortunately, Nagata et al. (2005b) have proposed a method for tagging nouns with their countability. This paper follows it to generate training data.</Paragraph>
      <Paragraph position="1"> To generate training data, rst, instances of the target noun used as a head noun are collected from a corpus with their surrounding words. This can be simply done by an existing chunker or parser.</Paragraph>
      <Paragraph position="2"> Second, the collected instances are tagged with either countable or uncountable by tagging rules.</Paragraph>
      <Paragraph position="3"> For example, the underlined paper: ... read a paper in the morning ...</Paragraph>
      <Paragraph position="4"> is tagged as ... read a paper/countable in the morning ...</Paragraph>
      <Paragraph position="5"> because it is modi ed by the inde nite article.</Paragraph>
      <Paragraph position="6"> Figure 1 and Table 2 represent the tagging rules based on Nagata et al. (2005b)'s method. Figure 1 shows the framework of the tagging rules. Each node in Figure 1 represents a question applied to the instance in question. For instance, the root node reads Is the instance in question plural? . Each leaf represents a result of the classi cation. For instance, if the answer is yes at the root node, the instance in question is tagged with countable. Otherwise, the question at the lower node is applied and so on. The tagging rules do not classify instances in some cases. These unclassi ed instances are tagged with the symbol ? . Unfortunately, they cannot readily be included in training data. For simplicity of implementation, they are excluded from training data (we will discuss the use of these excluded data in Section 6). Note that the tagging rules cannot be used for countability prediction aiming at detecting errors in article usage and singular/plural usage. The reason is that they are useless in error detection where whether determiners and the singular/plural distinction are correct or not is unknown. Obviously, the tagging rules assume that the target text contains no error.</Paragraph>
      <Paragraph position="7"> Third, features are extracted from each instance.</Paragraph>
      <Paragraph position="8"> As the features, the following three types of contextual cues are used: (i) words in the noun phrase that the instance heads, (ii) three words to the left of the noun phrase, and (iii) three words to its right. Here, the words in Table 2 are excluded.</Paragraph>
      <Paragraph position="9"> Also, function words (except prepositions) such as pronouns, cardinal and quasi-cardinal numer- null als, and the target noun are excluded. All words are reduced to their morphological stem and converted entirely to lower case when collected. In addition to the features, the majority countability is used as a feature. For each discourse, the numbers of countables and uncountables are counted to obtain its majority countability. In case of ties, it is set to unknown. Also, it is set to unknown when only one instance appears in the discourse as explained in Subsection 3.2.</Paragraph>
      <Paragraph position="10"> To illustrate feature extraction, let us consider the following discourse (target noun: paper): ... writing a new paper/countable in his room ...</Paragraph>
      <Paragraph position="11"> ... read papers/countable with ...</Paragraph>
      <Paragraph position="12"> The discourse would give a set of features: -3=write, NP=new, +3=in, +3=room, MC=c -3=read, +3=with, MC=c where MC=c denotes that the majority countability for the discourse is countable. In this example (and in the following examples), the features are represented in a somewhat simpli ed manner for the purpose of illustration. In practice, features are represented as a vector.</Paragraph>
      <Paragraph position="13"> Finally, the features are stored in a le with their corresponding countability as training data. Each piece of training data would be as follows: -3=read, +3=with, MC=c, LABEL=c where LABEL=c denotes that the countability for the instance is countable.</Paragraph>
    </Section>
    <Section position="4" start_page="598" end_page="598" type="sub_section">
      <SectionTitle>
4.2 Model Generation
</SectionTitle>
      <Paragraph position="0"> The model used in the proposed method can be regarded as a function. It takes as its input a feature vector extracted from the instance in question and predicts countability (either countable or uncountable). Formally, a3a5a4a7a6a9a8 a10 where a3 , a6 , and a10 denote the model, the feature vector, and a10a12a11a14a13a16a15a18a17 , respectively; here, 0 and 1 correspond to countable and uncountable, respectively.</Paragraph>
      <Paragraph position="1"> Given the speci cation, almost any kind of machine learning algorithm cab be used to generate the model used in the proposed method. In this paper, the Maximum Entropy (ME) algorithm is used which has been shown to be effective in a wide variety of natural language processing tasks.</Paragraph>
      <Paragraph position="2"> Model generation is done by applying the ME algorithm to the training data. The resulting model takes account of the features including the majority countability feature and is used for reinforcing countability prediction.</Paragraph>
    </Section>
    <Section position="5" start_page="598" end_page="599" type="sub_section">
      <SectionTitle>
4.3 Reinforcing Countability Prediction
</SectionTitle>
      <Paragraph position="0"> Before explaining the reinforcement procedure, let us introduce the following discourse for illustration (target noun: paper): ... writing paper in room ... wrote paper in ...</Paragraph>
      <Paragraph position="1"> ... submitted paper to ...</Paragraph>
      <Paragraph position="2"> Note that articles and the singular/plural distinction are deliberately removed from the discourse. This kind of situation can happen in machine translation from a source language that does not have articles and the singular/plural distinction3. The situation is similar in the writing of second language learners of English since they often omit articles and the singular/plural distinction or use improper ones. Here, suppose that the true values of the countability for all instances are countable.</Paragraph>
      <Paragraph position="3"> A method to be reinforced by the proposed method would predict countability as follows: ... writing paper/countable (0.97) in room ...</Paragraph>
      <Paragraph position="4"> ... wrote paper/countable (0.98) in ...</Paragraph>
      <Paragraph position="5"> ... submitted paper/uncountable (0.57) to ...</Paragraph>
      <Paragraph position="6"> where the numbers in brackets denote the con dences given by the method. The third instance is mistakenly predicted as uncountable4.</Paragraph>
      <Paragraph position="7"> Now let us move on to the reinforcement procedure. It is divided into three steps. First, the majority countability for the discourse in question is estimated by counting the numbers of the predicted countables and uncountables whose con dences are greater than a certain threshold. In case of ties, the values of the majority countability is set to unknown. In the above example, the majority countability for the discourse is estimated to be countable when the threshold is set to a13a16a19a21a20a23a22 (two countables). Second, features explained in Sub-section 4.1 are extracted from each instance. As for the majority countability feature, the estimated one is used. Returning to the above example, the three instances would give a set of features: -3=write, +3=in, +3=room, MC=c, -3=write, +3=in, MC=c, -3=submit, +3=to, MC=c.</Paragraph>
      <Paragraph position="8"> Finally, the model generated in Subsection 4.2 is applied to the features to predict countability.</Paragraph>
      <Paragraph position="9"> Because of the majority countability feature, it  is likely that previous mispredictions are overridden by correct ones. In the above example, the third one would be correctly overridden by countable because of the majority countability feature (MC=c) that is informative for the instance being countable.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="599" end_page="601" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="599" end_page="599" type="sub_section">
      <SectionTitle>
5.1 Experimental Conditions
</SectionTitle>
      <Paragraph position="0"> In the experiments, we chose Nagata et al. (2005a)'s method as the one to be reinforced by the proposed method. In this method, the decision list (DL) learning algorithm (Yarowsky, 1995) is used. However, we used the ME algorithm because we found that the method with the ME algorithm instead of the DL learning algorithm performed better when trained on the same training data.</Paragraph>
      <Paragraph position="1"> As the target noun, we selected 23 nouns that were also used in Nagata et al. (2005a)'s experiments. They are exempli ed as nouns that are used as both countable and uncountable by Huddleston and Pullum (2002).</Paragraph>
      <Paragraph position="2"> Training data were generated from the written part of the British National Corpus (Burnard, 1995). A text tagged with the text tags was used as a discourse unit. From the corpus, 314 texts, which amounted to about 10% of all texts, were randomly taken to obtain test data. The rest of texts were used to generate training data.</Paragraph>
      <Paragraph position="3"> We evaluated performance of prediction by accuracy. We de ned accuracy by the ratio of the number of correct predictions to that of instances of the target noun in the test data.</Paragraph>
    </Section>
    <Section position="2" start_page="599" end_page="599" type="sub_section">
      <SectionTitle>
5.2 Experimental Procedures
</SectionTitle>
      <Paragraph position="0"> First, we generated training data for each target noun from the texts using the tagging rules explained in Subsection 4.1. We used the OAK system5 to extract noun phrases and their heads. Of the extracted instances, we excluded those that had no contextual cues from the training data (and also the test data). We also generated another set of training data by removing the majority countability features from them. This set of training data was used for comparison.</Paragraph>
      <Paragraph position="1"> Second, we obtained test data by applying the tagging rules described in Subsection 4.1 to each instance of the target noun in the 314 texts. Nagata et al. (2005b) showed that the tagging rules  achieved an accuracy of 0.997 in the texts that contained no errors. Considering these results, we used the tagging rules to obtain test data. Instances tagged with ? were excluded in the experiments.</Paragraph>
      <Paragraph position="2"> Third, we applied the ME algorithm6 to the training data without the majority countability feature. Using the resulting model, countability of the target nouns in the test data was predicted.</Paragraph>
      <Paragraph position="3"> Then, the predictions were reinforced by the proposed method. The threshold to lter out spurious predictions was set to a13a16a19a21a20a23a22 . For comparison, the predictions obtained by the ME model were simply replaced with the estimated majority countability for each discourse. In this method, the original predictions were used when the estimated majority countability was unknown. Also, Nagata et al. (2005a)'s method that was based on the DL learning algorithm was implemented for comparison. null Finally, we calculated accuracy of each method.</Paragraph>
      <Paragraph position="4"> In addition to the results, we evaluated the baseline on the same test data where all predictions were done by the majority countability for the whole corpus (training data).</Paragraph>
    </Section>
    <Section position="3" start_page="599" end_page="601" type="sub_section">
      <SectionTitle>
5.3 Experimental Results and Discussion
</SectionTitle>
      <Paragraph position="0"> Table 3 shows the accuracies7. ME and Proposed in Table 3 refer to accuracies of the ME model and the ME model reinforced by the proposed method, respectively. ME+MCD refers to accuracy obtained by replacing predictions of the ME model with the estimated majority countability for each discourse. Also, DL refers to accuracy of the DL-based method.</Paragraph>
      <Paragraph position="1"> Table 3 shows that the three ME-based methods ( Proposed , ME , and ME+MCD ) perform better than DL and the baseline. Especially, Proposed outperforms the other methods in most of the target nouns.</Paragraph>
      <Paragraph position="2"> Figure 2 summarizes the comparison between the three ME-based methods. Each plot in Figure 2 represents each target noun. The horizontal and vertical axises correspond to accuracy of ME and that of Proposed (or ME+MCD ), respectively. The diagonal line corresponds to the line a24a25a11a27a26 . So if Proposed (or ME+MCD ) achieved no improvement at all over ME , all the  because discourses where the target noun appears only once are not taken into account in Table 1.</Paragraph>
      <Paragraph position="3">  plots would be on the line. Plots above the line mean improvement over ME and the distance from the line expresses the amount of improvement. Plots below the line mean the opposite. Figure 2 clearly shows that most of the plots (a28 ) corresponding to the comparison between ME and Proposed are above the line. This means that the proposed method successfully reinforced ME in most of the target nouns. Indeed, the average accuracy of Proposed is signi cantly superior to that of ME at the 99% con dence level (paired t-test). This improvement is close to that of one sense per discourse (Yarowsky, 1995) (improvement ranging from 1.3% to 1.7%), which seems to be a sensible upper bound of the proposed method. By contrast, about half of the plots (a29 ) corresponding to the comparison between ME and ME+MCD are below the line.</Paragraph>
      <Paragraph position="4"> From these results, it follows that the one countability per discourse property is a good source of evidence for predicting countability, but it is crucial to devise a way of exploiting the property as we did in this paper. Namely, simply replacing original predictions with the majority countability for the discourse causes side effects, which has been already suggested in Table 1. This is  also exempli ed as follows. Suppose that several instances of the target noun advantage appear in a discourse and that its majority countably is countable. Further suppose that an idiomatic phrase take advantage of of which countability is uncountable happens to appear in it. On one hand, simply replacing all the predictions with its majority countability (countable) would lead to a misprediction for the idiomatic phrase even if the original prediction is correct. On the other hand, the proposed method would correctly predict the countability because the contextual cues strongly indicate that it is uncountable.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML