XML Viewer - c04-1191

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1191_metho.xml
Size: 12,945 bytes
Last Modified: 2025-10-06 14:08:48
<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1191">
  <Title>Inferring parts of speech for lexical mappings via the Cyc KB</Title>
  <Section position="4" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3 Inference of default part of
</SectionTitle>
    <Paragraph position="0"> speech Our method of inferring the part of speech for lexicalizations is to apply machine learning techniques over the lexical mappings from English words or phrases to Cyc terms. For each target denotatum term, the corresponding types and generalizations are extracted from the ontology. This includes terms for which the denotatum term is an instance or specialization, either explicitly asserted or inferable via transitivity. For simplicity, these are referred to as ancestor terms. The association between the lexicalization parts of speech and the common ancestor terms forms the basis for the main criteria used in the lexicalization speech part classifier and the special case for the mass-count classifier. In addition, this is augmented with features indicating whether known suffixes occur in the headword as well as with corpus statistics.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.1 Cyc ancestor term features
</SectionTitle>
      <Paragraph position="0"> There are several possibilities in mapping the Cyc ancestor terms into a feature vector for use in machine learning algorithms. The most direct method is to have a binary feature for each possible ancestor term, but this would require about ten thousand features. To prune the list of potential features, frequency considerations can be applied, such as taking the most frequent terms that occur in type definition assertions. Alternatively, the training data can be analyzed to see which reference terms are most correlated with the classifications. null For simplicity, the frequency approach is used here. The most-frequent 1024 atomic terms are selected, excluding terms used for bookkeeping purposes (e.g., PublicConstant, which mark terms for public releases of the KB); half of these terms are taken from the isa assertions, and the other half from the genls assertions. These are referred to as the reference terms. For instance, ObjectType is a type for 21,108 of the denotation terms (out of 44,449 cases), compared to 20,283 for StuffType.</Paragraph>
      <Paragraph position="1"> These occur at ranks 13 and 14, so they are both included. In contrast, SeparationEvent occurs only 185 times as a generalization term at rank 522, so it is pruned. See (O'Hara et al., 2003) for more details on extracting the reference term features.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.2 Morphology and corpus-based
features
</SectionTitle>
      <Paragraph position="0"> In English, the suffix for a word can provide a good clue as to the speech part of a word. For example, agentive nouns commonly end in '-or' or '-er.' Features to account for this are derived by seeing whether the headword ends in one of a predefined set of suffixes and adding the suffix as a value to an enumerated feature variable corresponding to suffixes of the given length. Currently, the suffixes  forms derived from the headword: &lt;plural&gt; and &lt;singular&gt; are derived via morphology; &lt;head&gt; uses the headword as is.</Paragraph>
      <Paragraph position="1"> used are the most-common two to four letter sequences found in the headwords.</Paragraph>
      <Paragraph position="2"> Often the choice of speech parts for lexicalizations reflects idiosyncratic usages rather than just underlying semantics. To account for this, a set of features is included that is based on the relative frequency that the denotational headword occurs in contexts that are indicative of each of the main speech parts: singular, plural, count, mass, verbal, adjectival, and adverbial. See Figure 3. These patterns were determined by analyzing part-of-speech tagged text and seeing which function words co-occur predominantly in the immediate context for words of the given grammatical category. Note that high frequency function words such as 'to' were not considered because they are usually not indexed for information retrieval.</Paragraph>
      <Paragraph position="3"> These features are derived as follows. Given a lexical assertion (e.g., (denotation Hound-TheWord CountNoun 0 Dog)), the headword is extracted and then the plural or singular variant wordform is derived for use in the pattern templates. Corpus checks are done for each, producing a vector of frequency counts (e.g., &lt;29, 17, 0, 0, 0, 0, 0&gt; ). These counts are then normalized and then used as numeric features for the machine learning algorithm. Table 3 shows the results for the hound example and with a few other cases.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.3 Sample criteria
</SectionTitle>
      <Paragraph position="0"> We use decision trees for this classification. Part of the motivation is that the result is readily interpretable and can be incorporated directly by knowledge-based applications. Decision trees are induced in a process that recursively splits the training examples based on the feature that parti- null speech part classifier.</Paragraph>
      <Paragraph position="1"> tions the current set of examples to maximize the information gain (Witten and Frank, 1999). This is commonly done by selecting the feature that minimizes the entropy of the distribution (i.e., yields least uniform distribution). A fragment of the decision tree is shown to give an idea of the criteria being considered in the speech part classification. See Figure 4. In this example, the semantic types mostly provide exceptions to associations inferred from the suffixes, with corpus clues used occasionally for differentiation.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
4 Evaluation and results
</SectionTitle>
    <Paragraph position="0"> To test out the performance of the speech part classification, 10-fold cross validation is applied to each configuration that was considered. Except as noted below, all the results are produced using Weka's J4.8 classifier (Witten and Frank, 1999), which is an implementation of Quillian's C4.5 (Quinlan, 1993) decision tree learner. Other classifiers were considered as well (e.g., Naive Bayes and nearest neighbor), but J4.8 generally gave the best overall results.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.1 Results for mass-count distinction
</SectionTitle>
      <Paragraph position="0"> Table 4 shows the results for the special case mass-count classification. This shows that the system achieves an accuracy of 93.0%, an improvement of 24.4 percentage points over the standard base-line of always selecting the most frequent case (i.e., count noun). Other baselines are included for comparison purposes. For example, using the head-word as the sole feature (just-headwords)performs fairly well compared to the system based on Cyc; but, this classifier would lack generalizability, relying simply upon table lookup. (In this case, the decision tree induction process ran into memory constraints, so a Naive Bayes classifier was used instead.) In addition, a system only based on the suffixes (just-suffixes) performs marginally better than always selecting the most common case.</Paragraph>
      <Paragraph position="1"> Thus, morphology alone would not be adequate for this task. The OpenCyc version of the classifier also performs well. This illustrates that sufficient data is already available in OpenCyc to allow for good approximations for such classifications. Note that for the mass-count experiments and for the experiments discussed later, the combined system over full Cyc leads to statistically significant improvements compared to the other cases.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="1" end_page="1" type="metho">
    <SectionTitle>
4.2 Results for general speech part
</SectionTitle>
    <Paragraph position="0"> classification Running the same classifier setup over all speech parts produces the results shown in Table 5. The overall result is not as high, but there is a similar improvement over the baselines. Relying solely on suffixes or on corpus checks performs slightly better than the baseline. Using headwords performs well, but again that amounts to table lookup. In terms of absolute accuracy it might seem that the system based on OpenCyc is doing nearly as well as the system based on full Cyc. This is somewhat misleading, since the distribution of parts of speech is simpler in OpenCyc, as shown by the lower entropy value (Jurafsky and Martin, 2000).</Paragraph>
  </Section>
  <Section position="7" start_page="1" end_page="1" type="metho">
    <SectionTitle>
5 Related work
</SectionTitle>
    <Paragraph position="0"> There has not been much work in the automatic determination of the preferred lexicalization part of speech, outside of work related to part-of-speech tagging (Brill, 1995), which concentrates on the  lexical mappings. Instances is size of the training data. Classes is the number of choices. Entropy characterizes distribution uniformity. Base-line uses more frequent case. The just-X entries incorporate a single type: headwords from lexical mapping, suffixes of headword, corpus co-occurrence of part-of-speech indicators; and Cyc reference terms. Combination uses all features except for the headwords. For Cyc, it yields a statistically significant improvement over the others at  Cyc lexical mappings. All speech parts in Cyc are used. See Table 4 for legend.</Paragraph>
    <Paragraph position="1"> sequences of speech tags rather than the default tags. Brill uses an error-driven transformation-based learning approach that learns lists for transforming the initial tags assigned to the sentence. Unknown words are handled basically via rules that change the default assignment to another based on the suffixes of the unknown word. Pedersen and Chen (1995) discuss an approach to inferring the grammatical categories of unknown words using constraint solving over the properties of the known words. Toole (2000) applies decision trees to a similar problem, distinguishing common nouns, pronouns, and various types of names, using a framework analogous to that commonly applied in named-entity recognition.</Paragraph>
    <Paragraph position="2"> In work closer to ours, Woods (2000) describes an approach to this problem using manually constructed rules incorporating syntactic, morphological, and semantic tests (via an ontology). For example, patterns targeting specific stems are applied provided that the root meets certain semantic constraints. There has been clustering-based work in part-of-speech induction, but these tend to target idiosyncratic classes, such as capitalized words and words ending in '-ed' (Clark, 2003).</Paragraph>
    <Paragraph position="3"> The special case of classifying the mass-count distinction has received some attention. Bond and Vatikiotis-Bateson (2002) infer five types of countability distinctions using NT&amp;T's Japanese to English transfer dictionary, including the categories strongly countable, weakly countable, and plural only. The countability assigned to a particular semantic category is based on the most common case associated with the English words mapping into the category. Our earlier work (O'Hara et al., 2003) just used semantic features as well but accounted for inheritance of types, achieving 89.5% with a baseline of 68.2%. Schwartz (2002) uses the five NT&amp;T countability distinctions when tagging word occurrences in a corpus (i.e., word tokens), based primarily on clues provided by determiners.</Paragraph>
    <Paragraph position="4"> Results are given in terms of agreement rather than accuracy; compared to NT&amp;T's dictionary there is about 90% agreement for the fully or strong countable types and about 40% agreement for the weakly countable or uncountable types, with half of the tokens left untagged for countability. Baldwin and Bond (2003) apply sophisticated preprocessing to derive a variety of countability clues, such as grammatical number of modifiers, co-occurrence of specific types of determiners and pronouns, and specific types of prepositions. They achieve 94.6% accuracy using four categories of countability, including two categories for types of plural-only nouns. Since multiple assignments are allowed, negative agreement is considered as well as positive. When restricted to just count versus mass nouns, the accuracy is 89.9% (personal communication). Note that, as with Schwartz, the task is different from ours and that of Bond and Vatikiotis-Bateson: we assign countability to word/concept pairs instead of just to words.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML