File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0207_metho.xml

Size: 10,348 bytes

Last Modified: 2025-10-06 14:14:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0207">
  <Title>Measuring Semantic Entropy Dept.</Title>
  <Section position="5" start_page="42" end_page="44" type="metho">
    <SectionTitle>
3 Results
</SectionTitle>
    <Paragraph position="0"> To estimate the semantic entropy of English words, roughly thirteen million words were used from the record of proceedings of the Canadian parliament (&amp;quot;Hansards&amp;quot;), which is available in English and in French. Before induction of the translation lexicon, both halves of the bitext were tagged for part  tagger (Bri92). The POS information was not used in the lexicon induction process but, after estimating the semantic entropies for all the English words in the corpus, the words were grouped into rough part-of-speech categories.</Paragraph>
    <Paragraph position="1"> First, mean semantic entropy was compared across parts of speech. Table I lists the mean semantic entropies Ep for each part of speech P, sorted by \]~p, and the variance of each Ep. The table provides empirical evidence for the intuition that function words are translated less consistently than content words: The mean semantic entropy of each function-word POS is higher than that of any content-word POS. The table also shows that punctuation and interjections rank between the function words at the top and the content words at the bottom. This ranking is consistent with the intuition that punctuation and interjections have more semantic weight than function words, but less than content words.</Paragraph>
    <Paragraph position="2">  After analyzing the aggregated results, it was time to peek into the semantic entropy rankings within each POS. Several of these were particularly interesting. Table 2 explains the atypically high variance of the semantic entropy of punctuation.</Paragraph>
    <Paragraph position="3"> End-of-sentence punctuation is used very consistently and almost identically in English and in French. So, the question mark, the exclamation mark and the period have almost no semantic entropy. In contrast, the two languages have different rules for comas and colons, especially around quotations. Comas and dashes are often used for similar purposes, so one is often translated as the other.</Paragraph>
    <Paragraph position="4"> Moreover, English comas are often lost in translation. For these reasons, the short Table 2 includes both the lowest and the highest semantic entropy values for English words in the Hansards.</Paragraph>
    <Paragraph position="5"> Table 3 shows some of the adjectives, ranked by semantic Entropy. The top eight adjectives in the table say very little about the nouns that they might modify. They seem like thinly disguised function words that happen to appear in syntactic positions normally reserved for adjectives. Adjectives in the middle of the table are more typical, but they are less specific than the adjectives in the bottom third of the table.</Paragraph>
    <Paragraph position="6"> Table 4 displays a sorted sample of the pronouns.</Paragraph>
    <Paragraph position="7"> Topping the list are the English possessive suffixes, which have no equivalent in French or in most other languages. Existential &amp;quot;there&amp;quot; is next. &amp;quot;It&amp;quot; is high on the list because of its frequent pleonastic function (&amp;quot;It is necessary to....&amp;quot;). These four pronouns are atypically functional. The most frequent of the thirty seven pronouns in the corpus, &amp;quot;I,&amp;quot; is eleventh from the bottom of the list. The most consistently  translated pronouns are the archaic forms &amp;quot;thee&amp;quot; and &amp;quot;thou.&amp;quot;  The most interesting ranking of semantic entropies is among the verbs, including present and past participles. As shown in Table 5, verbs can have high entropies for several reasons. The verb with the highest semantic entropy by far is the functional verb place-holder &amp;quot;do.&amp;quot; Very high on the list are various forms of the functional auxiliaries &amp;quot;be, .... have,&amp;quot; and &amp;quot;(be) going (to),&amp;quot; as well as the modals &amp;quot;may,&amp;quot; &amp;quot;might,&amp;quot; and &amp;quot;shall.&amp;quot; The past participles &amp;quot;concerning, .... involving,&amp;quot; &amp;quot;according,&amp;quot; &amp;quot;dealing,&amp;quot; and &amp;quot;regarding&amp;quot; are near the top of the list because they occur most often as the heads of adjectival phrases modifying noun phrases, as in &amp;quot;the world according to NP&amp;quot;, an English construction that is usually paraphrased in translation. &amp;quot;Try&amp;quot; and &amp;quot;let&amp;quot; axe up there because they often serve as mere modal modifiers of a sentential argument. Most of the other verbs at the top of the list are light verbs. Verbs like &amp;quot;get,&amp;quot; &amp;quot;make,&amp;quot; &amp;quot;come,&amp;quot; &amp;quot;take,&amp;quot; &amp;quot;put, .... stand,&amp;quot; and &amp;quot;give&amp;quot; are often used as syntactic filler while most of the semantic content of the phrase is conveyed by their argument.</Paragraph>
  </Section>
  <Section position="6" start_page="44" end_page="45" type="metho">
    <SectionTitle>
4 Discussion
</SectionTitle>
    <Paragraph position="0"> The most in-depth study of semantic entropy and its applications to date was carried out by Resnik (Res93; Res95). Resnik's approach differs from the present one in three major ways. First, he defines semantic entropy over concepts, rather than over words. This definition is more useful for his particular applications, namely evaluating concept similarity and estimating selectional preferences. Second, in order to measure semantic similarity over concepts, his method requires a concept taxonomy, such as the Princeton WordNet (Milg0), which is grounded in the lexical ontology of a particular language. In contrast, the method presented in this paper requires a large bitext. Both kinds of resources are still available only for a limited number of languages, so only one of the two methods may be a viable option in any given situation. Third, Resnik's measure of information content is defined in terms of the logarithm of each concept's frequency in text, where the frequency of a concept is defined as the sum of the frequencies of words representing that concept in the taxonomy.</Paragraph>
    <Paragraph position="1"> Given only monolingual data, log-frequency is a relatively good estimator of semantic entropy. Looking through the various tables in this paper, you may have noticed that words with higher entropy tend to have higher frequency. Semantic entropy, as measured here, actually correlates quite well with the logarithm of word frequency (p = 0.79). This correlation is to be expected, since the maximum possible entropy of a word with frequency f is log(f), which is what Equation (3) evaluates to when a word is always linked to nothing. Yet the correlation is not perfect; simply sorting the words by frequency would produce a suboptimal result. For instance, the most frequent pronoun in Table 4 is eleventh from the bottom of the list of thirty seven, because 'T' has a very consistent meaning. Likewise, &amp;quot;going&amp;quot; has a higher entropy than &amp;quot;go&amp;quot; in Table 5, even though it is less than one fifth as frequent, because &amp;quot;going&amp;quot; can be used as a near-future tense marker whereas &amp;quot;go&amp;quot; has no such function. The best counter-example to the correlation between semantic entropy and log-frequency is the period, which is the most frequent token in the English Hansards and has a semantic entropy of zero.</Paragraph>
    <Paragraph position="2"> The method presented here for measuring semantic entropy is sensitive to ontological and syntactic differences between languages. It is partly motivated by the observation that translators must paraphrase when the target language has no obvious equivalent for some word or syntactic construct in the source text. There are many more ways to paraphrase something than to translate it literally, and translators usually strive for variety in order to improve readability. That's why, for example, English light verbs have such high entropies even though there are many English verbs that are more frequent. The entropy of English light verbs would likely remain relatively high if English/Chinese bitexts were used instead of English/French, because the lexicalization patterns involving light verbs in English are particular to English. Reliance on this property of translated texts is a double-edged sword, however, due to the converse possibility that two languages share an unusual syntactic construct or an unusual bit of ontology. In that case, the relevant semantic entropies may be estimated too low. Ideally, semantic entropy should be estimated by averaging each source language of interest over several different target languages.</Paragraph>
    <Paragraph position="3"> A more serious drawback of translational entropy as an estimate of semantic entropy is that words may be inconsistently translated either because they don't mean very much or because they mean several different things, or both. For example, WordNet 1.5 lists twenty six senses for the English verb &amp;quot;run.&amp;quot; We would expect the different senses to have different translations in other languages, and we would expect several of these senses to occur in any sufficiently large bitext, resulting in a high estimate of semantic entropy for &amp;quot;run&amp;quot; (5.65 in the Hansards). Meanwhile, Table 5 shows that the English verb &amp;quot;be&amp;quot; is translated much less consistently s than &amp;quot;run,&amp;quot; even though only nine senses are listed for it in WordNet.</Paragraph>
    <Paragraph position="4"> This is because &amp;quot;be&amp;quot; rarely conveys much information. It is useful to know about both of these components of semantic entropy, but it would be more useful to know about them separately (Ros97). This knowledge is contingent on knowledge of the elusive  Pr(senselword), which is currently the subject of much research (see, e.g. (NS~L96) and references therein). Knowing Pr(senselword) would also improve Resnik's method, which st) far has been forced to assume that this distribution is uniform (Res95).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML