File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1202_metho.xml

Size: 13,272 bytes

Last Modified: 2025-10-06 14:07:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1202">
  <Title>Sense-Tagging Chinese Corpus</Title>
  <Section position="3" start_page="0" end_page="137" type="metho">
    <SectionTitle>
2 Degree of Polysemy in Mandarin Chinese
</SectionTitle>
    <Paragraph position="0"> The degree of polysemy is defined as the average number of senses of words. We adopt tagging set from tong2yi4ei2ci21in2 (~ ~ ~q ~'\] ~hk) abbreviated as Cilin (Mei, et al., 1982). It is composed of 12 large categories, 94 middle categories, and 1,428 small categories.</Paragraph>
    <Paragraph position="1"> Small categories (more fine granularity) are used to compute the distribution of word senses.</Paragraph>
    <Paragraph position="2"> Besides Cilin, ASBC is employed to count frequency of a word. Total 28,321 word types appear both in Cilin and in ASBC corpus.</Paragraph>
    <Paragraph position="3"> Here a word type corresponds to a dictionary</Paragraph>
    <Paragraph position="5"/>
    <Section position="1" start_page="0" end_page="137" type="sub_section">
      <SectionTitle>
Total Word
Types
</SectionTitle>
      <Paragraph position="0"/>
      <Paragraph position="2"> entry. Of these, 5,922 words are polysemous, i.e., they have more than one sense. Table 1 lists the statistics. We divide the ambiguity degree into three levels according to the number of senses of a word. It includes low (2-4), middle (5-8), and high ambiguity (&gt;8). The statistics shows that 93.77% of word types belong to the class of low ambiguity.</Paragraph>
      <Paragraph position="3"> We further consider POS when computing the distribution of word senses. Table 2 shows the statistics. N, V, A, F, and K denote nouns, verbs, adjectives, numerals, and auxiliaries (adverbs), respectively. We can find most of words belong to the class of low ambiguity no matter which POSes they are. Besides, the ambiguity is decreased when POS is considered.</Paragraph>
      <Paragraph position="4"> The number of polysemous words is down to 4,132. For A and K, the number of senses is no more than 7, and the percentages in the class of low degrees are 98.22% and 97.08%, respectively. For N and V, there are some high ambiguous words. In particular, the verb (6, da3) has 19 senses 2. The percentages in the class of low degrees are 97.53% and 94.70%, respectively.</Paragraph>
      <Paragraph position="5"> Then, the ffi'equency of word types is considered. ASBC corpus is used to compute the occurrences of word types. Table 3 fists the statistics. A word token is an occurrence of a type in the corpus. On the average, the words of low, middle and high ambiguity appear 205.96, 1926.65, and 4480.28 times, respectively. Table 1 shows 93.77% of polysemous words belong to the class of low ambiguity, but Table 3 illustrates they only  occupy 58.52% of tokens in ASBC corpus.</Paragraph>
      <Paragraph position="6"> Table 4 summarizes the distribution of word senses and frequencies. Low frequency denotes the number of occurrences less than 100, middle frequency denotes the number of occurrences between 100 and 1000, and high frequency denotes the number of occurrences more than 1000. Rows C and A in Table 4 denote number of word types and word tokens, respectively. The last column denotes percentage for each ambiguity degree. For example, the percentage of word types with low ambiguity is 96.64% (i.e., 3993/4132). This table shows the following two phenomena: (1) POS information reduces the degree of ambiguities. Total 8.94% of word tokens are high ambiguous in Table 3. It decreases to 0.47% in Table 4.</Paragraph>
      <Paragraph position="7"> (2) High ambiguous words tend to be high frequent. From the row of low ambiguity, there are 3,112 low-frequent words. They occur 70,131 times in ASBC corpus.</Paragraph>
      <Paragraph position="8"> Comparatively, there are only 881 middle- or high-frequent words, but they occur 966,774 times. That is, 23.67% of word types are middle- or high-frequent words, and they occupy 94.06% of word tokens. From the row of high ambiguity, there are only a few words, but they occur frequently in the ASBC corpus. It shows that semantic tagging is a ehallengeable problem in Mandarin Chinese.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="137" end_page="137" type="metho">
    <SectionTitle>
3 Semantic Tagging
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="137" end_page="137" type="sub_section">
      <SectionTitle>
3.1 Tagging Unambiguous Words
</SectionTitle>
      <Paragraph position="0"> In the semantic tagging, the small categories are selected. We postulate that the sense definition for each word in Cilin is complete. That is, a word that has only one sense in Cilin is called an unambiguous word or a monosemous word. If POS information is also considered, a word may be unambiguous under a specific POS.</Paragraph>
      <Paragraph position="1"> Because we do not have a semantically tagged corpus for training, we try to acquire the context for each semantic tag strutting from the unambiguous words.</Paragraph>
      <Paragraph position="2"> ASBC corpus is the target we study. At the first stage, only those words that are unambiguous in Cilin, and also appear in ASBC corpus are tagged~ Figure 1 shows this cease.</Paragraph>
      <Paragraph position="3">  An unambiguous word (and hence its sense tag) is characterized by the words surrounding it. The window size is set to 6, and stop words are removed. A list of stop words is trained from ASBC corpus. The words of POSes Neu (~ C/~q), DE (~, .~., ~,~-, ~), SHI (~,.), FW (J'l'~ ~), C (i~l~j~q), T (~l~h~q), and I (~*~q) are regarded as stop words. A sense tag Ctag is in terms of a vector (wl, w2, ..., wn), where n is the vocabulary size and wi is a weight of word cw. The weight can be determined by the following two ways.</Paragraph>
      <Paragraph position="4">  where P(Ctag) is the probability of Crag, P(cw) is the probability of cw, P(Ctag, cw) is the cooccurrence probability of Crag and cw, J(Ctag) is the frequency of Ctag, .PSew) is the frequency of cw, ~Ctag, cw) is the cooccurrence frequency of Ctag and cw, and N is total number of words in the corpus.</Paragraph>
      <Paragraph position="5"> (2) EM metric (Ballesteros and Croft, 1998)</Paragraph>
      <Paragraph position="7"/>
    </Section>
    <Section position="2" start_page="137" end_page="137" type="sub_section">
      <SectionTitle>
3.2 Tagging Ambiguous Words
</SectionTitle>
      <Paragraph position="0"> At the second stage, we deal with those words that have more than one sense in the Cilin.</Paragraph>
      <Paragraph position="1"> Figure 2 shows the words we consider.</Paragraph>
      <Paragraph position="2">  The approach we adopted on semantic tagging rests on an underlying assumption: each sense has a characteristic context that is different from the context of all the other senses. In addition, all words expressing the same sense share the same characteristic context. We will apply the information trained at the first stage to selecting the best sense tag from the candidates of each ambiguous word. Recall that a vector corresponds to a sense tag. We employ the similar way specified in Section 3.1 to identify the context vector of an ambiguous word. A cosine formula shown as foUows measures the similarity between a sense vector and a context vector, where w and v are a sense vector and a context vector, respectively. The sense tag of the highest similarity score is chosen.</Paragraph>
      <Paragraph position="3"> W oV cos (w, v)--IwIIvl We retrain the sense vector for each sense tag after the unambiguous words are resolved.</Paragraph>
    </Section>
    <Section position="3" start_page="137" end_page="137" type="sub_section">
      <SectionTitle>
3.3 Tagging Unknown Words
</SectionTitle>
      <Paragraph position="0"> Those words that appear in ASBC corpus, but are not gathered in Cilin are called unknown words. All the 1,428 sense tags are the possible candidates. Intuitively, the algorithm in Section 3.2 can be applied directly to select a sense tag from the 1,428 candidates. However, the candidate set is very large. Here we adopt outside evidences from the mapping among WordNet synsets (Fellbaum, 1998) and Cflin</Paragraph>
      <Paragraph position="2"> sense tags to narrow down the candidate set.</Paragraph>
      <Paragraph position="3"> Figure 3 summarizes the flow of our algorithm.</Paragraph>
      <Paragraph position="4"> It is illuslrated as follows.</Paragraph>
      <Paragraph position="5">  (1) Find all the English translations of an unknown Chinese word by looking up a Chinese-English dictionary.</Paragraph>
      <Paragraph position="6"> (2) Find all the symets of the English translations by looking up WordNet. We do not resolve translation ambiguity and target polysemy at these two steps, thus the retrieved symets may cover more senses than that of the original Chinese word.</Paragraph>
      <Paragraph position="7"> (3) Transform the synsets back to Cilin sense tags by looking up a mapping table. How the mapping table is set up will be discussed in Section 3.3. I.</Paragraph>
      <Paragraph position="8"> (4) Select a sense tag from the candidates  proposed at step (3) by using the WSD in Section 3.2.</Paragraph>
      <Paragraph position="9"> Figure 4 shows the unknown words we deal with at this stage. Those words that are not gathered in our Chineso-English dictionary are not considered, so that only parts of unknown words are resolve. In other words, thore remain words without sense tags.</Paragraph>
    </Section>
    <Section position="4" start_page="137" end_page="137" type="sub_section">
      <SectionTitle>
Unambiguous Words
Unknown &amp;quot;~
Words Ambiguous Words
</SectionTitle>
      <Paragraph position="0"> At first, we put unambiguous words (specified in Section 3.1) into WordNet by looking up a Chinese-English dictionary. Although these words do not have translation ambiguity, the corresponding English translation may have target polysemy problem. In other words, the English translation may cover irrelevant senses besides the correct one. The following algorithm will find the most similar syuseet with Chinese sense tag.</Paragraph>
      <Paragraph position="1">  (1) If the English translation corresponds to only one symet, this symet is the solution. (2) If the English translation corresponds to more than one synset, POS is considered: (a) If the Chinese sense tag belongs to one of categories A-D in Cilin (i.e., a noun sense), and there is only one noun synset, then the synset is adopted. Otherwise, we translate the context vector of the Chinese sense into English, compare it with vectors of the synsets, and select the most similar synset.</Paragraph>
      <Paragraph position="2"> (b) If the Chinese sense tag belongs to one of categories F-J in Cilin (i.e., a verb sense), we try to find a verb syuset in the similar way as (a). If it fails, we try noun and adjective synsets instead.</Paragraph>
      <Paragraph position="3"> (c) If the Chinese sense tag bdongs to category  E in Olin (i.e., an adjective sense), we try adjective, adverb, noun and verb symets in sequence.</Paragraph>
      <Paragraph position="4"> Off) If the Chinese sense tag belongs to category K in Cilin (i.e., an adverb sense), only adverb syasets are considered.</Paragraph>
      <Paragraph position="5"> Next, we consider the ambiguous words.</Paragraph>
      <Paragraph position="6"> Chinese-English dictionary lookup finds all the English translations. WordNet search coneets  the synset candidates for the translations. Some synsets are selected and regarded as the mapping of the Cilin sense tag. Here the problems of translation ambiguity and target polysemy must be faced. In other words, not all English translations cover the Cilin sense. Because the goal is to find a mapping table between WordNet synsets and Cflin sense tags, we neglect the problem of translation ambiguity and follow the method in the previous paragraph to choose the most similar synsets.</Paragraph>
      <Paragraph position="7"> During mapping, English translations of a word may not be found in the Chinese-English dictionary, and WordNet may not gather the English translations even dictionary look-up is successful. Thus, only 1,328 of 1,428 Cilin tags are mapped to WordNet synsets. From the other view, there remains some WordNet synsets that do not correspond to any Cilin sense tags. Let such a synset be Si. We follow the relational pointers like hypernym, hyponym, similar, derived, antonym, or participle to collect the neighboring synsets denoted by Sj. The following method selects suitable Cflin  tag(s) for Si.</Paragraph>
      <Paragraph position="8"> (1) IfSj is the only one syuset that has been mapped to Cilin tags, we choose a Cilin tag and map Si to it.</Paragraph>
      <Paragraph position="9"> (2) If there exists more than one Sj (say, Sjl,  Sj2, ..., S~) that has been mapped to Cilin tags, we choose the Cilin tags that more synsets map to.</Paragraph>
      <Paragraph position="10"> The above method is called a more restrictive scheme. An alternative method (called less restrictive method) is: all the Cilin tags that the neighboring synsets map to are selected. If Cilin tags cannot be found from neighboring synsets, we extend the range one more, and repeat the selection procedure again until all the syuseets are considered.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML