File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1207_metho.xml

Size: 12,403 bytes

Last Modified: 2025-10-06 14:07:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1207">
  <Title>Statistically-Enhanced New Word Identification in a Rule-Based Chinese System</Title>
  <Section position="3" start_page="46" end_page="47" type="metho">
    <SectionTitle>
2 Selection of candidate strings
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="46" end_page="46" type="sub_section">
      <SectionTitle>
2.1 Hypothesis
</SectionTitle>
      <Paragraph position="0"> Chinese used to be a monosyllabic language, with one-to-one correspondences between syllables, characters and words, but most words in modem Chinese, especially new words, consist of two or more characters. Of the 85,135 words in our system's dictionary, 9217 of them are monosyllabic, 47778 are disyllabic, 17094 are m-syllabic, and the rest has four or more characters. Since hardly any new character is being added to the language, the unfound words we are trying to identify are almost always multiple character words. Therefore, if we find a sequence of single characters (not subsumed by any words) after the completion of basic word segmentation, derivational morphology and proper name identification, this sequence is very likely to be a new word. This basic intuition has been discussed in many papers, such as Tung and Lee (1994). Consider the following sentence.</Paragraph>
      <Paragraph position="1"> (1) ~.~rj~ IIA~,~t~l~.J~)-~l~-~-.~t:a--.</Paragraph>
      <Paragraph position="2"> This sentence contains two new words (not including the name &amp;quot;~t~l~ which is recognized by the proper name identification mechanism) that are unknown to our system: ~f~:~rj (probably the abbreviated name of a junior high school) ~:~j (a word used in sports only but not in our dictionary) Initial lexical processing based on dictionary lookup and proper name identification produces the following segmentation: where ~-~rJ and ~a~.~\]- are segmented into single characters. In this case, both single character-strings are the new words we want to find.</Paragraph>
      <Paragraph position="3"> However, not every character sequence is a word in Chinese. Many such sequences are simply sequences of.single-character words.</Paragraph>
      <Paragraph position="4"> Here is an example: After dictionary look up, we get which is a sequence of 10 single characters.</Paragraph>
      <Paragraph position="5"> However, every character here is an independent word and there is no new word in the sentence. From this we see that, while most new words show up as a sequence of single characters, not every sequence of single characters forms a new word. The existence of a single-character string is the necessary but not sufficient condition for a new word. Only those sequences of single characters where the characters are unlikely to be a sequence of independent words are good candidates for new words.</Paragraph>
    </Section>
    <Section position="2" start_page="46" end_page="47" type="sub_section">
      <SectionTitle>
2.2 Implementation
</SectionTitle>
      <Paragraph position="0"> The hypothesis in the previous section can be implemented with the use of the Independent Word Probability (IWP), which can be a property of a single character or a string of characters.</Paragraph>
      <Paragraph position="1">  Most Chinese characters can be used either as independent words or component parts of multiple character words. The IWP of a single character is the likelihood for this character to appear as an independent word in texts:</Paragraph>
      <Paragraph position="3"> where N(Word(c)) is the number of occurrences of a character as an independent word in the sentences of a given text corpus and N(c) is the total number of occurrence of this character in the same corpus. In our implementation, we computed the probability from a parsed corpus where we went through all the leaves of the trees, counting the occurrences of each character and the occurrences of each character as an independent word.</Paragraph>
      <Paragraph position="4"> The parsed corpus we used contains about 5,000 sentences and was of course not big enough to contain every character in the Chinese language. This did not turn out to be a major problem, though. We find that, as long as all the frequently used single-character words are in the corpus, we can get good results, for what really matters is the IWP of this small set of frequent characters/words. These characters/words are bound to appear in any reasonably large collection of texts.</Paragraph>
      <Paragraph position="5"> Once we have the IWP of individual characters (IWP(c)), we can compute the IWP of a character string (IWP(s)). IWP(s) is the probability of a sequence of two or more characters being a sequence of independent words. This is simply the joint probability of the IWP(c) of the component characters.</Paragraph>
      <Paragraph position="6">  With IWP(c) and IWP(s) defined , we then define a threshold T for IWP. A sequence S of two or more characters is considered a candidate for a new word only if its IWP(s) &lt; T. When IWP(s) reaches T, the likelihood for the characters to be a sequence of independent words is too high and the string will notbe considered to be a possible new word. In our implementation, the value of Tis empirically determined. A lower T results in higher precision and lower recall while a higher T improves recall at the expense of precision. We tried different values and weighed recall against precision until we got the best performance. ~-~)J and ~'~ in Sentence (1) are identified as candidate dates because 1WP(s)(~) = 8% and lWP(s)(~'~\]~) = 10% while the threshold is 15%. In our system, precision is not a big concern at this stage because the final filtering is done in the parsing process. We put recall first to ensure that the parser will have every word it needs. We also tried to increase precision, but not at the expense of recall.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="47" end_page="49" type="metho">
    <SectionTitle>
3 POS Assignment
</SectionTitle>
    <Paragraph position="0"> Once a character string is identified to be a candidate for new word, we must decide what syntactic category or POS to assign to this * possible new word. This is required for sentence analysis where every word in the sentence must have at least one POS.</Paragraph>
    <Section position="1" start_page="47" end_page="48" type="sub_section">
      <SectionTitle>
3.1. Hypothesis
</SectionTitle>
      <Paragraph position="0"> Most multiple character words in Chinese have word-internal syntactic structures, which is roughly the POS sequence of the component characters (assuming each character has a POS or potential POS). A two-character verb, for example, can have a V-V, V-N, V-N or A(dv)-V internal structure. For a two-character string to be assigned the POS of verb, the POS/potential POS of its component characters must match one of those patterns. However, this matching alone is not the sufficient condition for POS assignment. Considering the fact that a single character can have more than one POS and a single POS sequence can correspond to the internal word structures of different parts of speech (V-N can be verb or a noun, for instance), simply assigning POS on the basis of word internal structurewill result in massive over-generation and introduce too much noise into the parsing process. To prune away the unwanted guesses, we need more help from statistics.</Paragraph>
      <Paragraph position="1"> When we examine the word formation process in Chinese, we find that new words are often modeled on existing words. Take the newly coined verb ~C/~J&amp;quot; as an example. Scanning our dictionary, we find that ~&amp;quot; appears many times as the first character of a two-character verb, such as F~'5~, ~, ~'~, ~'~, ~\[,, ~'~'~J~, etc.</Paragraph>
      <Paragraph position="2"> Meanwhile, ~J&amp;quot; appears many times as the second  character of a two-character verb, such as ~\]~, ~,.~\]~j-, z\]z~, ~\]~\]., ~l-~J, ~\]r~, etc. This leads us to the following hypothesis: A candidate character string for a new word is likely to have a given POS if the component characters of this string have appeared in the corresponding positions of many existing words with this POS.</Paragraph>
    </Section>
    <Section position="2" start_page="48" end_page="49" type="sub_section">
      <SectionTitle>
3.2. Implementation
</SectionTitle>
      <Paragraph position="0"> To represent the likelihood for a character to appear in a given position of a word with a given POS and a given length, we assign probabilities of the following form to each character:</Paragraph>
      <Paragraph position="2"> where Cat is the category/POS of a word, Pos is the position of the character in the word, and Len is the length (number of characters) of the word.</Paragraph>
      <Paragraph position="3"> The probability of a character appearing as the second character in a four-character verb, for instance, is represented as P(Verb,2,4).</Paragraph>
      <Paragraph position="4"> 3.1.1. Computing P(Cat, Pos, Len) There are many instantiations of P(Cat, Pos, Len), depending on the values of the three variables. In our implementation, we limited the values of Cat to Noun, Verb and Adjective, since they are the main open class categories and therefore the POSes of most new words. We also assume that most new words will have between 2 to 4 characters, thereby limiting the values of Pos to 1--4 and the values of Len to 2--4. Consequently each character will have 27 different kinds of probability values associated with it. We assign to each of them a 4-character name where the first character is always &amp;quot;P&amp;quot;, the second the value of Cat, the third the value of Pos, and the fourth the value of Len. Here are some examples: Pnl2 (the probability of appearing as the first character of a two-character noun) Pv22 (the probability of appearing as the second character of a two-character verb) Pa34 (the probability of appearing as the third character of a four-character adjective) The values of those 27 kinds of probabilities are obtained by processing the 85,135 headwords in our dictionary. For each character in Chinese, we count the number of occurrences of this character in a given position of words with a given length and given category and then divide it by the total number of occurrences of this character in the headwords of the dictionary. For example,</Paragraph>
      <Paragraph position="6"> where N(v12(c)) is the number of occurrences of a character in the first position of a two-character verb while N(c) is the total number of occurrences of this character in the dictionary headwords. Here are some of the values we get for the character~:</Paragraph>
      <Paragraph position="8"> It is clear from those numbers that the character tend to occur in the second position of two-character and three-character verbs.</Paragraph>
      <Paragraph position="9"> 3.1.2. Using P(Cat, Pos, Len) Once a character string is identified as a new word candidate, we will calculate the POS probabilities for the string. For each string, we will get P(noun), P(verb) and P(adj) which are respectively the probabilities of this string being a noun, a verb or an adjective. They are the joint probabilities of the P(Cat, Pos, Len)of the component characters of this string. We then measure the outcome against a threshold. For a new word string to be assigned the syntactic category Cat, its P(Cat) must reach the threshold. The threshold for each P(Cat ) is independently determined so that we do not favor a certain POS (e.g. Noun) simply because there are more nouns in the dictionary.</Paragraph>
      <Paragraph position="10"> If a character string reaches the threshold of more than one P(Cat), it will be assigned more than one syntactic category. A string that has both P(noun) and P(verb) reaching the threshold, for example, will have both a noun and a verb added to the word lattice. The ambiguity is then resolved in the parsing process. If a string passes the IWP test but falls the P(Cat) test, it will  receive noun as its syntactic category. In other words, the default POS for a new word candidate is noun. This is what happened to ~f~ in the Sentence (l). ~-~D passed tlhe IWP test, but failed each of the P(Cat) tests. As a result, it is made a noun by default. As we can see, this assignment is the correct one (at least in this particular sentence).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML