File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-1207_intro.xml

Size: 3,257 bytes

Last Modified: 2025-10-06 14:00:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1207">
  <Title>Statistically-Enhanced New Word Identification in a Rule-Based Chinese System</Title>
  <Section position="2" start_page="0" end_page="46" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In this paper, new words refer to newly coined words, occasional words and other rarely used words that are neither found in the dictionary of a natural language processing system nor recognized by the derivational rules or proper name identification rules of the system. Typical examples of such words are shown in the following sentences, with the new words underlined in bold.</Paragraph>
    <Paragraph position="1">  The automatic identification of such words by a machine is a trivial task in languages where words are separated by spaces in written texts. In languages like Chinese, where no word boundary exists in written texts, this is by no means an easy job. In many cases the machine will not even realize that there is an unfound word in the sentence since most single Chinese characters can be words by themselves.</Paragraph>
    <Paragraph position="2"> Purely statistical methods of word segmentation (e.g. de Marcken 1996, Sproat et al 1996, Tung and Lee 1994, Lin et al (1993), Chiang et al (1992), Lua, Huang et al, etc.) often fail to identify those words because of the sparse data problem, as the likelihood for those words to appear in the training texts is extremely low. There are also hybrid approaches such as (Nie dt al 1995) where statistical approaches and heuristic rules are combined to identify new words. They generally perform better than purely statistical segmenters, but the new words they are able to recognize are usually proper names and other relatively frequent words. They require a reasonably big training corpus and the performance is often domain-specific depending on the training corpus used.</Paragraph>
    <Paragraph position="3"> Many word segmenters ignore low-frequency new words and treat their component characters as independent words, since they are often of  little significance in applications where the structure of sentences is not taken into consideration. For in-depth natural language understanding where full parsing is required, however, the identification of those words is critical, because a single unidentified word can cause a whole sentence to fail.</Paragraph>
    <Paragraph position="4"> The new word identification mechanism to be presented here is used in a wide coverage Chinese parser that does full sentence analysis. It assumes the word segmentation process described in Wu and Jiang (1998). In this model, word segmentation, including unfound word identification, is not a stand-alone process, but an integral part of sentence analysis. The segmentation component provides a word lattice of the sentence that contains all the possible words, and the final disambiguation is achieved in the parsing process.</Paragraph>
    <Paragraph position="5"> In what follows, we will discuss two hypotheses and their implementation. The first one concerns the selection of candidate strings and the second one concerns the assignment of parts of speech (POS) to those strings.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML