File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/94/p94-1010_relat.xml

Size: 2,068 bytes

Last Modified: 2025-10-06 14:16:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="P94-1010">
  <Title>REFERENCES</Title>
  <Section position="3" start_page="66" end_page="66" type="relat">
    <SectionTitle>
PREVIOUS WORK
</SectionTitle>
    <Paragraph position="0"> There is a sizable literature on Chinese word segmentation: recent reviews include (Wang et al., 1990; Wu and Tseng, 1993). Roughly, previous work can be classified into purely statistical approaches (Sproat and Shih, 1990), statistical approaches which incorporate lexical knowledge (Fan and Tsai, 1988; Lin et al., 1993), and approaches that include lexical knowledge combined with heuristics (Chen and Liu, 1992).</Paragraph>
    <Paragraph position="1"> Chert and Liu's (1992) algorithm matches words of an input sentence against a dictionary; in cases where various parses are possible, a set of heuristics is applied to disambiguate the analyses. Various morphological rules are then applied to allow for morphologically complex words that are not in the dictionary. Precision and recall rates of over 99% are reported, but note that this covers only words that are in the dictionary: &amp;quot;the... statistics do not count the mistakes \[that occur\] due to the existence of derived words or proper names&amp;quot; (Chen and Liu, 1992, page 105). Lin et al. (1993) describe a sophisticated model that includes a dictionary and a morphological analyzer. They also present a general statistical model for detecting 'unknown words' based on hanzi and part-of-speech sequences. However, their unknown word model has the disadvantage that it does not identify a sequence of hanzi as an unknown word of a particular category, but merely as an unknown word (of indeterminate category). For an application like TTS, however, it is necessary to know that a particular sequence ofhanzi is of a particular category because, for example, that knowledge could affect the pronunciation. We therefore prefer to build particular models for different classes of unknown words, rather than building a single general model.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML