File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/03/w03-1728_abstr.xml

Size: 3,930 bytes

Last Modified: 2025-10-06 13:43:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1728">
  <Title>Right Boundary (R) Not Right Boundary (M) Left Boundary (L) LR LM Not Left Boundary (M) MR MM Table 1: LMR Tagging 2 Tagging Algorithms</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> In this paper we present Chinese word segmentation algorithms based on the so-called LMR tagging. Our LMR taggers are implemented with the Maximum Entropy Markov Model and we then use Transformation-Based Learning to combine the results of the two LMR taggers that scan the input in opposite directions. Our system achieves F-scores of a0a2a1a4a3a5a0a7a6 and a0a9a8a10a3a5a11a7a6 on the Academia Sinica corpus and the Hong Kong City University corpus respectively.</Paragraph>
    <Paragraph position="1"> 1 Segmentation as Tagging Unlike English text in which sentences are sequences of words delimited by white spaces, in Chinese text, sentences are represented as strings of Chinese characters or hanzi without similar natural delimiters. Therefore, the first step in a Chinese language processing task is to identify the sequence of words in a sentence and mark boundaries in appropriate places. This may sound simple enough but in reality identifying words in Chinese is a non-trivial problem that has drawn a large body of research in the Chinese language processing community (Fan and Tsai, 1988; Gan et al., 1996; Sproat et al., 1996; Wu, 2003; Xue, 2003).</Paragraph>
    <Paragraph position="2"> The key to accurate automatic word identification in Chinese lies in the successful resolution of ambiguities and a proper way to handle out-of-vocabulary words. The ambiguities in Chinese word segmentation is due to the fact that a hanzi can occur in different word-internal positions (Xue, 2003). Given the proper context, generally provided by the sentence in which it occurs, the position of a hanzi can be determined. In this paper, we model the Chinese word segmentation as a hanzi tagging problem and use a machine-learning algorithm to determine the appropriate position for a hanzi. There are several reasons why we may expect this approach to work. First, Chinese words generally have fewer than four characters. As a result, the number of positions is small. Second, although each hanzi can in principle occur in all possible positions, not all hanzi behave this way. A substantial number of hanzi are distributed in a constrained manner. For example, , the plural marker, almost always occurs in the word-final position. Finally, although Chinese words cannot be exhaustively listed and new words are bound to occur in naturally occurring text, the same is not true for hanzi. The number of hanzi stays fairly constant and we do not generally expect to see new hanzi.</Paragraph>
    <Paragraph position="3"> We represent the positions of a hanzi with four different tags (Table 1): LM for a hanzi that occurs on the left periphery of a word, followed by other hanzi, MM for a hanzi that occurs in the middle of a word, MR for a hanzi that occurs on the right periphery of word, preceded by other hanzi, and LR for hanzi that is a word by itself. We call this LMR tagging. With this approach, word segmentation is a process where each hanzi is assigned an LMR tag and sequences of hanzi are then converted into sequences of words based on the LMR tags. The use of four tags is linguistically intuitive in that LM tags morphemes that are prefixes or stems in the absence of prefixes, MR tags morphemes that are suffixes or stems in the absence of suffixes, MM tags stems with affixes and LR tags stems without affixes. Representing the distributions of hanzi with LMR tags also makes it easy to use machine learning algorithms which has been successfully applied to other tagging problems, such as POS-tagging and IOB tagging used in text chunking.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML