File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/w99-0902_intro.xml

Size: 6,330 bytes

Last Modified: 2025-10-06 14:07:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0902">
  <Title>The applications of unsupervised learning to Japanese grapheme-phoneme alignment</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The objective of this paper is to analyse the applicability of statistical and learning methods to automated grapheme-phoneme alignment in Japanese, without reliance on pre-annotated training data or any form of supervision. The two principal models proposed herein are a simple statistical model nonreliant on learning techniques, and an incremental learning method deriving therefrom, incorporating automated &amp;quot;pseudo-supervision&amp;quot; drawing on prior alignments. The incremental learning method selects a single alignment candidate to accept at each iteration, and adjusts the statistical model accordingly to aid in the subsequent disambiguation of residue G-P tuples.</Paragraph>
    <Paragraph position="1"> Grapheme-phoneme (&amp;quot;G-B&amp;quot;) alignment is defined as the task of maximally segmenting a grapheme compound into morpho-phonic units, and aligning each unit to the corresponding substring in the phoneme compound (Bilac et al., 1999). Its main use is in portrayal of the phonological interaction between adjoining grapheme segments, and also implicit description of the range of readings each grapheme segment can take. We further suggest that a large-scale database of maximally aligned G-P tuples has applications within the more conventional task of G-P translation (Klatt, 1987; Huang et al., 1994; Divay and Vitale, 1997).</Paragraph>
    <Paragraph position="2"> Our particular interest in developing a database of G-P tuples is to apply it in the development of a kanji tester which can dynamically predict plausibly incorrect readings for a given grapheme string. For this purpose, we require as great a coverage of grapheme strings as possible, and the proposed system has thus been designed to exhaustively align the input set of G-P tuples, sacrificing precision for 100% recall.</Paragraph>
    <Paragraph position="3"> 'Grapheme string' in this research refers to the maximal kanji representation of a given word or compound, and 'phoneme string' refers to the kana (hiragana and/or katakana) mora correlate. 1 By 'maximal' segmentation is meant that the grapheme string must be segmented to the degree that each segment corresponds to a self-contained component of the phonemic description of that compound, and that no segment can be further segmented into aligning sub-segments. The statement of 'maximality' of segmentation is qualified by the condition that each segment must constitute a morpho-phonic unit, in that for conjugating parts-of-speech, namely verbs and adjectives, the conjugating suffix must be contained in the same segment as the stem.</Paragraph>
    <Paragraph position="4"> By way of illustration of the alignment process, let us consider the example of the verb ka-n-sya-su-ru i~--~-su-ru\] &amp;quot;to thank/be thankful&amp;quot;,2 a portion ot the 35 member alignment paradigm for which is given in Fig. 1. The importance of maximality of alignment is observable by way of align35, which constitutes a legal (under-)alignment of the correct solution in align1. Here, there is scope for further segmentation, as evidenced by the replaceability of by its phoneme content of ka-n in isolation of (producing the string ka-n-=~-su-ru). Thus, we are able to discount align35 on the grounds of it being non-maximal. That a segment exists between sya and su-ru, on the other hand, is a result of su-ru being a light verb and hence an independent morpheme. null The overall alignment procedure is depicted in 1Our description of kana as phoneme units represents a slight abuse of terminology, in that individual kana characters are uniquely associated with a broad phonetic transcription potentially extending over multiple phones. Note, however, that in abstracting away to this meta-phonemic representation, we are freed from consideration of low-level phonological concerns such as phoneme connection constraints.</Paragraph>
    <Paragraph position="5"> 2So as to make this paper as accessible as possible to readers not familiar with Japanese, hiragana and katakana characters have been transliterated into Latin script throughout this paper and are essentially treated as being identical. The graphemic kanji character set, on the other, has been provided in its original form to give the reader a feel for the significance of the kana-kanji dichotomy. For both the grapheme and phoneme strings, character boundaries are indicated by &amp;quot;-&amp;quot; and segment boundaries (which double as character boundaries) indicated by &amp;quot;(r)&amp;quot;.</Paragraph>
    <Paragraph position="6">  ment candidates (PSseg)-{GSseg) for each G-P tuple i. This alignment paradigm is pruned through application of a series of constraints, and either of the two proposed alignment selection methods is then applied to identify a single most plausible alignment from each alignment paradigm. Both the simple statistical model (&amp;quot;method-l') and incremental learning method (&amp;quot;method-2&amp;quot;) rely on a slightly modified form of the TF-IDF model. In the case of method-l, statistical analysis is applied to the full range of alignment paradigms in C/ and all alignment paradigms are disambiguated in parallel. For method-2, we commence identically to method-l, but single out an alignment paradigm to disambiguate at each iteration, and incrementally adjust the statistical model based on both the reduced C/ and the expanded w. As such, the principal difference between the two methods can be stated as statistical feedback from w to ~, in method-2, but not in method-1.</Paragraph>
    <Paragraph position="7">  In the remainder of this paper, we first present the methodology used to derive all legal alignments for a given G-P tuple (Section 2), then give full details of both the simple statistical method and incremental learning method (Section 3), before evaluating the various methods against a baseline rule-based method (Section 4). Finally, in Section 5, we consider additional applications of the basic methodology proposed here.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML