File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2099_metho.xml
Size: 11,317 bytes
Last Modified: 2025-10-06 14:14:14
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2099"> <Title>Segmenting Sentences into Linky Strings Using D-bigram Statistics</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Characteristics of Japanese </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Text 1.1 Letters in Japanese </SectionTitle> <Paragraph position="0"> Japanese text is composed of four kinds of characters kanji, hiragana, katakana, and others such as alphabetic characters and numeral characters.</Paragraph> <Paragraph position="1"> Itiragana is used fbr Japanese words, inflections and flmction words, while k~takana is used for words from foreign languages and for other special purposes.</Paragraph> <Paragraph position="2"> Table 1 shows examples of rates of those four characters in texts (Teller and Batchelder, 1994). The bus__._=, corpus consists of a set of newspaper articles on business ventures from Yomiuri. The ed__:, corpus contains a series of editorial columns from Asahi Shinbun.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.2 Morphemes in Japanese </SectionTitle> <Paragraph position="0"> Segmenting a Japanese text is a difficult task. A phrase &quot;~b -C~ ~ b f~ (was studying)&quot; call be a single lexical unit or can be separated into as m~,ny as six elements (Teller and Batchelder, 1994): 'study' 'do' particle progressive polite past Acquiring &quot;morphemes&quot; from Japanese text is not a simple task because of this flexibility.</Paragraph> </Section> </Section> <Section position="3" start_page="0" end_page="586" type="metho"> <SectionTitle> 2 Linky Strings </SectionTitle> <Paragraph position="0"> This paper is on dividing non-separated language sentences into meaningful strings of letters without using any grammar or linguistic knowledge.</Paragraph> <Paragraph position="1"> Instead, this system uses the statistical information between letters to select the best ways to segment sentences in non-separated languages.</Paragraph> <Paragraph position="2"> It is not very hard to divide a sentence using a certain dictionary for that. The problem is that a 'certain dictionary' is not easily obtainable. There never is a perfect dictionary which holds all the words that exist in the lmlguage. Moreover, building a dictionary is very hard work, since there are no perfect automatic dictionary-making systems.</Paragraph> <Paragraph position="3"> llowever, machine-readable dictionaries are needed anyway. ~br this reason, we propose a new method for picking out meaningflfl strings. Our purpose is not to segment a sentence into conventional morphemes. We introduce a concept for a type of language unit for machine use. We named the unit a 'linky string'. A linky string is a series of letters extracted from a corpus using statistical intbrmation only. It is a series of letters which share a strong statistical relationship.</Paragraph> </Section> <Section position="4" start_page="586" end_page="587" type="metho"> <SectionTitle> 3 LINKING SCORE </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="586" end_page="586" type="sub_section"> <SectionTitle> 3.1 Linking Score </SectionTitle> <Paragraph position="0"> To pick out linky strings, we need to find highly connectable letters in a sentence. We introduce the linking score., which shows the linkability between two neighbor letters in a sentence. This score is estimated using d-bigram statistics.</Paragraph> </Section> <Section position="2" start_page="586" end_page="586" type="sub_section"> <SectionTitle> 3.2 D-bigram </SectionTitle> <Paragraph position="0"> The idea of bigrams and trigrams is often used in studies on NI,P. Wgram is the information of the association between n certain events. In this study we use thed-bigram data (Tsutsumi et al., 1993), which is a kind of bigrmn data with the concept of distance between events (Figure l). l)-higram is equal to bigram when d = l, thus d~bigrmn data includes the conventional bigrmn relation.</Paragraph> <Paragraph position="2"/> </Section> <Section position="3" start_page="586" end_page="586" type="sub_section"> <SectionTitle> 3.3 Calculation Mutual InfiJrmation with Distance </SectionTitle> <Paragraph position="0"> Expression (1) iv for calculating mutual intbrmation between two events(Nobesawa et al., 1994): l'(ai, bj, d) bj, d) = log v(.d/'(b,) (1) ai : a letter P(ai) : the possibility the letter ai appears ldeg(ai, bj, d) : the possibility ai and bj appear together with the distance d in a sentence The parameter d shows the distance between two events. In Figure 2, the distance between &quot;a&quot; m~d &quot;pen&quot; is 1, and the distm,ce between &quot;is&quot; and &quot;pen&quot; is 2 as well. Since the event order has a meaning, in this case the distance between &quot;pen&quot; and &quot;a&quot; is defined as -1.</Paragraph> <Paragraph position="1"> ThL~ is a pen .</Paragraph> <Paragraph position="2"> As the vahm of MI gets bigger, the stronger is the association between the two events.</Paragraph> </Section> <Section position="4" start_page="586" end_page="587" type="sub_section"> <SectionTitle> Linking Score </SectionTitle> <Paragraph position="0"> Expression (2) is tbr calculating the linking score between two letters in a sentence ~.</Paragraph> <Paragraph position="2"> wl : the i-th letter in the sentence w g(d) : a certain weight for iV// concerning distance between letters The information between two remote words has less nmaning in a sentence when it comes to the semantic analysis(Church and Hanks, 1989). According to the idea we put g(d) in the expression so that nearer pair can be more effective in calculating the score of the sentence.</Paragraph> <Paragraph position="3"> A pair of far-away letters do not have strong relation between each other, neither syntactically nor semantically. For this reason we use dma,, and in this paper we set tile dmax value 2 to 5 and 1. When the dma, is 1, the MI used in calculation is only bigram data.</Paragraph> </Section> </Section> <Section position="5" start_page="587" end_page="588" type="metho"> <SectionTitle> 4 THE SYSTEM L~S </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="587" end_page="587" type="sub_section"> <SectionTitle> 4.1 Overview </SectionTitle> <Paragraph position="0"> This system is called LSS, a &quot;linky string segmentor&quot;. This system takes a corpus made of non-separated sentences as its input and segments it into linky strings using d-bigram statistics.</Paragraph> <Paragraph position="1"> Figure 4 shows the flow of LSS's processing.</Paragraph> <Paragraph position="2"> Input sentences to segment.</Paragraph> <Paragraph position="3"> Calculate the linking score of each pair of neighboring letters.</Paragraph> <Paragraph position="4"> Check the score graph to see where to segment.</Paragraph> <Paragraph position="5"> pick out each linky string found in the given corpus.</Paragraph> <Paragraph position="6"> In this paper we used a fixed score for the starting score, so that/S~ can decide whether the first letter should be a one-letter linky string.</Paragraph> </Section> <Section position="2" start_page="587" end_page="587" type="sub_section"> <SectionTitle> 4.2 The Score Graph What a Score Graph Is </SectionTitle> <Paragraph position="0"> To segment a sentence into statistically-meaningful strings, we use the linking scores to locate boundaries between linking strings. A score graph has the letters in a sentence on the x-axis and linking scores on the y-axis (Figure 5). We get one score graph for each sentence. Figure 5 shows two sentences (one above and one below), each of 14 letters (including an exclamation/question mark as the sentence terminator).</Paragraph> <Paragraph position="1"> When the linking score between a pair of neighboring letters is high, we assume they are part of the same word. When it is low, we assume that the letters, though neighbors, are statistically independent of one another. In a score graph, a series of scores in the shape of mountain (ex.: A-B and C-F part in Figure 5) becomes a linky string, and a valley (ex.: between the letter B and C in Figure 5) is a spot to segment.</Paragraph> </Section> <Section position="3" start_page="587" end_page="588" type="sub_section"> <SectionTitle> Score-Graph Segmenting Algorithm </SectionTitle> <Paragraph position="0"> The system LSS finds the valley-points in a sentence and segments the sentence there into strings.</Paragraph> <Paragraph position="1"> Following is the algorithm to find the segment- null ing points in a sentence.</Paragraph> <Paragraph position="2"> 1. Do not segment in a mountain.</Paragraph> <Paragraph position="3"> 2. Segment at the valley point.</Paragraph> <Paragraph position="4"> 3. Cut before and after a one-lettered linky string. A one-lettered linky string needs to (a) place at the valley point, and (b) look flat 3 in the score graph. In Figure 5, one-lettered linky strings are G,L,N 4'0'Y, Zand?.</Paragraph> <Paragraph position="5"> Mountain Threshold A linky string takes a mountain shape because of high linking scores. Note that a linky string is not equal to a morpheme in human-handmade grammars. When a certain pair of morphemes occurs in a corpus very often, the system recognizes the pair's high linking score and puts them together into one linky string. For example, &quot;~&quot; &quot;~ &quot;2/~::)k: ~$~ (President Bush)&quot; is often treated as a linky string, since &quot;:)&quot; &quot;9 &quot;J ~ (Bush)&quot; and &quot;gk:})~YI (president)&quot; appear next to each other very frequently. The mountains of letters are not always simple hat-shaped; most of time they have other smaller mountains in them. This means that there can be shorter strings in one linky string. In one linky string &quot;7&quot; y &quot;5' J-:}%i~ (President Bush)&quot;, there must be two smaller mountains, just like H-I and J-K in the mountain H-K in Figure 5. To control the size of linky strings we introduce a mountain threshold, which is shown in the sentence below in Figure 5. When the score of a valley point is higher than the mountain threshold, the system judges the point isnot a segmenting spot. In this paper the mountain threshold value is 5.0.</Paragraph> <Paragraph position="6"> 3We use a constant value as a threshold.</Paragraph> <Paragraph position="7"> aN is a special one-lettered linky string which places at the beginning of a sentence.</Paragraph> </Section> <Section position="4" start_page="588" end_page="588" type="sub_section"> <SectionTitle> 4.3 Corpus </SectionTitle> <Paragraph position="0"> LcN accepts all the non-separated sentences with little preparation. All we need is a set of certain amount of the target-language corpus for training.</Paragraph> <Paragraph position="1"> In this paper we show the experimental results on Japanese. The corpus prepared for this paper is of Asahi Shinbun Newspaper.</Paragraph> </Section> </Section> class="xml-element"></Paper>