File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0701_metho.xml
Size: 11,186 bytes
Last Modified: 2025-10-06 14:15:33
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0701"> <Title>Unsupervised Learning of Word Boundary with Description Length Gain</Title> <Section position="4" start_page="0" end_page="3" type="metho"> <SectionTitle> 2. Description Length Gain </SectionTitle> <Paragraph position="0"> Kit defines the description length of a corpus X = xlx2&quot;&quot;xn, a sequence of linguistic tokens (e.g., characters, words, POS tags), as the Shannon-Fano code length for the corpus \[Kit 1998\]. FoP s from X. As an extracted s is supposed to be appended to the modified corpus by a string concatenation, as shown in (2), the original corpus can be easily recovered by a transformation that reverses the extraction, i.e., replacing all r's in X\[r --r s\] with the string s.</Paragraph> <Paragraph position="1"> It is worth noting that we can achieve the purpose of calculating DL(X\[r --C/ s\] (9 s) without carrying out the string substitution operations throughout the original corpus. The calculation can be based on the token count change involved in the substitution operatious to derive the new corpus X\[r -+ s\] (9 s, as follows: lowing classic information theory \[Shannon 1948, DL(X\[r -+ s\] (9 s) = ~ a;'(x)log d(x) (4) i Cover and Thomas 1991\], it can be formulated in n' xEVu{r} terms of token counts in the corpus as below for empirical calculation: where d(x) is the new count ofx in the new corpus and n ' is the new corpus length. The new counts</Paragraph> <Paragraph position="3"> I where V is the set of distinct tokens (i.e., the vo- where c(x) and cs(x) are the counts of in the x cabulary) in X and c(x) is the count of x in X. original corpus X and in the string s, respectively.</Paragraph> <Paragraph position="4"> Accordingly, the description length gain (DLG) A key problem in this straightforward calculat from identifying a (sub)sequence s = sis2..&quot; sk in tion is that we need to derive the count c(s) for the corpus X as a segment or chunk, which is ex- all possible string s's in the original corpus X, be_ pected to have a nice correspondence to a linguis- cause during the lexical learning process it is neci tically significant unit (e.g., a lexical item such as to consider all fragments (i.e., all n-grams) essary a word, or a syntactic phrase), is formulated as in the corpus in order to select a set of good cam i DLG(seX) = DL(X)- DL(X\[r--+ s\] (9 s) (2) didates for lexical items. Kit and Wilks provide an efficient method for deriving n-ga'ams of any where r is an index, X\[r --+ s\] represents the resul- length and their counts from large-scale corpora tant corpus by the operation of replacing all occur- \[Kit and Wilks 1998\]. It has been adopted as the fences s r through out (in words, operational implementation un- of with X other basis for the of the we extract a rule r --+ s from X) and (9 represents supervised lexical acquisition algorithm that is to the concatenation of two strings (e.g., X\[r -+ s\] be reported in the next sections.</Paragraph> <Paragraph position="5"> and s) with a delimiter inserted in between. It is straightforward that the average DLG for extract- 3. Learning Algorithm ing an individual s from X is Given an utterance U = totl&quot;.tn as a string I of some linguistic tokens (e.g., characters, words, DLG(s) aDLG(s) c(s) (3) POS tags), the unsupervised lexical acquisition algorithm seeks for an optimal segmentation OS(U) This average DLG is an estimation of the compres- over the string U such that the sum of the compression effect of extracting an individual instance of sion effect over the segments is maximal. Formally | BSs\[j\] to tt t2 ...... ti &quot;- tj tj+l &quot;&quot; tk &quot;&quot; tn</Paragraph> <Paragraph position="7"> arg max ~_, aDLG(s,) (6) sl...sk s.t. U=sl+..-+sl,, z=l where 0 < k _< n, + represents a string concatenation and aDLG(s,) is the average DLG for each instance of the string s, in the original corpus, as defined in (3) above.</Paragraph> <Paragraph position="8"> Based on this description length gain calculation, a Viterbi algorithm is formulated to search for the optimal segmentation over an utterance U that fulfils (6). It is presented in Figure 1 with an illustration. The algorithm uses a list of intermediate variables OS\[0\], OS\[1\],..-, OS\[n\], each OSs\[i\] stores the optimal segmentation over tot1 ... ti (for i = 0, 1.2,-..,n). A segmentation is an ordered set (or list) of adjacent segments. The sign ~ represents an ordered set union operation. The DLG over a list of seg~mnts, e.g., DLG(OS\[j\]), is de- null fined as the sum of all segments' DLGs in the set:</Paragraph> <Paragraph position="10"> Notice that the algorithm has a bias against the extraction of a single token as a rule, due to the fact that a single token rule bears a negative DLG.</Paragraph> <Paragraph position="11"> When j = k - 1, OS\[j\] ~ \[tj+,..-tk\] becomes OS\[k - 1\] ~ {\[t~\]}, which is less preferable than OS\[k - 1\] t~ {tk}. The difference between the denotations \[tk\] and tk is that the former indicates that the string tk is extracted from the corpus as the right-hand side of a rule (a deterministic CFG rule), which results in a negative DLG; whereas the latter treats tk as an individual token instead of a segment, which has a zero DLG.</Paragraph> <Paragraph position="12"> It is worth noting that the breaking condition c(\[ tj ... tk\]) < 2 in the inner loop in the algorithm is an empirical condition. Its main purpose is to speed up the algorithm by avoiding fruitless iterations on strings of count 1. According to our observation in experiments, learning without this breaking condition leads to exactly the s.ame.result on large-scale corpora but the speed is many times slower. Strings with a count c = 1 can be skipped in the learning, because they are all long strings with a negative DLG*and none of them can become a good segment that contributes a positive compression effect to the entire segmentation of the ISince extracting a string \[t,...t~.\] of count 1 as a rule does not change any token's count in the new corpus C\[r -4 t, * .. tk\] (9 t, ... tk), except the new non-terminal r and the delimiter ~, whose counts become 1 (i.e., c(r) = c(\[t,.., tk\]) = 1 and c(~) = 1) after the extraction.- Thus,</Paragraph> <Paragraph position="14"> utterance. Rather, they can be broken into shorter segments with a positive DLG.</Paragraph> <Paragraph position="15"> Time complexity analysis also shows that this breaking condition can speed up the algorithm significantly. Without this condition, the time complexity of the algorithm is O(n2). With it, the complexity is bounded by O(mn), where m is the maximal common prefix length of sub-strings (i.e., n-grams) in the corpus. Accordingly, the average time complexity of the algorithm is O(an): where a is the average common prefix length in the corpus, which is much smaller than m.</Paragraph> </Section> <Section position="5" start_page="3" end_page="4" type="metho"> <SectionTitle> 4. Experiments </SectionTitle> <Paragraph position="0"> We have conducted a series of lexical acquisition experiments with the above algorithm on large-scale English corpora, e.g., the Brown corpus \[Francis and Kucera 1982\] and the PTB WSJ corpus \[Marcus et al. 1993\]. Below is the segmentation result on the first few sentences in the Brown corpus: \[the\] \[_fulton_county\] \[_grand_jury\] \[_said_\] \[friday_\] \[an\] \[_investigation_of\] \[_atlanta\] \[_'s_\] \[recent\] \[_primary_\] \[election\] \[_produced\] \[_' '_no\] \[_evidence\]</Paragraph> <Paragraph position="2"> where uppercase letters are converted to lowercase ones, the spaces are visualised by all underscore and the full-stops are all replaced by (@'s.</Paragraph> <Paragraph position="3"> Although a space is not distinguished from any other characters for the learner, we have to rely on the spaces to judge the correctness of a word boundary prediction: a predicted word boundary immediately before or after a space is judged as correct. But we also have observed that this criterion overlooks many meaningful predictions like &quot;-.-charge\] \[d_by-..&quot;, &quot;---are_outmode\] \[d_.-.&quot; and &quot;.--government\] \[s...:'. If this is taken into account, the learning pcrformance is evidently better than the precision and recall figures reported in Table 1 below.</Paragraph> <Paragraph position="4"> Interestingly, it is observed that n-gram counts derived from a larger volume of data can significantly improve the precision but decrease the recall of the word boundary prediction. The correlation betwee, the volume of data used tbr deriving n-gram counts and the change of precision and recall is shown in Table 1. The effectiveness of the unsupervised learning is evidenced by the fact that its precision and recall are, respectively, ~tll tl~ree times as high as the precision and recall by random guessing. The best learning performance, in terms of both precision and recall, in the experiments is o the one with 79.33% precision and 63.01~ recall, obtained from the experiment on the e,ltire Brown corpus.</Paragraph> <Paragraph position="5"> It is straightforwardly understandable that the increase of data volume leads to a significant increase of precision in the learning, because prediction based on more data is more reliable. The reason for the drop of recall is that when the volume of data increases, more multi-word strings have a higher compression effect (than individual words) and, consequently: they are learned by the learner as lexical items, e.g., \[fulton_county\], \[grand_jury\] and \[_took_place\]. If the credit in such nmlti-word lexical items is counted, the recall nmst be much better than the one in Table 1. Of course, this also reflects a limitation of the learning algorithm: it only conducts an optimal segmentation instead of a hierarchical chunking on an utterance.</Paragraph> <Paragraph position="6"> The precision and recall reported above is not a big surprise. To our knowledge, however, it is the first time that the performance of unsupervised learning of word boundaries is examined with the criteria of both precision and recall. Unfortunately, this performance can't be compared with any previous studies, for several reasons. One is that the learning results of previous studies are not presented in a comparable manner, for example, \[Wolff 1975, Wolff 1977\] and \[Nevill-Manning 1996\], as noted by \[de Marken 1996\] as well. Another is that the learning outcomes are different. For example, the output of lexical learning from an utterance (as a character sequence) in \[Nevill-Manning 1996\] and \[de Marken 1995, de Marken 1996\] is a hierarchical chunking of the utterance. The chance to hit the correct words in such chunking is obviously many times higher than that in a flat segmentation. The hierarchical chunking leads to a recall above 90% in de Marken's work. Interestingly, however, de Marken does not report the precision, which seems too low, therefore meaningless, to report, because the learner produces so many chunks.</Paragraph> </Section> class="xml-element"></Paper>