File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2056_metho.xml
Size: 18,218 bytes
Last Modified: 2025-10-06 14:10:29
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2056"> <Title>Unsupervised Segmentation of Chinese Text by Use of Branching Entropy</Title> <Section position="4" start_page="428" end_page="429" type="metho"> <SectionTitle> 2 The Assumption </SectionTitle> <Paragraph position="0"> Given a set of elements and a set of n-gram sequences n formed of , the conditional entropy of an element occurring after an n-gram</Paragraph> <Paragraph position="2"> ability of occurrence of x.</Paragraph> <Paragraph position="3"> A well-known observation on language data states that H(XjX n ) decreases as n increases (Bell et al., 1990). For example, Figure 2 shows how H(XjX n ) shifts when n increases from 1 to 8 characters, where n is the length of a word prex. This is calculated for all words existing in the test corpus, with the entropy being measuredin the learning data(thelearning and test data are dened in x4). This phenomenon indicates that X will become easier to estimate as the context of X n gets longer. This can be intuitively understood: it is easy to guess that \e&quot; will follow after \Hello! How ar&quot;, but it is dicult to guess what comes after the short string \He&quot;. The last term BnZrlog P(xjx n ) in the above formula indicates the information of a token of x coming after x n , and thus the branching after</Paragraph> <Paragraph position="5"> . The latter half of the formula, the local entropy value for a given x</Paragraph> <Paragraph position="7"> (2) indicates the average information of branching for a specic n-gram sequence x n . As our interest in this paper is this local entropy, we</Paragraph> <Paragraph position="9"> ) in the rest of this paper. The decrease in H(XjX n ) globally indicates that given an n-length sequence x</Paragraph> <Paragraph position="11"> One reason why inequality (3) holds for language data is that there is context in language,</Paragraph> <Paragraph position="13"> carries a longer context as compared with x n . Therefore, if we suppose that x</Paragraph> <Paragraph position="15"> holds, because the longer the preceding ngram, the longer the same context. For example, it is easier to guess what comes after x =\natura&quot; than what comes after x = \natur&quot;. Therefore, the decrease in H(XjX n ) can be expressed asthe concept thatif thecontext is longer, the uncertainty of the branching decreases on average. Then, taking the logical contraposition, if the uncertainty does not decrease, the context is not longer, which can be interpreted as the following: If the entropy of successive tokens increases, the location is at a context border. (B) For example, in the case of x = \natural&quot;, the entropy h(\natural&quot;)should be larger than h(\natura&quot;),because it is uncertain what character will allow x to succeed. In the next section, we utilize assumption (B) to detect context boundaries.</Paragraph> <Paragraph position="16"> Figure 3: Our model for boundary detection based on the entropy of branching</Paragraph> </Section> <Section position="5" start_page="429" end_page="430" type="metho"> <SectionTitle> 3 Boundary Detection Using the </SectionTitle> <Paragraph position="0"> Assumption (B) gives a hint on how to utilize the branching entropy as an indicator of the context boundary. When two semantic units, both longer than 1, are put together, the entropy would appear as in the rst gure of Figure 3. The rst semantic unit is from osets 0 to 4, and the second is from 4 to 8, with each unit formed by elements of . In the gure, one possible transition of the branching degree is shown, where the plot at k on the horizontal axis denotes the entropy for h(x</Paragraph> <Paragraph position="2"> denotes the substring between osets n and m.</Paragraph> <Paragraph position="3"> Ideally, the entropy would take a maximum at 4, because it will decrease as k is increased in the ranges of k < 4 and 4 < k < 8, and</Paragraph> <Paragraph position="5"> ) over k. The boundary condition after such observation can be redened as the following:</Paragraph> <Paragraph position="7"> Boundaries are locations where the entropy is locally maximized.</Paragraph> <Paragraph position="8"> A similar method is proposed by Harris (Harris, 1955), where morpheme borders can be detected by using the local maximum of the number of dierent tokens coming after a prex. null This only holds, however, for semantic units longer than 1. Units often have a length of 1, especially in our case with Chinese characters as elements, so that there are many one-character words. If a unit has length 1, then the situation will look like the second graph in Figure 3, where three semantic units, x , are present, with the middle unit having length 1. First, at k = 4, the value of h increases. At k = 5, the value may increase or decrease, because the longer context results in an uncertainty decrease, though an uncertainty decrease does not necessarily mean a longer context. When h increases at k = 5, the situation will look like the second graph. In this case, the condition B</Paragraph> <Paragraph position="10"> where 0 < i < k. According to inequality (3), then, a similar trend should be present for plots of h(x i;k ), assuming that h(x</Paragraph> <Paragraph position="12"> Therefore, when the target language consists of many one-element units, B increase is crucial for collecting all boundaries. Note that the boundaries detected by B max are included in those detected by the condition B increase , and also that B increase is a boundary condition representing the assumption (B) moredirectly. So far, we have considered only regularorder processing: the branching degree is calculated for successive elements of x n . We can also consider the reverse order, which involves calculating h forthe previous element of x n . In the case of the previous element, the question is whether the head of x n forms the beginning of a context boundary.</Paragraph> <Paragraph position="13"> Next, we move on to explain how we actually applied the above formalization to the problem of Chinese segmentation.</Paragraph> </Section> <Section position="6" start_page="430" end_page="430" type="metho"> <SectionTitle> 4 Data </SectionTitle> <Paragraph position="0"> The whole data for training amounted to 200 MB, from the Contemporary Chinese Corpus of the Center of Chinese Linguistics at Peking University (CenterforChinese Linguistics, 2006). It consists of several years of Peoples' Daily newspapers, contemporaryChinese literature, and some popular Chinese magazines. Note that as our method is unsupervised, this learning corpus is just text without any segmentation.</Paragraph> <Paragraph position="1"> The test data were constructed by selecting sentences from the manually segmented People's Daily corpus of Peking University. In total, the test dataamountsto 1001KB, consisting 147026 Chinese words. The word boundaries indicated in the corpus were used as our golden standard.</Paragraph> <Paragraph position="2"> As punctuation is clear from text boundaries in Chinese text, we pre-processed the test data by segmenting sentences at punctuation locations to form text fragments. Then, from all fragments, n-grams of less than 6 characters were obtained. The branching entropies for all these n-grams existing within the test data were obtained from the 200 MB of data.</Paragraph> <Paragraph position="3"> We used 6 as the maximum n-gram length because Chinese words with a length of more than 5 characters are rare. Therefore, scanning the n-grams up to a length of 6 was sucient. Another reason is that we actually conducted the experiment up to 8-grams, but the performance did not improve from when we used 6-grams.</Paragraph> <Paragraph position="4"> Using this list of words ranging from unigrams to 6-grams and their branching entropies, the test data were processed so as to obtain the word boundaries.</Paragraph> </Section> <Section position="7" start_page="430" end_page="431" type="metho"> <SectionTitle> 5 Analysis for Small Examples </SectionTitle> <Paragraph position="0"> Figure 4 shows an actual graph of the entropy shift for the input phrase (wei lai fa zhan de mu biao he zhi dao fang zhen, the aim and guideline of future development). The upper gure shows the entropy shift for the forward case, and the lower gure shows the entropy shift for the backward case. Note that for the backward case, the branching entropy was calculated for characters before the x</Paragraph> <Paragraph position="2"/> <Paragraph position="4"> are two increasing points, indicating that the phrase was segmented between and , and between and . The second line plots h( ):::h( ). The increasing locations are between and , between and , and after .</Paragraph> <Paragraph position="5"> The lower gure is the same. There are two lines, one for the branching entropy before the substring ending with sux . The rightmost line plots h( ), h( ) ...h( ) running from back to front. We can see increasing points (asseen fromback tofront)between and , and between and .</Paragraph> <Paragraph position="6"> As for the last line, it also starts from and runs from back to front, indicating boundaries between and , between and , and just before .</Paragraph> <Paragraph position="7"> If we consider all the increasing points in all four lines and take the set union of them, we obtain the correct segmentation as follows: j j j j j j , which is the 100 % correct segmentation in terms of both recall and precision.</Paragraph> <Paragraph position="8"> In fact, as there are 12 characters in this input, there should be 12 lines starting from each character for all substrings. For readability, however, we only show two lines each for the forward and backward cases. Also, the maximum length of a line is 6, because we only took 6-grams out of the learning data. If we consider all the increasing points in all 12 lines and take the set union, then we again obtain 100 % precision and recall. It is amazing how all 12 lines indicate only correct word boundaries. null Also, note how the correct full segmentation is obtained only with partial information from 4 lines taken from the 12 lines. Based on this observation, we next explain the algorithm that we used for a larger-scale experiment. null</Paragraph> </Section> <Section position="8" start_page="431" end_page="432" type="metho"> <SectionTitle> 6 Algorithm for Segmentation </SectionTitle> <Paragraph position="0"> Having determined the entropy for all n-grams in the learning data, we could scan through each chunk of test data in both the forward order and the backward orderto determine the locations of segmentation.</Paragraph> <Paragraph position="1"> As our intention in this paper is above all to study the innate linguistic structure described by assumption (B), we do not want to add any artifacts other than this assumption. For such exact verication, we have to scan through all possible substrings ofan input, which amounts to O(n ) computational complexity, where n indicates the input length of characters.</Paragraph> <Paragraph position="2"> Usually, however, h(x m;n ) becomes impossible to measure when n BnZr m becomes large. Also, as noted in the previous section, words longer than 6 characters are very rare in Chinese text. Therefore, given a string x, all n-grams of no more than 6 grams are scanned, and the points where the boundary condition holds are output as boundaries.</Paragraph> <Paragraph position="3"> As for the boundary conditions, we have boundary when the branching entropy h(x n ) is simply above a given threshold. Precisely, there are three boundary conditions: where valmax, valdelta, and val are arbitrary thresholds.</Paragraph> </Section> <Section position="9" start_page="432" end_page="433" type="metho"> <SectionTitle> 7 Large-Scale Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="432" end_page="432" type="sub_section"> <SectionTitle> 7.1 Denition of Precision and Recall </SectionTitle> <Paragraph position="0"> Usually, when precision and recall are addressed in the Chinese word segmentation domain, they are calculated based on the number of words. For example, consider a correctly segmented sequence \aaajbbbjcccjddd&quot;, with a,b,c,d being characters and \j&quot; indicating a word boundary. Suppose that the machine's result is \aaabbbjcccjddd&quot;; then the correct words are only \ccc&quot; and \ddd&quot;, giving a value of 2. Therefore, the precision is 2 divided by the number of words in the results (i.e., 3 for the words \aaabbb&quot;, \ccc&quot;, \ddd&quot;), giving 67%, and the recall is 2 divided by the total number of wordsin the golden standard (i.e., 4 for the words \aaa&quot;,\bbb&quot;, \ccc&quot;, \ddd&quot;) giving 50%. We call these values the word precision and recall, respectively, throughout this paper.</Paragraph> <Paragraph position="1"> In our case, we use slightly dierent measures for the boundary precision and recall, which are based on the correct number of boundaries. These scoresarealsoutilized especially in previous works on unsupervised segmentation (Ando and Lee, 2000) (Sun et al., is the number of correct boundaries in the result, N test is the number of boundaries in the test result, and, N true is the number of boundaries in the golden standard.</Paragraph> <Paragraph position="2"> For example, in the case of the machine result being \aaabbbjcccjddd&quot;, the precision is 100% and the recall is 75%. Thus, we consider there to be no imprecise result as a boundary in the output of \aaabbbjcccjddd&quot;.</Paragraph> <Paragraph position="3"> The crucial reason for using the boundary precision and recall is that boundary detection and word extraction are not exactly the same task. In this sense, assumption (A) or (B) is a general assumption about a boundary (of a sentence, phrase, word, morpheme). Therefore, the boundary precision and recall measure serves for directly measuring boundaries. null Notethatall precision and recall scoresfrom now on in this paper are boundary precision and recall. Even in comparing the supervised methods with our unsupervised method later, the precision and recall values are all re-calculated as boundary precision and recall.</Paragraph> </Section> <Section position="2" start_page="432" end_page="433" type="sub_section"> <SectionTitle> 7.2 Precision and Recall </SectionTitle> <Paragraph position="0"> The precision and recall graph is shown in Figure 5. The horizontal axis is the precision and the vertical axis is the recall. The three lines from right to left (top to bottom) cor- null interval of 0.1. For every condition, the larger the threshold, the higher the precision and the lower the recall.</Paragraph> <Paragraph position="1"> We can see how B increase and B max keep high precision as compared with B ordinary . We also can see that the boundary can be more easily detected if it is judged as comprising the proximity value of h(x n ).</Paragraph> <Paragraph position="2"> For B increase , in particular, when valdelta = 0:0,the precision and recall arestill at0.88and 0.79, respectively. Upon increasing the threshold to valdelta = 2:4, the precision is higher than 0.96 at the cost of a low recall of 0.29. As for B max , we also observe a similar tendency but with low recall due to the smaller number of local maximum points as compared with the number of increasing points. Thus, we see how B increase attains a better performance among the three conditions. This shows the correctness of assumption (B). Fromnow on, we consider only B increase and proceed through our other experiments. training data size Next, we investigated how the training data size aects the precision and recall. This time, the horizontal axis is the amount of learning data, varying from 10 KB up to 200 MB, on a log scale. The vertical axis shows the precision and recall. The boundary condition is B increase with valdelta = 0:1.</Paragraph> <Paragraph position="3"> We can see how the precision always remains high, whereas the recall depends on the amount of data. The precision is stable at an amazingly high value, even when the branching entropy is obtained from a very small corpus of 10 KB. Also, the linear increase in the recall suggests that if we had more than 200 MB of data, we would expect to have an even higher recall. As the horizontal axis is in a log scale, however, we would have to have gigabytes of data to achieve the last several percent of recall.</Paragraph> </Section> <Section position="3" start_page="433" end_page="433" type="sub_section"> <SectionTitle> 7.3 Error Analysis </SectionTitle> <Paragraph position="0"> According to our manual error analysis, the top-most three errors were the following: Numbers: dates, years, quantities (example: 1998, written in Chinese number into (open) and (mind)) The reason for the bad results with numbers is probably because the branching entropy for digits is less biased than for usual ideograms. Also, for one-character words, our method is limited, as we explained in x3. Both of these two problems, however, can be solved by applying special preprocessing for numbers and one-character words, given that many of the one-character words are functional characters, which are limited in number. Such improvements remain for our future work.</Paragraph> <Paragraph position="1"> The third error type, in fact, is one that could be judged as correct segmentation. In the case of \open mind&quot;, it was not segmented into two words in the golden standard; therefore, our result was judged as incorrect. This could, however, be judged as correct.</Paragraph> <Paragraph position="2"> ThestructuresofChinese wordsand phrases are very similar, and there are no clear criteria for distinguishing between a word and a phrase. The unsupervised method determines the structure and segments words and phrases into smaller pieces. Manual recalculation of the accuracy comprising such cases also remains for our future work.</Paragraph> </Section> </Section> class="xml-element"></Paper>