File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1705_metho.xml
Size: 19,384 bytes
Last Modified: 2025-10-06 14:08:35
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1705"> <Title>A Bottom-up Merging Algorithm for Chinese Unknown Word Extraction</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 System Overview </SectionTitle> <Paragraph position="0"> The purpose to our unknown word extraction system is to online extract all types of unknown words from a Chinese text. Figure 1 illustrates the block diagram of the system proposed in this paper.</Paragraph> <Paragraph position="1"> Initially, the input sentence is segmented by a conventional word segmentation program. As a result, each unknown word in the sentence will be segmented into several adjacent tokens (known words or monosyllabic morphemes). At unknown word detection stage, every monosyllable is decided whether it is a word or an unknown word morpheme by a set of syntactic discriminators, which are learned from a corpus. Afterward, a bottom-up merging process applies the general rules to extract unknown word candidates. Finally, the input text is re-segmented by consulting the</Paragraph> <Paragraph position="3"> if can increase gross profit rate &quot;if gross profit rate can be increased...&quot; (2) after first step word segmentation:</Paragraph> <Paragraph position="5"> For example, the correct segmentation of (1) is shown, but the unknown word &quot;a5a9a6a10a7 &quot; is segmented into three monosyllabic words after the first step of word segmentation process as shown in (2). The unknown word detection process will mark the sentence as &quot;a0 () a1 () a2a8a3 () a5 (?) a6 (?) a7 (?)&quot;, where (?) denotes the detected monosyllabic unknown word morpheme and () denotes common words. During extracting process, the rule matching process focuses on the morphemes marked with (?) only and tries to combine them with left/right neighbors according to the rules for unknown words. After that, the unknown word &quot;a5a4a6a8a7 &quot; is extracted. During the process, we do not need to take care of other superfluous combinations such as &quot;a0 a1 &quot; even though they might have strong statistical association or co-occurrence too.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Analysis of Unknown Word Detection </SectionTitle> <Paragraph position="0"> The unknown word detection method proposed by (Chen & Bai 1998) is applied in our system. It adopts a corpus-based learning algorithm to derive a set of syntactic discriminators, which are used to distinguish whether a monosyllable is a word or an unknown word morpheme after an initial segmentation process. If all occurrences of monosyllabic words are considered as morphemes of unknown words, the recall of the detection will be about 99%, but the precision is as low as 13.4%.</Paragraph> <Paragraph position="1"> The basic idea in (Chen & Bai 1998) is that the complementary problem of unknown word detection is the problem of monosyllabic knownword detection, i.e. to remove the monosyllabic known-words as the candidates of unknown word morphemes. Chen and Bai (1998) adopt ten types of context rule patterns, as shown in table 1, to generate rule instances from a training corpus. The generated rule instances were checked for applicability and accuracy. Each rule contains a key token within curly brackets and its contextual tokens without brackets. For some rules there may be no contextual dependencies. The function of each rule means that in a sentence, if a character and its context match the key token and the contextual tokens of the rule respectively, this character is a common word (i.e. not a morpheme of unknown word).</Paragraph> <Paragraph position="2"> For instance, the rule &quot;{Dfa} Vh&quot; says that a character with syntactic category Dfa is a common word, if it follows a word of syntactic category Vh. category category char Na Dfa {a9 } char category category {a10 } Vh T ================================= Table1. Rule types and Examples The final rule set contains 45839 rules and were used to detect unknown words in the experiment. It achieves a detection rate of 96%, and a precision of 60%. Where detection rate 96% means that for 96% of unknown words in the testing data, at least one of its morphemes are detected as part of unknown word and the precision of 60% means that for 60% of detected monosyllables in the testing data, are actually morphemes. Although the precision is not high, most of over-detecting errors are &quot;isolated&quot;, which means there are few situations that two adjacent detected monosyllabic unknown morphemes are both wrong at the mean time. These operative characteristics are very important for helping the design of general rules for unknown words later.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Rules for Unknown Words </SectionTitle> <Paragraph position="0"> Although morphological rules work well in regular unknown word extraction, it's difficult to induce morphological rules for irregular unknown words.</Paragraph> <Paragraph position="1"> In this section, we try to represent a common structure for unknown words from another point of view; an unknown word is regarded as the combination of morphemes which are consecutive morphemes/words in context after segmentation, most of which are monosyllables. We adopt context free grammar (Chomsky 1956), which is the most commonly used generative grammar for modelling constituent structures, to express our unknown word structure.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Rule Derivation </SectionTitle> <Paragraph position="0"> According to the discussion in section 3, for 96% of unknown words, at least one of its morphemes are detected as part of unknown word, which motivates us to represent the unknown word structure with at least one detected morpheme.</Paragraph> <Paragraph position="1"> Taking this phenomenon into our consideration, the rules for modeling unknown words and an unknown word example are presented as follows.</Paragraph> <Paragraph position="3"> Notes: There is one non-terminal symbol. &quot;UW&quot; denotes &quot;unknown word&quot; and is also the start symbol.</Paragraph> <Paragraph position="4"> There are three terminal symbols, which includes ms(?), which denotes the detected monosyllabic unknown word morpheme, ms() , which denotes the monosyllable that is not detected as the morpheme, and ps(), which &quot; a0a2a1a4a3 &quot;(Chen Zhi Ming), which is segmented initially and detected as &quot;a0 (?) a1 (?) a3 ()&quot;, and &quot;a3 &quot; was marked incorrectly at detection stage.</Paragraph> <Paragraph position="5"> There are three kinds of commonly used measures applied to evaluate grammars: 1. generality (recall), the range of sentences the grammar analyzes correctly; 2. selectivity (precision), the range of non-sentences it identifies as problematic and 3.</Paragraph> <Paragraph position="6"> understandability, the simplicity of the grammar itself (Allen 1995). For generality, 96% unknown words have this kind of structure, so the grammar has high generality to generate unknown words.</Paragraph> <Paragraph position="7"> But for selectivity, our rules are over-generation.</Paragraph> <Paragraph position="8"> Many patterns accepted by the rules are not words.</Paragraph> <Paragraph position="9"> The main reason is that rules have to include non-detected morphemes for high generality. Therefore selectivity is sacrificed momentary. In next section, rules would be constrained by linguistic and text-based statistical constraints to compensate the selectivity of the grammar. For understandability, you can find each rule in (1)-(12) consists of just two right-hand side symbols. The reason for using this kind of presentation is that it regards the unknown word structure as a series of combinations of consecutive two morphemes, such that we could simplify the analysis of unknown word structure by only analyzing its combinations of consecutive two morphemes.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Appending Constraints </SectionTitle> <Paragraph position="0"> Since the general rules in table 2 have high generality and low selectivity to model unknown words, we append some constraints to restrict their applications. However, there are tradeoffs between generality and selectivity: higher selectivity usually results in lower generality. In order to keep high generality while assigning constraints, we assign different constraints on different rules according to their characteristics, such that it is only degraded generality slightly but selectivity being upgraded significantly.</Paragraph> <Paragraph position="1"> The rules in table 2 are classified into two kinds: one kind is the rules which both its right-hand side symbols consist of detected morphemes, i.e, (1), (2), (7), and (10), the others are the rules that just one of its right-hand side symbols consists of detected morphemes, i.e, (3), (4), (5), (6), (8), (9), (11), and (12). The former is regarded as &quot;strong&quot; structure since they are considered to have more possibility to compose an unknown word or an unknown word morpheme and the latter is regarded as &quot;weak&quot; structure, which means they are considered to have less possibility to compose an unknown word or an unknown word morpheme.</Paragraph> <Paragraph position="2"> The basic idea is to assign more constraint on those rules with weak structure and less constraint on those rules with strong structure.</Paragraph> <Paragraph position="3"> The constraints we applied include word length, linguistic and statistical constraints. For statistical constraints, since the target of our system is to extract unknown words from a text, we use text-based statistical measure as the statistical constraint. It is well known that keywords often reoccur in a document (Church 2000) and very possible the keywords are also unknown words.</Paragraph> <Paragraph position="4"> Therefore the reoccurrence frequency within a document is adopted as the constraint. Another useful statistical phenomenon in a document is that a polysyllabic morpheme is very unlikely to be the morphemes of two different unknown words within the same text. Hence we restrict the rule with polysyllabic symbols by evaluating the conditional probability of polysyllabic symbols. In addition, syntactic constraints are also utilized here. For most of unknown word morphemes, their syntactic categories belong to &quot;bound&quot;, &quot;verb&quot;, &quot;noun&quot;, and &quot;adjective&quot; instead of &quot;conjunction&quot;, &quot;preposition&quot;...etc. So we restrict the rule with non-detected symbols by checking whether syntactic categories of its non-detected symbols belong to &quot;bound&quot;, &quot;verb&quot;, &quot;noun&quot;, or &quot;adjective&quot;. To avoid unlimited recursive rule application, the length of matched unknown word is restricted unless very strong statistical association do occur between two matched tokens.</Paragraph> <Paragraph position="5"> The constraints adopted so far are presented in table 3. Rules might be restricted by multi-</Paragraph> <Paragraph position="7"> Notes: L denotes left terminal of right-hand side R denotes right terminal of right-hand side Threshold is a function of Length(LR) and text size. The basic idea is larger amount of length(LR) or text size matches larger amount of Threshold.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Priority </SectionTitle> <Paragraph position="0"> To scheduling and ranking ambiguous rule matching, each step of rule matching is associated with a measure of priority which is calculated by the association strength of right-hand side symbols.</Paragraph> <Paragraph position="1"> In our extracting algorithm, the priority measure is used to help extracting process dynamically decide which rule should be derived first. More detail discussion about ambiguity problem and complete disambiguation process are presented in section 5.</Paragraph> <Paragraph position="2"> We regard the possibility of a rule application as co-occurrence and association strength of its right-hand side symbols within a text. In other words, a rule has higher priority of application while its right-hand side symbols are strongly associated with each other, or co-occur frequently in the same text. There have been many statistical measures which estimate co-occurrence and the degree of association in previous researches, such as mutual information (Church 1990, Sporat 1990), t-score (Church 1991), dice matrix (Smadja 1993, 1996).</Paragraph> <Paragraph position="3"> Here, we adopt four well-developed kinds of statistical measures as our priority individually: mutual information (MI), a variant of mutual information (VMI), t-score, and co-occurrence.</Paragraph> <Paragraph position="4"> The formulas are listed in table 4. MI mainly focuses on association strength, and VMI and t-score consider both co-occurrence and association strength. The performances of these four measures are evaluated in our experiments discussed in section 6.</Paragraph> <Paragraph position="5"> Notes: f(L,R) denotes the number of occurrences of L,R in the text; N denotes the number of occurrences of all the tokens in the text; length(*) denotes the length of *.</Paragraph> <Paragraph position="7"/> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Unknown Word Extraction </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Ambiguity </SectionTitle> <Paragraph position="0"> Even though the general rules are appended with well-designed constraints, ambiguous matchings, such as, overlapping and covering, are still existing.</Paragraph> <Paragraph position="1"> We take the following instance to illustrate that: &quot;a0a2a1a4a3 &quot; (La Fa Yeh), a warship name, occurs frequently in the text and is segmented and detected as &quot;a0 (?) a1 (?) a3 (?)&quot;. Although &quot;a0a5a1 a3 &quot; could be derived as an unknown word &quot;((a0a6a1 ) a3 )&quot; by rule 2 and rule 10, &quot;a0a5a1 &quot; and &quot;a1a4a3 &quot; might be also derived as unknown words &quot;(a0a6a1 )&quot; and &quot;(a1a2a3 )&quot; individually by the rule 2. Hence there are total three possible ambiguous unknown words and only one is actually correct.</Paragraph> <Paragraph position="2"> Several approaches on unsupervised segmentation of Chinese words were proposed to solve overlapping ambiguity to determine whether to group &quot;xyz&quot; as &quot;xy z&quot; or &quot;x yz&quot;, where x, y, and z are Chinese characters. Sproat and Shih (1990) adopt a greedy algorithm: group the pair of adjacent characters with largest mutual information greater than some threshold within a sentence, and the algorithm is applied recursively to the rest of the sentence until no character pair satisfies the threshold. Sun et al. (1998) use various association measures such as t-score besides mutual information to improve (Sproat & Shih 1990). They developed an efficient algorithm to solve overlapping character pair ambiguity.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Bottom-up Merging Algorithm </SectionTitle> <Paragraph position="0"> Following the greedy strategy of (Sproat & Shih 1990), here we present an efficient bottom-up merging algorithm consulting the general rules to extract unknown words. The basic idea is that for a segmented sentence, if there are many rulematched token pairs which also satisfy the rule constraints, the token pair with the highest rule priority within the sentence is merged first and forms a new token string. Same procedure is then applied to the updated token string recursively until no token pair satisfied the general rules. It is illustrated by the following example: By the general rules and greedy strategy, besides overlapping character pair ambiguity, the algorithm is able to deal with more complex overlapping and coverage ambiguity, even which result from consecutive unknown words. In finger 3, input sentence &quot;a15a17a16a6a18a20a19a22a21 &quot; is derived as the correct two unknown words &quot;((a15a17a16 )a18 )&quot; and &quot;(a19 a21 )&quot; by rule (2), rule (10), and rule (2) in turn. &quot;a15 a16a6a18 &quot; and &quot;a19a23a21 &quot; are not further merged. That is because P(a19a24a21 |a15a25a16a24a18 )<1 violates the constraint of rule (1). Same reason explains why &quot;a15a23a16a24a18 &quot; and &quot;a19 &quot; do not satisfy rule (10) in the third iteration.</Paragraph> <Paragraph position="1"> By this simple algorithm, unknown words with unlimited length all have possibilities to be extracted. Observing the extraction process of &quot;a26a25a27 a28 &quot;, you can find, in the extraction process, boundaries of unknown words might extend during iteration until no rule could be applied.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Experiment </SectionTitle> <Paragraph position="0"> In our experiments, a word is considered as an unknown word, if either it is not in the CKIP lexicon or it is not identified by the word segmentation program as foreign word (for instance English) or a number. The CKIP lexicon contains about 80,000 entries.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.1 Evaluation Formulas </SectionTitle> <Paragraph position="0"> The extraction process is evaluated in terms of precision and recall. The target of our approach is to extract unknown words from a text, so we define &quot;correct extractions&quot; as unknown word types correctly extracted in the text. The precision and recall formulas are listed as follows: idocument in sextractioncorrect ofnumber NCi = idocument in rdsunknown wo extracted ofnumber NEi = idocument in rdsunknown wo totalofnumber NTi =</Paragraph> <Paragraph position="2"/> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.2 Data Sets </SectionTitle> <Paragraph position="0"> We use the Sinica balanced corpus version 3.0 as our training set for unknown word detection, which contains 5 million segmented words tagged with pos. We randomly select 150 documents of Chinese news on the internet as our testing set. These testing data are segmented by hand according to the segmentation standard for information processing designed by the Academia Sinica (Huang et.al 1997). In average, each testing text contains about 300 words and 16.6 unknown word types.</Paragraph> </Section> </Section> class="xml-element"></Paper>