File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1506_metho.xml
Size: 16,166 bytes
Last Modified: 2025-10-06 14:08:35
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1506"> <Title>Multi-Language Named-Entity Recognition System based on HMM</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. System Architecture </SectionTitle> <Paragraph position="0"> Our goal is to build a practical multi-language named-entity recognition system for multi-language information retrieval. To accomplish our aim, there are several conditions that should be fulfilled. First is to solve the differences between the features of languages. Second is to have a good adaptability to a variety of genres because there are an endless variety of texts on the WWW. Third is to combine high accuracy and processing speed because the users of information retrieval are sensitive to processing speed. To fulfill the first condition, we divided our system architecture into language dependent parts and language independent parts.</Paragraph> <Paragraph position="1"> For the second and third conditions, we used a combination of statistical language model and optimal word sequence search. Details of the language model and word sequence search are discussed in more depth later; we start with an explanation of the system's architecture.</Paragraph> <Paragraph position="2"> Figure 1 overviews the multi-language named-entity recognition system. We have implemented Japanese (JP), Chinese (CN), Korean (KR) and English (EN) versions, but it can, in principle, treat any other language.</Paragraph> <Paragraph position="3"> There are two language dependent aspects. One involves the character encoding system, and the other involves the language features themselves such as orthography, the kinds of character types, and word segmentation. We adopted a character code converter for the former and a lexical analyzer for the latter.</Paragraph> <Paragraph position="4"> In order to handle language independent aspects, we adopted N-best word sequence search and a statistical language model in the analytical engine. The following sections describe the character code converter, lexical analyzer, and analytical engine.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1. Character Code Conversion </SectionTitle> <Paragraph position="0"> If computers are to handle multilingual text, it is essential to decide the character set and its encoding. The character set is a collection of characters and encoding is a mapping between numbers and characters. One character set could have several encoding schemes. Hundreds of character sets and attendant encoding schemes are used on a regional basis. Most of them are standards from the countries where the language is spoken, and differ from country to country.</Paragraph> <Paragraph position="1"> Examples include JIS from Japan, GB from China and KSC from Korea; EUC-JP, EUC-CN and EUC-KR are the corresponding encoding schemes [3]. We call these encoding schemes 'local codes' in this paper. It is impossible for local code to handle two different character sets at the same time, so Unicode was invented to bring together all the languages of the world [4]. In Unicode, character type is defined as Unicode property through the assignment of a range of code points such as alphanumerics, symbols, kanji (Chinese character), hiragana (Japanese syllabary character), hangul (Korean character) and so on. The proposed lexical analyzer allows us to define arbitrary properties other than those defined by the Unicode standard.</Paragraph> <Paragraph position="2"> The character code converter changes the input text encoding from local code to Unicode and the output from Unicode to local code. That is, the internal code of our system is Unicode (UCS-4). Our system can accept EUC-JP, EUC-CN, EUC-KR and UTF-8 as input-output encoding schemes. In principle, we can use any encoding scheme if the encoding has round-trip conversion mapping between Unicode. We assume that the input encoding is either specified by the user, or automatically detected by using conventional techniques such as [5].</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2. Lexical Analyzer </SectionTitle> <Paragraph position="0"> The lexical analyzer recognizes words in the input sentence. It also plays an important role in solving the language differences, that is, it generates adequate word candidates for every language.</Paragraph> <Paragraph position="1"> The lexical analyzer uses regular expressions and is controlled by lexical analysis rules that reflect the differences in language features. We assume the following three language features; 1. character type and word length 2. orthography and spacing 3. word candidate generation The features can be set as parameters in the lexical analyzer. We explain these three features in the following sections.</Paragraph> <Paragraph position="2"> Table 1 shows the varieties of character types in each language. Character types influence the average word length. For example, in Japanese, kanji (Chinese character) words have about 2 characters and katakana (phonetic character used primarily to represent loanwords) words are about 5 characters long such as 'pasuwado (password)'. In Chinese, most kanji words have 2 characters but proper nouns for native Chinese are usually 3 characters, and those representing loanwords are about 4 characters long such as 'Bei Ke Yi Mu (Beckham)'. In Korean, one hangul corresponds to one kanji and one hangul consists of one consonant - one vowel - one consonant, so loanwords written in hangul are about 3 characters long such as 'inteones (internet)'. Character type and word length are related to word candidate generation in section 2.2.3. There is an obvious difference in orthography between each language, that is, European languages put a space between words while Japanese and Chinese do not. In Korean, spaces are used to delimit phrases (called as eojeol in Korean) not words, and space usage depends greatly on the individual.</Paragraph> <Paragraph position="3"> Therefore, another important role of the lexical analyzer is to handle spaces. In Japanese and Chinese, spaces should usually be recognized as tokens, but in English and Korean, spaces must be ignored because it indicates words or phrases. For example, the following analysis results are preferred; I have a pen 'I/pronoun' 'have/verb' 'a/article' 'pen/noun' and never must be analyzed as follows;</Paragraph> <Paragraph position="5"> There are, however, many compound nouns that include spaces such as 'New York', 'United States' and so on. In this case, spaces must be recognized as a character in a compound word. In Korean, it is necessary not only to segment one phrase separated by a space like Japanese, but also to recognize compound words including spaces like English.</Paragraph> <Paragraph position="6"> These differences in handling spaces are related to the problem of whether spaces must be included in the statistical language model or not. In Japanese and Chinese, it is rare for spaces to appear in a sentence, so the appearance of a space is an important clue in improving analysis accuracy. In English and Korean, however, they are used so often that they don't have any important meaning in the contextual sense.</Paragraph> <Paragraph position="7"> The lexical analyzer can treat spaces appropriately. The rules for Japanese and Chinese, always recognize a space as a token, while for those for English and Korean consider spaces only a part of compound words such as 'New York'.</Paragraph> <Paragraph position="8"> In our system, the analytical engine can list all dictionary word candidates from the input string by dictionary lookup. However, it is also necessary to generate word candidates for other than dictionary words, i.e. unknown words candidates. We use the lexical analyzer to generate word candidates that are not in the dictionary.</Paragraph> <Paragraph position="9"> It is more difficult to generate word candidates for Asian languages than for European languages, because Asian languages don't put a space between words as mentioned above.</Paragraph> <Paragraph position="10"> The first step in word candidate generation is to make word candidates from the input string. The simplest way is to list all substrings as word candidates at every point in the sentence. This technique can be used for any language but its disadvantage is that there are so many linguistically meaningless candidates that it takes too long to calculate the probabilities of all combinations of the</Paragraph> <Paragraph position="12"> alphabet symbol number kanji hangul candidates in the following analytical process. A much more effective approach is to limit word candidates to only those substrings that are likely to be words.</Paragraph> <Paragraph position="13"> The character types are often helpful in word candidate generation. For example, a cross-linguistic characteristic is that numbers and symbols are often used for serial numbers, phone numbers, block numbers, and so on, and some distinctive character strings of alphabets and symbols such as 'http://www...' and 'name@abc.mail.address' are URLs, Email-addresses and so on. This is not foolproof since the writing styles often differ from language to language. Furthermore, it is better to generate such kinds of word candidates based on the longest match method because substrings of these candidates do not usually constitute a word. In Japanese, a change between character types often indicates a word boundary. For example, katakana words are loanwords and so must be generated based on the longest match method. In Chinese and Korean, sentences mainly consist of one character type, such as kanji or hangul, so the character types are not as effective for word recognition as they are in Japanese. However, changes from kanji or hangul to alphanumerics and symbols often indicate word changes.</Paragraph> <Paragraph position="14"> And word length is also useful to put a limit on the length of word candidates. It is a waste of time to make long kanji words (length is 5 or more characters) in Japanese unless the substring matched with the dictionary, because its average length is about 2 characters. In Korean, although hanguls (syllabaries) are converted into a sequence of hangul Jamo (consonant or vowel) internally in order to facilitate the morphological analysis, the length of hangul words are defined in hangul syllabaries.</Paragraph> <Paragraph position="15"> We designed the lexical analyzer so that it can correctly treat spaces and word candidate generation depending on the character types for each language. Table 2 shows sample lexical analysis rules for Japanese (JP) and English (EN). For example, in Japanese, if character type is kanji or hiragana, the lexical analyzer attempts to output word candidates with lengths of 1 to 3. If character type is katakana, alphabet, or number, it generates one candidate based on the longest match method until character type changes. If the input is '1500km', word candidates are '1500' and 'km'. Subset character strings such as '1', '15', '500', 'k' and 'm' are never output as candidates. It is possible for a candidate to consist of several character types. Japanese has many words that consist of kanji and hiragana such as 'Li rete (away from)'. In any language there are many words that consist of numbers and alphabetic characters such as '2nd', or alphabetic characters and symbols such as 'U.N.'. Furthermore, if we want to treat positional notation and decimal numbers, we may need to change the Unicode properties, that is, we add '.' and ',' to number-property. The character type 'compound' in English rule indicates compound words. The lexical analyzer generates a compound word (up to 2 words long) with recognition of the space between them. In Japanese, a space is always recognized as one word, a symbol.</Paragraph> <Paragraph position="16"> Table 3 shows the word candidates output by the lexical analyzer following the rules of Table 2. The Japanese and English inputs are parallel sentences. It is apparent that the efficiency of word candidate generation improves dramatically compared to the case of generating all character strings as Japanese, kanji and hiragana strings become several candidates with lengths of 1 to 3, and alphabet and katakana strings become one candidate based on the longest match method until character type changes. In English, single words and compound words are recognized as candidates. Only the candidates that are not in the dictionary become unknown word candidates in the analytical engine.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3. Analytical engine </SectionTitle> <Paragraph position="0"> The analytical engine consists of N-best word sequence search and a statistical language model. Our system uses a word bigram model for morphological analysis and a hidden Markov model for named-entity recognition. These models are trained from tagged corpora that have been manually word segmented, part-of-speech tagged, and named-entity recognized respectively. Since N-best word sequence search and statistical language model don't depend on language, we can apply this analytical engine to all languages. This makes it possible to treat any language if a corpus is available for training the language model. The next section explains the hidden Markov model used for named-entity recognition.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. Named-entity Recognition Model </SectionTitle> <Paragraph position="0"> The named-entity task is to recognize entities such as organizations, personal names, and locations. Several papers have tackled named-entity recognition through the use of Markov model (HMM) [6], maximum entropy method (ME) [7, 8], and support vector machine (SVM) [9]. It is generally said that HMM is inferior to ME and SVM in terms of accuracy, but is superior with regard to training and processing speed. That is, HMM is suitable for applications that require realtime response or have to process large amounts of text such as information retrieval. We extended the original HMM reported by BBN.</Paragraph> <Paragraph position="1"> BBN's named-entity system is for English and offers high accuracy.</Paragraph> <Paragraph position="2"> The HMM used in BBN's system is described as follows. Let the morpheme represents the type of named entity such as organization, personal name, or location. The joint probability of word sequence and NC Here, the special symbol >< end indicates the end of an NC sequence.</Paragraph> <Paragraph position="3"> In this model, morphological analysis and named-entity recognition can be performed at the same time. This is preferable for Asian languages because they have some ambiguity about word segmentation. To adapt BBN's HMM for Asian languages, we extended the original HMM as follows. Due to the ambiguity of word segmentation, morphological analysis is performed prior to applying the HMM; the analysis uses a word bigram model and N-best candidates (of morpheme sequence) are output as a word graph structure. Named-entity recognition is then performed over this word graph using the HMM. We use a forward-DP backward-A* N-best search algorithm to get N-best morpheme sequence candidates [10]. In this way, multiple morpheme candidates are considered in named-entity recognition and the problem of word segmentation ambiguity is mitigated.</Paragraph> <Paragraph position="4"> BBN's original HMM used a back-off model and smoothing to avoid the sparse data problem. We changed this smoothing to linear interpolation to improve the accuracy, and in addition, we used not only the morpheme frequency of terms but also part of speech frequency. Table 4 shows the linear interpolation scheme used here. Underlined items are added in our model. The weight for each probability was decided from experiments.</Paragraph> </Section> class="xml-element"></Paper>