File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/03/w03-1506_concl.xml
Size: 3,423 bytes
Last Modified: 2025-10-06 13:53:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1506"> <Title>Multi-Language Named-Entity Recognition System based on HMM</Title> <Section position="6" start_page="2" end_page="2" type="concl"> <SectionTitle> 5. Application to Bilingual Lexicon </SectionTitle> <Paragraph position="0"> Extraction from Parallel Text In order to illustrate the benefit of our multi-language named-entity recognition system, we conducted a simple experiment on extracting bilingual named-entity lexicons from parallel texts. It is very difficult to gather bilingual lexicons of named entities because there are an enormous number of new named entities. Establishing a bilingual named-entity dictionary automatically would be extremely useful.</Paragraph> <Paragraph position="1"> There are 3 steps in extracting a bilingual lexicon as follows; 1. recognize named entity from parallel text 2. extract bilingual lexicon candidates 3. winnow the candidates to yield a reasonable lexicon list The multi-language named-entity recognition system is used in the first step. In this step, the parallel texts are analyzed morphologically and named entities are recognized.</Paragraph> <Paragraph position="2"> In the second step, bilingual lexicon candidates are listed automatically under the following conditions; word sequence up to 5 words include one or more named entities does not include function words The cooccurrence frequency of candidates is calculated at the same time.</Paragraph> <Paragraph position="3"> In the third step, reasonable lexicons are created from the candidates. To judge the suitability of the candidates to be entered into a bilingual lexicon, we use the translation model called the IBM model [18]. Let a word sequence in language X be where e is constant. )|( ij xyt is translation probability and is estimated by applying the EM algorithm to a large number of parallel texts.</Paragraph> <Paragraph position="4"> Since the longer word sequences X and Y are, the smaller )|( XYP becomes, the value of )|( XYP cannot be compared when a word sequence length changes. Therefore, we improved equation (5) to take into account the difference in a word sequence length and cooccurrence frequency as follows; xyt is the average of )|( ij xyt . )|( XYS is used as a measure of candidate suitability. We used Japanese-English news article alignment data as parallel texts that is released by CRL [19, 20]. In this data, articles and sentences are aligned automatically. We separated the parallel text into a small set (about 1000 sentences) and a large set (about 150 thousand sentences). We extracted bilingual lexicons from a small set and )|( ij xyt was estimated from a large set. Table 7 shows bilingual lexicons that achieved very high scores. It can be said that they are adequate as bilingual lexicons. Though a more detailed evaluation is a future task, the accuracy is about 86 % for the top 50 candidates. This suggests that the proposed system can be applied to bilingual lexicon extraction for automatically creating bilingual dictionaries of named entities. Conclusion We developed a multi-language named-entity recognition system based on HMM. We have implemented Japanese, Chinese, Korean and English versions, but in principle it can handle any language if we have training data for the target language. Our system is very fast and has state-of-the-art accuracy.</Paragraph> </Section> class="xml-element"></Paper>