File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-1019_metho.xml
Size: 1,327 bytes
Last Modified: 2025-10-06 14:12:57
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-1019"> <Title>WORD IDENTIFICATION FOR MANDARIN CIlINESE SENTENCES</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> WORD IDENTIFICATION FOR MANDARIN CIlINESE SENTENCES Abstract </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> Academia Sinica </SectionTitle> <Paragraph position="0"> Chinese sentences are composed with string of characters without blanks to mark words. However the basic unit for sentence parsing and understanding is word. Therefore the first step of processing Chinese sentences is to identify the words. The difficulties of identifying words include (l) the identification of complex words, such as Determinative-Measure, reduplications, derived words etc., (2) the identification of proper names,(3) resolving the ambiguous segmentations. In this paper, we propose the possible solutions for the above difficulties. We adopt a matching algorithm with 6 different heuristic rules to resolve the ambiguities and achieve an 99.77% of the success rate.</Paragraph> <Paragraph position="1"> The statistical data supports that the maximal matching algorithm is the most effective heuristics.</Paragraph> </Section> class="xml-element"></Paper>