File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2203_metho.xml
Size: 4,632 bytes
Last Modified: 2025-10-06 14:13:50
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-2203"> <Title>CHINESE SEGMENTATION DISAMBIGUATION</Title> <Section position="4" start_page="0" end_page="1245" type="metho"> <SectionTitle> 2 Difficulties in Chinese </SectionTitle> <Paragraph position="0"> segmentation As (:lainm(t in (lSu, t987),the main (;a.us(;s Of 8C~,lllCllta.tioll a, mbiguity al;(; vag~tlClieSs ill word dt;finition a, Nd l,hc phenomenon of word (:imins. Tlic V&gllCllCSS ()f the wor(I (lc\[initioris (;a.tlsos s(;g\]l/l(~rita, Liori alnbigilitics, as in t,h(; string ll~/~fl~iEU. It (;&it siiands either for tN4EtI~ -J:j: (modcr., factory) or for ~4~ #U:~ (rnodern chc'mical fa, ctory). A woM cli~in is a, se(lU(mcc of Chinese characters fi'om which sevoral words can /)c \[)rodu(;ed with or withouL overlap. Two types of word chains have I)cou recognized in (Jhinese litera.turc, i.e. mull, f-S(~llS( ~, combinations and interse(;1;ion coral)inactions (\]hlallg a,lid Liu, 1988). The sl, ring ;~N is an example of multi-sense combination; (ice), ~I(box) and ~N(refrigerator) are all words. The character string ~flN is an example of intersection combination; Ntfl(paddle) is a word and ~fl~(sell.-at-sate-price) is also a word, whereas tfl is the intersection character. The example of the string ~fl~ f illustrates the typical segmentation ambiguity caused by word chains. The segmentation of this string can be either (fl'hc ping-pong-balLs were soht outat sale price.) or ('13e paddles for gable tennis were sold out.) Some ambiguities can be solved by word structure knowledge. Others can be disambiguated by syntactic and/or semantic knowledge. The most difficult disambiguation is that requiring contextual or pragmatic knowl-edge to arrive at an appropriate interpretation a,s in the string ~~t which can be segmented into: (students will write a paper.) or (student-association writes a paper'.) Both are syntactically and semantically correct. in this case, contextual information would allow the reader to trace the information claimed in the previous statements to solve ambiguity problems.</Paragraph> </Section> <Section position="5" start_page="1245" end_page="1247" type="metho"> <SectionTitle> 3 Reasoning theory for </SectionTitle> <Paragraph position="0"> Chinese segmentation disambiguation A model of evidential strength in inexact reasoning studied by (Buchanan and Shortliffe, 1984) has been successfully implemented in the MYCIN system. Tihe theory is that, if a hypothesis can be derived from various types of mutually exclusive evidence, then the strength of truth of the hypothesis can be increased to reach a plausible conclusion.</Paragraph> <Paragraph position="1"> Two concepts MB\[h,e\] and Ml)\[h,e\] have been introduced as the measures of belief and disbelief. MB\[h,e\] means the measure of increased belief in the hypothesis h, based on the evidence e. M l)\[h,e\] means the measure o\[ increased disbelief in the hypothesis h, based on the evidence c. To facilitate comparison of the evidential strength of competing hypotheses, certainty factor CF is introduced to combine degrees of belief and disbelief as fop \]OWS: csqh, ~1 = M l~\[t~, e\] - MY\[h, c\] in the case that a hypothesis is derived froIn a number of mutually exclusive observations, the combining functions are defined as:</Paragraph> <Paragraph position="3"> In the case that two hypotheses are established with positive evidence from syntactic and semantic knowledge with the same degree, no discrimination of the strength of truth hypotheses can be drawn. If world knowledge provides positive evidence for the first hypothesis and negative evidence to the second; then the strength of the first hypothesis is stronger than thai; of the second. Therefore, the first hypothesis would be the most likely correct segmentation.</Paragraph> <Paragraph position="4"> A weighted certainty factor is proposed he, re to represent the importance of various linguistic aspects. The, weight is a vector of four elements representing the importance of morphology, syntax, semantics and pragmatits, respectively, which total 1, i.e.</Paragraph> <Paragraph position="5"> Cl,;\[h,, e\] - w~ , CF\[h, ~\] where Wi is the weight of the certainty fac-tor CFi in hypothesis h supported by the evidence e with respect to one of the linguistic a,specl;s. Suppose, the weight; vecl;or (O.l, 0.2, 0.3, 0A:) is a,ssigncd (or morphology, synU~x, scma,ni;i(:s a, nd pr~gtnal;i(;s, r(,speci;ivcly, Lh(;n I;hc following exa.tnple iJlusLra,i, es Lhe t:uncl, iou Therefore, this segmentation is unlikely to be a coherent; string.</Paragraph> </Section> class="xml-element"></Paper>