File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-1812_intro.xml
Size: 9,381 bytes
Last Modified: 2025-10-06 14:01:33
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1812"> <Title>Segmentation Rate(%) Number of Words Economics Engineering Correct Segmentation Rate Unsegmentation Rate</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Algrithm </SectionTitle> <Paragraph position="0"> Fig. 1 shows the outline of the proposed method.</Paragraph> <Paragraph position="1"> (1) Input sentences are segmented by word candidates that were acquired in the dictionary so far.</Paragraph> <Paragraph position="2"> Segment by a update dictionary (2) For the remaining part of the character strings that are unsegmented by the known words, the system predicts unknown words by extracting WS using Inductive Learning. The system extracts WS as word candidates. This process is based on the supposition that a common character string of appearing repeatedly in text has high probability as a word. (3) Theuser judgeswhether theresults ofthe wordsegmentationiscorrect ornot. Ifthereare errors in the result, the user will correct errors. (4) The system compares the proofread results with the segmentation result to update the information in the dictionary. Through this procedure, thecertaintyof WS asa wordisconfirmed and increased.</Paragraph> <Paragraph position="3"> Here, the WS those are used in correct segmentation are called CW (Correct Word).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 SegmentationbyKnownWords </SectionTitle> <Paragraph position="0"> Input sentence and then the system segments it into words by registered CW and WS that the system has got by using Inductive Learning until that time.</Paragraph> <Paragraph position="1"> (1) In the first step, the system compares the registered CW or WS in the dictionary with the character string in the input sentence from the beginning of the sentence, and finds out the same character strings with the registered words. Thesystemrepeatsthiscomparisonprocess until the end of the sentence is reached. A list of segmentation candidate is established. Then the system segments the sentence into words.</Paragraph> <Paragraph position="2"> (2) In the second step, however, for the character strings of multiple segmentations, we use the registered candidates in order of their ranks in the dictionary(Section2.3). When there are more than one word candidate with the same rank, we decide the correct segmentation from the list of segmentation candidates by the value of LEF. We define LEF as follows:</Paragraph> <Paragraph position="4"> Where: FR, CS, ES and LE are the frequency of CW or WS appearing in the text, the frequency of the correct segmentation, the frequency of the erroneous segmentation and the lengthofCWorWSrespectively. a, b and g are coefficients. The optimum coefficients of LEF are decided by the preliminary experiments using Greedy method, a=10, b=1 and g=5.</Paragraph> <Paragraph position="5"> The word that has the maximum value of LEFis decided asthe correctsegmentation candidate. null (3) When LEF value of the set of possible segmentations is equal to each other, the correct segmentation candidate is decided by the word candidate that the value of ES is minimum, the value of CS is maximum, the value of FR is maximum, the value of LE is the longest or the location of segmentation is the leftmost in a sentence in turn.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 PredictionforUnknownWords </SectionTitle> <Paragraph position="0"> Fig. 2 shows an example of a non-segmented sentence. In this example, every character represents a Chinese character, so we use this example to express a general sentence of non-segmented language to present the proposed method. Those words that are not registered in the dictionary are predicted by using Inductive Learning. After the sentences were segmented by known words, which have been registered in the dictionary, the unsegmented part of character string will be used to extract WS. The prediction method is to find the common character string in text. The extraction procedure is carried out as Fig. 3 shows: the extraction of common parts, sift out the common part of themostpossibility as a word, there-extraction of common parts and the extraction of different parts.</Paragraph> <Paragraph position="1"> A common part in non-segmented text is extracted by two steps: (1) When a character string appears in text frequently,wecallitacommoncharacterstring.</Paragraph> <Paragraph position="2"> If the common character string consists of more thantwocharacters,weextractitasawordcandidate and call it common part and represent it by S1(Segment one). Here, we use length, frequency andlocation of S1 in the sentence to sift out it, to get the S1 of the most possible as a word. At this step, we acquired S1 from the sentence that is shown in Fig. 2: &quot;abkhd&quot;,&quot;epsilon1phg&quot; and &quot;Thputgb&quot;.</Paragraph> <Paragraph position="3"> (2) When the character string appears in the sentence only one times but meanwhile it is included in other extracted common part and made up by more than two characters, we also extract it as a word candidate. For example in fore &quot;tgb&quot; is extracted and belongs to S1. The extracted S1 at 2.2.1 may still include a common character string. At this situation, the common character string can be re-extracted moreover from the extracted S1. We consider it has a higher probability as a word that re-extracted common parts at this procedure. The conditions of re-extraction are presented as follows: null (1) The common part can be re-extracted from the extracted S1 when it includes a common character string that is more than two characters. For example, &quot;Thputgb&quot; contains &quot;tgb&quot; whichcan be extracted from &quot;Thputgb&quot;, so &quot;Thputgb(S1)&quot; is equal to &quot;Thpu(S2)&quot; + &quot;tgb(S3)&quot;.</Paragraph> <Paragraph position="4"> The part of re-extraction is called high dimensional common part and represented by S2 (Segment two). The part of remain is called different part and represented by S3 (Segment three). The S1 is deleted from the dictionary when it is divided into S2 and S3.</Paragraph> <Paragraph position="5"> (2)Furthermoreone charactercanalsobe extracted as a word candidate when both sides of it are extracted as a word candidate or both sides were segmented by known words.</Paragraph> <Paragraph position="6"> Like &quot;p&quot;in&quot;Thputgbpabkhd&quot; is surrounded by &quot;Thputgb&quot; and &quot;abkhd&quot;, and &quot;p&quot; is extracted as a word candidate belonging to S2.</Paragraph> <Paragraph position="7"> The extraction procedure is carried out repeatedly untilthenew WS cannotbe extracted and the input can not be segmented.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 SegmentationbyaUpdate Dictionary </SectionTitle> <Paragraph position="0"> The extracted WS are classified to &quot;S1&quot;, &quot;S2&quot;, and &quot;S3&quot;. Those WS that are confirmed as a word by proofreading process are called &quot;CW&quot; (Correct Word). Furthermore, the FR(appearing FRequency), CS(Correct Segmentation frequency), ES(Erroneous Segmentation frequency), LE(LEength) and rankof a word candidate are rigestered simultaneously.</Paragraph> <Paragraph position="1"> Word Segmentation is carried out by the update dictionary as2.1.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 FeedbackProcess </SectionTitle> <Paragraph position="0"> After the system segments the sentence into words, the results are judged whether they are correct or not by the user. Then the user corrects the errors if there are errors in the results. The system updates the rankof the registered CWandWS inthedictionary bycomparingthe corrected results with the segmentation results.</Paragraph> <Paragraph position="1"> And the system increases the priority degree of the words that were used in correct segmentation and decreases the priority degree of words (1) For the Correct Segmentation Results: When the result of segmentation is correct, the value of FR and CS of a word that is used to segment are added one.</Paragraph> <Paragraph position="2"> If the rankof the words does not belong to CW, it is changed to CW.</Paragraph> <Paragraph position="3"> (2) For the Erroneous Segmentation Results: If the dictionary does not has the correct words,thesystemregistersthewordsinthe dictionary. In this case, their FRs are 1, their ranks are CW.</Paragraph> <Paragraph position="4"> If the dictionary has the correct words, the system adds one to the value of FR for a word and changes the value of CL to CW if it does not belong to CW.</Paragraph> <Paragraph position="5"> If the reason of erroneous segmentation is that theerroneous word wasused, thenthe ES of erroneous word is added one.</Paragraph> <Paragraph position="6"> (3) For the Unsegmented Parts: The system registers the words in the dictionary, as FR of the words equal 1 and rankequal CW.</Paragraph> </Section> </Section> class="xml-element"></Paper>