File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/p90-1039_metho.xml
Size: 16,874 bytes
Last Modified: 2025-10-06 14:12:38
<?xml version="1.0" standalone="yes"?> <Paper uid="P90-1039"> <Title>A HARDWARE ALGORITHM FOR HIGH SPEED MORPHEME EXTRACTION AND ITS IMPLEMENTATION</Title> <Section position="4" start_page="0" end_page="308" type="metho"> <SectionTitle> 2 MACHINE DESIGN STRATEGY 2.1 MORPHEME EXTRACTION </SectionTitle> <Paragraph position="0"> Morphological analysis methods are generally composed of two processes: (1) a morpheme extraction process and (2) a morpheme determination process. In process (1), all morphemes, which are considered as probably being use<\] to construct input text, are extracted by searching a morpheme dictionary. These morphemes are extracted as candidates. Therefore, they are selected mainly by morpheme conjunction constraint. Morphemes which actually construct the text are determined in process (2).</Paragraph> <Paragraph position="1"> The authors selected morpheme extraction as the first process to be implemented on specific hardware, for the following three reasons. First is that the speed-up requirement for the morphological analysis process is very strong in Japanese Input Text ........</Paragraph> <Paragraph position="2"> ~.p)i~ C. ...... ~ Iverb ! ! i ' ' i I noun ; I i ,1&quot;, ; ~'~,~: I noun ~MorphemeExtraction~l fi~ inoun ~.~ Process ..,) ,ti~ inou n</Paragraph> <Paragraph position="4"/> <Section position="1" start_page="307" end_page="308" type="sub_section"> <SectionTitle> Japanese Text 2.2 STRATEGY DISCUSSION </SectionTitle> <Paragraph position="0"> In conventional morpheme extraction methods, which are the software methods used on sequential processing computers, the comparison operation between one key string in the morpheme dictionary and one sub-string of input text is repeated. This is one to one comparison. On the other hand, many to one comparison or one to many comparison is practicable in parallel computing.</Paragraph> <Paragraph position="1"> Content- addressable memories (.CAMs) (Chlsvln, 1989) (Yamada, 1987) reallze the many to one comparison. One sub-string of input text is simultaneously compared with all key strings stored in a CAM. However, presently available CAMs have only a several tens of kilobit memory, which is too small to store data for a more than 50,000 morpheme dictionary.</Paragraph> <Paragraph position="2"> The above mentioned parallel processing computers realize the one to many comparison. On the parallel processing computers, one processor searches the dictionary at one text position, while another processor searches the same dictionary at the next position at the same time (Nakamura, 1988). However, there is an access conflict problem involved, as already mentioned.</Paragraph> <Paragraph position="3"> The above discussion has led the authors to the following strategy to design the morpheme extraction machine (Fukushima, 1989a). This strategy is to shorten the one to one comparison cycle. Simple architecture, which will be described in the next section, can realize this strategy. text parsing systems. This process is necessary for natural language parsing, because it is the first step in the parsing. However, it is more laborious for Japanese and several other languages, which have no explicit word boundaries, than for Engllsh and many European languages (Miyazald, 1983) (Ohyama, 1986) (Abe, 1986). English text reading has the advantage of including blanks between words. Figure 1 shows an example of the morpheme extraction process for Japanese text.</Paragraph> <Paragraph position="4"> Because of the disadvantage inherent in reading difficulty involved in all symbols being strung together without any logical break between words, the morpheme dictionary, including more than 50,000 morphemes in Japanese, is searched at almost all positions of Japanese text to extract morphemes. The authors' investigation results, indicating that the morpheme extraction process requires using more than 70 % of the morphological analysis process time in conventional Japanese parsing systems, proves the strong requirement for the speed-up.</Paragraph> <Paragraph position="5"> The second reason is that the morpheme extraction process is suitable for being implemented on specific hardware, because simple character comparison operation has the heaviest percentage weight in this process. The third reason is that this speed-up will be effective to evade the common memory access conflict problem mentioned in Section 1.</Paragraph> </Section> </Section> <Section position="5" start_page="308" end_page="309" type="metho"> <SectionTitle> 3 A HARDWARE ALGO- RITHM FOR MOR- PHEME EXTRACTION 3.1 FUNDAMENTAL ARCHITECTURE </SectionTitle> <Paragraph position="0"> A new hardware algorithm for the morpheme extraction, which was designed with the strategy mentioned in the previous section, is described in this section.</Paragraph> <Paragraph position="1"> The fundamental architecture, used to implement the algorithm, is shown in Fig. 2. The main components of this architecture are a dictionary block, a shift register block, an index memory, an address generator and comparators.</Paragraph> <Paragraph position="2"> The dictionary block consists of character memories (i.e. 1st character memory, 2nd character memory, ..., N-th character memory). The n-th character memory (1 < n < N) stores n-th characters of all key strings \]-n th~ morpheme dictionary, as shown in Fig. 3. In Fig. 3, &quot;iI~&quot;, &quot;~f&quot;, &quot;@1:~ &quot;, &quot;~&quot;, &quot;~&quot;, and so on are Japanese morphemes. As regarding morphemes shorter than the key length N, pre-deflned remainder symbols /ill in their key areas. In Fig. 3, '*' indicates the remainder symbol. The shift register block consists of character registers (i.e. 1st character register, 2nd character register,..., N-th character register). These registers</Paragraph> <Paragraph position="4"> store the sub-string of input text, which can be shifted, as shown in Fig. 4. The index memory receives a character from the 1st character register.</Paragraph> <Paragraph position="5"> Then, it outputs the top address and the number of morphemes in the dictionary, whose 1st character corresponds to the input character. Because morphemes are arranged in the incremental order of their key string in the dictionary, the pair for the top address and the number expresses the address range in the dictionary. Figure 3 shows the relation between the index memory and the character memories. For example, when the shift register block content is as shown in Fig. 4(a), where '~' is stored in the 1st character register, the index memory's output expresses the address range for the morpheme set {&quot;~&quot;, &quot;~&quot;, &quot;~\]~&quot;, &quot;~\]~ ~\[~&quot;, &quot;~\]~&quot;, ..., &quot;~J&quot;} in Fig. 3. The address generator sets the same address to all the character memories, and changes their addresses simultaneously within the address range which the index memory expresses. Then, the dictionary block outputs an characters constructing one morpheme (key string with length N ) simultaneously at one address. The comparators are N in number (i.e. 1st comparator, 2nd compara,or, ..., N-th comparator). The n-th comparator compares the character in the n-th character register with the one from the -th character memory. When there is correspondence between the two characters, a match signal is output. In this comparison, the remainder symbol operates as a wild card. This means that the comparator also outputs a match signal when the ~-th character memory outputs the remainder symbol. Otherwise, it outputs a no match signal.</Paragraph> <Paragraph position="6"> The algorithm, implemented on the above described fundamental architecture, is as follows.</Paragraph> <Paragraph position="7"> * Main procedure Step 1: Load the top N characters from the input text into the character registers in the shift register block.</Paragraph> <Paragraph position="8"> morphemes in the dictionary, whose ist character corresponds to the character in the 1st character register. Then, set the top address for this range to the current address for the character memories.</Paragraph> <Paragraph position="9"> ous comparisons at the current address.</Paragraph> <Paragraph position="10"> When all the comparators output match signals, detection of one morpheme is indicated. When at least one comparator outputs the no match signal, there is no detection.</Paragraph> <Paragraph position="11"> Step 2: Increase the current address.</Paragraph> <Paragraph position="12"> For example, Fig. 4(a) shows the sub-string in the shift register block immediately after Step 1 for Main procedure, when the input text is &quot;~J~}~L~ bfc...&quot;. Step 3 for Procedure I causes such movement as (a)-*(b), (b)--*(c), (c)---*(d), (d)--*(e), and so on. Step 1 and Step 2 for Procedure 1 are implemented in each state for (a), (b), (c), (d), (e), and so on. In state (a) for Fig. 4, the index memory's output expresses the address range for the morpheme set {&quot;~&quot;, &quot;~&quot;~&quot;, &quot;~'~&quot;, &quot;~;&quot;, &quot;~:~\]~&quot;, ..., &quot;~J&quot;} if the dictionary is as shown in Fig. 3. Then, Step 1 for Procedure 2 is repeated at each address for the morpheme set {&quot;~:&quot;, &quot;~&quot;, ,,~f~f,,, ,,~:~,,, ,,~f,,, ..., ,,~,,}.</Paragraph> <Paragraph position="13"> Figure 5 shows two examples of Step 1 for Procedure 2. In Fig. 5(a), the current address for the dictionary is at the morpheme &quot;~&quot;. In Fig. 5(b), the address is at the morpheme &quot;~$; \]~&quot;. In Fig. 5(a), all of the eight comparators output match signals as the result of the simultaneous comparisons. This means that the morpheme &quot;~&quot; has been detected at the top position of the sub-string &quot;~~j~:~ ~ L&quot;. On the other hand, in Fig. 5(b), seven comparators output match signals, but one comparator, at 2nd position, outputs a no match slgual, due to the discord between the two characters, '~' and '~\[~'. This means that the morpheme &quot;~\]~&quot; hasn't been detected at this position.</Paragraph> <Section position="1" start_page="309" end_page="309" type="sub_section"> <SectionTitle> tal Architecture 3.2 EXTENDED </SectionTitle> <Paragraph position="0"/> </Section> </Section> <Section position="6" start_page="309" end_page="311" type="metho"> <SectionTitle> ARCHITECTURE </SectionTitle> <Paragraph position="0"> The architecture described in the previous section treats one stream of text string. In this section, the architecture is extended to treat multiple text streams, and the algorithm for extracting morphemes from multiple text streams is proposed. null Generally, in character recognition results or speech recognition results, there is a certain amount of ambignJty, in that a character or a syllable has multiple candidates. Such multiple candidates form the multiple text streams. Figure 6(a) shows an example of multiple text streams, expressed by a two dimensional matrix. One dimension corresponds to the position in the text.</Paragraph> <Paragraph position="1"> The other dimension corresponds to the candidate level. Candidates on the same level form one stream. For example, in Fig. 6(a), the character at the 3rd position has three candidates: the 1st candidate is '~', the 2nd one is '~' and the 3rd one is '\]~'. The 1st level stream is &quot;~\]:~:.~...&quot;. The 2nd level stream is &quot;~R...&quot;. The 3rd level stream is &quot;~R ~... &quot;.</Paragraph> <Paragraph position="2"> Figure 6(b) shows an example of the morphemes extracted from the multiple text streams shown in Fig. 6(a)..In the morpheme extraction process for the multiple text streams, the key strings in the morpheme dictionary are compared with the combinations of various candidates. For example, &quot;~ ~&quot;, one of the extracted morphemes, is composed of the 2nd candidate at the 1st position, the 1st candidate at the 2nd position and the 3rd candidate at the 3rd position. The architecture described in the previous section can be easily extended to treat multiple text streams. Figure 7 shows the extended architecture. This extended architecture is different from the fundamental architecture, in regard to the following three points. First, there are M sets of character registers in the shift register block. Each set is composed of N character registers, which store and shift the sub-string for one text strearn. Here, M is the number of text streams. N has already been introduced in Section 3.1. The text streams move simultaneously in all the register sets.</Paragraph> <Paragraph position="3"> Second, the n-th comparator compares the chara~'ter from the n-th character memory with the M characters at the n-th position in the shift register block. A match signal is output, when there is correspondence between the character from the memory and either of the M characters in the registers. null Third, a selector is a new component. It changes the index memory's input. It connects one of the registers at the 1st position to sequential index memory inputs in turn. This changeover occurs M times in one state of the shift register block.</Paragraph> <Paragraph position="4"> Regarding the algorithm described in Section 3.1, the following modification enables treating multiple text streams. Procedure 1 and Procedure 1.5, shown below, replace the previous morphemes in the dictionary, whose 1st character corresponds to the character in the register at the 1st position with the current level. Then, set the top address for this range to the current address for the character memories.</Paragraph> <Paragraph position="5"> tors output the match signal as a result of simultaneous comparisons, when the morpheme from the dictionary is &quot;~:&quot;. Characters marked with a circle match the characters from the dictionary. This means that the morpheme &quot;~:&quot; has been detected.</Paragraph> <Paragraph position="6"> When each character has M candidates, the worst case time complexity for sequential morpheme extraction algorithms is O(MN). On the other hand, the above proposed algorithm (Fukushima's algorithm) has the advantage that the time complexity is O(M).</Paragraph> <Paragraph position="7"> 1988), proposed for speech recognition systems, is similax to Fukushima's algorithm. In Hamaguchi's algorithm, S bit memory space expresses a set of syllables, when there are S different kinds of syllables ( S = 101 in Japanese). The syllable candidates at the saxne position in input phonetic text are located in one S bit space. Therefore, H~naguchi's algorithm shows more advantages, as the full set size of syllables is sm~ller s~nd the number of syllable candidates is larger. On the other ha~d, Fukushima's ~Igorithm is very suitable for text with a large character set, such as Japanese (more than 5,000 different chaxacters are computer re~able in Japanese). This algorithm ~Iso has the advantage of high speed text stream shift, compared with conventions/algorithms, including Hamaguchi's.</Paragraph> </Section> <Section position="7" start_page="311" end_page="312" type="metho"> <SectionTitle> 4 A MORPHEME EX- TRACTION MACHINE </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="311" end_page="312" type="sub_section"> <SectionTitle> 4.1 A MACHINE OUTLINE </SectionTitle> <Paragraph position="0"> This section describes a morpheme extraction machine, called MEX-I. It is specific hardware which realizes extended architecture and algorithm proposed in the previous section.</Paragraph> <Paragraph position="1"> It works as a 5ackend machine for NEC Persons/Computer PC-9801VX (CPU: 80286 or V30, clock: 8MHz or 10MHz). It receives Japanese text from the host persona/computer, m~d returns morphemes extracted from the text after a bit of time. MEX-Iis composed of 12 boards. Approximately 80 memory IC chips (whose total memory storage capacity is approximately 2MB) and 500 logic IC chips are on the boards.</Paragraph> <Paragraph position="2"> The algorithm parameters in MEX-I axe as follow. The key length (the maximum morpheme length) in the dictionary is 8 (i.e. N = 8 ).</Paragraph> <Paragraph position="3"> The maximum number of text streams is 3 (i.e.</Paragraph> <Paragraph position="4"> M = 1, 2, 3). The dictionary includes approximately 80,000 Japanese morphemes. This dictionary size is popular in Japanese word processors. The data length for the memories a~d the registers is 16 bits, corresponding to the character code in Japanese text.</Paragraph> </Section> </Section> class="xml-element"></Paper>