XML Viewer - p98-2132

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2132_metho.xml
Size: 7,314 bytes
Last Modified: 2025-10-06 14:15:03
<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2132">
  <Title>A Multi-Neuro Tagger Using Variable Lengths of Contexts</Title>
  <Section position="3" start_page="802" end_page="802" type="metho">
    <SectionTitle>
2 POS Tagging Problems
</SectionTitle>
    <Paragraph position="0"> Since each input Thai text can be segmented into individual words that can be further tagged with all possible POSs using an electronic Thai dictionary, the POS tagging tasks can be regarded as a kind of POS disambiguation problem using contexts as follows:</Paragraph>
    <Paragraph position="2"> where ipt_t is the element related to the possible POSs of the target word, (ipt_lt,..., ipt_ll) and (ipt_rl,...,ipt_rr) are the elements related to the contexts, i.e., the POSs of the words to the left and right of the target word, respectively, and POS_t is the correct POS of the target word in the contexts.</Paragraph>
  </Section>
  <Section position="4" start_page="802" end_page="802" type="metho">
    <SectionTitle>
3 Information Gain
</SectionTitle>
    <Paragraph position="0"> Suppose each element, ipt_x (x = li,t, or rj), in (1) has a weight, w_z, which can be obtained using information theory as follows. Let S be the training set and Ci be the ith class, i.e., the ith POS (i = 1,...,n, where n is the total number of POSs). The entropy of the set S, i.e., the average amount of information needed to identify the class (the POS) of an example in 5', is in f o( S) = _ ~-~ f req( Ci, S) ~(~\]i, S) ), ISl x In( fre (2) where ISl is the number of examples in S and freq(Ci, S) is the number of examples belonging to class Ci. When S has been partitioned to h subset Si (i = 1,...,h) according to the element ipt.x, the new entropy can be found as the weighted sum over these subsets, or</Paragraph>
    <Paragraph position="2"> Thus, the quantity of information gained by this partitioning, or by knowing the POSs of element ipt_x, can be obtained by</Paragraph>
    <Paragraph position="4"> which is used as the weight, w_T, i.e., w_x= gain(x). (5)</Paragraph>
  </Section>
  <Section position="5" start_page="802" end_page="804" type="metho">
    <SectionTitle>
4 Multi-Neuro Tagger
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="802" end_page="803" type="sub_section">
      <SectionTitle>
4.1 Single-Neuro Tagger
</SectionTitle>
      <Paragraph position="0"> Figure 1 shows a single-neuro tagger (SNT) which consists of a 3-layer feedforward neural network. The SNT can disambiguate the POS of each word using a fixed length of the context by training it in a supervised manner with a well-known error back-propagation algorithm (for details see e.g., Haykin, 1994).</Paragraph>
      <Paragraph position="2"> When word x is given in position y (y = t, li, or rj), element ipt-y of input IPT is a weighted pattern defined as</Paragraph>
      <Paragraph position="4"> where w_y is the weight obtained in (5), n is the total number of POSs defined in Thai, and  Izi = w_y.e~i ( i = 1,...,n ). Ifx is aknown word, i.e., it appears in the training data, each bit ezi is obtained as follows:</Paragraph>
      <Paragraph position="6"> Here tile Prob(POSi\[x) is the prior probability of POSi that the word x can be and is estimated from tile training data as</Paragraph>
      <Paragraph position="8"> where IPOSi,x\[ is the number of times both POSi and x appear and Ixl is the number of times x appears in all the training data. If x is an unknown word, i.e., it does not appear in the training data, each bit e,:i is obtained as follows: 1__ if POSi is a candidate = n,' (9) exi 0, otherwise, where nx is the number of POSs that the word x can be (this number can be simply obtained from an electronic Thai dictionary). The OPT is a pattern defined as follows:</Paragraph>
      <Paragraph position="10"> There is more information available for constructing the input for the words on the left because they have already been tagged. In the tagging phase, instead of using (6)-(9), the input may be constructed simply as follows: ipt_li(t) = wdi. OPT(t - i), (12) where t is the position of the target word in a sentence and i = 1,2,...,1 for t - i &gt; 0. However, in the training process the output of the tagger is not correct and cannot be fed back to the inputs directly. Instead, a weighted average of the actual output and the desired output is used as follows:</Paragraph>
      <Paragraph position="12"> where EOBJ and EACT are the objective and actual errors, respectively. Thus, the weighting of the desired output is large at the beginning of the training, and decreases to zero during training. null</Paragraph>
    </Section>
    <Section position="2" start_page="803" end_page="804" type="sub_section">
      <SectionTitle>
4.2 Multi-Neuro Tagger
</SectionTitle>
      <Paragraph position="0"> Figure 2 shows the structure of the multi-neuro tagger. The individual SNTi has input IPTi with length (the number of input elements: l +</Paragraph>
      <Paragraph position="2"> When a sequence of words (word_ll, ..., word_ll, word_t, word_r1, ..., word_r~), which has a target word word_t in the center and a maximum length l(IPTm ), is inputed, its subsequence of words, which also has the target word word_t in the center and length l(IPTi), will be encoded into IPTi in the same way as described in the previous section. The outputs OPTi (for</Paragraph>
      <Paragraph position="4"> coded into RSTi by (11). The RSTi are next inputed into the longest-context-priority selector which obtains the final result as follows:</Paragraph>
      <Paragraph position="6"> This means that the output of the single-neuro tagger that gives a result being not unknown and has the largest length of input is regarded as a final answer.</Paragraph>
    </Section>
    <Section position="3" start_page="804" end_page="804" type="sub_section">
      <SectionTitle>
4.3 Training
</SectionTitle>
      <Paragraph position="0"> If we use the weights trained by the single-neuro taggers with short inputs as the initial values of those with long inputs, the training time for the latter ones can be greatly reduced and the cost to train multi-neuro taggers would be almost the same as that to train the single-neuro taggers. Figure 3 shows an example of training a tagger with four input elements. The trained weights, w\] and w2, of the tagger with three input elements are copied to the corresponding part of the tagger and used as initial values for its training.</Paragraph>
    </Section>
    <Section position="4" start_page="804" end_page="804" type="sub_section">
      <SectionTitle>
4.4 Feat ures
</SectionTitle>
      <Paragraph position="0"> Suppose that at most seven elements are adopted in the inputs for tagging and that there are 50 POSs. The n-gram models must estin\]ate 50 T = 7.8e + 11 n-grams, while the single-neuro tagger with the longest input uses only 70,000 weights, which can be calculated by nipt * nhid q- nhid * nopt where nipt, nhid, and nopt are, respectively, the number of units in the input, the hidden, and the output layers, and nhid is set to be nipt/2. That neuro models require few parameters may offer another advantage: their performance is less affected by a small amount of training data than that of the statistical methods (Schmid, 1994). Neuro taggers also offer fast tagging compared to other models, although its training stage is longer.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML