File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2202_metho.xml

Size: 3,680 bytes

Last Modified: 2025-10-06 14:14:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2202">
  <Title>Word Extraction from Corpora and Its Part-of-Speech Estimation Using Distributional Analysis</Title>
  <Section position="4" start_page="1119" end_page="1119" type="metho">
    <SectionTitle>
3 Algorithm
</SectionTitle>
    <Paragraph position="0"> In this section we describe the algorithm used to calcnlate tile word rneasure of all arbitrary string and tire probabilities that the string belongs to each of a set of POSs. We used observations from tile EDI{ corpus, which is divided into words and tagged as to POS, to calculate tile POS environments, and then used a raw corpus (no indication of word or morpheme boundaries, and no POS tags) for calculating the string environments.</Paragraph>
    <Section position="1" start_page="1119" end_page="1119" type="sub_section">
      <SectionTitle>
3.1 Calculating POS Environments
</SectionTitle>
      <Paragraph position="0"> The environment of each POS is obtained by calculating statistics on all contexts that precede and follow the POS in a tagged corpus, as follows:  1. Let all elements of left and right probability vectors be 0.</Paragraph>
      <Paragraph position="1"> 2. For each occurrence of the POS in the corpus, iucrement the left vector elenmnt corresponding to the context preceding this occurrence of the POS, and increment the right vector element corresponding to the context following the POS.</Paragraph>
      <Paragraph position="2"> 3. Divide each vector element by the total number of o&lt;:currences of the POS.</Paragraph>
      <Paragraph position="3">  Figure 1 shows a sample sentence from the EI)R corpus, and Table 2 shows the computation of the one-character environment of Noun in the tiny corpus consisting of this single sentence. In practice, instead of a single character, we used as contexts the preceding or following POS-tagged string (a morpheme or word). Thus the  freq. prob. str. str. freq. prob.</Paragraph>
    </Section>
    <Section position="2" start_page="1119" end_page="1119" type="sub_section">
      <SectionTitle>
3.2 Calculating String Environments
</SectionTitle>
      <Paragraph position="0"> The cMculation of the enviromnent of an arbitrary string (possible word) in a corpus is basically identical to tire POS algorithm above, except that because Japanese has no blank space between words arr(t a raw (unsegmented) corpus is used, the extent of the environment is ambiguous. There are two ways to determine the extent of the left and right environment: one is to specify a fixed number of characters, and the other is to use a look-up-and-match procedure to identify specific morphenms. We adol)ted the second method, and used as a mort)henm lexicon the set of hash keys representing the POS envirouments.</Paragraph>
      <Paragraph position="1"> Where there was a conflict between two or more possible matches of a string context with tire POS hash keys, the longest match was selected. For instance, although a right context zi, ro 'kate' couht match either the postposition 'ka' or the postl)osilion 'kara', the longer match 'kara' would always be chosen.</Paragraph>
    </Section>
    <Section position="3" start_page="1119" end_page="1119" type="sub_section">
      <SectionTitle>
3.3 Optimization
</SectionTitle>
      <Paragraph position="0"> The environments for a string and for each POS which it represents become the parameters of the objective flmction defined I)y formula (2), and the optimization of this flmction then yields the probabilities that the string belongs to each l)OS. The. problem can be solved e~sily by the optimal gradient method because both the objective function and the feasible region are convex.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML