File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2148_metho.xml

Size: 10,643 bytes

Last Modified: 2025-10-06 14:15:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2148">
  <Title>A Stochastic Language Model using Dependency and Its Improvement by Word Clustering</Title>
  <Section position="3" start_page="898" end_page="900" type="metho">
    <SectionTitle>
2 Stochastic Language Model based
on Dependency
</SectionTitle>
    <Paragraph position="0"> In this section, we propose a stochastic language model based on dependency. Formally this model is based on a stochastic context-free grammar (SCFG).</Paragraph>
    <Paragraph position="1"> The terminal symbol is the attribute of a bunsetsu, represented by the product of the head of the content part and that of the function part. From the attribute, a word sequence that matches the bun.</Paragraph>
    <Paragraph position="2"> setsu is predicted by a word-based 2-gram model, and unknown words axe predicted from POS by a character-based 2-gram model.</Paragraph>
    <Section position="1" start_page="898" end_page="898" type="sub_section">
      <SectionTitle>
2.1 Sentence Model
</SectionTitle>
      <Paragraph position="0"> A Japanese sentence is considered as a sequence of units called bunsetsu composed of one or more content words and function words. Let Cont be a set of content words, Func a set of function words and Sign a set of punctuation symbols. Then bunsetsu is defined as follows: Bnst = Cont+ Func * U Cont+ Func* Sign, where the signs &amp;quot;+&amp;quot; and &amp;quot;*&amp;quot; mean positive closure and Kleene closure respectively. Since the relations between bunsetsu known as dependency are not always between sequential ones, we use SCFG to describe them (Fu, 1974). The first problem is how to choose terminal symbols. The simplest way is to select each bunsetsu as a terminal symbol. In this case, however, the data-sparseness problem would surely be invoked, since the number of possible bunsetsu is enormous. To avoid this problem we use the concept of class proposed for a word n-gram model (Brown et al., 1992). All bunsetsu axe grouped by the attribute defined as follows:</Paragraph>
      <Paragraph position="2"> where the functions cont, func and sign take a bun~etsu as their argument and return its content word sequence, its function word sequence and its punctuation respectively. In addition, the function last(m) returns the POS of the last element of word sequence m or NULL if the sequence has no word.</Paragraph>
      <Paragraph position="3"> Given the attribute, the content word sequence and the function word sequence of the bunsetsu axe independently generated by word-based 2-gram models (Mori and Yamaji, 1997).</Paragraph>
    </Section>
    <Section position="2" start_page="898" end_page="900" type="sub_section">
      <SectionTitle>
2.2 Dependency Model
</SectionTitle>
      <Paragraph position="0"> In order to describe the relation between bunsetsu called dependency, we make the generally accepted assumption that no two dependency relations cross each other, and we introduce a SCFG with the attribute of bunsetsu as terminals. It is known, as a characteristic of the Japanese language, that each bunsetsu depends on the single bunsetsu appearing just before it. We say of two sequential bunsetsu that the first to appear is the anterior and the second is the posterior. We assume, in addition, that the dependency relation is a binary relation - that each relation is independent of the others. Then this relation is representing by the following form of rewriting rule of CFG: B =~ AB, where A is the attribute of the anterior bunsetsu and B is that of the posterior.</Paragraph>
      <Paragraph position="1"> Similarly to terminal symbols, non-terminal symbols can be defined as the attribute of bunsetsu. Also they can be defined as the product of the attribute and some additional information to reflect the characteristics of the dependency. It is reported that the dependency is more frequent between closer bunsetsu in terms of the position in the sentence (Maruyama and Ogino, 1992). In order to model these characteristics, we add to the attribute of bunsetsu an  (verb. ending, period. 2.0) (noun, NULL. comma, O, 0) kyou/noun ./sign (today) (noun. postp.. NULL. 0. 0)  additional information field holding the number of bunsetsu depending on it. Also the fact that a bun. setsu has a tendency to depend on a bunsetsu with comma. For this reason the number of bunsetsu with comma depending on it is also added. To avoid data-sparseness problem we set an upper bound for these numbers. Let d be the number of bunsetsu depending on it and v be the number of bunsetsu with comma depending on it, the set of terminal symbols T and that of non-terminal symbols V is represented as follows (see Figure 1):</Paragraph>
      <Paragraph position="3"> It should be noted that terminal symbols have no bunsetsu depending on them. It follows that all rewriting rules are in the following forms:</Paragraph>
      <Paragraph position="5"> where a is the attribute of bunsetsu.</Paragraph>
      <Paragraph position="6"> The attribute sequence of a sentence is generated through applications of these rewriting rules to the start symbol S. Each rewriting rule has a probability and the probability of the attribute sequence is the product of those of the rewriting rules used for its generation. Taking the example of Figure 1, this value is calculated as follows:  (verb, ending, period, 0, 0)).</Paragraph>
      <Paragraph position="7"> The probability value of each rewriting rule is estimated from its frequency N in a syntactically annotated corpus as follows:</Paragraph>
      <Paragraph position="9"> In a word n-gram model, in order to cope with data-sparseness problem, the interpolation technique is applicable to SCFG. The probability of the interpolated model of grammars G1 and G2, whose  probabilities axe P1 and P2 respectively, is represented as follows:</Paragraph>
      <Paragraph position="11"> estimated by held-out method or deleted interpolation method (Jelinek et al., 1991).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="900" end_page="900" type="metho">
    <SectionTitle>
3 Word Clustering
</SectionTitle>
    <Paragraph position="0"> The model we have mentioned above uses the POS given manually for the attribute of bunsetsu. Changing it into some class may improve the predictive power of the model. This change needs only a slight replacement in the model representing formula (1): the function last returns the class of the last word of a word sequence rn instead of the POS. The problem we have to solve here is how to obtain such classes i.e. word clustering. In this section, we propose an objective function and a search algorithm of the word clustering.</Paragraph>
    <Section position="1" start_page="900" end_page="900" type="sub_section">
      <SectionTitle>
3.1 Objective Function
</SectionTitle>
      <Paragraph position="0"> The aim of word clustering is to build a language model with less cross entropy without referring to the test corpus. Similar reseaxch has been successful, aiming at an improvement of a word n-gram model both in English and Japanese (Mori et al., 1997). So we have decided to extend this research to obtain an optimal word-class relation. The only difference from the previous research is the language model. In this case, it is a SCFG in stead of a n-gram model. Therefore the objective function, called average cross entropy, is defined as follows:</Paragraph>
      <Paragraph position="2"> where Li is the i-th learning corpus and Mi is the language model estimated from the learning corpus excluding the i-th learning corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="900" end_page="900" type="sub_section">
      <SectionTitle>
3.2 Algorithm
</SectionTitle>
      <Paragraph position="0"> The solution space of the word clustering is the set of all possible word-class relations. The caxdinality of the set, however, is too enormous for the dependency model to calculate the average cross entropy for all word-class relations and select the best one. So we abandoned the best solution and adopted a greedy algorithm as shown in Figure 2.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="900" end_page="901" type="metho">
    <SectionTitle>
4 Syntactic Analysis
</SectionTitle>
    <Paragraph position="0"> Syntactic Analysis is defined as a function which receives a character sequence as an input, divides it into a bunsetsu sequence and determines dependency relations among them, where the concatenation of character sequences of all the bunsetsu must Let ml, m2, ..., mn be .b4 sorted in the descending order of frequency.</Paragraph>
    <Paragraph position="2"> be equal to the input. Generally there axe one or more solutions for any input. A syntactic analyzer chooses the structure which seems the most similar to the human decision. There are two kinds of analyzer: one is called a rule-based analyzer, which is based on rules described according to the intuition of grarnmarians; the other is called a corpus-based analyzer, because it is based on a large number of analyzed examples. In this section, we describe a stochastic syntactic analyzer, which belongs to the second category.</Paragraph>
    <Section position="1" start_page="900" end_page="901" type="sub_section">
      <SectionTitle>
4.1 Stochastic Syntactic Analyzer
</SectionTitle>
      <Paragraph position="0"> A stochastic syntactic analyzer, based on a stochastic language model including the concept of dependency, calculates the syntactic tree (see Figure 1) with the highest probability for a given input x according to the following formula:</Paragraph>
      <Paragraph position="2"/>
      <Paragraph position="4"> where to (T) represents the character sequence of the syntactic tree T. P(T) in the last line is a stochastic language model including the concept of dependency. We use, as such a model, the POS-based dependency model described in section 2 or the class-based dependency model described in section 3.</Paragraph>
    </Section>
    <Section position="2" start_page="901" end_page="901" type="sub_section">
      <SectionTitle>
4.2 Solution Search Algorithm
</SectionTitle>
      <Paragraph position="0"> The stochastic context-free grammar used for syntactic analysis consists of rewriting rules (see formula (3)) in Chom~ky normal form (Hopcroft and Ullman, 1979) except for the derivation from the start symbol (formula (2)). It follows that a CKY method extended to SCFG, a dynamic-programming method, is applicable to calculate the best solution in O(n 3) time, where n is the number of input characters. It should be noted that it is necessary to multiply the probability of the derivation from the start symbol at the end of the process.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML