File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1119_metho.xml

Size: 10,278 bytes

Last Modified: 2025-10-06 14:14:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1119">
  <Title>Automatic Acquisition of Language Model based on Head-Dependent Relation between Words</Title>
  <Section position="3" start_page="0" end_page="724" type="metho">
    <SectionTitle>
2 A Simple Dependency Grammar
</SectionTitle>
    <Paragraph position="0"> In this paper, we assume a kind of simple dependency grammar which describes a language  by a set of head-dependent relations between words. A sentence is analyzed by establishing dependency links between individual words in the sentence. A dependency analysis, :D, of a sentence can be represented with arrows pointing from head to dependent as depicted in Figure 1. For structural generality, we assume that there is always a marking tag, &amp;quot;EOS&amp;quot;(End of Sentence), at the end of a sentence and it has the head word of the sentence as its own dependent(&amp;quot;gave&amp;quot; in Figure 1).</Paragraph>
    <Paragraph position="1"> I gave him a book EOS  A/) is a set of inter-word dependencies which satisfy the following conditions: (1) every word in the sentence has its head in the sentence except the head word of the sentence. (2) every word can have only one head. (3) there is neither crossing nor cycle of dependencies.</Paragraph>
    <Paragraph position="2"> The probabilistic model of the simple dependency grammar is given by</Paragraph>
    <Paragraph position="4"> Here, we define complete-link and complete-sequence which represent partial :Ds for substrings. They are used to construct overall 79s and used as the basic structures for the reestimation algorithm in section 3.</Paragraph>
    <Paragraph position="5"> A set of dependency relations on a word sequence, wij l, is a complete-link when the following conditions are satisfied: * there is (wi -+ wi) or (wi e-- wj) exclusively. null  * Every inner word has a head in the word sequence.</Paragraph>
    <Paragraph position="6"> * Neither crossing nor cycle of dependency relations is allowed.</Paragraph>
    <Paragraph position="7"> tWe use wi for ith word in a sentence and wi,j for the word sequence from wl to wj(i &lt; j).</Paragraph>
    <Paragraph position="8"> k her second child the bus  A complete-sequence is a sequence of 0 or more adjacent complete-links that have the same direction. A unit complete-sequence is defined on a string of one word. It is 0 sequence of complete-links. The direction of a complete-sequence is determined by the direction of the component complete-links. In Figure 3, (a) is a rightward complete-sequence composed of two complete-links, and (b) is a leftward one. (c) is a complete-sequence composed of zero completelinks, and it can be both leftward and rightward. The word of &amp;quot;complete&amp;quot; means that the dependency relations on the inner words are completed and that consequently there is no need to process further on them. From now on, we use Lr(i,j)/Lt(i,j) for rightward/leftward complete-links and Sr(i,j)/St(i,j) for rightward/leftward complete-sequences on wi, j. Any complete-link on wi, j can be viewed as the following combination.</Paragraph>
    <Paragraph position="10"> foram(i&lt;m&lt;j).</Paragraph>
    <Paragraph position="11"> Otherwise, the set of dependencies does not satisfy the conditions of no crossing, no cycle and no multiple heads and is not a complete-link any more.</Paragraph>
    <Paragraph position="12"> Similarly, any complete-sequence on wi,j can be viewed as the following combination.</Paragraph>
    <Paragraph position="14"> foram(i&lt;m&lt;j).</Paragraph>
    <Paragraph position="15"> In the case of complete-sequence, we can prevent multiple constructions of the same  a/) of an n-word sentence. When wk(1 &lt; k &lt;_ n) is the head of the sentence, any D of the sentence can be represented by a St(l, EOS) uniquely by the assumption that there is always the dependency relation, (wk +-- wEos).</Paragraph>
  </Section>
  <Section position="4" start_page="724" end_page="725" type="metho">
    <SectionTitle>
3 Reestimation Algorithm
</SectionTitle>
    <Paragraph position="0"> The reestimation algorithm is a variation of Inside-Outside algorithm(Jelinek et al., 1990) adapted to dependency grammar. In this section we first define the inside-outside probabilities of complete-links and complete-sequences, and then describe the reestimation algorithm based on them 2.</Paragraph>
    <Paragraph position="1"> In the followings, ~ indicates inside probability and a, is for outside probability. The superscripts, l and s, are used for &amp;quot;complete-link&amp;quot; and &amp;quot;complete-sequence&amp;quot; respectively. The subscripts indicate direction: r for &amp;quot;rightward&amp;quot; and I for &amp;quot;leftward&amp;quot;.</Paragraph>
    <Paragraph position="2"> The inside probabilities of complete-links (n~(i,j), Lt(i,j)) and complete-sequences</Paragraph>
    <Paragraph position="4"> ~A little more detailed explanation of the expressions can be found in (Lee and Choi, 1997).</Paragraph>
    <Paragraph position="6"> /37(1, EOS) is the sentence probability because every dependency analysis, D, is represented by a St(l, EOS) and/37(1 , EOS) is sum of the probability of every St(l, EOS).</Paragraph>
    <Paragraph position="7"> probabilities for complete(i, j)) and complete-sequences are as follows.</Paragraph>
    <Paragraph position="9"> Given a training corpus, the initial grammar is just a list of all pairs of unique words in the corpus. The initial pairs represent the tentative head-dependent relations of the words.</Paragraph>
    <Paragraph position="10"> And the initial probabilities of the pairs can be given randomly. The training starts with the initial grammar. The train corpus is analyzed with the grammar and the occurrence frequency of each dependency relation is calculated. Based on the frequencies, probabilities of dependency relations are recalculated by C(wp --+ w~) The process w,) = C(w continues until the entropy of the training corpus becomes the minimum. The frequency of occurrence, C(wi --+ wj), is calculated by</Paragraph>
    <Paragraph position="12"> where O~(wi ~ wj, D, wl,n) is 1 if the dependency relation, (wi --+ wj), is used in the D,  and 0 otherwise. Similarly, the occurrence frequency of the dependency relation, (wi +- wj), is computed by ~----L---o~l(i,j)~\[(i,j ).</Paragraph>
  </Section>
  <Section position="5" start_page="725" end_page="726" type="metho">
    <SectionTitle>
4 Preliminary experiments
</SectionTitle>
    <Paragraph position="0"> We have experimented with three language models, tri-gram model (TRI), bi-gram model (BI), and the proposed model (DEP) on a raw corpus extracted from KAIST corpus 3. The raw corpus consists of 1,589 sentences with 13,139 words, describing animal life in nature. We randomly divided the corpus into two parts: a training set of 1,445 sentences and a test set of 144 sentences. And we made 15 partial training sets which include the first s sentences in the whole training set, for s ranging from 100 to 1,445 sentences. We trained the three language models for each partial training set, and tested the training and the test corpus entropies.</Paragraph>
    <Paragraph position="1"> TRI and BI was trained by counting the occurrence of tri-grams and bi-grams respectively. DEP was trained by running the reestimation algorithm iteratively until it converges to an optimal dependency grammar. On the average, 26 iterations were done for the training sets.</Paragraph>
    <Paragraph position="2"> Smoothing is needed for language modeling due to the sparse data problem. It is to compensate for the overestimated and the underestimated probabilities. Smoothing method itself is an important factor. But our goal is not to find out a better smoothing method. So we fixed on an interpolation method and applied it for the three models. It can be represented as</Paragraph>
    <Paragraph position="4"> The Ks is the global smoothing factor. The bigger the Ks, the larger the degree of smoothing.</Paragraph>
    <Paragraph position="5"> For the experiments we used 2 for Ks.</Paragraph>
    <Paragraph position="6"> We take the performance of a language model to be its cross-entropy on test corpus,  words), POS-tagged collection(6,750,000 words), and tree-tagged collection(30,000 sentences) at present. where the test corpus contains a total of IV\] words and is composed of S sentences.</Paragraph>
    <Section position="1" start_page="725" end_page="726" type="sub_section">
      <SectionTitle>
3.4 i   ||i  |! I
</SectionTitle>
      <Paragraph position="0"> Figure 5 shows the training corpus entropies of the three models. It is not surprising that DEP performs better than BI. DEP can be thought of as a kind of linguistic bi-gram model in which long distance dependencies can be represented through the head-dependent relations between words. TRI shows better performance than both BI and DEP. We think it is because TRI overfits the training corpus, judging from the experimental results for the test corpus.</Paragraph>
      <Paragraph position="1">  For the test corpus, BI shows slightly better performance than TRI as depicted in Figure 6. Increase in the order of n-gram from two to three shows no gains in entropy reduction. DEP, however, Shows still better performance than the n-gram models. It shows about 11.5% entropy reduction to BI and about 11% entropy reduction to TRI. Figure 7 shows the entropies for the mixed corpus of training and test sets. From the results, we can see that head-dependent relations between words are more useful information than the naive n-gram sequences, for language modeling. We can see also that the reestimation algorithm can find out properly the hidden head-dependent relations between words, from a raw corpus.</Paragraph>
      <Paragraph position="2">  Related to the size of model, however, DEP has much more parameters than TRI and BI as depicted in Figure 8. This can be a serious problem when we create a language model from a large body of text. In the experiments, however, DEP used the grammar acquired automatically as it is. In the grammar, many inter-word dependencies have probabilities near 0. If we exclude such dependencies as was experimented for n-grams by Seymore and Rosenfeld (1996), we may get much more compact DEP model with very slight increase in entropy.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML