XML Viewer - w96-0114

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/w96-0114_metho.xml
Size: 11,344 bytes
Last Modified: 2025-10-06 14:14:26
<?xml version="1.0" standalone="yes"?>
<Paper uid="W96-0114">
  <Title>Towards Automatic Grammar Acquisition from a Bracketed Corpus Thanaruk Theeramunkong</Title>
  <Section position="4" start_page="168" end_page="169" type="metho">
    <SectionTitle>
2 Grammar Acquisition with a Bracketed Corpus
</SectionTitle>
    <Paragraph position="0"> In this section, we give a brief explanation of grammar acquisition using a bracketed corpus. In this work, the grammar acquisition utilizes a lexical-tagged corpus with bracketings. An example of the parse structures of two sentences in the corpus is shown graphically in Figure 1.</Paragraph>
    <Paragraph position="2"> A big man slipped on the ice The boy dropped his wallet somewhere  boy dropped his wallet somewhere In the parse structures, each terminal category (leaf node) is given a name (tag) while there is no label for each nonterminal category (intermediate node). With this corpus, the grammar learning task corresponds to a process to determine the nonterminal label of each bracket in the corpus. More precisely, this task is concerned with the way to classify the brackets into some certain groups and give each group a label. For instance, in Figure 1, it is reasonable to classify the brackets 1 (c2),(c4) and (c5) into a same group and give them a same label (e.g., NP(noun phrase)). As the result, we obtain three grammar rules: NP ~ (ART)(NOUN), NP ~ (PRON)(NOUN) and NP ~ (ART)(el). To perform this task, our grammar acquisition algorithm operates in five stages as follows.</Paragraph>
    <Paragraph position="3"> 1. Assign a unique label to each node of which lower nodes are assigned labels. At the initial step, such node is one whose lower nodes are lexical categories 2. This process is performed throughout all parse trees in the corpus.</Paragraph>
    <Paragraph position="4"> 1A bracket corresponds to a node in Figure 1.</Paragraph>
    <Paragraph position="5"> ~In Figure 1, there are three unique labels derived: el---~(ADJ)(NOUN), cc2-*(ART)(NOUN) and cs ~ (PRON)(NOUJV).</Paragraph>
    <Paragraph position="6">  2. Calculate the similarity of every pair of the derived labels.</Paragraph>
    <Paragraph position="7"> 3. Merge the most similar pair to a single new label(i.e., a label group) and recalculate the similarity of this new label with other labels.</Paragraph>
    <Paragraph position="8"> 4. Repeat (3) until a termination condition is detected. As the result of this step, a certain set of label groups is derived.</Paragraph>
    <Paragraph position="9"> 5. Replace labels in each label group with a new label in the corpus. For example, if (ART)(NOUN) and (PRON)(NOUN) are in the same label group, we replace them with a new label (such as NP) in the whole corpus.</Paragraph>
    <Paragraph position="10"> 6. Repeat (1)-(5) until all brackets(nodes) in the corpus are assigned labels.  In this paper, as a first step of our grammar acquisition, we focus on step (1)-(4), that is how to group nodes of which lower nodes are lexical categories. Figure 2 depicts an example of the grouping process.</Paragraph>
    <Paragraph position="12"> To compute the similarity of a pair of labels(in step 2), we propose two types of techniques called distributional analysis and hierarchical Bayesian cbtstering as shown in section 3. In section 4, we introduce the concept of differential entropy as the termination condition used in step (4).</Paragraph>
  </Section>
  <Section position="5" start_page="169" end_page="170" type="metho">
    <SectionTitle>
3 Local Contextual Information as Similarity Measure
</SectionTitle>
    <Paragraph position="0"> In this section, we describe two techniques which utilize &amp;quot;local context information&amp;quot; to calculate similarity between two labels. The term &amp;quot;local contextual information&amp;quot; considered here is represented by a pair of words immediately before and after a label. In the rest of this section, we first describe distributional analysis in subsection 3.1. Next, we give the concept of Bayesian clustering in subsection 3.2.</Paragraph>
    <Section position="1" start_page="169" end_page="170" type="sub_section">
      <SectionTitle>
3.1 Distributional Analysis
</SectionTitle>
      <Paragraph position="0"> Distributional analysis is a statistical method originally proposed by Harris\[Harbl\] to uncover regularities in the distributional relations among the features of speech. Applications of this technique are varied\[Bri92\]\[Per93\]. In this paper, we apply this technique to group similar brackets in a bracketed corpus. The detail of this technique is illustrated below.</Paragraph>
      <Paragraph position="1"> Let P1 and P2 be two probability distributions over environments. The relative entropy between</Paragraph>
      <Paragraph position="3"> Relative entropy D(PIlIP2 ) is a measure of the amount of extra information beyond P2 needed to describe Pl. The divergence between Pz and P= is defined as D(PIlIP2 ) + D(P2IIP1), and is a measure of how difficult it is to distinguish between the two distributions. The environment is  a pair of words immediately before and after a label(bracket). A pair of labels is considered to be identical when they are distributionaliy similar, i.e., the divergence of their probability distributions over environments is low.</Paragraph>
      <Paragraph position="4"> The probability distribution can be simply calculated by counting the occurrence of (c~) and (word1 c~ words). For the example in Figure 1, the numbers of appearances of (c1), (c2), (c5), (ART cz VI), (PREP c2 NULL) and (VT es ADV) are collected from the whole corpus. NULL stands for a blank tag representing the beginning or ending mark of a sentence.</Paragraph>
      <Paragraph position="5">  Utilizing divergence as a similarity measure, there is a serious problem caused by the sparseness of existing data or the characteristic of language itself. In the formula of relative entropy, there is a possibility that P2(e) becomes zero. In this condition, we cannot calculate the divergence of two probability distributions. To cope with this problem, we extend the original probability to one shown in the following formula.</Paragraph>
      <Paragraph position="7"> where, N(a) is the occurrence frequency of o~, Ntag8 is the number of terminal categories and A is a interpolation coefficient. The first term in the right part of the formula is the original estimated probability. The second term is generally called a uniform distribution, where the probability of an unseen event is estimated to a uniform fixed number. A is applied as a balancing weight between the observed distribution and the uniform distribution. Intuitively, when the size of data is large, the small number should be used as A. In the experimental results in this paper, we assigned A with a value of 0.6.</Paragraph>
    </Section>
    <Section position="2" start_page="170" end_page="170" type="sub_section">
      <SectionTitle>
3.2 Hierarchical Bayesian Clustering Method
</SectionTitle>
      <Paragraph position="0"> As a probabilistic method, hierarchical Bayesian clustering was proposed by Iwaya~na\[Iwa95\] to automatically classify given texts. It was applied to improve the efficiency and the effectiveness of text retrieval/categorization. Referring to this method,we try to make use of Baycsiar~ posterior probability as another similarity measure for grouping the similar brackets. In this section, we conclude the concept of this measure as follows.</Paragraph>
      <Paragraph position="1"> Let's denote a posterior probability with P(GIC), where C is a collection of data (i.e., in Figure 2, C = {c1,e2, ..., CN}) and G is a set of groups(clusters) (i.e., G = {gz,g2, ...}). Each group(cluster) gj is a set of data and the groups are mutually exclusive. In the initial stage, each group is a singleton set; g~ = {~} for all i. The method tries to select and merge the group pair that brings about the maximum value of the posterior probability ~ P(GIC). That is, in each step of merging, this method searches for the most plausible situation that the data in C are partitioned in the certain groups G.</Paragraph>
      <Paragraph position="2"> For instance, at a merge step h + 1 (0 &lt; b &lt; N - 1), a data collection C has been partitioned into a set of groups G~. That is each datum e belongs to a group g E Gk. The posterior probability at the merging step/C/ + 2 can be calculated using the posterior probability at the merging step/C/ + 1 as shown below (for more detail, see\[Iwa95\]).</Paragraph>
      <Paragraph position="4"> Here PC(G~) corresponds to the prior probability that N random data are classified in to a set of groups O~. As for the factor of ~ a well known estimate\[Ris89\] is applied and it is reduced PC(G~) ' to a constant value A -1 regardless of the merged pair. For a certain merging step, P(G~IC ) is identical independently of which groups are merged together. Therefore we can use the following measure to select the best group pair to merge. The similarity between two bracket groups(labels), g= and gv, can be defined by SIM(g=,gv). Here, the larger SIM(g=,g~) is, the more similar two brackets are.</Paragraph>
      <Paragraph position="6"> where SC(g) expresses the probability that all the labels in a group g are produced from the group, an elementai probability P(c\[g) means the probability that a group g produces its member c and P(elc ) denotes a relative frequency of an environment e of a label e, P(elg ) means a relative frequency of an environment e of a group g and P(e) is a relative frequency of an environment e of the entire label set. In the calculation of SIM(g=,gv), we can ignore the value of P(c) because it occurs Ig= U gvl times in both denominator and numerator. Normally, SIM(g=,gy) is ranged between 0 and 1 due to the fact that P(c\[g= U gy) _&lt; P(clg= ) when c E g=.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="170" end_page="170" type="metho">
    <SectionTitle>
4 Differential Entropy as Termination Condition
</SectionTitle>
    <Paragraph position="0"> During iteratively merging the most similar labels, all labels will finally be gathered to a single group.</Paragraph>
    <Paragraph position="1"> Due to this, it is necessary to provide a criterion for determining whether this merging process should be continued or terminated. In this section, we describe a criterion named differential entropy which is a measure of entropy (perplexity) fluctuation before and after merging a pair of labels, Let cl and c2 be the most similar pair of labels based on divergence or Bayesia~u posterior probability. Also let c3 be the result label. P~i (e), Pc= (e) and Pc3(e) are probability distributions over environment e of cl, e2 and c3, respectively. Pc1, Pc= and P~3 are estimated probabilities of cl, c2 and c3, respectively.</Paragraph>
    <Paragraph position="2"> The differentiaJ entropy (Z~E) is defined as follows.</Paragraph>
    <Paragraph position="4"> is, the larger the information fluctuation before aad after merging becomes. Generally, we prefer a small fluctuation to a larger one. When ZXE is large, the current merging process introduces a large amount of information fluctuation and its reliability should be low. From this viewpoint, we apply this measure as a criterion for determining the termination of the merging process which will be given in the next section.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML