File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/a94-1012_metho.xml

Size: 19,447 bytes

Last Modified: 2025-10-06 14:13:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="A94-1012">
  <Title>Combination of Symbolic and Statistical Approaches for Grammatical Knowledge Acquisition</Title>
  <Section position="4" start_page="72" end_page="73" type="metho">
    <SectionTitle>
2 The System Organization
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="72" end_page="72" type="sub_section">
      <SectionTitle>
2.1 Hypothesis Generation
</SectionTitle>
      <Paragraph position="0"> Figure 1 shows the framework of our system. When the parser fails to analyse a sentence, the Hypothesis Generator (HG) produces hypotheses of missing knowledge each of which could rectify the defects of the current grammar. As the parser is a sort of Chart Parser and maintains partial parsing results in the form of inactive and active edges, a parsing failure means that no inactive edge of category S spanning the whole sentence exists.</Paragraph>
      <Paragraph position="1"> The HG tries to introduce an inactive edge of S by making hypotheses of missing linguistic knowledge.</Paragraph>
      <Paragraph position="2"> It generates hypotheses of rewriting rules which collect existing sequences of inactive edges into an expected category. It also calls itself recursively to in- null troduce necessary inactive edges for each rule of the expected category whose application is prevented due to the lack of necessary inactive edges. The simplest form of the algorithm is shown below.</Paragraph>
      <Paragraph position="3"> \[Algorithm\] An inactive edge \[ie(A) : xo, xn\] can be introduced, with label A, between word positions x0 and xn by each of the hypotheses generated from the following two steps.</Paragraph>
      <Paragraph position="4"> \[Step 1\] For each sequence of inactive edges,</Paragraph>
      <Paragraph position="6"> spanning from x0 to Xn, generates a new rule.</Paragraph>
      <Paragraph position="7"> A ==~ B1,.-.,B, \[Step 2\] For each existing rule of form A ::V A1, * * -, An, finds an incomplete sequence of inactive edges, \[ie(A1) : xo, xl\], ..., \[ie(A~_l) : x~-2, xi-1\], \[ie(Ai+l) : xi, xi+l\], ..., \[ie(An) : xn-1, xn\], and calls this algorithm for \[ie(Ai) : xi-1, xi\].</Paragraph>
      <Paragraph position="8"> This algorithm has been further augmented in order to treat sentences which contain more than one construction not covered by the current version of the grammar and to generate hypotheses concerning complex features like subcategorization frames.</Paragraph>
    </Section>
    <Section position="2" start_page="72" end_page="72" type="sub_section">
      <SectionTitle>
2.2 Hypothesis Filtering
</SectionTitle>
      <Paragraph position="0"> The greater number of the hypotheses generated by the algorithm are linguistically unnatural, because the algorithm does not embody any linguistic principle to judge the appropriateness of hypotheses, and therefore we introduced a set of criteria to filter out unnatural hypotheses (Kiyono and Tsujii, 1993; Kiyono and Tsujii, 1994). This includes, for example, * The maximum number of daughter constituents of a rule is set to 3.</Paragraph>
      <Paragraph position="1"> * Supposing that the current version of the grammar contains all the category conversion rules, a unary rule with one daughter constituent is not generated.</Paragraph>
      <Paragraph position="2"> * Using generalizations embodied in the current version of the grammar, a rule containing a sequence of constituents which can be collected into a larger constituent by the current version of grammar is not generated.</Paragraph>
      <Paragraph position="3"> * Distinguishing non-lexical categories from lexical categories, a rule whose mother category is a lexical category is not generated.</Paragraph>
      <Paragraph position="4"> These criteria significantly reduce the number of hypotheses to be generated.</Paragraph>
    </Section>
    <Section position="3" start_page="72" end_page="73" type="sub_section">
      <SectionTitle>
2.3 Hypothesis Graph
</SectionTitle>
      <Paragraph position="0"> As the criteria which the HG uses to filter out unnatural hypotheses are solely based on the forms of hypotheses, they cannot identify the &amp;quot;correct&amp;quot; hypotheses on their own. The correct ones are rather chosen by the Hypothesis Selector (HS), which resorts to examining the statistical behaviour of hypotheses throughout a given corpus.</Paragraph>
      <Paragraph position="1"> A straightforward method is to count the frequency of hypotheses, but this simple method does not work, because hypotheses are not independent of each other. A hypothesis is either competing with or complemenlary to other hypotheses generated from the same sentence. A group of hypotheses generated for restoring the same inactive edge constitutes a set of competing hypotheses and only one of them contributes to the correct structure of the sentence. On the other hand, two groups of hypotheses which are generated to treat two different parts of the same sentence stand in complementary relationships.</Paragraph>
      <Paragraph position="2"> A hypothesis should be recognized as being correct, only when no other competing hypothesis is more plausible. That is, even if a hypothesis is generated frequently, it should not be chosen as the correct one, if more plausible competing hypotheses are always generated together with it. On the other hand, even if a hypothesis is generated only once, it should be chosen as the correct one, if there is no other competing hypothesis.</Paragraph>
      <Paragraph position="3"> In order to realize the above conception, the HS maintains mutual relationships among hypotheses as an AND-OR graph. In a graph, AND nodes and OR nodes express complementary relationships and competing relationships, respectively. A node is shared, when different recursion steps in the HG try to restore the same inactive edge. Figure 2 shows the AND-OR graph for the hypotheses generated from the sentence '~Failing students looked embarrassed&amp;quot; when the current version of grammar does not contain rules for participles. The top node is an AND node which has two groups of hypotheses that treat two different parts of the sentence, i.e. &amp;quot;failing students&amp;quot; and &amp;quot;looked embarrased&amp;quot;.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="73" end_page="75" type="metho">
    <SectionTitle>
3 Statistical Analysis
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="73" end_page="73" type="sub_section">
      <SectionTitle>
3.1 Two Measures of Plausibility
</SectionTitle>
      <Paragraph position="0"> The HS uses two measures of plausibility of hypotheses. One is computed for an instance hypothesis and the other is for a generic hypothesis. (See 3.3 for the relationship between the two types of hypotheses.)  (1) Local Plausibility: This value shows how  plausible an instance hypothesis is as grammatical knowledge to contribute to the correct analysis of a unsuccessfully parsed sentence. (2) Global Plausibility: This value shows how plausible the hypothesis of the generic form is as grammatical knowledge to be acquired.</Paragraph>
      <Paragraph position="1"> As we describe in the following section, the Local Plausibility (LP) of an instance hypothesis is computed on the basis of the values of the Global Plausibility (GP) of the generic hyoptheses which are linked to instance hypotheses in the same hypothesis graph. On the other hand, the GP of a generic hypothesis is computed from the LP values of its instance hypotheses across the whole corpus. Intuitively speaking, the GP of a generic hypothesis is high if its instances are frequently generated and if they receive high LP values, while the LP of a instance hypothesis is high if the GP of the corresponding generic hypothesis is high and if the GP values of the generic hypotheses corresponding to its competing hypotheses are low. Because of this mutual dependence between LP and GP, they cannot be computed in a single step but rather computed iteratively by repeating the following steps until the halt condition is satisfied.</Paragraph>
      <Paragraph position="2"> \[Step 1\] Estimates the initial values of LP.</Paragraph>
      <Paragraph position="3"> \[Step 2\] Calculates GP values from LP values.</Paragraph>
      <Paragraph position="4"> \[Step 3\] Checks the halt condition.</Paragraph>
      <Paragraph position="5"> \[Step 4\] Calculates LP values from GP values and GOT0 \[Step 2\].</Paragraph>
    </Section>
    <Section position="2" start_page="73" end_page="73" type="sub_section">
      <SectionTitle>
3.2 Initial Estimation of Local Plausibility
</SectionTitle>
      <Paragraph position="0"> If the current version of the grammar is reasonably comprehensive, pieces of linguistic knowledge which have to be acquired are likely to be lexical or idiosyncratic. That is, we assume that sublanguage-specificity tends to be manifested by unknown words, new usages of existing words, and syntactic constructions idiosyncratic to the sublanguage. In order to quantify such plausibility, the following value is given to each hypothesis.</Paragraph>
      <Paragraph position="2"> This value shows the proportion of the syntactic structure in the whole sentence which is not covered by the hypothesis. It ranges from 0 to 1 and gets larger if the hypothesis rectifies a smaller part of the sentence. W(Hypol), the width of the hypothesis, is defined as the word count of the subtree and H(Hypoi), the height, is defined as the shortest path from lexical nodes to the top node of the subtree.</Paragraph>
    </Section>
    <Section position="3" start_page="73" end_page="74" type="sub_section">
      <SectionTitle>
3.3 Generic Hypothesis and Global
</SectionTitle>
      <Paragraph position="0"> The GP of a hypothesis is computed based on the LP values of its instance hypotheses, but the relationship between a generic hypothesis and its instances is not straightforward because we adopted a unification-based grammar formalism. For example, the instance hypothesis of NP =:~ VP, NP in Figure 2 contains not only this CFG skeleton but also further feature descriptions of the three constituents which include specific surface words like &amp;quot;failing&amp;quot; and &amp;quot;students&amp;quot;. Unless we generalize them, we cannot obtain the generic form of this instance hypothesis, and therefore cannot judge whether the hypotheses generated from different sentences are identical.</Paragraph>
      <Paragraph position="1"> Such generalization of instance hypotheses requires an inductive mechanism for judging which parts of the feature specification are common to all instance hypotheses and should be included in a hypothesis of the generic form. This kind of induction is beyond the scope of the current framework, because such induction may need a lot of time and space if it is carried out from scratch. We first gather a set of instance hypotheses which are likely to be instances of the same generic hypothesis which, in turn, is likely to be &amp;quot;correct&amp;quot; linguistic knowledge. Our current framework uses a simple definition of generic hypotheses and their instances. That is, if two rule hypotheses have the same CFG skeleton, then they are judged to be instances of the same generic hypotheses. As for lexical hypotheses, we use a set of fixed templates of lexical entries in order to acquire detailed knowledge like subcategorization frames. Features which are not included in the templates are ignored in the judgement of whether generic hypotheses are identical.</Paragraph>
      <Paragraph position="2">  The GP of a generic hypotheses is defined as being the probability of the event that at least one instance hypothesis recovers the true cause of a parsing failure, and it is computed by the following formula when a set of its instance hypotheses is identified.</Paragraph>
      <Paragraph position="3"> In the formula, HP is a generic hypothesis and HPi are its instances.</Paragraph>
      <Paragraph position="5"> The more instance hypotheses are generated, the closer to 1 GP(HP) becomes. If one of the instances is regarded to be recovering the true cause of a parsing failure, the GP of the generic hypothesis is assigned 1, because the hypothesis is indispensable to the analysis of the corpus.</Paragraph>
    </Section>
    <Section position="4" start_page="74" end_page="75" type="sub_section">
      <SectionTitle>
3.4 Local Plausibility
</SectionTitle>
      <Paragraph position="0"> The calculation of LP is carried out on each hypothesis graph based on the assumption that an instance hypothesis or a set of instance hypotheses which recovers the true cause(s) of the parsing failure should exist in the graph. This assumption means that the top node of a hypothesis graph is assigned 1 as its LP value.</Paragraph>
      <Paragraph position="1"> The LP value assigned to a node is to be distributed to its daughter nodes by considering the GP values of the corresponding generic hypotheses.</Paragraph>
      <Paragraph position="2"> For example, the daughter nodes of an OR node, which constitute a set of competing hypotheses, receive their LP values which are dividents of the LP value of the mother node proportional to their GP values.</Paragraph>
      <Paragraph position="3"> However, as GP is defined only for hypotheses, we first determine the GP values of all nodes in a hypothesis graph in a bottom-up manner, starting from the tip nodes of the graph to which instance hypotheses are attached. Therefore, \[Step 2\] in the statistical analysis is further divided into the following three steps.</Paragraph>
      <Paragraph position="4"> \[Step 2-1\] Bottom-up Calculation of GP The GP value of an intermediate node is determined as follows (See Figure 3(a)).</Paragraph>
      <Paragraph position="5"> * The GP value of an OR node is computed by the following formula based on the GP values of the daughter nodes, which corresponds to the probability that at least one of the daughter nodes represents &amp;quot;correct&amp;quot; grammatical knowledge.</Paragraph>
      <Paragraph position="7"> * The GP value of an AND node is computed by the following formula, which corresponds to the probability that all the daughter nodes represent &amp;quot;correct&amp;quot; grammatical knowledge.</Paragraph>
      <Paragraph position="9"> The nodes which have significantly smaller GP values than the highest one among the daughter nodes of the same mother OR node (less than one tenth, in our current implementation) will be removed from the hypothesis graph. For example, HP2 in Figure 3 was considered to be much less plausible than HP4 and removed from the graph.</Paragraph>
      <Paragraph position="10"> As a node in a hypothesis graph could have more than one mother nodes, the hypothesis deletion is realized by removing the link between the node representing the hypothesis and one of its mother OR nodes (not removing the node itself). For example, in Figure 3, when HP4 is removed in comparison with HP2 or HP3, the link between HP4 and the OR node is removed, while the link between HP4 and the AND node still remains.</Paragraph>
      <Paragraph position="11"> The deletion of less viable nodes accelerates the convergence of the iterative process of computing GP and LP.</Paragraph>
      <Paragraph position="12">  \[Step 2-3\] Top-down Calculation of LP This step distributes the LP assigned to the top node (that is, 1) to the nodes below in a top-down way according to the following rules (See Figure 3(b)).</Paragraph>
      <Paragraph position="13"> * The LP value of an OR node is distributed to its daughter nodes proportional to their GP values so that the sum of their LP values is the same as that of the OR node because the daughter nodes of the same OR node represent mutually exclusive hypotheses.</Paragraph>
      <Paragraph position="15"> If a hypothesis has more than one mother nodes and its LP can be calculated through several paths, the sum of those is given to the hypothesis. For example, the value for HP4 in Figure 3 is 0.56 + 0.38 = 0.94.</Paragraph>
      <Paragraph position="16"> As we discussed before, these newly computed LP values are used to compute the GP values at \[Step 2\] in the next cycle of iteration.</Paragraph>
    </Section>
    <Section position="5" start_page="75" end_page="75" type="sub_section">
      <SectionTitle>
3.5 Halt Condition
</SectionTitle>
      <Paragraph position="0"> The iterative calculation process is regarded to have converged if the GP values of all the generic hypotheses do not change in comparison with the previous cycle, but as it possibly takes a lot of time for the process to reach such a situation, we use an easier condition to stop the process. That is, we count the number of deleted instance hypotheses at each cycle and terminate the iteration when no instance hypothesis is deleted in a number of consecutive iterations. Actually, the process halts after 5 zerodeletion cycles in our current implementation.</Paragraph>
      <Paragraph position="1"> When the interative process terminates, the hypotheses with high GP values are presented as the final candidates of new knowledge to be added to the current version of grammar.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="75" end_page="75" type="metho">
    <SectionTitle>
4 Preliminary Experiment
</SectionTitle>
    <Paragraph position="0"> In order to demonstrate how the HS works, we carried out a preliminary experiment with 1,000 sentences in the UNIX on-line manual (approximately one fifth of the whole manual). As the initial knowledge for the experiment, we prepared a grammar set which contains 120 rules covering English basic expressions and deliberately removed rules for participles in order to check whether the HS can discover adequate rules. The input data to the statistical process is a set of 5,906 instance hypotheses generated  The statistical process removed 4,034 instance hypotheses and stopped after 63 cycles of the iterative computation of GP and LP. The instance hypotheses were grouped into 2,876 generic hypotheses and the GP values of 2,331 generic hypotheses were reduced to 0 by the hypothesis deletion.</Paragraph>
    <Paragraph position="1"> Table 1 is the list of &amp;quot;correct&amp;quot; hypotheses picked up from the whole list of generic hypotheses sorted by GP values. The hypothesis for participles, np =&gt; vp,np, is one of the 128 hypotheses whose GP values are 1. This table also shows that quite a few &amp;quot;correct&amp;quot; lexical hypotheses are in higher positions because lexical knowledge for unknown words is indispensable to the successful parsing of the corpus.</Paragraph>
    <Paragraph position="2"> The distribution of &amp;quot;correct&amp;quot; hypotheses within the whole list is shown in Table 2. The fact that &amp;quot;correct&amp;quot; hypotheses exist more in higher ranges supports our mechanism. Although some of the &amp;quot;correct&amp;quot; ones have zero GP values, they do not diminish our framework because most of them are the hypotheses treating participles as adjectives, which are the alternative hypotheses of np =&gt; vp,np.</Paragraph>
    <Paragraph position="3"> The parameter which we can adjust to select more plausible hypotheses is the threshold for the hypothesis deletion. Generally speaking, giving a higher threshold causes an increase of the number of deleted hypotheses and therefore accelerates the convergence of the iterative process. In the experiment, however, the use of one fifth as the threshold instead of one tenth did not bring a major difference.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML