File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/95/j95-3002_intro.xml
Size: 9,255 bytes
Last Modified: 2025-10-06 14:05:51
<?xml version="1.0" standalone="yes"?> <Paper uid="J95-3002"> <Title>Robust Learning, Smoothing, and Parameter Tying on Syntactic Ambiguity Resolution</Title> <Section position="4" start_page="325" end_page="327" type="intro"> <SectionTitle> ACTION REDUCE SHIFT </SectionTitle> <Paragraph position="0"> Figure 2 The decomposition of a given syntactic tree X into different phrase levels. steps. First, the tree is decomposed into a number of phrase levels, such as L1, L2,..., L8 in Figure 2. A phrase level is a sequence of symbols (terminal or nonterminal) that acts as an intermediate result in parsing the input sentence, and is also called a sentential form in formal language theory (Hopcroft and Ullman 1974). In the second step, we formulate the transition between phrase levels as a context-sensitive rewriting process. With the formulation, each transition probability between two phrase levels is calculated by consulting a finite-length window that comprises the symbols to be reduced and their left and right contexts.</Paragraph> <Paragraph position="1"> Let the label ti in Figure 2 be the time index for the ith state transition, which corresponds to a reduce action, and Li be the ith phrase level. Then the syntactic score of the tree in Figure 2 is defined as:</Paragraph> <Paragraph position="3"> The transition probability between two phrase levels, say P(L7 I C6), is the product of the probabilities of two events. Taking P(L7 I L6) as an example, the first probability corresponds to the event that {F, G} are the constituents to be reduced, and the second probability corresponds to the event that they are reduced to C. The transition probability can thus be expressed as follows:</Paragraph> <Paragraph position="5"> According to the results of our experiments, the first term is equal to one in most cases, and it makes little contribution to discriminating different syntactic structures.</Paragraph> <Paragraph position="6"> In addition, to simplify the computation, we approximate the full context {B,F,G} with a window of finite length around {F, G}. The formulation for the syntactic scoring Tung-Hui Chiang et al. Robust Learning, Smoothing, and Parameter Tying function can thus be expressed as follows:</Paragraph> <Paragraph position="8"> where Treex is the parse tree X, $ and 0 correspond to the end-of-sentence marker and the null symbol, respectively; and li and ri represent the left and right contexts to be consulted in the ith phrase level) respectively. In the above equation, it is assumed that each phrase level is highly correlated with its immediately preceding phrase level but less correlated with other preceding phrase levels. In other words, the inter-level correlation is assumed to be a first-order Markov process. In addition, for computational feasibility, only a finite number of left and right contextual symbols are considered in the formulation. If M left context symbols and N right context symbols are consulted in evaluating Equation 9, the model is said to operate in the LMRN mode.</Paragraph> <Paragraph position="9"> Notice that the last formula in Equation 9 corresponds to the rightmost derivation sequence in a generalized LR parser with left and right contexts taken into account (Su et al. 1991). Such a formulation is particularly useful for a generalized LR parsing algorithm, in which context-sensitive processing power is desirable. Although the context-sensitive model in the above equation provides the ability to deal with intra-level context-sensitivity, it fails to catch inter-level correlation. In addition, the formulation of Equation 9 will result in the normalization problem (Suet al. 1991; Briscoe and Carroll 1993) when various syntactic trees have different number of nodes. An alternative formulation, which compacts the highly correlated phrase levels into a single one, was proposed by Suet al. (1991) to resolve the normalization problem. For instance, for the syntactic tree in Figure 2, the syntactic score for the modified formulation is expressed as follows: Ssyn(Treex) ~ P(Ls, L7,L6 I L5) x P(L5 \[L4) x P(L4,L3 I L2) x P(L2 ILl) ~, P(L8 \]L5) x P(L5 \]L4) x P(L4 \]L2) x P(L2 \]L1). (12) Each pair of phrase levels in the above equation corresponds to a change in the LR parser's stack before and after an input word is consumed by a shift operation. Because the total number of shift actions, equal to the number of product terms in Equation 12, is always the same for all alternative syntactic trees, the normalization problem is resolved in such a formulation. Moreover, the formulation in Equation 12 provides a way to consider both intra-level context-sensitivity and inter-level correlation of the underlying context-free grammar. With such a formulation, the capability of context-sensitive parsing (in probabilistic sense) can be achieved with a context-free grammar. It is interesting to compare our frameworks (Suet al. 1991) with the work by Briscoe and Carroll (1993) on probabilistic LR parsing. Instead of assigning probabilities to the production rules as a conventional stochastic context-free grammar parser does, Briscoe and Carroll distribute probability to each state so that the probabilities of the transitions from a state sum to one; the preference to a SHIFT action is based on one right context symbol (i.e., the lookahead symbol), and the preference for a REDUCE action depends on the lookahead symbol and the previous state reached after the REDUCE action. With such an approach, it is very easy to implement (mildly) context-sensitive probabilistic parsing on existing LR parsers, and the probabilities can be easily trained. The probabilities assigned to the states implicitly imply different preferences for left-hand side contextual environment of the reduced symbol, since a Computational Linguistics Volume 21, Number 3 state, in general, can indicate part of the past parsing history (i.e., the left context) from which the current reduced symbol follows.</Paragraph> <Paragraph position="10"> However, because of the implicit encoding of the parsing history, a state may fail to distinguish some left contextual environments correctly. This is not surprising, because the LR parsing table generator would merge certain states according to the context-free grammar and the closure operations on the sets of items. Therefore, there are cases in which the same string is reduced, under different left contexts, to the same symbol at the same state and return to the same state after reduction. For instance, if several identical constructs, e.g., \[X --* a\], are allowed in a recursive structure X ~, and the input contains a Y followed by three (or more) consecutive Xs, e.g., &quot;YXXX,&quot; then the reduction of the second and third Xs will return to the same state after the same rule is applied at that state. Under such circumstances, the associated probabilities for these two REDUCE actions will be identical and thus will not reflect the different preferences between them.</Paragraph> <Paragraph position="11"> In our framework, it is easy to tell that the first REDUCE action is applied when the two left context symbols are {Y,X}, and the second REDUCE is applied when the left context is two Xs under an L2R1 mode of operation. Because such recursion is not rare, for example, in groups of adjectives, nouns, conjunction constructs, prepositional phrases in English, the estimated scores will be affected by such differences. In other words, we use context symbols explicitly and directly to evaluate the probabilities of a substructure instead of using the parsing state to implicitly encode past history, which may fail to provide a sufficient characterization of the left context. In addition, explicitly using the left context symbols allows easy use of smoothing techniques, such as deleted interpolation (Bahl, Jelinek, and Mercer 1983), clustering techniques (Brown et al. 1992), and model refinement techniques (Lin, Chiang, and Su 1994) to estimate the probabilities more reliably by changing the window sizes of the context and weighting the various estimates dynamically. This kind of improvement is desirable when the training data is limited.</Paragraph> <Paragraph position="12"> Furthermore, Briscoe and Carroll (1993) use the geometric mean of the probabilities, not their product, as the preference score, to avoid biasing their procedure in favor of parse trees that have a smaller number of nodes (i.e., a smaller number of rules being applied.) The geometric mean, however, fails to fit into the probabilistic framework for disambiguation. In our approach, such a normalization problem is avoided by considering a group of highly correlated phrase levels as a single phrase level and evaluating the sequence of transitions for such phrase levels between the SHIFT actions. Alternatively, it is also possible to consider each group of highly correlated phrase levels as a joint event for evaluating its probability when enough data is available. The optimization criteria are thus not compromised by the topologies of the parse trees, because the number of SHIFT actions (i.e., the number of input tokens) is fixed for an input sentence.</Paragraph> </Section> class="xml-element"></Paper>