File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1509_metho.xml

Size: 14,519 bytes

Last Modified: 2025-10-06 14:10:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1509">
  <Title>Lexical and Structural Biases for Function Parsing</Title>
  <Section position="4" start_page="84" end_page="85" type="metho">
    <SectionTitle>
2 The Basic Architecture
</SectionTitle>
    <Paragraph position="0"> To achieve the complex task of assigning function labels while parsing, we use a family of statistical parsers, the Simple Synchrony Network (SSN) parsers (Henderson, 2003), which do not make any explicit independence assumptions, and are therefore likely to adapt without much modification to the current problem. This architecture has shown state-of-the-art performance.</Paragraph>
    <Paragraph position="1"> SSN parsers comprise two components, one which estimates the parameters of a stochastic model for syntactic trees, and one which searches for the most probable syntactic tree given the parameter estimates. As with many other statistical parsers (Collins, 1999; Charniak, 2000), SSN parsers use a history-based model of parsing. Events in such a model are derivation moves. The set of well-formed sequences of derivation moves in this parser is defined by a Predictive LR pushdown automaton (Nederhof, 1994), which implements a form of left-corner parsing strategy.</Paragraph>
    <Paragraph position="2"> This pushdown automaton operates on configurations of the form (G,v), where G represents the stack, whose right-most element is the top, and v the remaining input. The initial configuration is (ROOT,w) where ROOT is a distinguished non-terminal symbol. The final configuration is (ROOT,epsilon1). Assuming standard notation for context-free grammars (Nederhof, 1994), three derivation moves are defined:</Paragraph>
    <Paragraph position="4"> where A - a, D - Ad and B - bCg are productions such that D is a left-corner of C.</Paragraph>
    <Paragraph position="6"> where both A - a and B - bAg are productions. null The joint probability of a phrase-structure tree and its terminal yield can be equated to the probability of a finite (but unbounded) sequence of derivation moves. To bound the number of parameters, standard history-based models partition the set of well-formed sequences of transitions into equivalence classes. While such a partition makes the problem of searching for the most probable parse polynomial, it introduces hard independence assumptions: a derivation move only depends on the equivalence class to which its history belongs. SSN parsers, on the other hand, do not state any explicit independence assumptions: they use a neural network architecture, called Simple Synchrony Network (Henderson and Lane, 1998), to induce a finite history representation of an unbounded sequence of moves. The history representation of a parse history d1,... ,di[?]1, which we denote h(d1,... ,di[?]1), is assigned to the constituent that is on the top of the stack before the ith move.</Paragraph>
    <Paragraph position="7"> The representation h(d1,... ,di[?]1) is computed from a set f of features of the derivation move di[?]1 and from a finite set D of recent history representations h(d1,... ,dj), where j &lt; i [?] 1. Because the history representation computed for the move i[?]1 is included in the inputs to the computation of the representation for the next move i, virtually any information about the derivation history could flow from history representation to history representation and be used to estimate the probability of a derivation move. However, the recency preference exhibited by recursively defined neural networks biases learning towards information which flows through fewer  history representations. (Henderson, 2003) exploits this bias by directly inputting information which is considered relevant at a given step to the history representation of the constituent on the top of the stack before that step. To determine which history representations are input to which others and provide SSNs with a linguistically appropriate inductive bias, the set D includes history representations which are assigned to constituents that are structurally local to a given node on the top of the stack. In addition to history representations, the inputs to h(d1,... ,di[?]1) include hand-crafted features of the derivation history that are meant to be relevant to the move to be chosen at step i. For each of the experiments reported here, the set D that is input to the computation of the history representation of the derivation moves d1,... ,di[?]1 includes the most recent history representation of the following nodes: topi, the node on top of the pushdown stack before the ith move; the left-corner ancestor of topi (that is, the second top-most node on the parser's stack); the leftmost child of topi; and the most recent child of topi, if any. The set of features f includes the last move in the derivation, the label or tag of topi, the tag-word pair of the most recently shifted word, and the leftmost tag-word pair that topi dominates. Given the hidden history representation h(d1,***,di[?]1) of a derivation, a normalized exponential output function is computed by SSNs to estimate a probability distribution over the possible next derivation moves di.2 The second component of SSN parsers, which searches for the best derivation given the parameter estimates, implements a severe pruning strategy. Such pruning handles the high computational cost of computing probability estimates with SSNs, and renders the search tractable. The space of possible derivations is pruned in two different ways. The first pruning occurs immediately after a tag-word pair has been pushed onto the stack: only a fixed beam of the 100 best derivations ending in that tag-word pair are expanded. For training, the width of such beam is set to five. A second reduction of the search space prunes the space of possible project or attach deriva2The on-line version of Backpropagation is used to train SSN parsing models. It performs the gradient descent with a maximum likelihood objective function and weight decay regularization (Bishop, 1995).</Paragraph>
    <Paragraph position="8"> tion moves: the best-first search strategy is applied to the five best alternative decisions only.</Paragraph>
  </Section>
  <Section position="5" start_page="85" end_page="87" type="metho">
    <SectionTitle>
3 Learning Lexical Projection and
</SectionTitle>
    <Paragraph position="0"> Locality Domains of Function Labels Recent approaches to functional or semantic labels are based on two-stage architectures. The first stage selects the elements to be labelled, while the second determines the labels to be assigned to the selected elements. While some of these models are based on full parse trees (Gildea and Jurafsky, 2002; Blaheta, 2004), other methods have been proposed that eschew the need for a full parse (CoNLL, 2004; CoNLL, 2005). Because of the way the problem has been formulated, - as a pipeline of parsing feeding into labelling - specific investigations of the interaction of lexical projections with the relevant structural parsing notions during function labelling has not been studied.</Paragraph>
    <Paragraph position="1"> The starting point of our augmentation of SSN models is the observation that the distribution of function labels can be better characterised structurally than sequentially. Function labels, similarly to semantic roles, represent the interface between lexical semantics and syntax. Because they are projections of the lexical semantics of the elements in the sentence, they are projected bottom-up, they tend to appear low in the tree and they are infrequently found on the higher levels of the parse tree, where projections of grammatical, as opposed to lexical, elements usually reside. Because they are the interface level with syntax, function and semantic labels are also subject to distributional constraints that govern syntactic dependencies, especially those governing the distribution of sequences of long distance elements. These relations often correspond to top-down constraints. For example, languages like Italian allow inversion of the subject (the Agent) in transitive sentences, giving rise to a linear sequence where the Theme precedes the Agent (Mangia la mela Gianni, eats the apple Gianni). Despite this freedom in the linear order, however, it is never the case that the structural positions can be switched. It is a well-attested typological generalisation that one does not find sentences where the subject is a Theme and the object is the Agent. The hierarchical description, then, captures the underlying generalisa- null to capture the notion of c-command (solid lines).</Paragraph>
    <Paragraph position="2"> tion better than a model based on a linear sequence.</Paragraph>
    <Paragraph position="3"> In our augmented model, inputs to each history representation are selected according to a linguistically motivated notion of structural locality over which dependencies such as argument structure or subcategorization could be specified. We attempt to capture the sequence and the structural position by indirectly modelling the main definition of syntactic domain, the notion of c-command. Recall that the c-command relation defines the domain of interaction between two nodes in a tree, even if they are not close to each other, provided that the first node dominating one node also dominates the other.</Paragraph>
    <Paragraph position="4"> This notion of c-command captures both linear and hierarchical constraints and defines the domain in which semantic role labelling applies, as well as many other linguistic operations.</Paragraph>
    <Paragraph position="5"> In SSN parsing models, the set D of nodes that are structurally local to a given node on the top of the stack defines the structural distance between this given node and other nodes in the tree. Such a notion of distance determines the number of history representations through which information passes to flow from the representation assigned to a node i to the representation assigned to a node j. By adding nodes to the set D, one can shorten the structural distance between two nodes and enlarge the locality domain over which dependencies can be specified.</Paragraph>
    <Paragraph position="6"> To capture a locality domain appropriate for function parsing, we include two additional nodes in the set D: the most recent child of topi labelled with a syntactic function label and the most recent child of topi labelled with a semantic function label. These additions yield a model that is sensitive to regularities in structurally defined sequences of nodes bearing function labels, within and across constituents. First, in a sequence of nodes bearing function labels within the same constituent - possibly interspersed with nodes not bearing function labels - the structural distance between a node bearing a function label and any of its right siblings is shortened and constant. This effect comes about because the representation of a node bearing a function label is directly input to the representation of its parent, until a farther node with a function label is attached. Second, the distance between a node labelled with a function label and any node that it c-commands is kept constant: since the structural distance between a node [A - a] on top of the stack and its left-corner ancestor [B - b] is constant, the distance between the most recent child node of B labelled with a function label and any child of A is kept constant. This modification of the biases is illustrated in Figure 2.</Paragraph>
    <Paragraph position="7"> This figure displays two constituents, S and VP with some of their respective child nodes. The VP node is assumed to be on the top of the parser's stack, and the S one is supposed to be its left-corner ancestor. The directed arcs represent the information that flows from one node to another. According to the original SSN model in (Henderson, 2003), only the information carried over by the leftmost child and the most recent child of a constituent di- null rectly flows to that constituent. In the figure above, only the information conveyed by the nodes a and d is directly input to the node S. Similarly, the only bottom-up information directly input to the VP node is conveyed by the child nodes epsilon1 and th. In both the no-biases and H03 models, nodes bearing a function label such as ph1 and ph2 are not directly input to their respective parents. In our extended model, information conveyed by ph1 and ph2 directly flows to their respective parents. So the distance between the nodes ph1 and ph2, which stand in a c-command relation, is shortened and kept constant.</Paragraph>
    <Paragraph position="8"> As well as being subject to locality constraints, functional labels are projected by the lexical semantics of the words in the sentence. We introduce this bottom-up lexical information by fine-grained modelling of function tags in two ways. On the one hand, extending a technique presented in (Klein and Manning, 2003), we split some part-of-speech tags into tags marked with semantic function labels. The labels attached to a non-terminal which appeared to cause the most trouble to the parser in a separate experiment (DIR, LOC, MNR, PRP or TMP) were propagated down to the pre-terminal tag of its head. To affect only labels that are projections of lexical semantics properties, the propagation takes into account the distance of the projection from the lexical head to the label, and distances greater than two are not included. Figure 3 illustrates the result of the tag splitting operation.</Paragraph>
    <Paragraph position="9"> On the other hand, we also split the NULL label into mutually exclusive labels. We hypothesize that the label NULL (ie. SYN-NULL and SEM-NULL) is a mixture of types, some of which of semantic nature, such as CLR, which will be more accurately learnt separately. The NULL label was split into the mutually exclusive labels CLR, OBJ and OTHER. Constituents were assigned the OBJ label according to the conditions stated in (Collins, 1999). Roughly, an OBJ non-terminal is an NP, SBAR or S whose parent is an S, VP or SBAR. Any such non-terminal must not bear either syntactic or semantic function labels, or the CLR label. In addition, the first child following the head of a PP is marked with the OBJ label. (For more detail on this lexical semantics projection, see (Merlo and Musillo, 2005).) We report the effects of these augmentations on parsing results in the experiments described below.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML