File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1100_metho.xml

Size: 11,159 bytes

Last Modified: 2025-10-06 14:09:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1100">
  <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 795-802, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Morphology and Reranking for the Statistical Parsing of Spanish</Title>
  <Section position="4" start_page="795" end_page="797" type="metho">
    <SectionTitle>
3 Models
</SectionTitle>
    <Paragraph position="0"> This section details our two approaches for adding features to a baseline parsing model. First, we describe how morphological information can be added to a parsing model by modifying the POS tagset.</Paragraph>
    <Paragraph position="1"> Second, we describe an approach that reranks the n-best output of the morphologically-rich parser, using arbitrary, general features of the parse trees as additional information.</Paragraph>
    <Section position="1" start_page="795" end_page="797" type="sub_section">
      <SectionTitle>
3.1 Adding Morphological Information
</SectionTitle>
      <Paragraph position="0"> The mechanism we employ for incorporating morphological information is the modification of the POS tagset of a lexicalized PCFG2 -- the Model 1 parser described in (Collins, 1999) (hereafter Model 1). Each POS tagset can be thought of as a particular morphological model or a subset of morphological attributes. Table 1 shows the complete set of morphological features we considered for Spanish. There are 22 morphological features in total in this table; different POS sets can be created by deciding whether or not to include each of these 22 features; hence, there are 222 different morphological models we could have created. For instance, one particular model might capture the modal information of verbs. In this model, there would be six POS tags for verbs (one for each of indicative, subjunctive, imperative, infinitive, gerund, and participle) instead of just one. A model that captured both the number and mode of verbs would have 18 verbal POS tags, assuming three values (singular, plural, and neutral) for the number feature.</Paragraph>
      <Paragraph position="1"> The Effect of the Tagset on Model 1 Modifying the POS tagset allows Model 1 to better distinguish 2Hand-crafted head rules are used to lexicalize the trees.</Paragraph>
      <Paragraph position="3"> is unlikely to modify the singular verb corri'o.</Paragraph>
      <Paragraph position="4"> events that are unlikely from those that are likely, on the basis of morphological evidence. An example will help to illustrate this point.</Paragraph>
      <Paragraph position="5"> Model 1 relies on statistics conditioned on lexical headwords for practically all parameters in the model. This sensitivity to headwords is achieved by propagating lexical heads and POS tags to the non-terminals in the parse tree. Thus, any statistic based on headwords may also be sensitive to the associated POS tag. For instance, consider the subtree in Figure 1. Note that this structure is ungrammatical because the subject, gatos (cats), is plural, but the verb, corri'o (ran), is singular. In Model 1, the probability of generating the noun phrase (NP) with headword gatos and headtag noun (n) is defined as follows:3</Paragraph>
      <Paragraph position="7"> The parser smooths parameter values using backed-off statistics, and in particular smooths statistics based on headwords with coarser statistics based on POS tags alone. This allows the parser to effectively use POS tags as a way of separating different lexical items into subsets or classes depending on their syntactic behavior. In our example, each term is estimated as follows:</Paragraph>
      <Paragraph position="9"> 3Note that the parsing model includes other features such as distance which we omit from the parameter definition for the sake of brevity.</Paragraph>
      <Paragraph position="10"> Here the ^Pi,j terms are maximum likelihood estimates derived directly from counts in the training data. The li,j parameters are defined so that</Paragraph>
      <Paragraph position="12"> trol the relative contribution of each level of back-off to the final estimate.</Paragraph>
      <Paragraph position="13"> Note that thus far our example has not included any morphological information in the POS tags. Because of this, we will see that there is a danger of the estimates P1 and P2 both being high, in spite of the dependency being ungrammatical. P1 will be high because all three estimates ^P1,1, ^P1,2 and ^P1,3 will most likely be high. Next, consider P2. Of the three estimates ^P2,1, ^P2,2, and ^P2,3, only ^P2,1 retains the information that the noun is plural and the verb is singular. Thus P2 will be sensitive to the morphological clash between gatos and corri'o only if l2,1 is high, reflecting a high level of confidence in the estimate of ^P2,3. This will only happen if the context &lt;corri'o, v, S, VP&gt; is seen frequently enough for l2,1 to take a high value. This is unlikely, given that this context is quite specific. In summary, the impoverished model can only capture morphological restrictions through lexically-specific estimates based on extremely sparse statistics.</Paragraph>
      <Paragraph position="14"> Now consider a model that incorporates morphological information -- in particular, number information -- in the noun and verb POS tags. gatos will have the POS pn, signifying a plural noun; corri'o will have the POS sv, signifying a singular verb.</Paragraph>
      <Paragraph position="15"> All estimates in the previous equations will reflect these POS changes. For example, P1 will now be estimated as follows:</Paragraph>
      <Paragraph position="17"> Note that the two estimates ^P1,1 and ^P1,2 include an (unlikely) dependency between the POS tags pn and sv. Both of these estimates will be 0, assuming that a plural noun is never seen as the subject of a singular verb. At the very least, the context &lt;sv, S, VP&gt; will be frequent enough for ^P1,2 to be a reliable estimate. The value for l1,2 will therefore be high, leading to a low estimate for P1, thus correctly assigning low probability to the ungrammatical de- null pendency. In summary, the morphologically-rich model can make use of non-lexical statistics such as ^P1,2(pn, NP  |sv, S, VP) which contain dependencies between POS tags and which will most likely be estimated reliably by the model.</Paragraph>
    </Section>
    <Section position="2" start_page="797" end_page="797" type="sub_section">
      <SectionTitle>
3.2 The Reranking Model
</SectionTitle>
      <Paragraph position="0"> In the reranking model, we use an n-best version of the morphologically-rich parser to generate a number of candidate parse trees for each sentence in training and test data. These parse trees are then represented through a combination of the log probability under the initial model, together with a large number of global features. A reranking model uses the information from these features to derive a new ranking of the n-best parses, with the hope of improving upon the baseline model. Previous approaches (e.g., (Collins and Koo, 2005)) have used a linear model to combine the log probability under a base parser with arbitrary features derived from parse trees. There are a variety of methods for training the parameters of the model. In this work, we use the algorithm described in (Bartlett et al., 2004), which applies the large-margin training criterion of support vector machines (Cortes and Vapnik, 1995) to the reranking problem.</Paragraph>
      <Paragraph position="1"> The motivation for the reranking model is that a wide variety of features, which can essentially be sensitive to arbitrary context in the parse trees, can be incorporated into the model. In our work, we included all features described in (Collins and Koo, 2005). As far as we are aware, this is the first time that a reranking model has been applied to parsing a language other than English. One goal was to investigate whether the improvements seen on English parsing can be carried across to another language.</Paragraph>
      <Paragraph position="2"> We have found that features in (Collins and Koo, 2005), initially developed for English parsing, also give appreciable gains in accuracy when applied to Spanish.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="797" end_page="798" type="metho">
    <SectionTitle>
4 Data
</SectionTitle>
    <Paragraph position="0"> The Spanish 3LB treebank is a freely-available resource with about 3,500 sentence/tree pairs that we have used to train our models. The average sentence length is 28 tokens. The data is taken from 38 complete articles and short texts. Roughly 27%  of the texts are news articles, 27% scientific articles, 14% narrative, 11% commentary, 11% sports articles, 6% essays, and 5% articles from weekly magazines. The trees contain information about both constituency structure and syntactic functions.</Paragraph>
    <Section position="1" start_page="797" end_page="798" type="sub_section">
      <SectionTitle>
4.1 Preprocessing
</SectionTitle>
      <Paragraph position="0"> It is well-known that tree representation influences parsing performance (Johnson, 1998). Prior to training our models, we made some systematic modifications to the corpus trees in an effort to make it easier for Model 1 to represent the linguistic phenomena present in the trees. For the convenience of the reader, Table 2 gives a key to the non-terminal labels in the 3LB treebank that are used in this section and the remainder of the paper.</Paragraph>
      <Paragraph position="1"> Relative and Subordinate Clauses Cases of relative and subordinate clauses appearing in the corpus trees have the basic structure of the example in Figure 2a. Figure 2b shows the modifications we impose on such structures. The modified structure has the advantage that the SBAR selects the CP node as its head, making the relative pronoun que the head-word for the root of the subtree. This change allows, for example, better modeling of verbs that select for particular complementizers. In addition, the new subtree rooted at the S node now looks like a top-level sentence, making sentence types more uniform in structure and easier to model statistically. Additionally, the new structure differentiates phrases em- null bank for the phrase a quien todos consideraban or whom everyone considered. We transform structures like (a) into (b) by insertingSBARandCPnodes, and by marking all non-terminals below the CP with a -CP tag.</Paragraph>
      <Paragraph position="2"> bedded in the complementizers ofSBARs from those used in other contexts, allowing relative pronouns like quien in Figure 2 to surface as lexical head-words when embedded in larger phrases beneath the CP node.4 Coordination In the treebank, coordinated constituents and their coordinating conjunction are placed as sister nodes in a flat structure. We enhance the structure of such subtrees, as in Figure 3. Our structure helps to rule out unlikely phrases such as cats and dogs and; the model trained with the original treebank structures will assign non-zero probability to ill-formed structures such as these.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML