File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-2025_metho.xml
Size: 20,204 bytes
Last Modified: 2025-10-06 14:07:50
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-2025"> <Title>The LinGO Redwoods Treebank Motivation and Preliminary Applications</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 A Rich and Dynamic Treebank </SectionTitle> <Paragraph position="0"> The Redwoods treebank is based on open-source HPSG resources developed by a broad consortium of research groups including researchers at Stanford (USA), Saarbr&quot;ucken (Germany), Cambridge, Edinburgh, and Sussex (UK), and Tokyo (Japan). Their wide distribution and common acceptance make the HPSG framework and resources an excellent anchor point for the Redwoods treebanking initiative.</Paragraph> <Paragraph position="1"> The key innovative aspect of the Redwoods approach to treebanking is the anchoring of all linguistic data captured in the treebank to the HPSG framework and a generally-available broad-coverage grammar of English, the LinGO English Resource Grammar (Flickinger, 2000) as implemented with the LKB grammar development environment (Copestake, 2002). Unlike existing treebanks, there is no need to define a (new) form of grammatical representation specific to the treebank. Instead, the treebank records complete syntacto-semantic analyses as defined by the LinGO ERG and provide tools to extract different types of linguistic information at varying granularity.</Paragraph> <Paragraph position="2"> The treebanking environment, building on the [incr tsdb()] profiling environment (Oepen & Callmeier, 2000), presents annotators, one sentence at a time, with the full set of analyses produced by the grammar. Using a pre-existing tree comparison tool in the LKB (similar in kind to the SRI Cambridge TreeBanker; Carter, 1997), annotators can quickly navigate through the parse forest and identify the correct or preferred analysis in the current context (or, in rare cases, reject all analyses proposed by the grammar). The tree selection tool presents users, who need little expert knowledge of the underlying grammar, with a range of basic properties that distinguish competing analyses and that are relatively easy to judge. All disambiguating decisions made by annotators are recorded in the [incr tsdb()] database and thus become available for (i) later dynamic extraction from the annotated profile or (ii) dynamic propagation into a more recent profile obtained from re-running a newer version of the grammar on the same corpus.</Paragraph> <Paragraph position="3"> Important innovative research aspects in this approach to treebanking are (i) enabling users of the treebank to extract information of the type they need and to transform the available representation into a form suited to their needs and (ii) the ability to update the treebank with an enhanced version of the grammar in an automated fashion, viz. by re-applying the disambiguating decisions on the corpus with an updated version of the grammar.</Paragraph> <Paragraph position="4"> Depth of Representation and Transformation of Information Internally, the [incr tsdb()] database records analyses in three different formats, viz. (i) as a derivation tree composed of identifiers of lexical items and constructions used to build the analysis, (ii) as a traditional phrase structure tree labeled with an inventory of some fifty atomic labels (of the type 'S', 'NP', 'VP' et al.), and (iii) as an underspecified MRS (Copestake, Lascarides, & Flickinger, 2001) meaning representation. While representation (ii) will in many cases be similar to the representation found in the Penn Treebank, representation (iii) subsumes the functor - argument (or tectogrammatical) structure advocated in the Prague Dependency Tree-bank or the German TiGer corpus. Most importantly, however, representation (i) provides all the information required to replay the full HPSG analysis (using the original grammar and one of the open-source HPSG processing environments, e.g., the LKB or PET, which already have been interfaced to [incr tsdb()]). Using the latter approach, users of the treebank are enabled to extract information in whatever representation they require, simply by reconstructing full analyses and adapting the existing mappings (e.g., the inventory of node labels used for phrase structure trees) to their needs. Likewise, the existing [incr tsdb()] facilities for comparing across competence and performance profiles can be deployed to evaluate results of a (stochastic) parse disambiguation system, essentially using the preferences recorded in the treebank as a 'gold standard' target for comparison.</Paragraph> <Paragraph position="5"> Automating Treebank Construction Although a precise HPSG grammar like the LinGO ERG will typically assign a small number of analyses to a given sentence, choosing among a few or sometimes a few dozen readings is time-consuming and error-prone. The project is exploring two approaches to automating the disambiguation task, (i) seeding lexical selection from a part-of-speech (POS) tagger and (ii) automated inter-annotator comparison and assisted resolution of conflicts.</Paragraph> <Paragraph position="6"> Treebank Maintenance and Evolution One of the challenging research aspects of the Redwoods initiative is about developing a methodology for automated updates of the treebank to reflect the continuous evolution of the underlying linguistic framework and of the LinGO grammar. Again building on the notion of elementary linguistic discriminators, we expect to explore the semi-automatic propagation of recorded disambiguating decisions into newer versions of the parsed corpus. While it can be assumed that the basic phrase structure inventory and granularity of lexical distinctions have stabilized to a certain degree, it is not guaranteed that one set of discriminators will always fully disambiguate a more recent set of analyses for the same utterance (as the grammar may introduce new ambiguity), nor that re-playing a history of disambiguating decisions will necessarily identify the correct, preferred analysis for all sentences.</Paragraph> <Paragraph position="7"> A better understanding of the nature of discriminators and relations holding among them is expected to provide the foundations for an update procedure that, ultimately, should be mostly automated, with minimal manual inspection, and which can become part of the regular regression test cycle for the grammar.</Paragraph> <Paragraph position="8"> Scope and Current State of Seeding Initiative The first 10,000 trees to be hand-annotated as part of the kick-off initiative are taken from a domain for which the English Resource Grammar is known to exhibit broad and accurate coverage, viz. transcribed face-to-face dialogues in an appointment scheduling and travel arrangement domain.1 For the follow-up phase of the project, it is expected to move into a second domain and text genre, presumably more formal, edited text taken from newspaper text or another widely available on-line source. As of June 2002, the seeding initiative is well underway.</Paragraph> <Paragraph position="9"> The integrated treebanking environment, combining [incr tsdb()] and the LKB tree selection tool, has been established and has been deployed in a first iteration of annotating the VerbMobil utterances. The approach to parse selection through minimal discriminators turned out to be not hard to learn for a second-year Stanford undergraduate in linguistics, and allowed completion of the first iteration in less than ten weeks. Table 1 summarizes the current Redwoods status.</Paragraph> <Paragraph position="10"> an expert grammarian and (almost) completely disambiguated to date; therefore it exhibits an interestingly higher degree of phrasal ambiguity in the 'active D 1' subset.</Paragraph> <Paragraph position="11"> been annotated. The columns are, from left to right, the total number of sentences (excluding fragments) for which the LinGO grammar has at least one analysis (']'), average length ('k'), lexical and structural ambiguity (' ' and ' ', respectively), followed by the last four metrics broken down for the following subsets: sentences (i) for which the annotator rejected all analyses (no active trees), (ii) where annotation resulted in exactly one preferred analysis (one active tree), (iii) those where full disambiguation was not accomplished through the first round of annotation (more than one active tree), and (iv) massively ambiguous sentences that have yet to be annotated.2</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Early Experimental Results </SectionTitle> <Paragraph position="0"> Development of the treebank has just started. Nonetheless, we have performed some preliminary experiments on concrete applications to motivate the utility of the resource being developed. In this section, we describe experiments using the Redwoods treebank to build and test systems for parse disambiguation. As a component, we build a tagger for the HPSG lexical tags in the treebank, and report results on this application as well.</Paragraph> <Paragraph position="1"> Any linguistic system that allows multiple parses of strings must address the problem of selecting from among the admitted parses the preferred one. A variety of approaches for building statistical models of parse selection are possible. At the simplest end, we might look only at the lexical type sequence assigned to the words by each parse and rank the parse based on the likelihood of that sequence. These lexical types - the preterminals in the derivation - are essentially part-of-speech tags, but encode considerably finer-grained information about the words. Well-understood statistical part-of-speech tagging technology is sufficient for this approach.</Paragraph> <Paragraph position="2"> In order to use more information about the parse, we might examine the entire derivation of the string.</Paragraph> <Paragraph position="3"> Most probabilistic parsing research - including, for example, work by by Collins (1997), and Charniak (1997) - is based on branching process models (Harris, 1963).</Paragraph> <Paragraph position="4"> The HPSG derivations that the treebank makes available can be viewed as just such a branching process, and a stochastic model of the trees can be built as a probabilistic context-free grammar (PCFG) model. Abney (1997) notes important problems with the soundness of the approach when a unification-based grammar is actually determining the derivations, motivating the use of log-linear models (Agresti, 1990) for parse ranking that Johnson and colleagues further developed (Johnson, Geman, Canon, Chi, & Riezler, 1999). These models can deal with the many interacting dependencies and the structural complexity found in constraint-based or unification-based theories of syntax.</Paragraph> <Paragraph position="5"> Nevertheless, the naive PCFG approach has the advantage of simplicity, so we pursue it and the tagging approach to parse ranking in these proof-of-concept experiments (more recently, we have begun work on building log-linear models over HPSG signs (Toutanova & Manning, 2002)). The learned models were used to rank possible parses of unseen test sentences according to the probabilities they assign to them. We report parse selection performance as percentage of test sentences for which the correct parse was highest ranked by the model.</Paragraph> <Paragraph position="6"> (We restrict attention in the test corpus to sentences that are ambiguous according to the grammar, that is, for which the parse selection task is nontrivial.) We examine four models: an HMM tagging model, a simple PCFG, a PCFG with grandparent annotation, and a hybrid model that combines predictions from the PCFG and the tagger.</Paragraph> <Paragraph position="7"> These models will be described in more detail presently.</Paragraph> <Paragraph position="8"> The tagger that we have implemented is a standard tri-gram HMM tagger, defining a joint probability distribution over the preterminal sequences and yields of these trees. Trigram probabilities are smoothed by linear interpolation with lower-order models. For comparison, we present the performance of a unigram tagger and an upper-bound oracle tagger that knows the true tag sequence and scores highest the parses that have the correct preterminal sequence.</Paragraph> <Paragraph position="9"> The PCFG models define probability distributions over the trees of derivational types corresponding to the HPSG analyses of sentences. A PCFG model has parameters i; j for each rule Ai ! j in the corresponding context free grammar.3 In our application, the nonterminals in the PCFG Ai are rules of the HPSG grammar used to build the parses (such as HEAD-COMPL or HEAD-ADJ).</Paragraph> <Paragraph position="10"> We set the parameters to maximize the likelihood of the set of derivation trees for the preferred parses of the sentences in a training set. As noted above, estimating probabilities from local tree counts in the treebank does not provide a maximum likelihood estimate of the observed data, as the grammar rules further constrain the possible derivations. Essentially, we are making an assumption of context-freeness of rule application that does not hold in the case of the HPSG grammar. Nonetheless, we can still build the model and use it to rank parses.</Paragraph> <Paragraph position="11"> 3For an introduction to PCFG grammars see, for example, Manning & Sch&quot;utze (1999).</Paragraph> <Paragraph position="12"> As previously noted by other researchers (Charniak & Caroll, 1994), extending a PCFG with grandparent annotation improves the accuracy of the model. We implemented an extended PCFG that conditions each node's expansion on its parent in the phrase structure tree. The extended PCFG (henceforth PCFG-GP) has parameters P.Ak Ai ! jjAk; Ai/ . The resulting grammar can be viewed as a PCFG whose nonterminals are pairs of the nonterminals of the original PCFG.</Paragraph> <Paragraph position="13"> The combined model scores possible parses using probabilities from the PCFG-GP model together with the probability of the preterminal sequence of the parse tree according to a trigram tag sequence model. More specifically, for a tree T , Score.t/ D log.PPCFG-GP.T // C log.PTRIG.tags.T// where PTRIG.tags.T// is the probability of the sequence of preterminals t1 tn in T according to a trigram tag model:</Paragraph> <Paragraph position="15"> with appropriate treatment of boundaries. The trigram probabilities are smoothed as for the HMM tagger. The combined model is relatively insensitive to the relative weights of the two component models, as specified by ; in any case, exact optimization of this parameter was not performed. We refer to this model as Combined. The Combined model is not a sound probabilistic model as it does not define a probability distribution over parse trees. It does however provide a crude way to combine ancestor and left context information.</Paragraph> <Paragraph position="16"> The second column in Table 2 shows the accuracy of parse selection using the models described above.</Paragraph> <Paragraph position="17"> For comparison, a baseline showing the expected performance of choosing parses randomly according to a uniform distribution is included as the first row. The accuracy results are averaged over a ten-fold cross-validation on the data set summarized in Table 1. The data we used for this experiment was the set of disambiguated sentences that have exactly one preferred parse (comprising a total of 5312 sentences). Often the stochastic models we are considering give the same score to several different parses. When a model ranks a set of m parses highest with equal scores and one of those parses is the preferred parse in the treebank, we compute the accuracy on this sentence as 1=m.</Paragraph> <Paragraph position="18"> Since our approach of defining the probability of analyses using derivation trees is different from the traditional approach of learning PCFG grammars from phrase structure trees, a comparison of the two is probably in order. We tested the model PCFG-GP defined over the corresponding phrase structure trees and its average accuracy was 65:65% which is much lower than the accuracy of the same model over derivation trees (71:73%).</Paragraph> <Paragraph position="19"> This result suggests that the information about grammar constructions is very helpful for parse disambiguation.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Method Task </SectionTitle> <Paragraph position="0"> tag sel. parse sel.</Paragraph> <Paragraph position="1"> tag and parse selection tasks (accuracy). The results in Table 2 indicate that high disambiguation accuracy can be achieved using very simple statistical models. The performance of the perfect tagger shows that, informally speaking, roughly half of the information necessary to disambiguate parses is available in the lexical types alone. About half of the remaining information is recovered by our best method, Combined.</Paragraph> <Paragraph position="2"> An alternative (more primitive) task is the tagging task itself. It is interesting to know how much the tagging task can be improved by perfecting parse disambiguation. With the availability of a parser, we can examine the accuracy of the tag sequence of the highest scoring parse, rather than trying to tag the word sequence directly. We refer to this problem as the tag selection problem, by analogy with the relation between the parsing problem and the parse selection problem. The first column of Table 2 presents the performance of the models on the tag selection problem. The results are averaged accuracies over 10 cross-validation splits of the same corpus as the previous experiment, and show that parse disambiguation using information beyond the lexical type sequence slightly improves tag selection performance. Note that in these experiments, the models are used to rank the tag sequences of the possible parses and not to find the most probable tag sequence. Therefore tagging accuracy results are higher than they would be in the latter case.</Paragraph> <Paragraph position="3"> Since our corpus has relatively short sentences and low ambiguity it is interesting to see how much the performance degrades as we move to longer and more highly ambiguous sentences. For this purpose, we report in Table 3 the parse ranking accuracy of the Combined model as a function of the number of possible analyses for sentences. Each row corresponds to a set of sentences with number of possible analyses greater or equal to the bound shown in the first column. For example, the first row contains information for the sentences with ambiguity 2, which is all ambiguous sentences. The columns show the total number of sentences in the set, the expected accuracy of guessing at random, and the accuracy of the Combined model. We can see that the parse ranking accuracy is decreasing quickly and more powerful models will be needed to achieve good accuracy for highly ambiguous sentences.</Paragraph> <Paragraph position="4"> Despite several differences in corpus size and compo- null sition, it is perhaps nevertheless useful to compare this work with other work on parse selection for unification-based grammars. Johnson et al. (1999) estimate a Stochastic Unification Based Grammar (SUBG) using a log-linear model. The features they include in the model are not limited to production rule features but also adjunct and argument and other linguistically motivated features. On a dataset of 540 sentences (total training and test set) from a Verbmobil corpus they report parse disambiguation accuracy of 58.7% given a baseline accuracy for choosing at random of 9.7%. The random base-line is much lower than ours for the full data set, but it is comparable for the random baseline for sentences with more than 5 analyses. The accuracy of our Combined model for these sentences is 59.64%, so the accuracies of the two models seem fairly similar.</Paragraph> </Section> </Section> class="xml-element"></Paper>