XML Viewer - w06-1673

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1673_metho.xml
Size: 22,529 bytes
Last Modified: 2025-10-06 14:10:46
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1673">
  <Title>Solving the Problem of Cascading Errors: Approximate Bayesian Inference for Linguistic Annotation Pipelines</Title>
  <Section position="5" start_page="619" end_page="621" type="metho">
    <SectionTitle>
3 Generating Samples
</SectionTitle>
    <Paragraph position="0"> The method we have outlined requires the ability to sample from the conditional distributions in the factored distribution of (1): in our case, the probability of a particular linguistic annotation, conditioned on other linguistic annotations. Note that this differs from the usual annotation task: taking the argmax. But for most algorithms the change is a small and easy change. We discuss how to obtain samples efficiently from a few different annotation models: probabilistic context free grammars (PCFGs), and conditional random fields (CRFs).</Paragraph>
    <Section position="1" start_page="619" end_page="620" type="sub_section">
      <SectionTitle>
3.1 Sampling Parses
</SectionTitle>
      <Paragraph position="0"> Bod (1995) discusses parsing with probabilistic tree substitution grammars, which, unlike simple PCFGs, do not have a one-to-one mapping between output parse trees and a derivation (a bag of rules) that produced it, and hence the most-likely derivation may not correspond to the most likely parse tree. He therefore presents a bottom-up approach to sampling derivations from a derivation forest, which does correspond to a sample from the space of parse trees. Goodman (1998) presents a top-down version of this algorithm. Although we use a PCFG for parsing, it is the grammar of (Klein and Manning, 2003), which uses extensive statesplitting, and so there is again a many-to-one correspondence between derivations and parses, and we use an algorithm similar to Goodman's in our work.</Paragraph>
      <Paragraph position="1"> PCFGs put probabilities on each rule, such as S - NP VP and NN - 'dog'. The probability of a parse is the product of the probabilities of the rules used to construct the parse tree. A dynamic programing algorithm, the inside algorithm, can be used to find the probability of a sentence. The  inside probability bk(p,q) is the probability that words p through q, inclusive, were produced by the non-terminal k. So the probability of the sentence The boy pet the dog. is equal to the inside probability bS(1,6), where the first word, w1 is The and the sixth word, w6, is [period]. It is also useful for our purposes to view this quantity as the sum of the probabilities of all parses of the sentence which have S as the start symbol. The probability can be defined recursively (Manning and Sch&amp;quot;utze, 1999) as follows:</Paragraph>
      <Paragraph position="3"> where Nk, Nr and Ns are non-terminal symbols and wp is the word at position p. We have omitted the case of unary rules for simplicity since it requires a closure operation.</Paragraph>
      <Paragraph position="4"> These probabilities can be efficiently computed using a dynamic program. or memoization of each value as it is calculated. Once we have computed all of the inside probabilities, they can be used to generate parses from the distribution of all parses of the sentence, using the algorithm in Figure 1.</Paragraph>
      <Paragraph position="5"> This algorithm is called after all of the inside probabilities have been calculated and stored, and take as parameters S, 1, and length(sentence). It works by building the tree, starting from the root, and recursively generating children based on the posterior probabilities of applying each rule and each possible position on which to split the sentences. Intuitively, the algorithm is given a non-terminal symbol, such as S or NP, and a span of words, and has to decide (a) what rule to apply to expand the non-terminal, and (b) where to split the span of words, so that each non-terminal resulting from applying the rule has an associated word span, and the process can repeat. The inside probabilities are calculated just once, and we can then generate many samples very quickly; DrawSamples is linear in the number of words, and rules.</Paragraph>
    </Section>
    <Section position="2" start_page="620" end_page="621" type="sub_section">
      <SectionTitle>
3.2 Sampling Named Entity Taggings
</SectionTitle>
      <Paragraph position="0"> To do named entity recognition, we chose to use a conditional random field (CRF) model, based on Lafferty et al. (2001). CRFs represent the state of</Paragraph>
      <Paragraph position="2"> This is a recursive algorithm which starts at the root of the tree and expands each node by sampling from the distribution of possible rules and ways to split the span of words. Its arguments are a non-terminal and two integers corresponding to word indices, and it is initially called with arguments S, 1, and the length of the sentence. There is a call to sampleFrom, which takes an (unnormalized) probability distribution, normalizes it, draws a sample and then returns the sample.</Paragraph>
      <Paragraph position="3"> the art in sequence modeling - they are discriminatively trained, and maximize the joint likelihood of the entire label sequence in a manner which allows for bi-directional flow of information. In order to describe how samples are generated, we generalize CRFs in a way that is consistent with the Markov random field literature. We create a linear chain of cliques, each of which represents the probabilistic relationship between an adjacent set of n states using a factor table containing |S|n values. These factor tables on their own should not be viewed as probabilities, unnormalized or otherwise. They are, however, defined in terms of exponential models conditioned on features of the observation sequence, and must be instantiated for each new observation sequence. The probability of a state sequence is then defined by the sequence of factor tables in the clique chain, given the observation sequence:</Paragraph>
      <Paragraph position="5"> where Fi(si[?]n ...si) is the element of the factor table at position i corresponding to states si[?]n through si, and Z(o) is the partition function which serves to normalize the distribution.1 To in- null fer the most likely state sequence in a CRF it is customary to use the Viterbi algorithm.</Paragraph>
      <Paragraph position="6"> We then apply a process called clique tree calibration, which involves passing messages between the cliques (see Cowell et al. (2003) for a full treatment of this topic). After this process has completed, the factor tables can be viewed as unnormalized probabilities, which can be used to compute conditional probabilities, PCRF(si|si[?]n ...si[?]1,o). Once these probabilities have been calculated, generating samples is very simple. First, we draw a sample for the label at the first position,2 and then, for each subsequent position, we draw a sample from the distribution for that position, conditioned on the label sampled at the previous position. This process results in a sample of a complete labeling of the sequence, drawn from the posterior distribution of complete named entity taggings.</Paragraph>
      <Paragraph position="7"> Similarly to generating sample parses, the expensive part is calculating the probabilities; once we have them we can generate new samples very quickly.</Paragraph>
    </Section>
    <Section position="3" start_page="621" end_page="621" type="sub_section">
      <SectionTitle>
3.3 k-Best Lists
</SectionTitle>
      <Paragraph position="0"> At first glance, k-best lists may seem like they should outperform sampling, because in effect they are the k best samples. However, there are several important reasons why one might prefer sampling. One reason is that the k best paths through a word lattice, or the k best derivations in parse forest do not necessarily correspond to the k best sentences or parse trees. In fact, there are no known sub-exponential algorithms for the best outputs in these models, when there are multiple ways to derive the same output.3 This is not just a theoretical concern - the Stanford parser uses such a grammar, and we found that when generating a 50-best derivation list that on average these derivations corresponded to about half as many unique parse trees. Our approach circumvents this issue entirely, because the samples are generated from the actual output distribution.</Paragraph>
      <Paragraph position="1"> Intuition also suggests that sampling should give more diversity at each stage, reducing the likelihood of not even considering the correct output. Using the Brown portion of the SRL test set (discussed in sections 4 and 6.1), and 50samples/50-best, we found that on average the 50- null this argument.</Paragraph>
      <Paragraph position="2"> samples system considered approximately 25% more potential SRL labelings than the 50-best system. null When pipelines have more than two stages, it is customary to do a beam search, with a beam size of k. This means that at each stage in the pipeline, more and more of the probability mass gets &amp;quot;thrown away.&amp;quot; Practically, this means that as pipeline length increases, there will be increasingly less diversity of labels from the earlier stages. In a degenerate 10-stage, k-best pipeline, where the last stage depends mainly on the first stage, it is probable that all but a few labelings from the first stage will have been pruned away, leaving something much smaller than a k-best sample, possibly even a 1-best sample, as input to the final stage. Using approximate inference to estimate the marginal distribution over the last stage in the pipeline, such as our sampling approach, the pipeline length does not have this negative impact or affect the number of samples needed. And unlike k-best beam searches, there is an entire research community, along with a large body of literature, which studies how to do approximate inference in Bayesian networks and can provide performance bounds based on the method and the number of samples generated.</Paragraph>
      <Paragraph position="3"> One final issue with the k-best method arises when instead of a linear chain pipeline, one is using a general directed acyclic graph where a node can have multiple parents. In this situation, doing the k-best calculation actually becomes exponential in the size of the largest in-degree of a node for a node with n parents, you must try all kn combinations of the values for the parent nodes. With sampling this is not an issue; each sample can be generated based on a topological sort of the graph.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="621" end_page="622" type="metho">
    <SectionTitle>
4 Semantic Role Labeling
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="621" end_page="622" type="sub_section">
      <SectionTitle>
4.1 Task Description
</SectionTitle>
      <Paragraph position="0"> Given a sentence and a target verb (the predicate) the goal of semantic role labeling is to identify and label syntactic constituents of the parse tree with semantic roles of the predicate. Common roles are agent, which is the thing performing the action, patient, which is the thing on which the action is being performed, and instrument, which is the thing with which the action is being done. Additionally, there are modifier arguments which can specify the location, time, manner, etc. The following sentence provides an example of a predi- null cate and its arguments: [The luxury auto maker]agent [last year]temp [sold]pred [1,214 cars]patient in [the U.S]location.</Paragraph>
      <Paragraph position="1"> Semantic role labeling is a key component for systems that do question answering, summarization, and any other task which directly uses a semantic interpretation.</Paragraph>
    </Section>
    <Section position="2" start_page="622" end_page="622" type="sub_section">
      <SectionTitle>
4.2 System Description
</SectionTitle>
      <Paragraph position="0"> We modified the system described in Haghighi et al. (2005) and Toutanova et al. (2005) to test our method. The system uses both local models, which score subtrees of the entire parse tree independently of the labels of other nodes not in that subtree, and joint models, which score the entire labeling of a tree with semantic roles (for a particular predicate).</Paragraph>
      <Paragraph position="1"> First, the task is separated into two stages, and local models are learned for each. At the first stage, the identification stage, a classifier labels each node in the tree as either ARG, meaning that it is an argument (either core or modifier) to the predicate, or NONE, meaning that it is not an argument. At the second stage, the classification stage, the classifier is given a set of arguments for a predicate and must label each with its semantic role.</Paragraph>
      <Paragraph position="2"> Next, a Viterbi-like dynamic algorithm is used to generate a list of the k-best joint (identification and classification) labelings according to the local models. The algorithm enforces the constraint that the roles should be non-overlapping. Finally, a joint model is constructed which scores a completely labeled tree, and it is used to re-rank the k-best list. The separation into local and joint models is necessary because there are an exponential number of ways to label the entire tree, so using the joint model alone would be intractable. Ideally, we would want to use approximate inference instead of a k-best list here as well. Particle filtering would be particularly well suited - particles could be sampled from the local model and then reweighted using the joint model. Unfortunately, we did not have enough time modify the code of (Haghighi et al., 2005) accordingly, so the k-best structure remained.</Paragraph>
      <Paragraph position="3"> To generate samples from the SRL system, we take the scores given to the k-best list, normalize them to sum to 1, and sample from them. One consequence of this, is that any labeling not on the k-best list has a probability of 0.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="622" end_page="623" type="metho">
    <SectionTitle>
5 Recognizing Textual Entailment
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="622" end_page="622" type="sub_section">
      <SectionTitle>
5.1 Task Description
</SectionTitle>
      <Paragraph position="0"> In the task of recognizing textual entailment (RTE), also commonly referred to as robust textual inference, you are provided with two passages, a text and a hypothesis, and must decide whether the hypothesis can be inferred from the text. The term robust is used because the task is not meant to be domain specific. The term inference is used because this is not meant to be logical entailment, but rather what an intelligent, informed human would infer. Many NLP applications would benefit from the ability to do robust textual entailment, including question answering, information retrieval and multi-document summarization. There have been two PASCAL workshops (Dagan et al., 2005) with shared tasks in the past two years devoted to RTE.</Paragraph>
      <Paragraph position="1"> We used the data from the 2006 workshop, which contains 800 text-hypothesis pairs in each of the test and development sets4 (there is no training set). Here is an example from the development set from the first RTE challenge: Text: Researchers at the Harvard School of Public Health say that people who drink coffee may be doing a lot more than keeping themselves awake - this kind of consumption apparently also can help reduce the risk of diseases. null Hypothesis: Coffee drinking has health benefits.</Paragraph>
      <Paragraph position="2"> The positive and negative examples are balanced, so the baseline of guessing either all yes or all no would score 50%. This is a hard task - at the first challenge no system scored over 60%.</Paragraph>
    </Section>
    <Section position="2" start_page="622" end_page="623" type="sub_section">
      <SectionTitle>
5.2 System Description
</SectionTitle>
      <Paragraph position="0"> MacCartney et al. (2006) describe a system for doing robust textual inference. They divide the task into three stages - linguistic analysis, graph alignment, and entailment determination. The first of these stages, linguistic analysis is itself a pipeline of parsing and named entity recognition. They use the syntactic parse to (deterministically) produce a typed dependency graph for each sentence. This pipeline is the one we replace. The second stage, graph alignment consists of trying to find good alignments between the typed dependency graphs  for the text and hypothesis. Each possible alignment has a score, and the alignment with the best score is propagated forward. The final stage, entailment determination, is where the decision is actually made. Using the score from the alignment, as well as other features, a logistic model is created to predict entailment. The parameters for this model are learned from development data.5 While it would be preferable to sample possible alignments, their system for generating alignment scores is not probabilistic, and it is unclear how one could convert between alignment scores and probabilities in a meaningful way.</Paragraph>
      <Paragraph position="1"> Our modified linguistic analysis pipeline does NER tagging and parsing (in their system, the parse is dependent on the NER tagging because some types of entities are pre-chunked before parsing) and treats the remaining two sections of their pipeline, the alignment and determination stages, as one final stage. Because the entailment determination stage is based on a logistic model, a probability of entailment is given and sampling is straightforward.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="623" end_page="624" type="metho">
    <SectionTitle>
6 Experimental Results
</SectionTitle>
    <Paragraph position="0"> In our experiments we compare the greedy pipelined approach with our sampling pipeline approach. null</Paragraph>
    <Section position="1" start_page="623" end_page="624" type="sub_section">
      <SectionTitle>
6.1 Semantic Role Labeling
</SectionTitle>
      <Paragraph position="0"> For the past two years CoNLL has had shared tasks on SRL (Carreras and M`arquez (2004) and Carreras and M`arquez (2005)). We used the CoNLL 2005 data and evaluation script. When evaluating semantic role labeling results, it is common to present numbers on both the core arguments (i.e., excluding the modifying arguments) and all arguments. We follow this convention and present both sets of numbers. We give precision, 5They report their results on the first PASCAL dataset, and use only the development set from the first challenge for learning weights. When we test on the data from the second challenge, we use all data from the first challenge and the development data from the second challenge to learn these weights.</Paragraph>
      <Paragraph position="1">  recall and F-measure, which are based on the number of arguments correctly identified. For an argument to be correct both the span and the classification must be correct; there is no partial credit. To generate sampled parses, we used the Stanford parser (Klein and Manning, 2003). The CoNLL data comes with parses from Charniak's parser (Charniak, 2000), so we had to re-parse the data and retrain the SRL system on these new parses, resulting in a lower baseline than previously presented work. We choose to use Stanford's parser because of the ease with which we could modify it to generate samples. Unfortunately, its performance is slightly below that of the other parsers.</Paragraph>
      <Paragraph position="2"> The CoNLL data has two separate test sets; the first is section 23 of the Penn Treebank (PTB), and the second is &amp;quot;fresh sentences&amp;quot; taken from the Brown corpus. For full results, please see Table 1.</Paragraph>
      <Paragraph position="3"> On the Penn Treebank portion we saw an absolute F-score improvement of 0.7% on both core and all arguments. On the Brown portion of the test set we saw an improvement of 1.25% on core and 1.16% on all arguments. In this context, a gain of over 1% is quite large: for instance, the scores for the top 4 systems on the Brown data at CoNLL 2005 were within 1% of each other. For both portions, we generated 50 samples, and did this 4 times, averaging the results. We most likely saw better performance on the Brown portion than the PTB portion because the parser was trained on the Penn Treebank training data, so the most likely parses will be of higher quality for the PTB portion of the test data than for the Brown portion. We also</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="624" end_page="624" type="metho">
    <SectionTitle>
RTE Results
</SectionTitle>
    <Paragraph position="0"> pled numbers are averaged over several runs, as discussed.</Paragraph>
    <Paragraph position="1"> ran the pipeline using a 50-best list, and found the two results to be comparable.</Paragraph>
    <Section position="1" start_page="624" end_page="624" type="sub_section">
      <SectionTitle>
6.2 Textual Entailment
</SectionTitle>
      <Paragraph position="0"> For the second PASCAL RTE challenge, two different types of performance measures were used to evaluate labels and confidence of the labels for the text-hypothesis pairs. The first measure is accuracy - the percentage of correct judgments. The second measure is average precision. Responses are sorted based on entailment confidence and then average precision is calculated by the following equation:</Paragraph>
      <Paragraph position="2"> E(i)# correct up to pair ii (5) where n is the size of the test set, R is the number of positive (entailed) examples, E(i) is an indicator function whose value is 1 if the ith pair is entailed, and the is are sorted based on the entailment confidence. The intention of this measure is to evaluate how well calibrated a system is. Systems which are more confident in their correct answers and less confident in their incorrect answers will perform better on this measure.</Paragraph>
      <Paragraph position="3"> Our results are presented in Table 2. We generated 25 samples for each run, and repeated the process 7 times, averaging over runs. Accuracy was improved by 1.5% and average precision by 2%. It does not come as a surprise that the average precision improvement was larger than the accuracy improvement, because our model explicitly estimates its own degree of confidence by estimating the posterior probability of the class label.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML