File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1673_intro.xml

Size: 5,864 bytes

Last Modified: 2025-10-06 14:04:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1673">
  <Title>Solving the Problem of Cascading Errors: Approximate Bayesian Inference for Linguistic Annotation Pipelines</Title>
  <Section position="4" start_page="618" end_page="619" type="intro">
    <SectionTitle>
2 Approach
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="618" end_page="618" type="sub_section">
      <SectionTitle>
2.1 Overview
</SectionTitle>
      <Paragraph position="0"> In order to do approximate inference, we model the entire pipeline as a Bayesian network. Each stage in the pipeline corresponds to a variable in the network. For example, the parser stage corresponds to a variable whose possible values are all possible parses of the sentence. The probabilities of the parses are conditioned on the parent variables, which may just be the words of the sentence, or may be the part of speech tags output by a part of speech tagger.</Paragraph>
      <Paragraph position="1"> The simple linear structure of a typical linguistic annotation network permits exact inference that is quadratic in the number of possible labels at each stage, but unfortunately our annotation variables have a very large domain. Additionally, some networks may not even be linear; frequently one stage may require the output from multiple previous stages, or multiple earlier stages may be completely independent of one another. For example, a typical QA system will do question type classification on the question, and from that extract keywords which are passed to the information retreival part of the system. Meanwhile, the retreived documents are parsed and tagged with named entities; the network rejoins those outputs with the question type classification to decide on the correct answer. We address these issues by using approximate inference instead of exact inference. The structure of the nodes in the network permits direct sampling based on a topological sort of the nodes. Samples are drawn from the conditional distributions of each node, conditioned on the samples drawn at earlier nodes in the topological sort.</Paragraph>
    </Section>
    <Section position="2" start_page="618" end_page="619" type="sub_section">
      <SectionTitle>
2.2 Probability of a Complete Labeling
</SectionTitle>
      <Paragraph position="0"> Before we can discuss how to sample from these Bayes nets, we will formalize how to move from an annotation pipeline to a Bayes net. Let A be the set of n annotators A1, A2, ..., An (e.g., part of speech tagger, named entity recognizer, parser).</Paragraph>
      <Paragraph position="1"> These are the variables in the network. For annotator ai, we denote the set of other annotators whose input is directly needed as Parents(Ai) [?] A and a particular assignment to those variables is parents(Ai). The possible values for a particu- null lar annotator Ai are ai (e.g., a particular parse tree or named entity tagging). We can now formulate the probability of a complete annotation (over all annotators) in the standard way for Bayes nets:</Paragraph>
      <Paragraph position="3"/>
    </Section>
    <Section position="3" start_page="619" end_page="619" type="sub_section">
      <SectionTitle>
2.3 Approximate Inference in Bayesian
Networks
</SectionTitle>
      <Paragraph position="0"> This factorization of the joint probability distribution facilitates inference. However, exact inference is intractable because of the number of possible values for our variables. Parsing, part of speech tagging, and named entity tagging (to name a few) all have a number of possible labels that is exponential in the length of the sentence, so we use approximate inference. We chose Monte Carlo inference, in which samples drawn from the joint distribution are used to approximate a marginal distribution for a subset of variables in the distribution. First, the nodes are sorted in topological order. Then, samples are drawn for each variable, conditioned on the samples which have already been drawn. Many samples are drawn, and are used to estimate the joint distribution.</Paragraph>
      <Paragraph position="1"> Importantly, for many language processing tasks our application only needs to provide the most likely value for a high-level linguistic annotation (e.g., the guessed semantic roles, or answer to a question), and other annotations such as parse trees are only present to assist in performing that task. The probability of the final annotation is given by:</Paragraph>
      <Paragraph position="3"> Because we are summing out all variables other than the final one, we effectively use only the samples drawn from the final stage, ignoring the labels of the variables, to estimate the marginal distribution over that variable. We then return the label which had the highest number of samples. For example, when trying to recognize textual entailment, we count how many times we sampled &amp;quot;yes, it is entailed&amp;quot; and how many times we sampled &amp;quot;no, it is not entailed&amp;quot; and return the answer with more samples.</Paragraph>
      <Paragraph position="4"> When the outcome you are trying to predict is binary (as is the case with RTE) or n-ary for small n, the number of samples needed to obtain a good estimate of the posterior probability is very small.</Paragraph>
      <Paragraph position="5"> This is true even if the spaces being sampled from during intermediate stages are exponentially large (such as the space of all parse trees). Ng and Jordan (2001) show that under mild assumptions, with only N samples the relative classification error will be at most O( 1N) higher than the error of the Bayes optimal classifier (in our case, the classifier which does exact inference). Even if the outcome space is not small, the sampling technique we present can still be very useful, as we will see later for the case of SRL.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML