XML Viewer - e99-1016

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/e99-1016_metho.xml
Size: 16,898 bytes
Last Modified: 2025-10-06 14:15:19
<?xml version="1.0" standalone="yes"?>
<Paper uid="E99-1016">
  <Title>Cascaded Markov Models</Title>
  <Section position="3" start_page="118" end_page="119" type="metho">
    <SectionTitle>
2 Encoding of Syntactical
</SectionTitle>
    <Paragraph position="0"> Information as Markov Models When encoding a part-of-speech tagger as a Markov Model, states represent syntactic categories 1 and outputs represent words. Contextual probabilities of tags are encoded as transition probabilities of tags, and lexical probabilities of the Markov Model are encoded as output probabilities of words in states.</Paragraph>
    <Paragraph position="1"> We introduce a modification to this encoding.</Paragraph>
    <Paragraph position="2"> States additionally may represent non-terminal categories (phrases). These new states emit partial parse trees (cf. figure 2). This can be seen as collapsing a sequence of terminals into one nonterminal. Transitions into and out of the new states are performed in the same way as for words and parts-of-speech.</Paragraph>
    <Paragraph position="3"> Transitional probabilities for this new type of Markov Models can be estimated from annotated data in a way very similar to estimating probabilities for a part-of-speech tagger. The only difference is that sequences of terminals may be replaced by one non-terminal.</Paragraph>
    <Paragraph position="4"> Lexical probabilities need a new estimation method. We use probabilities of context-free partim parse trees. Thus, the lexical probability of the state NP in figure 2 is determined by</Paragraph>
    <Paragraph position="6"> Note that the last three probabilities are the same as for the part-of-speech model.</Paragraph>
    <Paragraph position="7"> 1Categories and states directly correspond in bi-gram models. For higher order models, tuples of categories are combined to one state.</Paragraph>
    <Paragraph position="8">  z K&amp;quot; A o. _z a. n- u. Z K&amp;quot; ~, a. a. &lt; ~. n</Paragraph>
    <Paragraph position="10"> to part-of-speech tagging, outputs of states may consist of structures with probabilities according to a stochastic context-free grammar.</Paragraph>
  </Section>
  <Section position="4" start_page="119" end_page="122" type="metho">
    <SectionTitle>
3 Cascaded Markov Models
</SectionTitle>
    <Paragraph position="0"> The basic idea of Cascaded Markov Models is to construct the parse tree layer by layer, first structures of depth one, then structures of depth two, and so forth. For each layer, a Markov Model determines the best set of phrases. These phrases are used as input for the next layer, which adds one more layer. Phrase hypotheses at each layer are generated according to stochastic context-free grammar rules (the outputs of the Markov Model) and subsequently filtered from left to right by Markov Models.</Paragraph>
    <Paragraph position="1"> Figure 3 gives an overview of the parsing model.</Paragraph>
    <Paragraph position="2"> Starting with part-of-speech tagging, new phrases are created at higher layers and filtered by Markov Models operating from left to right.</Paragraph>
    <Section position="1" start_page="119" end_page="119" type="sub_section">
      <SectionTitle>
3.1 Tagging Lattices
</SectionTitle>
      <Paragraph position="0"> The processing example in figure 3 only shows the best hypothesis at each layer. But there are alternative phrase hypotheses and we need to determine the best one during the parsing process.</Paragraph>
      <Paragraph position="1"> All rules of the generated context-free grammar with right sides that are compatible with part of the sequence are added to the search space. Figure 4 shows an example for hypotheses at the first layer when processing the sentence of figure 1.</Paragraph>
      <Paragraph position="2"> Each bar represents one hypothesis. The position of the bar indicates the covered words. It is labeled with the type of the hypothetical phrase, an index in the upper left corner for later reference, the negative logarithm of the probability that this phrase generates the terminal yield (i.e., the smaller the better; probabilities for part-of-speech tags are omitted for clarity). This part is very similar to chart entries of a chart parser.</Paragraph>
      <Paragraph position="3"> All phrases that are newly introduced at this layer are marked with an asterisk (*). They are produced according to context-free rules, based on the elements passed from the next lower layer.</Paragraph>
      <Paragraph position="4"> The layer below layer 1 is the part-of-speech layer.</Paragraph>
      <Paragraph position="5"> The hypotheses form a lattice, with the word boundaries being states and the phrases being edges. Selecting the best hypothesis means to find the best path from node 0 to the last node (node 14 in the example). The best path can be efficiently found with the Viterbi algorithm (Viterbi, 1967), which runs in time linear to the length of the word sequence. Having this view of finding the best hypothesis, processing of a layer is similar to word lattice processing in speech recognition (cf.</Paragraph>
      <Paragraph position="6"> Samuelsson, 1997).</Paragraph>
      <Paragraph position="7"> Two types of probabilities are important when searching for the best path in a lattice. First, these are probabilities of the hypotheses (phrases) generating the underlying terminal nodes (words).</Paragraph>
      <Paragraph position="8"> They are calculated according to a stochastic context-free grammar and given in figure 4. The second type are context probabilities, i.e., the probability that some type of phrase follows or precedes another. The two types of probabilities coincide with lexical and contextual probabilities of a Markov Model, respectively.</Paragraph>
      <Paragraph position="9"> According to a trigram model (generated from a corpus), the path in figure 4 that is marked grey is the best path in the lattice. Its probability is  sibly ambiguous output together with probabilities is passed to higher layers (only the best hypotheses are shown for clarity). At each layer, new phrases and grammatical functions are added.</Paragraph>
      <Paragraph position="11"> Start and end of the path are indicated by a dollar sign ($). This path is very close to the correct structure for layer 1. The CNP and PP are correctly recognized. Additionally, the best path correctly predicts that APPR, VAFIN and VVPP should not be attached in layer 1. The only error is the NP ein enormer Posten. Although this is on its own a perfect NP, it is not complete because the PP an Arbeit und Geld is missing. ART, ADJA and NN should be left unattached in this layer in order to be able to create the correct structure at higher layers.</Paragraph>
      <Paragraph position="12"> The presented Markov Models act as filters.</Paragraph>
      <Paragraph position="13"> The probability of a connected structure is determined only based on a stochastic context-free grammar. The joint probabilities of unconnected partial structures are determined by additionally using Markov Models. While building the structure bottom up, parses that are unlikely according to the Markov Models are pruned.</Paragraph>
    </Section>
    <Section position="2" start_page="119" end_page="119" type="sub_section">
      <SectionTitle>
3.2 The Method
</SectionTitle>
      <Paragraph position="0"> The standard Viterbi algorithm is modified in order to process Markov Models operating on lattices. In part-of-speech tagging, each hypothesis (a tag) spans exactly one word. Now, a hypothesis can span an arbitrary number of words, and the same span can be covered by an arbitrary number of alternative word or phrase hypotheses. Using terms of a Markov Model, a state is allowed to emit a context-free partial parse tree, starting with the represented non-terminal symbol, yielding part of the sequence of words. This is in contrast to standard Markov Models. There, states emit atomic symbols. Note that an edge in the lattice is represented by a state in the corresponding Markov Model. Figure 2 shows the part of the Markov Model that represents the best path in the lattice of figure 4.</Paragraph>
      <Paragraph position="1"> The equations of the Viterbi algorithm are adapted to process a language model operating on a lattice. Instead of the words, the gaps between the words are enumerated (see figure 4), and an edge between two states can span one or more words, such that an edge is represented by a triple &lt;t, t', q&gt;, starting at time t, ending at time t' and representing state q.</Paragraph>
      <Paragraph position="2"> We introduce accumulators At,t, (q) that collect the maximum probability of state q covering words from position t to t '. We use 6i,j (q) to denote the probability of the deriviation emitted by state q having a terminal yield that spans positions i to j. These are needed here as part of the accumulators A.</Paragraph>
      <Paragraph position="3"> Initialization:</Paragraph>
      <Paragraph position="5"/>
      <Paragraph position="7"> Additionally, it is necessary to keep track of the elements in the lattice that maximized each At,r (q).</Paragraph>
      <Paragraph position="8"> When reaching time T, we get the best last element in the lattice</Paragraph>
      <Paragraph position="10"> for i &gt; 1, until we reach t~ = 0. Now, q~... q~ is the best sequence of phrase hypotheses (read backwards).</Paragraph>
    </Section>
    <Section position="3" start_page="119" end_page="119" type="sub_section">
      <SectionTitle>
3.3 Passing Ambiguity to the Next Layer
</SectionTitle>
      <Paragraph position="0"> The process can move on to layer 2 after the first layer is computed. The results of the first layer are taken as the base and all context-free rules that apply to the base are retrieved. These again form a lattice and we can calculate the best path for layer 2.</Paragraph>
      <Paragraph position="1"> The Markov Model for layer 1 operates on the output of the Markov Model for part-of-speech tagging, the model for layer 2 operates on the output of layer 1, and so on. Hence the name of the processing model: Cascaded Markov Models.</Paragraph>
      <Paragraph position="2"> Very often, it is not sufficient to calculate just the best sequences of words/tags/phrases. This may result in an error leading to subsequent errors at higher layers. Therefore, we not only calculate the best sequence but several top ranked sequences. The number of the passed hypotheses depends on a pre-defined threshold ~ &gt; 1. We select all hypotheses with probabilities P &gt; Pbest/8. These are passed to the next layer together with their probabilities.</Paragraph>
    </Section>
    <Section position="4" start_page="119" end_page="122" type="sub_section">
      <SectionTitle>
3.4 Parameter Estimation
</SectionTitle>
      <Paragraph position="0"> Transitional parameters for Cascaded Markov Models are estimated separately for each layer.</Paragraph>
      <Paragraph position="1"> Output parameters are the same for all layers, they are taken from the stochastic context-free grammar that is read off the treebank.</Paragraph>
      <Paragraph position="2"> Training on annotated data is straight forward.</Paragraph>
      <Paragraph position="3"> First, we number the layers, starting with 0 for the part-of-speech layer. Subsequently, information for the different layers is collected.</Paragraph>
      <Paragraph position="4"> Each sentence in the corpus represents one training sequence for each layer. This sequence consists of the tags or phrases at that layer. If a span is not covered by a phrase at a particular layer, we take the elements of the highest layer below the actual layer. Figure 5 shows the training sequences for layers 0 - 4 generated from the sentence in figure 1. Each sentence gives rise to one training sequence for each layer. Contextual parameter estimation is done in analogy to models for part-of-speech tagging, and the same smoothing techniques can be applied. We use a linear interpolation of uni-, bi-, and trigram models.</Paragraph>
      <Paragraph position="5"> A stochastic context-free grammar is read off the corpus. The rules derived from the annotated sentence in figure 1 are also shown in figure  used to estimate transition probabilities for the corresponding Markov Models. The context-free rules are used to estimate the SCFG, which determines the output probabilities of the Markov Models. same for all layers. We could estimate probabilities for rules separately for each layer, but this would worsen the sparse data problem.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="122" end_page="123" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> This section reports on results of experiments with Cascaded Markov Models. We evaluate chunking precision and recall, i.e., the recognition of kernel NPs and PPs. These exclude prenominal adverbs and postnominal PPs and relative clauses, but include all other prenominal modifiers, which can be fairly complex adjective phrases in German. Figure 6 shows an example of a complex N P and the output of the parsing process.</Paragraph>
    <Paragraph position="1"> For our experiments, we use the NEGRA corpus (Skut et al., 1997). It consists of German newspaper texts (Frankfurter Rundschau) that are annotated with predicate-argument structures. We extracted all structures for NPs, PPs, APs, AVPs (i.e., we mainly excluded sentences, VPs and coordinations). The version of the corpus used contains 17,000 sentences (300,000 tokens).</Paragraph>
    <Paragraph position="2"> The corpus was divided into training part (90%) and test part (10%). Experiments were repeated 10 times, results were averaged. Cross-evaluation was done in order to obtain more reliable performance estimates than by just one test run. Input of the process is a sequence of words (divided into sentences), output are part-of-speech tags and structures like the one indicated in figure 6.</Paragraph>
    <Paragraph position="3">  sion. Differences to labeled recall/precision are small, since the number of different non-terminal categories is very restricted.</Paragraph>
    <Paragraph position="4"> they started with correctly tagged data, so our task is harder since it includes the process of part-of-speech tagging.</Paragraph>
    <Paragraph position="5"> Recall increases with the number of layers. It ranges from 54.0% for 1 layer to 84.8% for 9 layers. This could be expected, because the number of layers determines the number of phrases that can be parsed by the model. The additional line for &amp;quot;topline recall&amp;quot; indicates the percentage of phrases that can be parsed by Cascaded Markov Models with the given number of layers. All nodes that belong to higher layers cannot be recognized.</Paragraph>
    <Paragraph position="6"> Precision slightly decreases with the number of layers. It ranges from 91.4% for 1 layer to 88.3% for 9 layers.</Paragraph>
    <Paragraph position="7"> The F-score is a weighted combination of recall R and precision P and defined as follows:</Paragraph>
    <Paragraph position="9"> /3 is a parameter encoding the importance of recall and precision. Using an equal weight for both (/3 = 1), the maximum F-score is reached for 7 layers (F =86.5%).</Paragraph>
    <Paragraph position="10"> The part-of-speech tagging accuracy slightly increases with the number of Markov Model layers (bottom line in figure 7). This can be explained by top-down decisions of Cascaded Markov Models.</Paragraph>
    <Paragraph position="11"> A model at a higher layer can select a tag with a lower probability if this increases the probability at that layer. Thereby some errors made at lower layers can be corrected. This leads to the increase of up to 0.3% in accuracy.</Paragraph>
    <Paragraph position="12"> Results for chunking Penn Treebank data were previously presented by several authors (Ramshaw and Marcus, 1995; Argamon et al., 1998; Veenstra, 1998; Cardie and Pierce, 1998).</Paragraph>
    <Paragraph position="13"> These are not directly comparable to our results,  Proceedings of EACL '99 die von der Bundesregierung angestrebte Entlassung des Bundes aus einzelnen Bereichen</Paragraph>
  </Section>
  <Section position="6" start_page="123" end_page="123" type="metho">
    <SectionTitle>
ART APPR ART NN ADJA NN ART NN APPR ADJA NN
</SectionTitle>
    <Paragraph position="0"> the by the government intended dismissal (of) the federation from several areas  depending on the number of layers that are used for parsing. Layer 0 is used for part-of-speech tagging, for which tagging accuracies are given at the bottom line. Topline recall is the maximum recall possible for that number of layers.</Paragraph>
    <Paragraph position="1"> because they processed a different language and generated only one layer of structure (the chunk boundaries), while our algorithm also generates the internal structure of chunks. But generally, Cascaded Markov Models can be reduced to generating just one layer and can be trained on Penn Treebank data.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML