File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0102_metho.xml

Size: 27,572 bytes

Last Modified: 2025-10-06 14:07:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0102">
  <Title>An Interactive Spreadsheet for Teaching the Forward-Backward Algorithm</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Subject Matter
</SectionTitle>
    <Paragraph position="0"> Among topics in natural language processing, the forward-backward or Baum-Welch algorithm (Baum, 1972) is particularly difficult to teach.</Paragraph>
    <Paragraph position="1"> The algorithm estimates the parameters of a Hidden Markov Model (HMM) by Expectation-Maximization (EM), using dynamic programming to carry out the expectation steps efficiently.</Paragraph>
    <Paragraph position="2"> HMMs have long been central in speech recognition (Rabiner, 1989). Their application to part-of-speech tagging (Church, 1988; DeRose, 1988) kicked off the era of statistical NLP, and they have found additional NLP applications to phrase chunking, text segmentation, word-sense disambiguation, and information extraction.</Paragraph>
    <Paragraph position="3"> The algorithm is also important to teach for pedagogical reasons, as the entry point to a family of EM algorithms for unsupervised parameter estimation. Indeed, it is an instructive special case of (1) the inside-outside algorithm for estimation of probabilistic context-free grammars; (2) belief propagation for training singly-connected Bayesian networks and junction trees (Pearl, 1988; Lauritzen, 1995); (3) algorithms for learning alignment models such as weighted edit distance; (4) general finite-state parameter estimation (Eisner, 2002).</Paragraph>
    <Paragraph position="4"> Before studying the algorithm, students should first have worked with some if not all of the key ideas in simpler settings. Markov models can be introduced through n-gram models or probabilistic finite-state automata. EM can be introduced through simpler tasks such as soft clustering. Global optimization through dynamic programming can be introduced in other contexts such as probabilistic CKY parsing or edit distance. Finally, the students should understand supervised training and Viterbi decoding of HMMs, for example in the context of part-of-speech tagging.</Paragraph>
    <Paragraph position="5"> Even with such preparation, however, the forward-backward algorithm can be difficult for beginning students to apprehend. It requires them to think about all of the above ideas at once, in combination, and to relate them to the nitty-gritty of the algorithm, namely the two-pass computation of mysterious and probabilities the conversion of these prior path probabilities to posterior expectations of transition and emission counts Just as important, students must develop an understanding of the algorithm's qualitative properties, which it shares with other EM algorithms: performs unsupervised learning (what is this and why is it possible?) alternates expectation and maximization steps maximizes p(observed training data) (i.e., total probability of all hidden paths that generate those data) finds only a local maximum, so is sensitive to initial conditions cannot escape zeroes or symmetries, so they should be avoided in initial conditions uses the states as it sees fit, ignoring the suggestive names that we may give them (e.g., part of speech tags) may overfit the training data unless smoothing is used The spreadsheet lesson was deployed in two 50minute lectures at Johns Hopkins University, in an introductory NLP course aimed at upper-level undergraduates and first-year graduate students. A single lecture might have sufficed for a less interactive presentation.</Paragraph>
    <Paragraph position="6"> The lesson appeared in week 10 of 13, by which time the students had already been exposed to most of the preparatory topics mentioned above, including Viterbi decoding of a part-of-speech trigram tagging model. However, the present lesson was their first exposure to EM or indeed to any kind of unsupervised learning.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 The Ice Cream Climatology Data
</SectionTitle>
    <Paragraph position="0"> [While the spreadsheet could be used in many ways, the next several sections offer one detailed lesson plan. Questions for the class are included; subsequent points often depend on the answers, which are concealed here in footnotes. Some fragments of the full spreadsheet are shown in the figures.] The situation: You are climatologists in the year 2799, studying the history of global warming. You can't find any records of Baltimore weather, but you do find my diary, in which I assiduously recorded how much ice cream I ate each day (see Figure 3).</Paragraph>
    <Paragraph position="1"> What can you figure out from this about the weather that summer? Let's simplify and suppose there are only two kinds of days: C (cold) and H (hot). And let's suppose you have guessed some probabilities as shown on the spreadsheet (Figure 2).</Paragraph>
    <Paragraph position="2"> Thus, you guess that on cold days, I usually ate only 1 ice cream cone: my probabilities of 1, 2, or 3 cones were 70%, 20% and 10%. That adds up to 100%. On hot days, the probabilities were reversed--I usually ate 3 ice creams. So other things equal, if you know I ate 3 ice creams, the odds are 7 to 1 that it was a hot day, but if I ate 2 ice creams, the odds are 1 to 1 (no information).</Paragraph>
    <Paragraph position="3"> You also guess (still Figure 2) that if today is cold, tomorrow is probably cold, and if today is hot, tomorrow is probably hot. (Q: How does this setup resemble part-of-speech tagging?1) We also have some boundary conditions. I only kept this diary for a while. If I was more likely to start or stop the diary on a hot day, then that is useful information and it should go in the table. (Q: Is there an analogy in part-of-speech tagging?2) For simplicity, let's guess that I was equally likely to start or stop on a hot or cold day. So the first day I started writing was equally likely (50%) to be hot or cold, and any given day had the same chance (10%) of being the last recorded day, e.g., because on any day I wrote (regardless of temperature), I had a 10% chance of losing my diary.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 The Trellis and Decoding
</SectionTitle>
    <Paragraph position="0"> [The notationpi(H) in this paper stands for the probability ofHon dayi, given all the observed ice cream data. On the spreadsheet itself the subscript i is clear from context and is dropped; thus in Figure 3, p(H) denotes the conditional probabilitypi(H), not a prior. The spreadsheet likewise omits subscripts on i(H) and i(H).] Scroll down the spreadsheet and look at the lower line of Figure 3, which shows a weather reconstruction under the above assumptions. It estimates the relative hot and cold probabilities for each day. Apparently, the summer was mostly hot with a cold spell in the middle; we are unsure about the weather on a few transitional days.</Paragraph>
    <Paragraph position="1"> We will now see how the reconstruction was accomplished. Look at the trellis diagram on the spreadsheet (Figure 4). Consistent with visual intuition, arcs (lines) represent days and states (points) represent the intervening midnights. A cold day is represented by an arc that ends in a C state.3 (So 1A: This is a bigram tag generation model with tagsCandH. Each tag independently generates a word (1, 2, or 3); the word choice is conditioned on the tag.</Paragraph>
    <Paragraph position="2"> 2A: A tagger should know that sentences tend to start with determiners and end with periods. A tagging that ends with a determiner should be penalized because p(StopjDet) 0.</Paragraph>
    <Paragraph position="3"> 3These conventions are a compromise between a traditional view of HMMs and a finite-state view used elsewhere in the course. (The two views correspond to Moore vs. Mealy machines.) In the traditional view, states would represent days and  each arc effectively inherits the C or H label of its terminal state.) Q: According to the trellis, what is the a priori probability that the first three days of summer are H,H,C and I eat 2,3,3 cones respectively (as I did)?4 Q: Of the 8 ways to account for the 2,3,3 cones, which is most probable?5 Q: Why do all 8 paths have low probabilities?6 Recall that the Viterbi algorithm computes, at each state of the trellis, the maximum probability of any path from Start. Similarly, define at a state to be the total probability of all paths to that state from Start. Q: How would you compute it by dynamic programming?7 Q: Symmetrically, how would you compute at a state, which is defined to be the total probability of all paths to Stop? The and values are computed on the spreadsheet (Figure 1). Q: Are there any patterns in the values?8 Now for some important questions. Q: What is the total probability of all paths from Start to would bear emission probabilities such asp(3jH). In Figure 4, as in finite-state machines, this role is played by the arcs (which also carry transition probabilities such as p(HjC)); this allows and to be described more simply as sums of path probabilities. But we persist in a traditional labeling of the states as H or  C so that the notation can refer to them.</Paragraph>
    <Paragraph position="4"> 4A: Consult the path Start ! H ! H ! C, which has probability (0:5 0:2) (0:8 0:7) (0:1 0:1) = 0:1 0:56 0:01 = 0:00056. Note that the trellis is specialized to these data. 5A: H,H,H gives probability 0:1 0:56 0:56 = 0:03136.</Paragraph>
    <Paragraph position="5"> (Starting with C would be as cheap as starting with H, but then getting from C to H would be expensive.) 6A: It was a priori unlikely that I'd eat exactly this sequence of ice creams. (A priori there were many more than 8 possible  paths, but this trellis only shows the paths generating the actual data 2,3,3.) We'll be interested in the relative probabilities of these 8 paths.</Paragraph>
    <Paragraph position="6"> 7A: In terms of at the predecessor states: just replace &amp;quot;max&amp;quot; with &amp;quot;+&amp;quot; in the Viterbi algorithm. 8A: probabilities decrease going down the column, and probabilities decrease going up, as they become responsible for  Stop in which day 3 is hot?9 It is shown in column H of Figure 1. Q: Why is column I of Figure 1 constant at 9.13e-19 across rows?10 Q: What does that column tell us about ice cream or weather?11 Now the class may be able to see how to complete the reconstruction:</Paragraph>
    <Paragraph position="8"> which is 0.989, as shown in cell K29 of Figure 5.</Paragraph>
    <Paragraph position="9"> Figure 3 simply graphs column K.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Understanding the Reconstruction
</SectionTitle>
    <Paragraph position="0"> Notice that the lower line in Figure 3 has the same general shape as the upper line (the original data), but is smoother. For example, some 2-ice-cream days were tagged as probably cold and some as probably hot. Q: Why?12 Q: Since the first day has 2 ice creams and doesn't follow a hot day, why was it tagged as hot?13 Q: Why was day 11, which has only 1 ice cream, tagged as hot?14 We can experiment with the spreadsheet (using the Undo command after each experiment). Q: What</Paragraph>
    <Paragraph position="2"> ranges over all 233 possible weather state sequences such as H,H,C,. . . . Each summand is the probability of a trellis path.</Paragraph>
    <Paragraph position="3"> 12A: Figure 2 assumed a kind of &amp;quot;weather inertia&amp;quot; in which a hot day tends to be followed by another hot day, and likewise for cold days.</Paragraph>
    <Paragraph position="4"> 13Because an apparently hot day follows it. (See footnote 5.) It is the factors that consider this information from the future, and make 1(H) 1(H) 1(C) 1(C).</Paragraph>
    <Paragraph position="5"> 14A: Switching from hot to cold and back (HCH) has probability 0.01, whereas staying hot (HHH) has probability 0.64. So although the fact that I ate only one ice cream on day 11 favors  or remove the &amp;quot;weather inertia&amp;quot; in Figure 2?15 Q: What happens if we try &amp;quot;anti-inertia&amp;quot;?16 Even though the number of ice creams is not decisive (consider day 11), it is influential. Q: What do you predict will happen if the distribution of ice creams is the same on hot and cold days?17 Q: What if we also change p(HjStart) from 0.5 to 0?18</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Reestimating Emission Probabilities
</SectionTitle>
    <Paragraph position="0"> We originally assumed (Figure 2) that I had a 20% chance of eating 2 cones on either a hot or a cold day. But if our reconstruction is right (Figure 3), I actually ate 2 cones on 20% of cold days but 40+% of hot days.</Paragraph>
    <Paragraph position="1"> 15A: Changing p(C j C) = p(H j C) = p(C j H) = p(H j H) = 0:45 cancels the smoothing effect (Figure 6a). The lower line now tracks the upper line exactly.</Paragraph>
    <Paragraph position="2"> 16A: Setting p(C j H) = p(H j C) = 0:8 and p(C j C) = p(HjH) = 0:1, rather than vice-versa, yields Figure 6b. 17A: The ice cream data now gives us no information about the weather, so pi(H) = pi(C) = 0:5 on every day i.</Paragraph>
    <Paragraph position="3"> 18A: p1(H) = 0, but pi(H) increases toward an asymptote  So now that we &amp;quot;know&amp;quot; which days are hot and which days are cold, we should really update our probabilities to 0.2 and 0.4, not 0.2 and 0.2. After all, our initial probabilities were just guesses.</Paragraph>
    <Paragraph position="4"> Q: Where does the learning come from--isn't this circular? Since our reconstruction was based on the guessed probabilities 0.2 and 0.2, why didn't the reconstruction perfectly reflect those guesses?19 Scrolling rightward on the spreadsheet, we find a table giving the updated probabilities (Figure 7).</Paragraph>
    <Paragraph position="5"> This table feeds into a second copy of the forward-backward calculation and graph. Q: The second graph of pi(H) (not shown here) closely resembles the first; why is it different on days 11 and 27?20 The updated probability table was computed by the spreadsheet. Q: When it calculated how often I ate 2 cones on a reconstructed hot day, do you think it counted day 27 as a hot day or a cold day?21</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
8 Reestimating Transition Probabilities
</SectionTitle>
    <Paragraph position="0"> Notice that Figure 7 also updated the transition probabilities. This involved counting the 4 kinds of days distinguished by Figure 8:22 e.g., what fraction of H 19A: The reconstruction of the weather underlying the observed data was a compromise between the guessed probabilities (Figure 2) and the demands of the actual data. The model in Figure 2 disagreed with the data: it would not have predicted that 2-cone days actually accounted for more than 20% of all days, or that they were disproportionately likely to fall between 3-cone days.</Paragraph>
    <Paragraph position="1"> 20A: These days fall between hot and cold days, so smoothing has little effect: their temperature is mainly reconstructed from the number of ice creams. 1 ice cream is now better evidence of a cold day, and 2 ice creams of a hot day. Interestingly, days 11  days were followed by H? Again, fractional counts were used to handle uncertainty.</Paragraph>
    <Paragraph position="2"> Q: Does Figure 3 (first-order reconstruction) contain enough information to construct Figure 8 (second-order reconstruction)?23 Continuing with the probabilities from the end of footnote 23, suppose we increase p(HjStart) to 0.7. Q: What will happen to the first-order graph?24 Q: What if we switch from anti-inertia back to inertia (Figure 9)?25 Q: In this last case, what do you predict will happen when we reestimate the probabilities?26 This reestimation (Figure 10) slightly improved the reconstruction. [Defer discussion of what &amp;quot;improved&amp;quot; means: the class still assumes that good reconstructions look like Figure 3.] Q: Now what? A: Perhaps we should do it again. And again, and again. . . Scrolling rightward past 10 successive reestimations, we see that this arrives at the intuitively 23A: No. A dramatic way to see this is to make the distribution of ice cream distribution the same on hot and cold days. This makes the first-order graph constant at 0.5 as in footnote 17. But we can still get a range of behaviors in the second-order graph; e.g., if we switch from inertia to anti-inertia as in footnote 16, then we switch from thinking the weather is unknown but constant to thinking it is unknown but oscillating. 24A: pi(H) alternates and converges to 0.5 from both sides. 25A: pi(H) converges to 0.5 from above (cf. footnote 18), as shown in Figure 9.</Paragraph>
    <Paragraph position="3"> 26A: The first-order graph suggests that the early days of summer were slightly more likely to be hot than cold. Since we ate more ice cream on those days, the reestimated probabilities (unlike the initial ones) slightly favor eating more ice cream on hot days. So the new reconstruction based on these probabilities has a very shallow &amp;quot;U&amp;quot; shape (bottom of Figure 10), in which the low-ice-cream middle of the summer is slightly less  by reestimation.</Paragraph>
    <Paragraph position="4"> correct answer (Figure 11)! Thus, starting from an uninformed probability table, the spreadsheet learned sensible probabilities (Figure 11) that enabled it to reconstruct the weather.</Paragraph>
    <Paragraph position="5"> The 3-D graph shows how the reconstruction improved over time.</Paragraph>
    <Paragraph position="6"> The only remaining detail is how the transition probabilities in Figure 8 were computed. Recall that to get Figure 3, we asked what fraction of paths passed through each state. This time we must ask what fraction of paths traversed each arc. (Q: How to compute this?27) Just as there were two possible states each day, there are four possible arcs each day, and the graph reflects their relative probabilities.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
9 Reestimation Experiments
</SectionTitle>
    <Paragraph position="0"> We can check whether the algorithm still learns from  other initial guesses. The following examples appear on the spreadsheet and can be copied over the table of initial probabilities. (Except for the pathologically symmetric third case, they all learn the same structure.) 1. No weather inertia, but more ice cream on hot days. The model initially behaves as in foot27A: The total probability of all paths traversing q ! r is (q) p(q!r) (r).</Paragraph>
    <Paragraph position="1"> Figure 10: The effect of reestimation on Figure 9. note 15, but over time it learns that weather does have inertia.</Paragraph>
    <Paragraph position="2"> 2. Inertia, but only a very slight preference for more  ice cream on hot days. The pi(H) graph is initially almost as flat as in footnote 17. But over several passes the model learns that I eat a lot more ice cream on hot days.</Paragraph>
    <Paragraph position="3">  3. A completely symmetric initial state: no inertia, and no preference at all for more ice cream on hot days. Q: What do you expect to happen under reestimation?28 4. Like the previous case, but break the symmetry  by giving cold days a slight preference to eat more ice cream (Figure 12). This initial state is almost perfectly symmetric. Q: Why doesn't this case appear to learn the same structure as the previous ones?29 The final case does not converge to quite the same result as the others: C and H are reversed. (It is 28A: Nothing changes, since the situation is too symmetric. As H and C behave identically, there is nothing to differentiate them and allow them to specialize.</Paragraph>
    <Paragraph position="4"> 29A: Actually it does; it merely requires more iterations to converge. (The spreadsheet is only wide enough to hold 10 iterations; to run for 10 more, just copy the final probabilities back over the initial ones. Repeat as necessary.) It learns both inertia and a preference for more ice cream on cold days.  (33 days plus Stop), then p(T) increases rapidly during reestimation. To compress the range of the graph, we don't plot p(T) but rather perplexity per observation = 1= 34</Paragraph>
    <Paragraph position="6"> nowHthat is used for the low-ice-cream midsummer days.) Should you care about this difference? As climatologists, you might very well be upset that the spreadsheet reversed cold and hot days. But since C and H are ultimately just arbitrary labels, then perhaps the outcome is equally good in some sense.</Paragraph>
    <Paragraph position="7"> What does it mean for the outcome of this unsupervised learning procedure to be &amp;quot;good&amp;quot;? The dataset is just the ice cream diary, which makes no reference to weather. Without knowing the true weather, how can we tell whether we did a good job learning it?</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
10 Local Maximization of Likelihood
</SectionTitle>
    <Paragraph position="0"> The answer: A good model is one that predicts the dataset as accurately as possible. The dataset actually has temporal structure, since I tended to have long periods of high and low ice cream consumption. That structure is what the algorithm discovered, regardless of whether weather was the cause.</Paragraph>
    <Paragraph position="1"> The state C or H distinguishes between the two kinds of periods and tends to persist over time.</Paragraph>
    <Paragraph position="2"> So did this learned model predict the dataset well? It was not always sure about the state sequence, but Figure 13 shows that the likelihood of the observed dataset (summed over all possible state sequences) increased on every iteration. (Q: How is this found?30) That behavior is actually guaranteed: repeated 30It is the total probability of paths that explain the data, i.e., all paths in Figure 4, as given by column I of Figure 1; see footnote 10.</Paragraph>
    <Paragraph position="3"> forward-backward reestimation converges to a local maximum of likelihood. We have already discovered two symmetric local maxima, both with perplexity of 2.827 per day: the model might use C to represent cold and H to represent hot, or vice versa.</Paragraph>
    <Paragraph position="4"> Q: How much better is 2.827 than a model with no temporal structure?31 Remember that maximizing the likelihood of the training data can lead to overfitting. Q: Do you see any evidence of this in the final probability table?32 Q: Is there a remedy?33</Paragraph>
  </Section>
  <Section position="10" start_page="0" end_page="0" type="metho">
    <SectionTitle>
11 A Trick Ending
</SectionTitle>
    <Paragraph position="0"> We get very different results if we slightly modify Figure 12 by putting p(1 j H) = 0:3 with p(2 j H) = 0:4. The structure of the solution is very different (Figure 14). In fact, the final parameters now show anti-inertia, giving a reconstruction similar to Figure 6b. Q: What went wrong?34 In the two previous local maxima, H meant &amp;quot;low ice-cream day&amp;quot; or &amp;quot;high ice-cream day.&amp;quot; Q: According to Figure 14, what doesHmean here?35 Q: What does the low value of p(HjH) mean?36 So we see that there are actually two kinds of structure coexisting in this dataset: days with a lot (little) ice cream tend to repeat, and days with 2 ice creams tend not to repeat. The first kind of structure did a better job of lowering the perplexity, but both 31A: A model with no temporal structure is a unigram model. A good guess is that it will have perplexity 3, since it will be completely undecided between the 3 kinds of observations. (It so happens that they were equally frequent in the dataset.) However, if we prevent the learning of temporal structure (by setting the initial conditions so that the model is always in state C, or is always equally likely to be in states C and H), we find that the perplexity is 3.314, reflecting the four-way unigram distribution</Paragraph>
    <Paragraph position="2"> 32A: p(H j Start) ! 1 because we become increasingly sure that the training diary started on a hot day. But this single training observation, no matter how justifiably certain we are of it, might not generalize to next summer's diary.</Paragraph>
    <Paragraph position="3"> 33A: Smoothing the fractional counts. Note: If a prior is used for smoothing, the algorithm is guaranteed to locally maximize the posterior (in place of the likelihood).</Paragraph>
    <Paragraph position="4"> 34A: This is a third local maximum of likelihood, unrelated to the others, with worse perplexity (3.059). Getting stuck in poor local maxima is an occupational hazard.</Paragraph>
    <Paragraph position="5"> 35A: H usually emits 2 ice creams, whereas C never does. So H stands for a 2-ice-cream day.</Paragraph>
    <Paragraph position="6"> 36A: That 2 ice creams are rarely followed by 2 ice creams.</Paragraph>
    <Paragraph position="7"> Looking at the dataset, this is true. So even this local maximum successfully discovered some structure: it discovered (to my surprise) that when I make up data, I tend not to repeat 2's! Figure 14: A suboptimal local maximum.</Paragraph>
    <Paragraph position="8"> are useful. Q: How could we get our model to discover both kinds of structure (thereby lowering the perplexity further)?37 Q: We have now seen three locally optimal models in which the H state was used for 3 different things--even though we named it H for &amp;quot;Hot.&amp;quot; What does this mean for the application of this algorithm to part-of-speech tagging?38</Paragraph>
  </Section>
  <Section position="11" start_page="0" end_page="0" type="metho">
    <SectionTitle>
12 Follow-Up Assignment
</SectionTitle>
    <Paragraph position="0"> In a follow-up assignment, students applied Viterbi decoding and forward-backward reestimation to part-of-speech tagging.39 In the assignment, students were asked to test their code first on the ice cream data (provided as a small tagged corpus) before switching to real data. This cemented the analogy between the ice cream and tagging tasks, helping students connect the class to the assignment.</Paragraph>
    <Paragraph position="1"> 37A: Use more states. Four states would suffice to distinguish hot/2, cold/2, hot/not2, and cold/not2 days.</Paragraph>
    <Paragraph position="2"> 38A: There is no guarantee that N and V will continue to distinguish nouns and verbs after reestimation. They will evolve to make whatever distinctions help to predict the word sequence. 39Advanced students might also want to read about a modern supervised trigram tagger (Brants, 2000), or the mixed results when one actually trains trigram taggers by EM (Merialdo, 1994).</Paragraph>
    <Paragraph position="3"> Furthermore, students could check their ice cream output against the spreadsheet, and track down basic bugs by comparing their intermediate results to the spreadsheet's. They reported this to be very useful.</Paragraph>
    <Paragraph position="4"> Presumably it helps learning when students actually find their bugs before handing in the assignment, and when they are able to isolate their misconceptions on their own. It also made office hours and grading much easier for the teaching assistant.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML