File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1515_metho.xml
Size: 28,684 bytes
Last Modified: 2025-10-06 14:09:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-1515"> <Title>Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Constituent Parsing by Classification</Title> <Section position="4" start_page="141" end_page="144" type="metho"> <SectionTitle> 2 Parsing by Classification </SectionTitle> <Paragraph position="0"> Recall that with typical probabilistic parsers, our goal is to output the parse ^P with the highest likelihood for the given input sentence x:</Paragraph> <Paragraph position="2"> where each I is a constituency inference in the parse path P.</Paragraph> <Paragraph position="3"> In this work, we explore a generalization in which each inference I is assigned a real-valued confidence score Q(I) and individual confidences are aggregated using some function A, which need not be a sum or product: In Section 2.1 we describe how we induce scoring function Q(I). In Section 2.2 we discuss the aggregation function A. In Section 2.3 we describe the method used to restrict the size of the search space over P(x).</Paragraph> <Section position="1" start_page="141" end_page="143" type="sub_section"> <SectionTitle> 2.1 Learning the Scoring Function Q(I) </SectionTitle> <Paragraph position="0"> During training, our goal is to induce the scoring function Q, which assigns a real-valued confidence score Q(I) to each candidate inference I (Equation 4). We treat this as a classification task: If inference I is correct, we would like Q(I) to be a positive value, and if inference I is incorrect, we would like Q(I) to be a negative value.</Paragraph> <Paragraph position="1"> Training discriminative parsers can be computationally very expensive. Instead of having a single classifier score every inference, we parallelize training by inducing 26 sub-classifiers, one for each constituent label l in the Penn Treebank (Taylor, Marcus, & Santorini, 2003): Q(I ) = Q (I ), where Q is the l-classifier and I is an inference that infers a constituent with labell. For example, the VPclassifier QVP would score the VP-inference in Figure 1, preferably assigning it a positive confidence.</Paragraph> <Paragraph position="2"> Figure 1 A candidate VP-inference, with headchildren annotated using the rules given in (Collins, 1999).</Paragraph> <Paragraph position="4"> DT/The NN/timing JJ/perfect Eachl-classifier is independently trained on training set E , where each example e 2E is a tuple (I ,y), I is a candidatel-inference, and y2f 1g. y=+1 if I is a correct inference and 1 otherwise. This approach differs from that of Yamada and Matsumoto (2003) and Sagae and Lavie (2005), who parallelize according to the POS tag of one of the child items.</Paragraph> <Paragraph position="5"> Our method of generating training examples does not require a working parser, and can be run prior to any training. It is similar to the method used in the literature by deterministic parsers (Yamada & Matsumoto, 2003; Sagae & Lavie, 2005) with one exception: Depending upon the order constituents are inferred, there may be multiple bottom-up paths that lead to the same final parse, so to generate training examples we choose a single random path that leads to the gold-standard parse tree.1 The training examples correspond to all candidate inferences considered in every state along this path, nearly all of which are incorrect inferences (with y = 1). For instance, only 4.4% of candidate NP-inferences are correct.</Paragraph> <Paragraph position="6"> During training, for each labellwe induce scoring function Q to minimize the loss over training</Paragraph> <Paragraph position="8"> included in the aforementioned implementation so that our results can be replicated under the same experimental conditions. null where y Q (I ) is the margin of example (I ,y). Hence, the learning task is to maximize the margins of the training examples, i.e. induce scoring function Q such that it classifies correct inferences with positive confidence and incorrect inferences with negative confidence. In our work, we minimized the logistic loss:</Paragraph> <Paragraph position="10"> i.e. the negative log-likelihood of the training sample. null Our classifiers are ensembles of decisions trees, which we boost (Schapire & Singer, 1999) to minimize the above loss using the update equations given in Collins, Schapire, and Singer (2002). More specifically, classifier QT is an ensemble comprising decision trees q1 ,...,qT , where:</Paragraph> <Paragraph position="12"> At iteration t, decision tree qt is grown, its leaves are confidence-rated, and it is added to the ensemble.</Paragraph> <Paragraph position="13"> The classifier for each constituent label is trained independently, so we henceforth omitlsubscripts.</Paragraph> <Paragraph position="14"> An example (I,y) is assigned weight wt(I,y):2</Paragraph> <Paragraph position="16"> and this leaf has loss Ztf :</Paragraph> <Paragraph position="18"> Growing the decision tree: The loss of the entire decision tree qt is</Paragraph> <Paragraph position="20"> exp(y Qt 1(I)) 1, but leave the remainder of the algorithm unchanged, this algorithm would be confidence-rated AdaBoost (Schapire & Singer, 1999), minimizing the exponential loss L(z) = exp( z). In preliminary experiments, however, we found that the logistic loss provided superior generalization accuracy.</Paragraph> <Paragraph position="21"> We will use Zt as a shorthand for Z(qt). When growing the decision tree, we greedily choose node splits to minimize this Z (Kearns & Mansour, 1999). In particular, the loss reduction of splitting leaf f using featurephinto two children, f ^phand f ^:ph, is</Paragraph> <Paragraph position="23"> Equation 14 is smoothed by the epsilon1 term (Schapire & Singer, 1999) to prevent numerical instability in the case that either Wtf;+ or Wtf; is 0. In our experiments, we used epsilon1= 10 8. Although our example weights are unnormalized, so far we've found no benefit from scalingepsilon1 as Collins and Koo (2005) suggest. All inferences that fall in a particular leaf node are assigned the same confidence: if inference I falls in leaf node f in the tth decision tree, then qt(I)=ktf .</Paragraph> <Paragraph position="24"> An important concern is when to stop growing the decision tree. We propose the minimum reduction in loss (MRL) stopping criterion: During training, there is a value Tht at iteration t which serves as a threshold on the minimum reduction in loss for leaf splits. If there is no splitting feature for leaf f that reduces loss by at least Tht then f is not split. Formally, leaf f will not be bisected during iteration t if max 2 [?]Ztf (ph) <Tht. The MRL stopping criterion is essentiallylscript0 regularization:Tht corresponds to the lscript0 penalty parameter and each feature with non-zero confidence incurs a penalty ofTht, so to outweigh the penalty each split must reduce loss by at leastTht.</Paragraph> <Paragraph position="25"> Tht decreases monotonically during training at the slowest rate possible that still allows training to proceed. We start by initializing Th1 to 1, and at the beginning of iteration t we decrease Tht only if the root node ; of the decision tree cannot be split. Otherwise,Tht is set toTht 1. Formally, Tht =min(Tht 1,max 2 [?]Zt;(ph)). In this manner, the decision trees are induced in order of decreasingTht.</Paragraph> <Paragraph position="26"> During training, the constituent classifiers Q never do any parsing per se, and they train at different rates: If l nequal l0, then Tht isn't necessarily equal toTht 0. We calibrate the different classifiers by picking some meta-parameter ^Th and insisting that the sub-classifiers comprised by a particular parser have all reached some fixedThin training. Given ^Th, the constituent classifier for label l is Qt , where Tht ^Th > Tht+1 . To obtain the final parser, we cross-validate ^Th, picking the value whose set of constituent classifiers maximizes accuracy on a development set.</Paragraph> <Paragraph position="27"> 2.1.4 Types of Features used by the Scoring Function Our parser operates bottom-up. Let the frontier of a state be the top-most items (i.e. the items with no parents). The children of a candidate inference are those frontier items below the item to be inferred, the left context items are those frontier items to the left of the children, and the right context items are those frontier items to the right of the children. For example, in the candidateVP-inference shown in Figure 1, the frontier comprises the NP, VBD, and ADJP items, the VBD and ADJP items are the children of the VP-inference (theVBDis its head child), theNPis the left context item, and there are no right context items.</Paragraph> <Paragraph position="28"> The design of some parsers in the literature restricts the kinds of features that can be usefully and efficiently evaluated. Our scoring function and parsing algorithm have no such limitations. Q can, in principle, use arbitrary information from the history to evaluate constituent inferences. Although some of our feature types are based on prior work (Collins, 1999; Klein & Manning, 2003; Bikel, 2004), we note that our scoring function uses more history information than typical parsers.</Paragraph> <Paragraph position="29"> All features check whether an item has some property; specifically, whether the item's label/headtag/headword is a certain value. These features perform binary tests on the state directly, unlike Henderson (2003) which works with an intermediate representation of the history. In our baseline setup, feature set Ph contained five different feature types, described in Table 1.</Paragraph> </Section> <Section position="2" start_page="143" end_page="143" type="sub_section"> <SectionTitle> 2.2 Aggregating Confidences </SectionTitle> <Paragraph position="0"> To get the cumulative score of a parse path P, we apply aggregatorAover the confidences Q(I) in Equation 4. Initially, we definedAin the customary fashion as summing the loss of each inference's confidence: null</Paragraph> <Paragraph position="2"> with the logistic loss L as defined in Equation 6. (We negate the final sum because we want to minimize the loss.) This definition ofAis motivated by viewing L as a negative log-likelihood given by a logistic function (Collins et al., 2002), and then using Equation 3. It is also inspired by the multiclass loss-based decoding method of Schapire and Singer (1999).</Paragraph> <Paragraph position="3"> With this additive aggregator, loss monotonically increases as inferences are added, as in a PCFG-based parser in which all productions decrease the cumulative probability of the parse tree.</Paragraph> <Paragraph position="4"> In preliminary experiments, this aggregator gave disappointing results: precision increased slightly, but recall dropped sharply. Exploratory data analysis revealed that, because each inference incurs some positive loss, the aggregator very cautiously builds the smallest trees possible, thus harming recall. We had more success by defining A to maximize the minimum confidence. Essentially,</Paragraph> <Paragraph position="6"> Ties are broken according to the second lowest confidence, then the third lowest, and so on.</Paragraph> </Section> <Section position="3" start_page="143" end_page="144" type="sub_section"> <SectionTitle> 2.3 Search </SectionTitle> <Paragraph position="0"> Given input sentence x, we choose the parse path P in P(x) with the maximum aggregated score (Equation 4). Since it is computationally intractable to item dominated by non-rightmost children items that has headword &quot;quux&quot;? (False) consider every possible sequence of inferences, we use beam search to restrict the size of P(x). As an additional guard against excessive computation, search stopped if more than a fixed maximum number of states were popped from the agenda. As usual, search also ended if the highest-priority state in the agenda could not have a better aggregated score than the best final parse found thus far.</Paragraph> </Section> </Section> <Section position="5" start_page="144" end_page="149" type="metho"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"> Following Taskar, Klein, Collins, Koller, and Manning (2004), we trained and tested on 15 word sentences in the English Penn Treebank (Taylor et al., 2003), 10% of the entire treebank by word count.3 We used sections 02-21 (9753 sentences) for training, section 24 (321 sentences) for development, and section 23 (603 sentences) for testing, preprocessed as per Table 3. We evaluated our parser using the standard PARSEVAL measures (Black et al., 1991): labelled precision, recall, and F-measure (LPRC, LRCL, and LFMS, respectively), which are computed based on the number of constituents in the parser's output that match those in the gold-standard parse. We tested whether the observed differences in PARSEVAL measures are significant at p=0.05 using a stratified shuffling test (Cohen, 1995, Section 5.3.2) with one million trials.4 As mentioned in Section 1, the parser cannot infer any item that crosses an item already in the state. 3 There was insu cient time before deadline to train on all sentences.</Paragraph> <Paragraph position="1"> 4 The shu ing test we used was originally implemented by Dan Bikel (http://www.cis.upenn.edu/~dbikel/ software.html) and subsequently modified to compute p-values for LFMS di erences.</Paragraph> <Paragraph position="2"> We placed three additional candidacy restrictions on inferences: (a) Items must be inferred under the bottom-up item ordering; (b) To ensure the parser does not enter an infinite loop, no two items in a state can have both the same span and the same label; (c) An item can have no more than K = 5 children.</Paragraph> <Paragraph position="3"> (Only 0.24% of non-terminals in the preprocessed development set have more than five children.) The number of candidate inferences at each state, as well as the number of training examples generated by the algorithm in Section 2.1.1, is proportional to K. In our experiment, there were roughlyjE j 1.7 million training examples for each classifier.</Paragraph> <Section position="1" start_page="144" end_page="144" type="sub_section"> <SectionTitle> 3.1 Baseline </SectionTitle> <Paragraph position="0"> In the baseline setting, context item features (Section 2.1.4) could refer to the two nearest items of context in each direction. The parser used a beam width of 1000, and was terminated in the rare event that more than 10,000 states were popped from the agenda. Figure 2 shows the accuracy of the base-line on the development set as training progresses.</Paragraph> <Paragraph position="1"> Cross-validating the choice of ^Thagainst the LFMS (Section 2.1.3) suggested an optimum of ^Th= 1.42.</Paragraph> <Paragraph position="2"> At this ^Th, there were a total of 9297 decision tree splits in the parser (summed over all constituent classifiers), LFMS = 87.16, LRCL = 86.32, and LPRC=88.02.</Paragraph> </Section> <Section position="2" start_page="144" end_page="145" type="sub_section"> <SectionTitle> 3.2 Beam Width </SectionTitle> <Paragraph position="0"> To determine the effect of the beam width on the accuracy, we evaluated the baseline on the development set using a beam width of 1, i.e. parsing entirely greedily (Wong & Wu, 1999; Kalt, 2004; Sagae & Lavie, 2005). Table 4 compares the base- null Table 3 Steps for preprocessing the data. Starred steps are performed only on input with tree structure. 1. * Strip functional tags and trace indices, and remove traces. 2. * Convert PRT to ADVP. (This convention was established by Magerman (1995).) 3. Remove quotation marks (i.e. terminal items tagged '' or ''). (Bikel, 2004) 4. * Raise punctuation. (Bikel, 2004) 5. Remove outermost punctuation.a 6. * Remove unary projections to self (i.e. duplicate items with the same span and label). 7. POS tag the text using Ratnaparkhi (1996).</Paragraph> <Paragraph position="1"> 8. Lowercase headwords.</Paragraph> <Paragraph position="2"> 9. Replace any word observed fewer than 5 times in the (lower-cased) training sentences with UNK. a As pointed out by an anonymous reviewer of Collins (2003), removing outermost punctuation may discard useful information. It's also worth noting that Collins and Roark (2004) saw a LFMS improvement of 0.8% over their baseline discriminative parser after adding punctuation features, one of which encoded the sentence-final punctuation. Figure 2 PARSEVAL scores of the baseline on the 15 words development set of the Penn Treebank. The top x-axis shows accuracy as the minimum reduction in loss ^Thdecreases. The bottom shows the corresponding number of decision tree splits in the parser, summed over all classifiers. line results on the development set with a beam width of 1 and a beam width of 1000.5 The wider beam seems to improve the PARSEVAL scores of the parser, although we were unable to detect a statistically significant improvement in LFMS on our relatively small development set.</Paragraph> <Paragraph position="3"> 5 Using a beam width of 100,000 yielded output identical to using a beam width of 1000.</Paragraph> </Section> <Section position="3" start_page="145" end_page="146" type="sub_section"> <SectionTitle> 3.3 Context Size </SectionTitle> <Paragraph position="0"> Table 5 compares the baseline to parsers that could not examine as many context items. A significant portion of the baseline's accuracy is due to contextual clues, as evidenced by the poor accuracy of the no context run. However, we did not detect a significant difference between using one context item or two.</Paragraph> <Paragraph position="1"> Table 4 PARSEVAL results on the 15 words development set of the baseline, varying the beam width. Also, the MRL that achieved this LFMS and the total number of decision tree splits at this MRL. velopment set, given the amount of context available. is statistically significant. The score differences between &quot;context 0&quot; and &quot;context 1&quot; are significant, whereas the differences between &quot;context 1&quot; and the baseline are not.</Paragraph> <Paragraph position="2"> the 15 words development set, through 8200 splits. The differences between the stumps run and the baseline are statistically significant.</Paragraph> </Section> <Section position="4" start_page="146" end_page="146" type="sub_section"> <SectionTitle> 3.4 Decision Stumps </SectionTitle> <Paragraph position="0"> Our features are of relatively fine granularity. To test if a less powerful machine could provide accuracy comparable to the baseline, we trained a parser in which we boosted decisions stumps, i.e. decision trees of depth 1. Stumps are equivalent to learning a linear discriminant over the atomic features. Since the stumps run trained quite slowly, it only reached 8200 splits total. To ensure a fair comparison, in Table 6 we chose the best baseline parser with at most 8200 splits. The LFMS of the stumps run on the development set was 85.72%, significantly less accurate than the baseline.</Paragraph> <Paragraph position="1"> For example, Figure 3 shows a case where NP classification better served by the informative conjunctionph1^ph2 found by the decision trees. Given Figure 3 An example of a decision (a) stump and (b) tree for scoring NP-inferences. Each leaf's value is the confidence assigned to all inferences that fall in this leaf. ph1 asks &quot;does the first child have a determiner headtag?&quot;.ph2 asks &quot;does the last child have a noun label?&quot;. NP classification is better served by the informative conjunctionph1^ph2 found by the decision trees.</Paragraph> <Paragraph position="2"> on the 15 words development set through 8700 splits. A shaded cell means that the difference between this value and that of the baseline is statistically significant. All differences between l2r and r2l are significant.</Paragraph> <Paragraph position="3"> the sentence &quot;The man left&quot;, at the initial state there are six candidate NP-inferences, one for each span, and &quot;(NP The man)&quot; is the only candidate inference that is correct.ph1 is true for the correct inference and two of the incorrect inferences (&quot;(NP The)&quot; and &quot;(NP The man left)&quot;).ph1 ^ph2, on the other hand, is true only for the correct inference, and so it is better at discriminating NPs over this sample.</Paragraph> </Section> <Section position="5" start_page="146" end_page="147" type="sub_section"> <SectionTitle> 3.5 Deterministic Parsing </SectionTitle> <Paragraph position="0"> Our baseline parser simulates a non-deterministic machine, as at any state there may be several correct decisions. We trained deterministic variations of the parser, for which we imposed strict left-to-right (l2r) and right-to-left (r2l) item orderings. For these variations we generated training examples using the corresponding unique path to each gold-standard training tree. The r2l run reached only 8700 splits total, so in Table 7 we chose the best baseline and l2r Table 8 PARSEVAL results of the full vocabulary parser on the 15 words development set. The differences between the full vocabulary run and the baseline are not statistically significant.</Paragraph> <Paragraph position="1"> parser with at most 8700 splits.</Paragraph> <Paragraph position="2"> r2l parsing is significantly more accurate than l2r. The reason is that the deterministic runs (l2r and r2l) must avoid prematurely inferring items that come later in the item ordering. This puts the l2r parser in a tough spot. If it makes far-right decisions, it's more likely to prevent correct subsequent decisions that are earlier in the l2r ordering, i.e. to the left. But if it makes far-left decisions, then it goes against the right-branching tendency of English sentences.</Paragraph> <Paragraph position="3"> In contrast, the r2l parser is more likely to be correct when it infers far-right constituents.</Paragraph> <Paragraph position="4"> We also observed that the accuracy of the deterministic parsers dropped sharply as training progressed (See Figure 4). This behavior was unexpected, as the accuracy curve levelled off in every other experiment. In fact, the accuracy of the deterministic parsers fell even when parsing the training data. To explain this behavior, we examined the margin distributions of the r2l NP-classifier (Figure 5). As training progressed, the NP-classifier was able to reduce loss by driving up the margins of the incorrect training examples, at the expense of incorrectly classifying a slightly increased number of correct training examples. However, this is detrimental to parsing accuracy. The more correct inferences with negative confidence, the less likely it is at some state that the highest confidence inference is correct. This effect is particularly pronounced in the deterministic setting, where there is only one correct inference per state.</Paragraph> </Section> <Section position="6" start_page="147" end_page="147" type="sub_section"> <SectionTitle> 3.6 Full Vocabulary </SectionTitle> <Paragraph position="0"> As in traditional parsers, the baseline was smoothed by replacing any word that occurs fewer than five times in the training data with the special token UNK (Table 3.9). Table 8 compares the baseline to a full vocabulary run, in which the vocabulary contained all words observed in the training data. As evidenced by the results therein, controlling for lexical sparsity did not significantly improve accuracy in our setting.</Paragraph> <Paragraph position="1"> In fact, the full vocabulary run is slightly more accurate than the baseline on the development set, although this difference was not statistically significant. This was a late-breaking result, and we used the full vocabulary condition as our final parser for parsing the test set.</Paragraph> </Section> <Section position="7" start_page="147" end_page="147" type="sub_section"> <SectionTitle> 3.7 Test Set Results </SectionTitle> <Paragraph position="0"> Table 9 shows the results of our best parser on the 15 words test set, as well as the accuracy reported for a recent discriminative parser (Taskar et al., 2004) and scores we obtained by training and testing the parsers of Charniak (2000) and Bikel (2004) on the same data. Bikel (2004) is a &quot;clean room&quot; reimplementation of the Collins parser (Collins, 1999) with comparable accuracy. Both Charniak (2000) and Bikel (2004) were trained using the gold-standard tags, as this produced higher accuracy on the development set than using Ratnaparkhi (1996)'s tags.</Paragraph> </Section> <Section position="8" start_page="147" end_page="149" type="sub_section"> <SectionTitle> 3.8 Exploratory Data Analysis </SectionTitle> <Paragraph position="0"> To gain a better understanding of the weaknesses of our parser, we examined a sample of 50 development sentences that the full vocabulary parser did not get entirely correct. Besides noise and cases of genuine ambiguity, the following list outlines all error types that occurred in more than five sentences, in roughly decreasing order of frequency. (Note that there is some overlap between these groups.) * ADVPs and ADJPs A disproportionate amount of the parser's error was due to ADJPs and ADVPs.</Paragraph> <Paragraph position="1"> Out of the 12.5% total error of the parser on the development set, an absolute 1.0% was due to ADVPs, and 0.9% due to ADJPs. The parser had LFMS=78.9%,LPRC=82.5%,LRCL=75.6% on ADVPs, and LFMS = 68.0%,LPRC = 71.2%,LRCL=65.0% on ADJPs.</Paragraph> <Paragraph position="2"> These constructions can sometimes involve tricky attachment decisions. For example, in the fragment &quot;to get fat in times of crisis&quot;, the parser's output was &quot;(VP to (VP get (ADJP fat (PP in (NP (NP times) (PP of (NP crisis)))))))&quot; instead of the correct construction &quot;(VPto (VPget (ADJPfat) (PP in (NP (NP times) (PP of (NP crisis))))))&quot;.</Paragraph> <Paragraph position="3"> The amount of noise present in ADJP and ADVP annotations in the PTB is unusually high. Annotation of ADJP and ADVP unary projections is particularly inconsistent. For example, the development set contains the sentence &quot;The dollar was trading sharply lower in Tokyo .&quot;, with &quot;sharply lower&quot; bracketed as &quot;(ADVP (ADVP sharply) lower)&quot;. &quot;sharply lower&quot; appears 16 times in the complete training section, every time bracketed as &quot;(ADVP sharply lower)&quot;, and &quot;sharply higher&quot; 10 times, always as &quot;(ADVPsharply higher)&quot;. Because of the high number of negative examples, the classifiers' Table 9 PARSEVAL results of on the 15 words test set of various parsers in the literature. The differences between the full vocabulary run and Bikel or Charniak are significant. Taskar et al. (2004)'s output was unavailable for significance testing, but presumably its differences from the full vocab parser are also significant.</Paragraph> <Paragraph position="4"> bias is to cope with the noise by favoring negative confidences predictions for ambiguous ADJP and ADVP decisions, hence their abysmal labelled recall. One potential solution is the weight-sharing strategy described in Section 3.5.</Paragraph> <Paragraph position="5"> * Tagging Errors Many of the parser's errors were due to poor tagging. Preprocessing sentence &quot;Would service be voluntary or compulsory ?&quot; gives &quot;would/MD service/VB be/VB voluntary/JJ or/CC UNK/JJ&quot; and, as a result, the parser brackets &quot;service . . . compulsory&quot; as a VP instead of correctly bracketing &quot;service&quot; as an NP. We also found that the tagger we used has difficulties with completely capitalized words, and tends to tag them NNP. By giving the parser access to the same features used by taggers, especially rich lexical features (Toutanova et al., 2003), the parser might learn to compensate for tagging errors.</Paragraph> <Paragraph position="6"> * Attachment decisions The parser does not detect affinities between certain word pairs, so it has difficulties with bilexical dependency decisions.</Paragraph> <Paragraph position="7"> In principle, bilexical dependencies can be represented as conjunctions of feature given in Section 2.1.4. Given more training data, the parser might learn these affinities.</Paragraph> </Section> </Section> class="xml-element"></Paper>