XML Viewer - c04-1179

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1179_metho.xml
Size: 19,786 bytes
Last Modified: 2025-10-06 14:08:47
<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1179">
  <Title>FrameNet-based Semantic Parsing using Maximum Entropy Models</Title>
  <Section position="4" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3 Maximum Entropy
</SectionTitle>
    <Paragraph position="0"> ME models implement the intuition that the best model is the one that is consistent with the set of constraints imposed by the evidence, but otherwise is as uniform as possible (Berger et al. 1996). We model the probability of a class c given a vector of features x according to the ME formulation below:</Paragraph>
    <Paragraph position="2"> feature function which maps each class and vector element to a binary value, n is the total number of feature functions, and i l is a weight for the feature function. The final classification is just the class with the highest probability given its feature vector and the model.</Paragraph>
    <Paragraph position="3"> It is important to note that the feature functions described here are not equivalent to the subset conditional distributions that are used in G &amp; J's model. ME models are log-linear models in which feature functions map specific instances of features and classes to binary values. Thus, ME is not here being used as another way to find weights for an interpolated model. Rather, the ME approach provides an overarching framework in which the full distribution of classes (semantic roles) given features can be modeled.</Paragraph>
  </Section>
  <Section position="5" start_page="1" end_page="3" type="metho">
    <SectionTitle>
4 Model
</SectionTitle>
    <Paragraph position="0"> We define the problem into three subsequent processes (see Figure 1): 1) sentence segmentation 2) frame element identification, and 3) semantic role tagging for the identified frame elements. In order to use sentence-wide features for the FE identification, a sentence should have a single non-overlapping constituent sequence instead of all the independent constituents. Sentence segmentation is applied before FE identification for this purpose.</Paragraph>
    <Paragraph position="1"> For each segment the classification into FE or not is performed in the FE identification phase, and from the FE-tagged constituents the semantic role classification is applied in the role tagging phase.</Paragraph>
    <Paragraph position="2"> He got up, bent briefly over her hand.</Paragraph>
    <Paragraph position="3">  apply ME classification to classify each segment into classes of FE (frame element), T (target), NO (none) Extract the identified FEs: choose segments that are identified as FEs  3) Semantic Role Tagging: apply ME classification to classify each FE  Into classes of 120 semantic roles Output role: Agent (He), Manner (briefly), Path (over her hand) for the target &amp;quot;bent&amp;quot;</Paragraph>
    <Section position="1" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
4.1 Sentence Segmentation
</SectionTitle>
      <Paragraph position="0"> The advantages of applying sentence segmentation before FE identification are considered in two ways. First we can utilize sentence-wide features, and second the number of constituents as FE candidates is reduced, which reduces the convergence time in training.</Paragraph>
      <Paragraph position="1"> We segment a sentence with parse constituents  .</Paragraph>
      <Paragraph position="2"> During training, we split a sentence into true frame elements and the remainder. After choosing frame elements as segments, we choose the highest level constituents in parse tree for other parts, and then make a complete sentence composed of a sequence of constituent segments. During testing, we need to consider all combinations of various level constituents. We know the given target word should be a separate segment because a target word is not a part of other FEs. Since most frame elements tend to be among the higher levels of a parse tree, we decide to use the highest constituents while separating the target word. Figure 2 shows an example of the segmentation for  We use Michael Collins's parser : http://www.cis.upenn.edu/~mcollins/ an actual sentence in FrameNet with the target  a target predicate in a sentence and the shaded constituent represents each segment.</Paragraph>
      <Paragraph position="3"> However, this segmentation for testing reduces the FE coverage of constituents, which means our FE classification performance is limited. Table 1 shows the FE coverage and the number of constituents for our development set. The FE coverage of individual constituents (86.36%) means the accuracy of the parser. This limitation and will be discussed in detail in Section 4.4.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
4.2 Frame Element Identification
</SectionTitle>
      <Paragraph position="0"> Frame element identification is executed for the sequence of segments. For the example sentence in Figure 2, &amp;quot;(He) (got up) (bent) (briefly) (over her hand)&amp;quot;, there are five segments and each segment has its own feature vector. Maximum Entropy classification into the classes of FE, Target, or None is conducted for each. Since the target predicate is given we don't need to classify a target word into a class, but we do not exclude it from the segments because we want to get benefit of using previous segment's features.</Paragraph>
      <Paragraph position="1"> The initial features are adopted from G &amp; J and FKH, and most features are common to both of frame element identification and semantic role classification. The features are: * Target predicate (target): The target predicate, the principal word in a sentence, is the feature that is provided by the user.</Paragraph>
      <Paragraph position="2"> Although there can be many predicates in a sentence, only one predicate is defined at a time.</Paragraph>
      <Paragraph position="3"> * Target identification (tar): The target identification is a binary value, indicating whether the given constituent is a target or not. Because we have a target word in a sequence of segments, we provide this information explicitly.</Paragraph>
      <Paragraph position="4"> * Constituent path (path): From the syntactic parse tree of a sentence, we extract the path from each constituent to the target predicate.</Paragraph>
      <Paragraph position="5"> The path is represented by the nodes through which one passes while traveling up the tree from the constituent and then down through the governing category to the target word. For example, &amp;quot;over her hand&amp;quot; in a sentence of Figure 2 has a path PP |VP |VBD.</Paragraph>
      <Paragraph position="6"> * Phrase Type (pt): The syntactic phrase type (e.g., NP, PP) of each constituent is also extracted from the parse tree.</Paragraph>
      <Paragraph position="7"> * Syntactic Head (head): The syntactic head of each constituent is obtained based on Michael Collins's heuristic method  . When the head is a proper noun, &amp;quot;proper-noun&amp;quot; substitutes for the real head. The decision if the head is proper noun is done by the part of speech tag in a parse tree.</Paragraph>
      <Paragraph position="8"> * Logical Function (lf): The logical functions of constituents in a sentence are simplified into three values: external argument, object argument, other. We follow the links in the parse tree from the constituent to the ancestors until we meet either S or VP. If the S is found first, we assign external argument to the constituent, and if the VP is found, we assign object argument. Otherwise, other is assigned. Generally, a grammatical function of external argument is a subject, and that of object argument is an object. This feature is applied only to constituents whose phrase type is NP.</Paragraph>
      <Paragraph position="9"> * Position (pos): The position indicates whether a constituent appears before or after the target predicate and whether the constituent has the same parent as the target predicate or not.</Paragraph>
      <Paragraph position="10"> * Voice (voice): The voice of a sentence (active, passive) is determined by a simple regular expression over the surface form of the sentence.</Paragraph>
      <Paragraph position="11"> * Previous class (c_n): The class information of the n th -previous constituent (target, frame element, or none) is used to exploit the dependency between constituents. During training, this information is provided by simply  http://www.ai.mit.edu/people/mcollins/papers/heads looking at the true classes of the frame element occurring n-positions before the current element. During testing, hypothesized classes of the n elements are used and Viterbi search is performed to find the most probable tag sequence for a sentence.</Paragraph>
      <Paragraph position="12"> The combination of these features is used in ME classification as feature sets. The feature sets are optimized by previous work and trial and error experiments. Table 2 shows the lists of feature sets for &amp;quot;briefly&amp;quot; in a sentence of &amp;quot;He got up, bent briefly over her hand&amp;quot;. These feature sets contain the previous or next constituent's features, for example, pt_-1 represents the previous constituent's phrase type and lf_1 represents the next constituent's logical function.</Paragraph>
      <Paragraph position="13">  f(c, target) f(c, &amp;quot;bent&amp;quot;) = 1 f(c, target, pt) f(c, &amp;quot;bent&amp;quot;,ADVP) = 1 f(c, target, pt, lf) f(c, &amp;quot;bent&amp;quot;,ADVP,other) = 1 f(c, pt, pos, voice) f(c, ADVP,after_yes,active) = 1 f(c, pt, lf) f(c, ADVP,other) = 1 f(c, pt_-1, lf_-1) f(c, VBD_-1, other_-1) = 1 f(c, pt_1, lf_1) f(c, PP_1, other_1) = 1 f(c, pt_-1, pos_-1,voice) f(c, VBD_-1,t_-1,active) = 1 f(c, pt_1, pos_1, voice) f(c, PP_1,after_yes_1, active) = 1 f(c, head) f(c, &amp;quot;briefly&amp;quot;) = 1 f(c, head, target) f(c, &amp;quot;briefly&amp;quot;, &amp;quot;bent&amp;quot;) = 1 f(c, path) f(c, ADVP |VP |VBD) = 1 f(c, path_-1) f(c, VBD_-1) = 1 f(c, path_1) f(c, PP |VP |VBD_1) = 1 f(c, tar) f(c, 0) = 1 f(c, c_-1) f(c, &amp;quot;target&amp;quot;_-1) = 1 f(c, c_-1,c_-2) f(c, &amp;quot;target&amp;quot;_-1,&amp;quot;NO FE&amp;quot;_-2) = 1  identification. Example functions of &amp;quot;briefly&amp;quot; from the sample sentence in Fig.2 are shown.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.3 Semantic Role Classification
</SectionTitle>
      <Paragraph position="0"> The semantic role classification is executed only for the constituents that are classified into FEs in the previous FE identification phase. Maximum Entropy classification is performed to classify each FE into classes of semantic roles.</Paragraph>
      <Paragraph position="1"> Most features from the frame element identification in Section 4.2 are still used, and two additional features are applied. The feature sets are in Table 3.</Paragraph>
      <Paragraph position="2"> * Order (order): The relative position of a frame element in a sentence is given. For example, in the sentence from Figure 2, there are three frame elements, and the element &amp;quot;He&amp;quot; has order 0, while &amp;quot;over her hand&amp;quot; has order 2.</Paragraph>
      <Paragraph position="3"> * Syntactic pattern (pat): The sentence level syntactic pattern is generated from the parse tree by looking at the phrase type and logical functions of each frame element in a sentence.</Paragraph>
      <Paragraph position="4"> For example, in the sentence from Figure 2, &amp;quot;He&amp;quot; is an external argument Noun Phrase, &amp;quot;bent&amp;quot; is a target predicate, and &amp;quot;over her hand&amp;quot; is an external argument Prepositional Phrase. Thus, the syntactic pattern associated with the sentence is [NP-ext, target, PP-ext].</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
Feature Sets
</SectionTitle>
      <Paragraph position="0"> f(c, target) f(r, head) f(r, target, pt) f(r, head, target) f(r, target, pt, lf) f(r, head, target, pt) f(r, pt, pos, voice) f(r, order, syn) f(r, pt, pos, voice, target) f(r,target, order, syn) f(r, r_-1) f(r,r_-1,r_-2)</Paragraph>
    </Section>
    <Section position="5" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.4 Experiments and Results
</SectionTitle>
      <Paragraph position="0"> Since FrameNet II was published during our research, we continued using FrameNet I (120 semantic role categories). We can, therefore, compare our results with previous research by matching exactly the same data as used in G &amp; J and FKH. We thank Dan Gildea for providing the following data set: training (36,993 sentences / 75,548 frame elements), development (4,000 sentences / 8,167 frame elements), and held our test sets (3,865 sentences / 7,899 frame elements). We train the ME models using the GIS algorithm (Darroch and Ratcliff, 1972) as implemented in the YASMET ME package (Och, 2002). For testing, we use the YASMET MEtagger (Bender et al. 2003) to perform the Viterbi search for choosing the most probable tag sequence for a sentence using the probabilities from training. Feature weights are smoothed using Gaussian priors with mean 0 (Chen and Rosenfeld, 1999). The standard deviation of this distribution and the number of GIS iterations for training are optimized on development set for each experiment.</Paragraph>
      <Paragraph position="1"> Table 4 shows the performance for test set. The evaluation is done for individual frame elements.</Paragraph>
      <Paragraph position="2"> To segment a sentence before FE identification or role tagging improves the overall performance (from 57.6% to 60.0% in Table 4). Since the segmentation reduces the FE coverage of segments, we conduct the experiment with the manually chosen segmentation to see how much the segmentation helps the performance. Here, we extract segments from the parse tree constituents, so the FE coverage is 86% for test set, which maches the parsing accuracy. Table 5 shows the performance of the frame element identification for test set: F-score is 77.2% that is much better than 71.7% of our automatic segmentation.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="3" end_page="5" type="metho">
    <SectionTitle>
5 n-best Lists and Re-ranking
</SectionTitle>
    <Paragraph position="0"> As stated, the sentence segmentation improves the performance by using sentence-wide features, but it drops the FE coverage of constituents. In order to determine a good segmentation for a sentence that does not reduce the FE coverage, we perform another experiment by using re-ranking.</Paragraph>
    <Paragraph position="1"> We obtain all possible segmentations for a given sentence, and conduct frame element identification and semantic role classification for all segmentations. During both phases, we get n-best lists with Viterbi search, and finally choose the best output with re-ranking method. Figure 3 shows the overall framework of this task.</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
5.1 Maximum Entropy Re-ranking
</SectionTitle>
      <Paragraph position="0"> We model the probability of output r given  feature function which maps each output and all candidates' feature sets to a binary value, n is the total number of feature functions, and l i is the weight for a given feature function. The weight l i is associated with only each feature function while the weight in the ME classifier is associated with all possible classes as well as feature functions. The final decision is r having the highest probability of p(r|{x</Paragraph>
      <Paragraph position="2"> }) from t number of candidates.</Paragraph>
      <Paragraph position="3"> As a feature set for each candidate, we use the ME classification probability that is calculated during Viterbi search. These probabilities are conditional probabilities given feature sets and these feature sets depend on the previous output, for example, semantic role tagging is done for the identified FEs in the previous phase. For this reason, the product of these conditional probabilities is used as a feature set.</Paragraph>
      <Paragraph position="4"> )|(*)|(*)|()|( ferpsegfepssegpsrp = where s is a given sentence, seg is a segmentation, fe is a frame element identification, and r is the final semantic role tagging. p(fe|seg) and p(r|fe) are produced from the ME classification but p(seg|s) is computed by a heuristic method and a development set optimization experiment. The adopted p(seg|s) is composed of p(each segment's part of speech tag  |target's part of speech tag), p(the number of total segments in a sentence  |total number of words in a sentence), and the average of each segment's p(head word of FE  |target).</Paragraph>
      <Paragraph position="5"> Two additional feature sets other than p(r|s) are applied to get slight improvement for re-ranking performance, which are average of p(parse tree depth of FE  |target) and average of p(head word of FE  |target).</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="5" type="sub_section">
      <SectionTitle>
5.2 Experiments and Results
</SectionTitle>
      <Paragraph position="0"> We apply ME re-ranking in YASMET-ME package. We train re-ranking model with development set after obtaining candidate lists for the set. For a simple cross validation, the development set is divided into a sub-training set (3,200 sentences) and a sub-development set (800 sentences) by selecting every fifth sentence.</Paragraph>
      <Paragraph position="1"> Training for re-ranking is executed with the sub-training set and optimization is done with the sub-development set. The final test is applied to test set.</Paragraph>
      <Paragraph position="2"> The possible number of segmentations is different depending on sentences, but the average number of segmentation lists is 15.2  for the development set.</Paragraph>
      <Paragraph position="3"> For these segmentations, we compute 10-best  lists for the FE identification and 10-best lists for the semantic role classification.</Paragraph>
      <Paragraph position="4">  To reduce the number of different segmenations while not dropping the FE coverage, the segmentations having too many segments for a long sentence are excluded.</Paragraph>
      <Paragraph position="5">  The experiment showed 10-best lists outperformed other n-best lists where n is less than 10. The bigger number was not tested because of huge number of lists. He craned over the balcony again but finally he seemed to sigh.  of segmentations depending on each sentence, (2) has mn number of lists when we obtain m possible segmentations in (1) and we get n-best FE identifications, (3) has mnn number of lists when we get n-best role classifications given mn lists (4) shows finally chosen output. Table 6 shows the performance of re-ranking. To evaluate the performance of top-n, the best tagging output for a sentence is chosen among nlists and the performance is computed for that list. The top-5 lists show two interesting points: one is that precision is very high, and the other is that F-score including role tagging is not much different from F-score of only FE identification. In other words, there are a few (not 120) confusing roles for a given frame element, and we have many frame elements that are not identified even in n-best lists.  To improve our re-ranker, more features regarding these problems should be added, and a more principled method to obtain the probability of segmenations, p(seg) in Sectioin 5.1, needs to be investigated.</Paragraph>
      <Paragraph position="6"> Table 7 compares the final output with G &amp; J's best result. Our model is slightly worse than their integrated model, but it supports much further experimentation in segmentation and re-ranking.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML