File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-2401_evalu.xml
Size: 11,702 bytes
Last Modified: 2025-10-06 13:59:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2401"> <Title>A Linear Programming Formulation for Global Inference in Natural Language Tasks</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> We describe below two experiments on the problem of simultaneously recognizing entities and relations. In the first, we view the task as a knowledge acquisition task - we let the system read sentences and identify entities and relations among them. Given that this is a difficult task which may require quite often information beyond the sentence, we consider also a &quot;forced decision&quot; task, in which we simulate a question answering situation we ask the system, say, &quot;who killed whom&quot; and evaluate it on identifying correctly the relation and its arguments, given that it is known that somewhere in this sentence this relation is active. In addition, this evaluation exhibits the ability of our approach to incorporate task specific constraints at decision time.</Paragraph> <Paragraph position="1"> Our experiments are based on the TREC data set (which consists of articles from WSJ, AP, etc.) that we annotated for named entities and relations. In order to effectively observe the interaction between relations and entities, we picked 1437 sentences that have at least one active relation. Among those sentences, there are 5336 entities, and 19048 pairs of entities (binary relations). Entity labels include 1685 persons, 1968 locations, 978 organizations and 705 others. Relation labels include 406 located in, 394 work for, 451 orgBased in, 521 live in, 268 kill, and 17007 none. Note that most pairs of entities have no active relations at all. Therefore, relation none significantly outnumbers others. Examples of each relation label and the constraints between a relation variable and its two entity arguments are shown as follows.</Paragraph> <Paragraph position="2"> Relation Entity1 Entity2 Example located in loc loc (New York, US) work for per org (Bill Gates, Microsoft) orgBased in org loc (HP, Palo Alto) live in per loc (Bush, US) kill per per (Oswald, JFK) In order to focus on the evaluation of our inference procedure, we assume the problem of segmentation (or phrase detection) (Abney, 1991; Punyakanok and Roth, 2001) is solved, and the entity boundaries are given to us as input; thus we only concentrate on their classifications. We evaluate our LP based global inference procedure against two simpler approaches and a third that is given more information at learning time. Basic, only tests our entity and relation classifiers, which are trained independently using only local features. In particular, the relation classifier does not know the labels of its entity arguments, and the entity classifier does not know the labels of relations in the sentence either. Since basic classifiers are used in all approaches, we describe how they are trained here.</Paragraph> <Paragraph position="3"> For the entity classifier, one set of features are extracted from words within a size 4 window around the target phrase. They are: (1) words, part-of-speech tags, and conjunctions of them; (2) bigrams and trigrams of the mixture of words and tags. In addition, some other features are extracted from the target phrase, including: symbol explanation icap the first character of a word is capitalized acap all characters of a word are capitalized incap some characters of a word are capitalized suffix the suffix of a word is &quot;ing&quot;, &quot;ment&quot;, etc. bigram bigram of words in the target phrase len number of words in the target phrase place3 the phrase is/has a known place's name prof3 the phrase is/has a professional title (e.g. Lt.) name3 the phrase is/has a known person's name For the relation classifier, there are three sets of features: (1) features similar to those used in the entity classification are extracted from the two argument entities of 3We collect names of famous places, people and popular titles from other data sources in advance.</Paragraph> <Paragraph position="4"> Pattern Example arg1 , arg2 San Jose, CA arg1 , *** a *** arg2 prof John Smith, a Starbucks manager *** in/at arg1 in/at/, arg2 Officials in Perugia in Umbria province said *** arg2 prof arg1 CNN reporter David McKinley *** arg1 *** native of *** arg2 Elizabeth Dole is a native of Salisbury, N.C. the relation; (2) conjunctions of the features from the two arguments; (3) some patterns extracted from the sentence or between the two arguments. Some features in category (3) are &quot;the number of words between arg1 and arg2 &quot;, &quot;whether arg1 and arg2 are the same word&quot;, or &quot;arg1 is the beginning of the sentence and has words that consist of all capitalized characters&quot;, where arg1 and arg2 represent the first and second argument entities respectively. In addition, Table 1 presents some patterns we use.</Paragraph> <Paragraph position="5"> The learning algorithm used is a variation of the Winnow update rule incorporated in SNoW (Roth, 1998; Roth and Yih, 2002), a multi-class classifier that is specifically tailored for large scale learning tasks. SNoW learns a sparse network of linear functions, in which the targets (entity classes or relation classes, in this case) are represented as linear functions over a common feature space. While SNoW can be used as a classifier and predicts using a winner-take-all mechanism over the activation value of the target classes, we can also rely directly on the raw activation value it outputs, which is the weighted linear sum of the active features, to estimate the posteriors. It can be verified that the resulting values are monotonic with the confidence in the prediction, therefore provide a good source of probability estimation. We use softmax (Bishop, 1995) over the raw activation values as conditional probabilities. Specifically, suppose the number of classes is n, and the raw activation values of class i is acti. The posterior estimation for class i is derived by the following equation.</Paragraph> <Paragraph position="7"> Pipeline, mimics the typical strategy in solving complex natural language problems - separating a task into several stages and solving them sequentially. For example, a named entity recognizer may be trained using a different corpus in advance, and given to a relation classifier as a tool to extract features. This approach first trains an entity classifier as described in the basic approach, and then uses the prediction of entities in addition to other local features to learn the relation identifier. Note that although the true labels of entities are known here when training the relation identifier, this may not be the case in general NLP problems. Since only the predicted entity labels are available in testing, learning on the predictions of the entity classifier presumably makes the relation classifier more tolerant to the mistakes of the entity classifier. In fact, we also observe this phenomenon empirically. When the relation classifier is trained using the true entity labels, the performance is much worse than using the predicted entity labels.</Paragraph> <Paragraph position="8"> LP, is our global inference procedure. It takes as input the constraints between a relation and its entity arguments, and the output (the estimated probability distribution of labels) of the basic classifiers. Note that LP may change the predictions for either entity labels or relation labels, while pipeline fully trusts the labels of entity classifier, and only the relation predictions may be different from the basic relation classifier. In other words, LP is able to enhance the performance of entity classification, which is impossible for pipeline.</Paragraph> <Paragraph position="9"> The final approach, Omniscience, tests the conceptual upper bound of this entity/relation classification problem.</Paragraph> <Paragraph position="10"> It also trains the two classifiers separately as the basic approach. However, it assumes that the entity classifier knows the correct relation labels, and similarly the relation classifier knows the right entity labels as well. This additional information is then used as features in training and testing. Note that this assumption is totally unrealistic. Nevertheless, it may give us a hint that how much a global inference can achieve.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Results </SectionTitle> <Paragraph position="0"> Tables 2 & 3 show the performance of each approach in Fb=1 using 5-fold cross-validation. The results show that LP performs consistently better than basic and pipeline, both in entities and relations. Note that LP does not apply learning at all, but still outperforms pipeline, which uses entity predictions as new features in learning. The results of the omniscient classifiers reveal that there is still room for improvement. One option is to apply learning to tune a better cost function in the LP approach.</Paragraph> <Paragraph position="1"> One of the more significant results in our experiments, we believe, is the improvement in the quality of the decisions. As mentioned in Sec. 1, incorporating constraints helps to avoid inconsistency in classification. It is in- null teresting to investigate how often such mistakes happen without global inference, and see how effectively the global inference enhances this.</Paragraph> <Paragraph position="2"> For this purpose, we define the quality of the decision as follows. For an active relation of which the label is classified correctly, if both its argument entities are also predicted correctly, we count it as a coherent prediction. Quality is then the number of coherent predictions divided by the sum of coherent and incoherent predictions.</Paragraph> <Paragraph position="3"> Since the basic and pipeline approaches do not have a global view of the labels of entities and relations, 5% to 25% of the predictions are incoherent. Therefore, the quality is not always good. On the other hand, our global inference procedure, LP, takes the natural constraints into account, so it never generates incoherent predictions. If the relation classifier has the correct entity labels as features, a good learner should learn the constraints as well. As a result, the quality of omniscient is almost as good as LP.</Paragraph> <Paragraph position="4"> Another experiment we did is the forced decision test, which boosts the F1 of &quot;kill&quot; relation to 86.2%. Here we consider only sentences in which the &quot;kill&quot; relation is active. We force the system to determine which of the possible relations in a sentence (i.e., which pair of entities) has this relation by adding a new linear equality. This is a realistic situation (e.g., in the context of question answering) in that it adds an external constraint, not present at the time of learning the classifiers and it evaluates the ability of our inference algorithm to cope with it. The results exhibit that our expectations are correct.</Paragraph> <Paragraph position="5"> In fact, we believe that in natural situations the number of constraints that can apply is even larger. Observing the algorithm performs on other, specific, forced decision tasks verifies that LP is reliable in these situations. As shown in the experiment, it even performs better than omniscience, which is given more information at learning time, but cannot adapt to the situation at decision time.</Paragraph> </Section> </Section> class="xml-element"></Paper>