File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/h05-1045_evalu.xml
Size: 8,483 bytes
Last Modified: 2025-10-06 13:59:20
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1045"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 355-362, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Identifying Sources of Opinions with Conditional Random Fields and Extraction Patterns</Title> <Section position="7" start_page="358" end_page="360" type="evalu"> <SectionTitle> 6 Experiments </SectionTitle> <Paragraph position="0"> We used the Multi-Perspective Question Answering (MPQA) corpus4 for our experiments. This corpus ally annotated with opinion-related information including direct and indirect sources. We used 135 documents as a tuning set for model development and feature engineering, and used the remaining 400 documents for evaluation, performing 10-fold cross validation. These texts are English language versions of articles that come from many countries and cover many topics.5 We evaluate performance using 3 measures: overlap match (OL), head match (HM), and exact match (EM). OL is a lenient measure that considers an extraction to be correct if it overlaps with any of the annotated words. HM is a more conservative measure that considers an extraction to be correct if its head matches the head of the annotated source. We report these somewhat loose measures because the annotators vary in where they place the exact boundaries of a source. EM is the strictest measure that requires an exact match between the extracted words and the annotated words. We use three evaluation metrics: recall, precision, and F-measure with recall and precision equally weighted.</Paragraph> <Section position="1" start_page="358" end_page="359" type="sub_section"> <SectionTitle> 6.1 Baselines </SectionTitle> <Paragraph position="0"> We developed three baseline systems to assess the difficulty of our task. Baseline-1 labels as sources all phrases that belong to the semantic categories authority, government, human, media, organizationor company, proper name.</Paragraph> <Paragraph position="1"> Table 1 shows that the precision is poor, suggesting that the third condition described in Section 3.1 (opinion recognition) does play an important role in source identification. The recall is much higher but still limited due to sources that fall outside of the semantic categories or are not recognized as belonging to these categories. Baseline-2 labels a noun phrase as a source if any of the following are true: (1) the NP is the subject of a verb phrase containing an opinion word, (2) the NP follows &quot;according to&quot;, (3) the NP contains a possessive and is preceded by an opinion word, or (4) the NP follows &quot;by&quot; and attaches to an opinion word. Baseline-2's heuristics are designed to address the first and the third conditions in Section 3.1. Table 1 shows that Baseline-2 is substantially better than Baseline-1. Baseline-3 labels a noun phrase as a source if it satisfies both Baseline-1 and Baseline-2's conditions (this should satisfy all three conditions described in Section 3.1). As shown in Table 1, the precision of this approach is the best of the three baselines, but the recall is the lowest.</Paragraph> </Section> <Section position="2" start_page="359" end_page="359" type="sub_section"> <SectionTitle> 6.2 Extraction Pattern Experiment </SectionTitle> <Paragraph position="0"> We evaluated the performance of the learned extraction patterns on the source identification task. The learned patterns were applied to the test data and the extracted sources were scored against the manual annotations.6 Table 1 shows that the extraction patterns produced lower recall than the baselines, but with considerably higher precision. These results show that the extraction patterns alone can identify 6These results were obtained using the patterns that had a probability > .50 and frequency > 1.</Paragraph> <Paragraph position="1"> nearly half of the opinion sources with good accuracy. null</Paragraph> </Section> <Section position="3" start_page="359" end_page="359" type="sub_section"> <SectionTitle> 6.3 CRF Experiments </SectionTitle> <Paragraph position="0"> We developed our CRF model using the MALLET code from McCallum (2002). For training, we used a Gaussian prior of 0.25, selected based on the tuning data. We evaluate the CRF using the basic features from Section 3, both with and without the IE pattern features from Section 5. Table 1 shows that the CRF with basic features outperforms all of the baselines as well as the extraction patterns, achieving an F-measure of 66.3 using the OL measure, 65.0 using the HM measure, and 59.2 using the EM measure. Adding the IE pattern features further increases performance, boosting recall by about 3 points for all of the measures and slightly increasing precision as well.</Paragraph> <Paragraph position="1"> CRF with feature induction. One limitation of log-linear function models like CRFs is that they cannot form a decision boundary from conjunctions of existing features, unless conjunctions are explicitly given as part of the feature vector. For the task of identifying opinion sources, we observed that the model could benefit from conjunctive features. For instance, instead of using two separate features, HUMAN and PARENT-CHUNK-INCLUDES-OPINION-EXPRESSION, the conjunction of the two is more informative.</Paragraph> <Paragraph position="2"> For this reason, we applied the CRF feature induction approach introduced by McCallum (2003).</Paragraph> <Paragraph position="3"> As shown in Table 1, where CRF-FI stands for the CRF model with feature induction, we see consistent improvements by automatically generating conjunctive features. The final system, which combines the basic features, the IE pattern features, and feature induction achieves an F-measure of 69.4 (recall=60.6%, precision=81.2%) for the OL measure, an F-measure of 68.0 (recall=59.5%, precision=79.3%) for the HM measure, and an F-measure of 62.0 (recall=54.1%, precision=72.7%) for the EM measure.</Paragraph> </Section> <Section position="4" start_page="359" end_page="360" type="sub_section"> <SectionTitle> 6.4 Error Analysis </SectionTitle> <Paragraph position="0"> An analysis of the errors indicated some common mistakes: Some errors resulted from error propagation in our subsystems. Errors from the sentence boundary detector in GATE (Cunningham et al., 2002) were especially problematic because they caused the Collins parser to fail, resulting in no dependency tree information.</Paragraph> <Paragraph position="1"> Some errors were due to complex and unusual sentence structure, which our rather simple feature encoding for CRF could not capture well. Some errors were due to the limited coverage of the opinion lexicon. We failed to recognize some cases when idiomatic or vague expressions were used to express opinions.</Paragraph> <Paragraph position="2"> Below are some examples of errors that we found interesting. Doubly underlined phrases indicate incorrectly extracted sources (either false positives or false negatives). Opinion words are singly underlined.</Paragraph> <Paragraph position="3"> False positives: (1) Actually, these three countries do have one common denominator, i.e., that their values and policies do not agree with those of the United States and none of them are on good terms with the United States.</Paragraph> <Paragraph position="4"> (2) Perhaps this is why Fidel Castro has not spoken out against what might go on in Guantanamo.</Paragraph> <Paragraph position="5"> In (1), &quot;their values and policies&quot; seems like a reasonable phrase to extract, but the annotation does not mark this as a source, perhaps because it is somewhat abstract. In (2), &quot;spoken out&quot; is negated, which means that the verb phrase does not bear an opinion, but our system failed to recognize the negation. False negatives: (3) And for this reason, too, they have a moral duty to speak out, as Swedish Foreign Minister Anna Lindh, among others, did yesterday.</Paragraph> <Paragraph position="6"> (4) In particular, Iran and Iraq are at loggerheads with each other to this day.</Paragraph> <Paragraph position="7"> Example (3) involves a complex sentence structure that our system could not deal with. (4) involves an uncommon opinion expression that our system did not recognize.</Paragraph> </Section> </Section> class="xml-element"></Paper>