XML Viewer - p06-2063

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2063_evalu.xml
Size: 7,386 bytes
Last Modified: 2025-10-06 13:59:44
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2063">
  <Title>Automatic Identification of Pro and Con Reasons in Online Reviews</Title>
  <Section position="7" start_page="486" end_page="488" type="evalu">
    <SectionTitle>
5 Experiments and Results
</SectionTitle>
    <Paragraph position="0"> We describe two goals in our experiments in this section. The first is to investigate how well our pro and con detection model with different feature combinations performs on the data we collected from epinions.com. The second is to see how well the trained model performs on new data from a different source, complaint.com.</Paragraph>
    <Paragraph position="1"> For both datasets, we carried out two separate sets of experiments, for the domains of mp3 players and restaurant reviews. We divided data into 80% for training, 10% for development, and 10% for test for our experiments.</Paragraph>
    <Section position="1" start_page="486" end_page="487" type="sub_section">
      <SectionTitle>
5.1 Experiments on Dataset 1
</SectionTitle>
      <Paragraph position="0"> Identification step: Table 3 and 4 show pros and cons sentences identification results of our system for mp3 player and restaurant reviews respectively. The first column indicates which combination of features was used for our model (see Table 2 for the meaning of Op, Lex, and Pos feature categories). We measure the performance with accuracy (Acc), precision (Prec), recall (Recl), and F-score  .</Paragraph>
      <Paragraph position="1"> The baseline system assigned all sentences as reason and achieved 57.75% and 54.82% of accuracy. The system performed well when it only used lexical features in mp3 player reviews (76.27% of accuracy in Lex), whereas it performed well with the combination of lexical and opinion features in restaurant reviews (Lex+Op row in Table 4).</Paragraph>
      <Paragraph position="2"> It was very interesting to see that the system achieved a very low score when it only used opinion word features. We can interpret this phenomenon as supporting our hypothesis that pro and con sentences in reviews are often purely  At the time (December 2005), there were total 42593 complaint reviews available in the database.  factual. However, opinion features improved both precision and recall when combined with lexical features in restaurant reviews. It was also interesting that experiments on mp3 players reviews achieved mostly higher scores than restaurants. Like the observation we described in Sub-section 4.1, frequently mentioned keywords of product features (e.g. durability) may have helped performance, especially with lexical features. Another interesting observation is that the positional features that helped in topic sentence identification did not help much for our task. Classification step: Tables 5 and 6 show the system results of the pro and con classification task. The baseline system marked all sentences as pros and achieved 53.87% and 50.71% accuracy for each domain. All features performed better than the baseline but the results are not as good as in the identification task. Unlike the identification task, opinion words by themselves achieved the best accuracy in both mp3 player and restaurant domains. We think opinion words played more important roles in classifying pros and cons than identifying them. Position features helped recognizing con sentences in mp3 player reviews.</Paragraph>
    </Section>
    <Section position="2" start_page="487" end_page="488" type="sub_section">
      <SectionTitle>
5.2 Experiments on Dataset 2
</SectionTitle>
      <Paragraph position="0"> This subsection reports the evaluation results of our system on Dataset 2. Since Dataset 2 from complaints.com has no training data, we trained a system on Dataset 1 and applied it to Dataset 2.</Paragraph>
      <Paragraph position="1">  A tough question, however, is how to evaluate the system results. Since it seemed impossible to evaluate the system without involving a human judge, we annotated a small set of data manually for evaluation purposes.</Paragraph>
      <Paragraph position="2"> Gold Standard Annotation: Four humans annotated 3 sets of test sets: Testset 1 with 5 complaints (73 sentences), Testset 2 with 7 complaints (105 sentences), and Testset 3 with 6 complaints (85 sentences). Testset 1 and 2 are from mp3 player complaints and Testset 3 is from restaurant reviews. Annotators marked sentences if they describe specific reasons of the complaint. Each test set was annotated by 2 humans. The average pair-wise human agreement was 82.1%  .</Paragraph>
      <Paragraph position="3"> System Performance: Like the human annotators, our system also labeled reason sentences. Since our goal is to identify reason sentences in complaints, we applied a system modeled as in the identification phase described in Subsection</Paragraph>
    </Section>
    <Section position="3" start_page="488" end_page="488" type="sub_section">
      <SectionTitle>
3.2 instead of the classification phase
</SectionTitle>
      <Paragraph position="0"> . Table 7 reports the accuracy, precision, and recall of the system on each test set. We calculated numbers in each A and B column by assuming each annotator's answers separately as a gold standard. In Table 7, accuracies indicate the agreement between the system and human annotators. The average accuracy 68.0% is comparable with the pair-wise human agreement 82.1% even if there is still a lot of room for improvement  . It was interesting to see that Testset 3, which was from restaurant complaints, achieved higher accuracy and recall than the other test sets from mp3 player complaints, suggesting that it would be interesting to further investigate the performance  The kappa value was 0.63.</Paragraph>
      <Paragraph position="1">  In complaints reviews, we believe that it is more important to identify reason sentences than to classify because most reasons in complaints are likely to be cons.</Paragraph>
      <Paragraph position="2">  The baseline system which assigned the majority class to each sentence achieved 59.9% of average accuracy.</Paragraph>
      <Paragraph position="3"> of reason identification in various other review domains such as travel and beauty products in future work. Also, even though we were somewhat able to measure reason sentence identification in complaint reviews, we agree that we need more data annotation for more precise evaluation. null Finally, the followings are examples of sentences that our system identified as reasons of complaints.</Paragraph>
      <Paragraph position="5"> your establishment because of the unprofessional, rude, obnoxious, and unsanitary treatment from the employees.</Paragraph>
      <Paragraph position="6"> (2) They never get my order right the first time and what really disgusts me is how they handle the food.</Paragraph>
      <Paragraph position="7">  (3) The kids play area at Braum's in The Colony, Texas is very dirty.</Paragraph>
      <Paragraph position="8"> (4) The only complaint that I have is that the French fries are usually cold.</Paragraph>
      <Paragraph position="9"> (5) The cashier there had short  changed me on the payment of my bill.</Paragraph>
      <Paragraph position="10"> As we can see from the examples, our system was able to detect con sentences which contained opinion-bearing expressions such as in (1), (2), and (3) as well as reason sentences that mostly described mere facts as in (4) and (5).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML