File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2915_metho.xml

Size: 8,725 bytes

Last Modified: 2025-10-06 14:10:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2915">
  <Title>Which Side are You on? Identifying Perspectives at the Document and Sentence Levels</Title>
  <Section position="5" start_page="109" end_page="110" type="metho">
    <SectionTitle>
3 Corpus
</SectionTitle>
    <Paragraph position="0"> Our corpus consists of articles published on the bitterlemonswebsite2. The website is set up to &amp;quot;contribute to mutual understanding [between Palestinians and Israelis] through the open exchange of  &amp;quot;Disengagement: unilateral or coordinated?&amp;quot;), and a Palestinian editor and an Israeli editor each contribute one article addressing the issue. In addition, the Israeli and Palestinian editors invite one Israeli and one Palestinian to express their views on the issue (sometimes in the form of an interview), resulting in a total of four articles in a weekly edition. We choose the bitterlemons website for two reasons. First, each article is already labeled as either Palestinian or Israeli by the editors, allowing us to exploit existing annotations. Second, the bitterlemons corpus enables us to test the generalizability of the proposed models in a very realistic setting: training on articles written by a small number of writers (two editors) and testing on articles from a much larger group of writers (more than 200 different guests).</Paragraph>
    <Paragraph position="1"> We collected a total of 594 articles published on the website from late 2001 to early 2005. The distribution of documents and sentences are listed in  cluding edition numbers, publication dates, topics, titles, author names and biographic information. We used OpenNLP Tools4 to automatically extract sentence boundaries, and reduced word variants using the Porter stemming algorithm.</Paragraph>
    <Paragraph position="2"> We evaluated the subjectivity of each sentence using the automatic subjective sentence classifier from (Riloff and Wiebe, 2003), and find that 65.6% of Palestinian sentences and 66.2% of Israeli sentences are classified as subjective. The high but almost equivalent percentages of subjective sentences in the two perspectives support our observation in Section 2 that a perspective is largely expressed using subjective language, but that the amount of subjectivity in a document is not necessarily indicative of</Paragraph>
  </Section>
  <Section position="6" start_page="110" end_page="112" type="metho">
    <SectionTitle>
4 Statistical Modeling of Perspectives
</SectionTitle>
    <Paragraph position="0"> We develop algorithms for learning perspectives using a statistical framework. Denote a training corpus as a set of documents Wn and their perspectives labels Dn, n = 1,...,N, where N is the total number of documents in the corpus. Given a new document ~W with a unknown document perspective, the perspective ~D is calculated based on the following conditional probability.</Paragraph>
    <Paragraph position="2"> We are also interested in how strongly each sentence in a document conveys perspective information. Denote the intensity of the m-th sentence of the n-th document as a binary random variable Sm,n.</Paragraph>
    <Paragraph position="3"> To evaluate Sm,n, how strongly a sentence reflects a particular perspective, we calculate the following conditional probability.</Paragraph>
    <Paragraph position="5"/>
    <Section position="1" start_page="110" end_page="111" type="sub_section">
      <SectionTitle>
4.1 Na&amp;quot;ive Bayes Model
</SectionTitle>
      <Paragraph position="0"> We model the process of generating documents from a particular perspective as follows:</Paragraph>
      <Paragraph position="2"> First, the parameters pi and th are sampled once from prior distributions for the whole corpus. Beta and Dirichlet are chosen because they are conjugate priors for binomial and multinomial distributions, respectively. We set the hyperparameters api,bpi, and ath to one, resulting in non-informative priors. A document perspective Dn is then sampled from a binomial distribution with the parameter pi. The value of Dn is either d0 (Israeli) or d1 (Palestinian). Words in the document are then sampled from a multinomial distribution, where Ln is the length of the document. A graphical representation of the model is shown in Figure 1.</Paragraph>
      <Paragraph position="3">  The model described above is commonly known as a na&amp;quot;ive Bayes (NB) model. NB models have been widely used for various classification tasks, including text categorization (Lewis, 1998). The NB model is also a building block for the model described later that incorporates sentence-level perspective information.</Paragraph>
      <Paragraph position="4"> To predict the perspective of an unseen document using na&amp;quot;ive Bayes , we calculate the posterior distribution of ~D in (5) by integrating out the parameters, integraldisplay integraldisplay</Paragraph>
      <Paragraph position="6"> However, the above integral is difficult to compute.</Paragraph>
      <Paragraph position="7"> As an alternative, we use Markov Chain Monte Carlo (MCMC) methods to obtain samples from the posterior distribution. Details about MCMC methods can be found in Appendix A.</Paragraph>
    </Section>
    <Section position="2" start_page="111" end_page="112" type="sub_section">
      <SectionTitle>
4.2 Latent Sentence Perspective Model
</SectionTitle>
      <Paragraph position="0"> We introduce a new binary random variable, S, to model how strongly a perspective is reflected at the sentence level. The value of S is either s1 or s0, where s1 indicates a sentence is written strongly from a perspective while s0 indicates it is not. The whole generative process is modeled as follows:</Paragraph>
      <Paragraph position="2"> The parameters pi and th have the same semantics as in the na&amp;quot;ive Bayes model. S is naturally modeled as a binomial variable, where t is the parameter of S.</Paragraph>
      <Paragraph position="3"> S represents how likely it is that a sentence strongly conveys a perspective. We call this model the Latent Sentence Perspective Model (LSPM) because S is not directly observed. The graphical model representation of LSPM is shown in Figure 2.</Paragraph>
      <Paragraph position="4">  To use LSPM to identify the perspective of a new document ~D with unknown sentence perspectives ~S, we calculate posterior probabilities by summing out possible combinations of sentence perspective in the document and parameters.</Paragraph>
      <Paragraph position="5"> integraldisplay integraldisplay integraldisplay summationdisplay</Paragraph>
      <Paragraph position="7"> As before, we resort to MCMC methods to sample from the posterior distributions, given in Equations (5) and (6).</Paragraph>
      <Paragraph position="8"> As is often encountered in mixture models, there is an identifiability issue in LSPM. Because the values of S can be permuted without changing the likelihood function, the meanings of s0 and s1 are ambiguous. In Figure 3a, four th values are used to represent the four possible combinations of document perspective d and sentence perspective intensity s. If we do not impose any constraints, s1 and s0 are exchangeable, and we can no longer strictly interpret s1 as indicating a strong sentence-level perspective and s0 as indicating that a sentence carries little or no perspective information. The other problem of this parameterization is that any improvement from LSPM over the na&amp;quot;ive Bayes model is not necessarily  due to the explicit modeling of sentence-level perspective. S may capture aspects of the document collection that we never intended to model. For example, s0 may capture the editors' writing styles and s1 the guests' writing styles in the bitterlemons corpus.</Paragraph>
      <Paragraph position="9"> We solve the identifiability problem by forcing thd1,s0 and thd0,s0 to be identical and reducing the number of th parameters to three. As shown in Figure 3b, there are separate th parameters conditioned on the document perspective (left branch of the tree, d0 is Israeli and d1 is Palestinian), but there is single th parameter when S = s0 shared by both document-level perspectives (right branch of the tree). We assume that the sentences with little or no perspective information, i.e., S = s0, are generated independently of the perspective of a document. In other words, sentences that are presenting common background information or introducing an issue and that do not strongly convey any perspective should look similar whether they are in Palestinian or Israeli documents. By forcing this constraint, we become more confident that s0 represents sentences of little perspectives and s1 represents sentences of strong perspectives from d1 and d0 documents.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML