File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2069_metho.xml

Size: 5,451 bytes

Last Modified: 2025-10-06 14:10:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2069">
  <Title>Examining the Content Load of Part of Speech Blocks for Information Retrieval</Title>
  <Section position="5" start_page="531" end_page="533" type="metho">
    <SectionTitle>
3 Methodology
</SectionTitle>
    <Paragraph position="0"> We present the steps realised in order to assess our hypotheses in the context of IR. Firstly, POS blocks with their respective frequencies are extracted from a corpus. The probability of occurrence of each POS block is statistically estimated. In order to test our first hypothesis, we remove from the query all but POS blocks of high probability of occurrence, on the assumption that the latter are content-rich. In order to test our second hypothesis, POS blocks that contain more closed class than open class tags are removed from the queries, on the assumption that these blocks are contentpoor. null</Paragraph>
    <Section position="1" start_page="531" end_page="532" type="sub_section">
      <SectionTitle>
3.1 Inducing POS blocks from a corpus
</SectionTitle>
      <Paragraph position="0"> We extract POS blocks from a corpus and estimate their probability of occurrence, as follows.</Paragraph>
      <Paragraph position="1">  The corpus is POS tagged. All lexical word forms are eliminated. Thus, sentences are constituted solely by sequences of POS tags. The following example illustrates this point.</Paragraph>
      <Paragraph position="2"> [Original sentence] Many of the proposals for directives and action programmes planned by the Commission have for some obscure reason never seen the light of day.</Paragraph>
      <Paragraph position="4"> For each sentence in the corpus, all possible POS blocks are extracted. Thus, for a given sentence ABCDEFGH, where POS tags are denoted by single letters, and where POS block length a0 = 4, the POS blocks extracted are ABCD, BCDE, CDEF, and so on. The extracted POS blocks overlap. The order in which the POS blocks occur in the sentence is disregarded.</Paragraph>
      <Paragraph position="5"> We statistically infer the probability of occurrence of each POS block, on the basis of the individual POS block frequencies counted in the corpus. Maximum Likelihood inference is eschewed, as it assigns the maximum possible likelihood to the POS blocks observed in the corpus, and no probability to unseen POS blocks. Instead, we employ statistical estimation that accounts for unseen POS blocks, namely Laplace and Good-Turing (Manning and Schutze, 1999).</Paragraph>
    </Section>
    <Section position="2" start_page="532" end_page="533" type="sub_section">
      <SectionTitle>
3.2 Removing POS blocks from the queries
</SectionTitle>
      <Paragraph position="0"> In order to test our first hypothesis, POS blocks of low probability of occurrence are removed from the queries. Specifically, we POS tag the queries, and remove the POS blocks that have a probability of occurrence below an empirical threshold a0 . The following example illustrates this point.</Paragraph>
      <Paragraph position="1"> [Original query] A relevant document will focus on the causes of the lack of integration in a significant way; that is, the mere mention of immigration difficulties is not relevant. Documents that discuss immigration problems unrelated to Germany are also not relevant.</Paragraph>
      <Paragraph position="2"> [Tags-only query] DT JJ NN MD VV IN  DT NNS IN DT NN IN NN IN DT JJ NN; WDT VBZ DT JJ NN IN NN NNS VBZ RB JJ. NNS WDT VVP NN NNS JJ TO NP VBP RB RB JJ [Query with high-probability POS blocks] DT NNS IN DT NN IN NN IN NN IN NN NNS  [Resulting query] the causes of the lack of integration in mention of immigration difficulties Some of the low-probability POS blocks, which are removed from the query in the above example, are DT JJ NN MD, JJ NN MD VV, NN MD VV IN, and so on. The resulting query contains fragments of the original query, assumed to be content-rich. In the context of the bag-of-words approach to IR investigated here, the grammatical well-formedness of the query is thus not an issue to be considered.</Paragraph>
      <Paragraph position="3"> In order to test the second hypothesis, we remove from the queries POS blocks that contain less open class than closed class components. We propose a simple heuristic Content Load algorithm, to 'count' the presence of content within a POS block, on the premise that open class tags bear more content than closed class tags. The order of tags within a POS block is ignored. Figure 1 displays our Content Load algorithm.</Paragraph>
      <Paragraph position="4"> After the a0a1a0a3a2 POS block component has been 'counted', if the Content Load is zero or more, we consider the POS block content-rich. If the  for pos a4 from 1 to POSblock-size do</Paragraph>
      <Paragraph position="6"> Content Load is strictly less than zero, we consider the POS block content-poor. We assume an underlying equivalence of content in all open class parts of speech, which albeit being linguistically counter-intuitive, is shown to be effective when applied to IR (Section 4). The following example illustrates this point. In this example, POS block length a0 = 4.</Paragraph>
      <Paragraph position="7"> [Original query] A relevant document will focus on the causes of the lack of integration in a significant way; that is, the mere mention of immigration difficulties is not relevant. Documents that discuss immigration problems unrelated to Germany are also not relevant.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML