File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1161_metho.xml
Size: 13,705 bytes
Last Modified: 2025-10-06 14:07:52
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1161"> <Title>Lexical Query Paraphrasing for Document Retrievala0</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Resources </SectionTitle> <Paragraph position="0"> Our system uses syntactic, semantic and statistical information for paraphrase generation. Syntactic information for each query was obtained from Brill's part-of-speech (PoS) tagger (Brill, 1992). Semantic information consisting of different types of synonyms for the words in each query was obtained from WordNet (Miller et al., 1990).</Paragraph> <Paragraph position="1"> The corpus used for information retrieval and for the collection of statistical information was the LA Times portion of the NIST Text Research Collection (//trec.nist.gov). This corpus was small enough to satisfy our disk space limitations, and sufficiently large to yield statistically significant results (131,896 documents). Full-text indexing was performed for the documents in the LA Times collection, using lemmas (rather than words). The lemmas for the words in the LA Times collection were also obtained from WordNet (Miller et al., 1990).</Paragraph> <Paragraph position="2"> The statistical information was used to assign a score to the paraphrases generated for a query (Section 4.4). This information was stored in a lemma dictionary (202,485 lemmas) and a lemma-pair dictionary (37,341,156 lemma-pairs). The lemma dictionary associates with each lemma the number of times it appears in the corpus. The lemma-pair dictionary associates with each ordered lemma-pair a0a2a1 -a0a4a3 the number of times a0a5a1 appears before a0a6a3 in a five-word window in the corpus (not counting stop words and closed-class words). The dictionary maintains a different entry for the lemma pair a0a7a3 a0a2a1 . Lemma-pairs which appear only once constitute 64% of the pairs, and were omitted from our dictionary owing to disk space limitations.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Paraphrasing and Retrieval Procedure </SectionTitle> <Paragraph position="0"> The procedure for paraphrasing a query consists of the following steps: 1. Tokenize, tag and lemmatize the query. 2. Generate synonyms for each content lemma in the query (stop words are ignored).</Paragraph> <Paragraph position="1"> 3. Propose paraphrases for the query using differ null ent synonym combinations, compute a score for each paraphrase, and rank the paraphrases according to their score. The lemmatized query plus the 19 top paraphrases are retained. Documents are then retrieved for the query and its paraphrases.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Tagging and lemmatizing the queries </SectionTitle> <Paragraph position="0"> We used the part-of-speech (PoS) of a word to constrain the number of synonyms generated for it.</Paragraph> <Paragraph position="1"> Brill's tagger correctly tagged 84% of the queries.</Paragraph> <Paragraph position="2"> In order to determine the effect of tagging errors on retrieval performance, we corrected manually the wrong tags, and ran our system with both automatically-obtained and manually-corrected tags (Section 6). After tagging, each query was lemmatized (using WordNet). This was done since the index used for document retrieval is lemma-based.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Proposing synonyms for each word </SectionTitle> <Paragraph position="0"> The following types of WordNet synonyms were generated for each content lemma in a query: synonyms, attributes, pertainyms and seealsos (Miller et al., 1990).1 For example, according to WordNet, a synonym for &quot;high&quot; is &quot;steep&quot;, an attributeis &quot;height&quot;, and a seealso is &quot;tall&quot;; a pertainym for &quot;chinese&quot; is &quot;China&quot;. In order to curb the combinatorial explosion, we do not allow multiple-word synonyms for a lemma, and do not generate synonyms for proper nouns or stop words.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Paraphrasing queries </SectionTitle> <Paragraph position="0"> Query paraphrases are generated by an iterative process which considers each content lemma in a query in turn, and proposes a synonym from those collected from WordNet (Section 4.2). Queries which do not have sufficient context are not paraphrased.</Paragraph> <Paragraph position="1"> These are queries where all the words except one are stop words or closed-class words.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Computing paraphrase scores </SectionTitle> <Paragraph position="0"> The score of a paraphrase is based on how common are the lemma combinations in it. Ideally, this score would be represented by Pra8 a0a10a9a12a11a14a13a14a13a14a13a12a11a15a0a2a16a18a17 , where a19 is the number of lemmas in the paraphrase. However, in the absence of sufficient information to compute this joint probability, approximations based on conditional probabilities are often used, e.g.,</Paragraph> <Paragraph position="2"> Unfortunately, this approximation yielded poor paraphrases in preliminary trials. We postulate that this is due to two reasons: (1) it takes into account the interaction between a lemma a0a36a1 and only one other lemma (without considering the rest of the lemmas in the query), and (2) relatively infrequent lemma combinations involving one frequent lemma 1In preliminary experiments we also generated hypernyms and hyponyms. However, this increased the number of alternative paraphrases exponentially, without improving the quality of the results in most cases.</Paragraph> <Paragraph position="3"> are penalized (which is correct for conditional probabilities). For instance, if a0 a3 appears 10 times in the corpus and a0a5a1 -a0a6a3 appears 4 times, a1 a8 a0a5a1 a23 a0a6a3 a17a3a2a5a4 a13a7a6a9a8 (where a8 is a normalizing constant). In contrast, if a0a11a10a3 appears 200 times in the corpus and a0a12a10a1 -a0a11a10a3 appears 30 times, a1 a8 a0a13a10a1 a23 a0a11a10a3 a17a14a2a15a4 a13a17a16a19a18a20a8 . However, a0a13a10a1 -a0a11a10a3 is a more frequent lemma combination, and should contribute a higher score to the paraphrase.</Paragraph> <Paragraph position="4"> To address these problems, we propose using the joint probability of a pair of lemmas instead of their conditional probability. In the above example, this yields a1 a8 a0a7a1a10a11a15a0a6a3 a17a21a2a22a6a24a23 and a1 a8 a0 a10a1 a11a15a0 a10a3 a17a25a2a27a26a20a4a20a23 (where a23 is a normalizing constant). These probabilities reflect more accurately the goodness of paraphrases containing these lemma-pairs. The resulting approximation of the probability of a paraphrase composed of lemmas a0 a9 a11a14a13a14a13a14a13 a11a15a0 a16 is as follows:</Paragraph> <Paragraph position="6"> where a23 is a normalizing constant.2 Since this constant is not informative with respect to the relative scores of the paraphrases for a particular query, we drop it from consideration, and use only the frequencies to calculate the score of a paraphrase.</Paragraph> <Paragraph position="7"> Thus, our paraphrase scoring function is</Paragraph> <Paragraph position="9"> When calculating the score of a paraphrase using Equation 2, the following aspects regarding freqa8 a0a7a1 a11a15a0a6a3 a17 must be specified: (1) the extent to which the order of a0 a1 and a0 a3 (as it appears in the paraphrase) should be enforced; and (2) how to handle a0 a1 -a0 a3 pairs in the paraphrase that are absent from the lemma-pair dictionary. To illustrate these aspects, consider the candidate paraphrase &quot;who is the greek deity of the ocean?&quot; (proposed for &quot;who is the greek god of the sea?&quot;). The first aspect determines whether the frequency of only &quot;greek deity&quot; should be used, or whether &quot;deity greek&quot; should also be taken into account. The second aspect determines how to score the paraphrase if &quot;greek ocean&quot; is absent from the lemma-pair dictionary. These aspects are specified as experimental parameters of the system.</Paragraph> <Paragraph position="11"> Relative word order. The extent to which we enforce the order of a0a5a1 -a0a6a3 when calculating freqa8 a0a5a1 a11a15a0a6a3 a17 is determined by the weight a62 order as follows:</Paragraph> <Paragraph position="13"> where freqa8 a0a7a1a67a63 a0a6a3 a17 is the frequency of the lemma-pair a8 a0 a1 a11a15a0 a3 a17 when a0 a1 is followed by a0 a3 . a62 order a2a68a4 allows only the word order in the paraphrase, while a62 order a2a69a16 counts equally the order in the paraphrase and the reverse order. We experimented with weights of 0, 1 and 0.5 for a62 order (Section 6).</Paragraph> <Paragraph position="14"> Absent lemma-pairs. When a lemma-pair is not in the dictionary, a frequency of 0 is returned. Using this frequency is too strict, because it invalidates an entire paraphrase on account of one culprit which may actually be innocent (recall that 64% of the lemma-pairs in the corpus - approximately 66 million pairs - had a frequency of 1 but were not stored). To address this problem, we assigned a penalty frequency of AbsFreq = 0.1 to a lemma-pair in a paraphrase that does not appear in the dictionary. That is, the score of a paraphrase is divided by 10 for each of its lemma-pairs that is absent from the dictionary.</Paragraph> <Paragraph position="15"> In addition, we defined the experimental parameter AbsAdjDiv, which models the impact of adjacent lemma-pairs on paraphrasing and retrieval performance. This parameter takes the form of a divisor for AbsFreq: it stipulates by how much to divide AbsFreq for a lemma-pair that is adjacent in the paraphrase but absent from the dictionary. In the above example, AbsAdjDiv=10 would cause an absent &quot;deity ocean&quot; to receive a penalty of 0.01 (=0.1/10) compared to an absent &quot;greek ocean&quot;, which would receive a penalty of 0.1. We experimented with four values for AbsAdjDiv: 1, 2, 10 and 20 (Section 6).</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.5 Retrieving documents for each query </SectionTitle> <Paragraph position="0"> Our retrieval process differs from the standard one in that for each query a70 , we adjust the scores of the retrieved documents according to the scores of the paraphrases of a70 (obtained from Equation 2). Our retrieval process consists of the following steps: 1. For each paraphrase a1 a1 of a70 (a71 a2 a4 a11a14a13a14a13a14a13 a11 # para a70 ), where a1a67a72 is the lemmatized query: (a) Extract the content lemmas from a1 a1 : cates how well a76a78a77 matches the lemmas in paraphrase a1 a1 . In order to take into account the plausibility of a1 a1 , this score is multiplied by a33a35a34 a8 a1 a1a20a17 - the score of a1 a1 obtained from Equation 2. This yields a2 a34 a77 a73 a1 , the score of document a76a78a77 for paraphrase a1 a1 .</Paragraph> <Paragraph position="2"> 2. For each document a76a3a77 , add the scores from each paraphrase (Equation 4), yielding</Paragraph> <Paragraph position="4"> An outcome of this method is that lemmas which appear in several paraphrases receive a higher weight. This indirectly identifies the important words in a query, which positively affects retrieval performance (Section 6).</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Sample Results </SectionTitle> <Paragraph position="0"> Table 1 shows the top 10 paraphrases generated by our system for three sample queries, and the 7 paraphrases generated for a fourth query (the lemmatized query is listed first). These paraphrases were obtained with a62 order a2 a16 , AbsAdjDiv = 10, and manually-corrected tagging (Section 4). The third column contains the paraphrase, the first column contains its score, and the second column contains the number of lemma-pairs in the paraphrase which were not found in the dictionary.</Paragraph> <Paragraph position="1"> These examples illustrate the combined effect of contextual information and WordNet senses. The first query yields mostly felicitous paraphrases, despite their low overall score and absent lemmapairs. This outcome may be attributed to the generally appropriate synonyms returned by WordNet for the lemmas in this query. The second query produces a mixed paraphrasing performance. The problematic paraphrases are generated because our corpus-based information supports WordNet's inappropriate suggestions of &quot;manufacture&quot; as a synonym for &quot;invent&quot; and &quot;video&quot; as a synonym for &quot;television&quot;, thus yielding highly-ranked but incorrect paraphrases. The third query is an extreme example of this behaviour, where WordNet synonyms conspire with contextual information to steer 1.00E-01 1 how grandiloquent be the giraffe ? 1.00E-01 1 how magniloquent be the giraffe ? 1.00E-01 1 how improbable be the giraffe ? 1.00E-01 1 how marvelous be the giraffe ? the paraphrasing process toward inappropriate synonyms of &quot;bear&quot;. The final example illustrates the opposite case, where the corpus information overcomes the effect of WordNet's less appropriate suggestions, which yield low-scoring paraphrases.</Paragraph> </Section> class="xml-element"></Paper>