File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1611_intro.xml
Size: 5,791 bytes
Last Modified: 2025-10-06 14:01:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1611"> <Title>Paraphrasing Japanese noun phrases using character-based indexing</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> We can use various linguistic expressions to denote a concept by virtue of richness of natural language. However this richness becomes a crucial obstacle when processing natural language by computer. For example, mismatches of index terms cause failure of retrieving relevant documents in information retrieval systems, in which documents are retrieved on the basis of surface string matching. To remedy this problem, the current information retrieval system adopts query expansion techniques which replace a query term with a set of its synonyms (Baeza-Yates and Riberto-Neto, 1999). The query expansion works well for single-word index terms, but more sophisticated techniques are necessary for larger index units, such as phrases. The effectiveness of phrasal indexing has recently drawn researchers' attention (Lewis, 1992; Mitra et al., 1997; Tokunaga et al., 2002). However, query expansion of phrasal index terms has not been fully investigated yet (Jacquemin et al., 1997).</Paragraph> <Paragraph position="1"> To deal with variations of linguistic expressions, paraphrasing has recently been studied for various applications of natural language processing, such as machine translation (Mitamura, 2001; Shimohata and Sumita, 2002), dialog systems (Ebert et al., 2001), QA systems (Katz, 1997) and information extraction (Shinyama et al., 2002). Paraphrasing is defined as a process of transforming an expression into another while keeping its meaning intact. However, it is difficult to define what &quot;keeping its meaning intact&quot; means, although it is the core of the definition. On what basis could we consider different linguistic expressions denoting the same meaning? This becomes a crucial question when finding paraphrases automatically.</Paragraph> <Paragraph position="2"> In past research, various types of clues have been used to find paraphrases. For example, Shinyama et al. tried to find paraphrases assuming that two sentences sharing many Named Entities and a similar structure are likely to be paraphrases of each other (Shinyama et al., 2002).</Paragraph> <Paragraph position="3"> Barzilay and McKeown assume that two translations from the same original text contain paraphrases (Barzilay and McKeown, 2001). Torisawa used subcategorization information of verbs to paraphrase Japanese noun phrase construction &quot;NP1 no NP2&quot; into a noun phrase with a relative clause (Torisawa, 2001). Most of previous work on paraphrasing took corpus-based approach with notable exceptions of Jacquemin (Jacquemin et al., 1997; Jacquemin, 1999) and Katz (Katz, 1997). In particular, text alignment technique is generally used to find sentence level paraphrases (Shimohata and Sumita, 2002; Barzilay and Lee, 2002).</Paragraph> <Paragraph position="4"> In this paper, we follow the corpus-based approach and propose a method to find paraphrases of a Japanese noun phrase in a large corpus using information retrieval techniques. The significant feature of our method is use of character-based indexing. Japanese uses four types of writing; Kanzi (Chinese characters), Hiragana, Katakana, and Roman alphabet. Among these, Hiragana and Katakana are phonographic, and Kanzi is an ideographic writing. Each Kanzi character itself has a certain meaning and provides a basis for rich word formation ability for Japanese. We use Kanzi characters as index terms to retrieve paraphrase candidates, assuming that noun phrases sharing the same Kanzi characters could be paraphrases of each other. For example, character-based indexing enables us to retrieve a paraphrase &quot;b (a commuting child)&quot; for &quot;tO(a child going to school)&quot;. Note that their head is the same, &quot; (child)&quot;, and their modifiers are different but sharing common characters &quot;(commute)&quot; and &quot;(study)&quot;. As shown in this example, the paraphrases generated based on Japanese word formation rule cannot be classified in terms of the past paraphrase classification (Jacquemin et al., 1997).</Paragraph> <Paragraph position="5"> The proposed method is summarized as follows. Given a Japanese noun phrase as input, the method finds its paraphrases in a set of documents. In this paper, we used a collection of newspaper articles as a set of documents, from which paraphrases are retrieved. The process is decomposed into following three steps: 1. retrieving paraphrase candidates, 2. filtering the retrieved candidates based on syntactic and semantic constraints, and 3. ranking the resulting candidates.</Paragraph> <Paragraph position="6"> Newspaper articles are segmented into passages at punctuation symbols, then the passages are indexed based on Kanzi characters and stored in a database. The database is searched with a query, an input noun phrase, to obtain a set of passages, which are paraphrase candidates. In general, using smaller index units, such as characters, results in gains in recall at the cost of precision. To remedy this, we introduce a filtering step after retrieving paraphrase candidates. Filtering is performed based on syntactic and semantic constraints. The resulting candidates are ranked and provided as paraphrases.</Paragraph> <Paragraph position="7"> The following three sections 2, 3 and 4 describe each of three steps in detail. Section 5 describes experiments to evaluate the proposed method. Finally, section 6 concludes the paper and looks at the future work.</Paragraph> </Section> class="xml-element"></Paper>