File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1609_intro.xml
Size: 7,510 bytes
Last Modified: 2025-10-06 14:01:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1609"> <Title>Paraphrase Acquisition for Information Extraction</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Overall Procedure of Paraphrase Acquisition </SectionTitle> <Paragraph position="0"> Our main goal is to obtain pattern clusters for IE, which consist of sets of equivalent patterns capturing the same information. So we tried to discover paraphrases contained in Japanese news articles for a specific domain. Our basic idea is to search news articles from the same day. We focused on the fact that various newspapers describe a single event in different ways. So if we can discover an event which is reported in more then one newspaper, we can hope these articles can be used as the source of paraphrases. For example, the following articles appeared in &quot;Health&quot; sections in different newspapers on Apr. 11: 1. &quot;The government has announced that two more people have died in Hong Kong after contracting the SARS virus and 61 new cases of the illness have been detected.&quot; (Reuters, Apr. 11) 2. &quot;Hong Kong reported two more deaths and 61 fresh cases of SARS Friday as governments across the world took tough steps to stop the killer virus at their borders.&quot; (Channel News Asia, Apr. 11) In these articles, we can find several corresponding parts, such as &quot;NUMBER people have died in LOCATION&quot; and &quot;LOCATION reported NUMBER deaths&quot;. Although their syntactic structures are different, they still convey the same single fact. Here it is worth noting that even if a different expression is used, some noun phrases such as &quot;Hong Kong&quot; or &quot;two more&quot; are preserved across the two articles. We found that these words shared by the two sentences provide firm anchors for two different expressions. In particular, Named Entities (NEs) such as names, locations, dates or numbers can be the firmest anchors since they are indispensable to report an event and difficult to paraphrase.</Paragraph> <Paragraph position="1"> We tried to obtain paraphrases by using this property. First we collect a set of comparable articles which reports the same event, and pull appropriate portions out of the sentences which share the same anchors. If we carefully choose appropriate portions of the sentences, the extracted expressions will convey the same information; i.e. they are paraphrases. After corresponding portions are obtained, we generalize the expressions to templates of paraphrases which can be used in future.</Paragraph> <Paragraph position="2"> Our method is divided into four steps: 1. Find comparable sentences which report the same event from different newspapers.</Paragraph> <Paragraph position="3"> 2. Identify anchors in the comparable sentences.</Paragraph> <Paragraph position="4"> 3. Extract corresponding portions from the sentences. null 4. Generalize the obtained expressions to para- null phrase templates.</Paragraph> <Paragraph position="5"> Figure 1 shows the overall procedure. In the remainder of this section, we describe each step in turn.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Find Comparable Sentences </SectionTitle> <Paragraph position="0"> To find comparable articles and sentences, we used methods developed for Topic Detection and Tracking (Wayne, 1998). The actual process is divided into two parts: article level matching and sentence level matching. Currently we assume that a pair of paraphrases can be found in a single sentence of each article and corresponding expressions don't range across two or more sentences. Article level matching is first required to narrow the search space and reduce erroneous matching of anchors.</Paragraph> <Paragraph position="1"> Before applying this technique, we first preprocessed the articles by stripping off the strings which are not considered as sentences. Then we used a part-of-speech tagger to obtain segmented words. In the actual matching process we used a method described in (Papka et al., 1999) to find a set of comparable articles. Then we use a simple vector space model for sentence matching.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Identify Anchors </SectionTitle> <Paragraph position="0"> Before extracting paraphrases, we find anchors in comparable sentences. We used Extended Named Entity tagging to identify anchors. A Named Entity tagger identifies proper expressions such as names, locations and dates in sentences. In addition to these expressions, an Extended Named Entity tagger identifies some common nouns such as disease names or numbers, that are also unlikely to change (Sekine et al., 2002). For each corresponding pair of sentences, we apply the tagger and identify the same noun phrases which appear in both sentences as anchors. null</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Extract Corresponding Sentence Portions </SectionTitle> <Paragraph position="0"> Now we identify appropriate boundaries of expressions which share the anchors identified in the previous stage. To avoid extracting non-grammatical expressions, we operate on syntactically structured text rather than sequences of words. Dependency analysis is suitable for this purpose, since using dependency trees we can reconstruct grammatically correct expressions from a spanning subtree whose root is a predicate. Dependency analysis also allows us to extract expressions which are subtrees but do not correspond to a single contiguous sequence of words.</Paragraph> <Paragraph position="1"> We applied a dependency analyzer to a pair of corresponding sentences and obtained tree structures for each sentence. Each node of the tree is either a predicate such as a verb or an adjective, or an argument such as a noun or a pronoun. Each predicate can take one or more arguments. We generated all possible combinations of subtrees from each dependency tree, and compared the anchors which are included in both subtrees. After a pair of corresponding subtrees which share the anchors is found, the subtree pair can be recognized as paraphrases. In actual experiments, we put some restrictions on these subtrees, which will be discussed later. This way we can obtain grammatically well-formed portions of sentences (Figure 2).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Generalize Expressions </SectionTitle> <Paragraph position="0"> After corresponding portions are obtained, we generalize the expressions to form usable templates of paraphrases. Actually this is already done by Extended Named Entity tagging. An Extended Named Entity tagger classifies proper expressions into several categories. This is similar to a part-of-speech tagger as it classifies words into several part-of-speech categories. For example, &quot;Hong Kong&quot; is tagged as a location name, and &quot;two more&quot; as a number. So an expression such as &quot;two more people die in Hong Kong&quot; is finally converted into the form &quot;NUMBER people die in LOCATION&quot; where NUMBER and LOCATION are slots to fill in. This way we obtain expressions which can be used as IE patterns.</Paragraph> </Section> </Section> class="xml-element"></Paper>