File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/h01-1009_intro.xml

Size: 4,734 bytes

Last Modified: 2025-10-06 14:01:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1009">
  <Title>Automatic Pattern Acquisition for Japanese Information Extraction</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Information Extraction (IE) systems today are commonly based on pattern matching. New patterns need to be written when we customize an IE system for a new scenario (extraction task); this is costly if done by hand. This has led to recent research on automated acquisition of patterns from text with minimal pre-annotation. Riloff [4] reported a successful result for her procedure that needs only a pre-classified corpus. Yangarber [6] developed a procedure for unannotated natural language texts.</Paragraph>
    <Paragraph position="1"> One of their common assumption is that the relevant documents include good patterns. Riloff implemented this idea by applying the pre-defined heuristic rules to pre-classified (relevant) documents and Yangarber advanced further so that the system can classify the documents by itself given seed patterns specific to a scenario and then find the best patterns from the relevant document set.</Paragraph>
    <Paragraph position="2"> Considering how they represent the patterns, we can see that, in general, Riloff and Yangarber relied on the sentence structure of English. Riloff's predefined heuristic rules are based on syntactic structures, such as &amp;quot;BOsubjBQ active-verb&amp;quot; and &amp;quot;active-verb .</Paragraph>
    <Paragraph position="3"> BOdobjBQ&amp;quot;. Yangarber used triples of a predicate and some of its arguments, such as &amp;quot;BOpredBQBOsubjBQBOobjBQ&amp;quot;.</Paragraph>
    <Paragraph position="4"> The Challenges Our careful examination of Japanese revealed some of the challenges for automated acquisition of patterns and information extraction on Japanese(-like) language and other challenges which arise regardless of the languages.</Paragraph>
    <Paragraph position="5"> Free Word-ordering Free word order is one of the most significant problems in analyzing Japanese. To capture all the possible patterns given a predicate and its arguments, we need to permute the arguments and list all the patterns separately. For example, for &amp;quot;BOsubjBQBOdobjBQBOiobjBQ BOpredicateBQ&amp;quot; with the constraint that the predicate comes last in the sentence, there would be six possible patterns (permutations of three arguments). The number of patterns to cover even simple facts would rise unacceptably high.</Paragraph>
    <Paragraph position="6"> Flexible case marking system There is also a difficulty in a language with a flexible case marking system, like Japanese. In particular, we found that, in Japanese, some of the arguments that are usually marked as object in English were variously marked by different post-positions, and some case markers (postpositions) are used for marking more than one grammatical category in different situations. For example, the topic marker in Japanese, &amp;quot;wa&amp;quot;, can mark almost any entity that would have been variously marked in English. It is difficult to deal with this variety by simply fixing the number of arguments of a predicate for creating patterns in Japanese.</Paragraph>
    <Paragraph position="7"> Relationships beyond direct predicate-argument Furthermore, we may want to capture the relationship between a predicate and a modifier of one of its arguments. In previous approaches, one had to introduce an ad hoc frame for such a relationship, such as &amp;quot;verb obj [PP BOhead-nounBQ]&amp;quot;, to extract the relationship between &amp;quot;to assume&amp;quot; and &amp;quot;BOorganizationBQ&amp;quot; in the sentence &amp;quot;BOpersonBQ will assume the BOpostBQ of BOorganizationBQ&amp;quot;. Relationships beyond clausal boundaries Another problem lies in relationships beyond clause boundaries, especially if the event is described in a subordinate clause. For example, for a sentence like &amp;quot;BOorganizationBQ announced that BOpersonBQ retired from BOpostBQ,&amp;quot; it is hard to find a relationship between BOorganizationBQ and the event of retiring without the global view from the predicate &amp;quot;announce&amp;quot;.</Paragraph>
    <Paragraph position="8"> These problems lead IE systems to fail to capture some of the arguments needed for filling the template. Overcoming the problems above makes the system capable of finding more patterns from the training data, and therefore, more slot-fillers in the template.</Paragraph>
    <Paragraph position="9"> In this paper, we introduce Tree-based pattern representation and consider how it can be acquired automatically.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML