File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1009_metho.xml

Size: 16,349 bytes

Last Modified: 2025-10-06 14:07:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1009">
  <Title>Automatic Pattern Acquisition for Japanese Information Extraction</Title>
  <Section position="4" start_page="0" end_page="1" type="metho">
    <SectionTitle>
2. TREE-BASED PATTERN REPRESENTA-
TION (TBP)
Definition
</SectionTitle>
    <Paragraph position="0"> Tree-based representation of patterns (TBP) is a representation of patterns based on the dependency tree of a sentence. A pattern is defined as a path in the dependency tree passing through zero or more intermediate nodes within the tree. The dependency tree is a directed tree whose nodes are bunsetsus or phrasal units, and whose directed arcs denote the dependency between two bunsetsus: AAXB denotes A's dependency on B (e.g. A is a subject and B is a predicate.) Here dependency relationships are not limited to just those between a case-marked element and a predicate, but also include those between a modifier and its head element, which covers most relationships within sentences.</Paragraph>
    <Paragraph position="1">  TBP for Information Extraction Figure 2 shows how TBP is used in comparison with the word-order based pattern, where A...F in the left part of the figure is a sequence of the phrasal units in a sentence appearing in this order and the tree in the right part is its dependency tree. To find the relationship between BAXF, a word-order based pattern needs a dummy expression to hold C, D and E, while TBF can denote the direct relationship as BAXF. TBP can also represent a complicated pattern for a node which is far from the root node in the dependency tree, like CAXDAXE, which is hard to represent without the sentence structure.</Paragraph>
    <Paragraph position="2"> For matching with TBP, the target sentence should be parsed into a dependency tree. Then all the predicates are detected and the subtrees which have a predicate node as a root are traversed to find a match with a pattern.</Paragraph>
    <Paragraph position="3"> Benefit of TBP TBP has some advantages for pattern matching over the surface word-order based patterns in addressing the problems mentioned in the previous section: AF Free word-order problem TBP can offer a direct representation of the dependency relationship even if the word-order is different.</Paragraph>
    <Paragraph position="4"> AF Free case-marking problem TBP can freely traverse the whole dependency tree and find any significant path as a pattern. It does not depend on pre-defined case-patterns as Riloff [4] and Yangarber [6] did.</Paragraph>
  </Section>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
AF Indirect relationships
</SectionTitle>
    <Paragraph position="0"> TBP can find indirect relationships, such as the relationship between a predicate and the modifier of the argument of the</Paragraph>
  </Section>
  <Section position="6" start_page="1" end_page="1" type="metho">
    <SectionTitle>
BD
</SectionTitle>
    <Paragraph position="0"> In this paper, we used the Japanese parser KNP [1] to obtain the dependency tree of a sentence.</Paragraph>
    <Paragraph position="1"> predicate. For example, the pattern</Paragraph>
    <Paragraph position="3"> AXappoint&amp;quot; can capture the relationship between &amp;quot;BOorganizationBQ&amp;quot; and &amp;quot;to be appointed&amp;quot; in the sentence &amp;quot;BOpersonBQ was appointed to BOpostBQ of BOorganizationBQ.&amp;quot; AF Relationships beyond clausal boundaries TBP can capture relationships beyond clausal boundaries. The pattern &amp;quot;BOpostBQ</Paragraph>
    <Paragraph position="5"> AX announce&amp;quot; can find the relationship between &amp;quot;BOpostBQ&amp;quot; and &amp;quot;to announce&amp;quot;. This relationship, later on, can be combined with the relationship &amp;quot;BOorganizationBQ&amp;quot; and &amp;quot;to announce&amp;quot; and merged into one event.</Paragraph>
  </Section>
  <Section position="7" start_page="1" end_page="2" type="metho">
    <SectionTitle>
3. ALGORITHM
</SectionTitle>
    <Paragraph position="0"> In this section, we outline our procedure for automatic acquisition of patterns. We employ a cascading procedure, as is shown in Figure 3. First, the original documents are processed by a morphological analyzer and NE-tagger. Then the system retrieves the relevant documents for the scenario as a relevant document set. The system, further, selects a set of relevant sentences as a relevant sentence set from those in the relevant document set. Finally, all the sentences in the relevant sentence set are parsed and the paths in the dependency tree are taken as patterns.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.1 Document Preprocessing
</SectionTitle>
      <Paragraph position="0"> Morphological analysis and Named Entity (NE) tagging is performed on the training data at this stage. We used JUMAN [2] for the former and a NE-system which is based on a decision tree algorithm [5] for the latter. Also the part-of-speech information given by JUMAN is used in the later stages.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
3.2 Document Retrieval
</SectionTitle>
      <Paragraph position="0"> The system first retrieves the documents that describe the events of the scenario of interest, called the relevant document set. A set of narrative sentences describing the scenario is selected to create a query for the retrieval. For this experiment, we set the size of the relevant document set to 300 and retrieved the documents using CRL's stochastic-model-based IR system [3], which performed well in the IR task in IREX, Information Retrieval and Extraction evaluation project in Japan  . All the sentences used to create the patterns are retrieved from this relevant document set.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.3 Sentence Retrieval
</SectionTitle>
      <Paragraph position="0"> The system then calculates the TF/IDF-based score of relevance to the scenario for each sentence in the relevant document set and retrieves the n most relevant sentences as the source of the patterns, where n is set to 300 for this experiment. The retrieved sentences will be the source for pattern extraction in the next subsection.</Paragraph>
      <Paragraph position="1"> First, the TF/IDF-based score for every word in the relevant document set is calculated. TF/IDF score of word w is:</Paragraph>
      <Paragraph position="3"> where N is the number of documents in the collection, TF(w)is the term frequency of w in the relevant document set and DF(w)is the document frequency of w in the collection.</Paragraph>
      <Paragraph position="4"> Second, the system calculates the score of each sentence based on the score of its words. However, unusually short sentences and  where length(s) is the number of words in s, and AVE is the average number of words in a sentence.</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.4 Pattern Extraction
</SectionTitle>
      <Paragraph position="0"> Based on the dependency tree of the sentences, patterns are extracted from the relevant sentences retrieved in the previous subsection. Figure 4 shows the procedure. First, the retrieved sentence is parsed into a dependency tree by KNP [1] (Stage 1). This stage also finds the predicates in the tree. Second, the system takes all the predicates in the tree as the roots of their own subtrees, as is shown in (Stage 2). Then each path from the root to a node is extracted, and these paths are collected and counted across all the relevant sentences. Finally, the system takes those paths with frequency higher than some threshold as extracted patterns. Figure 5 shows examples of the acquired patterns.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="2" end_page="2" type="metho">
    <SectionTitle>
4. EXPERIMENT
</SectionTitle>
    <Paragraph position="0"> It is not a simple task to evaluate how good the acquired patterns are without incorporating them into a complete extraction system with appropriate template generation, etc. However, finding a match of the patterns and a portion of the test sentences can be a good measure of the performance of patterns.</Paragraph>
    <Paragraph position="1"> The task for this experiment is to find a bunsetsu, a phrasal unit, that includes slot-fillers by matching the pattern to the test sentence. The performance is measured by recall and precision in terms of the number of slot-fillers that the matched patterns can find; these are calculated as follows.</Paragraph>
    <Paragraph position="3"> The procedure proposed in this paper is based on bunsetsus, and an individual bunsetsu may contain more than one slot filler. In such cases the procedure is given credit for each slot filler.</Paragraph>
    <Paragraph position="4"> Strictly speaking, we don't know how many entities in a matched pattern might be slot-fillers when, actually, the pattern does not contain any slot-fillers (in the case of over-generating). We approximate the potential number of slot-fillers by assigning 1 if the (falsely) matched pattern does not contain any Named-Entities, or assigning the number of Named-Entities in the (falsely) matched pattern. For example, if we have a pattern &amp;quot;go to dinner&amp;quot; for a management succession scenario and it matches falsely in some part of the test sentences, this match will gain one at the number of All Matched Slot-fillers (the denominator of the precision). On the other hand, if the pattern is &amp;quot;BOpostBQBOpersonBQ laugh&amp;quot; and it falsely matches &amp;quot;President Clinton laughed&amp;quot;, this will gain two, the number of the Named Entities in the pattern.</Paragraph>
    <Paragraph position="5"> For the sake of comparison, we defined the baseline system with the patterns acquired by the same procedure but only from the direct relationships between a predicate and its arguments (PA in Figure 6 and 7).</Paragraph>
    <Paragraph position="6"> We chose the following two scenarios.</Paragraph>
    <Paragraph position="7"> AF Executive Management Succession: events in which corporate managers left their positions or assumed new ones regardless of whether it was a present (time of the report) or past event.</Paragraph>
    <Paragraph position="8"> Items to extract: Date, person, organization, title.</Paragraph>
    <Paragraph position="9"> AF Robbery Arrest: events in which robbery suspects were arrested. null Items to extract: Date, suspect, suspicion.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.1 Data
Management Succession
</SectionTitle>
      <Paragraph position="0"> For all the experiments, we used the Mainichi-Newspaper-95 corpus for training. As described in the previous section, the system retrieved 300 articles for each scenario as the relevant document set from the training data and it further retrieved 300 sentences as the relevant sentence set from which all the patterns were extracted.</Paragraph>
      <Paragraph position="1"> Test data was taken from Mainichi-Newspaper-94 by manually reviewing the data for one month. The statistics of the test data are shown in Table 1 and 2.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="2" end_page="2" type="metho">
    <SectionTitle>
4.2 Results
</SectionTitle>
    <Paragraph position="0"> experiment for the executive management succession scenario and robbery arrest scenario, respectively. We ranked all the acquired patterns by calculating the sum of the TF/IDF-based score (same as for sentence retrieval in Section 3.3) for each word in the pattern and sorting them on this basis. Then we obtained the precision-recall curve by changing the number of the top-ranked patterns in the list.</Paragraph>
    <Paragraph position="1"> Figure 6 shows that TBP is superior to the baseline system both in recall and precision. The highest recall for TBP is 34% while the baseline gets 29% at the same precision level. On the other hand, at the same level of recall, TBP got higher precision (75%) than the baseline (70%).</Paragraph>
    <Paragraph position="2"> We can also see from Figure 6 that the curve has a slightly anomalous shape where at lower recall (below 20%) the precision is also low for both TBP and the baseline. This is due to the fact that the pattern lists for both TBP and the baseline contains some non-reliable patterns which get a high score because each word in the patterns gets higher score than others.</Paragraph>
    <Paragraph position="3">  rest scenario. Although the overall recall is low, TBP achieved higher precision and recall (as high as 30% recall at 40% of precision) than the baseline except at the anomalous point where both TBP and the baseline got a small number of perfect slot-fillers by a highly ranked pattern, namely &amp;quot;gotoyogi-de AX taihosuru (to arrest  on suspicion of robbery)&amp;quot; for the baseline and &amp;quot;BOpersonBQ yogisha AX BOnumberBQ-o AX taihosuru (to arrest the suspect, BOpersonBQ, age BOnumberBQ)&amp;quot;.</Paragraph>
  </Section>
  <Section position="10" start_page="2" end_page="2" type="metho">
    <SectionTitle>
5. DISCUSSION
Low Recall
</SectionTitle>
    <Paragraph position="0"> It is mostly because we have not made a class of types of crimes that the recall on the robbery arrest scenario is low. Once we have a classifier as reliable as Named-Entity tagger, we can make a significant gain in the recall of the system. And in turn, once we have a class name for crimes in the training data (automatically annotated by the classifier) instead of a separate name for each crime, it becomes a good indicator to see if a sentence should be used to acquire patterns. And also, incorporating the classes in patterns can reduce the noisy patterns which do not carry any slot-fillers of the template.</Paragraph>
    <Paragraph position="1"> For example on the management succession scenario, all the slot-fillers defined there were able to be tagged by the Named-Entity tagger [5] we used for this experiment, including the title. Since we knew all the slot-fillers were in one of the classes, we also knew those patterns whose argument was not classified any of the classes would not likely capture slot-fillers. So we could put more weight on those patterns which contained BOpersonBQ, BOorganizationBQ, BOpostBQ and BOdateBQ to collect the patterns with higher performance, and therefore we could achieve high precision.</Paragraph>
    <Paragraph position="2"> Erroneous Case Analysis We also investigated other scenarios, namely train accident and airplane accident scenario, which we will not report in this paper.</Paragraph>
    <Paragraph position="3"> However, some of the problems which arose may be worth mentioning since they will arise in other, similar scenarios.</Paragraph>
  </Section>
  <Section position="11" start_page="2" end_page="2" type="metho">
    <SectionTitle>
AF Results or Effects of the Target Event
</SectionTitle>
    <Paragraph position="0"> Especially for the airplane accident scenario, most errors were identified as matching the effect or result of the incident. A typical example is &amp;quot;Because of the accident, the airport had been closed for an hour.&amp;quot; In the airplane accident scenario, the performance of the document retrieval and the sentence retrieval is not as good as the other two scenarios, and therefore, the frequency of relevant acquired patterns is rather low because of the noise. Further improvement in retrieval and a more robust approach is necessary.</Paragraph>
  </Section>
  <Section position="12" start_page="2" end_page="2" type="metho">
    <SectionTitle>
AF Related but Not-Desired Sentences
</SectionTitle>
    <Paragraph position="0"> If the scenario is specific enough to make it difficult as an IR task, the result of the document retrieval stage may include many documents related to the scenario in a broader sense but not specific enough for IE tasks. In this experiment, this was the case for the airplane accident scenario. The result of document retrieval included documents about other accidents in general, such as traffic accidents. Therefore, the sentence retrieval and pattern acquisition for these scenarios were affected by the results of the document retrievals.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML