Automatic Pattern Acquisition
for Japanese Information Extraction
Kiyoshi Sudo
Computer Science
Department
New York University
715 Broadway, 7th floor,
New York, NY 10003 USA
sudo@cs.nyu.edu
Satoshi Sekine
Computer Science
Department
New York University
715 Broadway, 7th floor,
New York, NY 10003 USA
sekine@cs.nyu.edu
Ralph Grishman
Computer Science
Department
New York University
715 Broadway, 7th floor,
New York, NY 10003 USA
grishman@cs.nyu.edu
ABSTRACT
One of the central issues for information extraction is the cost of
customization from one scenario to another. Research on the auto-
mated acquisition of patterns is important for portability and scala-
bility. In this paper, we introduce Tree-Based Pattern representation
where a pattern is denoted as a path in the dependency tree of a sen-
tence. We outline the procedure to acquire Tree-Based Patterns in
Japanese from un-annotated text. The system extracts the relevant
sentences from the training data based on TF/IDF scoring and the
common paths in the parse tree of relevant sentences are taken as
extracted patterns.
Keywords
Information Extraction, Pattern Acquisition
1. INTRODUCTION
Information Extraction (IE) systems today are commonly based
on pattern matching. New patterns need to be written when we
customize an IE system for a new scenario (extraction task); this is
costly if done by hand. This has led to recent research on automated
acquisition of patterns from text with minimal pre-annotation. Riloff
[4] reported a successful result for her procedure that needs only
a pre-classified corpus. Yangarber [6] developed a procedure for
unannotated natural language texts.
One of their common assumption is that the relevant documents
include good patterns. Riloff implemented this idea by applying the
pre-defined heuristic rules to pre-classified (relevant) documents
and Yangarber advanced further so that the system can classify the
documents by itself given seed patterns specific to a scenario and
then find the best patterns from the relevant document set.
Considering how they represent the patterns, we can see that,
in general, Riloff and Yangarber relied on the sentence structure
of English. Riloff’s predefined heuristic rules are based on syn-
tactic structures, such as “BOsubjBQ active-verb” and “active-verb
.
BOdobjBQ”. Yangarber used triples of a predicate and some of its
arguments, such as “BOpredBQBOsubjBQBOobjBQ”.
The Challenges
Our careful examination of Japanese revealed some of the chal-
lenges for automated acquisition of patterns and information ex-
traction on Japanese(-like) language and other challenges which
arise regardless of the languages.
Free Word-ordering
Free word order is one of the most significant problems in analyz-
ing Japanese. To capture all the possible patterns given a predicate
and its arguments, we need to permute the arguments and list all the
patterns separately. For example, for “BOsubjBQBOdobjBQBOiobjBQ
BOpredicateBQ” with the constraint that the predicate comes last in
the sentence, there would be six possible patterns (permutations of
three arguments). The number of patterns to cover even simple
facts would rise unacceptably high.
Flexible case marking system
There is also a difficulty in a language with a flexible case marking
system, like Japanese. In particular, we found that, in Japanese,
some of the arguments that are usually marked as object in En-
glish were variously marked by different post-positions, and some
case markers (postpositions) are used for marking more than one
grammatical category in different situations. For example, the topic
marker in Japanese, “wa”, can mark almost any entity that would
have been variously marked in English. It is difficult to deal with
this variety by simply fixing the number of arguments of a predicate
for creating patterns in Japanese.
Relationships beyond direct predicate-argument
Furthermore, we may want to capture the relationship between a
predicate and a modifier of one of its arguments. In previous ap-
proaches, one had to introduce an ad hoc frame for such a relation-
ship, such as “verb obj [PP BOhead-nounBQ]”, to extract the relation-
ship between “to assume” and “BOorganizationBQ” in the sentence
“BOpersonBQ will assume the BOpostBQ of BOorganizationBQ”.
Relationships beyond clausal boundaries
Another problem lies in relationships beyond clause boundaries, es-
pecially if the event is described in a subordinate clause. For exam-
ple, for a sentence like “BOorganizationBQ announced that BOpersonBQ
retired from BOpostBQ,” it is hard to find a relationship between
BOorganizationBQ and the event of retiring without the global view
from the predicate “announce”.
These problems lead IE systems to fail to capture some of the ar-
guments needed for filling the template. Overcoming the problems
above makes the system capable of finding more patterns from the
training data, and therefore, more slot-fillers in the template.
In this paper, we introduce Tree-based pattern representation and
consider how it can be acquired automatically.
2. TREE-BASED PATTERN REPRESENTA-
TION (TBP)
Definition
Tree-based representation of patterns (TBP) is a representation of
patterns based on the dependency tree of a sentence. A pattern is
defined as a path in the dependency tree passing through zero or
more intermediate nodes within the tree. The dependency tree is a
directed tree whose nodes are bunsetsus or phrasal units, and whose
directed arcs denote the dependency between two bunsetsus: AAXB
denotes A’s dependency on B (e.g. A is a subject and B is a pred-
icate.) Here dependency relationships are not limited to just those
between a case-marked element and a predicate, but also include
those between a modifier and its head element, which covers most
relationships within sentences.
1
TBP for Information Extraction
Figure 2 shows how TBP is used in comparison with the word-
order based pattern, where A...F in the left part of the figure is a
sequence of the phrasal units in a sentence appearing in this or-
der and the tree in the right part is its dependency tree. To find
the relationship between BAXF, a word-order based pattern needs a
dummy expression to hold C, D and E, while TBF can denote the
direct relationship as BAXF. TBP can also represent a complicated
pattern for a node which is far from the root node in the depen-
dency tree, like CAXDAXE, which is hard to represent without the
sentence structure.
For matching with TBP, the target sentence should be parsed into
a dependency tree. Then all the predicates are detected and the
subtrees which have a predicate node as a root are traversed to find
a match with a pattern.
Benefit of TBP
TBP has some advantages for pattern matching over the surface
word-order based patterns in addressing the problems mentioned
in the previous section:
AF Free word-order problem
TBP can offer a direct representation of the dependency re-
lationship even if the word-order is different.
AF Free case-marking problem
TBP can freely traverse the whole dependency tree and find
any significant path as a pattern. It does not depend on pre-
defined case-patterns as Riloff [4] and Yangarber [6] did.
AF Indirect relationships
TBP can find indirect relationships, such as the relationship
between a predicate and the modifier of the argument of the
BD
In this paper, we used the Japanese parser KNP [1] to obtain the
dependency tree of a sentence.
predicate. For example, the pattern
“BOorganizationBQ
D3CU
AXBOpostBQ
D8D3
AXappoint” can capture the rela-
tionship between “BOorganizationBQ” and “to be appointed”
in the sentence
“BOpersonBQ was appointed to BOpostBQ of BOorganizationBQ.”
AF Relationships beyond clausal boundaries
TBP can capture relationships beyond clausal boundaries.
The pattern “BOpostBQ
D8D3
AXappoint
BVC7C5C8
AX announce” can find the
relationship between “BOpostBQ” and “to announce”. This re-
lationship, later on, can be combined with the relationship
“BOorganizationBQ” and “to announce” and merged into one
event.
3. ALGORITHM
In this section, we outline our procedure for automatic acquisi-
tion of patterns. We employ a cascading procedure, as is shown
in Figure 3. First, the original documents are processed by a mor-
phological analyzer and NE-tagger. Then the system retrieves the
relevant documents for the scenario as a relevant document set. The
system, further, selects a set of relevant sentences as a relevant sen-
tence set from those in the relevant document set. Finally, all the
sentences in the relevant sentence set are parsed and the paths in
the dependency tree are taken as patterns.
3.1 Document Preprocessing
Morphological analysis and Named Entity (NE) tagging is per-
formed on the training data at this stage. We used JUMAN [2] for
the former and a NE-system which is based on a decision tree algo-
rithm [5] for the latter. Also the part-of-speech information given
by JUMAN is used in the later stages.
3.2 Document Retrieval
The system first retrieves the documents that describe the events
of the scenario of interest, called the relevant document set. A set
of narrative sentences describing the scenario is selected to create
a query for the retrieval. For this experiment, we set the size of
the relevant document set to 300 and retrieved the documents us-
ing CRL’s stochastic-model-based IR system [3], which performed
well in the IR task in IREX, Information Retrieval and Extraction
evaluation project in Japan
2
. All the sentences used to create the
patterns are retrieved from this relevant document set.
3.3 Sentence Retrieval
The system then calculates the TF/IDF-based score of relevance
to the scenario for each sentence in the relevant document set and
retrieves the n most relevant sentences as the source of the patterns,
where n is set to 300 for this experiment. The retrieved sentences
will be the source for pattern extraction in the next subsection.
First, the TF/IDF-based score for every word in the relevant doc-
ument set is calculated. TF/IDF score of word w is:
D7CRD3D6CTB4DBB5BP
B4
CCBYB4DBB5A1
D0D3CVB4C6B7BCBMBHB5
BWBYB4DBB5
D0D3CVB4C6B7BDB5
if w is Noun, Verb or Named Entity
BC otherwise
where N is the number of documents in the collection, TF(w)is
the term frequency of w in the relevant document set and DF(w)is
the document frequency of w in the collection.
Second, the system calculates the score of each sentence based
on the score of its words. However, unusually short sentences and
BE
IREX Homepage: http://cs.nyu.edu/cs/projects/proteus/irex
Dependency Tree Tree-Based Pattern
CU
CU CU
CU CU
A0
A0
A0
A0
A0
A0
A0
BS
BS
BS
BS
BS
BS
BS
A0
A0
A0
A0
A0
A0
A0
BS
BS
BS
BS
BS
BS
BS
BOorganizationBQ-wa
BOorganizationBQ-TOPIC
BOpersonBQ-ga
BOpersonBQ-SUBJ
BOpostBQ-kara
BOpostBQ-FROM
taininsuru
(retire)
happyosuru
(announce)
A0
A0
A0
A0A9
1
BS
BS
BS
BSCA
3
BS
BS
BS
BS
BS
A0
A0
A0
A0
A0A9
2
CU
CU
CU
CU
CU
CU
CU
B9
B9 B9
B9
happyosuru
(announce)
taininsuru
(retire)
BOorganizationBQ-wa
(BOorganizationBQ-TOPIC)
BOpersonBQ-ga
(BOpersonBQ-SUBJ)
BOpostBQ-kara
(BOpostBQ-FROM)
1
2
3
Figure 1: Tree-Based Pattern Representation
Word-order Pattern
A B C D E F
BI
Pattern [B * F]
F
E
BA
D
C
AH
AH
BF
C8
C8
C8
C8D5
AH
AH
AHBF
C9
C9D7
B9
Pattern [BAXF]
C8
C8
C8
C8
C8
C8
C8
C8
C8D5
Pattern [CAXEAXF]
TBP
Figure 2: Extraction using Tree-Based Pattern Representation
unusually long sentences will be penalized. The TF/IDF score of
sentence s is:
D7CRD3D6CTB4D7B5BP
C8
DBBED7
D7CRD3D6CTB4DBB5
D0CTD2CVD8CWB4D7B5B7CYD0CTD2CVD8CWB4D7B5A0BTCEBXCY
where length(s) is the number of words in s, and AVE is the av-
erage number of words in a sentence.
3.4 Pattern Extraction
Based on the dependency tree of the sentences, patterns are ex-
tracted from the relevant sentences retrieved in the previous sub-
section. Figure 4 shows the procedure. First, the retrieved sentence
is parsed into a dependency tree by KNP [1] (Stage 1). This stage
also finds the predicates in the tree. Second, the system takes all
the predicates in the tree as the roots of their own subtrees, as is
shown in (Stage 2). Then each path from the root to a node is
extracted, and these paths are collected and counted across all the
relevant sentences. Finally, the system takes those paths with fre-
quency higher than some threshold as extracted patterns. Figure 5
shows examples of the acquired patterns.
4. EXPERIMENT
It is not a simple task to evaluate how good the acquired pat-
terns are without incorporating them into a complete extraction sys-
tem with appropriate template generation, etc. However, finding a
match of the patterns and a portion of the test sentences can be a
good measure of the performance of patterns.
The task for this experiment is to find a bunsetsu, a phrasal unit,
that includes slot-fillers by matching the pattern to the test sentence.
The performance is measured by recall and precision in terms of the
number of slot-fillers that the matched patterns can find; these are
calculated as follows.
CACTCRCPD0D0 BP
AZD3CU C5CPD8CRCWCTCS CACTD0CTDACPD2D8CBD0D3D8BYCXD0D0CTD6D7
AZ D3CU BTD0D0 CACTD0CTDACPD2D8 CBD0D3D8BYCXD0D0CTD6D7
Original Document Set
Document
Preprocessing
B9
Preprocessed Document Set
Document
Retrieval
B9
Relevant Document Set
Sentence Retrieval
A0
A0
A0
A0
A0
A0
A0
A9
Relevant Sentence Set
(Tree representation)
CU
CU
CU
CU
CUAG
AG
AG
AGAG
C8
C8
C8
C8C8
AG
AG
AG
AG
AG
C8
C8
C8
C8C8
Pattern
Extraction
B9
CU CU CU
CU CU CU
CU CU
Extracted Patterns
Figure 3: Pattern Acquisition Procedure Overall Process
C8D6CTCRCXD7CXD3D2 BP
AZ D3CU C5CPD8CRCWCTCS CACTD0CTDACPD2D8 CBD0D3D8BYCXD0D0CTD6D7
AZD3CU BTD0D0 C5CPD8CRCWCTCS CBD0D3D8BYCXD0D0CTD6D7
The procedure proposed in this paper is based on bunsetsus, and
an individual bunsetsu may contain more than one slot filler. In
such cases the procedure is given credit for each slot filler.
Strictly speaking, we don’t know how many entities in a matched
pattern might be slot-fillers when, actually, the pattern does not
contain any slot-fillers (in the case of over-generating). We ap-
proximate the potential number of slot-fillers by assigning 1 if the
(falsely) matched pattern does not contain any Named-Entities, or
assigning the number of Named-Entities in the (falsely) matched
pattern. For example, if we have a pattern “go to dinner” for a
management succession scenario and it matches falsely in some
part of the test sentences, this match will gain one at the number
of All Matched Slot-fillers (the denominator of the precision). On
the other hand, if the pattern is “BOpostBQBOpersonBQ laugh” and it
falsely matches “President Clinton laughed”, this will gain two, the
number of the Named Entities in the pattern.
For the sake of comparison, we defined the baseline system with
the patterns acquired by the same procedure but only from the di-
rect relationships between a predicate and its arguments (PA in Fig-
ure 6 and 7).
We chose the following two scenarios.
AF Executive Management Succession: events in which corpo-
rate managers left their positions or assumed new ones re-
gardless of whether it was a present (time of the report) or
past event.
Items to extract: Date, person, organization, title.
AF Robbery Arrest: events in which robbery suspects were ar-
rested.
Items to extract: Date, suspect, suspicion.
4.1 Data
Management Succession
Documents 15
Sentences 79
DATE 43
PERSON 41
ORGANIZATION 22
OLD-ORGANIZATION 2
NEW-POST 30
OLD-POST 39
Table 1: Test Set for Management Succession scenario
Robbery Arrest
Documents 28
Sentences 182
DATE 26
SUSPICION 34
SUSPECT 50
Table 2: Test Set for Robbery Arrest scenario
For all the experiments, we used the Mainichi-Newspaper-95
corpus for training. As described in the previous section, the system
retrieved 300 articles for each scenario as the relevant document set
from the training data and it further retrieved 300 sentences as the
relevant sentence set from which all the patterns were extracted.
Test data was taken from Mainichi-Newspaper-94 by manually
reviewing the data for one month. The statistics of the test data are
shown in Table 1 and 2.
4.2 Results
Figure 6 and Figure 7 illustrates the precision-recall curve of this
Stage 1 (Dependency Tree)
BOorgBQ-wa
BOpsnBQ-ga
BOpostBQ-ni
shuninsuru(p)-to
happyoshita(p)
AY
AY
CQ
CQ
AY
AY
CQ
CQ
CU
CU
CU
CU
CU
B9
Stage 2 (Separated Trees)
BOorgBQ-wa
BOpsnBQ-ga
BOpostBQ-ni
happyoshita(p)
AH
AHC9
C9CZ
AY
AY
CQ
CQ
AY
AY
CQ
CQ
C9
C9
CZ
AH
AH
AH
AH
AHB7
AH
AHB7
1
2
3
4
CU
CU
CU
CU
CU
BOpsnBQ-ga
BOpostBQ-ni
shuninsuru(p)
CQ
CQ
AY
AY
C9
C9CZ
AH
AHB7
5
6
CU
CU
CU
(p indicates the node is a predicate.)
Extracted Patterns
1 BOorganizationBQ-wa AX happyosuru
2 BOpersonBQ-ga AX shuninsuru-to AX happyosuru
3 BOpostBQ-ni AX shuninsuru-to AX happyosuru
4 shuninsuru-to AX happyosuru
5 BOpersonBQ-ga AX shuninsuru
6 BOpostBQ-ni AX shuninsuru
Japanese sentence :
English Translation :
BOorganizationBQ-wa
BOorganizationBQ-TOPIC
BOpersonBQ-ga
BOpersonBQ-SBJ
BOpostBQ-ni
BOpostBQ-TO
shuninsuru-to
start-COMP
happyoshita.
announced.
(BOorganizationBQ announced that BOpersonBQ was appointed to BOpostBQ.)
Figure 4: Pattern Acquisition from “BOorgBQ-wa BOpsnBQ-ga BOpstBQ-ni shuninsuru-to happyoshita.”
experiment for the executive management succession scenario and
robbery arrest scenario, respectively. We ranked all the acquired
patterns by calculating the sum of the TF/IDF-based score (same
as for sentence retrieval in Section 3.3) for each word in the pattern
and sorting them on this basis. Then we obtained the precision-
recall curve by changing the number of the top-ranked patterns in
the list.
Figure 6 shows that TBP is superior to the baseline system both
in recall and precision. The highest recall for TBP is 34% while the
baseline gets 29% at the same precision level. On the other hand,
at the same level of recall, TBP got higher precision (75%) than the
baseline (70%).
We can also see from Figure 6 that the curve has a slightly anoma-
lous shape where at lower recall (below 20%) the precision is also
low for both TBP and the baseline. This is due to the fact that
the pattern lists for both TBP and the baseline contains some non-
reliable patterns which get a high score because each word in the
patterns gets higher score than others.
Figure 7 shows the result of this experiment on the Robbery Ar-
rest scenario. Although the overall recall is low, TBP achieved
higher precision and recall (as high as 30% recall at 40% of pre-
cision) than the baseline except at the anomalous point where both
TBP and the baseline got a small number of perfect slot-fillers by a
highly ranked pattern, namely “gotoyogi-de AX taihosuru (to arrest
0 2040608010
Precision
0
20
40
60
Recall
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
*
*
*
*
*
*
*
*
*
*
**
+ ... TBP
* ... Baseline (PA)
Figure 6: Result on Management Succession Scenario
Scenario Patterns
Executive Succession : BOpostBQ-ni AX shokakusuru (to be promoted to BOpostBQ)
BOpostBQ-ni AX shuninsuru (to assume BOpostBQ)
BOpostBQ-ni AX shokakusuru AX (to announce an informal decision of promoting
BOjinjiBQ-o AX happyosuru somebody to BOpostBQ)
Robbery Arrest : satsujin-yogi-de AX taihosuru (to arrest in suspicion of murder)
BOdateBQ AX taihosuru (to arrest on BOdateBQ)
satsujin-yogi-de AX taihosuru (to arrest in suspicion of murder)
BOpersonBQ-yogisha AX #-o AX taihosuru (to arrest the suspect, BOpersonBQ, age #)
Figure 5: Acquired Patterns
0 2040608010
Precision
0
20
40
60
Recall
+
+
+
+++
+
+
+++
*
*
*
**
**
****
****
+ ... TBP
* ... Baseline (PA)
Figure 7: Result on Robbery Arrest Scenario
on suspicion of robbery)” for the baseline and “BOpersonBQ yogisha
AX BOnumberBQ-o AX taihosuru (to arrest the suspect, BOpersonBQ,
age BOnumberBQ)”.
5. DISCUSSION
Low Recall
It is mostly because we have not made a class of types of crimes
that the recall on the robbery arrest scenario is low. Once we have
a classifier as reliable as Named-Entity tagger, we can make a sig-
nificant gain in the recall of the system. And in turn, once we have
a class name for crimes in the training data (automatically anno-
tated by the classifier) instead of a separate name for each crime,
it becomes a good indicator to see if a sentence should be used to
acquire patterns. And also, incorporating the classes in patterns can
reduce the noisy patterns which do not carry any slot-fillers of the
template.
For example on the management succession scenario, all the
slot-fillers defined there were able to be tagged by the Named-
Entity tagger [5] we used for this experiment, including the title.
Since we knew all the slot-fillers were in one of the classes, we
also knew those patterns whose argument was not classified any
of the classes would not likely capture slot-fillers. So we could
put more weight on those patterns which contained BOpersonBQ,
BOorganizationBQ, BOpostBQ and BOdateBQ to collect the patterns with
higher performance, and therefore we could achieve high precision.
Erroneous Case Analysis
We also investigated other scenarios, namely train accident and air-
plane accident scenario, which we will not report in this paper.
However, some of the problems which arose may be worth men-
tioning since they will arise in other, similar scenarios.
AF Results or Effects of the Target Event
Especially for the airplane accident scenario, most errors were
identified as matching the effect or result of the incident. A
typical example is “Because of the accident, the airport had
been closed for an hour.” In the airplane accident scenario,
the performance of the document retrieval and the sentence
retrieval is not as good as the other two scenarios, and there-
fore, the frequency of relevant acquired patterns is rather low
because of the noise. Further improvement in retrieval and a
more robust approach is necessary.
AF Related but Not-Desired Sentences
If the scenario is specific enough to make it difficult as an IR
task, the result of the document retrieval stage may include
many documents related to the scenario in a broader sense
but not specific enough for IE tasks. In this experiment, this
was the case for the airplane accident scenario. The result of
document retrieval included documents about other accidents
in general, such as traffic accidents. Therefore, the sentence
retrieval and pattern acquisition for these scenarios were af-
fected by the results of the document retrievals.
6. FUTURE WORK
Information Extraction
To apply the acquired patterns to an information extraction task,
further steps are required besides those mentioned above. Since
the patterns are a set of the binary relationships of a predicate and
another element, it is necessary to merge the matched elements into
a whole event structure.
Necessity for Generalization
We have not yet attempted any (lexical) generalization of pattern
candidates. The patterns can be expanded by using a thesaurus
and/or introducing a new (lexical) class suitable for a particular
domain. For example, the class of expressions of flight number
clearly helps the performance on the airplane accident scenario.
Especially, the generalized patterns will help improve recall.
Robust Pattern Extraction
As is discussed in the previous section, the performance of our sys-
tem relies on each component. If the scenario is difficult for the IR
task, for example, the whole result is affected. The investigation of
a more conservative approach would be necessary.
Translingualism
The presented results show that our procedure of automatic pat-
tern acquisition is promising. The procedure is quite general and
addresses problems which are not specific to Japanese. With an ap-
propriate morphological analyzer, a parser that produces a depen-
dency tree and an NE-tagger, our procedure should be applicable to
almost any language.
7. ACKNOWLEDGMENTS
This research is supported by the Defense Advanced Research
Projects Agency as part of the Translingual Information Detec-
tion, Extraction and Summarization (TIDES) program, under Grant
N66001-00-1-8917 from the Space and Naval Warfare Systems Cen-
ter San Diego. This paper does not necessarily reflect the position
or the policy of the U.S. Government.
8. REFERENCES
[1] S. Kurohashi and M. Nagao. Kn parser : Japanese
dependency/case structure analyzer. In the Proceedings of the
Workshop on Sharable Natural Language Resources, 1994.
[2] Y. Matsumoto, S. Kurohashi, O. Yamaji, Y. Taeki, and
M. Nagano. Japanese morphological analyzing system:
Juman. Kyoto University and Nara Institute of Science and
Technology, 1997.
[3] M. Murata, K. Uchimoto, H. Ozaku, and Q. Ma. Information
retrieval based on stochastic models in irex. In the
Proceedings of the IREX Workshop, 1994.
[4] E. Riloff. Automatically generating extraction patterns from
untagged text. In the Proceedings of Thirteenth National
Conference on Artificial Intelligence (AAAI-96), 1996.
[5] S. Sekine, R. Grishman, and H. Shinnou. A decision tree
method for finding and classifying names in japanese texts. In
the Proceedings of the Sixth Workshop on Very Large
Corpora, 1998.
[6] R. Yangarber, R. Grishman, P. Tapanainen, and S. Huttunen.
Unsupervised discovery of scnario-level patterns for
information extraction. In the Proceedings of the Sixth Applied
Natural Language Processing Conference, 2000.
