Analyzing the Complexity of a Domain With Respect To An
Information Extraction Task
Amit Bagga
Dept. of Computer Science
Box 90129, Duke University
Durham, N. C. 27708#7B0129
Email: amit@cs.duke.edu
Phone: 919-660-6507
Abstract
In this paper we describe a method of classifying facts #28information#29 into categories or levels; where
each level signi#0Ces a di#0Berent degree of syntactic complexity related to a fact. Based on this classi#0Ccation
mechanism, we also propose a method of evaluating a domain by assigning to it a #5Cdomain number"
based on the levels of a set of standard facts present in the articles of that domain.
Introduction
The Message Understanding Conferences #28MUCs#29 have been held with the goal of qualitatively evaluating
message understanding systems. While the MUCs held thus far have been quite successful at providing
such an evaluation, very little work has been done in analyzing the di#0Eculty of understanding a text in
a particular domain; both, independently, as well as in comparison to understanding a text in some other
domain.
The organizers of MUC-5 attempted to compare the di#0Eculty of the EJV #28English JointVentures#29 task
in MUC-5 to the terrorist task of MUC-3 and MUC-4. The criteria used for comparing these two tasks
included the vocabulary size, the average sentence length, the average number of sentences per text, the
number of texts, etc. #5BSundheim 1993#5D. The organizers of MUC-6 did not attempt to compare the di#0Eculty
of the MUC-6 task to the previous MUC tasks saying that #5Cthe problem of coming up with a reasonable,
objectiveway of measuring relative task di#0Eculty has not been adequately addressed" #5BSundheim 1995#5D.
In this paper we describe a method of classifying facts #28information#29 into categories or levels; where
each level signi#0Ces a di#0Berent degree of syntactic complexity related to a fact. Based upon this classi#0Ccation
mechanism, we also propose a method of evaluating a domain by assigning to it a #5Cdomain number" based
on the levels of a set of standard facts present in the articles of that domain. In addition, using the proposed
classi#0Ccation mechanism, we analyze the complexity of the MUC-7 information extraction task and compare
it to the complexities of the information extraction tasks of MUC-4, MUC-5, and MUC-6.
De#0Cnitions
Network:
A network consists of a collection of nodes interconnected by an accompanying set of arcs. Each node denotes
an object and each arc represents a binary relation between the objects. #5BHendrix 1979#5D
APartial Network:
A partial network is a collection of nodes interconnected by an accompanying set of arcs where the collection
1
The Extraditables
Medellin
or
for
of of
Bogota’s Daily El Espectadortwo employees
on
Nov 15
took place in
the murder
the Medellin Cartel
of
the armed branch
responsibility
have claimed
Figure 1: A Sample Network
of nodes is a subset of a collection of nodes forming a network, and the accompanying set of arcs is a subset
of the set of arcs accompanying the set of nodes which form the network.
Figure 1 shows a sample network for the following piece of text:
#5CThe Extraditables," or the Armed Branch of the Medellin Cartel have claimed responsibility
for the murder of two employees of Bogota's daily El Espectador on Nov 15. The murders took
place in Medellin.
The Level of A Fact
The level of a fact, F, in a piece of text is de#0Cned by the following algorithm:
1. Build a network, S, for the piece of text.
2. Suppose the fact, F, consists of several nodes fx
1
;x
2
;:::;x
n
g. Let s be the partial network consisting
of the set of nodes fx
1
;x
2
;:::;x
n
ginterconnected by the set of arcs ft
1
;t
2
;:::;t
k
g.
We de#0Cne the level of the fact, F, with respect to the network, S, to be equal to k, the number of arcs
linking the nodes which comprise the fact F.
Observations
Given the de#0Cnition of the level of a fact, the following observations can be made:
#0F Thelevelofafact isrelatedtothe conceptof#5Csemanticvicinity"de#0Cned bySchubertet al. #5BSchubert 1979#5D.
The semantic vicinity ofanodeinanetwork consists of the nodes and the arcs reachable from that
node by traversing a small number of arcs. The fundamental assumption used here is that #5Cthe knowl-
edge required to perform an intellectual task generally lies in the semantic vicinity of the concepts
involved in the task" #5BSchubert 1979#5D.
The level of a fact is equal to the number of arcs that one needs to traverse to reach all the concepts
#28nodes#29 which comprise the fact of interest.
2
#0F A level-0 fact consists of a single node #28i.e. no transitions#29 in a network.
#0F A level-k fact is a union of k level-1 facts.
#0F Conjunctions#2Fdisjunctions increase the level of a fact.
#0F The higher the level of a fact, the harder it is to extract it from a piece of text.
#0F A fact appearing at one level in a piece of text may appear at some other level in the same piece of
text.
#0F The level of a fact in a piece of text depends on the granularity of the network constructed for that
piece of text. Therefore, the level of a fact with respect to a network built at the word level #28i.e. words
represent objects and the relationships between the objects#29 will be greater than the level of a fact with
respect to a network built at the phrase level #28i.e. noun groups represent objects while verb groups
and preposition groups represent the relationships between the objects#29.
Examples
Let S be the network shown in Figure 1. S has been built at the phrase level.
#0F The city mentioned, in S, is an example of a level-0 fact because the #5Ccity" fact consists only of one
node #5CMedellin."
#0F The type of attack, in S, is an example of a level-1 fact.
We de#0Cne the type of attack in the network to be an attack designator suchas#5Cmurder," #5Cbombing," or
#5Cassassination" with one modi#0Cer giving the victim, perpetrator, date, location, or other information.
In this case the type of attack fact is composed of the #5Cthe murder" and the #5Ctwo employees" nodes
and their connector. This makes the type of attack a level-1 fact.
The type of attack could appear as a level-0 fact as in #5Cthe Medellin bombing" #28assuming that the
network is built at the phrase level#29 because in this case both the attack designator #28bombing#29 and
the modi#0Cer #28Medellin#29 occur in the same node. The type of attack fact occurs as a level-2 fact in
the following sentence #28once again assuming that the network is built at the phrase level#29: #5C10 people
were killed in the o#0Bensive which included several bombings." In this case there is no direct connector
between the attack designator #28several bombings#29 and its modi#0Cer #2810 people#29. They are connected by
the intermediatory #5Cthe o#0Bensive" node; thereby making the type of attack a level-2 fact. The type of
attack can also appear at higher levels.
#0F In S, the date of the murder of the two employees is an example of a level-2 fact.
This is because the attack designator #28the murder#29 along with its modi#0Cer #28two employees#29 account for
one level and the arc to #5CNov 15" accounts for the second level.
The date of the attack, in this case, is not a level-1 fact #28because of the two nodes #5Cthe murder" and
#5CNov 15"#29 because the phrase #5Cthe murder on Nov 15" does not tell one that an attack actually took
place. The article could have been talking about a seminar on murders that took place on Nov15and
not about the murder of two employees which took place then.
#0F In S, the location of the murder of the two employees is an example of a level-2 fact.
The exact same argument as the date of the murder of the two employees applies here.
#0F The complete information, in S, about the victims is an example of a level-2 fact because to know that
two employees of Bogota's Daily El Espectador were victims, one has to know that they were murdered.
The attack designator #28the murder#29 with its modi#0Cer #28two employees#29 accounts for one level, while the
connector between #5Ctwo employees" and #5CBogota's Daily El Espectador" accounts for the other.
3
#0F Similarly, the complete information, in S, about the perpetrators of the murder of the two employees
is an example of a level-5 fact. The breakup of the 5 levels is as follows: the fact that two employees
were murdered accounts for one level; the fact that #5CThe Extraditables" have claimed responsibility
for the murders accounts for two additional levels; and the fact that the Extraditables are the #5Carmed
branch of the Medellin Cartel" account for the remaining two levels.
Justi#0Ccation of the Methodology
The level of a fact quanti#0Ces the #5Cspread" in the information that makes up the fact. Therefore, the higher
the level of a fact, the greater is the #5Cspread" in the information that makes up the fact. This means that
more processing has to be done to identify and link all the individual pieces of information that makeup
the fact. In fact, an exploratory study done by Beth Sundheim during MUC-3 showed #5Ca degradation in
correctness of message processing as the information distribution in the message became more complex, that
is, as slot #0Clls were drawn from larger portions of the message" #5BHirschman 1992#5D.
An argument can be made that there are other factors, apart from the spread of information, which
in#0Duence the di#0Eculty of extracting a fact from text. Some of these factors include the amount of training
done on an information extraction system, the quality of training, and the frequency of occurrence of the
patterns that a system has been trained on. While these factors do in#0Duence the performance of an infor-
mation extraction system and they do give some indication as to how di#0Ecult it was for a particular system
to extract the fact, they do not give a system independentway of determining the complexity of extracting
the fact.
In #5BHirschman 1992#5D, Lynette Hirschman proposed the following hypothesis: there are facts that are
simply harder to extract, across all systems. Based on our de#0Cnition of the level of a fact, we analyzed
the performances of several information extraction systems on the MUC-4 terrorist reports domain. Our
analysis shows that all the systems consistently did muchworse on higher level facts. In addition to con-
#0Crming Hirschman's hypothesis, the analysis also shows that higher level facts are indeed harder to extract.
#5BBagga 1998#5D gives the complete details about the analysis.
Building the Networks
As mentioned earlier, the level of a fact for a piece of text depends on the network constructed for the text.
Since there is no unique network corresponding to a piece of text, care has to be taken so that the networks
are built consistently.
For the set of experiments described in the rest of the paper we used the following algorithm to build the
networks:
1. Every article was broken up into a non-overlapping sequence of noun groups #28NGs#29, verb groups #28VGs#29,
and preposition groups #28PGs#29. The rules employed to identify the NGs, VGs, and PGs were almost
the same as the ones employed by SRI's FASTUS system
1
.
2. The nodes of the network consisted of the NGs while the transitions between the nodes consisted of
the VGs and the PGs.
3. Identi#0Ccation of coreferent nodes and prepositional phrase attachments were done manually.
Obviously, if one were to employ a di#0Berent algorithm for building the networks, one would get di#0Berent
numbers for the level of a fact. But, if the algorithm were employed consistently across all the facts of
interest and across all articles in a domain, the numbers on the level of a fact would be consistently di#0Berent
and one would still be able to analyze the relative complexity of extracting that fact from a piece of text in
the domain.
1
We wish to thank Jerry Hobbs of SRI for providing us with the rules of their partial parser.
4
0
10
20
30
40
50
0 1 2 3 4 5 6 7 8 9 10
Percentage of Facts
Levels
Facts vs Levels
MUC-7: All Facts
Figure 2: MUC-7: Level Distribution of the Facts Combined
Analysis of the MUC-7 Domain
Based on our de#0Cnition of the level of a fact, we analyzed the MUC-7 domain which consisted of reports
on air vehicle launches. We selected a set of standard facts from the o#0Ecial MUC-7 template that we felt
captured most of the information in the template. This set consists of:
#0F Launch Date, Launch Site
#0F Mission Type, Mission Function, Mission Status
#0F Vehicle, Vehicle Type, Vehicle Owner, Vehicle Manufacturer
#0F Payload, Payload Type, Payload Function, Payload Owner
#0F Payload Manufacturer, Payload Origin, Payload Target, Payload Recipient
Using the algorithm described in the previous section, we built the networks for 32 of the 64 relevant
articles in the 100 article MUC-7 test set. From the network for each article, we calculated the level of each
of the standard facts mentioned in the article. The standard facts appeared 619 times in the 32 articles
analyzed. Figure 2 shows the level distribution of all the standard facts. In addition, Figures 3, 4, 5, and 6
show the level distributions of each of the standard facts.
Figures 4 and 5 show that the curves for the Vehicle and the Vehicle Type facts, and the Payload and
Payload Type facts are nearly the same. The main reason being that the phrases #28in the text#29 which describe
the vehicles and the payloads, in most cases, also mention their types. For example, the phrase #5CLong March
3B Rocket" describes the air launchvehicle and also mentions its type.
The Vehicle Owner and Manufacturer facts, and the Payload Owner and Manufacturer facts all occur at
higher levels than the Vehicle and the Payload facts indicating that they are harder to extract. The Payload
recipient fact appears to be the most complex fact occurring most frequently at level-9 #28Figure 6#29.
Evaluating the Di#0Eculty of the MUC-7 Domain
We extended our analysis to analyze the di#0Eculty of understanding a text in the MUC-7 domain.
Obviously, the di#0Eculty of understanding a text in a domain depends directly on the expected level of
a fact in that domain. We de#0Cne this expected level of a fact in a domain to be the domain number of the
domain. The domain number is measured in level units #28LUs#29. Two domains can therefore be compared on
the basis of their domain numbers.
5
0
10
20
30
0 1 2 3 4 5 6 7 8 9 10
Number of Facts
Levels
Facts vs Levels
Launch Date
Launch Site
Mission Type
Mission Function
Mission Status
Figure 3: MUC-7: Level Distribution of Each of the Facts
0
10
20
30
0 1 2 3 4 5 6 7 8 9 10
Number of Facts
Levels
Facts vs Levels
Vehicle
Vehicle Type
Vehicle Owner
Vehicle Manufacturer
Figure 4: MUC-7: Level Distribution of Each of the Facts
The formula used to calculate the domain number is:
P
1
l=0
l #03 x
l
P
1
l=0
x
l
where x
l
is the number of times one of the standard facts appeared at level-l in the articles of the domain.
Based on the levels of the standard facts in the MUC-7 test set, we calculated the domain number of the
air vehicle launch domain to be 2.44 LUs. The fact that the domain number of this domain is greater than
2 is to be expected given the fact that most curves in Figures 3, 4, 5, and 6 peak at level-2 or higher.
Analysis of MUC-4
The MUC-4 domain consisted of articles reporting terrorist activities in Latin America. Based on the
o#0Ecial MUC-4 template, we selected a set of standard facts that we felt captured most of the information in
the template. They are: #28The full de#0Cnition of each fact is not included here.#29
6
0
10
20
30
0 1 2 3 4 5 6 7 8 9 10
Number of Facts
Levels
Facts vs Levels
Payload
Payload Type
Payload Function
Payload Owner
Figure 5: MUC-7: Level Distribution of Each of the Facts
0
10
20
30
0 1 2 3 4 5 6 7 8 9 10
Number of Facts
Levels
Facts vs Levels
Payload Manufacturer
Payload Origin
Payload Target
Payload Recipient
Figure 6: MUC-7: Level Distribution of Each of the Facts
#0F The type of attack.
#0F The date of the attack.
#0F The location of the attack.
#0F The victim #28including damage to property#29.
#0F The perpetrator#28s#29 #28including suspects#29.
We then built the networks #28using the algorithm described earlier#29 for the relevant articles from the
MUC-4 TST3 set of 100 articles. From the network for each article, we calculated the levels of each of the
#0Cve standard facts. The level distribution of the #0Cve facts for the MUC-4 TST3 set is shown in Figure 7.
The level distribution of the #0Cve facts combined is shown in Figure 8.
Based on the data collected above, we made the following observations:
7
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
105
110
 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Facts
Levels
Attack Fact
Date Fact
Location Fact
Victim Fact
Perpetrator Fact
Figure 7: MUC-4: Level Distribution of Each of the FiveFacts
#0F There were 69 relevant articles in the MUC-4 TST3 set of 100 articles, each reporting one or more
terrorist attacks.
#0F The #0Cve facts of interest appeared 570 times in the 69 articles.
#0F Anumber of articles reported the same fact at two di#0Berent places and at two di#0Berent levels in the
same article. The #0Crst, usually, in the #0Crst paragraph of the text which reported the attack without
giving too many details, and, the second, later in the article when the attackwas reported with all the
details.
As one would expect, the level of the #0Crst occurrence of a fact in an article is usually less than or equal
to the level of the second occurrence of that fact in the same article.
#0F From Figure 8, we can see that almost 50#25 of the #0Cve facts were at level-1. This is not surprising
because four out of the #0Cve standard facts most frequently occur as level-1 facts #28Figure 7#29.
0
5
10
15
20
25
30
35
40
45
50
 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Percentage of Facts
Levels
5 Facts (combined)
Figure 8: MUC-4: Level Distribution of the FiveFacts Combined
8
0
5
10
15
20
25
30
35
40
45
50
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Facts
Levels
Parents of JV
Child Venture
Location of Child
Product
Percentage Ownership
Figure 9: MUC-5: Level Distribution of Each of the FiveFacts
Based on the levels of the #0Cve standard facts in the MUC-4 TST3 set of articles, we calculated the domain
number of the terrorist domain to be 1.87 LUs. We are assuming the fact that the set of 100 randomly chosen
articles in the MUC-4 TST3 set are representative of the domain. This assumption may not necessarily hold,
but, given the large number of articles we analyzed, we hope that the domain number calculated is close to
the actual domain number of the terrorist domain.
Analysis of MUC-5
Because two di#0Berent domains were used in MUC-5 #28each in two di#0Berent languages#29, we decided to
focus only on the English JointVentures #28EJV#29 domain. Once again, the set of standard facts were selected
from the o#0Ecial MUC-5 template and were chosen such that they contained most of the information in the
template. They are: #28The full de#0Cnition of each fact is not included here.#29
0
5
10
15
20
25
30
35
40
45
50
 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Percentage of Facts
Levels
MUC 5: 5 Facts
Figure 10: MUC-5: Level Distribution of the FiveFacts Combined
9
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
 0 1 2 3 4 5 6 7 8
Number of Facts
Levels
Succession Event: Organization
Succession Event: Title(s)
Person In
Hired From
Person Out
Old Person Going To
Figure 11: MUC-6: Level Distribution of Each of the Six Facts
#0F The parent#28s#29 of the jointventure formed.
#0F The child jointventure formed.
#0F The location of the child.
#0F Product that the child will produce.
#0F Percentage ownership of each parent.
Due to the unavailability of the o#0Ecial test set used for the MUC-5 EJV evaluation, we used a set of 50
articles used by the systems for training on the EJV domain. Using the algorithm described earlier, we then
built the networks for the relevant articles. Out of the 50 articles, 47 were relevant and the #0Cve standard facts
appeared 209 times in these articles. The level distribution of each of the #0Cve facts is shown in Figure 9. The
level distribution of the #0Cve facts combined is shown in Figure 10. Based on Figure 9 one can deduce that
the MUC-5 EJV domain is harder than the MUC-4 terrorist domain because three out of the #0Cve standard
facts most frequently occur as level-2 facts. Figure 10 peaks at level-2 giving further indication that the
domain number for this domain is more than 2 LUs.
Based on the levels of the standard set of facts, we calculated the domain numberoftheMUC-5 EJV
domain to be 2.67 LUs. This domain number is almost 1 LU higher than the domain number for the MUC-4
terrorist attack domain and it indicates that the MUC-5 EJV task was much harder than the MUC-4 task. In
comparison, an analysis done by Beth Sundheim, using the features described earlier, shows that the nature
of the MUC-5 EJV task is approximately twice as hard as the nature of the MUC-4 task #5BSundheim 1993#5D.
Analysis of MUC-6
The domain used for MUC-6 consisted of articles regarding changes in corporate executive management
personnel. As in the case of our analyses of the previous two MUCs, we selected a set of standard facts based
on the o#0Ecial MUC-6 template. This set consisted of the following facts: #28The full de#0Cnition of each fact is
not included here.#29
10
0
5
10
15
20
25
30
35
40
45
50
 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Percentage of Facts
Levels
MUC 6: 6 Facts
Figure 12: MUC-6: Level Distribution of the Six Facts Combined
MUC Domain Domain Numbers #28in LUs#29 Highest P&R F-Measure
MUC-4 Terrorist Attacks 1.87 55.93#25
MUC-5 JointVentures 2.67 52.75#25
MUC-6 Changes in ManagementPersonnel 2.47 56.40#25
MUC-7 Air Vehicle Launch Reports 2.44
Figure 13: Domain Numbers of MUC-4, MUC-5, MUC-6, and MUC-7
#0F Organization where the change#28s#29 in the personnel took place.
#0F The position involved in the changes.
#0F The person coming in to the position.
#0F The person leaving the position.
#0F The company#2Fpost from where the person coming in is hired.
#0F The company#2Fpost that the person going out is going to.
We analyzed the levels of the standard set of facts in the o#0Ecial MUC-6 test set by building the networks
for the relevant articles in the test set #28using the algorithm described earlier#29. This test set consisted of 100
articles, 56 of whichwere relevant. The six standard facts appeared 478 times in the relevant articles. The
level distribution of each of these six facts is shown in Figure 11. The level distribution of these six facts
combined is shown in Figure 12.
We calculated the domain number for the MUC-6 domain to be 2.47 LUs. Our analysis therefore indicates
that the MUC-6 domain is almost as hard as the MUC-5 EJV domain.
Comparing the MUC Information Extraction Tasks
Figure 13 shows the domain numbers for the MUC tasks that have been analyzed. For each of the MUCs,
the #0Cgure also shows the highest P&R F-Measure achieved by a system #28for the information extraction task#29.
Our analysis clearly separates the MUC-4 task from the later ones. The tasks for the later MUCs, however,
have surprisingly similar complexity pro#0Cles: 20 to 30#25 percent level-1 facts, substantially higher level-2
facts, and decreasing values for higher level facts. Given that the analysis has been only done for 4 tasks,
we do not want to infer too much from the di#0Berence between the complexities of the MUC-5 task and the
MUC-6 and the MUC-7 tasks #28whichhave roughly the same complexities#29.
Conclusion
11
The level of a fact with respect to a network for a piece of text provides a new method of classifying a
fact based on the degree of syntactic complexity related to a fact. This classi#0Ccation tries to quantify the
complexity of the syntax that holds the information in text. Moreover, studies based upon this classi#0Ccation
scheme indicate that there is a correlation between the level of a fact and the di#0Eculty of its extraction.
The analysis of the degree of di#0Eculty of understanding a text in a domain comes as a by-product of our
approach and is a big step up from some of the techniques used earlier.
This measure is just the beginning of a search for useful complexity measures. Despite its shortcomings,
the current measure does quantify complexity on one very important dimension, namely the number of
clauses #28or phrases#29 required to specify a fact. For the short term it appears to be the best available vehicle
for understanding the complexity of extracting a fact as shown in this paper.
Acknowledgments
The author would like to thank Ralph Grishman for the opportunity to analyze the MUC-7 information
extraction task. The author also thanks Beth Sundheim for providing him with the o#0Ecial data from
MUC-4, MUC-5, and MUC-6.
The author was supported byaFellowship from IBM Corporation.

REFERENCES
#5BBagga 1998#5D Bagga, Amit, and Alan W. Biermann. Analyzing the Performance of Message Understanding Systems, to appear in Journal of Computational Linguistics and Chinese Language Processing, 1998.
#5BHendrix 1979#5D Hendrix, Gray G. Encoding Knowledge in Partitioned Networks. In Associative Networks. Nicholas V. Findler #28ed.#29. New York: Academic Press, 1979, pp. 51-92.
#5BHirschman 1992#5D Hirschman, Lynette. An Adjunct Test for Discourse Processing in MUC-4, Proceedings of the Fourth Message Understanding Conference #28MUC-4#29, pp. 67-77, June 1992.
#5BMUC-3 1991#5D Proceedings of the Third Message Understanding Conference #28MUC-3#29, May 1991, San Mateo: Morgan Kaufmann.
#5BMUC-6 1995#5D Proceedings of the Sixth Message Understanding Conference #28MUC-6#29, November 1995, San Francisco: Morgan Kaufmann.
#5BSchubert 1979#5D Schubert, Lenhart K., et al. The Structure and Organization of a Semantic Net for Comprehension and Inference. In Associative Networks. Nicholas V. Findler #28ed.#29. New York: Academic Press, 1979, pp. 121-175.
#5BSundheim 1993#5D Sundheim, Beth M. TIPSTER#2FMUC-5 Information Extraction System Evaluation, Proceedings of the Fifth Message Understanding Conference #28MUC-5#29, pp. 27-44, August 1993.
#5BSundheim 1995#5D Sundheim, Beth M. Overview of Results of the MUC-6 Evaluation, Proceedings of the Sixth Message Understanding Conference #28MUC-6#29, pp. 13-31, November 1995.
