An Annotation System for Enhancing Quality of Natural Language
Processing
Hideo Watanabe*, Katashi Nagao**, Michael C. McCord*** and Arendse Bernth***
* IBM Research,
Tokyo Research Laboratory
1623-14 Shimotsuruma, Yamato,
Kanagawa 242-8502, Japan
hiwat@jp.ibm.com
** Dept. of Information Engineering
Nagoya University
Furo-cho, Chikusa-ku,
Nagoya 464-8603, Japan
nagao@nuie.nagoya-u.ac.jp
*** IBM T. J. Watson
Research Center
Route 134, Yorktown Heights,
NY 10598, USA
mcmccord@us.ibm.com,
arendse@us.ibm.com
Abstract
Naturallanguageprocessing#28NLP#29programsare
confronted with various di#0Eculties in processing
HTML and XML documents, and have the po-
tential to produce better results if linguistic infor-
mation is annotated in the source texts. Wehave
therefore developed the Linguistic Annotation Lan-
guage #28orLAL#29, which is an XML-compliant tagset
for assisting natural language processing programs,
and NLP tools such as parsers and machine trans-
lation programs which can accept LAL-annotated
input. In addition, we have developed a LAL-
annotation editor which allows users to annotate
documents graphically without seeing tags. Fur-
ther, we have conducted an experiment to check
the translation quality improvementby using LAL
annotation.
1 Introduction
Recently there has been increasing interest in
applying natural language processing #28NLP#29 sys-
tems, suchaskeyword extraction, automatic text
summarization, and machine translation, to Inter-
net documents. However, there are various ob-
stacles that make it di#0Ecult for them to produce
good results. It is true that NLP technologies are
not perfect, but some of the di#0Eculties result from
problems in HTML. Further, in general, if linguis-
tic information is added to source texts, it greatly
helps NLP programs to produce better results. In
what follows, wewould liketoshow some examples
related to machine translation.
In general, it is very helpful for machine transla-
tion programs to know boundaries on many levels
#28suchassentence, phrases, and words#29 and to know
word-to-word dependency relations. For instance,
in the following example, since #5CSt." has two possi-
ble meanings, #5Cstreet" and #5Csaint," it is di#0Ecult to
determine whether the following example consists
of one or two sentences.
I went to Newark St. Paul lived there
twoyears ago.
As another example, the following sentence has
twointerpretations; one interpretation is that what
he likes is people and the other interpretation is
that what he likes is accommodating.
He likes accommodating people.
If therearetagsindicatingthe direct-objectmod-
i#0Cer of the word #5Clike," then the correct interpreta-
tion is possible. NLP may be able to resolve these
ambiguities eventually by using advanced context
processing techniques, but current NLP technology
generally needs a hint from the author for these
sorts of ambiguities.
Further, there are issues in HTML#2FXML. When
MT systems are applied to Web pages, most of the
errors are generated by the linguistic incomplete-
ness of MT technology, but some are generated by
problems in HTML and XML tag usage. For in-
stance, writers often use #3Cbr#3E tag to sentence ter-
mination. Sometimes writers intend that a #3Cbr#3E
tag should terminate the sentence #28even without
terminating punctuation such as a period#29, and in
other cases writers intend #3Cbr#3E only as a format-
ting device. In the HTML #3Ctable#3Eshownin Figure
1, the writer intends each line of a cell to express
one linguistic unit. The MT program cannot tell
whether each line is a unit for translation, or, in-
stead, the two lines form one unit. In this example,
some MT programs would try to produce a transla-
tion of a unit #5CNetVista Models ThinkPad News."
As shown in the above examples, NLP appli-
cations do not achieve their full potential, on ac-
count of problems unrelated to the essential NLP
processes. If tags expressing linguistic information
#3Ctable#3E#3Ctr#3E#3Ctd#3E
#3Ca href="..."#3ENetVista Models#3C#2Fa#3E#3Cbr#3E
#3Ca href="..."#3EThinkPad News#3C#2Fa#3E#3Cbr#3E
#3C#2Ftd#3E#3C#2Ftr#3E#3C#2Ftable#3E
Figure 1: An example of using hbri tags in a table
are inserted into source documents, they help NLP
programs recognize document and linguistic struc-
tures properly, allowing the programs to produce
much better results. At the same time, it is true
that NLP technologiesareincomplete, but their de-
#0Cciencies can sometimes be circumvented through
the use of such tags. Therefore, this paper proposes
a set of tags for helping NLP programs, called Lin-
guistic Annotation Language #28or LAL#29.
2 Linguistic Annotation Language
LAL is an XML-compliant tag set and its XML
namespace pre#0Cx is lal.
The LAL tag set is designed to be as simple as
possible for the following reasons: #281#29 A simple tag
set is easier for developers to check manually. #282#29
An easy-to-use annotation tool is mandatory for
this annotation scheme. Simplicity is important
for making an easy-to-use annotation tool, since if
we use a feature-rich tag set, the user must check
many annotation items.
2.1 Basic Tags
The sentence tag s is used to delimit a sentence.
#3Clal:s#3EThis is the first sentence.#3C#2Flal:s#3E
#3Clal:s#3EThis is the second sentence.#3C#2Flal:s#3E
The attribute type="hdr" means that the sen-
tence is a title or header.
The word tag w is used to delimit a word. It
can have attributes for additional information such
as base form #28lex#29, part-of-speech#28pos#29, features
#28ftrs#29, and sense #28sense#29ofaword. The values of
these attributes are language-dependent, and are
not described in this paper because of space limi-
tations. The following example illustrates some of
these tags and attributes.
#3Clal:s#3E
#3Clal:w lex="this" pos="det"#3EThis#3C#2Flal:w#3E
#3Clal:w lex="be" pos="verb" ftr="sg,3rd"#3E
is#3C#2Flal:w#3E
#3Clal:w lex="a" pos="det"#3Ea#3C#2Flal:w#3E
#3Clal:w lex="pen" pos="noun" ftr="sg,count"#3E
pen#3C#2Flal:w#3E
#3C#2Flal:s#3E
The dependency #28or word-to-word modi#0Ccation#29
relationship can be expressed by using the id and
mod attributes of a word tag; that is, a word can
have the ID value of its modi#0Cee in a mod attribute.
The ID value of a mod attribute must be an ID value
ofaword tag or a segment tag. For instance, the
following example contains attributes showing that
the word #5Cwith" modi#0Ces the word #5Csaw," meaning
that #5Cshe" has a telescope.
She #3Clal:w id="w1" lex="see" pos="v"
sense="see1"#3Esaw#3C#2Flal:w#3E a man
#3Clal:w mod="w1"#3Ewith#3C#2Flal:w#3E
a telescope.
The phrase #28or segment#29 tag seg is used to spec-
ify a phrase scope on any level. In addition, you
can specify the syntactic category for a phrase by
using an optional attribute cat. The following ex-
ample speci#0Ces the scope of a noun phrase #5Ca man
... a telescope," and it is a noun phrase. This also
implies that the prepositional phrase #5Cwith a tele-
scope" modi#0Ces the noun phrase #5Ca man."
She saw #3Clal:seg cat="np"#3Ea man with a
telescope#3C#2Flal:seg#3E.
The attribute para="yes" means that the seg-
ment is a coordinated segment. The following ex-
ample showsthat the word#5Csoftware"andthe word
#5Chardware" are coordinated.
This company deals with #3Clal:seg cat="np"
para="yes"#3Esoftware and hardware#3C#2Flal:seg#3E
for networking.
The ref attribute has the ID value of the refer-
ent of the currentword. This can be used to specify
a pronoun referent, for instance:
#3Clal:s#3EHe bought #3Clal:seg id="w1"#3Ea
new car#3C#2Flal:seg#3E yesterday.#3C#2Flal:s#3E
#3Clal:s#3EShe was very surprised to
learn that #3Clal:w ref="w1"#3Eit#3C#2Flal:w#3E
was very expensive.#3C#2Flal:s#3E
2.2 Expressing Multiple Parses
As mentionedearlier, sincenaturallanguagecon-
tains ambiguities, it is useful for LAL annotation
to have a mechanism for expressing syntactic am-
biguities.
Wehaveintroduced a parse identi#0Cer #28or PID#29
in attribute values for distinguishing parses. An
attribute value which may be changed according
to parses can be allowed to be expressed as space-
separated multiple values, each of which consists of
a PID pre#0Cx followed by a colon and an attribute
value.
#3Clal:s#3E
#3Clal:w id="1" mod="2"#3EHe#3C#2Flal:w#3E
#3Clal:w id="2" mod="0"#3Elikes#3C#2Flal:w#3E
#3Clal:w id="3" mod="p1:2 p2:4"#3E
accommodating#3C#2Flal:w#3E
#3Clal:w id="4" mod="p1:3 p2:2"#3Epeople
#3C#2Flal:w#3E.#3C#2Flal:s#3E
This example shows that there are twointerpre-
tations whose PIDs are p1 and p2, and that the p1
interpretation is #5CHe likes people" and p2 is #5CHe
likes accommodating."
3 LAL-Aware NLP Programs
We have modi#0Ced certain NLP systems to be
LAL-aware. ESG #5B5, 6#5D is an English parsing sys-
tem developed by the IBM Watson Research Cen-
ter, andupdated toacceptandgenerateLAL-annotated
English. Wehave also developed a Japanese pars-
ing system with LAL output functionality. These
LAL-aware versions of parsers are used as a back-
end process to show users the system's default in-
terpretationforagivensentencein the LAL-annotation
editor described below.
Further, the English to German, French, Span-
ish, Italian and Portuguese translation engines #5B6,
7#5D and English to Japanese translation engine #5B9#5D
aremodi#0Ced toacceptLAL-annotatedEnglishHTML
input.
1
4 The LAL-Annotation Editor
Since inserting tags into documents manually is
not generally an easy task for end users, it is impor-
tant to provide an easy-to-use GUI-based editing
environment. In developing suchanenvironment,
wetookinto consideration the following points: #281#29
Users should not have to see any tags. #282#29 Users
should not have to see internal representations ex-
pressing linguistic information. #283#29 Users should be
able to view and modify linguistic information such
as feature values, but only if they want to.
Considering these points, we have found that
most of the errors made by NLP programs result
from their failure to recognize the phrasal struc-
tures of sentences. Therefore, wehave decided to
1
In addition, Watanabe #5B11#5D reported on an algorithm
for accelerating CFG-parsing by using LAL tag informa-
tion, and it is implemented in the above English-to-Japanese
translation engine.
show only a structural view of a sentence in the ini-
tial screen; other information is shown only if the
user requests it.
The important issue here is how to represent the
syntactic structure of a sentence to the user. NLP
programs normally deal with a linguistic structure
by means of a syntactic tree, but such a structure
is not necessarily easy for end users to understand.
For instance, Figure 2 shows the dependency struc-
ture of the Englishsentence#5CIBM announcedanew
computer system for children with voice function."
This dependency structure is di#0Ecult for end users,
partly because a dependency tree does not keep the
surface word order, so that it is di#0Ecult to map it
to the original sentence quickly.
2
Therefore, an im-
portant property for the linguistic structural view
is that users can easily reconstruct the original sur-
face sentence string.
The next important issue is how easily a user
can understand the overall linguistic structure. If
a user is, at #0Crst, presented with detailed linguistic
structure at the word level, then it is di#0Ecult to
grasp the important linguistic skeleton of a sen-
tence. Therefore, another necessary property is
to give users a view in which the overall sentence
structure is easily recognized.
Figure 2: An example of tree structure of an En-
glish sentence
With these requirements in mind, wehave devel-
oped a GUI tool called the LAL Editor. To satisfy
the last requirement, this editor has two presenta-
tion modes: the reduced presentation view and the
expanded presentation view. In the reduced pre-
sentation view, a main verb and its modi#0Cers are
basic units for presenting dependencies, and they
are located on di#0Berent lines, keeping the surface
order. Figure 3 shows an example of this reduced
presentation view. In this view, since dependen-
cies that are obvious for native speakers #28e.g. #5Ca"
and #5Ccomputer" #29 are not displayed explicitly, the
user can concentrate on dependencies between key
2
You must perform an inorder tree walk to reconstruct a
surface sentence string.
Figure 3: Screen Images of LAL Editor - Reduced
View
units #28or phrases#29. If the user #0Cnds any depen-
dency errors in the reduced view, he or she can
enter the expanded view mode in which all words
are basic units for presenting dependencies. Fig-
ure 4 #28a, b#29 shows examples of this expanded view.
In these views, to satisfy the former requirement,
dependencies between basic units are expressed by
using indentation. Therefore you can easily recon-
struct the surface sentence string by just looking at
words from top to bottom and from left to right,
and easily know dependencies of words by looking
at words located in the same column. For details
of the algorithm, see #5B12#5D.
In Figure 3, you can easily grasp the overall
structure. In this case, since the dependencies be-
tween #5Cfor"and #5Cannounced," and #5Cwith" and #5Can-
nounced" are wrong, the user can change the mode
to the expanded view #28as shown in Figure 4 #28a#29#29.
In this view, the user can change dependencies by
dragging a modi#0Cer to the correct modi#0Cee using
a mouse. The corrected dependency structure is
shown in Figure 4 #28b#29.
In addition, the LAL Editor has the capabilityof
testing translation by using LAL annotation. Fig-
ure 5 shows a window in which the top pane shows
theinput sentence, thesecondpaneshowsthe LAL-
annotation of the input, the third pane shows the
translation result using the LAL annotation, and
the fourth pane shows the default translation with-
out using the LAL annotation. The user can easily
check whether the current annotation can improve
translations.
5 Experiment
Wehave conducted a small experiment for eval-
uating LAL annotation to our English-to-Japanese
machine translation system#5B9#5D. We gathered about
60 sentences from Web pages in the computer do-
main, and added LAL annotation to these sen-
#28a#29 Expanded View #28before correction#29
#28b#29 Expanded View #28after correction#29
Figure 4: Screen Images of LAL Editor - Expanded
View
tences with the LAL annotation editor. In this
experiment, only word-to-word modi#0Ccations were
corrected. Due to severe parsing errors and glitches
of the annotation editor, 53 of the 60 sentences
were used in this experiment. The averagesentence
length for this test set was 21 words. Twoevalu-
ators assigned a qualityevaluation ranging from 1
#28worst#29 to 5 #28best#29 for each translation, with and
without use of annotation.
Translation results for 18 sentences #28about 34#25#29
were better for the annotated case than the non-
annotated case. These better sentences were 1.16
Figure 5: Translation test window of LAL Editor
points better #2827#25 better in quality score#29. On
the other hand, 26 sentences #28about 49#25#29 were not
changed, and 9 sentences #28about 17#25#29 were worse.
The main reason why these 9 sentences were worse
was the structural mismatch between the output
of the LAL Editor and the expected structure of
EtoJ translation system, since the LAL Editor and
the EtoJ MT system use di#0Berent parsing systems.
Wehave developed a structure conversion routine
from LAL editor output to EtoJ input, but it does
not yet cover all situations. This is the reason why
these 9 sentences become worse.
Note that this experiment only uses word-to-
word modi#0Ccation corrections, so there is room for
producing better translations if we use other types
ofannotationsuchaspart-of-speech, andwordsense.
6 Discussion
There have been several e#0Borts to de#0Cne tags
for describing language resources, such as TEI #5B10#5D,
OpenTag #5B8#5D, CES #5B1#5D, EAGLES #5B2#5D, GDA #5B3#5D. The
main focus of these e#0Borts other than GDA has
beento sharelinguisticresourcesbyexpressingthem
in a standard tag set, and therefore they de#0Cne very
detailed levels of tags for expressing linguistic de-
tails. GDA has almost the same purposes but it
has also de#0Cned a very complex tag set. This com-
plexity discourages people from using these tag sets
when writing documents, and it also becomes dif-
#0Ccult to make an annotation tool for these tags.
LAL is not opposed to these previous e#0Borts, but
attempts to strike a useful balance between expres-
siveness and simplicity, so that annotation can be
used widely.
As mentioned in the discussion of the experi-
ment, there is an issue when the parsing system
of LAL editor and the parsing system of a NLP
tool which accepts the output of LAL editor are
di#0Berent. As mentioned before, we used the ESG
parser for producing LAL-annotated English, and
Japanese-to-EnglishMTsystemforacceptingLAL-
annotated English. Since these systems have been
independently developedbasedon di#0Berentapproaches
by di#0Berent developers, we found there are some
structural di#0Berences. For instance, given a prepo-
sitional phrase Prep N, ESG's head word of the
prepositional phrase is Prep, but EtoJ MT engine's
head is N. In most cases, we can make systematic
conversion routines for di#0Berent structures. In fact,
for most of sentences whose translation is worse
when annotation is used, we can provide struc-
tural conversion routines for linguistic structures
included in them. The basic idea of LAL-awareness
for NLP tools is that an NLP tool uses LAL infor-
mation as much as possible, but if LAL information
produces a severe con#0Dict with the internal process-
ing, then such information should not be used. Our
EtoJ MT program was basically implemented this
way based on the algorithm described in #5B11#5D, but
we seem to need more research on this issue.
7 Conclusion
In thispaper, wehaveproposedanXML-compliant
tag set called Linguistic Annotation Language #28or
LAL#29, which helps NLP programs perform their
tasks more correctly. LAL is designed to be as
simple as possible so that humans can use it with
minimal help from assisting tools. Wehave also de-
veloped a GUI-based LAL annotation editor, and
have shown in an experiment that use of LAL anno-
tation enhances translation quality. We hope that
wide acceptance of LAL will make it possible to use
more intelligentInternet tools and services.

References

#5B1#5D CES, #5CCorpus Encoding Standard #28CES#29,"
#28http:#2F#2Fwww.cs.vassar.edu#2FCES#2F#29

#5B2#5D EAGLES, #5CExpert Advisory Group on Language Engi-
neering Standards,"
#28http:#2F#2Fwww.ilc.pi.cnr.it#2FEAGLES#2Fhome.html#29

#5B3#5D GDA, #5CGlobal Document Annotation,"
#28http:#2F#2Fwww.etl.go.jp#2Fetl#2Fnl#2Fgda#2F#29

#5B4#5D Koichi Hashida, Katashi Nagao, et. al, #5CProgress
and Prospect of Global Document Annotation," #28in
Japanese#29 Proc. of 4th Annual Meeting of the Asso-
ciation of Natural Language Processing, pp. 618#7B621,
1998

#5B5#5D McCord, M. C., #5CSlot Grammars," Computational Lin-
guistics, Vol. 6, pp. 31#7B43, 1980.

#5B6#5D McCord, M. C., #5CSlot Grammar: A System for Sim-
pler Construction of Practical Natural Language Gram-
mars," in #28ed#29 R. Studer, Natural Language and Logic:
International Scienti#0Cc Symposium, Lecture Notes in
Computer Science, pp. 118#7B145, Springer Verlag, 1990.

#5B7#5D McCord, M. C., and Bernth, A., #5CThe LMT Transfor-
mational System," Proc. of Proceedings of AMTA-98,
pp. 344#7B355, 1998.

#5B8#5D OpenTag, #5CA Standard Extraction#2FAbstraction Text
Format for Translation and NLP Tools,"
#28http:#2F#2Fwww.opentag.org#2F#29

#5B9#5D Takeda, K., #5CPattern-Based Machine Translation,"
Proc. of 16th COLING, Vol. 2, pp. 1155#7B1158, August
1996.

#5B10#5D TEI, #5CText Encoding Initiative #28TEI#29,"
#28http:#2F#2Fwww.uic.edu:80#2Forgs#2Ftei#2F#29

#5B11#5D Watanabe, H., #5CA Method for Accelerating CFG-
Parsing by Using Dependency Information," Proc. of
18th COLING, 2000.

#5B12#5D Watanabe, H., Nagao, K., McCord, M. C., and Bernth,
A., #5CImproving Natural Language Processing by Lin-
guistic Document Annotation," Proc. of COLING 2000
Workshop for Semantic Annotation and Intelligent Con-
tent, pp. 20#7B27, 2000.
