Tagging English by Path Voting Constraints 
Ghkhan Tfir and Kemal Oflazer 
Department of Computer Engineering and Information Science 
Bilkent University, Bilkent, Ankara, TR-06533, TURKEY 
{tur, ko }@cs. bilkent, edu. tr 
Abstract: We describe a constraint-based 
tagging approach where individual constraint 
rules vote on sequences of matching tokens and 
tags. Disambiguation of all tokens in a sentence 
is performed at the very end by selecting tags 
that appear on the path that receives the high- 
est vote. This constraint application paradigm 
makes the outcome of the disambiguation in- 
dependent of the rule sequence, and hence re- 
lieves the rule developer from worrying about 
potentially conflicting rule sequencing. The ap- 
proach can also combine statistically and manu- 
ally obtained constraints, and incorporate neg- 
ative constraint rules to rule out certain pat- 
terns. We have applied this approach to tagging 
English text from the Wall Street Journal and 
the Brown Corpora. Our results from the Wall 
Street Journal Corpus indicate that with 400 
statistically derived constraint rules and about 
800 hand-crafted constraint rules, we can attain 
an average accuracy of 9Z89~ on the training 
corpus and an average accuracy of g7.50~ on 
the testing corpus. We can also relax the single 
tag per token limitation and allow ambiguous 
tagging which lets us trade recall and precision. 
1 Introduction 
Part-of-speech tagging is one of the preliminary 
steps in many natural language processing sys- 
tems in which the proper part-of-speech tag of 
the tokens comprising the sentences are disam- 
biguated using either statistical or symbolic lo- 
cal contextual information. Tagging systems 
have used either a statistical approach where 
a large corpora is employed to train a proba- 
bilistic model which then is used to tag unseen 
text, (e.g., Church (1988), Cutting et al. (1992), 
DeRose (1988)), or a constraint-based approach 
which employs a large number of hand-crafted 
linguistic constraints that are used to eliminate 
impossible sequences or morphological parses 
for a given word in a given context, recently 
most prominently exemplified by the Constraint 
Grammar work (Karlsson et al., 1995; Vouti- 
lainen, 1995b; Voutilainen et al., 1992; Vouti- 
lainen and Tapanainen, 1993). BriU (1992; 
1994; 1995) has presented a transformation- 
based learning approach. 
This paper extends a novel approach to 
constraint-based tagging first applied for Turk- 
ish (Oflazer and Tiir, 1997), which relieves the 
rule developer from worrying about conflicting 
rule ordering requirements and constraints. The 
approach depends on assigning votes to con- 
straints via statistical and/or manual means, 
and then letting constraints vote on match- 
ing sequences on tokens, as depicted in Figure 
1. This approach does not reflect the outcome 
of matching constraints to the set of morpho- 
logical parses immediately as usually done in 
constraint-based systems. Only after all appli- 
cable rules are applied to a sentence, tokens are 
disambiguated in parallel. Thus, the outcome of 
the rule applications is independent of the order 
of rule applications. 
W1 W2 W3 W4 Wn Tokens 
R1 R3 R2 --" Rm voting Rules 
Figure 1: Voting Constraint Rules 
1277 
© 
( can, HD) (can, l,~) 
(I.PRP) ~~- _ (the,DT) 
( cart o HD | 
Figure 2: Representing sentences with a directed acyclic graph 
2 Tagging by Path Voting 
Constraints 
We assume that sentences are delineated and 
that each token is assigned all possible tags by a 
lexicon or by a morphological analyzer. We rep- 
resent each sentence as a standard chart using 
a directed acyclic graph where nodes represent 
token boundaries and arcs are labeled with am- 
biguous interpretations of tokens. For instance, 
the sentence I can can the can. would be 
represented as shown in Figure 2, where bold 
arcs denote the correct tags. 
We describe constraints on token sequences 
using rules of the sort R = (C1,C2,-. ",Cn; V), 
where the Ci are, in general, feature constraints 
on a sequence of the ambiguous parses, and V 
is an integer denoting the vote of the rule. For 
English, the features that we use are: (1) LEX: 
the lexical form, and (2) TAG: the tag. It is 
certainly possible to extend the set of features 
used, by including features such as initial letter 
capitalization, any derivational information, 
etc. (see (Oflazer and Tiir, 1997)). For in- 
stance, (\[ThG=MD\], \[ThG=RB\], \[ThGfVB\] ; 100) 
is a rule with a high vote to promote modal 
followed by a verb with an intervening adverb. 
The rule (\[TAG=DT,LEX=that\], \[ThG=NNS\] ; 
-100) demotes a singular determiner read- 
ing of that before a plural noun, while 
( \[ThG=DT, LEX=each\], \[TAG=J J, LEX=o'cher\] ; 
100) is a rule with a high vote that captures a 
collocation (Santorini, 1995). 
The constraints apply to a sentence in the 
following manner: Assume for a moment that 
all possible paths from the start node to the 
end node of a sentence graph are explicitly enu- 
merated, and that after the enumeration, each 
path is augmented by a vote component. For 
each path at hand, we apply each constraint 
to all possible sequences of token parses. Let 
R = (C1,C2,'",C,~;V) be a constraint and 
let wi,wi+l,-'-, wi+,~-i be a sequence of token 
parses labeling sequential arcs of the path. We 
say rule R matches this sequence of parses, if 
wj, i _< j < i + n - 1 is subsumed by the corre- 
sponding constraint Cj-i+l. When such a match 
occurs, the vote of the path is incremented by 
V. When all constraints are applied to all pos- 
sible sequences in all paths, we select the path 
with the maximum vote. If there are multiple 
paths with the same maximum vote, the tokens 
whose parses are different in those paths are as- 
sumed to be left ambiguous. 
Given that each token has on the average 
more than 2 possible tags, the procedural de- 
scription above is very inefficient for all but very 
short sentences. However, the observation that 
our constraints are localized to a window of a 
small number of tokens (say at most 5 tokens 
in a sequence), suggests a more efficient scheme 
originally used by Church (1988). Assume our 
constraint windows are allowed to look at a win- 
dow of at most size k sequential parses. Let 
us take the first k tokens of a sentence and 
generate all possible paths of k arcs (spanning 
k + 1 nodes), and apply all constraints to these 
"short" paths. Now, if we discard the first to- 
ken and consider the (k + 1) st token, we need 
to consider and extend only those paths that 
have accumulated the maximum vote among the 
paths whose last k - 1 parses are the same. The 
reason is that since the first token is now out 
of the context window, it can not influence the 
application of any rules. Hence only the high- 
est scoring (partial) paths need to be extended, 
as lower scoring paths can not later accumu- 
late votes to surpass the current highest scoring 
paths. 
In Figure 3 we describe the procedure in a 
more formal way where wl, w2," ", ws denotes 
a sequence of tokens in a sentence, amb(wi) de- 
notes the number of ambiguous tags for token 
wi, and k denotes the maximum context win- 
dow size (determined at run time). 
1278 
1. P = { all I-I~_--: arnb(wj) paths of the first k- 1 
tokens } 
2. i=k 
3. while i < s 
4. begin 
4.1) Create amb(wi) copies of each path in P 
and extend each such copy with one of the 
distinct tags for token wl. 
4'.2) Apply all constraints to the last k tokens 
of every path in P, updating path votes 
accordingly. 
4.3) Remove from P any path p if there is some 
other path p' such that vote(p') > vote(p) 
and the last k - 1 tags of path p are same 
as the last k - 1 tags of p'. 
4.4) i=i+1 
end 
Figure 3: Procedure for fast constraint apphca- 
tion 
3 Results from Tagging English 
We evaluated our approach using l 1-fold cross 
validation on the Wall Street Journal Corpus 
and 10-fold cross validation on a portion of the 
Brown Corpus from the Penn Treebank CD. 
We used two classes of constraints: (i) we ex- 
tracted a set of tag k-grams from a training 
corpus and used them as constraint rules with 
votes assigned as described below, and (ii) we 
hand-crafted a set rules mainly incorporating 
negative constraints (demoting impossible or 
unlikely situations), or lezicalized positive con- 
straints. These were constructed by observing 
the failures of the statistical constraints on the 
training corpus. 
Rules derived from the training corpus 
For the statistical constraint rules, we extract 
tag k-grams from the tagged training corpus 
for k = 2, and k = 3. For each tag 
k-gram, we compute a vote which is essen- 
tially very similar to the rule strength used 
by Tzoukermann et al. (1995) except that 
we do not use their notion of genotypes ex- 
actly in the same way. Given a tag k-gram 
tl,t2,...tk, let n = count(t1 E Tags(wi),t2 E 
Tags(wi+l),...,tk E Tags(wi+k-1)) for all pos- 
sible i's in the training corpus, be the number 
of possible places the tags sequence can possi- 
bly occur, footnoteTags(wi) is the set of tags 
associated with the token wi. Let f be the num- 
ber of times the tag sequence tl,t2,...tk ac- 
tually occurs in the tagged text, that is, f = 
count(tl,t~,...tk). We smooth fin by defining 
/+0.5 so that neither p nor 1 -p is zero. The P"- n+l 
uncertainty of p is then given as ~/p(1- p)/n 
(Tzoukermann et al., 1995). We then computed 
the vote for this k-gram as 
Vote(tl,t2,...tk) = (p- ~fp(1 - p)/n) • 100. 
This formulation thus gives high votes to k- 
grams which are selected most of the time they 
are "selectable." And, among the k-grams 
which are equally good (same f/n), those with 
a higher n (hence less uncertainty) are given 
higher votes. 
After extracting the k-grams as described 
above for k = 2 and k = 3, we ordered each 
group by decreasing votes and conducted an ini- 
tim set of experiments to select a small group 
of constraints performing satisfactorily. We se- 
lected the first 200 (with highest votes) of the 2- 
gram and the first 200 of the 3-gram constraints, 
as the set of statistical constraints. It should be 
noted that the constraints obtained this way are 
purely constraints on tag sequences and do not 
use any lexical or genotype information. 
Hand-crafted rules In addition to these 
statistical constraint rules, we introduced 824 
hand-crafted constraint rules. Most of the 
hand-crafted constraints imposed negative con- 
straints (with large negative votes) to rule out 
certain tag sequences that we encountered in 
the Wall Street Journal Corpus. Another set 
of rules were lexicahzed rules involving the to- 
kens as well as the tags. A third set of rules for 
idiomatic constructs and collocations was also 
used. The votes for negative and positive hand- 
crafted constraints are selected to override any 
vote a statisticM constraint may have. 
Initial Votes To reflect the impact of lexical 
frequencies we initialize the totM vote of each 
path with the sum of the lexical votes for the 
token and tag combinations on it. These lexical 
votes for the parse ti,j of token wi are obtained 
from the training corpus in the usuM way, i.e., 
as count(wi,ti,j)/count(w~), and then are nor- 
mahzed to between 0 and 100. 
Experiments on WSJ and Brown Corpora 
We tested our approach on two English Corpora 
1279 
from the Penn Treebank CD. We divided a 5500 
sentence portion of the Wall Street Journal Cor- 
pus into 11 different sets of training texts (with 
about 118,500 words on the average), and corre- 
sponding testing texts (with about 11,800 words 
on the average), and then tagged these texts 
using the statistical rules and hand-crafted con- 
straints. The hand-crafted rules were obtained 
from only one of the training text portions, and 
not from all, but for each experiment the 400 
statistical rules were obtained from the respec- 
tive training set. 
We also performed a similar experiment with 
a portion of the Brown Corpus. We used 4000 
sentences (about 100,000 words) with 10-fold 
cross validation. Again we extracted the statis- 
tical rules from the respective training sets, but 
the hand-crafted rules were the ones developed 
from the Wall Street Journal training set. For 
each case we measured the accuracy by counting 
the correctly disambiguated tokens. The man- 
ual rules used for Brown Corpus were the rules 
derived the from Wall Street Journal data. The 
results of these experiments are shown in Table 
1. 
WSJ Brown 
Const. Tra. Test Tra. Test 
Set Acc. Acc. Acc. Acc. 
1 95.59 94.54 95.75 94.25 
1+2 96.47 95.68 96.78 95.76 
1+3 96.39 95.37 96.50 95.10 
1+2+3 96.66 95.96 96.91 96.02 
1+4 97.21 96.70 96.27 95.53 
1+2+4 97.85 97.43 97.13 96.51 
1+3+4 97.60 97.08 96.80 96.09 
1+2+3+4 97.89 97.50 97.18 96.67 
(I) Lexical Votes (2) 200 2-grams 
(3) 200 3-grams (4) 824 Manual Constr. 
Table 1: Results from tagging the WSJ and 
Brown Corpora. 
We feel that the results in the last row of 
Table 1 are quite satisfactory and warrant fur- 
ther extensive investigation. On the Wall Street 
Journal Corpus, our tagging approach is on par 
or even better than stochastic taggers making 
closed vocabulary assumption. Weischedel et al. 
(1993) report a 96.7% accuracy with 1,000,000 
words of training corpus. The performance of 
P 
0.99 
0.98 
0.97 
0.96 
0.95 
0.94 
0.93 
0.92 
0.91 
Recall 
Test Set 
Precision Ambiguity 
97.94 96.70 1.012 
98.27 95.29 1.031 
98.48 93.63 1.052 
98.65 91.63 1.076 
98.78 90.21 1.095 
98.98 88.92 1.113 
99.03 88.05 1.124 
99.10 87.19 1.136 
99.13 86.68 1.143 
Table 2: Recall and precision results on a WSJ 
test set with some tokens left ambiguous 
our system with Brown corpus is very close 
to that of Brill's transformation-based tagger, 
which can reach 97.2% accuracy with closed vo- 
cabulary assumption and 96.5% accuracy with 
open vocabulary assumption with no ambiguity 
(Brill, 1995). Our tagging speed is also quite 
high. With over 1000 constraint rules (longest 
spanning 5 tokens) loaded, we can tag at about 
1600 tokens/sec on a Ultrasparc 140, or a Pen- 
tium 200. 
It is also possible for our approach to allow 
for some ambiguity. In the procedure given ear- 
lier, in line 4.3, if one selects all (partial) paths 
whose accumulated vote is within p (0 < p <__ 1) 
of the (partial) path with the largest vote, then 
a certain amount of ambiguity can be intro- 
duced, at the expense of a slowdown in tagging 
speed and an increase in memory requirements. 
In such a case, instead of accuracy, one needs 
to use ambiguity, recall, and precision (Vouti- 
lainen, 1995a). Table 2 presents the recall, pre- 
cision and ambiguity results from tagging .one of 
the Wall Street Journal test sets using the same 
set of constraints but with p ranging from 0.91 
to 0.99. These compare quite favorably with 
the k-best results of Brill(1995), but reduction 
in tagging speed is quite noticeable, especially 
for lower p's. Any improvements in single tag 
per token tagging (by additional hand crafted 
constraints) will certainly be reflected to these 
results also. 
4 Conclusions 
We have presented an approach to constraint- 
based tagging that relies on constraint rules vot- 
1280 
ing on sequences of tokens and tags. This ap- 
proach can combine both statistically and man- 
ually derived constrMnts, and relieves the rule 
developer from worrying about rule ordering, as 
removal of tags is not immediately committed 
but only after all rules have a say. Using posi- 
tive or negative votes, we can promote meaning- 
ful sequences of tags or collocations, or demote 
impossible sequences. Our approach is quite 
general and is applicable to any language. Our 
results from the Wall Street Journal Corpus in- 
dicate that with 400 statistically derived con- 
straint rules and about 800 hand-crafted con- 
straint rules, we can attain an average accuracy 
of 9Z89~ on the training corpus and an average 
accuracy of 9Z50~ on the testing corpus. Our 
future work involves extending to open vocabu- 
lary case and evaluating unknown word perfor- 
mance. 
5 Acknowledgments 
A portion of the first author's work was done 
while he was visiting Johns Hopkins University, 
Department of Computer Science with a NATO 
Visiting Student Scholarship. This research was 
in part supported by a NATO Science for Stabil- 
ity Program Project Grant - TU-LANGUAGE. 

References 
Eric Brill. 1992. A simple-rule based part-of- 
speech tagger. In Proceedings of the Third 
Conference on Applied Natural Language 
Processing, Trento, Italy. 
Eric Brill. 1994. Some advances in rule-based 
part of speech tagging. In Proceedings of the 
Twelfth National Conference on Articial In- 
telligence (AAAI-94), Seattle, Washinton. 
Eric Brill. 1995. Transformation-based error- 
driven learning and natural language pro- 
cessing: A case study in part-of-speech tag- 
ging. Computational Linguistics, 21(4):543- 
566, December. 
Kenneth W. Church. 1988. A stochastic parts 
program and a noun phrase parser for un- 
restricted text. In Proceedings of the Sec- 
ond Conference on Applied Natural Language 
Processing, Austin, Texas. 
Doug Cutting, Julian Kupiec, Jan Pedersen, 
and Penelope Sibun. 1992. A practical 
part-of-speech tagger. In Proceedings of the 
Third Conference on Applied Natural Lan- 
guage Processing, Trento, Italy. 
Steven J. DeRose. 1988. Grammatical cate- 
gory disambiguation by statistical optimiza- 
tion. Computational Linguistics, 14(1):31- 
39. 
Fred Karlsson, Atro Voutilainen, Juha 
tteikkil~, and Arto Anttila. 1995. Con- 
straint Grammar-A Language-Independent 
System for Parsing Unrestricted Text. 
• Mouton de Gruyter. 
Kemal Oflazer and GSkhan Tiir. 1997. 
Morphological disambiguation by vot- 
ing constraints. In Proceedings of 
ACL '97/EACL '97, The 35th Annual Meet- 
ing of the Association for Computational 
Linguistics, June. 
Beatrice Santorini. 1995. Part-of-speech tag- 
ging guidelines fro the penn treebank project. 
Available at http ://www. ldc.upenn, edu/. 
3rd Revision, 2rid Printing. 
Evelyne Tzoukermann, Dragomir R. Radev, 
and William A. Gale. 1995. Combining lin- 
guistic knowledge and statistical learning in 
french part-of-speech tagging. In Proceedings 
of the ACL SIGDAT Workshop From Texts 
to Tags: Issues in Multilingual Language 
Analysis, pages 51-57. 
Atro Voutilainen and Pasi Tapanainen. 1993. 
Ambiguity resolution in a reductionistic 
parser. In Proceedings of EACL'93, Utrecht, 
Holland. 
Atro Voutilainen, Juha Heikkil~, and Arto 
Anttila. 1992. Constraint Grammar of En- 
glish. University of Helsinki. 
Atro Voutilainen. 1995a. Morphological dis- 
ambiguation. In Fred Karlsson, Atro Vouti- 
lainen, Juha Heikkil~, and Arto Anttila, 
editors, Constraint Grammar-A Language- 
Independent System for Parsing Unrestricted 
Text, chapter 5. Mouton de Gruyter. 
Atro Voutilainen. 1995b. A syntax-based part- 
of-speech analyzer. In Proceedings of the Sev- 
enth Conference of the European Chapter of 
the Association of Computational Linguistics, 
Dublin, Ireland. 
Ralph Weischedel, Marie Meteer, Richard 
Schwartz, Lance Ramshaw, and Jeff Pal- 
mucci. 1993. Coping with ambiguity and un- 
known words through probabilistic models. 
Computational Linguistics, 19(2):359-382. 
